# Interactive Exploration of FQSC model

# Introduction to the Notebook

This notebook commences with the loading of the FQSD. Following the dataset preparation, we delve into the application of the BERT model, a state-of-the-art natural language processing tool, to analyze and extract insights from the FQSD. Our goal is to leverage BERT's advanced capabilities to understand the nuances within the dataset, enabling us to classify and interpret the questions more effectively.


# **Data Download and Processing**:

This code snippet demonstrates the process of downloading a dataset from a remote URL, unzipping it, and loading it into a Pandas DataFrame. The dataset, in this case, is the FSQD-Json-dataset. This initial step is crucial for preparing the data for subsequent analysis and visualization.

In [None]:
import requests
import zipfile
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


data_url = "https://github.com/mahsamb/FSQD/raw/main/FSQD-Json-dataset.zip"
zip_filename = "FSQD-Json-dataset.zip"

# Downloading using requests
response = requests.get(data_url)

# Check if the request was successful (status_code 200)
if response.status_code == 200:
    with open(zip_filename, "wb") as f:
        f.write(response.content)
else:
    print(f"Failed to retrieve the data: {response.status_code}: {response.text}")
    # Add additional error handling here

# Unzipping the dataset
try:
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall("FSQD-Json-dataset")
    print("Files extracted:")
    print(os.listdir("FSQD-Json-dataset"))
except zipfile.BadZipFile:
    print("Error: The file doesn’t appear to be a valid zip file")



json_file_path = 'FSQD-Json-dataset/FSQD-Json-dataset.json'  # Update with the correct file path

# Try reading the file as a JSON Lines file
try:
    merged_df = pd.read_json(json_file_path, lines=True)
except ValueError as e:
    print(f"Error reading the JSON file: {e}")



Files extracted:
['Yu_et_al_2012-Json-dataset.json', 'SubjQA-Json-dataset.json', 'FSQD-Json-dataset.json', 'ConvEx-Json-dataset.json']


In [None]:
print(merged_df.columns)

Index(['Question', 'Label_FSQD', 'Label_Subjectivity', 'Label_ComparisionForm',
       'Label_Subjectivity_ComparisionForm', 'Label_SubjectivityType'],
      dtype='object')


This code downloads and processes the data from [FSQD-Json-dataset](https://github.com/mahsamb/FSQD/raw/main/FSQD-Json-dataset.zip).

#######################################################################################################

# Environment Setup for Deep Learning

Before diving into the model training and evaluation, we need to set up our environment with all the necessary libraries and frameworks:


In [None]:
!pip install tensorflow sklearn transformers

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' instead of 'scikit-lea

#######################################################################################################

# Model Training and Evaluation with BERT

This section of the code details training and evaluating a BERT-based model for sequence classification tasks within the FQSD. We initiate by tokenizing the dataset questions using BERT's tokenizer and encoding the labels into integers for processing. Our model, built upon the TFBertForSequenceClassification architecture, includes a dropout layer for regularization and a dense layer tailored to our dataset's number of classes.

We leverage K-Fold cross-validation, specifically a 5-fold strategy, to ensure our model's robustness and generalizability. For each fold, we train the model, track its history, and evaluate its performance using precision, recall, and F1-score metrics. This approach not only validates our model's effectiveness but also provides a comprehensive overview of its predictive capabilities.

By iterating through each fold, we collect a set of metrics that, once macro-averaged, offer insights into the overall performance of our model across different subsets of the data. The precision, recall, and F1-scores for each fold are reported, followed by the macro-averaged values, culminating in a robust assessment of our classification model's performance.


#######################################################################################################

# Sample Run for Model Evaluation

In alignment with our article's methodology, the final results presented are derived from the average of five separate runs to ensure the robustness and reliability of our findings. The code segment showcased here represents just one of these runs, offering a glimpse into the process of training and evaluating our BERT-based model on the FQSD. During each run, we employ a 5-fold cross-validation strategy, meticulously training the model on distinct subsets of the data and evaluating its performance across a range of metrics including precision, recall, and F1-score.

This approach allows us to assess the model's generalizability and consistency across different data partitions, mitigating the potential for overfitting and providing a comprehensive understanding of its predictive capabilities. The averaged metrics from all five runs are then calculated to present a holistic view of the model's performance, as detailed in our article. The process exemplified in this sample run is critical for ensuring the integrity and validity of our research findings.


# **BERT**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import precision_recall_fscore_support
from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
from tensorflow.keras.layers import Input, Dropout
from tensorflow.keras.regularizers import l1
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Assuming merged_df is your DataFrame
# merged_df = pd.read_csv('your_dataset.csv') # Uncomment if needed
questions = merged_df['Question'].values
labels = merged_df['Label_FSQD'].values

# Convert labels to integers
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(labels)
labels = np.array(integer_encoded)

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_SEQUENCE_LENGTH = 100

def tokenize_texts(text_list, max_length=MAX_SEQUENCE_LENGTH):
    return tokenizer(text_list, padding='max_length', truncation=True, max_length=max_length, return_tensors="tf")

def create_bert_model():
    model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=10)
    input_ids = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
    attention_mask = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')
    outputs = model(input_ids, attention_mask=attention_mask)
    bert_output = Dropout(0.5)(outputs[0])
    classification_output = tf.keras.layers.Dense(10, activation='softmax', kernel_regularizer=l1(0.01))(bert_output)
    keras_model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=classification_output)
    keras_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
                        loss='sparse_categorical_crossentropy',
                        metrics=['accuracy'])
    return keras_model

kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_metrics = []

# Lists to hold the macro average precision, recall, and f1-score
all_precisions = []
all_recalls = []
all_f1s = []

for fold, (train_index, test_index) in enumerate(kf.split(questions)):
    print(f"Training on fold {fold+1}")
    x_train, x_test = questions[train_index].tolist(), questions[test_index].tolist()
    y_train, y_test = labels[train_index], labels[test_index]

    # Tokenize text
    x_train_tokenized = tokenize_texts(x_train)
    x_test_tokenized = tokenize_texts(x_test)

    model = create_bert_model()

    # Train the model
    history = model.fit(
        {'input_ids': x_train_tokenized['input_ids'], 'attention_mask': x_train_tokenized['attention_mask']},
        y_train,
        validation_data=(
            {'input_ids': x_test_tokenized['input_ids'], 'attention_mask': x_test_tokenized['attention_mask']},
            y_test
        ),
        epochs=5,
        batch_size=32,
        verbose=1
    )

    # Save the model
    model.save(f"bert_fold_{fold+1}.h5")

    # Predict and evaluate
    y_pred = np.argmax(model.predict(x_test_tokenized.data), axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='macro')

    # Append the metrics for macro averaging later
    all_precisions.append(precision)
    all_recalls.append(recall)
    all_f1s.append(f1)

    # Store metrics
    fold_metrics.append({
        'history': history.history,
        'precision': precision,
        'recall': recall,
        'f1': f1
    })

# Print out the precision, recall, and F1-score for each fold
for i, metrics in enumerate(fold_metrics):
    print(f"Fold {i+1} - Precision: {metrics['precision']:.4f}, Recall: {metrics['recall']:.4f}, F1-Score: {metrics['f1']:.4f}")

# Calculate and print the macro average precision, recall, and F1-score across all folds
macro_precision = np.mean(all_precisions)
macro_recall = np.mean(all_recalls)
macro_f1 = np.mean(all_f1s)
print(f"Macro Average Precision: {macro_precision:.4f}")
print(f"Macro Average Recall: {macro_recall:.4f}")
print(f"Macro Average F1-Score: {macro_f1:.4f}")


2024-02-25 14:44:32.298710: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-25 14:44:32.298810: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-25 14:44:32.425665: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Training on fold 1


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5


I0000 00:00:1708872326.482886     108 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


  saving_api.save_model(
  inputs = self._flatten_to_reference_inputs(inputs)


Training on fold 2


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


  saving_api.save_model(
  inputs = self._flatten_to_reference_inputs(inputs)


Training on fold 3


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


  saving_api.save_model(
  inputs = self._flatten_to_reference_inputs(inputs)


Training on fold 4


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


  saving_api.save_model(
  inputs = self._flatten_to_reference_inputs(inputs)


Training on fold 5


All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


  saving_api.save_model(
  inputs = self._flatten_to_reference_inputs(inputs)


Fold 1 - Precision: 0.9692, Recall: 0.9686, F1-Score: 0.9682
Fold 2 - Precision: 0.9638, Recall: 0.9639, F1-Score: 0.9634
Fold 3 - Precision: 0.9763, Recall: 0.9753, F1-Score: 0.9756
Fold 4 - Precision: 0.9641, Recall: 0.9631, F1-Score: 0.9635
Fold 5 - Precision: 0.9735, Recall: 0.9731, F1-Score: 0.9732
Macro Average Precision: 0.9694
Macro Average Recall: 0.9688
Macro Average F1-Score: 0.9688


#######################################################################################################