# First impressions on the problem

Before I go decide on the algoritm, I said why not try to get some initial results from the simplest model possible real quick.

This is super helpful when I have a problem at hand because basic ML algorithms work very well for a lot of problems with minimal complexity :)


In [1]:
import pandas as pd

In [2]:
train = pd.read_csv("../data/raw/train.csv", index_col=0)
train

Unnamed: 0,content,cyber_label,environmental_issue
0,All rights reserved. MA23-16258 988982046\n\nh...,,0
1,Revisiting our purpose and/or values statement...,,0
2,Amid ongoing strategic competition in a\nmulti...,,0
3,"Source: PwC Pulse Survey, November 2, 2022: ba...",,0
4,Executive Summary 2\nAgeing and\nHealth Conce...,,1
...,...,...,...
1295,"Source: PwC Pulse Survey, November 2, 2022: ba...",,0
1296,Military-driven innovations in relevant fields...,,0
1297,", artificial\nintelligence, automation in all ...",,0
1298,Year-over-year cyberattacks continue to evolve...,1.0,0


First, I want to make sure I can work with the dataset, so let's explore (these are moved to the data processing nb)

In [3]:
train.cyber_label.value_counts()

cyber_label
1.0    127
Name: count, dtype: int64

In [4]:
train.environmental_issue.value_counts()

environmental_issue
0    1016
1     284
Name: count, dtype: int64

In [5]:
train.cyber_label.fillna(0)
train.environmental_issue.fillna(0)
train.cyber_label = train.cyber_label.fillna(0).astype(int)
train.environmental_issue = train.environmental_issue.astype(int)

In [6]:
train["cyber_label"] == 1

0       False
1       False
2       False
3       False
4       False
        ...  
1295    False
1296    False
1297    False
1298     True
1299    False
Name: cyber_label, Length: 1300, dtype: bool

In [7]:
train[(train["cyber_label"] == 1) & (train["environmental_issue"] == 0)].info()

<class 'pandas.core.frame.DataFrame'>
Index: 92 entries, 12 to 1298
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   content              92 non-null     object
 1   cyber_label          92 non-null     int64 
 2   environmental_issue  92 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 2.9+ KB


In [8]:
train[(train["cyber_label"] == 0) & (train["environmental_issue"] == 1)].info()

<class 'pandas.core.frame.DataFrame'>
Index: 249 entries, 4 to 1294
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   content              249 non-null    object
 1   cyber_label          249 non-null    int64 
 2   environmental_issue  249 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 7.8+ KB


In [9]:
train[(train["cyber_label"] == 1) & (train["environmental_issue"] == 1)].info()

<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, 14 to 1218
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   content              35 non-null     object
 1   cyber_label          35 non-null     int64 
 2   environmental_issue  35 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 1.1+ KB


In [10]:
train[(train["cyber_label"] == 0) & (train["environmental_issue"] == 0)].info()

<class 'pandas.core.frame.DataFrame'>
Index: 924 entries, 0 to 1299
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   content              924 non-null    object
 1   cyber_label          924 non-null    int64 
 2   environmental_issue  924 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 28.9+ KB


In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, classification_report

from sklearn.utils import resample
from scipy import sparse

In [12]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    train['content'], train[['cyber_label', 'environmental_issue']], test_size=0.2, random_state=42)


In [13]:
X_train.shape, X_test.shape

((1040,), (260,))

### Ready to train 
Let's kick start with a simple `Logistic Regression`

In [14]:

# Create a pipeline with TF-IDF vectorizer and a multi-label logistic regression model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultiOutputClassifier(LogisticRegression(solver='liblinear')))
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluating the model

print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("Classification Report (multilabel eval):\n", classification_report(y_test, y_pred))
# Example of prediction
sample_text = ["New document discussing climate change and cybersecurity.", "0"]
sample_prediction = pipeline.predict(sample_text)
print("Prediction for the new document:", sample_prediction)

Accuracy: 0.7230769230769231
F1 Score: 0.30067933650079187
Classification Report (multilabel eval):
               precision    recall  f1-score   support

           0       1.00      0.04      0.08        23
           1       1.00      0.23      0.37        70

   micro avg       1.00      0.18      0.31        93
   macro avg       1.00      0.14      0.23        93
weighted avg       1.00      0.18      0.30        93
 samples avg       0.07      0.06      0.06        93

Prediction for the new document: [[0 0]
 [0 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [15]:
y_test.shape, y_pred.shape

((260, 2), (260, 2))

Let's check a random example with both labels equal 1

In [16]:
sample_text = train[(train["cyber_label"] == 1) & (train["environmental_issue"] == 1)].sample(1)

sample_prediction = pipeline.predict(sample_text.content.to_list())
print("Prediction for the new document:", sample_prediction)
sample_text.to_json()

Prediction for the new document: [[0 0]]


'{"content":{"480":"0"},"cyber_label":{"480":1},"environmental_issue":{"480":1}}'

In [17]:
train[train.content.str.len() < 10]

Unnamed: 0,content,cyber_label,environmental_issue
49,3,1,0
111,joke,0,0
169,1,1,0
303,a,1,0
480,0,1,1
486,2,1,1
488,got,0,0
513,b,1,1
517,c,1,0
633,is,1,0


## Treat class imbalance

In [18]:
df = train.copy()

# Handle missing values if any
df.fillna(0, inplace=True)

# Splitting the data into training and testing sets
X = df['content']
y = df[['cyber_label', 'environmental_issue']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Balancing the training data by oversampling the minority class
# Assuming environmental_issue is the minority class
Xy_train = pd.concat([X_train, y_train], axis=1)
majority = Xy_train[Xy_train['environmental_issue'] == 0]
minority = Xy_train[Xy_train['environmental_issue'] == 1]

# Oversampling the minority
minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
upsampled_train = pd.concat([majority, minority_upsampled])

# Re-splitting the upsampled dataset
X_train_upsampled = upsampled_train['content']
y_train_upsampled = upsampled_train[['cyber_label', 'environmental_issue']]

# Create a pipeline with TF-IDF vectorizer and a multi-label logistic regression model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultiOutputClassifier(LogisticRegression(solver='liblinear', class_weight='balanced')))
])

# Train the model with the balanced dataset
pipeline.fit(X_train_upsampled, y_train_upsampled)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluating the model using F1 score and classification report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("Classification Report (multilabel eval):\n", classification_report(y_test, y_pred))

# Example of prediction
sample_text = ["New document discussing climate change and cybersecurity."]
sample_prediction = pipeline.predict(sample_text)
print("Prediction for the new document:", sample_prediction)


Accuracy: 0.7846153846153846
F1 Score: 0.6940371456500489
Classification Report (multilabel eval):
               precision    recall  f1-score   support

           0       0.47      0.65      0.55        23
           1       0.74      0.74      0.74        70

   micro avg       0.66      0.72      0.69        93
   macro avg       0.61      0.70      0.64        93
weighted avg       0.68      0.72      0.69        93
 samples avg       0.25      0.24      0.24        93

Prediction for the new document: [[1 1]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


By fixing the class imbalance by using balanced class weight during training, overall improvement is:

Accuracy: 0.72 -> 0.78

F1 Score: 0.30 -> 0.69


We're now trying to treat this imbalance from the data directly:

In [19]:
df = train.copy()

# Fill NaN values if any
df.fillna(0, inplace=True)

# Splitting the data into training and testing sets
X = df['content']
y = df[['cyber_label', 'environmental_issue']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Balancing the training data by oversampling both minority classes
Xy_train = pd.concat([X_train, y_train], axis=1)

# Oversampling logic for each label
for column in y_train.columns:
    majority = Xy_train[Xy_train[column] == 0]
    minority = Xy_train[Xy_train[column] == 1]

    # Check if minority class needs oversampling
    if len(minority) < len(majority):
        minority_upsampled = resample(minority,
                                      replace=True,
                                      n_samples=len(majority),
                                      random_state=42)
        Xy_train = pd.concat([majority, minority_upsampled])
    else:
        Xy_train = pd.concat([majority, minority])

# Re-splitting the upsampled dataset
X_train_upsampled = Xy_train['content']
y_train_upsampled = Xy_train[['cyber_label', 'environmental_issue']]

# Create a pipeline with TF-IDF vectorizer and a multi-label logistic regression model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultiOutputClassifier(LogisticRegression(solver='liblinear', class_weight='balanced')))
])

# Train the model with the balanced dataset
pipeline.fit(X_train_upsampled, y_train_upsampled)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Evaluating the model using F1 score and classification report
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("Classification Report (multilabel eval):\n", classification_report(y_test, y_pred, target_names=y.columns))

print("Classification Report for 'cyber_label':\n", classification_report(y_test['cyber_label'], y_pred[:, 0], target_names=['Class 0', 'Class 1']))
print("F1 Score:", f1_score(y_test['cyber_label'], y_pred[:, 0], average='weighted'))

print("Classification Report for 'environmental_issue':\n", classification_report(y_test['environmental_issue'], y_pred[:, 1], target_names=['Class 0', 'Class 1']))
print("F1 Score:", f1_score(y_test['environmental_issue'], y_pred[:, 1], average='weighted'))

# Example of prediction
sample_text = ["New document discussing climate change and cybersecurity."]
sample_prediction = pipeline.predict(sample_text)
print("Prediction for the new document:", sample_prediction)


Accuracy: 0.7538461538461538
F1 Score: 0.6176949083606599
Classification Report (multilabel eval):
                      precision    recall  f1-score   support

        cyber_label       0.54      0.57      0.55        23
environmental_issue       0.62      0.66      0.64        70

          micro avg       0.60      0.63      0.62        93
          macro avg       0.58      0.61      0.60        93
       weighted avg       0.60      0.63      0.62        93
        samples avg       0.21      0.22      0.21        93

Classification Report for 'cyber_label':
               precision    recall  f1-score   support

     Class 0       0.96      0.95      0.96       237
     Class 1       0.54      0.57      0.55        23

    accuracy                           0.92       260
   macro avg       0.75      0.76      0.75       260
weighted avg       0.92      0.92      0.92       260

F1 Score: 0.9200046366300696
Classification Report for 'environmental_issue':
               precisio

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


It's better to see the impact on each class individually, 

Slight performance drop, let's try a better known classifier, like Random Forest:

In [20]:
df = train.copy()

df.fillna(0, inplace=True)

X = df['content']
y = df[['cyber_label', 'environmental_issue']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handling the imbalance by oversampling each label individually
def balance_classes(X, y):
    Xy = pd.concat([X, y], axis=1)
    balanced_data = pd.DataFrame()

    for column in y.columns:
        majority = Xy[Xy[column] == 0]
        minority = Xy[Xy[column] == 1]

        minority_upsampled = resample(minority,
                                      replace=True,
                                      n_samples=len(majority),
                                      random_state=42)
        balanced = pd.concat([majority, minority_upsampled])
        balanced_data = pd.concat([balanced_data, balanced]) if not balanced_data.empty else balanced

    return balanced_data['content'], balanced_data.drop('content', axis=1)

X_train_balanced, y_train_balanced = balance_classes(X_train, y_train)

# Create a pipeline with TF-IDF vectorizer and a multi-label random forest model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42)))
])

# Train the model with the balanced dataset
pipeline.fit(X_train_balanced, y_train_balanced)

# Predict on the test data
y_pred = pipeline.predict(X_test)


print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("Classification Report (multilabel eval):\n", classification_report(y_test, y_pred, target_names=y.columns))

# Evaluating the model using a detailed classification report
print("Classification Report for 'cyber_label':\n", classification_report(y_test['cyber_label'], y_pred[:, 0], target_names=['Class 0', 'Class 1']))
print("F1 Score:", f1_score(y_test['cyber_label'], y_pred[:, 0], average='weighted'))

print("Classification Report for 'environmental_issue':\n", classification_report(y_test['environmental_issue'], y_pred[:, 1], target_names=['Class 0', 'Class 1']))
print("F1 Score:", f1_score(y_test['environmental_issue'], y_pred[:, 1], average='weighted'))

# Example of prediction
sample_text = ["New document discussing climate change and cybersecurity."]
sample_prediction = pipeline.predict(sample_text)
print("Prediction for the new document:", sample_prediction)


Accuracy: 0.75
F1 Score: 0.48264811300864435
Classification Report (multilabel eval):
                      precision    recall  f1-score   support

        cyber_label       0.86      0.26      0.40        23
environmental_issue       0.81      0.37      0.51        70

          micro avg       0.82      0.34      0.48        93
          macro avg       0.83      0.32      0.45        93
       weighted avg       0.82      0.34      0.48        93
        samples avg       0.12      0.12      0.11        93

Classification Report for 'cyber_label':
               precision    recall  f1-score   support

     Class 0       0.93      1.00      0.96       237
     Class 1       0.86      0.26      0.40        23

    accuracy                           0.93       260
   macro avg       0.89      0.63      0.68       260
weighted avg       0.93      0.93      0.91       260

F1 Score: 0.9134379905808477
Classification Report for 'environmental_issue':
               precision    recall  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [21]:
print("Full Sample Accuracy:", accuracy_score(y_test, y_pred))
ac1 = accuracy_score(y_test['cyber_label'], y_pred[:, 0])
ac2 = accuracy_score(y_test['environmental_issue'], y_pred[:, 1])
print("Accuracy Score cyber_label:", ac1)
print("Accuracy Score environmental_issue:", ac2)
print("Per Label avg accuracy :", (ac1 + ac2) / 2 )


Full Sample Accuracy: 0.75
Accuracy Score cyber_label: 0.9307692307692308
Accuracy Score environmental_issue: 0.8076923076923077
Per Label avg accuracy : 0.8692307692307693


Note:\
Average of both accuracies is not the same as overall accuracy score because they assume mutually correctness in order to count a sample as successful

In [22]:
sample_text = train[(train["cyber_label"] == 1) & (train["environmental_issue"] == 1)].sample(1)

sample_prediction = pipeline.predict(sample_text.content.to_list())
print("Prediction for the new document:", sample_prediction)
sample_text.to_json()

Prediction for the new document: [[1 1]]


'{"content":{"720":", artificial intelligence,\\nautomation in all of its forms, hyper-scalable platforms, faster data transmission, quantum computing,\\nblockchain, digital currencies and the metaverse) and\\/or other market forces may outpace our\\norganization\\u2019s ability to compete and\\/or manage the risk appropriately, without making significant\\nchanges to our business model\\n\\n3 \\u25cf \\u25cf \\u25cf\\n\\nRegulatory changes and scrutiny may heighten, noticeably affecting the way our processes are designed\\nand our products or services are produced or delivered\\n\\n9 \\u25cf \\u25cf \\u25cf\\n\\nSubstitute products and services may arise from competitors that may enhance the customer experience\\nand affect the viability of our current business model and planned strategic initiatives\\n\\n11 \\u25cf \\u25cf \\u25cf\\n\\nEase of entrance of new competitors into the industry and marketplace or other significant changes in\\nthe competitive environment (such as major mar

I'd like to try two models to see if we can get better classification than multi-label

In [23]:
df = train.copy()

# Handling missing values
df.fillna(0, inplace=True)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(df['content'], df[['cyber_label', 'environmental_issue']], test_size=0.2, random_state=42)

# Vectorization
tfidf = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Function to balance each class using resampling
def balance_classes(X, y):
    # Convert sparse matrix to DataFrame if necessary
    if sparse.issparse(X):
        X = pd.DataFrame(X.toarray())  # Note this could be memory-intensive with large datasets

    # Reset index if X or y are series or dataframes to ensure alignment
    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    # Concatenate X and y along axis=1
    Xy = pd.concat([X, y], axis=1)

    # Separate majority and minority classes
    majority = Xy[Xy.iloc[:, -1] == 0]
    minority = Xy[Xy.iloc[:, -1] == 1]

    # Resample the minority class
    if len(minority) < len(majority):
        minority_upsampled = resample(minority, replace=True, n_samples=len(majority), random_state=42)
        balanced_data = pd.concat([majority, minority_upsampled])
    else:
        balanced_data = Xy

    # Return balanced X and y
    return balanced_data.iloc[:, :-1], balanced_data.iloc[:, -1]


# Models for each label
def train_model(X_train, y_train, X_test, y_test, n_estimators=100):
    # Balance the training dataset
    X_train_balanced, y_train_balanced = balance_classes(X_train, y_train)

    # Train the model
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(X_train_balanced, y_train_balanced)

    # Predict and evaluate
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))


# Train and evaluate model for cyber_label
print("Model Performance for cyber_label:")
train_model(X_train_tfidf, y_train['cyber_label'], X_test_tfidf, y_test['cyber_label'], n_estimators=80)

# Train and evaluate model for environmental_issue
print("Model Performance for environmental_issue:")
train_model(X_train_tfidf, y_train['environmental_issue'], X_test_tfidf, y_test['environmental_issue'], n_estimators=43)


Model Performance for cyber_label:
Accuracy: 0.9346153846153846
Recall: 0.30434782608695654
Precision: 0.875
F1 Score: 0.45161290322580644
Classification Report:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97       237
           1       0.88      0.30      0.45        23

    accuracy                           0.93       260
   macro avg       0.91      0.65      0.71       260
weighted avg       0.93      0.93      0.92       260

Model Performance for environmental_issue:
Accuracy: 0.8192307692307692
Recall: 0.4857142857142857
Precision: 0.7555555555555555
F1 Score: 0.591304347826087
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.94      0.88       190
           1       0.76      0.49      0.59        70

    accuracy                           0.82       260
   macro avg       0.79      0.71      0.74       260
weighted avg       0.81      0.82      0.81       260



## SVM

SVM model is a pretty good classifier, let's give it a shot!

In [24]:
df = train.copy()


# Handling missing values
df.fillna(0, inplace=True)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(df['content'], df[['cyber_label', 'environmental_issue']], test_size=0.2, random_state=42)

# Vectorization
tfidf = TfidfVectorizer(stop_words='english')
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

def balance_classes(X, y):
    # Ensure y is a DataFrame with correct indices
    y = pd.DataFrame(y).reset_index(drop=True)
    # Determine indices of majority and minority classes
    majority_idx = y[y.iloc[:, 0] == 0].index
    minority_idx = y[y.iloc[:, 0] == 1].index

    # Resample the minority class
    if len(minority_idx) < len(majority_idx):
        minority_upsampled_idx = resample(minority_idx, replace=True, n_samples=len(majority_idx), random_state=42)
        new_indices = pd.Index(majority_idx.tolist() + minority_upsampled_idx.tolist())
    else:
        new_indices = pd.Index(majority_idx.tolist() + minority_idx.tolist())

    # Return balanced X and y using the new indices
    return X[new_indices], y.iloc[new_indices].values.ravel()


# Models for each label
def train_model(X_train, y_train, X_test, y_test):
    # Balance the training dataset
    X_train_balanced, y_train_balanced = balance_classes(X_train, y_train)

    # Train the model
    model = SVC(kernel='linear', C=1, probability=True, random_state=42)
    model.fit(X_train_balanced, y_train_balanced)

    # Predict and evaluate
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))


# Train and evaluate model for cyber_label
print("Model Performance for cyber_label:")
train_model(X_train_tfidf, y_train['cyber_label'], X_test_tfidf, y_test['cyber_label'])

# Train and evaluate model for environmental_issue
print("Model Performance for environmental_issue:")
train_model(X_train_tfidf, y_train['environmental_issue'], X_test_tfidf, y_test['environmental_issue'])


Model Performance for cyber_label:
Accuracy: 0.9115384615384615
Recall: 0.5652173913043478
Precision: 0.5
F1 Score: 0.5306122448979592
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.95      0.95       237
           1       0.50      0.57      0.53        23

    accuracy                           0.91       260
   macro avg       0.73      0.76      0.74       260
weighted avg       0.92      0.91      0.91       260

Model Performance for environmental_issue:
Accuracy: 0.8192307692307692
Recall: 0.6285714285714286
Precision: 0.676923076923077
F1 Score: 0.6518518518518519
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.89      0.88       190
           1       0.68      0.63      0.65        70

    accuracy                           0.82       260
   macro avg       0.77      0.76      0.76       260
weighted avg       0.82      0.82      0.82       260



Let me check the impact if we get back to multi-label instead of two-model clf

In [25]:
df = train.copy()

df.fillna(0, inplace=True)

X = df['content']
y = df[['cyber_label', 'environmental_issue']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Handling the imbalance by oversampling each label individually
def balance_classes(X, y):
    Xy = pd.concat([X, y], axis=1)
    balanced_data = pd.DataFrame()

    for column in y.columns:
        majority = Xy[Xy[column] == 0]
        minority = Xy[Xy[column] == 1]

        minority_upsampled = resample(minority,
                                      replace=True,
                                      n_samples=len(majority),
                                      random_state=42)
        balanced = pd.concat([majority, minority_upsampled])
        balanced_data = pd.concat([balanced_data, balanced]) if not balanced_data.empty else balanced

    return balanced_data['content'], balanced_data.drop('content', axis=1)

X_train_balanced, y_train_balanced = balance_classes(X_train, y_train)

# Create a pipeline with TF-IDF vectorizer and a multi-label random forest model
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('classifier', MultiOutputClassifier(SVC(kernel='linear', C=1, probability=True, random_state=42)))
])

# Train the model with the balanced dataset
pipeline.fit(X_train_balanced, y_train_balanced)

# Predict on the test data
y_pred = pipeline.predict(X_test)


print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("Classification Report (multilabel eval):\n", classification_report(y_test, y_pred, target_names=y.columns))

# Evaluating the model using a detailed classification report
print("Classification Report for 'cyber_label':\n", classification_report(y_test['cyber_label'], y_pred[:, 0], target_names=['Class 0', 'Class 1']))
print("F1 Score:", f1_score(y_test['cyber_label'], y_pred[:, 0], average='weighted'))

print("Classification Report for 'environmental_issue':\n", classification_report(y_test['environmental_issue'], y_pred[:, 1], target_names=['Class 0', 'Class 1']))
print("F1 Score:", f1_score(y_test['environmental_issue'], y_pred[:, 1], average='weighted'))

# Example of prediction
sample_text = ["New document discussing climate change and cybersecurity."]
sample_prediction = pipeline.predict(sample_text)
print("Prediction for the new document:", sample_prediction)


Accuracy: 0.7461538461538462
F1 Score: 0.5774244360799397
Classification Report (multilabel eval):
                      precision    recall  f1-score   support

        cyber_label       0.53      0.43      0.48        23
environmental_issue       0.66      0.57      0.61        70

          micro avg       0.62      0.54      0.58        93
          macro avg       0.59      0.50      0.54        93
       weighted avg       0.62      0.54      0.58        93
        samples avg       0.19      0.18      0.18        93

Classification Report for 'cyber_label':
               precision    recall  f1-score   support

     Class 0       0.95      0.96      0.95       237
     Class 1       0.53      0.43      0.48        23

    accuracy                           0.92       260
   macro avg       0.74      0.70      0.72       260
weighted avg       0.91      0.92      0.91       260

F1 Score: 0.9117093506214846
Classification Report for 'environmental_issue':
               precisio

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Now that we tried a couple of these models, it's clear that the data is not super easy to classify. \
We need more experiments to get better results.