# Logistic Regression on Comment Classification

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

file_path = 'Kenny_claudeclassification.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
df.head()


Unnamed: 0,id,comment,new,technical,local,correctional
0,44894774,New road construction in progress,1,0,1.0,0.0
1,44914065,Added house,1,0,0.0,0.0
2,44967243,(node) - added [tag=website]},1,0,0.0,0.0
3,45147457,Aligning or naming imported tiger roads #to-fix,0,1,0.0,1.0
4,45147673,highways modified,0,0,0.0,1.0


We use Logistic Regression and One Rest Classifier since it's simple, fast, and works well for text classification with a binary or multi-label setup.

Use a One-vs-Rest (OvR) approach to train separate classifiers for each category (New, Technical, Local, Correctional).

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Clean the data by dropping rows with NaN values in the target columns
df_cleaned = df.dropna(subset=['new', 'technical', 'local', 'correctional'])

# Split data into features and labels
X = df_cleaned['comment']
y = df_cleaned[['new', 'technical', 'local', 'correctional']]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Initialize Logistic Regression with One-vs-Rest approach
log_reg_ovr = OneVsRestClassifier(LogisticRegression(max_iter=1000))

# Train the model
log_reg_ovr.fit(X_train_tfidf, y_train)

# Predict on the test set
log_reg_ovr_preds = log_reg_ovr.predict(X_test_tfidf)

# Generate classification report
log_reg_ovr_report = classification_report(y_test, log_reg_ovr_preds, target_names=y.columns)

# Display the classification report
print(log_reg_ovr_report)

              precision    recall  f1-score   support

         new       0.92      0.95      0.93       486
   technical       0.87      0.79      0.83       254
       local       0.68      0.15      0.24        88
correctional       0.90      0.78      0.84       227

   micro avg       0.90      0.81      0.85      1055
   macro avg       0.85      0.67      0.71      1055
weighted avg       0.89      0.81      0.83      1055
 samples avg       0.90      0.86      0.86      1055



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Model peformance
- New Category: The model performs well in identifying new additions, with high precision and recall.
- Technical Category:  Good performance, capturing most technical comments accurately.
- Local Category: The model struggles to recall Local comments, leading to a low F1-score. This suggests that distinguishing local information might require more context or additional features.
- Correctional Category: The model performs fairly well in identifying correctional comments but shows slightly lower recall compared to the New category.

Overall, LogReg has high accuracy in identifying New and Technical knowledge types. But there are challenges in detecting Local category edits, likely due to its contextual nature, which may be harder to capture through basic text features alone.

Let's try expanding TF-IDF Features with N-grams: Local information might be better captured with phrase-based features (e.g., "local business," "road name," "street update"). We can use bi-grams (or tri-grams) in the TF-IDF vectorizer to capture phrases rather than individual words.

In [6]:
from sklearn.model_selection import GridSearchCV

# 1. Update TF-IDF Vectorizer with n-grams
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))  # Using unigrams and bigrams
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Set up Logistic Regression with balanced class weights
log_reg = LogisticRegression(max_iter=1000, class_weight='balanced')  # Automatically balances each binary class

# One-vs-Rest classifier with balanced Logistic Regression
log_reg_ovr = OneVsRestClassifier(log_reg)

# Train the model with the modified TF-IDF and balanced class weights
log_reg_ovr.fit(X_train_tfidf, y_train)

# Predict on the test set
log_reg_ovr_preds = log_reg_ovr.predict(X_test_tfidf)

# Generate and display the classification report
log_reg_ovr_report = classification_report(y_test, log_reg_ovr_preds, target_names=y.columns)
print("Updated Logistic Regression Classification Report with N-grams and Balanced Class Weights:\n", log_reg_ovr_report)

Updated Logistic Regression Classification Report with N-grams and Balanced Class Weights:
               precision    recall  f1-score   support

         new       0.94      0.90      0.92       486
   technical       0.82      0.81      0.82       254
       local       0.45      0.49      0.47        88
correctional       0.80      0.88      0.83       227

   micro avg       0.83      0.84      0.84      1055
   macro avg       0.75      0.77      0.76      1055
weighted avg       0.84      0.84      0.84      1055
 samples avg       0.86      0.87      0.85      1055



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


We can see that with class weighting, the model assigns more importance to underrepresented classes, like Local in this case. This means it will be more sensitive to examples from this class, increasing the likelihood that it correctly identifies them.

This improves recall because the model is more likely to predict the underrepresented class (even at the expense of precision). For example, the model might identify more comments as Local, even if some are borderline cases. But the precision is lower in this case.


# Random Forest

In [7]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest with One-vs-Rest approach
rf_model = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42))

# Train the model
rf_model.fit(X_train_tfidf, y_train)

# Predict on the test set
rf_preds = rf_model.predict(X_test_tfidf)

# Generate and display the classification report
rf_report = classification_report(y_test, rf_preds, target_names=y.columns)
print("Random Forest Classification Report with Class Weights:\n", rf_report)


Random Forest Classification Report with Class Weights:
               precision    recall  f1-score   support

         new       0.94      0.91      0.92       486
   technical       0.88      0.76      0.82       254
       local       0.65      0.17      0.27        88
correctional       0.86      0.81      0.83       227

   micro avg       0.90      0.79      0.84      1055
   macro avg       0.83      0.66      0.71      1055
weighted avg       0.89      0.79      0.82      1055
 samples avg       0.89      0.84      0.85      1055



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The Random Forest model performs well for New, Technical, and Correctional categories, with high precision and recall. Despite class weighting, recall for Local remains low (0.29), although precision for this category improved.

Random Forest generally performs well, but like Logistic Regression, it has difficulty with the Local category, possibly due to the nature of the class or the data distribution.

