In [26]:
import os 
import csv
import pandas  as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score  # Import accuracy_score
import random

In [27]:
# Step 1: Load and preprocess the data
df = pd.read_csv('BABE_scraped.csv')
df['content'] = df['content'].str.lower()  # Convert text to lowercase

df.dropna(subset=['content'], inplace=True) # Drop rows with missing values in the 'content' column

In [28]:
# Step 2: Feature extraction
print("Step 2: Feature extraction...")
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(df['content'])
y = df['type_class']

Step 2: Feature extraction...


In [29]:
# Step 3: Split data into training, validation, and testing sets
print("Step 3: Splitting data into training, validation, and testing sets...")
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Step 3: Splitting data into training, validation, and testing sets...


In [34]:
# Step 4: Train the Naive Bayes classifier
print("Step 4: Training the Naive Bayes classifier...")
clf = MultinomialNB()
clf.fit(X_train, y_train)

Step 4: Training the Naive Bayes classifier...


In [35]:
# Step 5: Evaluate the model on the validation set
print("Step 5: Evaluating the model on the validation set...")
y_val_pred = clf.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", val_accuracy)

Step 5: Evaluating the model on the validation set...
Validation Accuracy: 0.6790123456790124


In [36]:
# Step 6: Evaluate the model on the test set
print("Step 6: Evaluating the model on the test set...")
y_test_pred = clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)


Step 6: Evaluating the model on the test set...
Test Accuracy: 0.7484662576687117


In [42]:
# Step 7: Print classification report for test set
print("Classification Report for Test Set:")
print(classification_report(y_test, y_test_pred, zero_division=1))

Classification Report for Test Set:
              precision    recall  f1-score   support

           0       0.73      0.92      0.82        90
           1       1.00      0.00      0.00        11
           2       0.78      0.63      0.70        62

    accuracy                           0.75       163
   macro avg       0.84      0.52      0.50       163
weighted avg       0.77      0.75      0.72       163



In [45]:
from scipy.sparse import hstack

# Add article length as a feature
df['article_length'] = df['content'].apply(len)

# Combine text features with custom features
X_custom = hstack([X, df['article_length'].values.reshape(-1, 1)])

# Perform cross-validation
cv_scores = cross_val_score(clf, X_custom, y, cv=5)
print("Cross-Validation Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())

Cross-Validation Scores: [0.69846154 0.70769231 0.63692308 0.70153846 0.68      ]
Mean CV Accuracy: 0.684923076923077


**Data Loading and Preprocessing**
- The dataset (BABE_scraped.csv) contains news articles where each article is labeled with a political bias class (0 for left, 1 for center, and 2 for right).
- Text preprocessing involves converting the text to lowercase and removing rows with missing content.

**Feature Extraction**
- Text data is transformed into numerical feature vectors using TF-IDF (Term Frequency-Inverse Document Frequency) representation. This process captures the importance of words in the documents relative to the entire corpus.
- Stop words are removed, and only the top 1000 most frequent words are considered.

**Data Splitting**
- The dataset is split into training, validation, and test sets with a ratio of 80:10:10 respectively. This ensures that the model is trained on a majority of the data while still having separate sets for validation and final evaluation.

**Model Training**
- The Multinomial Naive Bayes classifier is trained on the training data. Naive Bayes classifiers are commonly used for text classification tasks due to their simplicity and effectiveness with high-dimensional data like text.

**Model Evaluation on Validation Set**
- The trained model's performance is evaluated on the validation set. Accuracy, precision, recall, and F1-score are computed to assess how well the model classifies the political biases in the validation data.

**Model Evaluation on Test Set**
- The model's performance is further evaluated on the test set to ensure its generalization ability. Accuracy metrics are computed to determine how well the model performs on unseen data.

**Classification Report**
- The classification report provides detailed metrics for each political bias class (left, center, right) including precision, recall, and F1-score. This helps in understanding the model's performance for each class individually.

**Custom Feature Addition and Cross-Validation**
- The custom feature added to the model is the "article length." This feature represents the length (number of characters, words, or sentences) of each news article in the dataset. By incorporating this additional information into the feature matrix, the model can potentially capture patterns related to the length of articles and how it correlates with their political bias classification.
- In the context of this NLP project, adding the article length as a feature serves two primary purposes:
    - **Additional Information Incorporation**
    - By including the article length as a feature, the model gains additional information beyond just the textual content of the articles. This can help the model better differentiate between articles of different lengths and potentially capture any correlations between article length and political bias.
    - Testing for Correlation between Length and Bias:
    - The inclusion of article length as a feature allows the model to test whether there is a correlation between the length of news articles and their political bias classification. It enables the model to learn if certain political biases tend to manifest in longer or shorter articles, which could provide insights into how different biases are expressed in media content.
- Overall, by incorporating the article length as a custom feature, the model aims to capture any potential relationships between the length of news articles and their political bias classifications, thereby enhancing its ability to accurately classify articles based on their content and length.
