<b>IS41070 Machine Learning Foundations Project<b>

<b>Install Necessary Packages<b>

In this section, we will install the necessary packages for our project. These include:
- `pandas` for data manipulation and analysis
- `numpy` for numerical operations
- `seaborn` for statistical data visualization
- `nltk` for natural language processing
- `matplotlib` for plotting and visualization
- `wordcloud` for generating word clouds from text data
- `scikit-learn` for machine learning algorithms and tools


In [None]:
# Install pandas for data manipulation and analysis
!pip install pandas

In [None]:
# Install numpy for numerical operations
!pip install numpy

In [None]:
# Install seaborn for statistical data visualization
!pip install seaborn

In [None]:
# Install nltk for natural language processing
!pip install nltk

In [None]:
# Install matplotlib for plotting and visualization
!pip install matplotlib

In [None]:
# Install wordcloud for generating word clouds from text data
!pip install wordcloud

In [None]:
# Install scikit-learn for machine learning algorithms and tools
!pip install scikit-learn

<b> Data Understanding & Exploration<b>

Loading the Dataset, we will load the dataset provided for the project. This dataset contains news articles and their corresponding categories. We will use `pandas` to load the data and perform initial exploration.

In [None]:
import pandas as pd  # Import the pandas library for data manipulation

news_data= pd.read_csv('22.csv')# Load the dataset from the CSV file into a DataFrame

news_data.head() # Display the first few rows of the DataFrame to check the data

Data Exploration

we will perform an initial exploration of the dataset. This includes:
- Checking the basic structure of the dataset
- Exploring the distribution of categories
- Analyzing the text data (most common terms, sentence lengths, etc.)
- Checking for missing values and outliers

Basic description of Data

In [None]:
news_data.info() # Display information about the DataFrame, such as the number of rows and columns, data types, and non-null counts

Lets remane the First column to "S.no"

In [None]:
news_data.rename(columns={'Unnamed: 0': 'S.no'}, inplace=True) # Rename the column 'Unnamed: 0' to 'S.no'

In [None]:
news_data.head() # Display the first few rows of the DataFrame to verify the change

Distribution of Categories

In [None]:

category_distribution = news_data['category'].value_counts() # Display the distribution of categories
print(category_distribution)


import matplotlib.pyplot as plt # Plot the distribution of categories
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.countplot(data=news_data, x='category', order=category_distribution.index)
plt.title('Distribution of Categories')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()


The dataset has a significant imbalance, with 5984 instances of "POLITICS" compared to 1996 instances of "CRIME." This disparity can cause the model to become biased towards predicting "POLITICS" more frequently. As a result, the model's predictions may be skewed, favoring "POLITICS" and leading to inaccurate results. Additionally, the model's sensitivity to the "CRIME" category will be reduced, with lower recall and precision, causing many true "CRIME" instances to be missed or misclassified.

In [None]:
news_data.describe(include='all') # Display summary statistics of the DataFrame, including count, unique values, top values, and frequency for categorical data

checking for missing values

Its always important to check the data for completeness( for higher accuracy of the model)

In [None]:
missing_values = news_data.isnull().sum() # Check for missing values 
missing_values  # Display the count of missing values for each column

We can see that several columns have missing values, but our primary concern is the category column. This column plays a crucial role in training the model it determines the classification of each article into specific categories, such as POLITICS or CRIME based on content.

We will remove those 20 rows where we have no categories

In [None]:
news_data_cleaned = news_data.dropna(subset=['category']) # Drop rows where 'category' is NaN

# Check the result
missing_values_after = news_data_cleaned.isnull().sum()
missing_values_after


In [None]:
news_data_cleaned.describe(include='all') # Display summary statistics of the DataFrame, including count, unique values, top values, and frequency for categorical data

<b>Data Analysis<b>

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


nltk.download('punkt') # Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')


In [None]:
news_data_copy = news_data_cleaned.copy() # Create a copy of the DataFrame to avoid modifying the original data

In [None]:
news_data_copy['headline'] = news_data_copy['headline'].fillna('')

news_data_copy['short_description'] = news_data_copy['short_description'].fillna('')  # Fill missing values in the 'short_description' column with empty strings

In [None]:
# Verify that there are no more missing values
missing_values_after = news_data_copy.isnull().sum()
print(missing_values_after)  # Display the count of missing values for each column to confirm


In [None]:
lemmatizer = WordNetLemmatizer() # Initialize the WordNetLemmatizer
stop_words = set(stopwords.words('english'))


def preprocess_text(text):  # Define a function to preprocess text (tokenize, remove stopwords, and lemmatize)
    tokens = word_tokenize(text.lower())  # Tokenize and convert to lowercase
    filtered_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]  # Remove stopwords and lemmatize
    return ' '.join(filtered_tokens)

In [None]:
news_data_copy['text'] = news_data_copy['headline'] + ' ' + news_data_copy['short_description'] # Combine the 'headline' and 'short_description' into a single text column for vectorization


news_data_copy['processed_text'] = news_data_copy['text'].apply(preprocess_text) # Apply the preprocessing function to the 'text' column


news_data_copy[['text', 'processed_text']].head() # Display the first few rows of the DataFrame to verify the preprocessing

In [None]:
# Import necessary libraries
from collections import Counter  # To count word frequencies
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For more advanced plotting

In [None]:
def plot_most_common_terms(data, category, num_terms=20): #Plot the most common terms in a specific category.
 
    category_data = data[data['category'] == category] # Filter data for the specific category
    
    all_text = ' '.join(category_data['processed_text'])  # Combine all processed text into a single string
    
    word_counts = Counter(all_text.split())  # Tokenize the combined text and count the terms
    
    common_terms = word_counts.most_common(num_terms)    # Get the most common terms
    
    common_terms_df = pd.DataFrame(common_terms, columns=['term', 'count']) # Create a DataFrame for plotting
    
    plt.figure(figsize=(10, 6))   # Plot the most common terms using seaborn's barplot
    sns.barplot(x='count', y='term', data=common_terms_df)
    plt.title(f'Most Common Terms in {category}')
    plt.xlabel('Count')
    plt.ylabel('Term')
    plt.show()

for category in news_data_copy['category'].unique(): # Plot the most common terms for each category
    plot_most_common_terms(news_data_copy, category)

Key Observations:
Politics:
- "Trump" is the most frequent term by a large margin.
- Political Figures: Frequent mentions of "donald," "clinton," "obama," and "hillary."
- Political Terms: Common words include "president," "republican," "state," and "gop."

Crime:
- "Police" is the most common term.
- High frequency of "man," "shooting," "suspect," "officer," and "killed."
- Common terms include "death," "found," "shot," "accused," and "arrested."


we can further analyze other features in the dataset and their relationship with the category

In [None]:
news_data_copy['text_length'] = news_data_copy['processed_text'].apply(len) # Add a column for the length of the processed text

plt.figure(figsize=(12, 6)) # Plot the distribution of text length for each category
sns.histplot(data=news_data_copy, x='text_length', hue='category', multiple='stack', kde=True)
plt.title('Distribution of Text Length by Category')
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.show()

Observations:

- "POLITICS" category shows an approximately normal distribution with a peak around 100 characters.
- "CRIME" category has a similar distribution but with lower frequency.
- High concentration of articles in both categories within the 50 to 150 character range.
- "POLITICS" articles exhibit a wider spread, some exceeding 500 characters.
- "CRIME" articles show less variation, with fewer long articles.
- "POLITICS" articles are more frequent overall compared to "CRIME" articles.
- Some outliers in the "POLITICS" category have text lengths greater than 400 characters.
- Political articles are generally longer and more frequent than crime-related articles.

In [None]:
news_data_copy.head()

In [None]:
news_data_copy = news_data_copy[(news_data_copy['text_length'] <= 450) & (news_data_copy['text_length'] >= 50)] # Remove outliers based on text length

# Verify the cleaning process
news_data_copy.info()

In [None]:
news_data_copy['text_length'] = news_data_copy['processed_text'].apply(len) # Add a column for the length of the processed text

plt.figure(figsize=(12, 6)) # Plot the distribution of text length for each category
sns.histplot(data=news_data_copy, x='text_length', hue='category', multiple='stack', kde=True)
plt.title('Distribution of Text Length by Category')
plt.xlabel('Text Length')
plt.ylabel('Frequency')
plt.show()

We have removed the outliers and ensured that the total text length (Headline + Short Description) is between 50 and 450 characters.

<b>Data Preparation<b>

Since we have noticed a significant imbalance in your dataset, it's crucial to split your data in a way that maintains the distribution of the classes across training, validation, and test sets. An effective approach would be using stratified sampling, which ensures that each subset has approximately the same class distribution as the original dataset.

In [None]:
from sklearn.model_selection import train_test_split  # Import train_test_split for splitting the dataset

# First, split the data into training (80%) and testing sets (20%)
train_df, test_df = train_test_split(news_data_copy, test_size=0.2, random_state=42, stratify=news_data_copy['category'])

# Then, split the training set further into training (60% of original data) and validation sets (20% of original data)
train_df, valid_df = train_test_split(train_df, test_size=0.25, random_state=42, stratify=train_df['category'])

# Save the training, validation, and testing sets to CSV files
train_df.to_csv('train.csv', index=False)
valid_df.to_csv('valid.csv', index=False)
test_df.to_csv('test.csv', index=False)

Verify and check the distribution of the splits

In [None]:
# Verify the splits
print("Training set size:", train_df.shape[0])
print("Validation set size:", valid_df.shape[0])
print("Testing set size:", test_df.shape[0])

# Check the distribution of categories in each set
print("\nTraining set category distribution:\n", train_df['category'].value_counts())
print("\nValidation set category distribution:\n", valid_df['category'].value_counts())
print("\nTesting set category distribution:\n", test_df['category'].value_counts())

In [None]:
# Display the first few rows of the training data to check the loaded data
train_df.head()


In [None]:
# Display the first few rows of the validation data to check the loaded data
valid_df.head()

Vectorizing the Text Data

We will use the TF-IDF vectorizer to convert the text data into numerical features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer  # Import TfidfVectorizer for converting text to numeric features

# Combine preprocessing and TF-IDF vectorization in a single step
vectorizer = TfidfVectorizer(
    lowercase=True,  # Convert all characters to lowercase
    stop_words='english',  # Remove English stopwords
    max_features=5000  # Consider only the top 5000 words by frequency
)

X_train = vectorizer.fit_transform(train_df['processed_text']) # Fit the vectorizer on the training data and transform the text data into TF-IDF features
X_valid = vectorizer.transform(valid_df['processed_text'])

print("Shape of X_train:", X_train.shape) # Display the shape of the resulting feature matrices
print("Shape of X_valid:", X_valid.shape)

y_train = train_df['category'] # Extract the labels (categories) from the training and validation data
y_valid = valid_df['category']


<b>Building and Evaluating Machine Learning Models<b>

we will use Logistic Regression and Support Vector Machine (SVM).

- Logistic Regression was chosen because of its simplicity, speed, and probabilistic results.
- SVM was chosen for its efficacy in high-dimensional text data and resistance to overfitting.

  These models work together to achieve balanced performance and interpretability.

Logistic Regression

We will start with a simple Logistic Regression model to set a baseline performance.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Initialize the model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
log_reg.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_log_reg = log_reg.predict(X_valid)

# Evaluate the model on the validation set
print("Logistic Regression Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_valid, y_pred_log_reg)}")
print(classification_report(y_valid, y_pred_log_reg))

# Generate confusion matrix
cm = confusion_matrix(y_valid, y_pred_log_reg)
print("Confusion Matrix:\n", cm)

Support Vector Machine (SVM)

we will build a Support Vector Machine classifier to further test our models.

In [None]:
from sklearn.svm import SVC

# Initialize the model
svm_clf = SVC(kernel='linear', random_state=42)

# Train the model
svm_clf.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_svm = svm_clf.predict(X_valid)

# Evaluate the model on the validation set
print("Support Vector Machine Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_valid, y_pred_svm)}")
print(classification_report(y_valid, y_pred_svm))

# Generate confusion matrix
cm = confusion_matrix(y_valid, y_pred_svm)
print("Confusion Matrix:\n", cm)


<b>Building Our Own Deep Learning Model<b>

We will use TensorFlow/Keras to create a simple LSTM model(type of recurrent neural network (RNN)).

we are using the Long Short-term memory(LSTM) model as they can learn, process, and classify sequential data because these networks can learn long-term dependencies between time steps of data.  

In [None]:
!pip install tensorflow


In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Combine text columns and preprocess the data
train_texts = train_df['processed_text']
valid_texts = valid_df['processed_text']

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=10000, oov_token='<OOV>')
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts)
valid_sequences = tokenizer.texts_to_sequences(valid_texts)
train_padded = pad_sequences(train_sequences, maxlen=500, padding='post', truncating='post')
valid_padded = pad_sequences(valid_sequences, maxlen=500, padding='post', truncating='post')

# Encode labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(train_df['category'])
y_valid = le.transform(valid_df['category'])

Build LSTM model

In [None]:
model = Sequential([
    Embedding(10000, 64, input_length=450),
    LSTM(64, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
history = model.fit(train_padded, y_train, epochs=10, validation_data=(valid_padded, y_valid), batch_size=64)

Evaluate the Model

In [None]:
loss, accuracy = model.evaluate(valid_padded, y_valid)
print(f'Validation Accuracy: {accuracy}')
print(f'Validation Loss: {loss}')


In [None]:
# Plot training & validation accuracy and loss
plt.figure(figsize=(12, 6))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='train_accuracy')
plt.plot(history.history['val_accuracy'], label='val_accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

# Plot loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='train_loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.title('Model Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc='upper right')

plt.tight_layout()
plt.show()

<b>EVALUATION<b>

<b>Error Analysis<b>

Own Deep Learning MOdel:

In [None]:
# Predict the validation data
y_pred = (model.predict(valid_padded) > 0.5).astype("int32")

# Generate a classification report
from sklearn.metrics import classification_report, confusion_matrix

print("Classification Report:\n", classification_report(y_valid, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_valid, y_pred))


In [None]:

# Determine the mapping based on known class distribution
politics_class = 1  # Larger class (5528 instances)
crime_class = 0     # Smaller class (1697 instances)

print(f"Class {crime_class} corresponds to CRIME.")
print(f"Class {politics_class} corresponds to POLITICS.")


Based on the above classification Report

Class 0: CRIME

- Precision: 0.00 - The model never correctly predicted class 0.
- Recall: 0.00 - Out of all the actual class 0 instances, the model didn't get any right.
- F1-Score: 0.00 - Overall performance for class 0 is very poor.
                                      
Class 1: POLITICS

- Precision: 0.76 - 76% of the time the model predicted class 1 correctly.
- Recall: 1.00 - The model identified all actual class 1 instances correctly.
- F1-Score: 0.87 - Good performance for class 1.

Overall Accuracy: 0.76 - 76% of the total predictions are correct, but this is mostly because the model is good at predicting class 1( due to significant imbalance, with 5984 instances of "POLITICS" compared to 1996 instances of "CRIME." )

The model is biased towards the majority class (class 1: POLITICS).

Logistic Regression & SVM :

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.svm import SVC

# Initialize the model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
log_reg.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_log_reg = log_reg.predict(X_valid)

# Evaluate the model on the validation set
print("Logistic Regression Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_valid, y_pred_log_reg)}")
print(classification_report(y_valid, y_pred_log_reg))

# Generate confusion matrix
cm = confusion_matrix(y_valid, y_pred_log_reg)
print("Confusion Matrix:\n", cm)

# Initialize the model
svm_clf = SVC(kernel='linear', random_state=42)

# Train the model
svm_clf.fit(X_train, y_train)

# Make predictions on the validation set
y_pred_svm = svm_clf.predict(X_valid)

# Evaluate the model on the validation set
print("Support Vector Machine Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_valid, y_pred_svm)}")
print(classification_report(y_valid, y_pred_svm))

# Generate confusion matrix
cm = confusion_matrix(y_valid, y_pred_svm)
print("Confusion Matrix:\n", cm)


From above, It is clearly evident that SVM model have the higher accuarcy over other two models. 
    
Logistic Regression:

- High accuracy (~91%) and good performance overall.
- Struggles more with identifying CRIME (lower recall of 0.65), indicating many CRIME articles are misclassified as POLITICS.
- POLITICS classification is strong (high recall and precision).
- Out of 340 CRIME instances, 220 were correctly classified, and 120 were misclassified as POLITICS.
- Out of 1105 POLITICS instances, 1090 were correctly classified, and 15 were misclassified as CRIME.

Support Vector Machine:

- Higher accuracy (~93%) than Logistic Regression.
- Better balance in classifying both CRIME and POLITICS.
- Improved recall for CRIME (0.79) compared to Logistic Regression, indicating fewer CRIME articles are misclassified.

We will use the F1-score as our major assessment metric. A single measure that balances false positives and false negatives is the F1-score, which is the harmonic mean of precision and recall. This is especially important with imbalanced datasets, when accuracy may not provide a clear view of model performance.

Given the imbalance in our dataset (more instances of "POLITICS" than "CRIME"), the F1-score ensures that our model performs well across both classes rather than just the majority class.


<b>MODEL IMPROVEMENT<b>

Logistic Regression Model:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Feature Scaling 
scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

# Initialize the model with GridSearchCV for hyperparameter tuning
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)

# Best estimator after hyperparameter tuning
log_reg_best = grid_search.best_estimator_

# Make predictions on the validation set
y_pred_log_reg_best = log_reg_best.predict(X_valid_scaled)

# Evaluate the model on the validation set
print("Improved Logistic Regression Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_valid, y_pred_log_reg_best)}")
print(classification_report(y_valid, y_pred_log_reg_best))

# Generate confusion matrix
cm_best = confusion_matrix(y_valid, y_pred_log_reg_best)
print("Confusion Matrix:\n", cm_best)


Key Improvements:

- Feature Scaling: We standardized the data, ensuring that all features are on the same scale. This allows the Logistic Regression model to learn more successfully since it treats each feature equally.
- Hyperparameter Tuning: We utilized GridSearchCV for testing with alternative values for a model parameter (C). This technique assisted us in determining the ideal configuration for the model, hence boosting its performance.

performance Comparison:
- Accuracy: Increased from 90.7% to 91.1%.
- Precision and Recall for Minority Class (0): Improved precision (0.85 vs. 0.94) and recall (0.76 vs. 0.65), leading to a more balanced model.
- Confusion Matrix: Fewer false positives (82 vs. 120) and false negatives (46 vs. 15), indicating better prediction quality.

Support Vector Machine Model(Use Ensemble Approach):

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# Initialize base models
lr = LogisticRegression()
nb = MultinomialNB()
svm = SVC(kernel='rbf', probability=True)

# Create an ensemble model
ensemble_clf = VotingClassifier(estimators=[
    ('lr', lr), ('nb', nb), ('svm', svm)], voting='soft')

# Train the ensemble model
ensemble_clf.fit(X_train_scaled, y_train)

# Make predictions on the validation set
y_pred_ensemble = ensemble_clf.predict(X_valid_scaled)

# Evaluate the model
print("Ensemble Model Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_valid, y_pred_ensemble)}")
print(classification_report(y_valid, y_pred_ensemble))

# Generate confusion matrix
cm_ensemble = confusion_matrix(y_valid, y_pred_ensemble)
print("Confusion Matrix:\n", cm_ensemble)


Despite a slight decline in overall performance compared to the Support Vector Machine (SVM) model, the current ensemble model demonstrates several positive aspects.

Positive Aspects of the Current Ensemble Model:

- Improved Recall for Minority Class (0): Increased from 0.79 to 0.80, better at identifying true positives.
- Fewer False Positives: Decreased from 73 to 69, indicating improved precision.
- Balanced Performance: More balanced across metrics, beneficial when both precision and recall are critical.

might perform well, when trained with large data.

- Now, we are left with an Improved Logistic Regression Model and Ensemble Model

<b>CROSS VALIDATION<b>

 Merge both training and validation sets to perform cross validation

In [None]:
# Merge the training and validation sets
crossvalid = pd.concat([train_df, valid_df], axis=0).reset_index(drop=True)

# Extract features and labels
X_crossvalid = crossvalid['processed_text']  
y_crossvalid = crossvalid['category']  

# Print the number of rows in the combined dataset
print(f"Number of rows in X_crossvalid: {X_crossvalid.shape[0]}")

vectorizing the cross validation dataset

In [None]:
# Initialize the vectorizer
vectorizer = TfidfVectorizer(
    lowercase=True,  # Convert all characters to lowercase
    stop_words='english',  # Remove English stopwords
    max_features=5000  # Consider only the top 5000 words by frequency
)

# Fit the vectorizer on the combined data and transform the text data into TF-IDF features
X_crossvalid_vectorized = vectorizer.fit_transform(X_crossvalid)


cross-validation using improved Logistic regression Model

In [None]:
# Feature Scaling 
scaler = StandardScaler(with_mean=False)
X_crossvalid_scaled = scaler.fit_transform(X_crossvalid_vectorized)

# Initialize the model with GridSearchCV for hyperparameter tuning
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_crossvalid_scaled, y_crossvalid)

# Best estimator after hyperparameter tuning
log_reg_best = grid_search.best_estimator_

# Make predictions on the cross- validation set

y_pred_log_reg_best = log_reg_best.predict(X_crossvalid_scaled)

# Evaluate the model on the validation set
print("Improved Logistic Regression Performance on Validation Set:")
print(f"Accuracy: {accuracy_score(y_crossvalid, y_pred_log_reg_best)}")
print(classification_report(y_crossvalid, y_pred_log_reg_best))

# Generate confusion matrix
cm_best = confusion_matrix(y_crossvalid, y_pred_log_reg_best)
print("Confusion Matrix:\n", cm_best)

The observed faultless performance during cross-validation suggests that the model is most likely being tested on the same data that it was trained on. This leads to overfitting, in which the model memorizes the training data and achieves 100% accuracy on the validation set. This is not a realistic assessment of the model's performance on actually unseen data.

Cross-validation using Ensemble Model 

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# Fit LabelEncoder on training data and transform both train and validation data
label_encoder = LabelEncoder()
y_true_numeric = label_encoder.fit_transform(crossvalid['category'])
y_pred_ensemble_crossvalid = ensemble_clf.predict(X_crossvalid_vectorized)

# Evaluate the model on the cross-validation set
print("Ensemble Model Performance on Cross-Validation Set:")
print(f"Accuracy: {accuracy_score(y_true_numeric, y_pred_ensemble_crossvalid)}")
print(classification_report(y_true_numeric, y_pred_ensemble_crossvalid))
cm_ensemble_crossvalid = confusion_matrix(y_true_numeric, y_pred_ensemble_crossvalid)
print("Confusion Matrix on Cross-Validation Set:\n", cm_ensemble_crossvalid)


The ensemble model achieved 93.63% training accuracy but dropped to 76.5% in testing, focusing solely on predicting the majority class (class 1) and missing all instances of class 0. This imbalance skews performance metrics, visible in the confusion matrix. Addressing this issue with adjusted class weights, resampling techniques, or model tuning can improve minority class classification.

So far, Our best model is improved Logistic regression Model

Appying it to Test dataset

In [None]:

# Load the test set
test_df = pd.read_csv('test.csv')
X_test = test_df['processed_text']
y_test = test_df['category']

# Initialize the vectorizer 
vectorizer = TfidfVectorizer(
    lowercase=True,  # Convert all characters to lowercase
    stop_words='english',  # Remove English stopwords
    max_features=5000  # Consider only the top 5000 words by frequency
)


# Initialize the scaler
scaler = StandardScaler(with_mean=False)
# Transform the test data using the fitted vectorizer and scaler
X_test_vectorized = vectorizer.fit_transform(X_test)
X_test_scaled = scaler.fit_transform(X_test_vectorized)

# Make predictions on the test set
y_pred_log_reg_best_test = log_reg_best.predict(X_test_scaled)

# Evaluate the model on the test set
print("Improved Logistic Regression Performance on Test Set:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_log_reg_best_test)}")
print(classification_report(y_test, y_pred_log_reg_best_test))

# Generate confusion matrix
cm_best_test = confusion_matrix(y_test, y_pred_log_reg_best_test)
print("Confusion Matrix on Test Set:\n", cm_best_test)


This Model when validated with the same data a trained performed flawlessly, but as we evaluated on the unseen independent test set., its performance went down. we will retrain this model with a combined dataset of training and validation data and check for any improvements.

retraining our improving logistic regression Model with combined train and valid datasets

Applying this re_trained Model to test data set

In [None]:
# Load the training and validation sets
train_df = pd.read_csv('train.csv')
valid_df = pd.read_csv('valid.csv')

# Combine the training and validation sets for vectorizer fitting
full_train_df = pd.concat([train_df, valid_df])
X_full_train = full_train_df['processed_text']
y_full_train = full_train_df['category']

# Initialize the vectorizer 
vectorizer = TfidfVectorizer(
    lowercase=True,  # Convert all characters to lowercase
    stop_words='english',  # Remove English stopwords
    max_features=5000  # Consider only the top 5000 words by frequency
)

# Fit the vectorizer on the combined training data and transform the training data
X_full_train_vectorized = vectorizer.fit_transform(X_full_train)

# Initialize the scaler
scaler = StandardScaler(with_mean=False)

# Fit the scaler on the combined training data and transform the training data
X_full_train_scaled = scaler.fit_transform(X_full_train_vectorized)

# Initialize the model with GridSearchCV for hyperparameter tuning
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_full_train_scaled, y_full_train)

# Best estimator after hyperparameter tuning
log_reg_best = grid_search.best_estimator_

# Train the best model on the full training data
log_reg_best.fit(X_full_train_scaled, y_full_train)

# Load the test set
test_df = pd.read_csv('test.csv')
X_test = test_df['processed_text']
y_test = test_df['category']

# Transform the test data using the fitted vectorizer and scaler
X_test_vectorized = vectorizer.transform(X_test)
X_test_scaled = scaler.transform(X_test_vectorized)

# Make predictions on the test set
y_pred_log_reg_best_test = log_reg_best.predict(X_test_scaled)

# Evaluate the model on the test set
print("Improved Logistic Regression Performance on Test Set:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_log_reg_best_test)}")
print(classification_report(y_test, y_pred_log_reg_best_test))

# Generate confusion matrix
cm_best_test = confusion_matrix(y_test, y_pred_log_reg_best_test)
print("Confusion Matrix on Test Set:\n", cm_best_test)


The retraining of the logistic regression model resulted in significant improvements in both accuracy and F1-score. The model's accuracy increased from 71.7% to 92.1%, indicating a substantial enhancement in overall prediction correctness. Moreover, the F1-score improved notably for both classes, particularly for class 0, rising from 13% to 82%. This indicates that the model now performs well not only in correctly classifying instances overall but also in achieving a balanced precision-recall trade-off for both the majority and minority classes.

In [None]:
# Save the trained model to a file
import joblib
joblib.dump(log_reg_best, 'logreg_improved_model.pkl')

In [None]:
loaded_model = joblib.load('logreg_improved_model.pkl') # Verify that the model can be loaded and produce the same results
y_pred = log_reg_best.predict(X_test_scaled)
loaded_test_accuracy = accuracy_score(y_test, y_pred)
print("Loaded Model Test Set Accuracy:", loaded_test_accuracy)