# Statistical NLP Part A

In [1]:
#1. Read and Analyse Dataset.

# Load required libraries
import pandas as pd
import zipfile
import os

# Extracting the zip file
with zipfile.ZipFile('blogtext.zip', 'r') as zip_ref:
    zip_ref.extractall('blogs_data')

# Listing the extracted files
extracted_files = os.listdir('blogs_data')
print(extracted_files)

# Assuming the data is in a CSV or text format, let's load the first file as a sample to inspect
file_path = 'blogs_data/' + extracted_files[0]  # Adjust this based on the actual file format
with open(file_path, 'r', encoding='utf-8') as file:
    sample_data = file.read()

# Print sample data to inspect
print(sample_data[:1000])  # Print the first 1000 characters

['blogtext.csv']
id,gender,age,topic,sign,date,text
2059027,male,15,Student,Leo,"14,May,2004","           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         "
2059027,male,15,Student,Leo,"13,May,2004","           These are the team members:   Drewes van der Laag           urlLink mail  Ruiyu Xie                     urlLink mail  Bryan Aaldering (me)          urlLink mail          "
2059027,male,15,Student,Leo,"12,May,2004","           In het kader van kernfusie op aarde:  MAAK JE EIGEN WATERSTOFBOM   How to build an H-Bomb From: ascott@tartarus.uwa.edu.au (Andrew Scott) Newsgroups: rec.humor Subject: How To Build An H-Bomb (humorous!) Date: 7 Feb 1994 07:41:14 GMT Organization: The University of Western Australia  Original file dated 12th November 1990. Seemed to be a transcript of a 'Seven Days' article. Poorly formatted and corrupted. I have added the text between 'examine under a microscop

In [3]:
df=pd.read_csv(file_path)

In [4]:
# #1A. Clearly write outcome of data analysis
# #1B. Clean the Structured Data
# #1B. i. Missing value analysis and imputation.

# Checking for missing values again after handling the extraction
print("Missing Values in Dataset:\n", df.isnull().sum())

# Basic statistics of the text data (e.g., average length of blog posts)
df['text_length'] = df['text'].apply(len)
print("Statistics on Blog Text Length:\n", df['text_length'].describe())

Missing Values in Dataset:
 id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64
Statistics on Blog Text Length:
 count    681284.000000
mean       1120.730698
std        2328.437003
min           4.000000
25%         230.000000
50%         637.000000
75%        1407.000000
max      790123.000000
Name: text_length, dtype: float64


### Note to the evaluator:
My system's configuration is insufficient to process the full dataset, so I have limited the data to 50,000 rows.


In [8]:
# Limiting the dataframe to 50,000 rows
df = df.head(50000)
print(f"Dataframe limited to {df.shape[0]} rows")

# Basic statistics of the text data (e.g., average length of blog posts)
df['text_length'] = df['text'].apply(len)
print("Statistics on Blog Text Length:\n", df['text_length'].describe())

Dataframe limited to 50000 rows
Statistics on Blog Text Length:
 count     50000.000000
mean       1130.585300
std        2216.412948
min           4.000000
25%         237.000000
50%         662.000000
75%        1460.000000
max      321278.000000
Name: text_length, dtype: float64


In [10]:
#1B. ii. Eliminate Non-English textual data.

from langdetect import detect, LangDetectException

# Function to detect language
def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

# Apply the function to detect language of each blog post
df['language'] = df['text'].apply(detect_language)

# Filter to keep only English blogs
df = df[df['language'] == 'en']

# Checking the remaining dataset after filtering
print("Remaining rows after removing non-English posts:", df.shape[0])

Remaining rows after removing non-English posts: 47746


In [20]:
#2. Preprocess unstructured data to make it consumable for model training.
#2A. Eliminate All special Characters and Numbers

import re

def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Keep only letters and spaces
    return text

# Apply cleaning function to text column
df['cleaned_text'] = df['text'].apply(clean_text)

# Preview cleaned data
print(df[['text', 'cleaned_text']].head())

                                                text  \
0             Info has been found (+/- 100 pages,...   
2             In het kader van kernfusie op aarde...   
3                   testing!!!  testing!!!             
4               Thanks to Yahoo!'s Toolbar I can ...   
5               I had an interesting conversation...   

                                        cleaned_text  
0             Info has been found   pages and  MB...  
2             In het kader van kernfusie op aarde...  
3                         testing  testing            
4               Thanks to Yahoos Toolbar I can no...  
5               I had an interesting conversation...  


In [21]:
#2B. Lowercase all textual data

df['cleaned_text'] = df['cleaned_text'].apply(lambda x: x.lower())

# Preview cleaned data
print(df[['cleaned_text']].head())

                                        cleaned_text
0             info has been found   pages and  mb...
2             in het kader van kernfusie op aarde...
3                         testing  testing          
4               thanks to yahoos toolbar i can no...
5               i had an interesting conversation...


In [24]:
#2C. Remove all Stopwords

import nltk
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Remove Stopwords
df['cleaned_text'] = df['cleaned_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

# Preview cleaned data
print(df[['cleaned_text']].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kanak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                        cleaned_text
0  info found pages mb pdf files wait untill team...
2  het kader van kernfusie op aarde maak je eigen...
3                                    testing testing
4  thanks yahoos toolbar capture urls popupswhich...
5  interesting conversation dad morning talking k...


In [25]:
#2D. Remove all extra white spaces

df['cleaned_text'] = df['cleaned_text'].apply(lambda x: re.sub(r'\s+', ' ', x).strip())

# Preview final cleaned data
print(df[['cleaned_text']].head())

                                        cleaned_text
0  info found pages mb pdf files wait untill team...
2  het kader van kernfusie op aarde maak je eigen...
3                                    testing testing
4  thanks yahoos toolbar capture urls popupswhich...
5  interesting conversation dad morning talking k...


In [26]:
#3. Build a base Classification model
#3A. Create dependent and independent variables

X = df['cleaned_text']  # Independent variable (text)
y = df['gender']        # Dependent variable (target)

# Checking the shapes
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (47746,)
Shape of y: (47746,)


In [27]:
#3B. Split data into train and test.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the training and testing data
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (38196,)
Testing set shape: (9550,)


In [28]:
#3C. Vectorize data using any one vectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)  # Limit the number of features to 5000 for faster processing
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Check the shape of the vectorized data
print("Shape of TF-IDF training data:", X_train_tfidf.shape)
print("Shape of TF-IDF testing data:", X_test_tfidf.shape)

Shape of TF-IDF training data: (38196, 5000)
Shape of TF-IDF testing data: (9550, 5000)


In [29]:
#3D. Build a base model for Supervised Learning - Classification.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predict on the test set
y_pred = model.predict(X_test_tfidf)

# Checking a few predictions
print("Predictions on test set:", y_pred[:10])

Predictions on test set: ['female' 'female' 'male' 'female' 'female' 'male' 'female' 'female'
 'male' 'male']


In [30]:
#3E. Clearly print Performance Metrics.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7159
Precision: 0.7158
Recall: 0.7159
F1-Score: 0.7158

Classification Report:
               precision    recall  f1-score   support

      female       0.71      0.69      0.70      4585
        male       0.72      0.74      0.73      4965

    accuracy                           0.72      9550
   macro avg       0.72      0.72      0.72      9550
weighted avg       0.72      0.72      0.72      9550



In [31]:
#4. Improve Performance of model. 
#4A. Experiment with other vectorisers. 

from sklearn.feature_extraction.text import CountVectorizer

# Vectorize the text data using CountVectorizer
count_vectorizer = CountVectorizer(max_features=5000)
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

# Build the Logistic Regression model again using CountVectorizer
model_count = LogisticRegression()
model_count.fit(X_train_count, y_train)

# Predict on the test set
y_pred_count = model_count.predict(X_test_count)

# Performance metrics for CountVectorizer
accuracy_count = accuracy_score(y_test, y_pred_count)
precision_count = precision_score(y_test, y_pred_count, average='weighted')
recall_count = recall_score(y_test, y_pred_count, average='weighted')
f1_count = f1_score(y_test, y_pred_count, average='weighted')

# Print metrics for CountVectorizer
print(f"CountVectorizer - Accuracy: {accuracy_count:.4f}")
print(f"CountVectorizer - Precision: {precision_count:.4f}")
print(f"CountVectorizer - Recall: {recall_count:.4f}")
print(f"CountVectorizer - F1-Score: {f1_count:.4f}")

CountVectorizer - Accuracy: 0.6977
CountVectorizer - Precision: 0.6977
CountVectorizer - Recall: 0.6977
CountVectorizer - F1-Score: 0.6971


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
#4B. Build classifier Models using other algorithms than base model. 

from sklearn.ensemble import RandomForestClassifier

# Build a Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_tfidf, y_train)  # Using TF-IDF for this model

# Predict on the test set
y_pred_rf = rf_model.predict(X_test_tfidf)

# Performance metrics for Random Forest
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf, average='weighted')
recall_rf = recall_score(y_test, y_pred_rf, average='weighted')
f1_rf = f1_score(y_test, y_pred_rf, average='weighted')

# Print metrics for Random Forest
print(f"Random Forest - Accuracy: {accuracy_rf:.4f}")
print(f"Random Forest - Precision: {precision_rf:.4f}")
print(f"Random Forest - Recall: {recall_rf:.4f}")
print(f"Random Forest - F1-Score: {f1_rf:.4f}")

Random Forest - Accuracy: 0.6766
Random Forest - Precision: 0.6781
Random Forest - Recall: 0.6766
Random Forest - F1-Score: 0.6767


In [33]:
#4B. Build classifier Models using other algorithms than base model. 

from sklearn.svm import SVC

# Build an SVM model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train_tfidf, y_train)  # Using TF-IDF for this model

# Predict on the test set
y_pred_svm = svm_model.predict(X_test_tfidf)

# Performance metrics for SVM
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm, average='weighted')
recall_svm = recall_score(y_test, y_pred_svm, average='weighted')
f1_svm = f1_score(y_test, y_pred_svm, average='weighted')

# Print metrics for SVM
print(f"SVM - Accuracy: {accuracy_svm:.4f}")
print(f"SVM - Precision: {precision_svm:.4f}")
print(f"SVM - Recall: {recall_svm:.4f}")
print(f"SVM - F1-Score: {f1_svm:.4f}")

SVM - Accuracy: 0.7151
SVM - Precision: 0.7149
SVM - Recall: 0.7151
SVM - F1-Score: 0.7149


In [34]:
#4C. Tune Parameters/Hyperparameters of the model/s. 

from sklearn.model_selection import GridSearchCV

# Tuning hyperparameters for Logistic Regression
param_grid = {
    'C': [0.1, 1, 10, 100],   # Regularization parameter
    'solver': ['liblinear', 'lbfgs']  # Solver to use
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, scoring='accuracy')

# Fit the model on the training data
grid_search.fit(X_train_tfidf, y_train)

# Get the best parameters
print("Best Parameters from GridSearchCV:", grid_search.best_params_)

# Predict with the best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_tfidf)

# Performance metrics for the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
precision_best = precision_score(y_test, y_pred_best, average='weighted')
recall_best = recall_score(y_test, y_pred_best, average='weighted')
f1_best = f1_score(y_test, y_pred_best, average='weighted')

# Print metrics for the best model
print(f"Best Logistic Regression - Accuracy: {accuracy_best:.4f}")
print(f"Best Logistic Regression - Precision: {precision_best:.4f}")
print(f"Best Logistic Regression - Recall: {recall_best:.4f}")
print(f"Best Logistic Regression - F1-Score: {f1_best:.4f}")

Best Parameters from GridSearchCV: {'C': 1, 'solver': 'lbfgs'}
Best Logistic Regression - Accuracy: 0.7159
Best Logistic Regression - Precision: 0.7158
Best Logistic Regression - Recall: 0.7159
Best Logistic Regression - F1-Score: 0.7158


In [35]:
#4D. Clearly print Performance Metrics.

print("Comparison of Performance Metrics:")

# Print metrics for CountVectorizer Logistic Regression
print(f"CountVectorizer - Accuracy: {accuracy_count:.4f}, F1-Score: {f1_count:.4f}")

# Print metrics for Random Forest
print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, F1-Score: {f1_rf:.4f}")

# Print metrics for SVM
print(f"SVM - Accuracy: {accuracy_svm:.4f}, F1-Score: {f1_svm:.4f}")

# Print metrics for Best Logistic Regression (with GridSearchCV)
print(f"Best Logistic Regression - Accuracy: {accuracy_best:.4f}, F1-Score: {f1_best:.4f}")

Comparison of Performance Metrics:
CountVectorizer - Accuracy: 0.6977, F1-Score: 0.6971
Random Forest - Accuracy: 0.6766, F1-Score: 0.6767
SVM - Accuracy: 0.7151, F1-Score: 0.7149
Best Logistic Regression - Accuracy: 0.7159, F1-Score: 0.7158


#### 5. Share insights on relative performance comparison.
#### 5A. Which vectorizer performed better? Probable reason?

The TF-IDF Vectorizer generally performed better than the Count Vectorizer across all models. This is perhaps because TF-IDF considers the frequency of terms within a document (like Count Vectorizer) as well as weighs down commonly occurring less informative words (e.g., stopwords) while giving more weight to rarer terms that are more informative. 

#### 5B. Which model outperformed? Probable reason?

The best-performing model is Logistic Regression (Accuracy: 0.7159, F1-Score: 0.7158), followed closely by SVM (Accuracy: 0.7151, F1-Score: 0.7149). A probable reason is that Logistic Regression, being a simple linear model, often works well for high-dimensional data like text data when combined with appropriate feature engineering (vectorization in this case). 

#### 5C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?

Regularization helps to prevent overfitting by penalizing large coefficients, which is especially important when dealing with sparse, high-dimensional data like text. In SVM, kernel choices or tuning of the C parameter might have also improved performance by controlling the trade-off between achieving a higher margin and allowing some misclassifications.

#### 5D. According to you, which performance metric should be given most importance, why?

In this case, the F1-Score should be given the most importance. This is because F1-Score balances precision and recall, making it more informative when dealing with imbalanced datasets or multi-class problems, which are common in NLP tasks. Accuracy alone can be misleading if there’s a class imbalance or if false positives/negatives have different costs.