<a href="https://colab.research.google.com/github/pranaysawant17/Artificial_Neural_Network_-ANN-/blob/main/coding_solution_Pranay_Sawant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Step 1: Baseline TFIDF model

###   Importing Data

In [2]:
import pandas as pd
import numpy as np

In [3]:
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('y_train.csv')
y_test = pd.read_csv('y_test.csv', index_col=0)
X_test = pd.read_csv('X_test.csv', index_col=0)


In [4]:
X_train.describe()

Unnamed: 0,text
count,200
unique,193
top,I know she called me
freq,2


In [5]:
X_test.describe()

Unnamed: 0,text
count,1115
unique,1091
top,"Sorry, I'll call later"
freq,8


Assumption 1 : Test data is more than train. So Taking test data for training

In [6]:
df_train = pd.concat([X_test, y_test], axis=1)
df_test = pd.concat([X_train, y_train], axis=1)

In [7]:
df_train.head(10)

Unnamed: 0,text,label
3245,"Funny fact Nobody teaches volcanoes 2 erupt, t...",0
944,I sent my scores to sophas and i had to do sec...,0
1044,We know someone who you know that fancies you....,1
2484,Only if you promise your getting out as SOON a...,0
812,Congratulations ur awarded either å£500 of CD ...,1
2973,"I'll text carlos and let you know, hang on",0
2991,K.i did't see you.:)k:)where are you now?,0
2942,No message..no responce..what happend?,0
230,Get down in gandhipuram and walk to cross cut ...,0
1181,You flippin your shit yet?,0


In [8]:
import re
import string
import nltk
from nltk.corpus import stopwords

# Make sure to download the NLTK stopwords if you haven't already
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

### Preprocessing data:

Here I have used prerpocessing steps like lowering the case, removing punctuation marks as we are using TFIDF model, removing stopwords. The new thing added is conversion of emoji symbols to text to add more meaning to the text

In [9]:

# Dictionary mapping common text smileys to their emoji text names
smiley_map = {
    ":)": " :slightly_smiling_face: ",  # :) -> :slightly_smiling_face:
    ":D": " :grinning: ",              # :D -> :grinning:
    ":P": " :stuck_out_tongue: ",      # :P -> :stuck_out_tongue:
    ":(": " :disappointed: ",          # :( -> :disappointed:
    ";)": " :winking_face: ",          # ;) -> :winking_face:
    ":|": " :neutral_face: ",          # :| -> :neutral_face:
}

# Function to convert text smileys to emoji text names
def convert_smileys_to_text(text):
    for smiley, emoji_name in smiley_map.items():
        text = re.sub(re.escape(smiley), emoji_name, text)
    return text

# Function to preprocess the text (lowercasing, stopwords removal, punctuation removal)
def preprocess_text(text):
    # Convert smileys to text emoji descriptions
    text = convert_smileys_to_text(text)

    # Lowercase the text
    text = text.lower()

    # Remove punctuation using regex
    text = re.sub(f"[{string.punctuation}]", " ", text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    words = [word for word in words if word not in stop_words]

    # Join the words back into a string
    text = ' '.join(words)

    return text



In [33]:
df_train['text']=df_train['text'].apply(preprocess_text)
df_test['text']=df_test['text'].apply(preprocess_text)

In [11]:
df_train.head(10)

Unnamed: 0,text,label
3245,funny fact nobody teaches volcanoes 2 erupt ts...,0
944,sent scores sophas secondary application schoo...,0
1044,know someone know fancies call 09058097218 fin...,1
2484,promise getting soon text morning let know mad...,0
812,congratulations ur awarded either å£500 cd gif...,1
2973,text carlos let know hang,0
2991,k see slightly smiling face k slightly smiling...,0
2942,message responce happend,0
230,get gandhipuram walk cross cut road right side...,0
1181,flippin shit yet,0


In [34]:

from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer() # Adjust max_features as needed

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train['text'])

# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(df_test['text'])

# Define the number of folds for stratified k-fold cross-validation
n_splits = 5  # You can change this value

# Initialize stratified k-fold
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize lists to store accuracy scores for each fold
accuracy_scores = []

# Perform stratified k-fold cross-validation
for fold, (train_index, val_index) in enumerate(skf.split(X_train_tfidf, df_train['label'])):
    print(f"Fold {fold+1}")

    # Split the data into training and validation sets for the current fold
    X_train_fold, X_val_fold = X_train_tfidf[train_index], X_train_tfidf[val_index]

    # Use .iloc to access data by position, ensuring alignment with train_index and val_index
    y_train_fold, y_val_fold = df_train['label'].iloc[train_index], df_train['label'].iloc[val_index]

    # Initialize and train a Logistic Regression model
    model = LogisticRegression(max_iter=1000) # Increased max_iter
    model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation set
    y_pred_fold = model.predict(X_val_fold)

    # Calculate the accuracy for the current fold
    accuracy = accuracy_score(y_val_fold, y_pred_fold)
    accuracy_scores.append(accuracy)
    print(f"Accuracy: {accuracy}")

# Print the average accuracy across all folds
print(f"\nAverage Accuracy across {n_splits} folds: {np.mean(accuracy_scores)}")

Fold 1
Accuracy: 0.8834080717488789
Fold 2
Accuracy: 0.8789237668161435
Fold 3
Accuracy: 0.8834080717488789
Fold 4
Accuracy: 0.8834080717488789
Fold 5
Accuracy: 0.8654708520179372

Average Accuracy across 5 folds: 0.8789237668161436


In [35]:
# Train a final model on the entire training data
final_model = LogisticRegression(max_iter=1000)
final_model.fit(X_train_tfidf, df_train['label'])

# Make predictions on the test data
y_pred_test = final_model.predict(X_test_tfidf)

In [44]:
from sklearn.metrics import confusion_matrix
confusion_matrix_original=confusion_matrix(df_test['label'], y_pred_test)
TN, FP, FN, TP = confusion_matrix_original.ravel()
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")

True Negatives (TN): 173
False Positives (FP): 0
False Negatives (FN): 14
True Positives (TP): 13


I have trained a binary classification model using simple logistic regression and TFIDF vectorizer.

We get following results:

True Negatives (TN): 173

False Positives (FP): 0

False Negatives (FN): 14

True Positives (TP): 13

### Step 2: Using Transformer based model to increase vocab

In [38]:

from keybert import KeyBERT


# Initialize the keyword extraction model
kw_model = KeyBERT(model="distilbert-base-uncased")

# Extract keywords from the training dataset
all_keywords = []
for text in df_train['text']:
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), top_n=10)  # Adjust top_n as needed
    all_keywords.extend([keyword[0] for keyword in keywords if keyword[1] > 0.7])  # Adjust threshold and access keyword





Here to get keywords I have used keybert as transformers library nativly doesn't have keyword extraction model

In [65]:
len(tfidf_vectorizer.vocabulary_.keys())

3441

In [64]:


# Remove common words from the TF-IDF vocabulary
tfidf_vocabulary = set(tfidf_vectorizer.vocabulary_.keys())
updated_vocabulary = tfidf_vocabulary.difference(set(stopwords.words('english')))
updated_vocabulary = updated_vocabulary.union(set(all_keywords))

# Create a new TF-IDF vectorizer with the updated vocabulary
tfidf_vectorizer_updated = TfidfVectorizer(vocabulary=updated_vocabulary)

X_train_tfidf = tfidf_vectorizer_updated.fit_transform(df_train['text'])

# Transform the test data
X_test_tfidf = tfidf_vectorizer_updated.transform(df_test['text'])

# Define the number of folds for stratified k-fold cross-validation
n_splits = 5  # You can change this value

# Initialize stratified k-fold
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

# Initialize lists to store accuracy scores for each fold
accuracy_scores = []

# Perform stratified k-fold cross-validation
for fold, (train_index, val_index) in enumerate(skf.split(X_train_tfidf, df_train['label'])):
    print(f"Fold {fold+1}")

    # Split the data into training and validation sets for the current fold
    X_train_fold, X_val_fold = X_train_tfidf[train_index], X_train_tfidf[val_index]

    # Use .iloc to access data by position, ensuring alignment with train_index and val_index
    y_train_fold, y_val_fold = df_train['label'].iloc[train_index], df_train['label'].iloc[val_index]

    # Initialize and train a Logistic Regression model
    model = LogisticRegression(max_iter=1000) # Increased max_iter
    model.fit(X_train_fold, y_train_fold)

    # Make predictions on the validation set
    y_pred_fold = model.predict(X_val_fold)

    # Calculate the accuracy for the current fold
    accuracy = accuracy_score(y_val_fold, y_pred_fold)
    accuracy_scores.append(accuracy)
    print(f"Accuracy: {accuracy}")

# Print the average accuracy across all folds
print(f"\nAverage Accuracy across {n_splits} folds: {np.mean(accuracy_scores)}")

Fold 1
Accuracy: 0.8834080717488789
Fold 2
Accuracy: 0.8789237668161435
Fold 3
Accuracy: 0.8834080717488789
Fold 4
Accuracy: 0.8834080717488789
Fold 5
Accuracy: 0.8654708520179372

Average Accuracy across 5 folds: 0.8789237668161436


In [66]:
len(updated_vocabulary)

3425

In [67]:

# Now use X_train_tfidf_updated and X_test_tfidf_updated in your model training and evaluation
# Initialize and train a Logistic Regression model on the updated TF-IDF features
final_model_updated = LogisticRegression(max_iter=1000)
final_model_updated.fit(X_train_tfidf, df_train['label'])

# Make predictions on the test data using the updated model
y_pred_test_updated = final_model_updated.predict(X_test_tfidf)

# Evaluate the updated model
accuracy_updated = accuracy_score(df_test['label'], y_pred_test_updated)
print(f"Accuracy of the updated model: {accuracy_updated}")

confusion_matrix_updated = confusion_matrix(df_test['label'], y_pred_test_updated)
print(f"Confusion Matrix of the updated model:\n{confusion_matrix_updated}")

TN, FP, FN, TP = confusion_matrix_updated.ravel()
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")

Accuracy of the updated model: 0.93
Confusion Matrix of the updated model:
[[173   0]
 [ 14  13]]
True Negatives (TN): 173
False Positives (FP): 0
False Negatives (FN): 14
True Positives (TP): 13


Conclusion: Even though the vocabulary is updated there is no changes in confusion matrix values

# Optional Step 4

### Using Random forest model

In [69]:

from sklearn.ensemble import RandomForestClassifier

# Initialize and train a Random Forest Classifier model on the updated TF-IDF features
final_model_updated = RandomForestClassifier(random_state=42) # You can adjust hyperparameters
final_model_updated.fit(X_train_tfidf, df_train['label'])

# Make predictions on the test data using the updated model
y_pred_test_updated = final_model_updated.predict(X_test_tfidf)

# Evaluate the updated model
accuracy_updated = accuracy_score(df_test['label'], y_pred_test_updated)
print(f"Accuracy of the updated model: {accuracy_updated}")

confusion_matrix_updated = confusion_matrix(df_test['label'], y_pred_test_updated)
print(f"Confusion Matrix of the updated model:\n{confusion_matrix_updated}")

TN, FP, FN, TP = confusion_matrix_updated.ravel()
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")


Accuracy of the updated model: 0.96
Confusion Matrix of the updated model:
[[172   1]
 [  7  20]]
True Negatives (TN): 172
False Positives (FP): 1
False Negatives (FN): 7
True Positives (TP): 20


Using a random forest classifier increases the performance of our model over baseline model

Transformer based classification model

In [63]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
from datasets import Dataset, DatasetDict

# Define the model and tokenizer
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)# binary classification
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(df_train[['text', 'label']])
test_dataset = Dataset.from_pandas(df_test[['text', 'label']])

# Combine into a DatasetDict
dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

# Tokenize the datasets
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Define the compute_metrics function for the Trainer
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

# Define the training arguments with WandB disabled
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    report_to="none",  # Disable WandB reporting
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"], # Use test dataset for evaluation
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Make predictions
predictions = trainer.predict(tokenized_datasets["test"])

# Evaluate the model
y_pred = np.argmax(predictions.predictions, axis=1)
accuracy = accuracy_score(df_test['label'], y_pred)
print(f"Test Accuracy: {accuracy}")
confusion_mat = confusion_matrix(df_test['label'], y_pred)
print(f"Confusion Matrix:\n{confusion_mat}")
TN, FP, FN, TP = confusion_mat.ravel()
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")
print(f"True Positives (TP): {TP}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.171834,0.97
2,No log,0.164656,0.97
3,No log,0.140647,0.975


Test Accuracy: 0.975
Confusion Matrix:
[[169   4]
 [  1  26]]
True Negatives (TN): 169
False Positives (FP): 4
False Negatives (FN): 1
True Positives (TP): 26


Here I have trained a bert model for classification. Based on the confusion matrix values I can say that it is performing way better than the previous models trained