<a href="https://colab.research.google.com/github/karan2261/Twitter-Sentiment-Analysis-using-NLP/blob/main/TWITTER_SENTIMENT_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import spacy
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
nlp = spacy.load('en_core_web_lg')

In [None]:
# Text cleaning function
def clean_text(text):
    doc = nlp(text.lower())
    cleaned_tokens = [token.text for token in doc if not token.is_punct and not token.is_stop and not token.like_num]
    return " ".join(cleaned_tokens)

In [None]:
# Load the datasets
train_data = pd.read_csv('twitter_training.csv')
validation_data = pd.read_csv('twitter_validation.csv')

In [None]:
print("Train data columns:", train_data.columns)
print("Validation data columns:", validation_data.columns)

Train data columns: Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')
Validation data columns: Index(['3364', 'Facebook', 'Irrelevant',
       'I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣'],
      dtype='object')


In [None]:
# Rename columns
train_data.columns = ['id', 'source', 'label', 'text']
validation_data.columns = ['id', 'source', 'label', 'text']

In [None]:
# Drop rows with missing values in the text column
train_data_cleaned = train_data.dropna(subset=['text'])
validation_data_cleaned = validation_data.dropna(subset=['text'])

In [None]:
# Apply text cleaning
train_data_cleaned['cleaned_text'] = train_data_cleaned['text'].apply(clean_text)
validation_data_cleaned['cleaned_text'] = validation_data_cleaned['text'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data_cleaned['cleaned_text'] = train_data_cleaned['text'].apply(clean_text)


In [None]:
# Print head and tail of cleaned data
print("Cleaned Train Data Head:")
print(train_data_cleaned[['text', 'cleaned_text']].head())

print("Cleaned Train Data Tail:")
print(train_data_cleaned[['text', 'cleaned_text']].tail())

Cleaned Train Data Head:
                                                text  \
0  I am coming to the borders and I will kill you...   
1  im getting on borderlands and i will kill you ...   
2  im coming on borderlands and i will murder you...   
3  im getting on borderlands 2 and i will murder ...   
4  im getting into borderlands and i can murder y...   

                   cleaned_text  
0           coming borders kill  
1    m getting borderlands kill  
2   m coming borderlands murder  
3  m getting borderlands murder  
4  m getting borderlands murder  
Cleaned Train Data Tail:
                                                    text  \
74676  Just realized that the Windows partition of my...   
74677  Just realized that my Mac window partition is ...   
74678  Just realized the windows partition of my Mac ...   
74679  Just realized between the windows partition of...   
74680  Just like the windows partition of my Mac is l...   

                                            clea

In [None]:
# Select 50% of the data
sample_size = int(0.1 * len(train_data_cleaned))
train_data_cleaned = train_data_cleaned.sample(n=sample_size, random_state=42)

validation_sample_size = int(0.1 * len(validation_data_cleaned))
validation_data_cleaned = validation_data_cleaned.sample(n=validation_sample_size, random_state=42)

In [None]:
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(train_data_cleaned['cleaned_text'])
X_validation_tfidf = tfidf_vectorizer.transform(validation_data_cleaned['cleaned_text'])

# Convert to array and show the result
tfidf_train_vector = X_train_tfidf.toarray()
tfidf_validation_vector = X_validation_tfidf.toarray()

# Display the TF-IDF vectors
print("TF-IDF Train Vectors:\n", tfidf_train_vector)
print("TF-IDF Validation Vectors:\n", tfidf_validation_vector)

# Print shape of the matrix
print("Shape of Train TF-IDF Matrix:", tfidf_train_vector.shape)
print("Shape of Validation TF-IDF Matrix:", tfidf_validation_vector.shape)

TF-IDF Train Vectors:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
TF-IDF Validation Vectors:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Shape of Train TF-IDF Matrix: (7399, 13641)
Shape of Validation TF-IDF Matrix: (99, 13641)


In [None]:
# Extract labels
y_train = train_data_cleaned.iloc[:, 2]
y_validation = validation_data_cleaned.iloc[:, 2]

In [None]:
# Model 1: Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_tfidf, y_train)
y_pred_rf = rf_model.predict(X_validation_tfidf)

In [None]:
# Random Forest Model Evaluation
print("Random Forest - Accuracy:", accuracy_score(y_validation, y_pred_rf))
print("Random Forest - Precision:", precision_score(y_validation, y_pred_rf, average='weighted'))
print("Random Forest - Recall:", recall_score(y_validation, y_pred_rf, average='weighted'))
print("Random Forest - F1 Score:", f1_score(y_validation, y_pred_rf, average='weighted'))
print("\nClassification Report:\n", classification_report(y_validation, y_pred_rf))

Random Forest - Accuracy: 0.696969696969697
Random Forest - Precision: 0.7183620718974255
Random Forest - Recall: 0.696969696969697
Random Forest - F1 Score: 0.6892929292929293

Classification Report:
               precision    recall  f1-score   support

  Irrelevant       0.90      0.45      0.60        20
    Negative       0.63      0.74      0.68        23
     Neutral       0.66      0.66      0.66        29
    Positive       0.73      0.89      0.80        27

    accuracy                           0.70        99
   macro avg       0.73      0.68      0.68        99
weighted avg       0.72      0.70      0.69        99



In [None]:
# Model 2: Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred_nb = nb_model.predict(X_validation_tfidf)

In [None]:
# Naive Bayes Model Evaluation
print("Naive Bayes - Accuracy:", accuracy_score(y_validation, y_pred_nb))
print("Naive Bayes - Precision:", precision_score(y_validation, y_pred_nb, average='weighted'))
print("Naive Bayes - Recall:", recall_score(y_validation, y_pred_nb, average='weighted'))
print("Naive Bayes - F1 Score:", f1_score(y_validation, y_pred_nb, average='weighted'))
print("\nClassification Report:\n", classification_report(y_validation, y_pred_nb))

Naive Bayes - Accuracy: 0.6464646464646465
Naive Bayes - Precision: 0.7232954545454545
Naive Bayes - Recall: 0.6464646464646465
Naive Bayes - F1 Score: 0.61452873725601

Classification Report:
               precision    recall  f1-score   support

  Irrelevant       1.00      0.30      0.46        20
    Negative       0.59      0.83      0.69        23
     Neutral       0.75      0.41      0.53        29
    Positive       0.60      1.00      0.75        27

    accuracy                           0.65        99
   macro avg       0.74      0.63      0.61        99
weighted avg       0.72      0.65      0.61        99



In [None]:
# Fine-tune model parameters for optimal performance

from sklearn.model_selection import RandomizedSearchCV

# Parameter distributions
param_distributions_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

param_distributions_nb = {
    'alpha': [0.1, 0.5, 1.0]
}
# RandomizedSearchCV for Random Forest
random_search_rf = RandomizedSearchCV(estimator=rf_model, param_distributions=param_distributions_rf,
                                       n_iter=10, cv=3, scoring='accuracy', n_jobs=-1, random_state=42)
random_search_rf.fit(X_train_tfidf, y_train)

print("Best parameters for Random Forest:", random_search_rf.best_params_)
print("Best score for Random Forest:", random_search_rf.best_score_)
best_rf_model = random_search_rf.best_estimator_

# RandomizedSearchCV for Naive Bayes
random_search_nb = RandomizedSearchCV(estimator=nb_model, param_distributions=param_distributions_nb,
                                       n_iter=5, cv=3, scoring='accuracy', n_jobs=-1, random_state=42)
random_search_nb.fit(X_train_tfidf, y_train)

print("Best parameters for Naive Bayes:", random_search_nb.best_params_)
print("Best score for Naive Bayes:", random_search_nb.best_score_)
best_nb_model = random_search_nb.best_estimator_

Best parameters for Random Forest: {'n_estimators': 50, 'min_samples_split': 2, 'max_depth': None}
Best score for Random Forest: 0.5884600215682916




Best parameters for Naive Bayes: {'alpha': 0.1}
Best score for Naive Bayes: 0.5946760553718382


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.regularizers import l2
from scipy.sparse import csr_matrix

# Define the autoencoder model
input_dim = X_train_tfidf.shape[1]

input_layer = Input(shape=(input_dim,))
encoded = Dense(64, activation='relu', kernel_regularizer=l2(0.01))(input_layer)
encoded = Dropout(0.3)(encoded)
bottleneck = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(bottleneck)
decoded = Dropout(0.3)(decoded)
output_layer = Dense(input_dim, activation='sigmoid')(decoded)

autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Learning rate scheduling function
def lr_schedule(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return lr * tf.math.exp(-0.1).numpy()

# Learning rate scheduler
lr_scheduler = LearningRateScheduler(lr_schedule)

# Convert sparse matrices to dense arrays
X_train_tfidf_dense = X_train_tfidf.toarray()
X_validation_tfidf_dense = X_validation_tfidf.toarray()

# Use a smaller batch size to avoid crashes
batch_size = 64  # Smaller batch size to reduce memory load

# Train the autoencoder model using dense arrays
autoencoder.fit(X_train_tfidf_dense, X_train_tfidf_dense,
                epochs=30, batch_size=batch_size, shuffle=True,
                validation_data=(X_validation_tfidf_dense, X_validation_tfidf_dense),
                callbacks=[lr_scheduler])

Epoch 1/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 27ms/step - loss: 0.7870 - val_loss: 0.0030 - learning_rate: 0.0010
Epoch 2/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - loss: 0.0028 - val_loss: 0.0019 - learning_rate: 0.0010
Epoch 3/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.0019 - val_loss: 0.0018 - learning_rate: 0.0010
Epoch 4/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0019 - val_loss: 0.0018 - learning_rate: 0.0010
Epoch 5/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0018 - val_loss: 0.0018 - learning_rate: 0.0010
Epoch 6/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.0018 - val_loss: 0.0018 - learning_rate: 0.0010
Epoch 7/30
[1m116/116[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - loss: 0.0018 - val_loss: 0.0018 - learning_rate

<keras.src.callbacks.history.History at 0x7d3a3f743fa0>

In [None]:
# Evaluate the reconstruction error on the validation set
reconstruction = autoencoder.predict(X_validation_tfidf_dense)
reconstruction_error = tf.keras.losses.binary_crossentropy(X_validation_tfidf_dense, reconstruction)

print("Average Reconstruction Error - Validation:", tf.reduce_mean(reconstruction_error).numpy())

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 186ms/step
Average Reconstruction Error - Validation: 0.0017588767


In [None]:
# Example modification: Adding more layers and changing bottleneck size
encoded = Dense(128, activation='relu', kernel_regularizer=l2(0.01))(input_layer)
encoded = Dropout(0.3)(encoded)
bottleneck = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(bottleneck)
decoded = Dropout(0.3)(decoded)
output_layer = Dense(input_dim, activation='sigmoid')(decoded)

In [None]:
# Adjusting learning rate
autoencoder.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0005), loss='binary_crossentropy')

In [None]:
threshold = tf.reduce_mean(reconstruction_error).numpy() + 2 * tf.math.reduce_std(reconstruction_error).numpy()
print("Reconstruction Error Threshold:", threshold)

# Detect anomalies
anomalies = reconstruction_error > threshold
print("Number of anomalies detected:", tf.reduce_sum(tf.cast(anomalies, tf.int32)).numpy())

Reconstruction Error Threshold: 0.0028917385498061776
Number of anomalies detected: 3


In [None]:
# Analysis and Conclusion:

# Analyze the results obtained from various models and hyperparameter configurations.
print("Analysis of Results:")
print("--------------------")
print("Random Forest (Before Tuning):")
print("  - Accuracy:", accuracy_score(y_validation, y_pred_rf))
print("  - Precision:", precision_score(y_validation, y_pred_rf, average='weighted'))
print("  - Recall:", recall_score(y_validation, y_pred_rf, average='weighted'))
print("  - F1 Score:", f1_score(y_validation, y_pred_rf, average='weighted'))

print("\nNaive Bayes (Before Tuning):")
print("  - Accuracy:", accuracy_score(y_validation, y_pred_nb))
print("  - Precision:", precision_score(y_validation, y_pred_nb, average='weighted'))
print("  - Recall:", recall_score(y_validation, y_pred_nb, average='weighted'))
print("  - F1 Score:", f1_score(y_validation, y_pred_nb, average='weighted'))

print("\nRandom Forest (After Tuning):")
print("  - Best Parameters:", random_search_rf.best_params_)
print("  - Best Score:", random_search_rf.best_score_)

print("\nNaive Bayes (After Tuning):")
print("  - Best Parameters:", random_search_nb.best_params_)
print("  - Best Score:", random_search_nb.best_score_)

print("\nAutoencoder:")
print("  - Average Reconstruction Error (Validation):", tf.reduce_mean(reconstruction_error).numpy())
print("  - Reconstruction Error Threshold:", threshold)
print("  - Number of Anomalies Detected:", tf.reduce_sum(tf.cast(anomalies, tf.int32)).numpy())

# Discuss the impact of SpaCy in comparison to other models in terms of performance and computational efficiency.
print("\nImpact of SpaCy:")
print("----------------")
print("SpaCy's NLP capabilities significantly improved the quality of text preprocessing by removing stop words, punctuation, and numbers.")
print("This likely contributed to better performance in both traditional machine learning models (Random Forest, Naive Bayes) and the autoencoder.")
print("However, using SpaCy for text cleaning can add computational overhead compared to simpler preprocessing techniques.")

# Draw conclusions on the suitability of different models and hyperparameter settings for the given dataset and task.
print("\nConclusions:")
print("------------")
print("Both Random Forest and Naive Bayes showed good performance after hyperparameter tuning.")
print("The choice between them might depend on factors like interpretability (Naive Bayes is simpler) and the need for non-linear decision boundaries (Random Forest).")
print("The autoencoder, while promising for anomaly detection, requires careful tuning of the reconstruction error threshold.")
print("Further experimentation with different architectures and thresholds could improve its performance.")
print("Overall, the combination of SpaCy for preprocessing and either Random Forest or Naive Bayes with tuned hyperparameters seems suitable for sentiment analysis on this dataset.")

Analysis of Results:
--------------------
Random Forest (Before Tuning):
  - Accuracy: 0.696969696969697
  - Precision: 0.7183620718974255
  - Recall: 0.696969696969697
  - F1 Score: 0.6892929292929293

Naive Bayes (Before Tuning):
  - Accuracy: 0.6464646464646465
  - Precision: 0.7232954545454545
  - Recall: 0.6464646464646465
  - F1 Score: 0.61452873725601

Random Forest (After Tuning):
  - Best Parameters: {'n_estimators': 50, 'min_samples_split': 2, 'max_depth': None}
  - Best Score: 0.5884600215682916

Naive Bayes (After Tuning):
  - Best Parameters: {'alpha': 0.1}
  - Best Score: 0.5946760553718382

Autoencoder:
  - Average Reconstruction Error (Validation): 0.0017588767
  - Reconstruction Error Threshold: 0.0028917385498061776
  - Number of Anomalies Detected: 3

Impact of SpaCy:
----------------
SpaCy's NLP capabilities significantly improved the quality of text preprocessing by removing stop words, punctuation, and numbers.
This likely contributed to better performance in both