<h1>Model Development Support Vector Machine (SVM)</h1>

## Table of Contents<a name="TOC"></a>

1. [Splitting the Dataset Into Training and Testing Sets](#Section1)
<br>First, separate the columns into dependent and independent variables (or features and labels). Then you split those variables into train and test sets.</br>

2. [Feature Extraction](#Section2)
<br>Includes document-term matirx (TF-IDF & BOW)</br>

3. [Model Generation](#Section3)
<br>Building SA Modelling using **SVM**</br>

4. [Model Evaluation](#Section4)
<br>Evaluate the SVM modelling based on performance metrics</br>

5. [Visualization](#Section5)
<br>Heatmaps and Stacked Bar Charts</br>

Approach:
1. Sentiment Analysis using SVM

2. Predicting Sentiment using SVM

**Importing libraries & dataset**

In [None]:
import re
import time
import nltk
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

from sklearn.svm import LinearSVC

In [None]:
# Load csv file containing tweets dataset (w/ sentiments)

tweets_df = pd.read_csv(r"C:\Users\LENOVO\Documents\Degree Life\FYP Journey\Dataset\Sentiment Analysis\V3 Harmonized [VADER & TextBlob]_All Keywords (Whole Malaysia).csv")
display(tweets_df)

In [None]:
#drop irrelvant columns for modelling purposes
#Irrelevant columns = "Datetime", "Username", "Location"
new_df = tweets_df.drop(['Datetime', 'Username','Location'], axis=1)

# Create a list of the column names in the desired order (VADER & TextBlob)
#cols = ['Cleaned_Tweets', 'Sentiment Score','Sentiment']

# Create a list of the column names in the desired order (Harmonized)
cols = ['Cleaned_Tweets', 'Harmonized_Score','Harmonized_Label', 'Risk_Label']

# Rearrange the columns in the dataset
new_df = new_df[cols]

display(new_df)

## 1. Splitting the Dataset Into Training and Testing Sets<a name="Section1"></a>

Use scikit-learn to randomly split the dataset into training and testing sets. You will use the training set to train the model to classify the sentiments of the reviews. And you will use the test set to access how good the model is at classifying new unseen reviews.

In [None]:
#Extract the features and labels
features = new_df['Cleaned_Tweets'].values
labels = new_df['Harmonized_Label'].values

# Use LabelEncoder to convert labels to numerical values
# Positive [1] OR Negative [0]
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

In [None]:
# Split the dataset into training and testing sets
# Remember to modify test size each time you're trying to run a new model!!
train_features, test_features, train_labels, test_labels = train_test_split(features, encoded_labels, 
                                                                            test_size=0.3, random_state=42)

train_features.shape, test_features.shape, train_labels.shape, test_labels.shape

In [None]:
# Convert train_features and train_labels back to Pandas DataFrames
train_data = pd.DataFrame({'Cleaned_Tweets': train_features, 'Sentiment': train_labels})
#train_data = pd.concat([train_features.reset_index(drop=True), train_labels.reset_index(drop=True)], axis=1)
display(train_data)

In [None]:
# Convert test_features and test_labels back to Pandas DataFrames
test_data = pd.DataFrame({'Cleaned_Tweets': test_features, 'Sentiment': test_labels})
display(test_data)

In [None]:
#Save the train data into CSV file
train_data.to_csv('(70-30)TextBlob train_data.csv', index=False)

In [None]:
#Save the test data into CSV file
test_data.to_csv('(70-30)TextBlob test_data.csv', index=False)

In [None]:
#view tweet length in train data and test data

length_train = train_data['Cleaned_Tweets'].str.len()
length_test = test_data['Cleaned_Tweets'].str.len()
plt.figure(figsize=(10,6))
plt.hist(length_train, bins=50, label="Train_tweets", color = "darkblue")
plt.hist(length_test, bins=50, label='Test_tweets', color = "skyblue")
plt.legend()

**To check if the dataset is balanced**

Source: https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4

**Train data**

In [None]:
import matplotlib.pyplot as plt
ax = train_data['Sentiment'].value_counts(sort=False).plot(kind='barh')
ax.set_xlabel("Number of Samples in Training Set")
ax.set_ylabel("Label")

**Test data**

In [None]:
import matplotlib.pyplot as plt
ax = test_data['Sentiment'].value_counts(sort=False).plot(kind='barh')
ax.set_xlabel("Number of Samples in Testing Set")
ax.set_ylabel("Label")

## 2. Feature Extraction<a name="Section2"></a>

Using TF-IDF or BOW to vectorize text reviews --> numbers

A.[TF-IDF](#Section6)

B.[BOW](#Section7)

### A. TF-IDF<a name="Section6"></a>

In [None]:
# Replace NaN values with an empty string
train_features = np.where(pd.isnull(train_features), '', train_features)

# Create an instance of TfidfVectorizer
vectorizer = TfidfVectorizer()  # For TF-IDF

# Fit the vectorizer on the training data and transform the training features
train_features_vectorized = vectorizer.fit_transform(train_features)

# Transform the testing features using the trained vectorizer
test_features_vectorized = vectorizer.transform(test_features)

train_features_vectorized.shape, test_features_vectorized.shape

In [None]:
# Convert the sparse matrix to a dense matrix and create a DataFrame
train_features_df = pd.DataFrame(train_features_vectorized.toarray(), columns=vectorizer.get_feature_names())
test_features_df = pd.DataFrame(test_features_vectorized.toarray(), columns=vectorizer.get_feature_names())

# Display the feature vectors
print("Training Features:\n", train_features_df)
print("Testing Features:\n", test_features_df)

In [None]:
# Save the train_features vectors to CSV files
train_features_df.to_csv('TF-IDF (70-30)- TextBlob train_features.csv', index=False)

In [None]:
# Save the test_features vectors to CSV files
test_features_df.to_csv('TF-IDF (70-30)- TextBlob test_features.csv', index=False)

### B. BOW<a name="Section7"></a>

In [None]:
# Replace NaN values with an empty string
train_features = np.where(pd.isnull(train_features), '', train_features)

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()  # For BoW

# Fit the vectorizer on the training data and transform the training features
train_features_vectorized = vectorizer.fit_transform(train_features)

# Transform the testing features using the trained vectorizer
test_features_vectorized = vectorizer.transform(test_features)

train_features_vectorized.shape, test_features_vectorized.shape

In [None]:
# Convert the sparse matrix to a dense matrix and create a DataFrame
train_features_df = pd.DataFrame(train_features_vectorized.toarray(), columns=vectorizer.get_feature_names())
test_features_df = pd.DataFrame(test_features_vectorized.toarray(), columns=vectorizer.get_feature_names())

# Display the feature vectors
print("Training Features:\n", train_features_df)
print("Testing Features:\n", test_features_df)

## 3. Model Generation<a name="Section3"></a>

## 3.1 Train model using SVM<a name="Section6"></a>



How to train a SVM classifier?
- 

**To save time, jump here**

In [None]:
#In case you wanna use the loaded classifier model from file
# Use this code to perform prediction
predictions = loaded_model.predict(new_data)

In [None]:
import pandas as pd

# Load the training dataset from CSV file
train_data = pd.read_csv('train_data.csv')

# Load the testing dataset from CSV file
test_data = pd.read_csv('test_data.csv')

# Load the TF-IDF vectorized features for training from CSV file
train_features = pd.read_csv('train_features.csv')

# Load the TF-IDF vectorized features for testing from CSV file
test_features = pd.read_csv('test_features.csv')

**Training SVM Classifier (Linear)**

In [None]:
from sklearn.svm import SVC

svm_classifier = SVC(kernel='linear')
start_time = time.time()
svm_classifier.fit(train_features_vectorized, train_labels)
end_time = time.time()

# Print the runtime of testing the classifier
print(f"Testing time: {end_time - start_time} seconds")
display(svm_classifier.get_params())

**Predict labels of test data**
<br>
Use the print() function to display the test_predictions array

In [None]:
# Create a list of the sentiment labels
sentiment_labels = ['Positive', 'Negative']


# Predict the labels of the test data
start_time = time.time()
test_predictions = svm_classifier.predict(test_features_vectorized)
end_time = time.time()

# Print the runtime of predicting the labels
print(f"Prediction time: {end_time - start_time} seconds")

# Convert the numeric labels back to sentiment labels
actual_sentiments = encoder.inverse_transform(test_labels)
predicted_sentiments = encoder.inverse_transform(test_predictions)

# Print the predicted labels
display(test_predictions, predicted_sentiments)

**Print results of predicted labels using DataFrame**

Create a new DataFrame that combines the test data with the predicted labels

**For TextBlob & VADER**

In [None]:
# Create a DataFrame with the test data and predicted labels
results_df = pd.DataFrame({'Text': test_data['Cleaned_Tweets'], 'actual_sentiment': actual_sentiments, 
                           'predicted_sentiment': predicted_sentiments})

# Print the DataFrame
display(results_df)

**For Harmonized**

In [None]:
# Create a dictionary to map sentiment labels to risk labels
sentiment_labels = ['Positive', 'Weakly Negative', 'Mild Negative', 'Strongly Negative']
risk_labels = ['Low Risk', 'Mild Risk', 'Moderate Risk', 'Severe Risk']
label_mapping = {sentiment_labels[i]: risk_labels[i] for i in range(len(sentiment_labels))}

# Apply reassignment to the actual and predicted sentiment labels
results_df['actual_risk'] = results_df['actual_sentiment'].map(label_mapping)
results_df['predicted_risk'] = results_df['predicted_sentiment'].map(label_mapping)

# Print the dataframe
display(results_df)

**Print results of predicted labels using DataFrame**

Create a new DataFrame that combines the test data with the predicted labels

In [None]:
#Save results into CSV file
results_df.to_csv('SVM_BOW (80-20)- Predict Label Modelling Results [VADER].csv', index=False)

<h3>Pickling the Model</h3>

If you still want to see the full output of the classifier object, you can try using the pickle module to save the classifier object to a file and then load it back into memory:

In [None]:
import pickle

# Save the classifier object to a file
with open('SVM_classifier (TF-IDF_VADER - 70-30).pkl', 'wb') as file:
    pickle.dump(svm_classifier, file)

# Print the classifier object
print(svm_classifier)

In [None]:
# Load the classifier object from the file
with open('SVM_classifier.pkl', 'rb') as file:
    nb_classifier = pickle.load(file)

In [None]:
#In case you wanna use the loaded classifier model from file
# Use this code to perform prediction
predictions = loaded_model.predict(new_data)

In [None]:
import pandas as pd

# Load the training dataset from CSV file
train_data = pd.read_csv('train_data.csv')

# Load the testing dataset from CSV file
test_data = pd.read_csv('test_data.csv')

# Load the TF-IDF vectorized features for training from CSV file
train_features = pd.read_csv('train_features.csv')

# Load the TF-IDF vectorized features for testing from CSV file
test_features = pd.read_csv('test_features.csv')

## 4. Model Evaluation<a name="Section4"></a>

**A. Evaluation Metrics:**

1. Accuracy
<br>
2. Precision
<br>
3. F1 Score
<br> Due to an imbalance classes, F1 score was metric was used </br>
4. Recall

**B. K-Fold Cross Validation**

Using k-fold (k = 10)
<br></br>
Part of code retrieved from here:
https://github.com/ThinamXx/Twitter..Sentiment..Analysis/blob/master/Twitter%20Sentiment%20Analysis.ipynb

<h2> A. Evaluation Metrics </h2>

In [None]:
# Using the fitted model to make predictions on testing data
# Predict the labels of the test data
test_predictions = svm_classifier.predict(test_features_vectorized)

# Print the predicted labels
print(test_predictions)

**Checking the accuracy in Testing Data**

In [None]:
# A) VADER & textBlob DATASET

# Create a classification report
classification = classification_report(results_df['actual_sentiment'], results_df['predicted_sentiment'])

# Calculate accuracy score
accuracy = accuracy_score(results_df['actual_sentiment'], results_df['predicted_sentiment'])

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_labels, test_predictions)

# Calculate the precision of the classifier
precision = precision_score(test_labels, test_predictions, average='weighted')

# Calculate the recall of the classifier
recall = recall_score(test_labels, test_predictions, average='weighted')

# Calculate the F1 score of the classifier
f1 = f1_score(test_labels, test_predictions, average='weighted')

# Calculate the confusion matrix of the classifier
confusion_mat = confusion_matrix(test_labels, test_predictions)

# Define the labels for the confusion matrix
sentiment_labels = ['Food Secured', 'Food Insecure']

# Create a DataFrame with the confusion matrix and labels
confusion_df = pd.DataFrame(confusion_mat, index=sentiment_labels, columns=sentiment_labels)

# Print the evaluation metrics & confusion matrix
print("Confusion Matrix:")
print(confusion_mat)

print("\nEvaluation Metrics:")
print("Accuracy:", accuracy * 100, "%")
print("Precision:", precision * 100, "%")
print("Recall:", recall * 100, "%")
print("F1 Score:", f1 * 100, "%")
display(confusion_df)

In [None]:
# Create a heatmap of the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=sentiment_labels, yticklabels=sentiment_labels)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

In [None]:
# B) HARMONIZED DATASET

# Create a classification report
classification = classification_report(results_df['actual_risk'], results_df['predicted_risk'])

# Calculate accuracy score
accuracy = accuracy_score(results_df['actual_risk'], results_df['predicted_risk'])

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_labels, test_predictions)

# Calculate the precision of the classifier
precision = precision_score(test_labels, test_predictions, average='weighted')

# Calculate the recall of the classifier
recall = recall_score(test_labels, test_predictions, average='weighted')

# Calculate the F1 score of the classifier
f1 = f1_score(test_labels, test_predictions, average='weighted')

# Calculate the confusion matrix of the classifier
confusion_mat = confusion_matrix(test_labels, test_predictions)

# Define the labels for the confusion matrix
risk_labels = ['Low Risk', 'Mild Risk', 'Moderate Risk', 'Severe Risk']

# Create a DataFrame with the confusion matrix and labels
confusion_df = pd.DataFrame(confusion_mat, index=risk_labels, columns=risk_labels)

# Print the evaluation metrics & confusion matrix
print("Confusion Matrix:")
print(confusion_mat)

print("\nEvaluation Metrics:")
print("Accuracy:", accuracy * 100, "%")
print("Precision:", precision * 100, "%")
print("Recall:", recall * 100, "%")
print("F1 Score:", f1 * 100, "%")
display(confusion_df)

In [None]:
# Create a heatmap of the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=risk_labels, yticklabels=risk_labels)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

In [None]:
# Create a dictionary to store the evaluation metrics
evaluation_results = {
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Score': [accuracy, precision, recall, f1]
}

df_evaluation = pd.DataFrame(evaluation_results)

# Create a DataFrame from the confusion matrix
confusion_df = pd.DataFrame(confusion_mat, columns=sentiment_labels, index=sentiment_labels)

# Save the evaluation DataFrame to a CSV file
df_evaluation.to_csv('[VADER] SVM TF-IDF (70-30) Inital Model_Evaluation.csv', index=False)

# Save the confusion matrix DataFrame to a CSV file
confusion_df.to_csv('[VADER] SVM TF-IDF (70-30) Inital Model_Confusion_Matrix.csv', index=True)

<h3>ROC Curve</h3>

In [None]:
%%time
# Calculate the predicted probabilities for the positive class (1)
predicted_probabilities = svm_classifier.decision_function(test_features_vectorized)

# Print the predicted labels
print(predicted_probabilities)

In [None]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Create ROC curve
fpr, tpr, thresholds = roc_curve(test_labels, predicted_probabilities)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

## 2. K-Fold Cross Validation<a name="Section2"></a>
Back to [Top Page](#TOC)

Use the cross_val_score() function from sklearn.model_selection to evaluate the performance of the classifier using 5-fold cross-validation. 
<br></br>
The cross_val_score() function takes the classifier, the feature vectors, the labels, and the number of folds as input, and returns an array of scores for each fold

**Output: Cross-validation scores for each fold + Average cross-validation score**

Note: cv = k (k-fold)

In [None]:
%%time
from sklearn.model_selection import cross_val_score

# Perform k-fold cross-validation and obtain the scores for each fold
scores = cross_val_score(svm_classifier, train_features_vectorized, train_labels, cv=10)

# Print the accuracy for each fold
for fold, score in enumerate(scores):
    print(f"Fold {fold+1}: {score}")

# Calculate and print the mean accuracy and standard deviation
mean_accuracy = scores.mean()
std_deviation = scores.std()
rmse = np.sqrt(-scores)
print(f"Mean accuracy: {mean_accuracy}")
print(f"Standard deviation: {std_deviation}")
print("RMSE values: ", np.round(rmse, 2))
print("RMSE average: ", np.round(rmse))

In [None]:
%%time

from sklearn.model_selection import cross_val_score

# Evaluate the performance of SVM classifier using cross-validation
scores = cross_val_score(svm_classifier, train_features_vectorized, train_labels, cv=10)

#rmse = np.sqrt(-scores)

# Print the cross-validation scores
print(f"Cross-validation scores: {scores}")
print(f"Average cross-validation score: {scores.mean()}")
#print("RMSE values: ", np.round(rmse, 2))
#print("RMSE average: ", np.round(rmse))

## 3. Hyperparameter Tuning<a name="Section4"></a>

Back to [Top Page](#TOC)

**How to find the BEST hyperparameters for SVM classifier**

The GridSearchCV() function takes the classifier, the hyperparameters, and the number of folds as input, and returns the best hyperparameters and the corresponding score.

In [None]:
%%time

from sklearn.model_selection import GridSearchCV #Perform grid search over hyperparameters
from sklearn.svm import SVC

# Create instance of SVM classifier
svm_classifier = SVC(kernel='linear')

# Define the hyperparameters to search over
hyperparameters = {'C': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]}

# Use grid search to find the best hyperparameters for the classifier
grid_search = GridSearchCV(svm_classifier, hyperparameters, cv=10)
grid_search.fit(train_features_vectorized, train_labels)

# Print the best hyperparameters and the corresponding score
print(f"Best hyperparameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

We then print the best hyperparameters and the corresponding score using the print() function. The resulting output will show the best hyperparameters found by the grid search and the corresponding score.

After finding the best hyperparameters, you can process to train and evaluate SVM classifier

In [None]:
# Replace NaN values with an empty string
train_features = np.where(pd.isnull(train_features), '', train_features)

# Create an instance of CountVectorizer
vectorizer = TfidfVectorizer()  # For TF-IDF
#vectorizer = CountVectorizer()  # For BoW

# Fit the vectorizer on the training data and transform thzzzzze training features
train_features_vectorized = vectorizer.fit_transform(train_features)

# Transform the testing features using the trained vectorizer
test_features_vectorized = vectorizer.transform(test_features)

train_features_vectorized.shape, test_features_vectorized.shape

In [None]:
%%time

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Split the data into training and testing sets
# Remember to modify test size each time you're trying to run a new model!!
train_features, test_features, train_labels, test_labels = train_test_split(features, encoded_labels, 
                                                                            test_size=0.2, random_state=42)
# Replace NaN values with an empty string
train_features = np.where(pd.isnull(train_features), '', train_features)

# Create an instance of CountVectorizer
vectorizer = TfidfVectorizer()  # For TF-IDF
#vectorizer = CountVectorizer()  # For BoW

# Fit the vectorizer on the training data and transform the training features
train_features_vectorized = vectorizer.fit_transform(train_features)

# Transform the testing features using the trained vectorizer
test_features_vectorized = vectorizer.transform(test_features)

# Create instance of SVM classifier with best hyperparameters
svm_classifier = SVC(kernel='linear', C=2.0)

# Train a SVM classifier on the training data
svm_classifier.fit(train_features_vectorized, train_labels)

# Predict the labels of the test data
test_predictions = svm_classifier.predict(test_features_vectorized)

# Evaluate the performance of the classifier on the test data
confusion = confusion_matrix(test_labels, test_predictions)
report = classification_report(test_labels, test_predictions)
accuracy = accuracy_score(test_labels, test_predictions)

# Define the labels for the confusion matrix
sentiment_labels = ['Food Secured', 'Food Insecure']

# Print the confusion matrix, classification report, and accuracy score
print(f"Confusion matrix:\n{confusion}")
print(f"Classification report:\n{report}")
print(f"Accuracy score: {accuracy}")

In [None]:
# A) VADER & textBlob DATASET

# Create a classification report
classification = classification_report(results_df['actual_sentiment'], results_df['predicted_sentiment'])

# Calculate accuracy score
accuracy = accuracy_score(results_df['actual_sentiment'], results_df['predicted_sentiment'])

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_labels, test_predictions)

# Calculate the precision of the classifier
precision = precision_score(test_labels, test_predictions, average='weighted')

# Calculate the recall of the classifier
recall = recall_score(test_labels, test_predictions, average='weighted')

# Calculate the F1 score of the classifier
f1 = f1_score(test_labels, test_predictions, average='weighted')

# Calculate the confusion matrix of the classifier
confusion_mat = confusion_matrix(test_labels, test_predictions)

# Define the labels for the confusion matrix
sentiment_labels = ['Food Secured', 'Food Insecure']

# Create a DataFrame with the confusion matrix and labels
confusion_df = pd.DataFrame(confusion_mat, index=sentiment_labels, columns=sentiment_labels)

# Print the evaluation metrics & confusion matrix
print("Confusion Matrix:")
print(confusion_mat)

print("\nEvaluation Metrics:")
print("Accuracy:", accuracy * 100, "%")
print("Precision:", precision * 100, "%")
print("Recall:", recall * 100, "%")
print("F1 Score:", f1 * 100, "%")
display(confusion_df)

In [None]:
# Create a heatmap of the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=sentiment_labels, yticklabels=sentiment_labels)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

In [None]:
# For HARMONIZED

# Create a classification report
classification = classification_report(results_df['actual_risk'], results_df['predicted_risk'])

# Calculate accuracy score
accuracy = accuracy_score(results_df['actual_risk'], results_df['predicted_risk'])

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_labels, test_predictions)

# Calculate the precision of the classifier
precision = precision_score(test_labels, test_predictions, average='weighted')

# Calculate the recall of the classifier
recall = recall_score(test_labels, test_predictions, average='weighted')

# Calculate the F1 score of the classifier
f1 = f1_score(test_labels, test_predictions, average='weighted')

# Calculate the confusion matrix of the classifier
confusion_mat = confusion_matrix(test_labels, test_predictions)

# Define the labels for the confusion matrix
risk_labels = ['Low Risk', 'Mild Risk', 'Moderate Risk', 'Severe Risk']

# Create a DataFrame with the confusion matrix and labels
confusion_df = pd.DataFrame(confusion_mat, index=risk_labels, columns=risk_labels)

# Print the evaluation metrics & confusion matrix
print("Confusion Matrix:")
print(confusion_mat)

print("\nEvaluation Metrics:")
print("Accuracy:", accuracy * 100, "%")
print("Precision:", precision * 100, "%")
print("Recall:", recall * 100, "%")
print("F1 Score:", f1 * 100, "%")
display(confusion_df)

In [None]:
# Create a heatmap of the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=risk_labels, yticklabels=risk_labels)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Calculate the accuracy of the classifier
accuracy = accuracy_score(test_labels, test_predictions)

# Calculate the precision of the classifier
precision = precision_score(test_labels, test_predictions, average='weighted')

# Calculate the recall of the classifier
recall = recall_score(test_labels, test_predictions, average='weighted')

# Calculate the F1 score of the classifier
f1 = f1_score(test_labels, test_predictions, average='weighted')

# Calculate the confusion matrix of the classifier
confusion_mat = confusion_matrix(test_labels, test_predictions)

# Define the labels for the confusion matrix
labels = ['True Negative (TN)', 'False Positive (FP)', 'False Negative (FN)', 'True Positive (TP)']

# Create a new confusion matrix with the labels
confusion_mat_labeled = np.empty((2,2), dtype=int)
confusion_mat_labeled[0,0] = confusion_mat[0,0] # True Negative
confusion_mat_labeled[0,1] = confusion_mat[0,1] # False Positive
confusion_mat_labeled[1,0] = confusion_mat[1,0] # False Negative
confusion_mat_labeled[1,1] = confusion_mat[1,1] # True Positive

# Create a DataFrame with the confusion matrix and labels
confusion_df = pd.DataFrame(confusion_mat_labeled, index=labels[:2], columns=labels[2:])

# Print the evaluation metrics
print("Evaluation Metrics:")
print("Accuracy:", accuracy * 100, "%")
print("Precision:", precision * 100, "%")
print("Recall:", recall * 100, "%")
print("F1 Score:", f1 * 100, "%")
display("Confusion Matrix:", confusion_df)

In [None]:
# Create a dictionary to store the evaluation metrics (After tuning)
evaluation_results = {
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Score': [accuracy, precision, recall, f1]
}

# Create a DataFrame from the evaluation results
df_evaluation = pd.DataFrame(evaluation_results)

# Create a DataFrame from the confusion matrix
confusion_df = pd.DataFrame(confusion_mat, columns=['False Negative', 'False Positive'], index=['True Negative', 'True Positive'])

# Concatenate the evaluation DataFrame and confusion DataFrame
results_df = pd.concat([df_evaluation, confusion_df], axis=0)

# Save the DataFrame to a CSV file
results_df.to_csv('[VADER] SVM BOW (80-20) Post Hypertuning Model_Evaluation.csv', index=False)

In [None]:
#For Harmonized Sentiments Dataset
# Create a DataFrame from the confusion matrix
confusion_df = pd.DataFrame(confusion_mat, columns=sentiment_labels, index=sentiment_labels)

# Save the DataFrame to a CSV file
results_df.to_csv('[Harmonized] SVM TF-IDF (80-20) Initial Model_Evaluation.csv', index=False)

## 5. Predict Sentiment of New text Data<a name="Section5"></a>
Back to [Top Page](#TOC)

Using the trained model classifier, we can predict the sentiment of new text data
<br>

Positive [1] - Food Secured

Negative [0] - Food Insecure

A) [VADER/TextBlob Dataset](#Section14)
<br>
B) [Harmonized Dataset](#Section15)

### A) VADER/TextBlob Dataset<a name="Section14"></a>

In [None]:
%%time

from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
#vectorizer = TfidfVectorizer()  # For TF-IDF
vectorizer = CountVectorizer() # For BOW

# Vectorize the training data
train_features_vectorized = vectorizer.fit_transform(train_features)

# Create instance of SVM classifier with best hyperparameters
svm_classifier = SVC(kernel='linear', C=1.0, probability=True)  # Set probability=True

# Train a Naive Bayes classifier on the training data
svm_classifier.fit(train_features_vectorized, train_labels)

# Predict the sentiment of new text data
new_data = [
    "I'm so angry about the high food prices! It's making it so hard for me to feed my family.",
    "I'm so grateful for the food banks and other organizations that are helping to feed people who are struggling. They're making a real difference.",
    "I'm so worried about the future of food security. Climate change is making it harder to grow food, and more people are going hungry.",
    "I'm so inspired by the work of food banks and other organizations that are fighting hunger. They're making a real difference in people's lives.",
    "I'm hopeful that we can create a world where everyone has access to the food they need to live a healthy and productive life",
    "I'm working part-time and I'm not sure if I'll be able to keep my job.",
    "I'm not sure if I'll be able to afford to pay my rent this month."
]

new_data_vectorized = vectorizer.transform(new_data)
new_data_predictions = svm_classifier.predict(new_data_vectorized)
new_data_sentiment_scores = svm_classifier.predict_proba(new_data_vectorized)[:, 1]  # Positive sentiment score

# Print the predicted sentiment + sentiment scores for the new data
for i in range(len(new_data)):
    print(f"Text: {new_data[i]}")
    sentiment_label = "Positive (Food Secured)" if new_data_predictions[i] == 1 else "Negative (Food Insecure)"
    print(f"Predicted sentiment: {sentiment_label}")
    print(f"Sentiment score: {new_data_sentiment_scores[i]}")
    print()

**Save the new data results into CSV file**

In [None]:
# Create a DataFrame to store the results
results_df = pd.DataFrame({
    'Text': new_data,
    'Predicted Sentiment': new_data_predictions,
    'Sentiment Score': new_data_sentiment_scores
})

# Save the DataFrame to a CSV file
results_df.to_csv('[SAMPLE] [SVM] BOW Harmonized (80-20) new_data_results.csv', index=False)

### B) Harmonized Dataset<a name="Section15"></a>

**Predict sentiment & FI Risk Category of New Text Data**

Go [here for the immediate solution](#Section16)

### Immediate Solution for Predicting FI Risk <a name="Section16"></a>

In [None]:
# Define the risk category mapping
risk_category_mapping = {
    0: "\033[1;32mLow Risk\033[0m",  # Green
    1: "\033[1;33mMild Risk\033[0m", # Yellow
    2: "\033[1;31mModerate Risk\033[0m", # Orange
    3: "\033[1;31mSevere Risk\033[0m"  # Red
}

# Create a CountVectorizer object
#vectorizer = TfidfVectorizer() # For TF-IDF
vectorizer = CountVectorizer() # For BOW

# Vectorize the training data
features_vectorized = vectorizer.fit_transform(train_features)

# Create instance of hypertuned SVM classifier with best hyperparameters
svm_classifier = SVC(kernel='linear', C=1.0, probability=True)  # Set probability=True

# Train a Multinomial Logistic Regression classifier on the training data
svm_classifier.fit(train_features_vectorized, train_labels)

# Function to predict sentiment, sentiment score, and FI risk category of new text data
def predict_sentiment_and_fi_risk(text):
    # Vectorize the new text data
    new_text_vectorized = vectorizer.transform([text])

    # Predict the sentiment using the trained model
    sentiment = svm_classifier.predict(new_text_vectorized)[0]

    # Predict the sentiment score using the trained model
    sentiment_score = np.max(svm_classifier.predict_proba(new_text_vectorized))

    # Assign the FI risk category based on sentiment and sentiment score
    if sentiment == 1:  # Positive sentiment (Food Secured)
        fi_sentiment = "Positive"
        fi_risk = risk_category_mapping[0]  # Low Risk
    else:  # Negative sentiment
        sentiment_score *= -1  # Multiply the sentiment score by -1 for Negative sentiment (Food Insecure)
        
        if (sentiment_score > -1.0) and (sentiment_score <= -0.6):
            fi_sentiment = "Negative (Strongly Negative)"
            fi_risk = risk_category_mapping[3]  # Severe Risk
        elif (sentiment_score > -0.6) and (sentiment_score <= -0.3):
            fi_sentiment = "Negative (Mild Negative)"
            fi_risk = risk_category_mapping[2]  # Moderate Risk
        else:
            fi_sentiment = "Negative (Weakly Negative)"
            fi_risk = risk_category_mapping[1]  # Mild Risk

    return fi_sentiment, sentiment_score, fi_risk


# Predict the sentiment, sentiment score, and FI risk for the new text data
new_data = [
    "I'm so angry about the high food prices! It's making it so hard for me to feed my family.",
    "I'm so grateful for the food banks and other organizations that are helping to feed people who are struggling. They're making a real difference.",
    "I'm so worried about the future of food security. Climate change is making it harder to grow food, and more people are going hungry.",
    "I'm so inspired by the work of food banks and other organizations that are fighting hunger. They're making a real difference in people's lives.",
    "I'm hopeful that we can create a world where everyone has access to the food they need to live a healthy and productive life",
    "I'm working part-time and I'm not sure if I'll be able to keep my job.",
    "I'm not sure if I'll be able to afford to pay my rent this month."
]

final_results = []
for text in new_data:
    fi_sentiment, sentiment_score, fi_risk = predict_sentiment_and_fi_risk(text)
    print("Text:", text)
    print("Predicted sentiment:", fi_sentiment)
    print("Sentiment score:", sentiment_score)
    print("FI Risk:", fi_risk)
    print()
    
    # Store the results in a dictionary
    final_result = {
        "Text": text,
        "Predicted Sentiment": fi_sentiment,
        "Sentiment Score": sentiment_score,
        "FI Risk": fi_risk
    }
    final_results.append(final_result)

In [None]:
# Create a DataFrame for the results
final_results_df = pd.DataFrame(final_results)

# Print the DataFrame
print(final_results_df)

**Save the new data results into CSV file**

In [None]:
# Save the DataFrame to a CSV file
final_results_df.to_csv('[SAMPLE] [SVM] BOW Harmonized (80-20) final new_data_results.csv', index=False)

**ALTERNATE CODE**

In [None]:
# Create a dataframe from the results
columns = ["Text", "Sentiment", "Sentiment Score", "FI Risk"]
final_results_df = pd.DataFrame(results, columns=columns)

# Assign colors to FI Risk categories
color_mapping = {
    "Low": "\033[92m",              # Green
    "Mild": "\033[93m",             # Yellow
    "Moderate": "\033[38;5;208m",   # Light Orange
    "High": "\033[91m"              # Red
}

# Add color codes to the FI Risk column
final_results_df["FI Risk"] = final_results_df["FI Risk"].apply(lambda x: color_mapping.get(x, ""))

# Print the results with color-coded FI Risk column
display(final_results_df)

In [None]:
# Save the DataFrame to a CSV file
final_results_df.to_csv('[SAMPLE] [SVM] TF-IDF TextBlob (70-30) final new_data_results.csv', index=False)

## 5. Visualization<a name="Section5"></a>
<br></br>
**Type of Visualizations:**
1. Heatmap
2. Stacked Bar Chart
Example: 
<img src="https://drive.google.com/file/d/1DPFEOBuwPYe4wVSVAvYoZYdVXlBMMTYh/view?usp=sharing" alt="Example Stacked Bar Chart Model Evaluation Results">
3. ROC Curve


**Heatmap**

In [None]:
import seaborn as sns

# Plotting the heatmap of confusion matrix
cm = confusion_matrix(test_labels, test_predictions)
sns.heatmap(cm, annot=True)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Define the labels for the confusion matrix
labels = ['True Negative', 'False Positive', 'False Negative', 'True Positive']

# Create a new confusion matrix with the labels
confusion_mat_labeled = np.empty((2,2), dtype=int)
confusion_mat_labeled[0,0] = confusion_mat[0,0] # True Negative
confusion_mat_labeled[0,1] = confusion_mat[0,1] # False Positive
confusion_mat_labeled[1,0] = confusion_mat[1,0] # False Negative
confusion_mat_labeled[1,1] = confusion_mat[1,1] # True Positive

# Create a DataFrame with the confusion matrix and labels
confusion_df = pd.DataFrame(confusion_mat_labeled, index=labels[:2], columns=labels[2:])

# Create a heatmap of the confusion matrix
sns.heatmap(confusion_df, annot=True)

# Show the plot
plt.show()

In [None]:
# Plotting Function for Confusion Matrix
def plot_cm(cm, classes, title, normalized = False, cmap = plt.cm.Blues):

  plt.imshow(cm, interpolation = "nearest", cmap = cmap)
  plt.title(title, pad = 20)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes)
  plt.yticks(tick_marks, classes)

  if normalized:
    cm = cm.astype('float') / cm.sum(axis = 1)[: np.newaxis]
    print("Normalized Confusion Matrix")
  else:
    print("Unnormalized Confusion Matrix")
  
  threshold = cm.max() / 2
  for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
      plt.text(j, i, cm[i, j], horizontalalignment = "center", color = "white" if cm[i, j] > threshold else "black")

  plt.tight_layout()
  plt.xlabel("Predicted Label", labelpad = 20)
  plt.ylabel("Real Label", labelpad = 20)

**For Harmonized Sentiment Tweets Dataset**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Perform predictions on the test data
test_predictions = svm_classifier.predict(test_features_vectorized)

# Generate the confusion matrix
confusion_mat = confusion_matrix(test_labels, test_predictions)

# Create a list of sentiment labels
sentiment_labels = ['Low Risk', 'Mild Risk', 'Mild Negative', 'Strongly Negative']

# Create a heatmap of the confusion matrix
sns.heatmap(confusion_mat, annot=True, fmt="d", cmap="Blues", xticklabels=sentiment_labels, yticklabels=sentiment_labels)

# Set the axis labels and title
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")

# Show the plot
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Create a dictionary to map sentiment labels to risk labels
label_mapping = {sentiment_labels[i]: risk_labels[i] for i in range(len(sentiment_labels))}

# Apply reassignment to the predicted sentiment labels
results_df['predicted_risk'] = results_df['predicted_sentiment'].map(label_mapping)

# Print the updated DataFrame
display(results_df)

# Create a list of risk labels
risk_labels = ['Low Risk', 'Mild Risk', 'Moderate Risk', 'Severe Risk']

# Generate the confusion matrix
confusion_mat = confusion_matrix(results_df['actual_risk'], results_df['predicted_risk'])

# Create a heatmap of the confusion matrix
sns.heatmap(confusion_mat, annot=True, fmt="d", cmap="Blues", xticklabels=risk_labels, yticklabels=risk_labels)

# Set the axis labels and title
plt.xlabel("Predicted")
plt.ylabel("True")
plt.title("Confusion Matrix")

# Show the plot
plt.show()

In [None]:
sentiment_labels = ['Positive', 'Weakly Negative', 'Mild Negative', 'Strongly Negative']
risk_labels = ['Low Risk', 'Mild Risk', 'Moderate Risk', 'Severe Risk']

# Create a dictionary to map sentiment labels to risk labels
label_mapping = {sentiment_labels[i]: risk_labels[i] for i in range(len(sentiment_labels))}

# Apply reassignment to the predicted sentiment labels
results_df['predicted_risk'] = results_df['predicted_sentiment'].map(label_mapping)

# Print the updated DataFrame
display(results_df)

## 6. Apply the Best SA Model on Full Dataset<a name="Section17"></a>
Back to [Top Page](#TOC)

Using the trained model classifier, we can predict the sentiment of new text data
<br>

Positive [1] - Food Secured

Negative [0] - Food Insecure

In [6]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC

In [3]:
# Load csv file containing tweets dataset (w/ sentiments)

tweets_df = pd.read_csv(r"C:\Users\LENOVO\Documents\Degree Life\FYP Journey\Dataset\Sentiment Analysis\V3 Harmonized [VADER & TextBlob]_All Keywords (Whole Malaysia) - Copy.csv")
display(tweets_df)

Unnamed: 0,Datetime,Username,Cleaned_Tweets,Location,VADER_score,TextBlob_score,Harmonized_Score,Harmonized_Label,Risk_Label
0,27/1/2023 14:32,Don Dale,buying forget review first guy feel want comme...,Malaysia,0.6703,-0.250000,0.210150,Positive,Low Risk
1,27/1/2023 19:04,Iliani,food security research going explode issue end...,Malaysia,0.5859,-0.181818,0.202041,Positive,Low Risk
2,29/1/2023 8:28,Naim Zaini,context slaughtered food muslim consideration ...,Malaysia,0.8658,0.034722,0.450261,Positive,Low Risk
3,29/1/2023 13:29,??,raise food price wet good expensive sorry guy,Malaysia,0.3818,-0.100000,0.140900,Positive,Low Risk
4,30/1/2023 21:52,Alinosourawr,che restaurant sek send food x order food drin...,Malaysia,-0.8934,-0.433333,-0.663367,Strongly Negative,Severe Risk
...,...,...,...,...,...,...,...,...,...
21832,2023-03-30 23:45:13+00:00,Charrlygirl,worried prosecution team family also worried f...,Malaysia,-0.8360,0.000000,-0.418000,Mild Negative,Moderate Risk
21833,2023-03-30 23:49:23+00:00,angel19971102,love much clark must always worried bruce drea...,Malaysia,0.6939,0.400000,0.546950,Positive,Low Risk
21834,2023-03-30 23:55:01+00:00,firdyfire,industry player worried energy commission chie...,Malaysia,-0.0258,0.000000,-0.012900,Weakly Negative,Mild Risk
21835,2023-03-30 23:55:16+00:00,AhmadMuhyie,ah really weak faith fasting without real exam...,Malaysia,0.6222,0.239583,0.430892,Positive,Low Risk


In [4]:
# Extract the features
features_full = tweets_df['Cleaned_Tweets'].values
print(features_full)

['buying forget review first guy feel want comment buying know fake food expensive'
 'food security research going explode issue end example new rafizi docking expensive food one food security issue'
 'context slaughtered food muslim consideration price cheap chicken bought heart sure halal level willing buy little chicken long feel confident observation chatting old people'
 ... 'industry player worried energy commission chief job'
 'ah really weak faith fasting without real exam g angu know cm holding hungry thirst delayed bed easy see people eating really good patience sincerity practiced cm sweetener jargon status whatsapp'
 'hate sin man know recorded video amazes continues prayer wrong despite tht easy feat yk faith come sacrifice eg org freehair hungry people fasting']


In [5]:
# Replace NaN values with an empty string in the feature set
features_full = np.where(pd.isnull(features_full), '', features_full)

# Transform the full dataset features using the trained vectorizer
features_full_vectorized = vectorizer.transform(features_full)

# Predict the sentiment labels using the trained model
predicted_labels_full = svm_classifier.predict(features_full_vectorized)

# Convert the numeric labels back to sentiment labels
predicted_sentiments_full = encoder.inverse_transform(predicted_labels_full)

# Map the sentiment labels to the corresponding risk labels
predicted_risks_full = [label_mapping[label] for label in predicted_sentiments_full]

# Add the predicted labels to the full dataset
tweets_df['predicted_sentiment'] = predicted_sentiments_full
tweets_df['predicted_risk'] = predicted_risks_full

# Print the DataFrame
display(tweets_df)

NameError: name 'np' is not defined

In [None]:
# Generate the sentiment confusion matrix
sentiment_confusion_mat = confusion_matrix(tweets_df['Harmonized_Label'], tweets_df['predicted_sentiment'])

# Generate the risk confusion matrix
risk_confusion_mat = confusion_matrix(tweets_df['Risk_Label'], tweets_df['predicted_risk'])

# Create a list of sentiment labels
sentiment_labels = ['Positive', 'Weakly Negative', 'Mild Negative', 'Strongly Negative']

# Create a list of risk labels
risk_labels = ['Low Risk', 'Mild Risk', 'Moderate Risk', 'Severe Risk']

# Create a heatmap of the sentiment confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(sentiment_confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=sentiment_labels, yticklabels=sentiment_labels)
plt.title('Sentiment Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

In [None]:
# Create a heatmap of the risk confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(risk_confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=risk_labels, yticklabels=risk_labels)
plt.title('FI Risk Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()