# Natural Language Processing (NLP) Task: 
##  Sentiment Analysis using Traditional Machine Learning
-    This project is focused on interpreting and classifying emotions expressed in text data using Natural Language Processing (NLP) . The approach leverages traditional machine learning techniques to analyze and predict sentiments expressed in customer reviews.

-   Objectives
    -   Understanding Sentiment Distribution: Our first step involves exploring the distribution of sentiments in the dataset. We aim to understand the proportion of positive, neutral, and negative sentiments expressed in customer reviews.
    -   Data Preprocessing: This includes cleaning text data, transforming it into a suitable format for analysis (using techniques like TF-IDF), and addressing class imbalances.
    -   Model Development and Evaluation: We train and evaluate multiple machine learning models, including Logistic Regression, Random Forest, and Naive Bayes. These models are chosen for their robustness and effectiveness in classification tasks.
    -   Performance Analysis: Each model's performance is assessed using metrics like accuracy, precision, recall, and F1-score. Additionally, confusion matrices are utilized for a more detailed evaluation.

-   Import necessary libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Set Pandas to show all columns
# Setting pandas display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
# Load the already preprocessed review dataset 
file_path = "C:/Users/Oby/Desktop/Data Science Portfolio/lemmatized_data.csv" 
data = pd.read_csv(file_path)
data.head(5)

### Data Preprocessing

-   Visualise the Data's Size, Shape, and Structure to gain a fundamental understanding of the dataset, explore its size, shape, and overall structure.

In [None]:
data.info()
print(data.shape)

In [None]:
#check for missing values in the dataset
data.isna().sum()

In [None]:
# Drop rows where 'review_title' or 'review_text' is missing
data.dropna(subset=['review_title', 'review_text'], inplace=True)

# Check for missing values again
print(data.isna().sum())


-   Visualization of Ratings Distribution
This is to visually represent the distribution of ratings in the review dataset. Understanding how ratings are distributed is crucial for gaining insights into overall customer satisfaction and preferences.

In [None]:
# utilize a pie chart to illustrate the proportion of each rating
#  Get the value counts of the 'rating' column
rating_counts = data['rating'].value_counts()

# Create a pie chart
plt.figure(figsize=(5, 5))
plt.pie(rating_counts, labels=rating_counts.index, autopct='%1.1f%%', startangle=140)

# Add a title
plt.title('Distribution of Ratings')

# Show the plot
plt.show()

-   Categorize ratings into 'Good', 'Neutral', and 'Poor' sentiment categories based on the value of the 'rating' column

In [None]:
# Define the function to categorize ratings
def categorize_rating(rating):
    if rating >= 4:
        return 'Good'
    elif rating == 3:
        return 'Neutral'
    else:
        return 'Poor'

# Apply the function to the 'rating' column
data['sentiment'] = data['rating'].apply(categorize_rating)


-   Graphically represent the distribution of sentiments within the dataset. By visualizing the sentiments as 'Good', 'Neutral', and 'Poor'to gain a clearer understanding of the overall sentiment trends present in the data. 

In [None]:
#Utilize a pie chart to show the proportion of each sentiment category within the dataset. 
# Get the value counts of the 'sentiment' column
sentiment_counts = data['sentiment'].value_counts()

# Create a pie chart
plt.figure(figsize=(6,5))
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140)

# Add a title
plt.title('Distribution of Sentiments')

# Show the plot
plt.show()

-  Import the necessary libraries for the Mchine learning process

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import ADASYN
from sklearn.model_selection import train_test_split

-  Converting the categorical sentiment labels in our dataset to numerical form, to ensure the dataset is suitable for the machine learning model

In [None]:
# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'sentiment' column
data['sentiment_encoded'] = label_encoder.fit_transform(data['sentiment'])

#display the value count of the encoded column
data["sentiment_encoded"].value_counts()

# Good 0, poor(664) 2 and neutral (67) 1

-   Transform the text data from the 'review_text' column into a numerical format using the Term Frequency-Inverse Document Frequency (TF-IDF) method. 
-   This transformation is crucial for text analysis and machine learning tasks as it converts text into a format that algorithms can process.

In [None]:
# Initialize TfidfVectorizer:
tfidf_vectorizer = TfidfVectorizer()

# Apply the vectorizer to the review_text column.
X = tfidf_vectorizer.fit_transform(data['review_text'])

# Check the matrix shape
X.shape

In [None]:
# Assign the target variable
y = data["sentiment_encoded"]

### Handling Data Imbalance
-   Addressing the imbalance in our dataset's target classes is crucial for improving the performance of machine learning models. We employ ADASYN (Adaptive Synthetic Sampling Approach) to oversample the minority classes and achieve a more balanced class distribution.

In [None]:
# Initialize the ADASYN object
adasyn = ADASYN(random_state=42)

# Apply ADASYN to generate the oversampled dataset
X_resampled, y_resampled = adasyn.fit_resample(X, y)

# Check the new class distribution (optional)
from collections import Counter
print("Original class distribution:", Counter(y))
print("Resampled class distribution:", Counter(y_resampled))


-    Splitting the Dataset into Training and Testing Sets to address class imbalance which is  essential for training machine learning models and subsequently evaluating their performance on unseen data.

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)


###  Create baseline models using Logistic Regression, Random Forest, and Naive Bayes
-    Focus on three widely-used machine learning algorithms: Logistic Regression, Random Forest, and Naive Bayes. Each of these models brings unique strengths and characteristics, making them suitable for a broad range of classification tasks, including our scenario of sentiment analysis.

Why These Models?

Logistic Regression:

-   A simple yet powerful linear model for binary and multiclass classification problems.
-   Efficient and interpretable, often serving as the first model to try in classification tasks.
-   Performs well when the dataset is linearly separable.

Random Forest:

-   An ensemble learning method based on decision tree classifiers.
-   Offers high accuracy through bagging and feature randomness when splitting nodes.
-   Handles large data with higher dimensionality well and provides estimates of feature importance.

Naive Bayes:

-   Based on Bayes' theorem with the assumption of independence between predictors.
-   Particularly useful for text classification problems like spam filtering and sentiment analysis.
-   Fast and efficient with large datasets.

-   Import Necessary Libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix,accuracy_score, classification_report

-   initialize three different machine learning algorithms: Logistic Regression, Random Forest, and Naive Bayes.

In [None]:
#Initialise Models
logreg = LogisticRegression()
rf = RandomForestClassifier()
nb = MultinomialNB()


-  Fit each model on the training data

-   utilize a Python dictionary to pair model names with their respective initialized instances. Then, by iterating through this dictionary, we can efficiently train each model in a loop.

In [None]:
# Define the models in a dictionary
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "Naive Bayes": MultinomialNB()
}

# Train each model in a loop
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"{name} has been trained.")


-   Evaluate and Visualize Model Performance after training the Logistic Regression, Random Forest, and Naive Bayes models. This step involves making predictions on the test set, calculating accuracy, generating classification reports, and displaying confusion matrices.

In [None]:
for name, model in models.items():
    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate predictions
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy:", accuracy)
    print(f"\n{name} Classification Report:\n", classification_report(y_test, y_pred))

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
    plt.title(f'Confusion Matrix for {name}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()
