# Project 1: Sentiment Analysis on Movie Reviews

This project is an individual project. In this project, you are expected to solve the classification problem on movie reviews. Movie reviews have two different sentiments (positive or negative), please train machine learning or deep learning models to classify movie reviews into correct categories (1 for positive 1 and 0 for negative).

**NOTE:**
* Please solve the problems in this notebook using the dataset `IBDM Dataset.csv`.
* Important Dates: 
    * Project Start: Feb 19, Monday
    * Project Due: March 7, Thursday midnight
* Submission should include a pdf report (at least 4 pages) and code.
* There are always last minute issues submitting the project. DO NOT WAIT UNTIL THE LAST MINUTE!

**HINT:**
* Here are some related tutorials that would be helpful:
    * https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/code
    * https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html 



# Data Exploration: Exploring the Dataset



In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
import nltk
nltk.data.path.append("C:\\Users\\SUNYLoaner\\Desktop\\DataProject_1")  # Adjust the path accordingly
nltk.download('stopwords')
nltk.download('punkt') 
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

# Load the dataset
# Assuming your dataset is in a CSV file, adjust the path accordingly
data = pd.read_csv('IMDB Dataset.csv')

# Display the first few rows of the dataset
print(data.head())







# Data Preproccessing

Here are some common preproccessing steps, feel free to add more preproccessing steps if needed: 
1. check missing values. 
2. remove noise and special characters, such as "\[[^]]*\]", etc.
3. transform all words to lower case, 
4. word tokenization  
5. stop words removing and stemming,
6. divide the dataset into train set (75%) and test set (25%) with random sampling

 ......

**Hint:**
* You may need TfidVectorizer class to convert a collection of raw documents to a matrix of TF-IDF features: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html, 
* You are also welcome to use the Python Natural Language Processing Toolkit (www.nltk.org).
 


In [2]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
# Check for missing values
print("Missing values:\n", data.isnull().sum())

# Text cleaning function
def clean_text(text):
    # Remove noise and special characters
    text = re.sub(r"[^a-zA-Z]", " ", text)
    # Convert to lowercase
    text = text.lower()
    return text

# Apply text cleaning to the 'review' column
data['clean_review'] = data['review'].apply(clean_text)

# Tokenization and stemming
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)

# Apply tokenization and stemming to the 'clean_review' column
data['processed_review'] = data['clean_review'].apply(tokenize_and_stem)

# Split the dataset into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(
    data['processed_review'], data['sentiment'], test_size=0.25, random_state=42
)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data)
X_test = vectorizer.transform(test_data)

# Now, you can train your classification model (e.g., using Naive Bayes)
model = MultinomialNB()
model.fit(X_train, train_labels)

# Make predictions on the test set
predictions = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(test_labels, predictions)
report = classification_report(test_labels, predictions)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)






# Data Modelling

* Please use the following models to classify the data:
    * Logistic Regression
    * LinearSVC
    * KNeighborsClassifier
    * Fully-connected layers, please try different number of hidden layers, different values of "hidden_layer_sizes" and "activation".
    * CNN (please use different number of convolutional layers combined with different number of fully-connected layers, and compare the results).


In [None]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary
pip install keras
pip install tensorflow
# Import necessary libraries for modeling
# Import necessary libraries for modeling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, train_labels)
lr_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(test_labels, lr_predictions)
print("Logistic Regression Accuracy:", lr_accuracy)

# Results for each model
results_df = pd.DataFrame({
    'Model': ['Logistic Regression'],
    'Accuracy': [lr_accuracy]
})

# Bar plot for accuracy comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1)
plt.show()

# Confusion Matrix for Logistic Regression
lr_conf_matrix = confusion_matrix(test_labels, lr_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(lr_conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix - Logistic Regression')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()






In [None]:
# Import necessary libraries for modeling
from sklearn.svm import LinearSVC
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix

# Linear SVC
svc_model = LinearSVC()
svc_model.fit(X_train, train_labels)
svc_predictions = svc_model.predict(X_test)
svc_accuracy = accuracy_score(test_labels, svc_predictions)
print("Linear SVC Accuracy:", svc_accuracy)

# Add results for Linear SVC to the DataFrame
results_df = pd.concat([results_df, pd.DataFrame({
    'Model': ['Linear SVC'],
    'Accuracy': [svc_accuracy]
})], ignore_index=True)

# Bar plot for accuracy comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1)
plt.show()

# Confusion Matrix for Linear SVC
svc_conf_matrix = confusion_matrix(test_labels, svc_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(svc_conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix - Linear SVC')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()


In [None]:
# Import necessary libraries for modeling
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix


# KNeighbors Classifier
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, train_labels)
knn_predictions = knn_model.predict(X_test)
knn_accuracy = accuracy_score(test_labels, knn_predictions)
print("KNeighbors Classifier Accuracy:", knn_accuracy)

# Add results for KNeighbors Classifier to the DataFrame
results_df = pd.concat([results_df, pd.DataFrame({
    'Model': ['KNeighbors Classifier'],
    'Accuracy': [knn_accuracy]
})], ignore_index=True)

# Bar plot for accuracy comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1)
plt.show()

# Confusion Matrix for KNeighbors Classifier
knn_conf_matrix = confusion_matrix(test_labels, knn_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(knn_conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix - KNeighbors Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()


In [None]:
# Import necessary libraries for modeling
from sklearn.neural_network import MLPClassifier
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix


# MLP Classifier
mlp_model = MLPClassifier(hidden_layer_sizes=(128,), activation='relu', max_iter=100)
mlp_model.fit(X_train, train_labels)
mlp_predictions = mlp_model.predict(X_test)
mlp_accuracy = accuracy_score(test_labels, mlp_predictions)
print("MLP Classifier Accuracy:", mlp_accuracy)

# Add results for MLP Classifier to the DataFrame
results_df = pd.concat([results_df, pd.DataFrame({
    'Model': ['MLP Classifier'],
    'Accuracy': [mlp_accuracy]
})], ignore_index=True)

# Bar plot for accuracy comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1)
plt.show()

# Confusion Matrix for MLP Classifier
mlp_conf_matrix = confusion_matrix(test_labels, mlp_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(mlp_conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix - MLP Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()


In [None]:
# Import necessary libraries
import nltk
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from keras.models import Sequential
from keras.layers import Dense
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Set the NLTK data path
nltk.data.path.append("C:\\Users\\SUNYLoaner\\Desktop\\DataProject_1")  # Adjust the path accordingly
nltk.download('stopwords')

# Load the dataset
data = pd.read_csv('IMDB Dataset.csv')

# Display the first few rows of the dataset
print(data.head())

# Check for missing values
print("Missing values:\n", data.isnull().sum())

# Text cleaning function
def clean_text(text):
    text = re.sub(r"[^a-zA-Z]", " ", text)
    text = text.lower()
    return text

data['clean_review'] = data['review'].apply(clean_text)

# Tokenization and stemming
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)

data['processed_review'] = data['clean_review'].apply(tokenize_and_stem)

# Split the dataset into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(
    data['processed_review'], data['sentiment'], test_size=0.25, random_state=42
)

# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=5000)  # Limit the number of features
X_train = vectorizer.fit_transform(train_data)
X_test = vectorizer.transform(test_data)
X_train_np = X_train.toarray()

# Convert 'train_labels' to float32
label_mapping = {'negative': 0, 'positive': 1}
train_labels = train_labels.map(label_mapping).astype('float32')
test_labels_numeric = test_labels.map(label_mapping).astype('float32')


# CNN Model
cnn_model = Sequential()
cnn_model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu'))
cnn_model.add(Dense(64, activation='relu'))
cnn_model.add(Dense(1, activation='sigmoid'))

cnn_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
cnn_model.fit(X_train_np, train_labels, epochs=5, batch_size=32, validation_split=0.2)

# Evaluate CNN model on the test set
cnn_predictions = (cnn_model.predict(X_test.toarray()) > 0.5).astype("int32")
cnn_accuracy = accuracy_score(test_labels_numeric, cnn_predictions)
print("CNN Model Accuracy:", cnn_accuracy)

results_df = pd.DataFrame(columns=['Model', 'Accuracy'])

# Add results for CNN to the DataFrame
results_df = pd.concat([results_df, pd.DataFrame({
    'Model': ['CNN'],
    'Accuracy': [cnn_accuracy]
})], ignore_index=True)

# Bar plot for accuracy comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='Accuracy', data=results_df, palette='viridis')
plt.title('Model Accuracy Comparison')
plt.ylim(0, 1)
plt.show()

# Convert string labels to integers for test_labels
label_mapping = {'negative': 0, 'positive': 1}
test_labels_int = test_labels.map(label_mapping)

# Convert predictions to integers
cnn_predictions_int = cnn_predictions.flatten()

# Confusion Matrix for CNN
cnn_conf_matrix = confusion_matrix(test_labels_int, cnn_predictions_int)
plt.figure(figsize=(8, 6))
sns.heatmap(cnn_conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix - CNN')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()


# Results: summarize and visualize the results discovered from the analysis

Please use figures or tables to present the results.


In [36]:
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary

#I have added the results and summary and visualizations of results of each model in the code of execution of models itself.









# Done

All set! 

**What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook. Please make sure all the plotted tables and figures are in the notebook. 

* **PDF Report**: please prepare a report in the PDF form which should be at least 4 pages. The report should includes:

  * Data description and exploration.

  * Data preproccessing.

  * Data modelling.

  * What did you find in the data?

  * (please include figures or tables in the report, but no source code)
  
Please compress all the files in a zipped file.