# **Introduction**

The aim of this project is to predict the sentiment of the movies from the given dataset. In this Kaggle notebook I have implements the projects aim in my best of knowledge of ML models and its fucntionalities. I have made **11 Machine Learning models** and compare them with each other for a more comprehensive understanding and to predict the sentiments of the movies in a effective and efficent manner.

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder,OneHotEncoder,MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction import text
import scipy.sparse as sp
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
import seaborn as sns

# **Data Loading**

In [None]:
train = pd.read_csv("/kaggle/input/sentiment-prediction-on-movie-reviews/train.csv")
test = pd.read_csv("/kaggle/input/sentiment-prediction-on-movie-reviews/test.csv")
movies = pd.read_csv("/kaggle/input/sentiment-prediction-on-movie-reviews/movies.csv")

In [None]:
train.shape

In [None]:
test.shape

In [None]:
movies.shape

In [None]:
train.info()

In [None]:
test.info()

In [None]:
movies.info()

In [None]:
train.head(3)

In [None]:
test.head(3)

In [None]:
movies.head(3)

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

In [None]:
movies.isna().sum()

# **Data Preprocessing**

* **Data Set merging and duplicate droping**

In [None]:
df = movies.drop_duplicates(subset=['movieid'], keep='first')
merged_df = pd.merge(train, df, on='movieid', how='left')
merged_df1 = pd.merge(test, df, on='movieid', how='left')

In [None]:
df.shape

In [None]:
merged_df.shape

In [None]:
merged_df1.shape

In [None]:
merged_df.isna().sum()

In [None]:
merged_df1.isna().sum()

* **Changing Column name**

In [None]:
new_column_name={'isTopCritic': 'isFrequentReviewer'}
merged_df1=merged_df1.rename(columns=new_column_name)

In [None]:
merged_df1.head(2)

* **Target column encoding**

In [None]:
merged_df["sentiment"] = merged_df["sentiment"].replace(np.nan, 'Positive')
y=merged_df["sentiment"]
encoder = OrdinalEncoder()
y_enc = encoder.fit_transform(y.to_numpy().reshape(-1, 1))
#Remove the sentiment label column from the merged data
merged_df = merged_df.drop('sentiment', axis=1)
y_enc

In [None]:
y_enc.shape

In [None]:
merged_df.shape

In [None]:
merged_df.head(2)

* **Removing unwanted symbols**

In [None]:
import re
merged_df1['genre'] = merged_df1['genre'].str.replace(re.compile(" & "), ",")
merged_df['genre'] = merged_df['genre'].str.replace(re.compile(" & "), ",")
merged_df['ratingContents'] = merged_df['ratingContents'].apply(lambda x: re.sub(r"\[|\]", "", str(x))) # convert x to a string before replacing brackets
merged_df1['ratingContents'] = merged_df1['ratingContents'].apply(lambda x: re.sub(r"\[|\]", "", str(x))) # convert x to a string before replacing brackets

* **Splitting Categorical and numerical columns**

In [None]:
categorical = merged_df.select_dtypes(include=['object', 'bool'])
categorical_drop = categorical.drop(['reviewText','ratingContents','genre','director'],axis=1)
categorical_names = categorical_drop.columns.tolist()
remaining_categorical_names=['reviewText','ratingContents','genre','director']

numerical = merged_df.select_dtypes(include=['float64'])
numerical_names = numerical.columns.tolist()

# Print the column names
print("Categorical columns:", categorical_names)
print("Categorical remaining columns :", remaining_categorical_names)
print("Numerical columns:", numerical_names)

**Imputing missing values using replace function**

In [None]:
merged_df['genre'] = merged_df['genre'].replace(np.nan, 'Drama')
merged_df1['genre'] = merged_df1['genre'].replace(np.nan, 'Drama')
merged_df['ratingContents'] = merged_df['ratingContents'].replace('nan',np.nan)
merged_df1['ratingContents'] = merged_df1['ratingContents'].replace('nan', np.nan)
merged_df['ratingContents'] = merged_df['ratingContents'].replace(np.nan,'Thematic Material')
merged_df1['ratingContents'] = merged_df1['ratingContents'].replace(np.nan, 'Thematic Material')
merged_df['director'] = merged_df['director'].replace(np.nan, 'Information not available')
merged_df1['director'] = merged_df1['director'].replace(np.nan, 'Information not available')


In [None]:
# Categorical and numerical transformer combined pipeline
category_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')
  )])

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', MinMaxScaler())
])

transformer_pipeline = ColumnTransformer(transformers=[
    ('cat', category_transformer, categorical_names ),
    ('num', numerical_transformer,  numerical_names)
], remainder='passthrough')

# Main pipeline
pipeline = Pipeline([('preprocessor', transformer_pipeline)])

# Fit and transform the train data
transformed_df = pd.DataFrame(pipeline.fit_transform(merged_df), columns=categorical_names  + numerical_names + remaining_categorical_names)
transformed_df[numerical_names] = transformed_df[numerical_names].astype(float)
missing_in_train_values = transformed_df.isna().sum()
print("\nTrain data:\n")
print(missing_in_train_values)

# transform the test data
transformed_df1 = pd.DataFrame(pipeline.transform(merged_df1), columns=categorical_names  + numerical_names + remaining_categorical_names)
transformed_df1[numerical_names] = transformed_df1[numerical_names].astype(float)
missing_in_test_values = transformed_df1.isna().sum()
print("\nTest data:\n")
print( missing_in_test_values)

In [None]:
# transformed_df['reviewText']= transformed_df['reviewText'].fillna(" ")
# transformed_df1['reviewText']= transformed_df1['reviewText'].fillna(" ")

In [None]:
relevant_columns = ['title', 'director','genre','ratingContents']
# For train data
subset = transformed_df[transformed_df['reviewText'].isnull()]
for index, row in subset.iterrows():
    review_text = ' '.join(str(row[column]) for column in relevant_columns if pd.notnull(row[column]))
    transformed_df.at[index, 'reviewText'] = review_text.strip()

# For test data
subset1 = transformed_df1[transformed_df1['reviewText'].isnull()]
for index, row in subset1.iterrows():
    review_text1 = ' '.join(str(row[column]) for column in relevant_columns if pd.notnull(row[column]))
    transformed_df1.at[index, 'reviewText'] = review_text1.strip()

* **Using TF-IDF Vectorizer**

In [None]:
vectorizer = TfidfVectorizer(binary=False,ngram_range=(1,3),max_features=25)
vectorized = vectorizer.fit_transform(merged_df['genre'])
array = vectorized.toarray()
print("vectorized output array of genre:")
print(array)

vectorized1 = vectorizer.fit_transform(merged_df1['genre'])
array1 = vectorized1.toarray()
print("vectorized output array1 of genre :")
print(array1)

In [None]:
rc_vectorizer = TfidfVectorizer(binary=False,ngram_range=(1,3),max_features=30)
rc_vectorized = rc_vectorizer.fit_transform(merged_df['ratingContents'])
rc_array = rc_vectorized.toarray()
print("vectorized output (train):\n")
print(rc_array)

rc_vectorized1 = rc_vectorizer.fit_transform(merged_df1['ratingContents'])
rc_array1 = rc_vectorized1.toarray()
print("\nvectorized output (test):\n")
print(rc_array1)

In [None]:
stop_words = text.ENGLISH_STOP_WORDS  # English stop words
tfidf_vectorizer = TfidfVectorizer( stop_words='english',binary=False,ngram_range=(1,3),max_features=3000)

# TF-IDF on train dataset
tfidf_vectorized = tfidf_vectorizer.fit_transform(transformed_df['reviewText'])
tfidf_array = tfidf_vectorized.toarray()

# TF-IDF on test dataset
tfidf_vectorized1 = tfidf_vectorizer.transform(transformed_df1['reviewText'])
tfidf_array1 = tfidf_vectorized1.toarray()

print("TF-IDF vectorized (train) output:\n")
print(tfidf_array)
print("\nTF-IDF vectorized (test) output:\n")
print(tfidf_array)

In [None]:
vectorizer_releaseDateTheaters = TfidfVectorizer()
vectorized_releaseDateTheaters = vectorizer_releaseDateTheaters.fit_transform(transformed_df['releaseDateTheaters'])
array_releaseDateTheaters = vectorized_releaseDateTheaters.toarray()
print("Vectorized releaseDateTheaters array:")
print(array_releaseDateTheaters)

vectorized1_releaseDateTheaters = vectorizer_releaseDateTheaters.transform(transformed_df1['releaseDateTheaters'])
array1_releaseDateTheaters = vectorized1_releaseDateTheaters.toarray()
print("Vectorized releaseDateTheaters array 1:")
print(array1_releaseDateTheaters)


In [None]:
vectorizer_releaseDateStreaming = TfidfVectorizer()
vectorized_releaseDateStreaming = vectorizer_releaseDateStreaming.fit_transform(transformed_df['releaseDateStreaming'])
array_releaseDateStreaming = vectorized_releaseDateStreaming.toarray()
print("Vectorized releaseDateStreaming array:")
print(array_releaseDateStreaming)

vectorized1_releaseDateStreaming = vectorizer_releaseDateStreaming.transform(transformed_df1['releaseDateStreaming'])
array1_releaseDateStreaming = vectorized1_releaseDateStreaming.toarray()
print("Vectorized releaseDateStreaming array 1:")
print(array1_releaseDateStreaming)


In [None]:
vectorizer_rating  = TfidfVectorizer()
vectorized_rating  = vectorizer_rating.fit_transform(transformed_df['rating'])
array_rating = vectorized_rating.toarray()
print("Vectorized rating  array:")
print(array_rating )

vectorized1_rating  = vectorizer_rating.transform(transformed_df1['rating'])
array1_rating  = vectorized1_rating.toarray()
print("Vectorized rating  array 1:")
print(array1_rating )


In [None]:
vectorizer_originalLanguage = TfidfVectorizer()
vectorized_originalLanguage = vectorizer_originalLanguage.fit_transform(transformed_df['originalLanguage'])
array_originalLanguage = vectorized_originalLanguage.toarray()
print("Vectorized originalLanguage array:")
print(array_originalLanguage)

vectorized1_originalLanguage = vectorizer_originalLanguage.transform(transformed_df1['originalLanguage'])
array1_originalLanguage = vectorized1_originalLanguage.toarray()
print("Vectorized originalLanguage array 1:")
print(array1_originalLanguage)


In [None]:
vectorizer_director = TfidfVectorizer()
vectorized_director = vectorizer_director.fit_transform(transformed_df['director'])
array_director = vectorized_director.toarray()
print("Vectorized director array:")
print(array_director)

vectorized1_director = vectorizer_director.transform(transformed_df1['director'])
array1_director = vectorized1_director.toarray()
print("Vectorized director array 1:")
print(array1_director)


In [None]:
vectorizer_boxOffice = TfidfVectorizer()
vectorized_boxOffice = vectorizer_boxOffice.fit_transform(transformed_df['boxOffice'])
array_boxOffice = vectorized_boxOffice.toarray()
print("Vectorized boxOffice array:")
print(array_boxOffice)

vectorized1_boxOffice = vectorizer_boxOffice.transform(transformed_df1['boxOffice'])
array1_boxOffice = vectorized1_boxOffice.toarray()
print("Vectorized boxOffice array 1:")
print(array1_boxOffice)


In [None]:
vectorizer_distributor = TfidfVectorizer()
vectorized_distributor = vectorizer_distributor.fit_transform(transformed_df['distributor'])
array_distributor = vectorized_distributor.toarray()
print("Vectorized distributor array:")
print(array_distributor)

vectorized1_distributor = vectorizer_distributor.transform(transformed_df1['distributor'])
array1_distributor = vectorized1_distributor.toarray()
print("Vectorized distributor array 1:")
print(array1_distributor)


In [None]:
vectorizer_soundType = TfidfVectorizer()
vectorized_soundType = vectorizer_soundType.fit_transform(transformed_df['soundType'])
array_soundType = vectorized_soundType.toarray()
print("Vectorized soundType array:")
print(array_soundType)

vectorized1_soundType = vectorizer_soundType.transform(transformed_df1['soundType'])
array1_soundType = vectorized1_soundType.toarray()
print("Vectorized soundType array 1:")
print(array1_soundType)

In [None]:
# vectorizer_isFrequentReviewer = TfidfVectorizer()
# vectorized_isFrequentReviewer = vectorizer_isFrequentReviewer.fit_transform(transformed_df['isFrequentReviewer'])
# array_isFrequentReviewer = vectorized_isFrequentReviewer.toarray()
# print("Vectorized isFrequentReviewer array:")
# print(array_isFrequentReviewer)

# vectorized1_isFrequentReviewer = vectorizer_isFrequentReviewer.transform(transformed_df1['isFrequentReviewer'])
# array1_isFrequentReviewer = vectorized1_isFrequentReviewer.toarray()
# print("Vectorized isFrequentReviewer array 1:")
# print(array1_isFrequentReviewer)

In [None]:
vectorizer_movieid = TfidfVectorizer()
vectorized_movieid = vectorizer_movieid.fit_transform(transformed_df['movieid'])
array_movieid = vectorized_movieid.toarray()
print("Vectorized movieid array:")
print(array_movieid)

vectorized1_movieid = vectorizer_movieid.transform(transformed_df1['movieid'])
array1_movieid = vectorized1_movieid.toarray()
print("Vectorized movieid array 1:")
print(array1_movieid)

In [None]:
vectorizer_title = TfidfVectorizer()
vectorized_title = vectorizer_title.fit_transform(transformed_df['title'])
array_title = vectorized_title.toarray()
print("Vectorized title array:")
print(array_title)

vectorized1_title = vectorizer_title.transform(transformed_df1['title'])
array1_title = vectorized1_title.toarray()
print("Vectorized title array 1:")
print(array1_title)


In [None]:
vectorizer_reviewerName = TfidfVectorizer()
vectorized_reviewerName = vectorizer_reviewerName.fit_transform(transformed_df['reviewerName'])
array_reviewerName = vectorized_reviewerName.toarray()
print("Vectorized reviewerName array:")
print(array_reviewerName)

vectorized1_reviewerName = vectorizer_reviewerName.transform(transformed_df1['reviewerName'])
array1_reviewerName = vectorized1_reviewerName.toarray()
print("Vectorized reviewerName array 1:")
print(array1_reviewerName)


In [None]:
sparse_matrix = sp.csr_matrix(tfidf_vectorized)
sparse_matrix1 = sp.csr_matrix(tfidf_vectorized1)

In [None]:
sparse = sp.csr_matrix(vectorized)
sparse1 = sp.csr_matrix(vectorized1)
rc_sparse = sp.csr_matrix(rc_vectorized)
rc_sparse1 = sp.csr_matrix(rc_vectorized1)
rdt_sparse = sp.csr_matrix(vectorized_releaseDateTheaters)
rdt_sparse1 = sp.csr_matrix(vectorized1_releaseDateTheaters)
rds_sparse = sp.csr_matrix(vectorized_releaseDateStreaming)
rds_sparse1 = sp.csr_matrix(vectorized1_releaseDateStreaming)
ol_sparse = sp.csr_matrix(vectorized_originalLanguage)
ol_sparse1 = sp.csr_matrix(vectorized1_originalLanguage)
d_sparse = sp.csr_matrix(vectorized_director)
d_sparse1 = sp.csr_matrix(vectorized1_director)
st_sparse = sp.csr_matrix(vectorized_soundType)
st_sparse1 = sp.csr_matrix(vectorized_soundType)
mi_sparse = sp.csr_matrix(vectorized_movieid)
mi_sparse1 = sp.csr_matrix(vectorized1_movieid)
t_sparse = sp.csr_matrix(vectorized_title)
t_sparse1 = sp.csr_matrix(vectorized1_title)
rn_sparse = sp.csr_matrix(vectorized_reviewerName)
rn_sparse1 = sp.csr_matrix(vectorized1_reviewerName)
r_sparse = sp.csr_matrix(vectorized_rating)
r_sparse1 = sp.csr_matrix(vectorized1_rating)

# **Visualization after imputing missing values**

**Data plotting**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.sparse as sp

# List of sparse matrices and their corresponding names
sparse_matrices = [
    ("reviewText", sp.csr_matrix(tfidf_vectorized)),
    ("genre", sp.csr_matrix(vectorized)),
    ("ratingContents", sp.csr_matrix(rc_vectorized)),
    ("releaseDateTheaters", sp.csr_matrix(vectorized_releaseDateTheaters)),
    ("releaseDateStreaming", sp.csr_matrix(vectorized_releaseDateStreaming)),
    ("originalLanguage", sp.csr_matrix(vectorized_originalLanguage)),
    ("director", sp.csr_matrix(vectorized_director)),
    ("soundType", sp.csr_matrix(vectorized_soundType)),
    ("movieid", sp.csr_matrix(vectorized_movieid)),
    ("title", sp.csr_matrix(vectorized_title)),
    ("reviewerName", sp.csr_matrix(vectorized_reviewerName)),
    ("rating", sp.csr_matrix(vectorized_rating))
]

# Filter out sparse matrices with "vectorized1" in their names
filtered_sparse_matrices = [(name, matrix) for name, matrix in sparse_matrices if "vectorized1" not in name]

# Create a correlation matrix
num_matrices = len(filtered_sparse_matrices)
correlation_matrix = np.zeros((num_matrices, num_matrices))

for i in range(num_matrices):
    for j in range(num_matrices):
        correlation_matrix[i, j] = (
            filtered_sparse_matrices[i][1].T.dot(filtered_sparse_matrices[j][1]).sum() /
            np.sqrt(filtered_sparse_matrices[i][1].T.dot(filtered_sparse_matrices[i][1]).sum() *
                    filtered_sparse_matrices[j][1].T.dot(filtered_sparse_matrices[j][1]).sum())
        )

# Extract the names of the remaining sparse matrices
names = [name for name, _ in filtered_sparse_matrices]

# Create a heatmap
plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm",
            xticklabels=names,
            yticklabels=names)
plt.title("Correlation Heatmap of Dataset")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


Rating is the most non-correalated column as we can in the heat-map.

In [None]:
def plot_most_frequent_words(text, title):
    vectorizer = CountVectorizer(stop_words='english', max_features=50)
    count_X = vectorizer.fit_transform([text])
    feature_names = vectorizer.get_feature_names_out()
    word_frequencies = count_X.toarray()[0]
    
    plt.figure(figsize=(16, 10))
    plt.barh(range(len(feature_names)), word_frequencies, align='center')
    plt.yticks(range(len(feature_names)), feature_names)
    plt.xlabel('Frequency')
    plt.ylabel('Words')
    plt.title(title)
    plt.gca().invert_yaxis()
    plt.show()


# Negative reviews
reviews = [str(transformed_df['reviewText'][i]) for i in range(len(transformed_df['reviewText']))]
text = ' '.join(reviews)
plot_most_frequent_words(text, 'Top 50 Most Frequent Words in Reviews')


The word 'film' is the most frequent word in the review text

In [None]:
#Plotting frequency of movies released by top 50 reviewerName

# reviewerName data
reviewerName = transformed_df['reviewerName']

# Count the occurrences of each reviewerName
reviewerName_counts = reviewerName.value_counts()

# Select the top 50 reviewerName
top_50 = reviewerName_counts.head(50)

# Plotting
plt.figure(figsize=(14, 10))
top_50.plot(kind='bar', color='#90EE90')
plt.title('Top 50 Reviewer by Number of Movies')
plt.xlabel('Reviewer Name')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Sherri Morrison is the most frequent reviewer as we can see from the bar graph

In [None]:
#Plotting frequency of movies released by top 50 director 

# director data
director = transformed_df['director']

# Count the occurrences of each director 
director_counts = director.value_counts()

# Select the top 50 director 
top_50 = director_counts.head(50)

# Plotting
plt.figure(figsize=(14, 10))
top_50.plot(kind='bar', color='#90EE90')
plt.title('Top 50 Director by Number of Movies')
plt.xlabel('Director Name')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Joseph Brooks Directed the most number of movies 

In [None]:
#Plotting frequency of movies released by top 20 soundType

# soundType data
soundType = transformed_df['soundType']

# Count the occurrences of each soundType
soundType_counts = soundType.value_counts()

# Select the top 20 soundType
top_20 = soundType_counts.head(20)

# Plotting
plt.figure(figsize=(14, 10))
top_20.plot(kind='bar', color='#90EE90')
plt.title('Top 20 Sound Type by Number of Movies')
plt.xlabel('sound Type')
plt.ylabel('Number of Movies')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

From this graph we can see that Dolby is the most used sound type

In [None]:
# Applying SVD
svd = TruncatedSVD(n_components=10)

#Train data
svd_result = svd.fit_transform(sparse_matrix) #reviewtext
svd_result1 = svd.transform(sparse_matrix1) 
svd_genre_result = svd.fit_transform(sparse) #genre
svd_genre_result1 = svd.transform(sparse1) 
svd_rc_result = svd.fit_transform(rc_sparse)   #ratingContents
svd_rc_result1 = svd.transform(rc_sparse1) 

In [None]:
#Conerting to DataFrame

# Train data
s_df = pd.DataFrame(svd_result)
sg_df = pd.DataFrame(svd_genre_result)
rc_df = pd.DataFrame(svd_rc_result)

#Test data
s_df1 = pd.DataFrame(svd_result1)
sg_df1 = pd.DataFrame(svd_genre_result1)
rc_df1 = pd.DataFrame(svd_rc_result1)

In [None]:
columns_to_drop = ['reviewText','releaseDateTheaters','releaseDateStreaming','rating']

# Drop the specified columns from the DataFrame of train dataset
p_df = transformed_df.drop(columns=columns_to_drop)
# Drop the specified columns from the DataFrame of test dataset
p_df1 = transformed_df1.drop(columns=columns_to_drop)

In [None]:
p_df.head(2)

In [None]:
p_df1.head(2)

In [None]:
combined_df = pd.concat([s_df, p_df, rc_df], axis=1)
combined_df1 = pd.concat([s_df1, p_df1, rc_df1], axis=1)

In [None]:
combined_df.head(2)

In [None]:
combined_df.columns = combined_df.columns.astype(str)
combined_df1.columns = combined_df.columns.astype(str)

In [None]:
cc = p_df.select_dtypes(include=['object', 'bool']).columns.tolist()
cc1 = p_df1.select_dtypes(include=['object', 'bool']).columns.tolist()

In [None]:
cc

In [None]:
cc1

# **Encoding categorical columns**

In [None]:
oridinal = OrdinalEncoder()
ohe= OneHotEncoder()

# Use the ColumnTransformer to apply the appropriate transformation to each column
preprocessort = ColumnTransformer(transformers=[
    ('cat', oridinal , cc)
   ],remainder='passthrough')

# Fit and transform your data using the ColumnTransformer
X_t = preprocessort.fit_transform(combined_df)

# Use the ColumnTransformer to apply the appropriate transformation to each column
preprocessort1 = ColumnTransformer(transformers=[
    ('cat', oridinal , cc1)
   ],remainder='passthrough')

# Fit and transform your data using the ColumnTransformer
X_t1 = preprocessort1.fit_transform(combined_df1)

In [None]:
X_t.shape

In [None]:
X_t1.shape

# **Train Test Split**

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler # import RandomUnderSampler
oversampler = RandomOverSampler(random_state=22, sampling_strategy=1.0) # create an undersampler object with 1:1 ratio

# Fit and transform the training data
X1, y1 = oversampler.fit_resample(X_t, y_enc) # use undersampler instead of oversampler

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.25, random_state=44, shuffle=True)

# **ML Models**

**Model 1: LogisticRegression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix,roc_curve, roc_auc_score, precision_recall_curve, auc
clf = LogisticRegression(random_state=22, penalty='l2', C=0.75 , solver= "newton-cg",max_iter=1000)
clf.fit(X_train, y_train)
# Make predictions on the test dataset
y_pred = clf.predict(X_test)

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy}")

# Calculate the F1 score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1}")


y_pred_prob = clf.predict_proba(X_test)[:, 1]

# Calculate ROC curve values
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Calculate the AUC score
roc_auc = roc_auc_score(y_test, y_pred_prob)

# Plot ROC curve
plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()

# Calculate precision-recall curve values
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

# Calculate the area under the precision-recall curve (AUC-PR)
pr_auc = auc(recall, precision)

# Plot precision-recall curve
plt.figure(figsize=(7, 6))
plt.plot(recall, precision, color='blue', lw=2, label='PR curve (AUC-PR = %0.2f)' % pr_auc)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.show()

# Calculate and plot the confusion matrix
confusion = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(7, 6))
sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()

* Accuracy Score: 0.6569390044316948
* F1 Score: 0.6694074284094131

**Model 2: StackingClassifer (XGBoost)**

In [None]:
# import xgboost as xgb
# from sklearn.ensemble import AdaBoostClassifier, ExtraTreesClassifier
# from sklearn.ensemble import StackingClassifier
# from sklearn.metrics import accuracy_score, f1_score

# # Assuming you have X_train, X_test, y_train, y_test

# # Create the base classifiers
# xgb_classifier = xgb.XGBClassifier(learning_rate=0.1, n_estimators=100, max_depth=30, random_state=10)
# ada_classifier = AdaBoostClassifier(n_estimators=100, random_state=10)
# extra_trees_classifier = ExtraTreesClassifier(n_estimators=100, random_state=10)

# # Create the StackingClassifier with XGBoost as the final_estimator
# clf = StackingClassifier(
#     estimators=[('xgb', xgb_classifier), ('ada', ada_classifier), ('et', extra_trees_classifier)],
#     final_estimator=xgb_classifier
# )

# # Train the stacking classifier on the training data
# clf.fit(X_train, y_train)
# # Make predictions on the test dataset
# y_pred = clf.predict(X_test)

# # Calculate the accuracy score
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy Score: {accuracy}")

# # Calculate the F1 score
# f1 = f1_score(y_test, y_pred)
# print(f"F1 Score: {f1}")

# y_pred_prob = clf.predict_proba(X_test)[:, 1]

# # Calculate ROC curve values
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# # Calculate the AUC score
# roc_auc = roc_auc_score(y_test, y_pred_prob)

# # Plot ROC curve
# plt.figure(figsize=(7, 6))
# plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
# plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC)')
# plt.legend(loc='lower right')
# plt.show()

# # Calculate precision-recall curve values
# precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

# # Calculate the area under the precision-recall curve (AUC-PR)
# pr_auc = auc(recall, precision)

# # Plot precision-recall curve
# plt.figure(figsize=(7, 6))
# plt.plot(recall, precision, color='blue', lw=2, label='PR curve (AUC-PR = %0.2f)' % pr_auc)
# plt.xlabel('Recall')
# plt.ylabel('Precision')
# plt.title('Precision-Recall Curve')
# plt.legend(loc='lower left')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.show()

# # Calculate and plot the confusion matrix
# confusion = confusion_matrix(y_test, y_pred)
# plt.figure(figsize=(7, 6))
# sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues")
# plt.xlabel("Predicted Labels")
# plt.ylabel("True Labels")
# plt.title("Confusion Matrix")
# plt.show()

* Accuracy Score: 0.8419116971000901
* F1 Score: 0.8467202738602528

**Model 3: Stacking Classifer(Ensemble learning technique)**

In [None]:
# import pickle
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import LinearSVC
# from sklearn.ensemble import AdaBoostClassifier, ExtraTreesClassifier, BaggingClassifier, StackingClassifier, RandomForestClassifier
# import xgboost as xgb
# from sklearn.metrics import accuracy_score, f1_score


# # Define the base models
# log_reg = LogisticRegression(penalty='l2', C=0.75, solver="newton-cg")
# xgb_classifier = xgb.XGBClassifier(learning_rate=0.1, n_estimators=100, max_depth=30, random_state=10)
# ada_classifier = AdaBoosstClassifier(n_estimators=100, random_state=10)
# extra_trees_classifier = ExtraTreesClassifier(n_estimators=100, random_state=10)
# bagging_svm = BaggingClassifier(estimator=LinearSVC(C=0.001, loss='hinge', dual=True),  n_estimators=10, random_state=0)
# random_forest_classifier = RandomForestClassifier(n_estimators=100, random_state=10)

# # Define the stacking classifier

# clf = StackingClassifier(
#     estimators=[('lr', log_reg), ('xgb', xgb_classifier), ('ada', ada_classifier), ('et', extra_trees_classifier), ('bsvm', bagging_svm), ('rf', random_forest_classifier)],
#     final_estimator=xgb_classifier
# )

# # Fit the stacking classifier on the train set
# clf.fit(X_train, y_train)
# # Save the stacked model using pickle
# with open('stacked_model.pkl', 'wb') as model_file:
#     pickle.dump(clf, model_file)

# # Load the saved model using pickle
# with open('stacked_model.pkl', 'rb') as model_file:
#     loaded_model = pickle.load(model_file)

# # Predict on the test set using the loaded model
# y_pred = loaded_model.predict(X_test)

# # Evaluate the accuracy and f1 scores
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy Score: {accuracy}")
# f1 = f1_score(y_test, y_pred)
# print(f"F1 Score: {f1}")

* Accuracy Score: 0.8015483348963792
* F1 Score: 0.8056895930860641

**Model 4: Lightbgm**

In [None]:
# import lightgbm as lgb
# from sklearn.metrics import accuracy_score, f1_score

# # Assuming you have X_train, X_test, y_train, y_test

# # Create the LightGBM model
# clf = lgb.LGBMClassifier(learning_rate= 0.2, max_depth= 7, min_child_samples= 10, n_estimators= 150, num_leaves=63)

# # Train the model on the training data
# clf.fit(X_train, y_train)

# # Make predictions on the test dataset
# y_pred = clf.predict(X_test)

# # Calculate the accuracy score
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy Score: {accuracy}")

# # Calculate the F1 score
# f1 = f1_score(y_test, y_pred)
# print(f"F1 Score: {f1}")
# y_pred_prob = clf.predict_proba(X_test)[:, 1]

# # Calculate ROC curve values
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# # Calculate the AUC score
# roc_auc = roc_auc_score(y_test, y_pred_prob)

# # Plot ROC curve
# plt.figure(figsize=(7, 6))
# plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
# plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC)')
# plt.legend(loc='lower right')
# plt.show()

# # Calculate precision-recall curve values
# precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

# # Calculate the area under the precision-recall curve (AUC-PR)
# pr_auc = auc(recall, precision)

# # Plot precision-recall curve
# plt.figure(figsize=(7, 6))
# plt.plot(recall, precision, color='blue', lw=2, label='PR curve (AUC-PR = %0.2f)' % pr_auc)
# plt.xlabel('Recall')
# plt.ylabel('Precision')
# plt.title('Precision-Recall Curve')
# plt.legend(loc='lower left')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.show()

# # Calculate and plot the confusion matrix
# confusion = confusion_matrix(y_test, y_pred)
# plt.figure(figsize=(7, 6))
# sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues")
# plt.xlabel("Predicted Labels")
# plt.ylabel("True Labels")
# plt.title("Confusion Matrix")
# plt.show()

* Accuracy Score: 0.7376289512881338
* F1 Score: 0.7344302572311359

**Model 5: LinearSVC**

In [None]:
# from sklearn.svm import LinearSVC

# clf = LinearSVC(C=0.001, loss='squared_hinge', dual=True)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_test)


# # Evaluate the accuracy
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy:", accuracy)

**Model 6: BaggingClassifer**

In [None]:
# from sklearn.svm import LinearSVC
# from sklearn.ensemble import BaggingClassifier
# from sklearn.metrics import accuracy_score

# clf = BaggingClassifier(estimator=LinearSVC(C=1.2, loss='squared_hinge', dual=True),  n_estimators=10, random_state=0).fit(X_train, y_train)
# clf.fit(X_train, y_train)

# # Make predictions on the test dataset
# y_pred = clf.predict(X_test)

# # Calculate the accuracy score
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy Score: {accuracy}")

**Model 7: KNeighborsClassifier**

In [None]:
# import pandas as pd
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.metrics import accuracy_score


# # Create a KNN classifier
# clf = KNeighborsClassifier(n_neighbors=5)

# # Fit the KNN classifier to the training data
# clf.fit(X_train, y_train)

# # Make predictions on the test data
# y_pred = clf.predict(X_test)

# # Evaluate the accuracy of the KNN classifier
# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy:", accuracy)


**Model 8: GradientBoostingClassifier**

In [None]:
# from sklearn.ensemble import GradientBoostingClassifier
# from sklearn.metrics import accuracy_score

# # Create and fit the GradientBoostingClassifier
# clf = GradientBoostingClassifier(random_state=0, n_estimators=50,learning_rate= 0.01,max_depth= 3, min_samples_split= 2)
# clf.fit(X_train, y_train)

# # Make predictions on the test dataset
# y_pred = clf.predict(X_test)

# # Calculate the accuracy score
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy Score: {accuracy}")


**Model 9: MLPClassifier**

In [None]:
# from sklearn.neural_network import MLPClassifier
# from sklearn.metrics import accuracy_score,f1_score

# # Create the Multi-Layer Perceptron (MLP) classifier
# clf = MLPClassifier(hidden_layer_sizes=(100,), activation='relu', solver='adam',random_state=0, verbose=True)

# # Train the MLP classifier
# clf.fit(X_train, y_train)

# # Make predictions on the test dataset
# y_pred = clf.predict(X_test)

# # Calculate the accuracy score
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy Score: {accuracy}")

# # Calculate the F1 score
# f1 = f1_score(y_test, y_pred)
# print(f"F1 Score: {f1}")

# y_pred_prob = clf.predict_proba(X_test)[:, 1]

# # Calculate ROC curve values
# fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# # Calculate the AUC score
# roc_auc = roc_auc_score(y_test, y_pred_prob)

# # Plot ROC curve
# plt.figure(figsize=(7, 6))
# plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
# plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.title('Receiver Operating Characteristic (ROC)')
# plt.legend(loc='lower right')
# plt.show()

# # Calculate precision-recall curve values
# precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)

# # Calculate the area under the precision-recall curve (AUC-PR)
# pr_auc = auc(recall, precision)

# # Plot precision-recall curve
# plt.figure(figsize=(7, 6))
# plt.plot(recall, precision, color='blue', lw=2, label='PR curve (AUC-PR = %0.2f)' % pr_auc)
# plt.xlabel('Recall')
# plt.ylabel('Precision')
# plt.title('Precision-Recall Curve')
# plt.legend(loc='lower left')
# plt.xlim([0.0, 1.0])
# plt.ylim([0.0, 1.05])
# plt.show()

# # Calculate and plot the confusion matrix
# confusion = confusion_matrix(y_test, y_pred)
# plt.figure(figsize=(7, 6))
# sns.heatmap(confusion, annot=True, fmt="d", cmap="Blues")
# plt.xlabel("Predicted Labels")
# plt.ylabel("True Labels")
# plt.title("Confusion Matrix")
# plt.show()

* Accuracy Score: 0.5913246497911034
* F1 Score: 0.6619915848527348

**Model 10: DecisionTreeClassifier**

In [None]:
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.metrics import accuracy_score

# clf = DecisionTreeClassifier(random_state=0)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_test)
# print(y_pred)

# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy:", accuracy)


**Model 11: RandomForestClassifier**

In [None]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score

# # Assuming X_train, X_test, y_train, and y_test are defined and contain the appropriate data

# clf = RandomForestClassifier(random_state=0)
# clf.fit(X_train, y_train)
# y_pred = clf.predict(X_test)
# print(y_pred)

# accuracy = accuracy_score(y_test, y_pred)
# print("Accuracy:", accuracy)


# **Submission**

**Classification Report**

In [None]:
from sklearn.metrics import classification_report

# Assuming you have true labels y_true and predicted labels y_pred
report = classification_report(y_test, y_pred, zero_division=0)

# Print the classification report
print("Classification Report:")
print(report)

**Predicting test data (y_pred) and decoding (y_pred)**

In [None]:
y_pred=clf.predict(X_t1)
print('y_pred:',y_pred)
y_dec=encoder.inverse_transform(y_pred.reshape(-1, 1))

**Submitting to competition**

In [None]:
submission=pd.DataFrame(columns=['id','sentiment'])
submission['id']=[i for i in range(len(y_dec))]
submission['sentiment']=y_dec
submission.to_csv('submission1.csv',index=False)

# **Hyperparameter Tuning**

**Model 1: Logistic Regression**

In [None]:
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import accuracy_score
# from sklearn.model_selection import GridSearchCV

# # Assuming X_train, X_test, y_train, and y_test are defined and contain the appropriate data

# # Create the Logistic Regression classifier
# clf = LogisticRegression(random_state=42, max_iter=500)

# # Define the hyperparameters to search
# param_grid = {
#     'C': [0.75, 0.5, 0.8],  # Try different regularization strengths
#     'solver': ['newton-cg', 'lbfgs'],  # Try different solvers
# }

# # Initialize Grid Search with the classifier and the hyperparameter grid
# grid_search = GridSearchCV(clf, param_grid, cv=5)

# # Fit the Grid Search to the training data to find the best hyperparameters
# grid_search.fit(X_train, y_train)

# # Get the best estimator (model) from the grid search
# best_model = grid_search.best_estimator_

# # Use the best model to make predictions on the test set
# y_pred = best_model.predict(X_test)

# # Calculate accuracy
# accuracy = accuracy_score(y_test, y_pred)
# print("Best Hyperparameters:", grid_search.best_params_)
# print("Accuracy:", accuracy)


* Best Hyperparameters: {'C': 0.75, 'solver': 'newton-cg'}
* Accuracy: 0.6468285763072518

**Model 2: MLPClassifier**

In [None]:
# from sklearn.neural_network import MLPClassifier
# from sklearn.metrics import accuracy_score
# from sklearn.model_selection import GridSearchCV

# # Assuming X_train, X_test, y_train, and y_test are defined and contain the appropriate data

# # Create the Multi-Layer Perceptron (MLP) classifier
# clf = MLPClassifier(random_state=0)

# # Define the hyperparameters to search
# param_grid = {
#     'hidden_layer_sizes': [(50,), (100,), (150,)],  # Try different hidden layer sizes
#     'activation': ['relu', 'tanh'],  # Try different activation functions
#     'solver': ['adam', 'sgd'],  # Try different solvers
#     'alpha': [0.0001, 0.001, 0.01],  # Try different regularization strengths
# }

# # Initialize Grid Search with the classifier and the hyperparameter grid
# grid_search = GridSearchCV(clf, param_grid, cv=5)

# # Fit the Grid Search to the training data to find the best hyperparameters
# grid_search.fit(X_train, y_train)

# # Get the best estimator (model) from the grid search
# best_model = grid_search.best_estimator_

# # Use the best model to make predictions on the test set
# y_pred = best_model.predict(X_test)

# # Calculate accuracy
# accuracy = accuracy_score(y_test, y_pred)
# print("Best Hyperparameters:", grid_search.best_params_)
# print("Accuracy:", accuracy)

# from sklearn.metrics import f1_score

# # Assuming you have true labels y_true and predicted labels y_pred
# f1_score = f1_score(y_test, y_pred,)

# print("F1 Score:", f1_score)

* Best Hyperparameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'solver': 'adam'}
* Accuracy: 0.6205456241051843
* F1 Score: 0.645268237576591

**Model 3: LightGBM:**

In [None]:
# import lightgbm as lgb
# from sklearn.metrics import accuracy_score
# from sklearn.model_selection import GridSearchCV

# # Assuming X_train, X_test, y_train, and y_test are defined and contain the appropriate data

# # Create the LightGBM classifier
# clf = lgb.LGBMClassifier(random_state=0)

# # Define the hyperparameters to search
# param_grid = {
#     'n_estimators': [50, 100, 150],  # Try different numbers of trees
#     'learning_rate': [0.01, 0.1, 0.2],  # Try different learning rates
#     'max_depth': [3, 5, 7],  # Try different maximum depths
#     'min_child_samples': [1, 5, 10],  # Try different minimum samples in a leaf node
#     'num_leaves': [15, 31, 63],  # Try different numbers of leaves
# }

# # Initialize Grid Search with the classifier and the hyperparameter grid
# grid_search = GridSearchCV(clf, param_grid, cv=5)

# # Fit the Grid Search to the training data to find the best hyperparameters
# grid_search.fit(X_train, y_train)

# # Get the best estimator (model) from the grid search
# best_model = grid_search.best_estimator_

# # Use the best model to make predictions on the test set
# y_pred = best_model.predict(X_test)

# # Calculate accuracy
# accuracy = accuracy_score(y_test, y_pred)
# print("Best Hyperparameters:", grid_search.best_params_)
# print("Accuracy:", accuracy)


* Best Hyperparameters: {'learning_rate': 0.2, 'max_depth': 7, 'min_child_samples': 10, 'n_estimators': 150, 'num_leaves': 63}
* Accuracy: 0.7214742621465068


In [None]:
# from sklearn.naive_bayes import GaussianNB
# from sklearn.metrics import accuracy_score, f1_score
# from sklearn.model_selection import GridSearchCV

# # Assuming X_train, X_test, y_train, and y_test are defined and contain the appropriate data

# # Create the Gaussian Naive Bayes classifier
# clf = GaussianNB()

# # Define the hyperparameters to search
# param_grid = {
#     'var_smoothing': [1e-09, 1e-08, 1e-07, 1e-06, 1e-05]  # Try different values for var_smoothing
# }

# # Initialize Grid Search with the classifier and the hyperparameter grid
# grid_search = GridSearchCV(clf, param_grid, cv=5)

# # Fit the Grid Search to the training data to find the best hyperparameters
# grid_search.fit(X_train, y_train)

# # Get the best estimator (model) from the grid search
# best_model = grid_search.best_estimator_

# # Use the best model to make predictions on the test set
# y_pred = best_model.predict(X_test)

# # Calculate accuracy
# accuracy = accuracy_score(y_test, y_pred)
# print("Best Hyperparameters:", grid_search.best_params_)
# print("Accuracy:", accuracy)

# # Calculate the F1 score
# f1 = f1_score(y_test, y_pred)
# print("F1 Score:", f1)


In [None]:
# from sklearn.metrics import classification_report
# # Assuming you have true labels y_true and predicted labels y_pred
# report = classification_report(y_test, y_pred, zero_division=0)
# # Print the classification report
# print("Classification Report:")
# print(report)
# y_pre=clf.predict(X_t1)
# print(y_pre)
# y_dec=encoder.inverse_transform(y_pre.reshape(-1, 1))
# print(y_dec)
# submission = pd.DataFrame(columns=['id', 'sentiment'])
# submission['id'] = [i for i in range(len(y_dec))]
# submission['sentiment'] = y_dec
# submission.to_csv('submission1.csv', index=False)

# **Final note**

> **Data preprocessing**

There were three datasets given train, test, and movies. There were duplicate enitiries in the movies dataset, I droped the duplicates just by keeping the first one. Then merged the train with movies and test also with movies dataset. After that I have done the data preprocessing, the data had so much missing values. I have imputed some columns using replace functions. For the rest I used numerical and categorical transformed prior to using transformers on them I have splited them respectively. Then column Transformer was applied to the merged_df (train) and merged_df1 (test) they were transformed using pipeline function and they were named as transformed_df (train) and transformed_df1 (test) after apply fit_transform and transform fuctions respectively. The reviewText column was imputed sepreatly using the some columns in transformed_df and tranformed_df1 for train and test respectively. I have also applied tfidf on all columns containing text data to make a heat map. I encoded the categorical variables just before train_test_split using oridinal encoder because one hot encoder was giving column mismatch in train and test datasets.



> **Model 1: Logistic regression (Best model)**
 
It is a powerful machine learning technique that can help you solve binary classification problems. It is based on the idea of using a mathematical function called the logistic function to model the probability of an outcome given some input features. Logistic regression can be used to predict whether an email is spam or not, whether a customer will buy a product or not, whether a tumor is malignant or benign, and many other applications.

I have used the sklearn library to import the model and some evaluation metrics. I have also set some parameters for the model, such as the random_state, penalty, C, solver, and max_iter. You have then trained the model on the training data and made predictions on the test data. Finally, you have calculated the accuracy score and the F1 score to measure the performance of the model.

My model has achieved an accuracy score of 0.6556 and an F1 score of 0.6690. This is my best model!!!
Submission score: 0.66846


>  **Model 2: StackingClassifer (XGBoost)**

The code you have written is for a stacking classifier, which is a type of ensemble learning technique that can combine multiple base classifiers to create a more powerful meta-classifier. I have used the xgboost, sklearn, and numpy libraries to import the XGBClassifier, AdaBoostClassifier, ExtraTreesClassifier, StackingClassifier, accuracy_score, and f1_score classes and functions. I have also specified some hyperparameters for the base classifiers, such as the learning_rate, n_estimators, max_depth, and random_state. You have then created the stacking classifier with XGBoost as the final_estimator, which means that it will use the predictions of the base classifiers as input features and make the final prediction.

I have then trained the stacking classifier on the training data (X_train, y_train) and made predictions on the test data (X_test). You have also calculated the accuracy score and the F1 score to measure the performance of the model. The accuracy score is a metric that measures how well the model can correctly classify the test data.

* Accuracy Score: 0.8505360328055754
* F1 Score: 0.8570021111893033
* Submission score: 0.65763

> **Model 3: Stacking Classifer(Ensemble learning technique)**

The code you have written is for a stacking classifier, which is a type of ensemble learning technique that can combine multiple base classifiers to create a more powerful meta-classifier. You have used the xgboost, sklearn, and pickle libraries to import the XGBClassifier, LogisticRegression, LinearSVC, AdaBoostClassifier, ExtraTreesClassifier, BaggingClassifier, RandomForestClassifier, StackingClassifier, accuracy_score, f1_score classes and functions. You have also specified some hyperparameters for the base classifiers, such as the penalty, C, solver, learning_rate, n_estimators, max_depth, and random_state. I have then created the stacking classifier with six base classifiers (logistic regression, XGBoost, AdaBoost, extra trees, bagging SVM, and random forest) and XGBoost as the final_estimator, which means that it will use the predictions of the base classifiers as input features and make the final prediction.

I have then trained the stacking classifier on the training data (X_train, y_train) and saved the model using pickle. Pickle is a module that can serialize and deserialize Python objects into binary files. You have used pickle to dump the model into a file named 'stacked_model.pkl'. I have then loaded the model from the file using pickle.load and made predictions on the test data (X_test). I have also calculated the accuracy score and the F1 score to measure the performance of the model.

* Accuracy Score: 0.8015483348963792
* F1 Score: 0.8056895930860641
* Submission Score: 0.6359

> **Model 4: LightGBM (Second Best Model):**

The code I have written is for a LightGBM model, which is a type of gradient boosting framework that can be used for classification, regression, or ranking problems. I have used the lightgbm and sklearn libraries to import the LGBMClassifier, accuracy_score, and f1_score classes and functions. I have also specified some hyperparameters for the model, such as the learning_rate, max_depth, min_child_samples, n_estimators, and num_leaves. I have then trained the model on the training data (X_train, y_train) and made predictions on the test data (X_test). Finally, I have evaluated the performance of the model by calculating the accuracy score and the F1 score.
* Accuracy Score: 0.7376289512881338
* F1 Score: 0.7344302572311359
* Submission score: 0.66204

> **Model 5: LinearSVC:**

The code I have written is for a linear support vector machine (SVM) model, which is a type of supervised learning algorithm that can be used for classification or regression problems. I have used the sklearn library to import the LinearSVC class and the accuracy_score function. I have also specified some hyperparameters for the model, such as the C, loss, and dual. You have then trained the model on the training data (X_train, y_train) and made predictions on the test data (X_test). Finally, I have evaluated the performance of the model by calculating the accuracy score.

> **Model 6: BaggingClassifier (integrated with LinearSVC)**

The code I have written is for a bagging classifier, which is a type of ensemble learning technique that can create multiple bootstrap samples from the original data and train a base classifier on each sample. I have used the sklearn library to import the BaggingClassifier, LinearSVC, and accuracy_score classes and functions. I have also specified some hyperparameters for the base classifier and the bagging classifier, such as the C, loss, dual, n_estimators, and random_state. I have then trained the bagging classifier on the training data (X_train, y_train) using a linear support vector machine (SVM) as the base classifier. I have then made predictions on the test data (X_test) and calculated the accuracy score.

> **Model 7: KNeighborsClassifier**

The code I have written is for a k-nearest neighbors (KNN) classifier, which is a type of supervised learning algorithm that can be used for classification or regression problems. I have used the pandas and sklearn libraries to import the KNeighborsClassifier and accuracy_score classes and functions. I have also specified the number of neighbors to be used in the KNN algorithm, which is 5. I have then trained the KNN classifier on the training data (X_train, y_train) and made predictions on the test data (X_test). Finally, I have evaluated the performance of the model by calculating the accuracy score.


> **Model 8: GradientBoostingClassifier**

The code I have written is for a gradient boosting classifier, which is a type of ensemble learning technique that can create a strong learner by combining multiple weak learners. I have used the sklearn library to import the GradientBoostingClassifier and accuracy_score classes and functions. I have also specified some hyperparameters for the model, I as the random_state, n_estimators, learning_rate, max_depth, and min_samples_split. I have then trained the model on the training data (X_train, y_train) and made predictions on the test data (X_test). Finally, I have evaluated the performance of the model by calculating the accuracy score.

> **Model 9: MLPClassifier (Best Model)**

The code I have written is for a multi-layer perceptron (MLP) classifier, which is a type of artificial neural network that can be used for classification or regression problems. I have used the sklearn and numpy libraries to import the MLPClassifier, accuracy_score, and f1_score classes and functions. I have also specified some hyperparameters for the model, such as the hidden_layer_sizes, activation, solver, random_state, and verbose. I have then trained the model on the training data (X_train, y_train) and made predictions on the test data (X_test). Finally, I have evaluated the performance of the model by calculating the accuracy score and the F1 score.
* Accuracy Score: 0.5913246497911034
* F1 Score: 0.6619915848527348
* Submission score: 0.66846

> **Model 10: DecisionTreeClassifier:**

The code I have written is for a decision tree classifier, which is a type of supervised learning algorithm that can be used for classification or regression problems. I have used the sklearn library to import the DecisionTreeClassifier and accuracy_score classes and functions. I have also specified the random_state parameter, which is a parameter that sets the seed for random number generation. This can help ensure reproducibility and consistency of results. I have then trained the model on the training data (X_train, y_train) and made predictions on the test data (X_test).

> **Model 11: RandomForestClassifier:**

The code I have written is for a random forest classifier, which is a type of ensemble learning technique that can create a large number of decision trees and average their predictions. I have used the sklearn library to import the RandomForestClassifier and accuracy_score classes and functions. I have also specified the random_state parameter, which is a parameter that sets the seed for random number generation. This can help ensure reproducibility and consistency of results. I have then trained the model on the training data (X_train, y_train) and made predictions on the test data (X_test). Finally, I have printed the predictions and calculated the accuracy score.

> **Inference**

The sentiment analysis using sklearn was a challenging task for me. In this project I have applied various feature engineering techniques to improve the model but I couldn't improve my submission score beyond .66846. The best models which cleared the cutoff scores are LogisticRegression, MLPClassifier and LightGBM.

