<a href="https://colab.research.google.com/github/harishk1998/HarishBabu_INFO5731_Fall2024/blob/main/Kancharla_Harishbabu_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [None]:
import pandas as pd
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
file_path = "/content/movie_review (1).csv"
try:
    print("Loading the dataset...")
    data = pd.read_csv(file_path, delimiter=",", encoding="utf-8", quoting=1, on_bad_lines='skip')  # Handle quotes
except Exception as e:
    print(f"Error loading file: {e}. Attempting with alternative settings...")
    data = pd.read_csv(file_path, delimiter=",", encoding="latin1", quoting=1, on_bad_lines='skip')

# Checking if dataset is loaded correctly
print("Dataset loaded. Displaying first few rows:")
print(data.head())

# Check and split columns if necessary
if data.columns.size == 1:
    print("Dataset is not properly split. Splitting manually...")
    data = pd.read_csv(file_path, delimiter=",", quoting=1, encoding="latin1", on_bad_lines='skip')
    if data.columns.size > 3:
        data = data.iloc[:, :3]
    data.columns = ['document_id', 'clean_text', 'sentiment']

print(data.head())

# Step 1: Preprocessing the text data
if 'clean_text' in data.columns:
    text_column = 'clean_text'
else:
    text_column = data.columns[1]

print(f"Using column '{text_column}' for text analysis.")

# Text preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(f"[{re.escape(string.punctuation)}]", "", text)
    text = re.sub("\d+", "", text)
    text = text.strip()
    return text

data['cleaned_text'] = data[text_column].astype(str).apply(preprocess_text)

# Step 2: Vectorize the text data
stop_words = list(stopwords.words('english'))
vectorizer = CountVectorizer(stop_words="english", max_features=5000)
text_vectors = vectorizer.fit_transform(data['cleaned_text'])

# Step 3: Perform LDA Topic Modeling
lda_model = LatentDirichletAllocation(n_components=10, random_state=42)
lda_model.fit(text_vectors)

# Extract topics and their top words
def get_top_words(model, feature_names, n_top_words=10):
    topics = []
    for topic_idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        topics.append(top_words)
    return topics

feature_names = vectorizer.get_feature_names_out()
topics = get_top_words(lda_model, feature_names)

# Display the topics
print("\nTop 10 Topics Identified:")
for idx, topic in enumerate(topics):
    print(f"Topic {idx + 1}: {', '.join(topic)}")

# Step 4: Summarize Topics
print("\nTopic Summaries:")
for idx, topic_words in enumerate(topics):
    summary = f"Topic {idx + 1} focuses on themes related to: {', '.join(topic_words[:5])}."
    print(summary)


Loading the dataset...
Error loading file: 'utf-8' codec can't decode byte 0x92 in position 146: invalid start byte. Attempting with alternative settings...
Dataset loaded. Displaying first few rows:
   document_id                                         clean_text sentiment
0            1  The first 30 minutes dragged a bit but thereaf...  Positive
1            2                       Infinity Stars to the movie.  Positive
2            3  I fear they focused most of their attention on...  Negative
3            4  Kalki 2898 AD: Merging Epic History with Futur...  Positive
4            5  The movie is good but too slow. The action sce...  Negative
   document_id                                         clean_text sentiment
0            1  The first 30 minutes dragged a bit but thereaf...  Positive
1            2                       Infinity Stars to the movie.  Positive
2            3  I fear they focused most of their attention on...  Negative
3            4  Kalki 2898 AD: Merging E

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
# Required Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE  # To handle class imbalance
from nltk.corpus import stopwords
import nltk


nltk.download('stopwords')

# Step 1: Load Dataset
data = pd.read_csv('/content/movie_review (1).csv', encoding='latin1')
data.columns = ['document_id', 'clean_text', 'sentiment']
data['sentiment'] = data['sentiment'].map({'Positive': 1, 'Negative': 0})


data.dropna(subset=['clean_text', 'sentiment'], inplace=True)

# Step 2: Feature Extraction

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(data['clean_text'])
y = data['sentiment']

# Step 3: Handle Class Imbalance
smote = SMOTE(random_state=42)
X, y = smote.fit_resample(X, y)

# Step 4: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 5: Define Models
logistic_model = LogisticRegression(class_weight='balanced', random_state=42)
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)

# Step 6: Train Models
logistic_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

# Step 7: Predict on Test Set
logistic_preds = logistic_model.predict(X_test)
rf_preds = rf_model.predict(X_test)

# Step 8: Evaluate Models
def evaluate_model(y_true, y_pred):
    return {
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred, zero_division=0),
        "Recall": recall_score(y_true, y_pred, zero_division=0),
        "F1 Score": f1_score(y_true, y_pred, zero_division=0)
    }

logistic_metrics = evaluate_model(y_test, logistic_preds)
rf_metrics = evaluate_model(y_test, rf_preds)

print("Logistic Regression Metrics:")
print(logistic_metrics)

print("\nRandom Forest Metrics:")
print(rf_metrics)

# Step 9: Classification Reports
print("\nClassification Report (Logistic Regression):")
print(classification_report(y_test, logistic_preds, target_names=['Negative', 'Positive'], zero_division=0))

print("\nClassification Report (Random Forest):")
print(classification_report(y_test, rf_preds, target_names=['Negative', 'Positive'], zero_division=0))

# Step 10: Cross-Validation
cv_scores_logistic = cross_val_score(logistic_model, X, y, cv=5, scoring='accuracy')
cv_scores_rf = cross_val_score(rf_model, X, y, cv=5, scoring='accuracy')

print("\nCross-Validation Scores (Logistic Regression):", cv_scores_logistic)
print("Cross-Validation Scores (Random Forest):", cv_scores_rf)



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Logistic Regression Metrics:
{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}

Random Forest Metrics:
{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}

Classification Report (Logistic Regression):
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         3
    Positive       1.00      1.00      1.00         3

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6


Classification Report (Random Forest):
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         3
    Positive       1.00      1.00      1.00         3

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6


Cross-Validation Scores (Logistic Regression): [0.83333333 1.         

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the training and testing data
train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/test.csv')

# Display basic info for EDA
print("Training data preview:")
print(train_df.head())
print("Testing data preview:")
print(test_df.head())

# Drop rows in training data with missing target
train_df.dropna(subset=['SalePrice'], inplace=True)

# Separate target variable from features in the training data
y = train_df['SalePrice']
X = train_df.drop(['SalePrice'], axis=1)

# Combine train and test datasets for consistent preprocessing
all_data = pd.concat([X, test_df], keys=['train', 'test'])

# Handling missing values based on feature type
categorical_cols = all_data.select_dtypes(include=['object']).columns
numerical_cols = all_data.select_dtypes(include=['int64', 'float64']).columns

# Define preprocessing pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a column transformer for both numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

# Define the full pipeline with a regressor
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Split the combined data back into train and test sets
X_train = all_data.loc['train']
X_test = all_data.loc['test']

# Train the model on training data
model_pipeline.fit(X_train, y)

# Predict on test.csv
test_predictions = model_pipeline.predict(X_test)

# Save predictions to a CSV file
output = pd.DataFrame({
    'Id': test_df['Id'],
    'SalePrice': test_predictions
})
output.to_csv('test_predictions.csv', index=False)
print("Predictions saved to test_predictions.csv")

# Evaluate on a holdout set (Validation split)
X_train_split, X_valid_split, y_train_split, y_valid_split = train_test_split(X_train, y, test_size=0.2, random_state=42)
model_pipeline.fit(X_train_split, y_train_split)
y_valid_pred = model_pipeline.predict(X_valid_split)

# Evaluation metrics
mse = mean_squared_error(y_valid_split, y_valid_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_valid_split, y_valid_pred)

print(f"Validation Mean Squared Error: {mse}")
print(f"Validation Root Mean Squared Error: {rmse}")
print(f"Validation R-squared: {r2}")


Training data preview:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice 

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [2]:
# Install necessary libraries
!pip install transformers
!pip install scikit-learn

# Import required libraries
import pandas as pd
from transformers import pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Step 1: Load the dataset
data = pd.read_csv('/content/movie_review (1).csv', encoding='latin1')
data.columns = ['document_id', 'clean_text', 'sentiment']

# Step 2: Check for NaN values and remove rows with NaN in 'sentiment' column
data.dropna(subset=['sentiment'], inplace=True)

# Step 3: Preprocess the sentiment column
# Ensure consistent lower case values for sentiment labels before mapping
data['sentiment'] = data['sentiment'].str.lower()

# Mapping the sentiment column to binary values
sentiment_mapping = {'positive': 1, 'negative': 0}
data['sentiment'] = data['sentiment'].map(sentiment_mapping)

# Drop rows where sentiment mapping is unsuccessful (e.g., neutral values if not mapped)
data.dropna(subset=['sentiment'], inplace=True)

# Step 4: Initialize the Hugging Face Sentiment Analysis pipeline
sentiment_analyzer = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Step 5: Predict sentiment on the text data using the pre-trained model
predictions = sentiment_analyzer(data['clean_text'].tolist())

# Convert predictions to binary format: 1 for positive, 0 for negative
predicted_sentiment_binary = [1 if pred['label'] == 'POSITIVE' else 0 for pred in predictions]

# Step 6: Add predictions to the dataframe
data['predicted_sentiment_binary'] = predicted_sentiment_binary

# Ensure both columns are of type int for evaluation
data['sentiment'] = data['sentiment'].astype(int)
data['predicted_sentiment_binary'] = data['predicted_sentiment_binary'].astype(int)

# Step 7: Evaluate the model's performance
accuracy = accuracy_score(data['sentiment'], data['predicted_sentiment_binary'])
precision = precision_score(data['sentiment'], data['predicted_sentiment_binary'], zero_division=0)
recall = recall_score(data['sentiment'], data['predicted_sentiment_binary'], zero_division=0)
f1 = f1_score(data['sentiment'], data['predicted_sentiment_binary'], zero_division=0)

# Step 8: Print the evaluation metrics
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

# Step 9: Print the classification report for more detailed metrics
print("\nClassification Report:")
print(classification_report(data['sentiment'], data['predicted_sentiment_binary'], target_names=['Negative', 'Positive'], zero_division=0))




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00        15
    Positive       1.00      1.00      1.00        10

    accuracy                           1.00        25
   macro avg       1.00      1.00      1.00        25
weighted avg       1.00      1.00      1.00        25

