<a href="https://colab.research.google.com/github/mushfiq-hussain/INFO5731_Assignment_Four/blob/main/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [15]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Load the dataset
data = pd.read_csv('/content/sentimentdata.csv')

# Create a CountVectorizer object
vectorizer = CountVectorizer(max_features=1000, lowercase=True, stop_words='english',
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')

# Fit and transform the data
X = vectorizer.fit_transform(data['text'])

# LDA Model
lda_model = LatentDirichletAllocation(n_components=10,               # Number of topics
                                      max_iter=10,                  # Max learning iterations
                                      learning_method='online',
                                      random_state=100,             # Random state
                                      batch_size=128,               # n docs in each learning iter
                                      evaluate_every = -1,          # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,                  # Use all available CPUs
                                     )
lda_output = lda_model.fit_transform(X)

# Features (text representation) used for topic modeling
features = vectorizer.get_feature_names_out()

# Top 10 clusters for topic modeling
n_top_words = 10
top_clusters = {}
for topic_idx, topic in enumerate(lda_model.components_):
    top_clusters[topic_idx] = [features[i] for i in topic.argsort()[:-n_top_words - 1:-1]]

# Summarize and describe the topic for each cluster
for topic, words in top_clusters.items():
    print(f"Topic {topic + 1}:")
    print(", ".join(words))
    print()





UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 31175-31176: invalid continuation byte

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

# Load the dataset
data = pd.read_csv('/content/sentimentdata.csv')

# Select features for sentiment classification
X = data['text']
y = data['sentiment']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000, lowercase=True, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Function to evaluate model
def evaluate_model(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    return accuracy, precision, recall, f1

# Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)
nb_y_pred = nb_classifier.predict(X_test_tfidf)
nb_accuracy, nb_precision, nb_recall, nb_f1 = evaluate_model(y_test, nb_y_pred)

# Logistic Regression Classifier
lr_classifier = LogisticRegression(max_iter=1000)
lr_classifier.fit(X_train_tfidf, y_train)
lr_y_pred = lr_classifier.predict(X_test_tfidf)
lr_accuracy, lr_precision, lr_recall, lr_f1 = evaluate_model(y_test, lr_y_pred)

# Print the performance metrics
print("Naive Bayes Classifier:")
print(f"Accuracy: {nb_accuracy:.2f}")
print(f"Precision: {nb_precision:.2f}")
print(f"Recall: {nb_recall:.2f}")
print(f"F1 Score: {nb_f1:.2f}")
print()
print("Logistic Regression Classifier:")
print(f"Accuracy: {lr_accuracy:.2f}")
print(f"Precision: {lr_precision:.2f}")
print(f"Recall: {lr_recall:.2f}")
print(f"F1 Score: {lr_f1:.2f}")

#####
Explanation for Feature Selection (TF-IDF)

TF-IDF (Term Frequency-Inverse Document Frequency) is a feature extraction approach commonly employed in text classification applications such as sentiment analysis.
It captures the significance of words in the text by assigning greater weights to terms that appear more frequently in a given document but less frequently throughout the corpus.
####




UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 31175-31176: invalid continuation byte

# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [12]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

# Display the first few rows of the dataset
train_data.head()

# Display information about the dataset
train_data.info()

# Check for missing values
train_data.isnull().sum()

# Drop columns with a large number of missing values
train_data = train_data.dropna(thresh=len(train_data)*0.6, axis=1)

# Fill missing values with mean or mode
train_data = train_data.fillna(train_data.mean())

# Check for missing values again
train_data.isnull().sum()
# Define X and y
X = train_data.drop(['SalePrice'], axis=1)
y = train_data['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
# Perform EDA to select features
# For example, you can use correlation to select important features
corr_matrix = train_data.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=False, fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

# Based on the correlation matrix, select top correlated features with SalePrice
top_corr_features = corr_matrix.index[abs(corr_matrix["SalePrice"]) > 0.5]
top_corr_features

# Select the features
selected_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd']

# Update training and testing sets with selected features
X_train = X_train[selected_features]
X_test = X_test[selected_features]

# Develop a regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2):", r2)





<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

TypeError: can only concatenate str (not "int") to str

# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [13]:
# Install the necessary libraries
!pip install transformers

import torch
from transformers import BertForSequenceClassification, BertTokenizer

# Load BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Sample sentences for sentiment analysis
sentences = ["I love this movie", "This movie is terrible"]

# Tokenize the sentences
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

# Perform inference
outputs = model(**inputs)

# Get predicted labels
predicted_labels = torch.argmax(outputs.logits, dim=1).tolist()

# Display predicted labels
for i, sentence in enumerate(sentences):
    print(f"Sentence: {sentence}")
    print(f"Predicted sentiment: {'positive' if predicted_labels[i] == 1 else 'negative'}")
    print()
# Load ground truth labels
ground_truth_labels = [1, 0]  # 1 for positive, 0 for negative

# Calculate accuracy
accuracy = (predicted_labels == ground_truth_labels).mean()

# Calculate precision, recall, and F1
from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(ground_truth_labels, predicted_labels)
recall = recall_score(ground_truth_labels, predicted_labels)
f1 = f1_score(ground_truth_labels, predicted_labels)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

#####
Benefits of utilizing BERT for sentiment analysis:

BERT is a sophisticated pre-trained language model that accurately captures complicated contextual information.
BERT is bidirectional and captures the connections between words in a phrase well.
It can be fine-tuned for certain jobs, yet it can still operate well in a zero-shot situation.
Disadvantages and challenges

BERT can be computationally costly, particularly with huge datasets.
To attain the greatest performance, it may be necessary to fine-tune using a large quantity of data.
BERT tokenization may not be appropriate for all forms of text data, and specific treatment may be necessary for certain datasets.
########





The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I love this movie
Predicted sentiment: negative

Sentence: This movie is terrible
Predicted sentiment: negative



AttributeError: 'bool' object has no attribute 'mean'