<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Loading the dataset
df = pd.read_csv('/content/Sentimental_Analyis.csv')

# Selecting review column as a text_column
text_column = 'review'

# Implementing CountVectorizer for feature extraction
vectorizer = CountVectorizer(max_df=0.85, max_features=5000, stop_words='english')
X = vectorizer.fit_transform(df[text_column])

# Using LDA for Topic Modelling
num_topics = 10
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(X)

# Printing top words from the feature names
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic #{topic_idx + 1}:")
    print([feature_names[i] for i in topic.argsort()[:-11:-1]])
    print()

# Assigning topics to documents
df['topic'] = lda.transform(X).argmax(axis=1)
top_clusters = df['topic'].value_counts().head(50)
print("Top 10 Clusters:")
print(top_clusters)

# Summarizing and describing the topic for each cluster
for cluster, count in top_clusters.items():
    cluster_df = df[df['topic'] == cluster]
    example_text = cluster_df[text_column].iloc[0]
    print(f"\nCluster #{cluster + 1} Summary:")
    print(f"Example Text: {example_text}")
    print(f"The Number of Documents in Cluster: {count}")
    print()

Topic #1:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #2:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #3:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #4:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #5:
['ultimate', 'viper', 'wireless', 'hyperspeed', 'rgb', 'dock', 'charging', 'dpi', 'time', 'making']

Topic #6:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #7:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #8:
['sure', 'issue', 'like', 'intense', 'tier', 'clicks', 'good', 'dpi', 'game', 've']

Topic #9:
['wired', 'tier', 'intense', 'sure', 'like', 'issue', 'clicks', 'dpi', 'day', 'grips']

Topic #10:
['right', 'good', 'use', 'buttons', 'synapse', 'sensitivity', 'left', 'click', 'usb', 'hand']

Top 10 Clusters:
top

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [2]:
# importing modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

# Load the dataset
df = pd.read_csv('/content/Sentimental_Analyis.csv')

# Selecting 'text' column for the label 'X' and 'Y' label for sentiment category
X = df['review']
y = df['sentiment_category']

# Spliting the data between training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Features: Text data is represented in TF-IDF format.
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Evaluating the model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    return accuracy, precision, recall, f1

# Model 1: Naive Bayes
nb_model = make_pipeline(TfidfVectorizer(max_features=5000, stop_words='english'), MultinomialNB())
nb_scores = cross_val_score(nb_model, X, y, cv=5)  # 5-fold cross-validation

# Model 2: MultinomialNB model
mnb_model = make_pipeline(TfidfVectorizer(max_features=5000, stop_words='english'), MultinomialNB())
mnb_scores = cross_val_score(mnb_model, X, y, cv=5)

# Display results
print("Naive Bayes Cross-Validation Scores:", nb_scores)
print("Mean Accuracy: {:.2f}".format(nb_scores.mean()))

print("\nMultinomial Naive Bayes Cross-Validation Scores:", mnb_scores)
print("Mean Accuracy: {:.2f}".format(mnb_scores.mean()))

# Train and evaluate models on the test set
nb_model.fit(X_train, y_train)

# Print evaluation metrics
print("Evaluation on Test Set:")
print("\nNaive Bayes:")
nb_accuracy, nb_precision, nb_recall, nb_f1 = evaluate_model(nb_model, X_test, y_test)
print("Accuracy: {:.2f}, Precision: {:.2f}, Recall: {:.2f}, F1 Score: {:.2f}".format(nb_accuracy, nb_precision, nb_recall, nb_f1))

# Train and evaluate the MultinomialNB model
mnb_model = MultinomialNB()
mnb_model.fit(X_train_vectorized, y_train)

print("\nMultinomial Naive Bayes:")
mnb_accuracy, mnb_precision, mnb_recall, mnb_f1 = evaluate_model(mnb_model, X_test_vectorized, y_test)
print("Accuracy: {:.2f}, Precision: {:.2f}, Recall: {:.2f}, F1 Score: {:.2f}".format(mnb_accuracy, mnb_precision, mnb_recall, mnb_f1))

Naive Bayes Cross-Validation Scores: [1. 1. 1. 1. 1.]
Mean Accuracy: 1.00

Multinomial Naive Bayes Cross-Validation Scores: [1. 1. 1. 1. 1.]
Mean Accuracy: 1.00
Evaluation on Test Set:

Naive Bayes:
Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1 Score: 1.00

Multinomial Naive Bayes:
Accuracy: 1.00, Precision: 1.00, Recall: 1.00, F1 Score: 1.00


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# Loading the train data
train_data = pd.read_csv('/content/train.csv')

# Taking 'SalePrice' as the target variable
y = train_data['SalePrice']
X = train_data.drop('SalePrice', axis=1)

# Identifying all categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Using columntransformer for data preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), X.columns.difference(categorical_cols)),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Spliting the data between test and train
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Building a linear regression model with preprocessing
model = make_pipeline(preprocessor, SimpleImputer(strategy='mean'), LinearRegression())
model.fit(X_train, y_train)

# Making predictions on the validation set
y_pred = model.predict(X_val)

# Evaluating the model
mse = mean_squared_error(y_val, y_pred)
r2 = r2_score(y_val, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

# Loading the test data
test_data = pd.read_csv('/content/test.csv')

# Lets assume that test data has the same attributes of train data
# Make predictions on the testing set
test_predictions = model.predict(test_data)

# Saving the predictions into a csv file
submission_df = pd.DataFrame({'Id': test_data['Id'], 'SalePrice': test_predictions})
submission_df.to_csv('submission.csv', index=False)

Mean Squared Error: 4269008109.8464313
R-squared: 0.443438519492645


# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


In [4]:
# Write your code here
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
import torch

# Loading the dataset
df = pd.read_csv('/content/Sentimental_Analyis.csv')

# Taking 'reviews' and 'sentiment' columns from the dataset
X = df['review'].values
y = df['sentiment_category'].apply(lambda x: 1 if x == 'positive' else 0).values  # Convert 'positive' to 1, 'negative' to 0

# Spliting the data between test data and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Using a subset for further development
X_train, _, y_train, _ = train_test_split(X_train, y_train, train_size=0.1, random_state=42)

# Loading BERT tokenizer and its model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # 2 for binary classification

# Tokenizing and encoding the text data
X_train_tokens = tokenizer(list(X_train), padding=True, truncation=True, return_tensors='pt', max_length=256)
X_test_tokens = tokenizer(list(X_test), padding=True, truncation=True, return_tensors='pt', max_length=256)

# Creating a DataLoader for training and testing sets
train_dataset = TensorDataset(X_train_tokens['input_ids'], X_train_tokens['attention_mask'], torch.tensor(y_train, dtype=torch.long))
test_dataset = TensorDataset(X_test_tokens['input_ids'], X_test_tokens['attention_mask'], torch.tensor(y_test, dtype=torch.long))

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Fine-tune the BERT model on your task
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
criterion = torch.nn.CrossEntropyLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
num_epochs = 1
for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluating the model on the testing set
model.eval()
predictions = []
with torch.no_grad():
    for batch in test_dataloader:
        input_ids, attention_mask, labels = batch
        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probabilities = torch.nn.functional.softmax(logits, dim=1)
        predicted_labels = torch.argmax(probabilities, dim=1)
        predictions.extend(predicted_labels.cpu().numpy())

# Calculating the required metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Accuracy: 0.8000
Precision: 1.0000
Recall: 0.8000
F1 Score: 0.8889
