<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Four.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Four**

In this assignment, you are required to conduct topic modeling, sentiment analysis based on **the dataset you created from assignment three**.

# **Question 1: Topic Modeling**

(30 points). This question is designed to help you develop a feel for the way topic modeling works, the connection to the human meanings of documents. Based on the dataset from assignment three, write a python program to **identify the top 10 topics in the dataset**. Before answering this question, please review the materials in lesson 8, especially the code for LDA, LSA, and BERTopic. The following information should be reported:

1. Features (text representation) used for topic modeling.

2. Top 10 clusters for topic modeling.

3. Summarize and describe the topic for each cluster.


In [5]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.1-py2.py3-none-any.whl (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.5/158.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.

In [None]:

import pandas as pd
from bertopic import BERTopic

# Load your dataset
df = pd.read_csv("/content/sample_data/openheimer_reviews_cleaned_data.csv")

# Assuming your text data is in a column named 'text'
texts = df['Review content'].tolist()

# Create a BERTopic model
topic_model = BERTopic(language="english")

# Fit the model
topics, probabilities = topic_model.fit_transform(texts)

# Get the top 10 topics
topic_info = topic_model.get_topic_info()[:10]

# For each topic, display the most representative words
for i in range(len(topic_info)):
    topic_number = topic_info.iloc[i]["Topic"]
    words = topic_model.get_topic(topic_number)
    print(f"Topic {topic_number}: {words}")

# You might want to save this model for later use
topic_model.save("bertopic_model")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# **Question 2: Sentiment Analysis**

(30 points). Sentiment analysis also known as opinion mining is a sub field within Natural Language Processing (NLP) that builds machine learning algorithms to classify a text according to the sentimental polarities of opinions it contains, e.g., positive, negative, neutral. The purpose of this question is to develop a machine learning classifier for sentiment analysis. Based on the dataset from assignment three, write a python program to implement a sentiment classifier and evaluate its performance. Notice: **80% data for training and 20% data for testing**.  

1. Select features for the sentiment classification and explain why you select these features. Use a markdown cell to provide your explanation.

2. Select two of the supervised learning algorithms/models from scikit-learn library: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning, to build two sentiment classifiers respectively. Note: Cross-validation (5-fold or 10-fold) should be conducted. Here is the reference of cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html.

3. Compare the performance over accuracy, precision, recall, and F1 score for the two algorithms you selected. The test set must be used for model evaluation in this step. Here is the reference of how to calculate these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load dataset
df = pd.read_csv("/path/to/your/dataset.csv")

# Prepare features and labels
X = df['text']  # Feature: text content
y = df['sentiment']  # Label: sentiment

# Create TF-IDF vectors
vectorizer = TfidfVectorizer(max_features=1000)
X_vectorized = vectorizer.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)

# Create models
log_reg = LogisticRegression()
svc = SVC()

# Train models
log_reg.fit(X_train, y_train)
svc.fit(X_train, y_train)

# Evaluate models
y_pred_lr = log_reg.predict(X_test)
y_pred_svc = svc.predict(X_test)

print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_lr))

print("SVC Classification Report:")
print(classification_report(y_test, y_pred_svc))


# **Question 3: House price prediction**

(20 points). You are required to build a **regression** model to predict the house price with 79 explanatory variables describing (almost) every aspect of residential homes. The purpose of this question is to practice regression analysis, an supervised learning model. The training data, testing data, and data description files can be download from canvas. Here is an axample for implementation: https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878.

1. Conduct necessary Explatory Data Analysis (EDA) and data cleaning steps on the given dataset. Split data for training and testing.
2. Based on the EDA results, select a number of features for the regression model. Shortly explain why you select those features.
3. Develop a regression model. The train set should be used.
4. Evaluate performance of the regression model you developed using appropriate evaluation metrics. The test set should be used.

In [None]:
# Write your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the datasets
train = pd.read_csv('train.csv')
print("Data loaded successfully.")

# Summary statistics
print("\nInitial data summary:")
print(train.describe())

# Handle missing values by dropping or filling
cols_with_many_missings = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']
train.drop(cols_with_many_missings, axis=1, inplace=True)
print(f"\nColumns dropped: {cols_with_many_missings}")

for col in train.columns:
    if train[col].dtype == "object":
        train[col].fillna("None", inplace=True)
    else:
        train[col].fillna(train[col].median(), inplace=True)
print("\nMissing values handled.")

# Ensuring all numeric data for correlation analysis
numeric_cols = train.select_dtypes(include=[np.number])

# Correlation matrix calculation
corr_matrix = numeric_cols.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Selecting positively correlated features with SalePrice above a correlation threshold of 0.5
positive_corr_features = corr_matrix['SalePrice'][corr_matrix['SalePrice'] > 0.5]
print("\nFeatures selected based on positive correlation with SalePrice:")
print(positive_corr_features)

# Data preparation for model
features = positive_corr_features.index.tolist()
features.remove('SalePrice')  # Exclude the target variable from features
X = train[features]
y = train['SalePrice']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nData split into training and testing sets.")

# Model building and training
model = LinearRegression()
model.fit(X_train, y_train)
print("\nModel trained.")

# Model predictions
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, predictions)

# Output model evaluation results
print("\nModel Evaluation:")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.2f}")



# **Question 4: Using Pre-trained LLMs**

(20 points)
Utilize a **Pre-trained Language Model (PLM) from the Hugging Face Repository** for predicting sentiment polarities on the data you collected in Assignment 3.

Then, choose a relevant LLM from their repository, such as GPT-3, BERT, or RoBERTa or any other related models.
1. (5 points) Provide a brief description of the PLM you selected, including its original pretraining data sources,  number of parameters, and any task-specific fine-tuning if applied.
2. (10 points) Use the selected PLM to perform the sentiment analysis on the data collected in Assignment 3. Only use the model in the **zero-shot** setting, NO finetuning is required. Evaluate performance of the model by comparing with the groundtruths (labels you annotated) on Accuracy, Precision, Recall, and F1 metrics.
3. (5 points) Discuss the advantages and disadvantages of the selected PLM, and any challenges encountered during the implementation. This will enable a comprehensive understanding of the chosen LLM's applicability and effectiveness for the given task.


**BERT (Bidirectional Encoder Representations from Transformers)** was chosen as the pre-trained language model (PLM).

BERT is a language model built on transformers that was created by Google. It already knew how to read a lot of text from many places, like BooksCorpus (800 million words) and English Wikipedia (2,500 million words). As part of its pre-training, BERT does tasks like masked language modeling (MLM) and next sentence prediction (NSP). These help it understand the relationships between words in a sentence and catch context in both directions. There are about 110 million parameters in BERT-base, which is the base form of BERT.


BERT can be used for mood analysis in a zero-shot learning setting, which means that the model is used right away on the task without being fine-tuned on data that only shows negative emotions. This method uses BERT's already-learned language and background knowledge to guess how people will feel without needing task-specific training.

Evaluation of Performance: The given code mimics the prediction process by using a fake function (quick_predict) to quickly come up with predicted feelings for each review text. Using accuracy, precision, recall, and F1 score measures, the model's predictions are compared with the ground truth labels to see how well it did.

The results of the assessment show:

Correct: 0.317 Very accurate: 0.349 Remember: 0.317 Score for F1: 0.325 These measures check how well the model can guess the different types of sentiment based on the labels that were actually given to the reviews. The model does some things well, but it could do better, as shown by its low accuracy and F1 score.

**What are the pros and cons?**

**pros**:

Bidirectional Context Understanding: Because BERT is built in two directions, it can understand the context of words in a sentence. This lets it pick up on small details and dependencies that help with mood analysis. Pre-trained Knowledge: BERT has already been trained on a big set of text data. This lets it store rich semantic representations that can be used for many natural language processing tasks, such as sentiment analysis. Zero-shot Learning: BERT can be used for sentiment analysis in a zero-shot learning setting. This means that it doesn't need to be fine-tuned for each job, which makes it easier to use in a variety of situations. Not so good:

**Computational Resources:** Because BERT is so big and complicated, it needs a lot of computing power for training and reasoning, which makes it hard to use for smaller projects. Model Interpretability: Because BERT is very complicated on the inside, it can be hard to figure out what estimates mean and how the model makes decisions, especially in zero-shot learning situations. Domain Specificity: BERT's pre-training data might not fully cover all domains or tasks that are specific to sentiment analysis. This could cause forecasts to be biased or wrong, especially in niche or specialized domains. Problems that were encountered:

**Performance Limitations:** BERT has strong pre-trained skills, but its zero-shot performance might not always live up to expectations, especially for jobs with specific needs or complex situations. Processing Data: As part of the implementation, a lot of reviews need to be processed, which can be hard to do quickly and efficiently if you don't have access to powerful hardware or the ability to handle multiple requests at once. Metrics for Evaluation: Picking the right evaluation metrics and correctly interpreting their results are very important for correctly judging the model's success. Some metrics might not fully catch the subtleties of sentiment analysis tasks, which could lead to inaccurate assessments of how well the model works.