# DATA2001 Assignment 2 (Weight: 25%)


The aim of this assignment is to gain practical experience in analysing unstructured data.
You should only submit your completed Jupyter notebook in .ipynb format via Blackboard, including written answers in markdown and results from executed code cells.


The assignment comprises 5 main tasks: Data Exploration, Data Preprocessing, Model Training, Model Evaluation, and Model Analysis. You will address and compare two tasks: sentiment analysis and rating prediction.


The dataset you will work with in this assignment comprises text reviews about various android applications and their corresponding ratings. Further information about the dataset can be found [here](https://huggingface.co/datasets/sealuzh/app_reviews).


## Task 1: Data Exploration





1. Load the dataset from the file "app_review.csv". How many records does the dataset contain? How many distinct classes are there in the dataset? Randomly select and print 5 reviews with a rating of '1' and 5 reviews with a rating of '5'.

In [1]:
import pandas as pd
import warnings
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

In [1]:
# Provide your answers here

2. Is the class distribution balanced? To support your answer, create a bar plot with the classes on the x-axis and the number of reviews in each class on the y-axis. Additionally, based on your observations of the reviews and the class distribution, determine whether there are more positive or negative reviews.

In [2]:
# Provide your answers here


## Task 2: Data Preprocessing

- Use the provided "clean_data" function to remove unnecessary symbols and clean the dataset.



In [6]:
import re

def clean_data(text):

    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text)
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'br', ' ', text)
    text = re.sub(r'\'', ' ', text)

    return text

In [3]:
# Provide your answers here


- Split the clean dataset into separate train and test sets. For this, use the "Review" field as the feature vector (X) and the "Rating" field as the label vector (Y).

In [8]:
from sklearn.model_selection import train_test_split

In [4]:
# Provide your answers here



- Transform the cleaned data into a numerical representation using Bag of Words (BoW) and remove any stop words. Save the BoW representation in the variables train_data_BOW and test_data_BOW.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

from nltk.corpus import stopwords
import nltk


nltk.download('stopwords')
nltk.download('omw-1.4')
nltk.download('wordnet')

In [5]:
# Provide your answers here


## Task 3: Model Training

Define 2 Logistic Regression models: *model1* and *model2* and train the models as follows:
- Train the first Logistic Regression model to predict the application rating (Y).


In [6]:
from sklearn.linear_model import LogisticRegression

# Provide your answers here


- Create an additional binary label by assigning ‘1’ – positive for the product ratings 4 and 5; and "–1" for product ratings 1, 2 and 3. Store it in y_train_binary and y_test_binary.

*Tip: you can use a function copy.deepcopy for creating a copy of label variables*

In [13]:
import copy

In [10]:
# Provide your answers here


- Train the second Logistic Regression model to predict the binary sentiment label (Y_binary).


In [8]:
# Provide your answers here



- Make and store predictions for both models

In [9]:
# Provide your answers here



## Task 4: Model Evaluation

- Compute and compare the test accuracy of Model 1 and Model 2. Based on your results, analyze which task is easier: binary sentiment prediction or multi-class rating prediction.

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Provide your answers here




- For Model 1,  compute additional evaluaton measures, namely confusion matrix, precision and recall.  

In [12]:
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

# Provide your answers here


- Based on the confusion matrix obtained in the previous question (referring to Model 1, the Logistic Regression for rating prediction), identify and state the number of samples that were classified to have the rating of 1 (the lowest rating), but in reality, they had an actual rating of 5 (the highest rating).

**Provide your answer here**




## Task 5: Model Analysis



- Discuss the importance of considering alternative evaluation measures, such as precision and recall, instead of relying solely on accuracy. Based on this discussion, identify the most suitable evaluation metric for Model 1.

**Provide your answer here**



- For binary sentiment prediction (Model 2), visualize important words with their model coefficients.  

*Tip: you can reuse the function plot_coefficients from prac. session.*

In [13]:
# Provide your answers here


- Analyze the quality of the features produced by Model 2 by examining the words with the highest coefficients for both the positive and negative classes.  Identify any potential bias in the model, and explain how this bias could affect its performance.

**Provide your answer here**
