# Text Mining Project Work (Team 1)

**Opinion Mining on Amazon Reviews with Logistic Regression and Deep Learning Models on top of TF-IDF Features**

_Prof. Gianluca Moro, Prof. Giacomo Frisoni – DISI, University of Bologna_

name.surname@unibo.it


**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Team 1.
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them.
- The function of every command or group of related commands must be documented clearly and concisely.
- The submission deadline is March 18th, 2024.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (giacomo.frisoni@unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference.
- If still in doubt about anything, ask the teacher.
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30.

## Setup

The following cell contains some necessary imports.

In [None]:
import numpy as np
import pandas as pd
import gzip
import json
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import os
from urllib.request import urlretrieve

Run the following to download the necessary files.

In [None]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [None]:
download("Magazine_Subscriptions.json.gz", "https://www.dropbox.com/s/g6om8q8c8pvirw8/Magazine_Subscriptions.json.gz?dl=1")

In [None]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Exercises

**1)** In the `Magazine_Subscriptions.json.gz` file, we provide a dataset composed of several reviews posted on Amazon.com about Magazine Subscriptions.
That file is a compressed .gzip, so you must open it using a .gzip decompression library.

Each review is labeled with a score between 1 and 5 stars (represented by the ```overall``` feature).

The text of each review is represented by the ```reviewText``` feature, which will be our input data along with the ```overall``` one.

Load the dataset by putting it in a new Pandas dataframe. The data is stored as a JSON file, so you will have to use a Python package to load JSON data into a variable.

**2)** Print the dataset rows number and visualize the first 5 rows.

**3)** Undersample the data by `overall` feature in order to obtain a class-balanced dataset.



**4)** Cast the `reviewText` column to unicode string.

**5)** Choose only the attributes labeled as ```reviewText``` and ```overall``` from the dataset, and place them into a dataframe.



**6)** Verify the distribution of the number of stars.

**7)** Remove from the dataframe the reviews rated with 3 stars.

**8)** Add a `label` column to the DataFrame whose value is `"pos"` for reviews with 4 stars, `"very_pos"` for 5-rated reviews, `"neg"` for reviews with 2 stars, and `"very_neg"` for 1-rated reviews.

**9)** Split the dataset randomly into a training set with 70% of data and a test set with the remaining 30%, stratifying the split by the `label` variable.

**10)** Create a tf.idf vector space model from training reviews excluding words appearing in less than 3 documents and using bigrams in addition to single words. Then, extract the document-term matrix for them.

**11)** Train a logistic regression classifier on the training reviews, using the representation created above.

**12)** Verify the accuracy of the classifier on the test set.

**13)** Get the model predictions and print the confusion matrix.

**14)** Train a Deep Learning model of your choice (e.g., MLP, RNN) using the document-term representation built in point (10) and evaluate it on test data. Try to maximize the model accuracy.

**15)** Get the predictions of this latter model and compare them with the Logistic Regression model trained in point (11) using the chi-square test at a confidence level of 95% (i.e. p-value must be <= 0.05 for models to be significantly different).

Hint: you will need to convert the predictions of both models into integer arrays for comparison.


To calculate the p-value, you can use the provided `chi2_pval`, inputting the arrays containing the predicted labels from the two models as well as the true labels.

```
chi2_pval(model1_predictions, model2_predictions, y_test)
```

In [None]:
from scipy.stats import chi2_contingency

def chi2_pval(p1, p2, y_test):
    num_classes = len(np.unique(y_test))
    num_instances = len(y_test)

    model1_errors = p1 != y_test
    model2_errors = p2 != y_test

    # Construct contingency table
    contingency_table = pd.crosstab(model1_errors, model2_errors)
    print(contingency_table)

    # Calculate Chi-square test statistic and p-value
    q_statistic, p_value = chi2_contingency(contingency_table, correction=False)[:2]

    return p_value

  import pandas.util.testing as tm
