# Results

*Team: nada401*

In this Notebook we present the initial results of our project.

Structure:
- Data cleaning and exploration
    - Language tagging
    - Data cleaning
    - Data exploration
- Expert-metric
    - Ad-hoc metric
    - Language depth
    - Embeddings

In [None]:
import pandas as pd

### Reproducibility

To reproduce the results of this notebook first download the datasets from [here](https://drive.google.com/drive/folders/1Wz6D2FM25ydFw_-41I9uTwG9uNsN4TCF) and unzip them in `./data/`.

Only the datasets
- `RateBeer`
- `BeerAdvocate`

are used for this part of the project.

### Notation

To distinguish the two datasets, a common naming scheme has been used, in particular:
- `*_RB` contains data from RateBeer
- `*_BA` contains data from BeerAdvocate

## Data cleaning and exploration

The first step into developing this project is cleaning and exploring the data.

### Language tagging

A first manual analysis of the datasets revealed that some reviews have been written in different languages, so we first wanted to understand the share of each language to direct our successive steps.

To achieve this, we tested different python packages that tag language, among which:
- [fast_langdetect](https://github.com/LlmKira/fast-langdetect)
- [langdetect](https://pypi.org/project/langdetect/)
- [lingua](https://github.com/pemistahl/lingua-py)

Finally opting for the first, being the fastest and having good precision.

In [None]:
from src.scripts.lang_tagger import lang_tagger

'''
Execute the language tagging process.
This function calls the pipeline that reads the .txt.gz files and creates a .csv file
with only the reviews and few other columns for indexing purposes.
'''
lang_tagger.tag_datasets()

In [None]:
d_BA = pd.read_csv("./data/BeerAdvocate/reviews_tagged.csv")
d_RB = pd.read_csv("./data/RateBeer/reviews_tagged.csv")

eng_perc_BA = d_BA["lang_tag"].value_counts()['en']/d_BA["lang_tag"].count() * 100
eng_perc_RB = d_RB["lang_tag"].value_counts()['en']/d_RB["lang_tag"].count() * 100

print(f"Percentage of English reviews in BeerAdvocate = {eng_perc_BA:.3f}%")
print(f"Percentage of English reviews in RateBeer = {eng_perc_RB:.3f}%")

print(f"\nNumber of reviews for the first 5 most used languages in RateBeer. Only the first 5 shown for visualization purposes")
print(f"{d_RB['lang_tag'].value_counts()[:5]}")

The majority of reviews are made in English, especially in BeerAdvocate. 
We therefore initially focus on solely English reviews, specifically the one in BeerAdvocate, but we plan later in the development of the project to work also on RateBeer and we to test our metrics on different languages as well.

In [None]:
# Free memory after showing the results
del d_BA, d_RB

### Data cleaning

The dataset contained various NaN values and duplicated rows, all of which must be properly handled to ensure a correct analysis.

In [None]:
from src.scripts.data_cleaning import data_cleaning, load_file

'''
Load the datasets and clean them. In particular:
- drop duplicates
- treat NaNs
- delete beers that don't have reviews
- delete users that didn't review at least one beer
- add language_tag column to the datasets
'''
data_cleaning.clean_data('./data')

In [None]:
 # Issues with users.csv
df_users_RB = pd.read_csv('./data/RateBeer/users.csv')
df_users_RB_clean = pd.read_csv('./data/RateBeer/users_RB_clean.csv')

print(f"Are users_id in BeerAdvocate's user dataframe unique? {df_users_RB['user_id'].is_unique}")
print(f"By removing users that never did written reviews we dropped {df_users_RB.shape[0] - df_users_RB_clean.shape[0]} rows")
print(f"Rows before cleaning: {df_users_RB.shape[0]}\nRows after cleaning:  {df_users_RB_clean.shape[0]}")

In [None]:
# language tagging and formatting
df_ratings_BA_clean = pd.read_csv('./data/BeerAdvocate/ratings_BA_clean.csv', nrows=5)
df_ratings_BA_clean.head()

In [None]:
# Free memory after showing the results
del df_users_RB, df_users_RB_clean, df_ratings_BA_clean

### Data exploration

**TODO**

In [None]:
# Vik's code

## Expert metric

The vast majority of the work done in Milestone 2 for the project has been finding a good "Expert metric" and checking that our assumptions are correct.

We needed a "Expert metric" that was reliable enough to see how written reviews change over time. This score should highlight expertise and precision of a review.

We tried different methods to get this metric:
- language depth
- embeddings
- ad-hoc metric

The first two didn't provide a significant score, while a "ad-hoc metric" proved to satisfy our needs. We therefore start by discussing this method.

### Ad-hoc metric

Due to the failure of previous general metrics, we tried creating a topic-specific metric that focuses solely on beer reviews.<br>

**CONTINUE WRITING THE EXPLANATION**

For this part, we will only use BeerAdvocate's dataframe

In [None]:
from src.scripts.expert_metric import expert_metric

data_folder = './data'
expert_metric.add_ex_score_BA(data_folder)

In [None]:
df = pd.read_csv('./data/BeerAdvocate/reviews_with_exp_scores.csv')

In [3]:
from src.scripts.expert_metric import expert_analysis
from scipy.stats import pearsonr

data_folder = './data'

rev_with_scores, beers, users = expert_analysis.get_expert_metric_dfs(data_folder)

In [5]:
rev_with_scores.head(1)

Unnamed: 0.1,Unnamed: 0,user_id,beer_id,date,text,flavor,aroma,mouthfeel,brewing,technical,appearance,judgment,off_flavors,miscellaneous,expertness_score
0,0,nmann08.184925,142544,2015-08-20,"From a bottle, pours a piss yellow color with ...",2,3,2,1,0,2,0,1,1,12


In [7]:
col_to_keep = ['flavor', 'aroma', 'mouthfeel', 'brewing', 'technical', 'appearance', 'judgment','off_flavors', 'miscellaneous', 'expertness_score']
user_ba = rev_with_scores.groupby('user_id').agg(
    {col: 'mean' for col in col_to_keep} | {'user_id': 'count'}
)

print(user_ba.columns)

user_ba = user_ba.rename(columns={'user_id': 'nbr_rev'})
#! TODO: Finish doing
pearsonr(user_ba['expertness_score'], user_ba['nbr_rev'])
user_ba_less_200 = user_ba[user_ba['nbr_rev']<200]
pearsonr(user_ba_less_200['expertness_score'], user_ba_less_200['nbr_rev'])

Index(['flavor', 'aroma', 'mouthfeel', 'brewing', 'technical', 'appearance',
       'judgment', 'off_flavors', 'miscellaneous', 'expertness_score',
       'user_id'],
      dtype='object')


PearsonRResult(statistic=0.25250688377961705, pvalue=0.0)

### Language depth

The assumption for this method was that a reviewers with more expertise would utilize a more nuanced vocabulary in their reviews.

To extract language depth of a review we used
[LexicalRichness](https://github.com/LSYS/LexicalRichness),
a python package that extracts some metrics highlighting language richness of a text.<br>
This has been used to evaluate each written reviews.

However, the anlysis that followed provided unsatisfactory results. Our main explanation is that language richness doesn't directly correlate to higher quality reviews, as the metrics extrapolated by the package are not tailored to reviews nor beer.

### Embeddings

We also tried with finding the embeddings of the reviews and trying to see if such a method could be used as an expertise metric.<br>
To find the embeddings we used:
- [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) 

This model maps paragraphs to a 768 dimensional dense vector space, that we then visualize by projecting it in a 2D space with t-SNE and PCA.

While some interesting patterns have been observed, it is clear that embeddings can't be used as a metric score.<br>