# Modeling
-------------

In modeling phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

- Modeling techuique selection: Selection the actual modeling technique that is to be used.

- Test design generation: Generateion of a procedure or mechanism to test the model's quality and validity.  Split the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.

- Build model: Run the modeling tool on the prepared dataset to create one or more models.

- Assess model: Summarize results of this task, list qualities of generated models (e.g.,in terms of accuracy) and rank their quality in relation to each other.


In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import joblib


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Access the data

In [4]:
dataset = pd.read_parquet('../data/preprocessed_data.parquet')
dataset

Unnamed: 0,review_id,product_id,product_category,total_votes,review,usefulness
41,R1FBO737KD9F2N,B00NG57H4S,Electronics,23,Great noise cancelling headphones for the pric...,0.826087
145,R227GSNWI6BSZV,B00ICNXESC,Electronics,20,"Garbage, lasted 8 months... warranty is useles...",1.000000
265,R4PF7S0TOV9S7,B00XR1MW4G,Electronics,17,A long lasting bluetooth sound bazooka!\nThis ...,0.882353
274,R22LKIOKMSOG8A,B00XS3HGEO,Electronics,13,nice!\nThis is a nice little turntable. Don't ...,0.923077
304,R3SJTYZBYBG4EE,B00L108SAW,Electronics,99,Very good charger for the price! But has a dow...,1.000000
...,...,...,...,...,...,...
9001932,R23WGBU0VIL6FA,B00000K390,Wireless,26,Great little radios\nI just purchased two of t...,1.000000
9001946,R1YWKS4FWD687C,B00001ZT56,Wireless,18,Sony FRS radio\nI have been a FRS radio user f...,1.000000
9001962,R2STV9N2M963YM,B00000K38X,Wireless,24,"Great\nI have only had a short time, and I am ...",0.916667
9001989,R25BIFCRPWPUHA,B00000J3H5,Wireless,46,Excellent - bearing in mind the power/range li...,0.586957


## Simple preprocessing

In [3]:
dataset['review'] = dataset['review'].str.lower()
# remove stopwords
stop = stopwords.words('english')
dataset['review'] = dataset['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
dataset['review'] = dataset['review'].str.replace('[^a-zA-Z\s]', '', regex=True)

## Apply label encoding to the `product_category` column

In [5]:
label_encoder = LabelEncoder()
dataset['product_category'] = label_encoder.fit_transform(dataset['product_category'])

## Select features and label from datasets

In [6]:
X = dataset[['product_category', 'total_votes', 'review']]

In [7]:
y = dataset['usefulness']

## Construct the column transformer

In [None]:
tfidf_vectorizer = TfidfVectorizer()
count_vectorizer = CountVectorizer(min_df=0.02)

column_transformer = ColumnTransformer(
    transformers=[
        ('tfidf', count_vectorizer, 'review'),
    ],
    remainder='passthrough'
)

## Define models

1. Decision Tree Regressor
2. Linear Regression
3. Gradient Boosting Regressor

In [None]:
models = [
    ('Decision Tree Regressor', DecisionTreeRegressor()),
    ('Linear Regression', LinearRegression()),
    ('Gradient Boosting Regressor', GradientBoostingRegressor())
]

## Create a pipeline for each model

In [None]:
pipelines = []
for name, model in models:
    pipeline = Pipeline([
        ('preprocessor', column_transformer),
        ('model', model)
    ])
    pipelines.append((name, pipeline))

## Split data into train and test sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train and test models

In [None]:
for name, pipeline in pipelines:
    print(f'Training {name}')
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'Mean Squared Error: {mse}')

     # Save the trained model
    joblib.dump(pipeline, f'./models/{name}_model.pkl')
    print()

Training Decision Tree Regressor
Mean Squared Error: 0.11914937113486608

Training Linear Regression
Mean Squared Error: 0.06507291126904217

Training Gradient Boosting Regressor
Mean Squared Error: 0.060718931912111515



## Test models predictions after summarization

In [15]:
gb_regressor_model = joblib.load('../models/Gradient Boosting Regressor_model.pkl')

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=0)

y_pred = gb_regressor_model.predict(X_test)

In [10]:
from transformers import pipeline, logging
logging.set_verbosity_error()


summarizer = pipeline("summarization", model="Falconsai/text_summarization", device='cuda:0')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
_, X_test_with_id = train_test_split(dataset, test_size=0.2, random_state=0)
_, X_test_with_id = train_test_split(X_test_with_id, test_size=0.5, random_state=0)

In [16]:
X_test_with_id["usefulness"] = y_pred

In [17]:
grouped = X_test_with_id.groupby('product_id').apply(lambda x: x.sort_values('usefulness', ascending=False))

top_reviews = grouped.groupby(level=0).head(5).groupby(level=0)['review'].apply(lambda x: '\n\n'.join(x))


In [18]:
from tqdm.auto import tqdm
from datasets import Dataset
from transformers import pipeline

summaries = []
reviews_dataset = Dataset.from_dict({"text": top_reviews.to_list()[:1000]})

for i in tqdm(range(0, len(reviews_dataset), 8)):
    batch = reviews_dataset.select(range(i, i+8))
    batch_summaries = summarizer(batch["text"], min_length=16, max_length=96)
    summaries.extend(batch_summaries)

  0%|          | 0/125 [00:00<?, ?it/s]

In [19]:
from evaluate import load

rouge = load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [21]:
results = rouge.compute(predictions=[sentence["summary_text"] for sentence in summaries], references=top_reviews[:1000])
print(f"ROUGE-1 F1 score: {results['rouge1'] * 100:.2f}")
print(f"ROUGE-2 F1 score: {results['rouge2'] * 100:.2f}")
print(f"ROUGE-L F1 score: {results['rougeL'] * 100:.2f}")

ROUGE-1 F1 score: 47.94
ROUGE-2 F1 score: 45.62
ROUGE-L F1 score: 47.28
