<strong>
    <font color="#088A68">
        Author: lprtk
    </font>
</strong>

<br/>
<br/>


<Center>
    <h1 style="font-family: Arial">
        <font color="#084B8A">
            NLP: sentiment analysis, topic modeling & sentiment prediction
        </font>
    </h1>
    <h3 style="font-family: Arial">
        <font color="#088A68">
            Notebook 5/5
        </font>
    </h3>
</Center>

<br/>

<h3 style="font-family: Arial"><font color="#088A68">Paris 1 Panthéon-Sorbonne, M2 MoSEF 2021-2022</font></h3>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Introduction & context
        </font>
    </h2>
</div>

<p style="text-align: justify">
    This project focuses on extracting information and value from large volumes of textual data using Natural Language Processing (NLP). Why do you want to do this?
</p>
<ul>
    <li><p style="text-align: justify">To improve the customer experience on the website, mobile application or in the office.</p></li>
    <li><p style="text-align: justify">Assess customer satisfaction differently.</p></li>
    <li><p style="text-align: justify"></p>Evaluate the company's image.</li>
    <li><p style="text-align: justify"></p>Be more available and accessible to customers.</li>
    <li><p style="text-align: justify"></p>According to the company's activity: find new solutions to improve the banking services offered, evaluate the seller of an online sales platform or improve the product based on customer reviews.</li>
</ul>

<p style="text-align: justify">
    Our application approach is presented in 5 main streams:
</p>
<ul>
    <li>
        <u>Step 1:</u> Web Scraping
        <ul>
            <li>Collect and create the data schema.</li>
            <li>Parsing customer reviews to enrich the database: extracting title, description, date, time, nickname and rating.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 2:</u> Sentiment Analysis and Scoring
        <ul>
            <li>Understand and probe the satisfaction of each customer.</li>
            <li>Scoring the intensity and polarity of feelings from the review description.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 3:</u> Text mining and data cleaning
        <ul>
            <li>Text cleaning adapted to the sales domain and to the general content of reviews.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 4:</u> Topic Modeling (unsupervised learning)
        <ul>
            <li>To improve availability and speed up response time, reviews can be disassociated and prioritized according to the topic they address.</li>
        </ul>
    </li>
</ul>
<ul>
    <li>
        <u>Step 5:</u> Machine Learning (supervised learning)
        <ul>
            <li>Without reading future reviews, design a robust model to identify the overall sentiment expressed by the customer.</li>
        </ul>
    </li>
</ul>

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Librairies import
        </font>
    </h2>
</div>

In [1]:
from lightgbm import LGBMClassifier
import numpy as np
import pandas as pd
from pyTCTK import TextNet, WordNet
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, fbeta_score, make_scorer, precision_score, recall_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings("ignore")

-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Data import
        </font>
    </h2>
</div>

In [2]:
df_data = pd.read_csv(filepath_or_buffer="amzn_customer_reviews.csv", sep=",")

In [3]:
df_data.head(3)

Unnamed: 0,Pseudo,Title,Review,Rating,Verified Purchase,Date,Score,Compound,Sentiment,New rating,New date,Country
0,Assault Kittens,really good option portability,impressive form factor really good balance siz...,4.5 out of 5 stars,Yes,"Reviewed in the United States on June 18, 2021","{'neg': 0.014, 'neu': 0.736, 'pos': 0.249, 'co...",0.9941,positive,4.5,2021-06-18,United States
1,Kenneth Cramer,excellent portable gaming,write review anyone fence purchasing since rea...,5.0 out of 5 stars,Yes,"Reviewed in the United States on July 7, 2021","{'neg': 0.016, 'neu': 0.88, 'pos': 0.104, 'com...",0.9921,positive,5.0,2021-07-07,United States
2,Assault Kittens,best inch world,sell macbook best decisions lifeif used macboo...,1.0 out of 5 stars,Yes,"Reviewed in the United States on June 18, 2021","{'neg': 0.0, 'neu': 0.854, 'pos': 0.146, 'comp...",0.8779,positive,1.0,2021-06-18,United States


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            Sentiment prediction: supervised learning
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            1) Pre-processing
        </font>
    </h3>
</div>

In [4]:
# cross validation method
cross_validation = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

# make scorer
fthree_scorer = make_scorer(fbeta_score, average="macro", beta=3)

In [5]:
# split train valid test split
X = df_data["Review"]
y = df_data["Sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)

In [6]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((846,), (283,), (846,), (283,))

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            2) Vectorization
        </font>
    </h3>
</div>

In [7]:
vectorizer = TfidfVectorizer(
    encoding="utf-8",
    lowercase=False,
    tokenizer=None,
    analyzer="word",
    stop_words=None,
    ngram_range=(1, 2),
    min_df=1,
    norm="l2",
    use_idf=True
)

X_train = vectorizer.fit_transform(X_train).todense()
X_test = vectorizer.transform(X_test).todense()

In [8]:
X_train.shape, X_test.shape

((846, 35424), (283, 35424))

In [9]:
print("-"*65)
print("Train shape: ", X_train.shape)
print("Test shape: ", X_test.shape)
print("-"*65)
print("Originally:")
print(df_data.Sentiment.value_counts(normalize=True))
print("\n")
print("Train:")
print(y_train.value_counts(normalize=True))
print("\n")
print("Test:")
print(y_test.value_counts(normalize=True))
print("-"*65)

-----------------------------------------------------------------
Train shape:  (846, 35424)
Test shape:  (283, 35424)
-----------------------------------------------------------------
Originally:
positive    0.800709
negative    0.137290
neutral     0.062002
Name: Sentiment, dtype: float64


Train:
positive    0.797872
negative    0.141844
neutral     0.060284
Name: Sentiment, dtype: float64


Test:
positive    0.809187
negative    0.123675
neutral     0.067138
Name: Sentiment, dtype: float64
-----------------------------------------------------------------


<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            3) Models
        </font>
    </h3>
</div>

<h3 style="font-family: Arial">
    <font color="#088A68">
        Logistic Regression
    </font>
</h3>

In [10]:
pipeline = Pipeline(
    [
        (
            "log_clf",
            LogisticRegression(
                dual=False,
                tol=0.0001,
                fit_intercept=True,
                intercept_scaling=1,
                class_weight="balanced",
                random_state=42,
                max_iter=100,
                multi_class="auto",
                verbose=False,
                warm_start=False,
                n_jobs=-1
            )
        )
    ]
)

param_grid = {
    "log_clf__C": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "log_clf__solver": ["saga", "liblinear"]
}

In [11]:
log_grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cross_validation,
    scoring=fthree_scorer,
    n_jobs=-1,
    verbose=True
)

log_grid.fit(X_train, np.ravel(y_train))

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  2.0min finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('log_clf',
                                        LogisticRegression(class_weight='balanced',
                                                           n_jobs=-1,
                                                           random_state=42,
                                                           verbose=False))]),
             n_jobs=-1,
             param_grid={'log_clf__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
                         'log_clf__solver': ['saga', 'liblinear']},
             scoring=make_scorer(fbeta_score, average=macro, beta=3),
             verbose=True)

In [12]:
dict_params = log_grid.best_params_
dict_params

{'log_clf__C': 1.0, 'log_clf__solver': 'saga'}

In [13]:
log_clf = LogisticRegression(
    penalty="l2",
    dual=False,
    tol=0.0001,
    C=dict_params["log_clf__C"],
    fit_intercept=True,
    intercept_scaling=1,
    class_weight="balanced",
    random_state=42,
    solver=dict_params["log_clf__solver"],
    max_iter=100,
    multi_class="auto",
    verbose=False,
    warm_start=False,
    n_jobs=-1,
    l1_ratio=0
)

log_clf.fit(X_train, np.ravel(y_train))
y_pred = log_clf.predict(X_test)

In [14]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,13,6,16
neutral,2,11,6
positive,5,6,218


In [15]:
accuracy_log = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_log = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_log = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_log = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_log = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

In [16]:
print(accuracy_log)

0.8551


<h3 style="font-family: Arial">
    <font color="#088A68">
        Random Forest
    </font>
</h3>

In [17]:
pipeline = Pipeline(
    [
        (
            "rf_clf",
            RandomForestClassifier(
                    min_weight_fraction_leaf=0.0,
                    max_leaf_nodes=None,
                    min_impurity_decrease=0.0,
                    bootstrap=True,
                    oob_score=False,
                    n_jobs=-1,
                    random_state=42,
                    verbose=False,
                    warm_start=False,
                    class_weight="balanced",
                    ccp_alpha=0.0,
                    max_samples=None
            )
        )
    ]
)

param_grid = {
    "rf_clf__n_estimators": [75, 100, 125, 150, 175],
    "rf_clf__criterion": ["gini", "entropy"],
    "rf_clf__max_depth": [7, 12],
    "rf_clf__min_samples_split": [2, 3],
    "rf_clf__min_samples_leaf": [1, 3],
    "rf_clf__max_features": ["auto", "sqrt"]
}

In [18]:
rf_grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cross_validation,
    scoring=fthree_scorer,
    n_jobs=-1,
    verbose=True
)

rf_grid.fit(X_train, np.ravel(y_train))

Fitting 3 folds for each of 160 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  3.9min finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('rf_clf',
                                        RandomForestClassifier(class_weight='balanced',
                                                               n_jobs=-1,
                                                               random_state=42,
                                                               verbose=False))]),
             n_jobs=-1,
             param_grid={'rf_clf__criterion': ['gini', 'entropy'],
                         'rf_clf__max_depth': [7, 12],
                         'rf_clf__max_features': ['auto', 'sqrt'],
                         'rf_clf__min_samples_leaf': [1, 3],
                         'rf_clf__min_samples_split': [2, 3],
                         'rf_clf__n_estimators': [75, 100, 125, 150, 175]},
             scoring=make_scorer(fbeta_score, average=macro, beta=3),
             verbose=True)

In [19]:
dict_params = rf_grid.best_params_
dict_params

{'rf_clf__criterion': 'gini',
 'rf_clf__max_depth': 12,
 'rf_clf__max_features': 'auto',
 'rf_clf__min_samples_leaf': 3,
 'rf_clf__min_samples_split': 2,
 'rf_clf__n_estimators': 75}

In [20]:
rf_clf = RandomForestClassifier(
    n_estimators=dict_params["rf_clf__n_estimators"],
    criterion=dict_params["rf_clf__criterion"],
    max_depth=dict_params["rf_clf__max_depth"],
    min_samples_split=dict_params["rf_clf__min_samples_split"],
    min_samples_leaf=dict_params["rf_clf__min_samples_leaf"],
    min_weight_fraction_leaf=0.0,
    max_features=dict_params["rf_clf__max_features"],
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    bootstrap=True,
    oob_score=False,
    n_jobs=-1,
    random_state=42,
    verbose=False,
    warm_start=False,
    class_weight="balanced",
    ccp_alpha=0.0,
    max_samples=None
)

rf_clf.fit(X_train, np.ravel(y_train))
y_pred = rf_clf.predict(X_test)

In [21]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,6,12,17
neutral,1,18,0
positive,5,44,180


In [22]:
accuracy_rf = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_rf = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_rf = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_rf = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_rf = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

<h3 style="font-family: Arial">
    <font color="#088A68">
        Adaboost
    </font>
</h3>

In [23]:
pipeline = Pipeline(
    [
        (
            "dtc_clf",
            DecisionTreeClassifier(
                min_weight_fraction_leaf=0.0,
                random_state=42,
                max_leaf_nodes=None,
                min_impurity_decrease=0.0,
                class_weight="balanced",
                ccp_alpha=0.0
            )
        )
    ]
)

param_grid = {
    "dtc_clf__criterion": ["gini", "entropy"],
    "dtc_clf__splitter": ["best", "random"],
    "dtc_clf__max_depth": [7, 12],
    "dtc_clf__min_samples_split": [2, 3],
    "dtc_clf__min_samples_leaf": [1, 3],
    "dtc_clf__max_features": [75, 100, 125, 150, 175],
}

In [24]:
dtc_grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cross_validation,
    scoring=fthree_scorer,
    n_jobs=-1,
    verbose=True
)

dtc_grid.fit(X_train, np.ravel(y_train))

Fitting 3 folds for each of 160 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   50.2s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  2.3min finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('dtc_clf',
                                        DecisionTreeClassifier(class_weight='balanced',
                                                               random_state=42))]),
             n_jobs=-1,
             param_grid={'dtc_clf__criterion': ['gini', 'entropy'],
                         'dtc_clf__max_depth': [7, 12],
                         'dtc_clf__max_features': [75, 100, 125, 150, 175],
                         'dtc_clf__min_samples_leaf': [1, 3],
                         'dtc_clf__min_samples_split': [2, 3],
                         'dtc_clf__splitter': ['best', 'random']},
             scoring=make_scorer(fbeta_score, average=macro, beta=3),
             verbose=True)

In [25]:
dict_params_dtc = dtc_grid.best_params_
dict_params_dtc

{'dtc_clf__criterion': 'entropy',
 'dtc_clf__max_depth': 12,
 'dtc_clf__max_features': 100,
 'dtc_clf__min_samples_leaf': 3,
 'dtc_clf__min_samples_split': 2,
 'dtc_clf__splitter': 'best'}

In [26]:
pipeline = Pipeline(
    [
        (
            "adab_clf",
            AdaBoostClassifier(
                base_estimator=DecisionTreeClassifier(
                    criterion=dict_params_dtc["dtc_clf__criterion"],
                    splitter=dict_params_dtc["dtc_clf__splitter"],
                    max_depth=dict_params_dtc["dtc_clf__max_depth"],
                    min_samples_split=dict_params_dtc["dtc_clf__min_samples_split"],
                    min_samples_leaf=dict_params_dtc["dtc_clf__min_samples_leaf"],
                    min_weight_fraction_leaf=0.0,
                    max_features=dict_params_dtc["dtc_clf__max_features"],
                    random_state=42,
                    max_leaf_nodes=None,
                    min_impurity_decrease=0.0,
                    class_weight="balanced",
                    ccp_alpha=0.0
                ),
                algorithm="SAMME.R",
                random_state=42
            )
        )
    ]
)

param_grid = {
    "adab_clf__n_estimators": [75, 100, 125, 150, 175],
    "adab_clf__learning_rate": [0.001, 0.01, 0.1, 1.0]
}

In [27]:
dtc_grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cross_validation,
    scoring=fthree_scorer,
    n_jobs=-1,
    verbose=True
)

dtc_grid.fit(X_train, np.ravel(y_train))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  4.9min finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('adab_clf',
                                        AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                                                 criterion='entropy',
                                                                                                 max_depth=12,
                                                                                                 max_features=100,
                                                                                                 min_samples_leaf=3,
                                                                                                 random_state=42),
                                                           random_state=42))]),
             n_jobs=-1,
             param_grid={'adab_clf__learning_rate': [0.001, 0.01, 0.1,

In [28]:
dict_params_adab = dtc_grid.best_params_
dict_params_adab

{'adab_clf__learning_rate': 0.01, 'adab_clf__n_estimators': 125}

In [29]:
adab_clf = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(
        criterion=dict_params_dtc["dtc_clf__criterion"],
        splitter=dict_params_dtc["dtc_clf__splitter"],
        max_depth=dict_params_dtc["dtc_clf__max_depth"],
        min_samples_split=dict_params_dtc["dtc_clf__min_samples_split"],
        min_samples_leaf=dict_params_dtc["dtc_clf__min_samples_leaf"],
        min_weight_fraction_leaf=0.0,
        max_features=dict_params_dtc["dtc_clf__max_features"],
        random_state=42,
        max_leaf_nodes=None,
        min_impurity_decrease=0.0,
        class_weight="balanced",
        ccp_alpha=0.0
    ),
    n_estimators=dict_params_adab["adab_clf__n_estimators"],
    learning_rate=dict_params_adab["adab_clf__learning_rate"],
    algorithm="SAMME.R",
    random_state=42
)

adab_clf.fit(X_train, np.ravel(y_train))
y_pred = adab_clf.predict(X_test)

In [30]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,11,2,22
neutral,3,12,4
positive,8,6,215


In [31]:
accuracy_adab = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_adab = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_adab = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_adab = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_adab = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

<h3 style="font-family: Arial">
    <font color="#088A68">
        Support Vector Machine
    </font>
</h3>

In [32]:
pipeline = Pipeline(
    [
        (
            "svc_clf",
            SVC(
                degree=3,
                gamma="auto",
                coef0=0.0,
                shrinking=True,
                probability=False,
                tol=0.001,
                cache_size=200,
                class_weight="balanced",
                verbose=False,
                max_iter=-1,
                decision_function_shape="ovr",
                break_ties=False,
                random_state=42
            )
        )
    ]
)

param_grid = {
    "svc_clf__C": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "svc_clf__kernel": ["linear", "rbf", "sigmoid"]
}

In [33]:
svc_grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cross_validation,
    scoring=fthree_scorer,
    n_jobs=-1,
    verbose=True
)

svc_grid.fit(X_train, np.ravel(y_train))

Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:  4.3min finished


GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('svc_clf',
                                        SVC(class_weight='balanced',
                                            gamma='auto', random_state=42))]),
             n_jobs=-1,
             param_grid={'svc_clf__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
                         'svc_clf__kernel': ['linear', 'rbf', 'sigmoid']},
             scoring=make_scorer(fbeta_score, average=macro, beta=3),
             verbose=True)

In [34]:
dict_params = svc_grid.best_params_
dict_params

{'svc_clf__C': 1.0, 'svc_clf__kernel': 'linear'}

In [35]:
svc_clf = SVC(
    C=dict_params["svc_clf__C"],
    kernel=dict_params["svc_clf__kernel"],
    degree=3,
    gamma="auto",
    coef0=0.0,
    shrinking=True,
    probability=False,
    tol=0.001,
    cache_size=200,
    class_weight="balanced",
    verbose=False,
    max_iter=-1,
    decision_function_shape="ovr",
    break_ties=False,
    random_state=42
)

svc_clf.fit(X_train, np.ravel(y_train))
y_pred = svc_clf.predict(X_test)

In [36]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,9,2,24
neutral,3,4,12
positive,3,0,226


In [37]:
accuracy_svc = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_svc = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_svc = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_svc = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_svc = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

<h3 style="font-family: Arial">
    <font color="#088A68">
        LightGBM
    </font>
</h3>

In [38]:
pipeline = Pipeline(
    [
        (
            "lgbm_clf",
            LGBMClassifier(
                    boosting_type="gbdt",
                    subsample_for_bin=200000,
                    objective=None,
                    class_weight="balanced",
                    min_split_gain=0.0,
                    min_child_weight=0.001,
                    min_child_sample=20,
                    subsample=1.0,
                    subsample_freq=0,
                    colsample_bytree=1.0,
                    reg_alpha=0.001,
                    reg_lambda=0.0,
                    random_state=42,
                    n_jobs=-1,
                    importance_type="split"
            )
        )
    ]
)

param_grid = {
    #"lgbm_clf__boosting_type": ["gbdt", "dart"],
    "lgbm_clf__num_leaves": [5, 10, 30, 50],
    "lgbm_clf__max_depth": [3, 7, 12],
    "lgbm_clf__learning_rate": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "lgbm_clf__n_estimators": [50, 100, 300, 500, 1000],
    #"lgbm_clf__min_child_weight": [0.001, 1.0, 300],
    #"lgbm_clf__subsample": [0.7, 1.0],
    #"lgbm_clf__colsample_bytree": [0.7, 1.0],
    #"lgbm_clf__reg_alpha": [0.001, 0.01, 0.1, 1.0, 10, 100],
    #"lgbm_clf__reg_lambda": [0.001, 0.01, 0.1, 1.0, 10, 100]
}

In [39]:
lgbm_grid = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cross_validation,
    scoring=fthree_scorer,
    n_jobs=-1,
    verbose=True
)

lgbm_grid.fit(X_train, np.ravel(y_train))

Fitting 3 folds for each of 360 candidates, totalling 1080 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   54.9s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 1080 out of 1080 | elapsed:  5.3min finished




GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('lgbm_clf',
                                        LGBMClassifier(class_weight='balanced',
                                                       min_child_sample=20,
                                                       random_state=42,
                                                       reg_alpha=0.001))]),
             n_jobs=-1,
             param_grid={'lgbm_clf__learning_rate': [0.001, 0.01, 0.1, 1.0, 10,
                                                     100],
                         'lgbm_clf__max_depth': [3, 7, 12],
                         'lgbm_clf__n_estimators': [50, 100, 300, 500, 1000],
                         'lgbm_clf__num_leaves': [5, 10, 30, 50]},
             scoring=make_scorer(fbeta_score, average=macro, beta=3),
             verbose=True)

In [40]:
dict_params = lgbm_grid.best_params_
dict_params

{'lgbm_clf__learning_rate': 0.1,
 'lgbm_clf__max_depth': 7,
 'lgbm_clf__n_estimators': 300,
 'lgbm_clf__num_leaves': 10}

In [41]:
lgbm_clf = LGBMClassifier(
    boosting_type="gbdt",
    num_leaves=dict_params["lgbm_clf__num_leaves"],
    max_depth=dict_params["lgbm_clf__max_depth"],
    learning_rate=dict_params["lgbm_clf__learning_rate"],
    n_estimators=dict_params["lgbm_clf__n_estimators"],
    subsample_for_bin=200000,
    objective=None,
    class_weight="balanced",
    min_split_gain=0.0,
    min_child_weight=0.001,
    min_child_sample=20,
    subsample=1.0,
    subsample_freq=0,
    colsample_bytree=1.0,
    reg_alpha=0.001,
    reg_lambda=0.0,
    random_state=42,
    n_jobs=-1,
    importance_type="split"
)

lgbm_clf.fit(X_train, np.ravel(y_train))
y_pred = lgbm_clf.predict(X_test)

In [42]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,11,7,17
neutral,3,12,4
positive,9,16,204


In [43]:
accuracy_lgbm = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_lgbm = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_lgbm = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_lgbm = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_lgbm = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

<h3 style="font-family: Arial">
    <font color="#088A68">
        Multi Layer Perceptron
    </font>
</h3>

In [44]:
mlp_clf = MLPClassifier(
    hidden_layer_sizes=(128, 64, 32),
    activation="relu",
    solver="adam",
    alpha=0.0001,
    batch_size="auto",
    learning_rate="constant",
    learning_rate_init=0.001,
    max_iter=100,
    shuffle=True,
    random_state=42,
    tol=0.0001,
    verbose=False,
    warm_start=False,
    early_stopping=False,
    validation_fraction=0.1,
)

mlp_clf.fit(X_train, np.ravel(y_train))
y_pred = mlp_clf.predict(X_test)

In [45]:
# architecture
print(f"Minimum loss reached: {mlp_clf.best_loss_}")
print(f"Number of features: {mlp_clf.n_features_in_}")
print(f"Number of iterations: {mlp_clf.n_iter_}")
print(f"Number of layers: {mlp_clf.n_layers_}")
print(f"Number of outputs: {mlp_clf.n_outputs_}")
print(f"Output activation function: {mlp_clf.out_activation_}")

Minimum loss reached: 0.0007344423515283769
Number of features: 35424
Number of iterations: 47
Number of layers: 5
Number of outputs: 3
Output activation function: softmax


In [46]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,10,3,22
neutral,3,9,7
positive,2,3,224


In [47]:
accuracy_mlp = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_mlp = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_mlp = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_mlp = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_mlp = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

<h3 style="font-family: Arial">
    <font color="#088A68">
        Voting
    </font>
</h3>

In [48]:
clf1 = MLPClassifier(
    hidden_layer_sizes=(128, 64, 32),
    activation="relu",
    solver="adam",
    alpha=0.0001,
    batch_size="auto",
    learning_rate="constant",
    learning_rate_init=0.001,
    max_iter=100,
    shuffle=True,
    random_state=42,
    tol=0.0001,
    verbose=False,
    warm_start=False,
    early_stopping=False,
    validation_fraction=0.1,
)

clf2 = LogisticRegression(
    penalty="l2",
    dual=False,
    tol=0.0001,
    C=1.0,
    fit_intercept=True,
    intercept_scaling=1,
    class_weight="balanced",
    random_state=42,
    solver="saga",
    max_iter=100,
    multi_class="auto",
    verbose=False,
    warm_start=False,
    n_jobs=-1,
    l1_ratio=0
)

clf3 = SVC(
    C=1.0,
    kernel="linear",
    degree=3,
    gamma="auto",
    coef0=0.0,
    shrinking=True,
    probability=False,
    tol=0.001,
    cache_size=200,
    class_weight="balanced",
    verbose=False,
    max_iter=-1,
    decision_function_shape="ovr",
    break_ties=False,
    random_state=42
)

clf4 = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(
        criterion="entropy",
        splitter="best",
        max_depth=12,
        min_samples_split=2,
        min_samples_leaf=3,
        min_weight_fraction_leaf=0.0,
        max_features=100,
        random_state=42,
        max_leaf_nodes=None,
        min_impurity_decrease=0.0,
        class_weight="balanced",
        ccp_alpha=0.0
    ),
    n_estimators=125,
    learning_rate=0.01,
    algorithm="SAMME.R",
    random_state=42
)

voting_clf = VotingClassifier(
    estimators=[
        ("mlp_clf", clf1),
        ("log_clf", clf2),
        ("svc_clf", clf3),
        ("adab_clf", clf4)
    ],
    voting="hard"
)

voting_clf.fit(X_train, np.ravel(y_train))
y_pred = voting_clf.predict(X_test)

In [49]:
cm = pd.crosstab(y_test, y_pred, rownames=["Real class"], colnames=["Predicted class"])
cm

Predicted class,negative,neutral,positive
Real class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
negative,11,4,20
neutral,3,9,7
positive,5,2,222


In [50]:
accuracy_voting = round(accuracy_score(y_true=y_test, y_pred=y_pred), 4)
recall_voting = round(recall_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
precision_voting = round(precision_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f1_score_voting = round(f1_score(y_true=y_test, y_pred=y_pred, average="macro"), 4)
f3_score_voting = round(fbeta_score(y_true=y_test, y_pred=y_pred, average="macro", beta=3), 4)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            3) Model evaluation
        </font>
    </h3>
</div>

In [51]:
models = pd.DataFrame(
    {
        "Model": [
            "Logistic Regression",
            "Random Forest",
            "AdaBoost",
            "SVM",
            "LGBM",
            "MLP",
            "Voting Classifier"
        ],
        "Accuracy": [
            accuracy_log,
            accuracy_rf,
            accuracy_adab,
            accuracy_svc,
            accuracy_lgbm,
            accuracy_mlp,
            accuracy_voting
        ],
        "Recall": [
            recall_log,
            recall_rf,
            recall_adab,
            recall_svc,
            recall_lgbm,
            recall_mlp,
            recall_voting
        ],
        "Precision": [
            precision_log,
            precision_rf,
            precision_adab,
            precision_svc,
            precision_lgbm,
            precision_mlp,
            precision_voting
        ],
        "F1-score": [
            f1_score_log,
            f1_score_rf,
            f1_score_adab,
            f1_score_svc,
            f1_score_lgbm,
            f1_score_mlp,
            f1_score_voting
        ],
        "F3-score": [
            f3_score_log,
            f3_score_rf,
            f3_score_adab,
            f3_score_svc,
            f3_score_lgbm,
            f3_score_mlp,
            f3_score_voting
        ]
    }
)

models.sort_values(by="Accuracy", ascending=False, ignore_index=True)

Unnamed: 0,Model,Accuracy,Recall,Precision,F1-score,F3-score
0,MLP,0.8587,0.5792,0.7173,0.6196,0.585
1,Logistic Regression,0.8551,0.6341,0.6789,0.6421,0.6342
2,Voting Classifier,0.8551,0.5858,0.6902,0.6219,0.5914
3,SVM,0.8445,0.4849,0.7098,0.5335,0.4905
4,AdaBoost,0.841,0.6282,0.664,0.6387,0.6296
5,LGBM,0.8021,0.6122,0.5759,0.5741,0.6001
6,Random Forest,0.7208,0.6349,0.5523,0.4958,0.5718


-------------------------------------------------------------------------------------------------------------------------------

<div style="margin: 10px;">
    <h2 style="font-family: Arial">
        <font color="#084B8A">
            New reviews for sentiment prediction
        </font>
    </h2>
</div>

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            1) Data creation
        </font>
    </h3>
</div>

In [52]:
df_new_data = pd.DataFrame(
    {
        "Review": [
            "This product is perfect!",
            "I don't recommend this product, it doesn't work",
            "if you are looking for a computer, buy it!",
            "I bought this laptop for my brother.",
            "Breakdown after 3 weeks. Don't buy, really lousy customer service, no refund or gift code!!!",
            "Pc that supports games like League Of Legends with the RTX 3080 ti GE and the ryzen 9 3900X.",
            "The computer doesn't work, this is not acceptable at all !! I want a PAYBACK ! "
        ]
    }
)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            2) Text cleaning
        </font>
    </h3>
</div>

In [53]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_space()

In [54]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).additional_cleaning(
    add_regexs=None
)

In [55]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).lowercase()

In [56]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_punctuation()

In [57]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_url()

In [58]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_html()

In [59]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_email()

In [60]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_digit()

In [61]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_mention()

In [62]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_hastag()

In [63]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_emoji()

In [64]:
stopwords_to_keep = [
    "doesn", "doesn't", "doesnt", "dont", "don't", "not", "wasn't", "wasnt",
    "aren", "aren't", "arent",  "couldn", "couldn't", "couldnt", "didn",
    "didn't", "didnt", "hadn", "hadn't", "hadnt",  "hasn", "hasn't", "hasnt",
    "haven't", "havent", "isn", "isn't", "isnt", "mightn",  "mightn't",
    "mightnt", "mustn", "mustn't", "mustnt", "needn", "needn't", "neednt",
    "shan", "shan't", "shant", "shouldn", "shouldn't", "shouldnt", "wasn",
    "wasn't",  "wasnt", "weren", "weren't", "werent", "won", "won't", "wont",
    "wouldn", "wouldn't", "wouldnt", "good", "bad", "worst", "wonderfull",
    "best", "better"
]

stopwords_to_add = [
    "es", "que", "en", "la", "las", "le", "les", "lo", "los", "de", "no",
    "el", "al", "un", "una", "se", "sa", "su", "sus", "por", "con", "mi",
    "para", "todo", "gb", "laptop", "computer", "pc"
]

In [65]:
df_new_data = WordNet(
    data=df_new_data,
    column="Review"
).remove_stopword(
    language="english",
    lowercase=False,
    remove_accents=False,
    add_stopwords=stopwords_to_add,
    remove_stopwords=stopwords_to_keep
)

In [66]:
df_new_data = WordNet(
    data=df_new_data,
    column="Review"
).lemmatize(
    language="english",
    lowercase=False,
    remove_accents=False
)

In [67]:
df_new_data = TextNet(
    data=df_new_data,
    column="Review"
).remove_whitespace()

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            3) Vectorization
        </font>
    </h3>
</div>

In [68]:
list_new_data = df_new_data["Review"].tolist()

In [69]:
new_data_vec = vectorizer.transform(list_new_data)

<div style="margin: 10px;">
    <h3 style="font-family: Arial">
        <font color="#088A68">
            4) Prediction
        </font>
    </h3>
</div>

In [70]:
new_data_pred = mlp_clf.predict(new_data_vec.toarray())

In [71]:
new_data_pred_proba = mlp_clf.predict_proba(new_data_vec.toarray())
new_data_pred_proba = np.around(new_data_pred_proba, decimals=4)

In [72]:
new_data_pred = pd.DataFrame({"Predicted Sentiment": new_data_pred})
new_data_pred_proba = pd.DataFrame(new_data_pred_proba, columns=["Negative proba", "Neutral proba", "Positive proba"])

In [73]:
y_pred_new = pd.concat([df_new_data["Review"], new_data_pred], axis=1)
y_pred_new = pd.concat([y_pred_new, new_data_pred_proba], axis=1)

In [74]:
y_pred_new

Unnamed: 0,Review,Predicted Sentiment,Negative proba,Neutral proba,Positive proba
0,product perfect,positive,0.0105,0.0142,0.9754
1,recommend product doesn work,negative,0.3576,0.3454,0.297
2,looking buy,positive,0.0504,0.3672,0.5824
3,buy brother,positive,0.0075,0.0446,0.948
4,breakdown weeks buy really lousy customer serv...,positive,0.2222,0.0925,0.6853
5,supports games like league legends rtx ti ge r...,positive,0.0035,0.0143,0.9822
6,doesn work not acceptable want payback,negative,0.3785,0.3242,0.2973
