# NLP LAB

###Juliana Varela

Imports

In [1]:
pip install emoji --upgrade

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


We will be using sklearn since it allows fine tuning of the preprocessing

In [2]:
import numpy as np
import pandas as pd

import re
import emoji

import nltk

from nltk.corpus import words,stopwords
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from sklearn.model_selection import GridSearchCV

##Data processing

In [3]:
TRAIN = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/train.csv.gz")
TEST = pd.read_csv("http://www.i3s.unice.fr/~riveill/dataset/Amazon_Unlocked_Mobile/test.csv.gz")

In [4]:
TRAIN.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,Samsung Galaxy Note 4 N910C Unlocked Cellphone...,Samsung,449.99,4,I love it!!! I absolutely love it!! 👌👍,0.0
1,BLU Energy X Plus Smartphone - With 4000 mAh S...,BLU,139.0,5,I love the BLU phones! This is my second one t...,4.0
2,Apple iPhone 6 128GB Silver AT&T,Apple,599.95,5,Great phone,1.0
3,BLU Advance 4.0L Unlocked Smartphone -US GSM -...,BLU,51.99,4,Very happy with the performance. The apps work...,2.0
4,Huawei P8 Lite US Version- 5 Unlocked Android ...,Huawei,198.99,5,Easy to use great price,0.0


In [5]:
nltk.download('wordnet')
nltk.download('words')
WNlemma =nltk.WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [6]:
custom_stop_words = [ 'the', 'and', 'is', 'in', 'it', 'of', 'to', 'a', 'that', 'was', 'for', 'with' 'by', 'at', 'this', 'but', 'from']

In [7]:
#This code is taken fro the NLP slides, it deletes repeated unecessary characters in words
class RepeatReplacer:
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'

    def replace(self, word):
        valid_words = set(words.words())
        if word in valid_words:
            return word
        # Remove repeated characters
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word


In [8]:
#Create a preprocessor that transforms emojis; uses lemmatization, lowers the fonts and manages dates
#ONLY FOR THE REVIEWS COLUMN

def my_preprocessor(text):
    processed_words = []

    #We apply the emoji demojize function to all the text
    text= emoji.demojize(text)
    #We lower all the characters in text
    text.lower()
    #We split the text to get the words
    words_list = text.split()



    for word in words_list:
        #We were going to remove the repeated characters but it takes 80 years fitting
        #word = replacer.replace(word)

        #We lemmatize the words
        word = WNlemma.lemmatize(word)
        #Then e append them to a list
        processed_words.append(word)

    text = ' '.join(processed_words)
    return text

#replacer = RepeatReplacer()


In [9]:
#We check how our function looks like
textico = my_preprocessor(TRAIN.iloc[0]['Reviews'])
print(textico)

I love it!!! I absolutely love it!! :OK_hand::thumbs_up:


In [10]:
#We change the ys adding the groups given in the instructions, 1
TRAIN['Review Votes'] = TRAIN['Review Votes'].apply(lambda x: 0 if x <= 2 else 1)
TEST['Review Votes'] = TEST['Review Votes'].apply(lambda x: 0 if x <= 2 else 1)

In [11]:
#We take the respective y and X values; the revie votes is what we're trying to predict
y_train = TRAIN['Review Votes']
X_train = TRAIN['Reviews']

y_test = TEST['Review Votes']
X_test = TEST['Reviews']

In [12]:
#We check the shapes of our training and testing sets
y_train.shape, X_train.shape, y_test.shape, X_test.shape

((5000,), (5000,), (1000,), (1000,))

Pipeline

In [13]:
X_train = X_train.dropna()
X_test = X_test.dropna()
y_train = y_train[X_train.index]
y_test = y_test[X_test.index]

In [14]:
#We create the pipeline passing my_preprocessor as the TdifVectorizer preprocessor and use logistic regression for its classification
pipeline = Pipeline(steps=[ ('vectorizer', TfidfVectorizer(preprocessor=my_preprocessor, token_pattern=r'\b\w+\b', analyzer='word')),
                    ('classifier', LogisticRegression()) ])

Training

In [15]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

f1 = f1_score(y_test, y_pred, average='weighted')
print(f'F1 Score: {f1}')

F1 Score: 0.8339292831886853


##Trying different parameters

In [16]:
params = {
    'tfidf__stop_words': [None, 'english',custom_stop_words],
    'tfidf__max_df': [1.0, 0.5],
    'tfidf__min_df': [1, 50],
    'tfidf__ngram_range': [(1, 1), (1, 3)],
    'tfidf__max_features': [None, 100, 1000],
    'clf__class_weight': [None, 'balanced']
}

In [17]:
'''#We create the grid search with our pipeline and parameters, with the goal being the best F1 Score
grid_search = GridSearchCV(pipeline, params, scoring='f1', cv=3, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

#We get the best model and predict with it
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))'''

'#We create the grid search with our pipeline and parameters, with the goal being the best F1 Score\ngrid_search = GridSearchCV(pipeline, params, scoring=\'f1\', cv=3, verbose=2, n_jobs=-1)\ngrid_search.fit(X_train, y_train)\n\n#We get the best model and predict with it\nbest_model = grid_search.best_estimator_\ny_pred = best_model.predict(X_test)\n\nprint("Best Parameters:", grid_search.best_params_)\nprint("\nClassification Report:\n", classification_report(y_test, y_pred))\nprint("F1 Score:", f1_score(y_test, y_pred))'

We get an error this is because of the values of max_df and min_df, so we relax them.

In [20]:

params = {
    'vectorizer__stop_words': [None, 'english', custom_stop_words],
    'vectorizer__max_df': [1.0, 0.9],
    'vectorizer__min_df': [1, 10],
    'vectorizer__ngram_range': [(1, 1), (1,3)],
    'classifier__class_weight': [None, 'balanced',{0: 1, 1: 2}, {0: 2, 1: 1}],
    'classifier__C': [0.7, 7.0]
}

# We create the grid search with our pipeline and parameters, with the goal being the best F1 Score
grid_search = GridSearchCV(pipeline, params, scoring='f1_weighted', cv=3, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

# We get the best model and predict with it
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred, average='weighted'))


Fitting 3 folds for each of 192 candidates, totalling 576 fits




Best Parameters: {'classifier__C': 0.7, 'classifier__class_weight': {0: 1, 1: 2}, 'vectorizer__max_df': 1.0, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 3), 'vectorizer__stop_words': ['the', 'and', 'is', 'in', 'it', 'of', 'to', 'a', 'that', 'was', 'for', 'withby', 'at', 'this', 'but', 'from']}

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.97      0.93       863
           1       0.56      0.26      0.36       137

    accuracy                           0.87      1000
   macro avg       0.73      0.62      0.64      1000
weighted avg       0.85      0.87      0.85      1000

F1 Score: 0.8501919142475506


Looking at the f1 score we see with grid search it gets higher. Regarding the parameters we  can observe that since for our model we used unbalanced data the class weight parameter improves the model results a lot, sal with the parameters for the vectorizer. We can conclude from this that we can also improve the scores even more by adding more parameters, the parameters are really important and crucial for the results and by implementing the repeated words class in the text processing we will ensure getting better F1 Scores.