 # Classification des avis sur des vêtements de femmes vendus dans le e-commerce

- Est ce que les avis que l'on a des vêtements sont représentatifs de la note qui est attribuée ?

Idées :
- visualisation de données : quels types de vêtements ont les notes les plus élevées ?
- nb d'avis donné selon l'âge des clients

In [67]:
import pandas as pd
import spacy
import string
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer

## I. Import des données

Dans un premier temps, nous allons importer nos données. Notre base de données contient des informations sur des avis de vêtements femmes vendus sur internet. Ces données sont issus d'un processus de webscrapping.

In [68]:
# Import des données
data = pd.read_csv("Womens Clothing E-Commerce Reviews.csv", sep = ",")

# Renomage première colonne pour pouvoir l'utiliser comme id par la suite
data = data.rename(columns = {"Unnamed: 0" : "id"})

# Affichage des 5 premières lignes
data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,"I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c",3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,5,1,6,General,Tops,Blouses


In [69]:
data.shape

(23486, 11)

Notre jeu de données contient 23 486 avis et 11 colonnes. 

## II. Pré-traitement des données

### Nettoyage des données

Pour avoir des données plus propres, nous allons nettoyer notre base de données en :
- mettant le texte en minuscules
- supprimant la ponctuation

Note : nous n'avons pas besoin de supprimer les caractères spéciaux car la langue anglaise n'en contient pas.

D'abord, nous allons convertir notre colonne 'Review Text' en chaîne de caractères pour pouvoir utiliser toutes les fonctions de pré-traitements.

In [70]:
data['Review Text'] = data['Review Text'].astype(str)
data['Review Text'].dtype

dtype('O')

Dans un premier temps, nous allons mettre notre texte en minuscules.

In [71]:
def lower_text(df, column_name): # Convertit en minuscules
    df[column_name] = df[column_name].apply(lambda x: x.lower())
    return df

data = lower_text(data, 'Review Text')
data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,"love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,"i had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c",3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"i love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!,5,1,6,General,Tops,Blouses


Ensuite, nous allons supprimer toute la ponctuation.

In [72]:
# Pour installer spacy et le modèle anglais
!pip install -U spacy
! python -m spacy validate

import spacy
!python -m spacy download en_core_web_sm

nlp = spacy.load('en_core_web_sm')


⠙ Loading compatibility table...
⠹ Loading compatibility table...
⠸ Loading compatibility table...
⠼ Loading compatibility table...
[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation:
c:\Users\ASUS\anaconda3\Lib\site-packages\spacy[0m

NAME              SPACY            VERSION                            
en_core_web_sm    >=3.7.2,<3.8.0   [38;5;2m3.7.1[0m   [38;5;2m✔[0m
fr_core_news_sm   >=3.7.0,<3.8.0   [38;5;2m3.7.0[0m   [38;5;2m✔[0m

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 435.7 kB/s eta 0:00:30
     --------------------------------------- 0.1/12.8 MB 871.5 kB/s eta 0:00:15
     ---------------------------------

In [73]:
# Suppression de la ponctuation
def remove_punctuation(df, column_name):
    df[column_name] = df[column_name].apply(lambda x: ' '.join([token.text for token in nlp(x) if not token.is_punct and token.text != "'"]))
    return df

In [74]:
data = remove_punctuation(data, 'Review Text')

On affiche notre dataframe pour vérifier que le pré-traitement soit bien réalisé.

In [75]:
data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,absolutely wonderful silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments,5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,5,1,6,General,Tops,Blouses


### Sélection de descripteurs : prétraitements textuels

Pour effectuer nos tâches de traitements de langage, nous allons utiliser spacy. En effet, spacy a plusieurs fonctionnalités : 
- permet de tokeniser directement notre texte
- peut lemmatiser les mots
- identifier et classer les entités nommées
- analyser les dépendances syntaxiques entre les mots
- représenter les mots sous forme de vecteurs (embedding)
- etc

#### Représentation des textes

##### Lemmatisation

**Exemple sur un avis**

Nous allons d'abord afficher le contenu d'un avis pour voir à quoi ressemble nos données

In [76]:
avis50 = data['Review Text'].iloc[50]
avis50

"this is a cute top that can transition easily from summer to fall it fits well nice print and it 's comfortable i tried this on in the store but did not purchase it because the color washed me out this is not the best color for a blonde would look much better on a brunette if this was in a different color i most likely would have purchased it"

On le transforme en objet spacy pour pouvoir faire des  pré-traitements textuels dessus.

In [77]:
avis_nlp = nlp(avis50)

In [78]:
type(avis_nlp)

spacy.tokens.doc.Doc

On a bien un objet spacy à présent.

On affiche chaque token de notre objet spacy.

In [79]:
for token in avis_nlp:
    print(token)

this
is
a
cute
top
that
can
transition
easily
from
summer
to
fall
it
fits
well
nice
print
and
it
's
comfortable
i
tried
this
on
in
the
store
but
did
not
purchase
it
because
the
color
washed
me
out
this
is
not
the
best
color
for
a
blonde
would
look
much
better
on
a
brunette
if
this
was
in
a
different
color
i
most
likely
would
have
purchased
it


Ensuite, nous allons pouvoir afficher les lemmes de chaque token. Grâce à cette étape, nous allons pouvoir simplifier les mots pour faciliter notre analyse textuelle.

In [80]:
# Lemmatisation
for token in avis_nlp:
    print(token.lemma_)

this
be
a
cute
top
that
can
transition
easily
from
summer
to
fall
it
fit
well
nice
print
and
it
be
comfortable
I
try
this
on
in
the
store
but
do
not
purchase
it
because
the
color
wash
I
out
this
be
not
the
good
color
for
a
blonde
would
look
much
well
on
a
brunette
if
this
be
in
a
different
color
I
most
likely
would
have
purchase
it


**Généralisation**

Nous allons pouvoir réaliser ces prétraitements sur l'ensemble de notre colonne 'Review Text'

Pour lemmatiser notre texte, nous allons définir une fonction. Cette étape est indispensable pour récupérer les lemmes des mots de notre texte d'origine. Nous allons simplifier notre texte grâce à cette fonction.

In [81]:
def lemmatise_text(text):
    text = nlp(text) # on transforme le texte en objet spacy
    lemmas = [token.lemma_ for token in text] # on récupère les lemmes
    return ' '.join(lemmas) # on retourne les lemmes sous forme de texte

In [82]:
import pandas as pd

# Augmenter la limite de caractères par colonne
pd.set_option('display.max_colwidth', None)


In [83]:
# On teste sur 1 avis 
print("Avis initial : ", data['Review Text'].iloc[50])
print("Avis lemmatisé : ", lemmatise_text(data['Review Text'].iloc[50]))


Avis initial :  this is a cute top that can transition easily from summer to fall it fits well nice print and it 's comfortable i tried this on in the store but did not purchase it because the color washed me out this is not the best color for a blonde would look much better on a brunette if this was in a different color i most likely would have purchased it
Avis lemmatisé :  this be a cute top that can transition easily from summer to fall it fit well nice print and it be comfortable I try this on in the store but do not purchase it because the color wash I out this be not the good color for a blonde would look much well on a brunette if this be in a different color I most likely would have purchase it


Nous allons ensuite ajouter une colonne avec la fonction de **lemmatisation** appliquée à nos avis.

In [84]:
data['lemmas'] = data['Review Text'].apply(lemmatise_text)

In [85]:
data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,lemmas
0,0,767,33,,absolutely wonderful silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,absolutely wonderful silky and sexy and comfortable
1,1,1080,34,,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,5,1,4,General,Dresses,Dresses,love this dress it be sooo pretty I happen to find it in a store and I ' m glad I do bc I never would have order it online bc it be petite I buy a petite and be 5'8 I love the length on me- hit just a little below the knee would definitely be a true midi on someone who be truly petite
2,2,1077,60,Some major design flaws,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c,3,0,0,General,Dresses,Dresses,I have such high hope for this dress and really want it to work for I I initially order the petite small my usual size but I find this to be outrageously small so small in fact that I could not zip it up I reorder it in petite medium which be just ok overall the top half be comfortable and fit nicely but the bottom half have a very tight under layer and several somewhat cheap net over layer imo a major design flaw be the net over layer sew directly into the zipper it c
3,3,1049,50,My favorite buy!,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments,5,1,0,General Petite,Bottoms,Pants,I love love love this jumpsuit it be fun flirty and fabulous every time I wear it I get nothing but great compliment
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,5,1,6,General,Tops,Blouses,this shirt be very flattering to all due to the adjustable front tie it be the perfect length to wear with legging and it be sleeveless so it pair well with any cardigan love this shirt


In [86]:
# Sauvegarde des données
data.to_pickle('data.pkl')

##### Racines

Pour réduire les mots à leur forme de base, nous allons utiliser SnowballStemmer sur notre texte. Pour cela, nous allons créer une fonction et l'appliquer à nos avis. 

In [87]:
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

In [88]:
def stem_text(text):
    stemmer = SnowballStemmer('english')
    tokenizer = RegexpTokenizer('\w+')
    stems = [stemmer.stem(token) for token in tokenizer.tokenize(text)]
    return ' '.join(stems)

On applique la fonction à notre dataframe.

In [89]:
data['stems'] = data['Review Text'].apply(stem_text)

On vérifie que la forme soit correcte.

In [90]:
data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,lemmas,stems
0,0,767,33,,absolutely wonderful silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,absolutely wonderful silky and sexy and comfortable,absolut wonder silki and sexi and comfort
1,1,1080,34,,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,5,1,4,General,Dresses,Dresses,love this dress it be sooo pretty I happen to find it in a store and I ' m glad I do bc I never would have order it online bc it be petite I buy a petite and be 5'8 I love the length on me- hit just a little below the knee would definitely be a true midi on someone who be truly petite,love this dress it s sooo pretti i happen to find it in a store and i m glad i did bc i never would have order it onlin bc it s petit i bought a petit and am 5 8 i love the length on me hit just a littl below the knee would definit be a true midi on someon who is truli petit
2,2,1077,60,Some major design flaws,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c,3,0,0,General,Dresses,Dresses,I have such high hope for this dress and really want it to work for I I initially order the petite small my usual size but I find this to be outrageously small so small in fact that I could not zip it up I reorder it in petite medium which be just ok overall the top half be comfortable and fit nicely but the bottom half have a very tight under layer and several somewhat cheap net over layer imo a major design flaw be the net over layer sew directly into the zipper it c,i had such high hope for this dress and realli want it to work for me i initi order the petit small my usual size but i found this to be outrag small so small in fact that i could not zip it up i reorder it in petit medium which was just ok overal the top half was comfort and fit nice but the bottom half had a veri tight under layer and sever somewhat cheap net over layer imo a major design flaw was the net over layer sewn direct into the zipper it c
3,3,1049,50,My favorite buy!,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments,5,1,0,General Petite,Bottoms,Pants,I love love love this jumpsuit it be fun flirty and fabulous every time I wear it I get nothing but great compliment,i love love love this jumpsuit it s fun flirti and fabul everi time i wear it i get noth but great compliment
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,5,1,6,General,Tops,Blouses,this shirt be very flattering to all due to the adjustable front tie it be the perfect length to wear with legging and it be sleeveless so it pair well with any cardigan love this shirt,this shirt is veri flatter to all due to the adjust front tie it is the perfect length to wear with leg and it is sleeveless so it pair well with ani cardigan love this shirt


In [91]:
# Sauvegarde des données
data.to_pickle('data.pkl')

##### Etiquettes morpho-syntaxiques

Puis, nous allons analyser notre texte et renvoyer chaque mot remplacé par sa catégorie grammaticale pour continuer l'étude des avis.

In [92]:
def replace_words_with_pos_tag(text):
    text = nlp(text)
    return ' '.join([token.pos_ for token in text])

In [93]:
data['pos'] = data['Review Text'].apply(replace_words_with_pos_tag)

In [94]:
data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,lemmas,stems,pos
0,0,767,33,,absolutely wonderful silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,absolutely wonderful silky and sexy and comfortable,absolut wonder silki and sexi and comfort,ADV ADJ NOUN CCONJ ADJ CCONJ ADJ
1,1,1080,34,,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,5,1,4,General,Dresses,Dresses,love this dress it be sooo pretty I happen to find it in a store and I ' m glad I do bc I never would have order it online bc it be petite I buy a petite and be 5'8 I love the length on me- hit just a little below the knee would definitely be a true midi on someone who be truly petite,love this dress it s sooo pretti i happen to find it in a store and i m glad i did bc i never would have order it onlin bc it s petit i bought a petit and am 5 8 i love the length on me hit just a littl below the knee would definit be a true midi on someon who is truli petit,VERB DET NOUN SPACE PRON AUX NOUN ADJ SPACE PRON VERB PART VERB PRON ADP DET NOUN CCONJ PRON VERB VERB ADJ PRON VERB PROPN PRON ADV AUX AUX VERB PRON ADJ PROPN PRON AUX ADJ SPACE PRON VERB DET ADJ CCONJ AUX NUM SPACE PRON VERB DET NOUN ADP PROPN VERB ADV DET ADJ ADP DET NOUN SPACE AUX ADV AUX DET ADJ NOUN ADP PRON PRON AUX ADV ADJ
2,2,1077,60,Some major design flaws,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c,3,0,0,General,Dresses,Dresses,I have such high hope for this dress and really want it to work for I I initially order the petite small my usual size but I find this to be outrageously small so small in fact that I could not zip it up I reorder it in petite medium which be just ok overall the top half be comfortable and fit nicely but the bottom half have a very tight under layer and several somewhat cheap net over layer imo a major design flaw be the net over layer sew directly into the zipper it c,i had such high hope for this dress and realli want it to work for me i initi order the petit small my usual size but i found this to be outrag small so small in fact that i could not zip it up i reorder it in petit medium which was just ok overal the top half was comfort and fit nice but the bottom half had a veri tight under layer and sever somewhat cheap net over layer imo a major design flaw was the net over layer sewn direct into the zipper it c,PRON VERB ADJ ADJ NOUN ADP DET NOUN CCONJ ADV VERB PRON PART VERB ADP PRON PRON ADV VERB DET ADJ ADJ PRON ADJ NOUN CCONJ PRON VERB PRON PART AUX ADV ADJ ADV ADJ ADP NOUN SCONJ PRON AUX PART VERB PRON ADP PRON VERB PRON ADP ADJ NOUN PRON AUX ADV ADV ADJ DET ADJ NOUN AUX ADJ CCONJ ADJ ADV CCONJ DET ADJ NOUN VERB DET ADV ADJ ADP NOUN CCONJ ADJ ADV ADJ NOUN ADP NOUN ADV DET ADJ NOUN NOUN AUX DET NOUN ADP NOUN VERB ADV ADP DET NOUN PRON VERB
3,3,1049,50,My favorite buy!,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments,5,1,0,General Petite,Bottoms,Pants,I love love love this jumpsuit it be fun flirty and fabulous every time I wear it I get nothing but great compliment,i love love love this jumpsuit it s fun flirti and fabul everi time i wear it i get noth but great compliment,PRON VERB NOUN VERB DET NOUN PRON AUX NOUN NOUN CCONJ ADJ DET NOUN PRON VERB PRON PRON VERB PRON SCONJ ADJ NOUN
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,5,1,6,General,Tops,Blouses,this shirt be very flattering to all due to the adjustable front tie it be the perfect length to wear with legging and it be sleeveless so it pair well with any cardigan love this shirt,this shirt is veri flatter to all due to the adjust front tie it is the perfect length to wear with leg and it is sleeveless so it pair well with ani cardigan love this shirt,DET NOUN AUX ADV ADJ ADP PRON ADP ADP DET ADJ ADJ NOUN PRON AUX DET ADJ NOUN PART VERB ADP NOUN CCONJ PRON AUX ADJ SCONJ PRON VERB ADV ADP DET NOUN VERB DET NOUN


In [95]:
# Sauvegarde des données
data.to_pickle('data.pkl')

##### Entités nommées

Dans notre contexte l'étude d'entités nommées n'a pas beaucoup de sens car on ne retrouve pas beaucoup de lieux, dates ou personnalités dans des avis concernant des vêtements pour femmes.

Pour prouver cet argument, nous allons tester sur un avis.

In [96]:
def ner(text):

    text = nlp(text) # on transforme le texte en objet spacy
    
    new_text = [] # on crée une liste vide

    for token in text: # pour chaque token dans l'avis

        # print(token.text, token.ent_iob_, token.ent_type_)
        
        if token.ent_iob_ == "O": # si l'entité ne fait pas partie d'une entité nommée
            new_text.append(token.text) # on ajoute le texte du token à la liste
        elif token.ent_iob_ == "B": # si l'entité fait partie d'une entité nommée
            new_text.append(token.ent_type_) # on ajoute le type de l'entité à la liste

        # Si l'entité comprend plusieurs mot on ne répète pas l'étiquette
        else:
            continue
    return ' '.join(new_text) # on retourne les étiquettes sous forme de texte

In [97]:
# Test sur un avis
print("Avis initial : ", data['Review Text'].iloc[50])
print("Avis avec les étiquettes : ", ner(data['Review Text'].iloc[50]))

Avis initial :  this is a cute top that can transition easily from summer to fall it fits well nice print and it 's comfortable i tried this on in the store but did not purchase it because the color washed me out this is not the best color for a blonde would look much better on a brunette if this was in a different color i most likely would have purchased it
Avis avec les étiquettes :  this is a cute top that can transition easily from DATE to fall it fits well nice print and it 's comfortable i tried this on in the store but did not purchase it because the color washed me out this is not the best color for a blonde would look much better on a brunette if this was in a different color i most likely would have purchased it


Dans cet avis, nous pouvons retrouver une entité nommée DATE correspondant à la période de l'année où le vêtement semble être le plus adapté. Cependant, cette dimension ne nous aidera pas dans notre objectif de classification (positif / négatif).

## III. Classification

Dans cette partie, nous allons chercher à classifier les avis en fonction de leur note. 

Nous allons utiliser la colonne "Rating" comme étiquettes et "Review Text" comme valeurs. 

### Sélection des informations dans notre dataframe

Nous allons sélectionner les trois colonnes qui vont nous servir pour la classification dans un objectif d'optimiser les temps de calculs et de ne pas avoir d'informations superflus.

In [98]:
# On récupère la colonne id, Rating et Review Text
new_data = data[["id", "Rating", "Review Text"]]
new_data.head()

Unnamed: 0,id,Rating,Review Text
0,0,4,absolutely wonderful silky and sexy and comfortable
1,1,5,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite
2,2,3,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c
3,3,5,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments
4,4,5,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt


#### Suppression des valeurs manquantes

Pour analyser seulement les données complètes, nous allons d'abord supprimer toutes les valeurs manquantes de notre jeu de données réduit. 

In [99]:
new_data = new_data.dropna() # On supprime les lignes avec des valeurs manquantes

In [100]:
data.shape, new_data.shape

((23486, 14), (23486, 3))

Suite à cette manipulation, nous avons 

#### Analyse de la colonne "Rating"

In [101]:
# Analyse de la colonne "Rating"
data["Rating"].value_counts()

Rating
5    13131
4     5077
3     2871
2     1565
1      842
Name: count, dtype: int64

Nous avons ici des notes allant de 1 à 5. Nous allons diviser ces valeurs en 3 catégories : 
- -1 pour les notes allant de 1 à 2
- 0 pour les notes égales à 3 
- 1 pour les plus élevées (4 et 5)

#### Analyse de la colonne "Review Text"

Notre colonne correspondant aux valeurs est "Review Text". \
Cette colonne contient tous les avis laissés par les internautes sur les différents vêtements.

#### Changement des étiquettes

Pour réaliser notre classification, nous allons donc modifier les étiquettes comme précisé ci-dessus. 

In [102]:
def map_label_to_numeric(label):
    return 1 if label == 5 else 0 if label == 3 or label == 4 else -1

In [103]:
def get_labels(data):
    labels = data[["id","Rating"]]
    labels['Rating'] = labels['Rating'].apply(map_label_to_numeric)
    labels.set_index('id', inplace=True)
    
    # ajouter les labels dans data selon l'id
    data['score_avis'] = labels

    # data['score_avis'] = labels
    return data

In [104]:
data = get_labels(data)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labels['Rating'] = labels['Rating'].apply(map_label_to_numeric)


Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,lemmas,stems,pos,score_avis
0,0,767,33,,absolutely wonderful silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,absolutely wonderful silky and sexy and comfortable,absolut wonder silki and sexi and comfort,ADV ADJ NOUN CCONJ ADJ CCONJ ADJ,0
1,1,1080,34,,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,5,1,4,General,Dresses,Dresses,love this dress it be sooo pretty I happen to find it in a store and I ' m glad I do bc I never would have order it online bc it be petite I buy a petite and be 5'8 I love the length on me- hit just a little below the knee would definitely be a true midi on someone who be truly petite,love this dress it s sooo pretti i happen to find it in a store and i m glad i did bc i never would have order it onlin bc it s petit i bought a petit and am 5 8 i love the length on me hit just a littl below the knee would definit be a true midi on someon who is truli petit,VERB DET NOUN SPACE PRON AUX NOUN ADJ SPACE PRON VERB PART VERB PRON ADP DET NOUN CCONJ PRON VERB VERB ADJ PRON VERB PROPN PRON ADV AUX AUX VERB PRON ADJ PROPN PRON AUX ADJ SPACE PRON VERB DET ADJ CCONJ AUX NUM SPACE PRON VERB DET NOUN ADP PROPN VERB ADV DET ADJ ADP DET NOUN SPACE AUX ADV AUX DET ADJ NOUN ADP PRON PRON AUX ADV ADJ,1
2,2,1077,60,Some major design flaws,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c,3,0,0,General,Dresses,Dresses,I have such high hope for this dress and really want it to work for I I initially order the petite small my usual size but I find this to be outrageously small so small in fact that I could not zip it up I reorder it in petite medium which be just ok overall the top half be comfortable and fit nicely but the bottom half have a very tight under layer and several somewhat cheap net over layer imo a major design flaw be the net over layer sew directly into the zipper it c,i had such high hope for this dress and realli want it to work for me i initi order the petit small my usual size but i found this to be outrag small so small in fact that i could not zip it up i reorder it in petit medium which was just ok overal the top half was comfort and fit nice but the bottom half had a veri tight under layer and sever somewhat cheap net over layer imo a major design flaw was the net over layer sewn direct into the zipper it c,PRON VERB ADJ ADJ NOUN ADP DET NOUN CCONJ ADV VERB PRON PART VERB ADP PRON PRON ADV VERB DET ADJ ADJ PRON ADJ NOUN CCONJ PRON VERB PRON PART AUX ADV ADJ ADV ADJ ADP NOUN SCONJ PRON AUX PART VERB PRON ADP PRON VERB PRON ADP ADJ NOUN PRON AUX ADV ADV ADJ DET ADJ NOUN AUX ADJ CCONJ ADJ ADV CCONJ DET ADJ NOUN VERB DET ADV ADJ ADP NOUN CCONJ ADJ ADV ADJ NOUN ADP NOUN ADV DET ADJ NOUN NOUN AUX DET NOUN ADP NOUN VERB ADV ADP DET NOUN PRON VERB,0
3,3,1049,50,My favorite buy!,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments,5,1,0,General Petite,Bottoms,Pants,I love love love this jumpsuit it be fun flirty and fabulous every time I wear it I get nothing but great compliment,i love love love this jumpsuit it s fun flirti and fabul everi time i wear it i get noth but great compliment,PRON VERB NOUN VERB DET NOUN PRON AUX NOUN NOUN CCONJ ADJ DET NOUN PRON VERB PRON PRON VERB PRON SCONJ ADJ NOUN,1
4,4,847,47,Flattering shirt,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,5,1,6,General,Tops,Blouses,this shirt be very flattering to all due to the adjustable front tie it be the perfect length to wear with legging and it be sleeveless so it pair well with any cardigan love this shirt,this shirt is veri flatter to all due to the adjust front tie it is the perfect length to wear with leg and it is sleeveless so it pair well with ani cardigan love this shirt,DET NOUN AUX ADV ADJ ADP PRON ADP ADP DET ADJ ADJ NOUN PRON AUX DET ADJ NOUN PART VERB ADP NOUN CCONJ PRON AUX ADJ SCONJ PRON VERB ADV ADP DET NOUN VERB DET NOUN,1


In [105]:
# On analyse la nouvelle colonne "score_avis"
data["score_avis"].value_counts()

score_avis
 1    13131
 0     7948
-1     2407
Name: count, dtype: int64

Grâce à cette manipulation, nous pouvons observer que les avis ayant la note de 5 sont majoritaires dans notre jeu de données puisque cela correspond à la note de 1. Les avis compris entre 1 et 2 ont une proportion plus faible (-1). 

A présent, nous n'avons plus besoin de la colonne Rating, nous pouvons donc la supprimer du dataframe.

In [106]:
data = data[["id", "Review Text", "score_avis"]]
data.head()

Unnamed: 0,id,Review Text,score_avis
0,0,absolutely wonderful silky and sexy and comfortable,0
1,1,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,1
2,2,i had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small my usual size but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium which was just ok overall the top half was comfortable and fit nicely but the bottom half had a very tight under layer and several somewhat cheap net over layers imo a major design flaw was the net over layer sewn directly into the zipper it c,0
3,3,i love love love this jumpsuit it 's fun flirty and fabulous every time i wear it i get nothing but great compliments,1
4,4,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,1


#### Division de notre dataframe

Pour réaliser notre classification, nous avons besoin de séparer notre jeu de données en un jeu d'apprentissage, de validation et de test.

Note :
Les données en apprentissage automatique sont généralement séparées en trois jeux :
+ **entraînement** : données destinées à l'apprentissage du modèle ;
+ **validation** : données destinées à une évaluation intermédiaire du modèle pour permettre l'ajustement de ses hyperparamètres. Une fois les hyperparamètres du modèle arrêtés, on peut le ré-entraîner sur l'ensemble des données (entraînement + validation) avant de le tester sur le jeu de test ;
+ **test** : données destinées EXCLUSIVEMENT à l'évaluation FINALE (à réaliser une fois uniquement !) du modèle choisi finalement. Elles ne doivent sous aucune forme servir à la conception du modèle. Il est donc interdit aussi bien de les examiner que d'évaluer le modèle en cours de développement sur ce jeu de données.

Pour créer l'ensemble de validation, nous allons effectuer la manipulation à la fin du pré-traitement réalisé lors de la classification. 

In [107]:
# Fonction pour diviser de notre jeu de données en 2 : train et test
def split_data(data, train_ratio):
    data_train = data.sample(frac = train_ratio)
    data_test = data.drop(data_train.index)
    return data_train, data_test

# Diviser notre jeu de données en 2 : train et test
data_train, data_test = split_data(data, 0.6)

In [108]:
data_train.shape, data_test.shape

((14092, 3), (9394, 3))

Dans notre cas :
+ entraînement (appelé *Train*) contenant 11797 observations ;
+ validation (appelé *Validation*) contenant 5243 observations ;
+ test (appelé *Test*), contenant 2622 observations, soit environ 22% de la taille du jeu d'entraînement.

In [109]:
data_train.head()

Unnamed: 0,id,Review Text,score_avis
7801,7801,i ordered an extra small it fits with both a humble tee and a button down with the sleeves rolled up i 've tried it with jeans and also joggers and heels really a cute and versatile sweater i got the olive but can see where the black might be more universally useful this one works perfectly for me however,1
23286,23286,as soon as my eyes touched these pants i knew my body had to not only doo they look great they feel great the material is very soft with just enough stretch they 've quickly become my favorites i loved the first pair so much i immediately bought the other color,1
4599,4599,,1
9009,9009,i love this top it is a perfect summer staple but for me it was too short i sized up to fix this problem because i prefer the boho flowy look it is flowy on it 's own but i felt it lacked length so went up a size waiting for my retailer mail with anticipation,1
21568,21568,the perfect summer top paired with a small tank underneath and jeans and had many compliments,1


In [110]:
data_test.head()

Unnamed: 0,id,Review Text,score_avis
0,0,absolutely wonderful silky and sexy and comfortable,0
1,1,love this dress it 's sooo pretty i happened to find it in a store and i 'm glad i did bc i never would have ordered it online bc it 's petite i bought a petite and am 5'8 i love the length on me- hits just a little below the knee would definitely be a true midi on someone who is truly petite,1
4,4,this shirt is very flattering to all due to the adjustable front tie it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan love this shirt,1
5,5,i love tracy reese dresses but this one is not for the very petite i am just under 5 feet tall and usually wear a 0p in this brand this dress was very pretty out of the package but its a lot of dress the skirt is long and very full so it overwhelmed my small frame not a stranger to alterations shortening and narrowing the skirt would take away from the embellishment of the garment i love the color and the idea of the style but it just did not work on me i returned this dress,-1
7,7,i ordered this in carbon for store pick up and had a ton of stuff as always to try on and used this top to pair skirts and pants everything went with it the color is really nice charcoal with shimmer and went well with pencil skirts flare pants etc my only compaint is it is a bit big sleeves are long and it does n't go in petite also a bit loose for me but no xxs so i kept it and wil ldecide later since the light color is already sold out in hte smallest size,0


## Exploration des données

#### Distribution des classes

Il est important de connaître la répartition des classes dans les données d'entraînement pour pouvoir procéder à notre classification.

In [111]:
from collections import Counter
import pandas as pd

class_distribution = (pd.DataFrame.from_dict(Counter(data_train.score_avis.values),
                                             orient='index')
                                  .rename(columns={0: 'num_examples'}))
class_distribution.index.name = 'class'
class_distribution

Unnamed: 0_level_0,num_examples
class,Unnamed: 1_level_1
1,7891
0,4773
-1,1428


In [112]:
import numpy as np

class_distribution['perc_examples'] = np.around(class_distribution.num_examples /
                                                np.sum(class_distribution.num_examples),
                                                2)
class_distribution

Unnamed: 0_level_0,num_examples,perc_examples
class,Unnamed: 1_level_1,Unnamed: 2_level_1
1,7891,0.56
0,4773,0.34
-1,1428,0.1


Nous pouvons observer que les classes de scores sont réparties de manière aléatoire dans notre jeu d'apprentissage. Nous pouvons noter plus de 50% d'avis très favorables, correspondant à la note de 5/5. Les avis négatifs sont en minorité dans notre jeu d'entraînement.

Ces tableaux nous montre que nous avons une prédominance pour les avis positifs dans notre jeu de données (note de 5) et les notes faibles sont minoritaires, elles ne concernent que 10% de notre jeu de données initial.

#### Exploration du texte

Pour se faire une idée des textes auxquels nous avons affaire, nous allons les afficher pour savoir quels pré-traitements sont nécessaires.

In [113]:
# Affichage des 5 premiers avis
data_train["Review Text"].values[:5]

array(["i ordered an extra small it fits with both a humble tee and a button down with the sleeves rolled up i 've tried it with jeans and also joggers and heels really a cute and versatile sweater i got the olive but can see where the black might be more universally useful this one works perfectly for me however",
       "as soon as my eyes touched these pants i knew my body had to not only doo they look great they feel great the material is very soft with just enough stretch they 've quickly become my favorites i loved the first pair so much i immediately bought the other color",
       'nan',
       "i love this top it is a perfect summer staple but for me it was too short i sized up to fix this problem because i prefer the boho flowy look it is flowy on it 's own but i felt it lacked length so went up a size waiting for my retailer mail with anticipation",
       'the perfect summer top paired with a small tank underneath and jeans and had many compliments'],
      dtype=object)

##  Séparation du jeu de données

Pour procéder aux calculs, nous allons séparer notre jeu de données d'entraînement pour avoir un jeu de données de validation. Les données test nous servirons pour l'évaluation finale des modèles. 

In [114]:
from sklearn.model_selection import train_test_split

In [115]:
X_train, X_valid, y_train, y_valid = train_test_split(data_train['Review Text'],
                                                      data_train['score_avis'],
                                                      train_size=0.75,
                                                      random_state=5)

In [116]:
X_train.shape, X_valid.shape

((10569,), (3523,))

On a donc 8 847 lignes dans notre jeu d'entrainement et 2 950 dans celui de validation.

In [117]:
y_train

1773     0
7449     0
12699    1
14613    0
19316    0
        ..
12119    1
1453     1
16740    1
3184     1
21463    0
Name: score_avis, Length: 10569, dtype: int64

Nous pouvons observer que les sorties à prédire correspondent aux trois étiquettes que nous avons défini plus haut.

Pour évaluer notre modèle, nous initialisons les ensembles de test.

In [118]:
# On récupère les avis et les labels du jeu de données de test
X_test, y_test = data_test['Review Text'], data_test['score_avis'] 

## Binaire : présence/absence

In [119]:
from sklearn.feature_extraction.text import CountVectorizer

bin_count = CountVectorizer(binary=True)

In [120]:
bin_count.fit(X_train)
bin_count

In [121]:
X_train_vectorized_bin = bin_count.transform(X_train)
X_train_vectorized_bin

<10569x10023 sparse matrix of type '<class 'numpy.int64'>'
	with 434359 stored elements in Compressed Sparse Row format>

In [122]:
X_valid_vectorized_bin = bin_count.transform(X_valid)
X_test_vectorized_bin = bin_count.transform(X_test)

In [123]:
X_valid_vectorized_bin # MEME NOMBRE DE COLONNES QUE X_train_vectorized_bin

<3523x10023 sparse matrix of type '<class 'numpy.int64'>'
	with 143751 stored elements in Compressed Sparse Row format>

##  Numérique discret : décomptes d'occurrence

Nous allons calculer les fréquences d'occurence des termes dans nos avis. 

In [124]:
vect_count = CountVectorizer().fit(X_train)

Nous pouvons examiner le vocabulaire de nos avis : 

In [125]:
vect_count.get_feature_names_out()[:50] # 50 premiers mots ("types" du vocabulaire)

array(['00', '00p', '02', '03', '0dd', '0p', '0petite', '0xs', '10',
       '100', '100lbs', '101', '102', '102lbs', '103', '103lbs', '104',
       '104lbs', '105', '105lbs', '106', '106lbs', '107', '107lbs',
       '107pound', '108', '109', '109lbs', '10mths', '10p', '10s', '10th',
       '11', '110', '110lbs', '111', '111lbs', '112', '112lbs', '113',
       '113lbs', '114', '115', '115lbs', '116', '116bs', '116ibs',
       '116lbs', '117', '117bl'], dtype=object)

In [126]:
vect_count.get_feature_names_out()[-50:] # 50 derniers mots ("types" du vocabulaire)

array(['yes', 'yest', 'yesterday', 'yesteryear', 'yet', 'yfit', 'yield',
       'yikes', 'yo', 'yoga', 'yogi', 'yoke', 'york', 'you', 'young',
       'younger', 'your', 'yourself', 'yourselves', 'youth', 'youthful',
       'yrs', 'yuck', 'yucky', 'yuk', 'yumi', 'yummy', 'yup', 'zag',
       'zed', 'zermatt', 'zero', 'zeros', 'zig', 'zigzag', 'zillion',
       'zip', 'zipepr', 'zipped', 'zipper', 'zippered', 'zippers',
       'zippie', 'zipping', 'zips', 'zombie', 'zone', 'zoolander', 'zoom',
       'zuma'], dtype=object)

Taille de notre vocabulaire :

In [127]:
len(vect_count.get_feature_names_out()) 

10023

### Création matrice document-termes

Nous allons créer la matrice document-termes avec le même vectoriseur.

In [128]:
X_train_vectorized_count = vect_count.transform(X_train)
X_train_vectorized_count

<10569x10023 sparse matrix of type '<class 'numpy.int64'>'
	with 434359 stored elements in Compressed Sparse Row format>

In [129]:
X_valid_vectorized_count = vect_count.transform(X_valid)
X_test_vectorized_count = vect_count.transform(X_test)

A présent, nous allons prendre en compte les bi-grammes dans notre vocabulaire. 

In [130]:
vect_count_bigrams = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized_count_bigrams = vect_count_bigrams.transform(X_train)
X_valid_vectorized_count_bigrams = vect_count_bigrams.transform(X_valid)
X_test_vectorized_count_bigrams = vect_count_bigrams.transform(X_test)

In [131]:
len(vect_count_bigrams.get_feature_names_out())

18997

Nous avons presque 2 fois plus de vocabulaire avec inclusion des bigrammes.

# TRI-GRAMMES 
# filtres sur des catégories (adj+nom)

###  Numérique continu : TF-IDF (ou autres pondérations)

Nous allons limiter le vocabulaire à des termes qui apparaissent au moins 5 fois dans le document.

In [132]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [133]:
vect_tfidf = TfidfVectorizer(min_df=5).fit(X_train)

In [134]:
len(vect_count.get_feature_names_out()), len(vect_tfidf.get_feature_names_out())

(10023, 3393)

La réduction de la taille du vocadulaire est importante et est due au paramètre min_df=5 : on a quasiment 3 fois moins de termes !

Nous allons vectoriser les jeux de données. 

In [135]:
# Vectorisation des corpus d'entrainement, de validation et de test
X_train_vectorized_tfidf = vect_tfidf.transform(X_train)
X_valid_vectorized_tfidf = vect_tfidf.transform(X_valid)
X_test_vectorized_tfidf = vect_tfidf.transform(X_test)

## Modélisation

Nous allons réaliser une classification en utilisant plusieurs modèles afin de comparer les performances. 

In [136]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

### Modèles de référence faibles

#### Choix aléatoire

Nous allons d'abord choisir un modèle où toutes les classes ont la même probabilité d'être choisies ou bien le prédicteur respecte la disctribution des classes dans les données d'entrainement.

In [137]:
from sklearn.dummy import DummyClassifier

Prédiction proportionnelle à la distribution des classes dans les données d'entraînement :

In [138]:
random_prop_class = DummyClassifier(strategy='stratified').fit(X_train_vectorized_tfidf,
                                                               y_train)
predictions_valid = random_prop_class.predict(X_valid_vectorized_tfidf)
conf_mat = confusion_matrix(y_valid, predictions_valid)

In [139]:
print(conf_mat)

[[  34  100  220]
 [ 107  442  645]
 [ 207  682 1086]]


Prédiction uniforme : 

In [140]:
random_uniform = DummyClassifier(strategy='uniform').fit(X_train_vectorized_tfidf,
                                                         y_train)
predictions_valid = random_uniform.predict(X_valid_vectorized_tfidf)
predictions_valid

array([ 1, -1,  0, ...,  0,  1, -1], dtype=int64)

In [141]:
conf_mat = confusion_matrix(y_valid, predictions_valid)

In [142]:
print(conf_mat)

[[111 118 125]
 [395 370 429]
 [638 672 665]]


In [143]:
accuracy_score(y_valid, predictions_valid)

0.32529094521714447

In [144]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.10      0.31      0.15       354
           0       0.32      0.31      0.31      1194
           1       0.55      0.34      0.42      1975

    accuracy                           0.33      3523
   macro avg       0.32      0.32      0.29      3523
weighted avg       0.42      0.33      0.35      3523



#### Prédiction constante de la classe majoritaire

Nous allons d'abord identifier la répartition des classes dans les données d'entrainement.

In [145]:
class_distribution

Unnamed: 0_level_0,num_examples,perc_examples
class,Unnamed: 1_level_1,Unnamed: 2_level_1
1,7891,0.56
0,4773,0.34
-1,1428,0.1


In [146]:
maj = DummyClassifier(strategy='most_frequent').fit(X_train_vectorized_tfidf, y_train)
predictions_valid = maj.predict(X_valid_vectorized_tfidf)
predictions_valid

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [147]:
import numpy as np 

maj_class = (class_distribution.index[class_distribution.perc_examples ==
                                      np.amax(class_distribution.perc_examples)][0])
maj_class

1

In [148]:
np.all(predictions_valid == maj_class)

True

In [149]:
maj.score(X_valid_vectorized_tfidf, y_valid)

0.5606017598637525

In [150]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00       354
           0       0.00      0.00      0.00      1194
           1       0.56      1.00      0.72      1975

    accuracy                           0.56      3523
   macro avg       0.19      0.33      0.24      3523
weighted avg       0.31      0.56      0.40      3523



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Classifieur naïf bayésien

In [151]:
from sklearn.naive_bayes import MultinomialNB

In [152]:
model_nb = MultinomialNB().fit(X_train_vectorized_tfidf, y_train)
predictions_valid = model_nb.predict(X_valid_vectorized_tfidf)

In [153]:
accuracy_score(y_valid, predictions_valid)

0.68379222253761

In [154]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.73      0.02      0.04       354
           0       0.58      0.44      0.50      1194
           1       0.72      0.95      0.82      1975

    accuracy                           0.68      3523
   macro avg       0.67      0.47      0.45      3523
weighted avg       0.67      0.68      0.63      3523



### Régression logistique

In [155]:
from sklearn.linear_model import LogisticRegression

In [156]:
model_lr = LogisticRegression(multi_class='multinomial', solver='lbfgs',
                              max_iter=200).fit(X_train_vectorized_count, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [157]:
predictions_valid = model_lr.predict(X_valid_vectorized_count)

In [158]:
accuracy_score(y_valid, predictions_valid)

0.6854953164916264

In [159]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.51      0.43      0.47       354
           0       0.56      0.53      0.54      1194
           1       0.78      0.83      0.80      1975

    accuracy                           0.69      3523
   macro avg       0.62      0.59      0.60      3523
weighted avg       0.68      0.69      0.68      3523



In [160]:
def print_n_strongly_associated_features(vectoriser, model, n):
    feature_names = np.array(vectoriser.get_feature_names_out())

    for i in range(3):
        class_name = model.classes_[i]
        print("CLASSE {}".format(class_name))
        idx_coefs_sorted = model.coef_[i].argsort() # ordre croissant
        print("Les dix variables ayant l'association négative la plus forte " + 
              "avec la classe {} :\n{}\n".format(class_name,
                                                 feature_names[idx_coefs_sorted[:n]]))
        idx_coefs_sorted = idx_coefs_sorted[::-1] # ordre décroissant
        print("Les dix variables ayant l'association positive la plus forte " +
              "avec la classe {} :\n{}\n"
              .format(class_name,
                      feature_names[idx_coefs_sorted[:n]]))
        print()

Examinons les variables (termes) ayant l'association la plus forte avec chaque classe.

In [161]:
print_n_strongly_associated_features(vect_count, model_lr, 10)

CLASSE -1
Les dix variables ayant l'association négative la plus forte avec la classe -1 :
['sold' 'stunning' 'compliments' 'comfortable' 'gorgeous' 'girls' 'wait'
 'liner' 'perfectly' 'lovely']

Les dix variables ayant l'association positive la plus forte avec la classe -1 :
['horrible' 'generally' 'nothing' 'awful' 'taste' 'killer' 'returning'
 'poorly' 'disappointing' 'boring']


CLASSE 0
Les dix variables ayant l'association négative la plus forte avec la classe 0 :
['sat' 'sequins' 'holding' 'massive' 'hesitant' 'pic' 'lays' 'thanks' 'hi'
 '37']

Les dix variables ayant l'association positive la plus forte avec la classe 0 :
['scrunch' 'ribs' 'settled' 'alterations' 'crotch' 'provided' 'secondly'
 'unsure' 'yarn' 'reg']


CLASSE 1
Les dix variables ayant l'association négative la plus forte avec la classe 1 :
['disappointing' 'returning' 'disappointed' 'shame' 'strange' 'scrunch'
 'taste' 'shrunk' 'crotch' 'suited']

Les dix variables ayant l'association positive la plus forte ave

COMMENTAIRE A METTRE

In [162]:
model_lr = LogisticRegression(multi_class='multinomial',
                              solver='lbfgs').fit(X_train_vectorized_tfidf, y_train)
predictions_valid = model_lr.predict(X_valid_vectorized_tfidf)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [163]:
accuracy_score(y_valid, predictions_valid)

0.7189894975872836

In [164]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.62      0.34      0.44       354
           0       0.61      0.57      0.59      1194
           1       0.79      0.87      0.83      1975

    accuracy                           0.72      3523
   macro avg       0.67      0.60      0.62      3523
weighted avg       0.71      0.72      0.71      3523



In [165]:
feature_names = np.array(vect_tfidf.get_feature_names_out())
idx_tfidf_sorted = X_train_vectorized_tfidf.max(0).toarray()[0].argsort()
print("TF-IDF le moins élevé : {}".format(feature_names[idx_tfidf_sorted[:10]]))
print("TF-IDF le plus élevé : {}".format(feature_names[idx_tfidf_sorted[:-11:-1]]))

TF-IDF le moins élevé : ['pros' 'secondly' 'cons' 'thi' 'xspetite' 'rigid' 'wi' 'lastly'
 'naturally' 'speaking']
TF-IDF le plus élevé : ['nan' 'amp' 'structure' 'ribbed' 'dolman' 'comfort' 'simple' 'lacey'
 'awesome' 'cute']


Nous faisons avec les mêmes paramètres mais avec le vectoriseur à unigrammes et bigrammes.

In [166]:
model_lr = LogisticRegression(multi_class='multinomial', solver='lbfgs',max_iter=500).fit(X_train_vectorized_count_bigrams, y_train)
predictions_valid = model_lr.predict(X_valid_vectorized_count_bigrams)

In [167]:
accuracy_score(y_valid, predictions_valid)

0.6979846721544138

In [168]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.54      0.40      0.46       354
           0       0.57      0.56      0.57      1194
           1       0.79      0.84      0.81      1975

    accuracy                           0.70      3523
   macro avg       0.63      0.60      0.61      3523
weighted avg       0.69      0.70      0.69      3523



In [169]:
print_n_strongly_associated_features(vect_count_bigrams, model_lr, 10)

CLASSE -1
Les dix variables ayant l'association négative la plus forte avec la classe -1 :
['comfortable' 'super cute' 'beautiful' 'nan' 'comfy' 'lovely' 'soft'
 'great' 'love this' 'tee']

Les dix variables ayant l'association positive la plus forte avec la classe -1 :
['to love' 'unflattering' 'shapeless' 'returning' 'cheap' 'generally'
 'disappointed' 'nothing' 'weird' 'ordered an']


CLASSE 0
Les dix variables ayant l'association négative la plus forte avec la classe 0 :
['tad big' 'amazing' 'do think' 'ordered an' 'big so' 'so comfy'
 'too snug' 'shapeless' 'such' 'into the']

Les dix variables ayant l'association positive la plus forte avec la classe 0 :
['extra small' 'top very' 'piece is' 'keeping' 'as other' 'gray and'
 'much bought' 'cool summer' 'great but' 'can work']


CLASSE 1
Les dix variables ayant l'association négative la plus forte avec la classe 1 :
['to love' 'not flattering' 'returning' 'disappointed' 'cute but'
 'going back' 'top very' 'frumpy' 'returned' 'thin a

### SVM

In [170]:
from sklearn.svm import SVC

In [171]:
model_svm = SVC(kernel='linear', C=0.1).fit(X_train_vectorized_count_bigrams, y_train)
predictions_valid = model_svm.predict(X_valid_vectorized_count_bigrams)

In [172]:
accuracy_score(y_valid, predictions_valid)

0.698552370139086

In [173]:
print(classification_report(y_valid, predictions_valid))

              precision    recall  f1-score   support

          -1       0.47      0.44      0.46       354
           0       0.59      0.53      0.56      1194
           1       0.79      0.85      0.82      1975

    accuracy                           0.70      3523
   macro avg       0.62      0.61      0.61      3523
weighted avg       0.69      0.70      0.69      3523

