 # Classification des avis sur des vêtements de femmes vendus dans le e-commerce

- est ce que les avis que l'on a des vêtements est représentatif de la note qui est attribué ?

idées 
- visualisation de données : quels types de vêtements ont les notes les plus élevées ?
- nb d'avis donné selon l'âge des clients

In [29]:
import pandas as pd


import spacy


## I. Import des données

Dans un premier temps, nous allons importer nos données. Notre base de données contient des informations sur des avis de vêtements femmes vendus sur internet. Ces données sont issus d'un processus de webscrapping.

In [2]:
data = pd.read_csv("Womens Clothing E-Commerce Reviews.csv", sep = ",")

# renomage première colonne
data = data.rename(columns = {"Unnamed: 0" : "id"})

data.head()

Unnamed: 0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## II. Pré-traitement des données

Pour avoir des données plus propres, nous allons effectuer divers pré-traitements.

Pour commencer, nous allon supprimer les lignes où nous avons des valeurs manquantes. 

In [3]:
#supprimer les valeurs manquantes
data = data.dropna()


In [4]:
nb_dpt = data.groupby("Department Name").count()
nb_dpt

Unnamed: 0_level_0,id,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Class Name
Department Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Bottoms,3184,3184,3184,3184,3184,3184,3184,3184,3184,3184
Dresses,5371,5371,5371,5371,5371,5371,5371,5371,5371,5371
Intimate,1408,1408,1408,1408,1408,1408,1408,1408,1408,1408
Jackets,879,879,879,879,879,879,879,879,879,879
Tops,8713,8713,8713,8713,8713,8713,8713,8713,8713,8713
Trend,107,107,107,107,107,107,107,107,107,107


### Récupération des données pour notre étude

### Traitement de la casse

- suppression des valeurs manquantes => pas d'avis : inutile
- suppression des caractères spéciaux 
- suppression des majuscules
- suppression des mots vides
- lemmatisation 
- affiche du nombre de mots par étiquette grammaticale
- extraction des mots (groupes de mots) les plus fréquents

- wordcloud

Dans un premier temps, nous allons récupérer les avis.

In [5]:
avis = data["Review Text"]
avis

2        I had such high hopes for this dress and reall...
3        I love, love, love this jumpsuit. it's fun, fl...
4        This shirt is very flattering to all due to th...
5        I love tracy reese dresses, but this one is no...
6        I aded this in my basket at hte last mintue to...
                               ...                        
23481    I was very happy to snag this dress at such a ...
23482    It reminds me of maternity clothes. soft, stre...
23483    This fit well, but the top was very see throug...
23484    I bought this dress for a wedding i have this ...
23485    This dress in a lovely platinum is feminine an...
Name: Review Text, Length: 19662, dtype: object

Nous allons récupérer seulement la partie textuelle de l'avis, cela nous permet de ne avoir un objet Pandas.Series.

In [6]:
from pprint import pprint
liste_avis = data["Review Text"].values.tolist()
pprint(liste_avis[:10])

['I had such high hopes for this dress and really wanted it to work for me. i '
 'initially ordered the petite small (my usual size) but i found this to be '
 'outrageously small. so small in fact that i could not zip it up! i reordered '
 'it in petite medium, which was just ok. overall, the top half was '
 'comfortable and fit nicely, but the bottom half had a very tight under layer '
 'and several somewhat cheap (net) over layers. imo, a major design flaw was '
 'the net over layer sewn directly into the zipper - it c',
 "I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time "
 'i wear it, i get nothing but great compliments!',
 'This shirt is very flattering to all due to the adjustable front tie. it is '
 'the perfect length to wear with leggings and it is sleeveless so it pairs '
 'well with any cardigan. love this shirt!!!',
 'I love tracy reese dresses, but this one is not for the very petite. i am '
 'just under 5 feet tall and usually wear a 0p in this 

Grâce à cette manipulation, chaque avis est élément d'une liste d'avis. 

Ensuite, nous allons découper les avis en liste de mots et les mettre en minuscules pour pouvoir les analyser plus facilement.

In [7]:
liste_avis_clean = []

for avis in liste_avis : 
    avis = str(avis)
    avis_clean = avis.split()
    avis_clean = avis.lower()
    liste_avis_clean.append(avis_clean)

liste_avis_clean[:10]

['i had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c',
 "i love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!",
 'this shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!',
 'i love tracy reese dresses, but this one is not for the very petite. i am just under 5 feet tall and usually wear a 0p in this brand. this dress was very pretty out of

Puis, nous allons tockeniser notre texte. 

In [8]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer('\w+')
liste_avis_clean = [tokenizer.tokenize(str(avis)) for avis in liste_avis_clean]
liste_avis_clean[:10]

[['i',
  'had',
  'such',
  'high',
  'hopes',
  'for',
  'this',
  'dress',
  'and',
  'really',
  'wanted',
  'it',
  'to',
  'work',
  'for',
  'me',
  'i',
  'initially',
  'ordered',
  'the',
  'petite',
  'small',
  'my',
  'usual',
  'size',
  'but',
  'i',
  'found',
  'this',
  'to',
  'be',
  'outrageously',
  'small',
  'so',
  'small',
  'in',
  'fact',
  'that',
  'i',
  'could',
  'not',
  'zip',
  'it',
  'up',
  'i',
  'reordered',
  'it',
  'in',
  'petite',
  'medium',
  'which',
  'was',
  'just',
  'ok',
  'overall',
  'the',
  'top',
  'half',
  'was',
  'comfortable',
  'and',
  'fit',
  'nicely',
  'but',
  'the',
  'bottom',
  'half',
  'had',
  'a',
  'very',
  'tight',
  'under',
  'layer',
  'and',
  'several',
  'somewhat',
  'cheap',
  'net',
  'over',
  'layers',
  'imo',
  'a',
  'major',
  'design',
  'flaw',
  'was',
  'the',
  'net',
  'over',
  'layer',
  'sewn',
  'directly',
  'into',
  'the',
  'zipper',
  'it',
  'c'],
 ['i',
  'love',
  'love',
 

Ensuite, on supprime la ponctuation.

In [9]:
import string
punct = string.punctuation

# Pour chaque token dans chaque avis, si le token n'est pas dans la liste des ponctuations, on le garde
liste_avis_clean = [[token for token in avis if token not in punct] for avis in liste_avis_clean]
liste_avis_clean[:10]

[['i',
  'had',
  'such',
  'high',
  'hopes',
  'for',
  'this',
  'dress',
  'and',
  'really',
  'wanted',
  'it',
  'to',
  'work',
  'for',
  'me',
  'i',
  'initially',
  'ordered',
  'the',
  'petite',
  'small',
  'my',
  'usual',
  'size',
  'but',
  'i',
  'found',
  'this',
  'to',
  'be',
  'outrageously',
  'small',
  'so',
  'small',
  'in',
  'fact',
  'that',
  'i',
  'could',
  'not',
  'zip',
  'it',
  'up',
  'i',
  'reordered',
  'it',
  'in',
  'petite',
  'medium',
  'which',
  'was',
  'just',
  'ok',
  'overall',
  'the',
  'top',
  'half',
  'was',
  'comfortable',
  'and',
  'fit',
  'nicely',
  'but',
  'the',
  'bottom',
  'half',
  'had',
  'a',
  'very',
  'tight',
  'under',
  'layer',
  'and',
  'several',
  'somewhat',
  'cheap',
  'net',
  'over',
  'layers',
  'imo',
  'a',
  'major',
  'design',
  'flaw',
  'was',
  'the',
  'net',
  'over',
  'layer',
  'sewn',
  'directly',
  'into',
  'the',
  'zipper',
  'it',
  'c'],
 ['i',
  'love',
  'love',
 

## III. Classification

### Traitement et séparation des données

Dans cette partie, nous allons chercher à classifier les avis en fonction de leur note. 

Nous allons utiliser la colonne "Rating" comme étiquettes et "Review Text" comme valeurs. 




#### Sélection des informations dans notre dataframe

Nous allons sélectionner les trois colonnes qui vont nous servir pour la classification dans un objectif d'optimiser les temps de calculs et de ne pas avoir d'informations superflus.

In [10]:
# On récupère la colonne id, Rating et Review Text
data = data[["id", "Rating", "Review Text"]]
data.head()

Unnamed: 0,id,Rating,Review Text
2,2,3,I had such high hopes for this dress and reall...
3,3,5,"I love, love, love this jumpsuit. it's fun, fl..."
4,4,5,This shirt is very flattering to all due to th...
5,5,2,"I love tracy reese dresses, but this one is no..."
6,6,5,I aded this in my basket at hte last mintue to...


#### Analyse de la colonne "Rating"

In [11]:
# Analyse de la colonne "Rating"
data["Rating"].value_counts()

5    10858
4     4289
3     2464
2     1360
1      691
Name: Rating, dtype: int64

Nous avons ici des notes allant de 1 à 5. Nous allons diviser ces valeurs en 3 catégories : 
- -1 pour les notes allant de 1 à 2
- 0 pour les notes égales à 3 
- 1 pour les plus élevées (4 et 5)

#### Analyse de la colonne "Review Text"

Notre colonne correspondant aux valeurs est "Review Text". \
Cette colonne contient tous les avis laissés par les internautes sur les différents vêtements.

In [12]:
# ajouter infos d'avant

#### Changement des étiquettes

Pour réaliser notre classification, nous allons donc modifier les étiquettes comme précisé ci-dessus. 

In [13]:
def map_label_to_numeric(label):
    return 1 if label == 5 else 0 if label == 3 or label == 4 else -1

In [14]:
def get_labels(data):
    labels = data[["id","Rating"]]
    labels['Rating'] = labels['Rating'].apply(map_label_to_numeric)
    labels.set_index('id', inplace=True)
    
    # ajouter les labels dans data selon l'id
    data['score_avis'] = labels

    # data['score_avis'] = labels
    return data

In [15]:
data = get_labels(data)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  labels['Rating'] = labels['Rating'].apply(map_label_to_numeric)


Unnamed: 0,id,Rating,Review Text,score_avis
2,2,3,I had such high hopes for this dress and reall...,0
3,3,5,"I love, love, love this jumpsuit. it's fun, fl...",1
4,4,5,This shirt is very flattering to all due to th...,1
5,5,2,"I love tracy reese dresses, but this one is no...",-1
6,6,5,I aded this in my basket at hte last mintue to...,1


In [16]:
# On analyse la nouevlle colonne "score_avis"
data["score_avis"].value_counts()

 1    10858
 0     6753
-1     2051
Name: score_avis, dtype: int64

Grâce à cette manipulation, nous pouvons observer que les avis ayant la note de 5 sont majoritaires dans notre jeu de données. Les avis compris entre 1 et 2 ont une proportion plus faible. 

A présent, nous n'avons plus besoin de la colonne Rating, nous pouvons donc la supprimer du dataframe.

In [17]:
data = data[["id", "Review Text", "score_avis"]]
data.head()

Unnamed: 0,id,Review Text,score_avis
2,2,I had such high hopes for this dress and reall...,0
3,3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,4,This shirt is very flattering to all due to th...,1
5,5,"I love tracy reese dresses, but this one is no...",-1
6,6,I aded this in my basket at hte last mintue to...,1


#### Division de notre dataframe

Pour réaliser notre classification, nous avons besoin de séparer notre jeu de données en un jeu d'apprentissage, de validation et de test.

Note :
Les données en apprentissage automatique sont généralement séparées en trois jeux :
+ **entraînement** : données destinées à l'apprentissage du modèle ;
+ **validation** : données destinées à une évaluation intermédiaire du modèle pour permettre l'ajustement de ses hyperparamètres. Une fois les hyperparamètres du modèle arrêtés, on peut le ré-entraîner sur l'ensemble des données (entraînement + validation) avant de le tester sur le jeu de test ;
+ **test** : données destinées EXCLUSIVEMENT à l'évaluation FINALE (à réaliser une fois uniquement !) du modèle choisi finalement. Elles ne doivent sous aucune forme servir à la conception du modèle. Il est donc interdit aussi bien de les examiner que d'évaluer le modèle en cours de développement sur ce jeu de données.

In [18]:
# Définir une fonction qui divise notre jeu de données en 3 : train, validation et test
def split_data(data, train_ratio, validation_ratio, test_ratio):
    data_train = data.sample(frac = train_ratio)
    data = data.drop(data_train.index)
    data_validation = data.sample(frac = validation_ratio/(validation_ratio + test_ratio))
    data_test = data.drop(data_validation.index)
    return data_train, data_validation, data_test

# Diviser notre jeu de données en 3 : train, validation et test
data_train, data_validation, data_test = split_data(data, 0.6, 0.2, 0.1)

In [19]:
data_train.shape, data_validation.shape, data_test.shape

((11797, 3), (5243, 3), (2622, 3))

Dans notre cas :
+ entraînement (appelé *Train*) contenant 11797 observations ;
+ validation (appelé *Validation*) contenant 5243 observations ;
+ test (appelé *Test*), contenant 2622 observations, soit environ 22% de la taille du jeu d'entraînement.

In [20]:
data_train.head()

Unnamed: 0,id,Review Text,score_avis
13301,13301,This top is so pretty and easy to wear. the ma...,1
6121,6121,I've been looking for a new winter dress and t...,1
14393,14393,The material is so soft and i really like the ...,0
6804,6804,This hoodie has a great fit! it is slightly ta...,1
2048,2048,I'm 5'6 and between 100-105lbs. buying clothes...,1


In [21]:
data_validation.head()

Unnamed: 0,id,Review Text,score_avis
5065,5065,Just bought this at the store. just to be safe...,0
21441,21441,I tried on about 25 items in the store today a...,1
2025,2025,This dress runs large so i sized down. when it...,1
21178,21178,I purchased the bird design top for my 17 year...,1
7335,7335,I originally ordered the regular xxs (loose fi...,1


In [22]:
data_test.head()

Unnamed: 0,id,Review Text,score_avis
10,10,Dress runs small esp where the zipper area run...,0
32,32,These pants are even better in person. the onl...,1
53,53,Very soft and comfortable. the shirt has an un...,1
66,66,"Just received this in the mail, tried it on an...",0
71,71,Why do designers keep making crop tops??!! i c...,-1


### Exploration des données

#### Distribution des classes

Il est important de connaître la répartition des classes dans les données d'entraînement pour pouvoir procéder à notre classification.

In [23]:
# Analyse de la colonne "score_avis" de notre jeu de données d'entrainement
print(data_train["score_avis"].value_counts())

# Calcul des proportions de chaque classe dans notre jeu de données d'entrainement
data_train["score_avis"].value_counts()/len(data_train)

 1    6547
 0    4014
-1    1236
Name: score_avis, dtype: int64


 1    0.554972
 0    0.340256
-1    0.104772
Name: score_avis, dtype: float64

Nous pouvons observer que les classes de scores sont réparties de manière aléatoire dans notre jeu d'apprentissage. Nous pouvons noter plus de 50% d'avis très favorables, correspondant à la note de 5/5. Les avis négatifs sont en minorité dans notre jeu d'entraînement.

#### Exploration du texte

Pour se faire une idée des textes auxquels nous avons affaire, nous allons les afficher pour savoir quels pré-traitements sont nécessaires.

In [24]:
# Affichage des 5 premiers avis
data_train["Review Text"].values[:5]

array(["This top is so pretty and easy to wear. the material is super soft, it's long enough to tuck in as pictured but looks fine untucked. the back is gorgeous and will prompt me to wear my hair up to show it off! great purchase on retailer weekend, i'd be surprised if it lasts long enough to make it to sale.",
       "I've been looking for a new winter dress and this one fit the order! it's warm yet flattering and i love the color.",
       "The material is so soft and i really like the design (i got the map one). as others mentioned, sleeves are quite short, almost morel like a muscle tee than a t-shirt, but i don't mind. looks really cute under a cardigan for fall too.",
       'This hoodie has a great fit! it is slightly tapered in at the waist so it is not boxy. in fact, it is quite flattering. it has pockets and a cute rubber zipper pull. the hood is a nice size, not overwhelming to the back profile. i purchased the matching pants a few months ago but had not idea when i ordere

## Représentation des textes

### Sélection de descripteurs : prétraitements textuels

#### Exemple sur un avis

In [26]:
tw = data_train['Review Text'].iloc[100]
tw

"I cannot say enough about these pajama pants. they're beautiful and crazy comfortable. it's a nice change from black or grey. i also love that there are no pockets because i hate how they jut out on me. i wanted a petite l because i am short, but the regular large is fine. i just wear them higher up on my hips. i normally wait for sales on sleepwear, but i couldn't resist on these. they're well worth the investment!"

In [32]:
nlp = spacy.load('en_core_web_sm')

In [35]:
avis_nlp = nlp(avis) # spacy
avis_nlp
#dir(avis_nlp)

This dress in a lovely platinum is feminine and fits perfectly, easy to wear and comfy, too! highly recommend!

In [36]:
# Pas de selection de mots quels quelle
for token in avis_nlp:
    print(token)

This
dress
in
a
lovely
platinum
is
feminine
and
fits
perfectly
,
easy
to
wear
and
comfy
,
too
!
highly
recommend
!


In [37]:
# Rduction par regroupement/uniformisation des mots
# Lemmatisation
for token in avis_nlp:
    print(token.lemma_)

this
dress
in
a
lovely
platinum
be
feminine
and
fit
perfectly
,
easy
to
wear
and
comfy
,
too
!
highly
recommend
!


#### Version lemmatisée


In [38]:
def lemmatise_text(text):
    text = nlp(text)
    lemmas = [token.lemma_ for token in text]
    return ' '.join(lemmas)

In [39]:
lemmatise_text(avis)

'this dress in a lovely platinum be feminine and fit perfectly , easy to wear and comfy , too ! highly recommend !'

In [40]:
data_train['lemmas'] = data_train['Review Text'].apply(lemmatise_text)

In [41]:
data_train.head()

Unnamed: 0,id,Review Text,score_avis,lemmas
13301,13301,This top is so pretty and easy to wear. the ma...,1,this top be so pretty and easy to wear . the m...
6121,6121,I've been looking for a new winter dress and t...,1,I have be look for a new winter dress and this...
14393,14393,The material is so soft and i really like the ...,0,the material be so soft and I really like the ...
6804,6804,This hoodie has a great fit! it is slightly ta...,1,this hoodie have a great fit ! it be slightly ...
2048,2048,I'm 5'6 and between 100-105lbs. buying clothes...,1,I be 5'6 and between 100 - 105lbs . buy clothe...


In [42]:
data_test['lemmas'] = data_test['Review Text'].apply(lemmatise_text)

In [43]:
data_test.shape

(2622, 4)

In [44]:
# Sauvegarde
data_train.to_pickle('train.pkl')
data_test.to_pickle('test.pkl')

##### Racine

In [45]:
from nltk.tokenize import word_tokenize, TweetTokenizer
from nltk.stem import SnowballStemmer

In [46]:
stemmer = SnowballStemmer('english')
tokenizer = TweetTokenizer()
tokenizer.tokenize(tw)

['I',
 'cannot',
 'say',
 'enough',
 'about',
 'these',
 'pajama',
 'pants',
 '.',
 "they're",
 'beautiful',
 'and',
 'crazy',
 'comfortable',
 '.',
 "it's",
 'a',
 'nice',
 'change',
 'from',
 'black',
 'or',
 'grey',
 '.',
 'i',
 'also',
 'love',
 'that',
 'there',
 'are',
 'no',
 'pockets',
 'because',
 'i',
 'hate',
 'how',
 'they',
 'jut',
 'out',
 'on',
 'me',
 '.',
 'i',
 'wanted',
 'a',
 'petite',
 'l',
 'because',
 'i',
 'am',
 'short',
 ',',
 'but',
 'the',
 'regular',
 'large',
 'is',
 'fine',
 '.',
 'i',
 'just',
 'wear',
 'them',
 'higher',
 'up',
 'on',
 'my',
 'hips',
 '.',
 'i',
 'normally',
 'wait',
 'for',
 'sales',
 'on',
 'sleepwear',
 ',',
 'but',
 'i',
 "couldn't",
 'resist',
 'on',
 'these',
 '.',
 "they're",
 'well',
 'worth',
 'the',
 'investment',
 '!']

In [47]:
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True) 
tokenizer.tokenize(avis)

['This',
 'dress',
 'in',
 'a',
 'lovely',
 'platinum',
 'is',
 'feminine',
 'and',
 'fits',
 'perfectly',
 ',',
 'easy',
 'to',
 'wear',
 'and',
 'comfy',
 ',',
 'too',
 '!',
 'highly',
 'recommend',
 '!']

In [48]:
for token in tokenizer.tokenize(avis):
    print(stemmer.stem(token))

this
dress
in
a
love
platinum
is
feminin
and
fit
perfect
,
easi
to
wear
and
comfi
,
too
!
high
recommend
!


In [49]:
def stem_text(text):
    tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
    stemmer = SnowballStemmer('english')
    stems = [stemmer.stem(token) for token in tokenizer.tokenize(text)]
    return ' '.join(stems)

In [50]:
stem_text(avis)

'this dress in a love platinum is feminin and fit perfect , easi to wear and comfi , too ! high recommend !'

In [53]:
data_train['stems'] = data_train['Review Text'].apply(stem_text)

In [54]:
data_train.head()

Unnamed: 0,id,Review Text,score_avis,lemmas,stems
13301,13301,This top is so pretty and easy to wear. the ma...,1,this top be so pretty and easy to wear . the m...,this top is so pretti and easi to wear . the m...
6121,6121,I've been looking for a new winter dress and t...,1,I have be look for a new winter dress and this...,i'v been look for a new winter dress and this ...
14393,14393,The material is so soft and i really like the ...,0,the material be so soft and I really like the ...,the materi is so soft and i realli like the de...
6804,6804,This hoodie has a great fit! it is slightly ta...,1,this hoodie have a great fit ! it be slightly ...,this hoodi has a great fit ! it is slight tape...
2048,2048,I'm 5'6 and between 100-105lbs. buying clothes...,1,I be 5'6 and between 100 - 105lbs . buy clothe...,i'm 5 ' 6 and between 100-105 lbs . buy cloth ...


In [55]:
data_test['stems'] = data_test['Review Text'].apply(stem_text)

In [56]:
data_test.shape

(2622, 5)

##### Étiquettes morphosyntaxiques

In [57]:
def replace_words_with_pos_tag(text):
    text = nlp(text)
    return ' '.join([token.pos_ for token in text])

In [58]:
replace_words_with_pos_tag(avis)

'DET NOUN ADP DET ADJ NOUN AUX ADJ CCONJ VERB ADV PUNCT ADJ PART VERB CCONJ ADJ PUNCT ADV PUNCT ADV VERB PUNCT'

In [59]:
data_train['pos'] = data_train['Review Text'].apply(replace_words_with_pos_tag)

In [60]:
data_train.head()

Unnamed: 0,id,Review Text,score_avis,lemmas,stems,pos
13301,13301,This top is so pretty and easy to wear. the ma...,1,this top be so pretty and easy to wear . the m...,this top is so pretti and easi to wear . the m...,DET NOUN AUX ADV ADJ CCONJ ADJ PART VERB PUNCT...
6121,6121,I've been looking for a new winter dress and t...,1,I have be look for a new winter dress and this...,i'v been look for a new winter dress and this ...,PRON AUX AUX VERB ADP DET ADJ NOUN NOUN CCONJ ...
14393,14393,The material is so soft and i really like the ...,0,the material be so soft and I really like the ...,the materi is so soft and i realli like the de...,DET NOUN AUX ADV ADJ CCONJ PRON ADV VERB DET N...
6804,6804,This hoodie has a great fit! it is slightly ta...,1,this hoodie have a great fit ! it be slightly ...,this hoodi has a great fit ! it is slight tape...,DET NOUN VERB DET ADJ NOUN PUNCT PRON AUX ADV ...
2048,2048,I'm 5'6 and between 100-105lbs. buying clothes...,1,I be 5'6 and between 100 - 105lbs . buy clothe...,i'm 5 ' 6 and between 100-105 lbs . buy cloth ...,PRON AUX NUM CCONJ ADP NUM PUNCT NUM PUNCT VER...


In [61]:
data_test['pos'] = data_test['Review Text'].apply(replace_words_with_pos_tag)

In [62]:
data_test.shape

(2622, 6)

In [63]:
# Sauvegarde
data_train.to_pickle('train.pkl')
data_test.to_pickle('test.pkl')

#####  Classe d'appartenance des entités nommées

In [64]:
def ner(text):
    text = nlp(text)
    new_text = []
    for token in text:
        if token.ent_iob_ == "O":
            new_text.append(token.text)
        elif token.ent_iob_ == "B":
            new_text.append(token.ent_type_)
        # Si l'entité comprend plusieurs mot on ne répète pas l'étiquette
        else:
            continue
    return ' '.join(new_text)

In [65]:
ner(avis)

'This dress in a lovely platinum is feminine and fits perfectly , easy to wear and comfy , too ! highly recommend !'

In [67]:
data_train['entites_nommees'] = data_train['Review Text'].apply(ner)

In [68]:
data_train.head()

Unnamed: 0,id,Review Text,score_avis,lemmas,stems,pos,entites_nommees
13301,13301,This top is so pretty and easy to wear. the ma...,1,this top be so pretty and easy to wear . the m...,this top is so pretti and easi to wear . the m...,DET NOUN AUX ADV ADJ CCONJ ADJ PART VERB PUNCT...,This top is so pretty and easy to wear . the m...
6121,6121,I've been looking for a new winter dress and t...,1,I have be look for a new winter dress and this...,i'v been look for a new winter dress and this ...,PRON AUX AUX VERB ADP DET ADJ NOUN NOUN CCONJ ...,I 've been looking for a new DATE dress and th...
14393,14393,The material is so soft and i really like the ...,0,the material be so soft and I really like the ...,the materi is so soft and i realli like the de...,DET NOUN AUX ADV ADJ CCONJ PRON ADV VERB DET N...,The material is so soft and i really like the ...
6804,6804,This hoodie has a great fit! it is slightly ta...,1,this hoodie have a great fit ! it be slightly ...,this hoodi has a great fit ! it is slight tape...,DET NOUN VERB DET ADJ NOUN PUNCT PRON AUX ADV ...,This ORG has a great fit ! it is slightly tape...
2048,2048,I'm 5'6 and between 100-105lbs. buying clothes...,1,I be 5'6 and between 100 - 105lbs . buy clothe...,i'm 5 ' 6 and between 100-105 lbs . buy cloth ...,PRON AUX NUM CCONJ ADP NUM PUNCT NUM PUNCT VER...,I 'm CARDINAL and CARDINAL . buying clothes fr...


In [69]:
data_test['entites_nommees'] = data_test['Review Text'].apply(ner)

In [70]:
data_test.shape

(2622, 7)

In [71]:
data_train.to_pickle('train.pkl')
data_test.to_pickle('test.pkl')

#### VOIR FONCTION SUBSTUTUTE_URL

###  Calcul des valeurs des descripteurs

In [72]:
from sklearn.model_selection import train_test_split

In [74]:
X_train, X_valid, y_train, y_valid = train_test_split(data_train['Review Text'],
                                                      data_train['score_avis'],
                                                      train_size=0.75,
                                                      random_state=5)

In [75]:
X_train.shape, X_valid.shape

((8847,), (2950,))

In [76]:
y_train

20304    1
13051    0
10628   -1
4648     0
19493    0
        ..
19357    1
9412     0
20438    0
22105    1
11296    1
Name: score_avis, Length: 8847, dtype: int64

In [78]:
X_test, y_test = data_test['Review Text'], data_test['score_avis']

### Binaire : présence/absence

In [81]:
from sklearn.feature_extraction.text import CountVectorizer


In [82]:
bin_count = CountVectorizer(binary=True)

In [83]:
bin_count.fit(X_train)
bin_count

In [84]:
X_train_vectorized_bin = bin_count.transform(X_train)
X_train_vectorized_bin

<8847x9677 sparse matrix of type '<class 'numpy.int64'>'
	with 388140 stored elements in Compressed Sparse Row format>

In [85]:
X_valid_vectorized_bin = bin_count.transform(X_valid)
X_test_vectorized_bin = bin_count.transform(X_test)

In [86]:
X_valid_vectorized_bin # MEME NOMBRE DE COLONNES QUE X_train_vectorized_bin

<2950x9677 sparse matrix of type '<class 'numpy.int64'>'
	with 128245 stored elements in Compressed Sparse Row format>

###  Numérique discret : décomptes d'occurrence

In [87]:
vect_count = CountVectorizer().fit(X_train) # binary=False

In [88]:
vect_count.get_feature_names_out()[:50] # 50 premiers mots ("types" du vocabulaire)

array(['00', '00p', '03', '03dd', '0in', '0p', '0petite', '0r', '0verall',
       '0xs', '10', '100', '1000', '100lb', '100lbs', '102', '102lbs',
       '103', '103lb', '103lbs', '104', '105', '105lbs', '106', '107',
       '107lb', '107pound', '108', '108lbs', '109', '109lbs', '10lbs',
       '10p', '10x', '11', '110', '110lbs', '111', '111lbs', '112',
       '112lbs', '113', '114', '114lb', '115', '115lbs', '116', '116lb',
       '116lbs', '117'], dtype=object)

In [89]:
vect_count.get_feature_names_out()[-50:] # 50 derniers mots ("types" du vocabulaire)

array(['yep', 'yes', 'yest', 'yesterday', 'yet', 'yey', 'yfit', 'yiddish',
       'yield', 'yikes', 'yippee', 'yo', 'yoga', 'yogini', 'yogis',
       'yoke', 'yolk', 'york', 'you', 'young', 'younger', 'your', 'youre',
       'yours', 'yourself', 'yourselves', 'youthful', 'youthfull', 'yr',
       'yrs', 'yuck', 'yucky', 'yummiest', 'yummy', 'zag', 'zermatt',
       'zero', 'zig', 'zigzag', 'zip', 'zipped', 'zipper', 'zippered',
       'zippers', 'zipping', 'zips', 'zombie', 'zone', 'zoom', 'zuma'],
      dtype=object)

In [90]:
len(vect_count.get_feature_names_out()) # taille du vocabulaire

9677

#### Creation matrice document terme 

In [91]:
X_train_vectorized_count = vect_count.transform(X_train)
X_train_vectorized_count

<8847x9677 sparse matrix of type '<class 'numpy.int64'>'
	with 388140 stored elements in Compressed Sparse Row format>

In [92]:
# transformation des corpus de validation et de test en matrices document-termes avec le meme vectorizer
X_valid_vectorized_count = vect_count.transform(X_valid)
X_test_vectorized_count = vect_count.transform(X_test)

In [93]:
# Inclure les bigram dans le vocabulaire
vect_count_bigrams = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized_count_bigrams = vect_count_bigrams.transform(X_train)
X_valid_vectorized_count_bigrams = vect_count_bigrams.transform(X_valid)
X_test_vectorized_count_bigrams = vect_count_bigrams.transform(X_test)


In [94]:
len(vect_count_bigrams.get_feature_names_out())

17411

Nous avons presque 2 fois plus de vocabuliaire avec inclusion des bigrammes.

###  Numérique continu : TF-IDF (ou autres pondérations)

In [95]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [96]:
vect_tfidf = TfidfVectorizer(min_df=5).fit(X_train)

In [97]:
# La reduction de la taille du vocadulaire est importante et est due au paramètre min_df=5
len(vect_count.get_feature_names_out()), len(vect_tfidf.get_feature_names_out())

(9677, 3260)

In [98]:
# Vectorisation des corpus d'entrainement, de validation et de test
X_train_vectorized_tfidf = vect_tfidf.transform(X_train)
X_valid_vectorized_tfidf = vect_tfidf.transform(X_valid)
X_test_vectorized_tfidf = vect_tfidf.transform(X_test)

## Classification des textes