<a href="https://colab.research.google.com/github/marouane-hadjali/fw-yb/blob/main/1_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  **Téléchargement les données sur le notebook**

In [24]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip   # download data to notebook 

--2022-12-09 17:28:34--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42989872 (41M) [application/x-httpd-php]
Saving to: ‘drugsCom_raw.zip.1’


2022-12-09 17:28:38 (12.3 MB/s) - ‘drugsCom_raw.zip.1’ saved [42989872/42989872]



In [25]:
##  **Décompression du zip des données**

In [26]:
!unzip drugsCom_raw.zip          #unzip data 

Archive:  drugsCom_raw.zip
replace drugsComTest_raw.tsv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

##  **Chargement des données**

In [27]:
import pandas as pd
df1 = pd.read_csv('drugsComTest_raw.tsv',delimiter='\t')     # Read the files with the pandas dataFrame
df2 = pd.read_csv('drugsComTrain_raw.tsv', delimiter='\t')   #  pass use the '\t' delimiter as argument because it is a tab separated file to prevent parser error

### Combiner les deux dataframes en un seul pour obtenir une plus grande taille de données et faciliter le prétraitement.

In [28]:
df = pd.concat([df1,df2])

## Dimension (nbr lignes, nbr colonnes)

In [29]:
df1.shape

(53766, 7)

In [30]:
df2.shape

(161297, 7)

In [31]:
df.shape            # confirmer la fusion

(215063, 7)

## Aperçu du dataset

In [32]:
df.head()

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4


## Renommage des colonnes

In [33]:
df.columns = ['Id','drugName','condition','review','rating','date','usefulCount']    #rename columns

In [34]:
df.head()

Unnamed: 0,Id,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,"February 28, 2012",22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,"May 17, 2009",17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,"September 29, 2017",3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,"March 5, 2017",35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,"October 22, 2015",4


## Dataframe (copie) sur lequel se fera l'analyse des sentiments

In [35]:
df2 = df[['Id','review','rating']].copy()    # create a new dataframe with just review and rating for sentiment analysis

In [36]:
df2.head()

Unnamed: 0,Id,review,rating
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0
2,159672,"""Quick reduction of symptoms""",9.0
3,39293,"""Contrave combines drugs that were used for al...",9.0
4,97768,"""I have been on this birth control for one cyc...",9.0


## Vérification des valeurs nulles (manquantes)

In [37]:
df2.isnull().any().any()    # check for null

False

## Vérification du nombre d'observation (pas de doublons)

In [38]:
df2['Id'].nunique()     #shows unique Id values

215063

In [39]:
df['review'][1]         # access indivdual value  

1    "My son has Crohn&#039;s disease and has done ...
1    "My son is halfway through his fourth week of ...
Name: review, dtype: object

# Analyse des sentiments

## Installer la bibliothèque d'analyse des sentiments "vaderSentiment"

In [40]:
!pip install vaderSentiment       # install Sentiment Analysis  library

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Traitement du langage naturel

In [41]:
import nltk
nltk.download(['punkt','stopwords'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [42]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

In [43]:
df2.head()

Unnamed: 0,Id,review,rating
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0
2,159672,"""Quick reduction of symptoms""",9.0
3,39293,"""Contrave combines drugs that were used for al...",9.0
4,97768,"""I have been on this birth control for one cyc...",9.0


## Suppression des mots-vides et ajout d'une nouvelle colonne au dataframe

In [44]:
df2['cleanReview'] = df2['review'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))     # remove stopwords from review

In [45]:
df2.head()

Unnamed: 0,Id,review,rating,cleanReview
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0,"""I&#039;ve tried antidepressants years (citalo..."
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0,"""My son Crohn&#039;s disease done well Asacol...."
2,159672,"""Quick reduction of symptoms""",9.0,"""Quick reduction symptoms"""
3,39293,"""Contrave combines drugs that were used for al...",9.0,"""Contrave combines drugs used alcohol, smoking..."
4,97768,"""I have been on this birth control for one cyc...",9.0,"""I birth control one cycle. After reading revi..."


## Chargement de l'analyseur de sentiments

In [46]:
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

## Ajout d'une colonne des scores de polarité au dataframe

In [47]:
df2['vaderReviewScore'] = df2['cleanReview'].apply(lambda x: analyzer.polarity_scores(x)['compound'])

In [48]:
df2.head()

Unnamed: 0,Id,review,rating,cleanReview,vaderReviewScore
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0,"""I&#039;ve tried antidepressants years (citalo...",0.7429
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0,"""My son Crohn&#039;s disease done well Asacol....",0.4767
2,159672,"""Quick reduction of symptoms""",9.0,"""Quick reduction symptoms""",0.0
3,39293,"""Contrave combines drugs that were used for al...",9.0,"""Contrave combines drugs used alcohol, smoking...",0.8115
4,97768,"""I have been on this birth control for one cyc...",9.0,"""I birth control one cycle. After reading revi...",0.9617


## Nombre de sentiments (positif, neutre et négatif)

In [49]:
positive_num = len(df2[df2['vaderReviewScore'] >= 0.05])
neutral_num = len(df2[(df2['vaderReviewScore'] >- 0.05) & (df2['vaderReviewScore'] < 0.05)])
negative_num = len(df2[df2['vaderReviewScore'] <= -0.05])

In [50]:
positive_num,neutral_num, negative_num

(106198, 9035, 99830)

## Conversion du score dans la colonne "vaderReviewScore" à :
si score >= 0.5 alors il est codé par 2 (positif) \\
sinon si score <= -0.5 alors il est codé par 1 (négatif)\\
sinon il est codé par 0 (neutre)

In [51]:
df2['vaderSentiment']= df2['vaderReviewScore'].map(lambda x:int(2) if x>=0.05 else int(1) if x<=-0.05 else int(0) )

In [52]:
df2['vaderSentiment'].value_counts()

2    106198
1     99830
0      9035
Name: vaderSentiment, dtype: int64

la somme donne le nombre d'observations (lignes) du dataset

In [53]:
Total_vaderSentiment = positive_num + neutral_num + negative_num
Total_vaderSentiment

215063

## Conversion du score dans la colonne "vaderReviewScore" :
si score >= 0.5 alors il est codé 'positive' \\
sinon si score <= -0.5 alors il est codé 'negative' \\
sinon si -0.5 < score < 0.5 alors il est codé par 'neutral'

In [54]:
df2.loc[df2['vaderReviewScore'] >= 0.05,"vaderSentimentLabel"] = "positive"
df2.loc[(df2['vaderReviewScore'] > -0.05) & (df2['vaderReviewScore'] < 0.05),"vaderSentimentLabel"] = "neutral"
df2.loc[df2['vaderReviewScore'] <=-0.05,"vaderSentimentLabel"] = "negative"

In [55]:
df2.shape

(215063, 7)

In [56]:
df2.head()

Unnamed: 0,Id,review,rating,cleanReview,vaderReviewScore,vaderSentiment,vaderSentimentLabel
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0,"""I&#039;ve tried antidepressants years (citalo...",0.7429,2,positive
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0,"""My son Crohn&#039;s disease done well Asacol....",0.4767,2,positive
2,159672,"""Quick reduction of symptoms""",9.0,"""Quick reduction symptoms""",0.0,0,neutral
3,39293,"""Contrave combines drugs that were used for al...",9.0,"""Contrave combines drugs used alcohol, smoking...",0.8115,2,positive
4,97768,"""I have been on this birth control for one cyc...",9.0,"""I birth control one cycle. After reading revi...",0.9617,2,positive


## Sentiment selon la note du médicament ("colonne "rating")

In [57]:
positive_rating = len(df2[df2['rating'] >=7.0])
neutral_rating = len(df2[(df2['rating'] >=4) & (df2['rating']<7)])
negative_rating = len(df2[df2['rating']<=3])

In [58]:
positive_rating,neutral_rating,negative_rating

(142306, 25856, 46901)

In [59]:
Total_rating = positive_rating+neutral_rating+negative_rating
Total_rating

215063

## Sentiment selon la note du médicament ("colonne "rating") :
si rating >= 7 alors il est codé '2' \\
sinon si rating <= 3 alors il est codé '1' \\
sinon sinon il est codé par '0'

In [60]:
df2['ratingSentiment']= df2['rating'].map(lambda x:int(2) if x>=7 else int(1) if x<=3 else int(0) )

In [61]:
df2['ratingSentiment'].value_counts()

2    142306
1     46901
0     25856
Name: ratingSentiment, dtype: int64

In [62]:
df2.head()

Unnamed: 0,Id,review,rating,cleanReview,vaderReviewScore,vaderSentiment,vaderSentimentLabel,ratingSentiment
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0,"""I&#039;ve tried antidepressants years (citalo...",0.7429,2,positive,2
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0,"""My son Crohn&#039;s disease done well Asacol....",0.4767,2,positive,2
2,159672,"""Quick reduction of symptoms""",9.0,"""Quick reduction symptoms""",0.0,0,neutral,2
3,39293,"""Contrave combines drugs that were used for al...",9.0,"""Contrave combines drugs used alcohol, smoking...",0.8115,2,positive,2
4,97768,"""I have been on this birth control for one cyc...",9.0,"""I birth control one cycle. After reading revi...",0.9617,2,positive,2


## Sentiment selon la note du médicament ("colonne "rating") :
si rating >= 7 alors il est codé 'positive' \\
sinon si rating <= 3 alors il est codé 'negative' \\
sinon si 4 <= score < 7 alors il est codé par 'neutral'

In [63]:
df2.loc[df2['rating'] >=7.0,"ratingSentimentLabel"] ="positive"
df2.loc[(df2['rating'] >=4.0) & (df2['rating']<7.0),"ratingSentimentLabel"]= "neutral"
df2.loc[df2['rating']<=3.0,"ratingSentimentLabel"] = "negative"

In [64]:
df2.head()

Unnamed: 0,Id,review,rating,cleanReview,vaderReviewScore,vaderSentiment,vaderSentimentLabel,ratingSentiment,ratingSentimentLabel
0,163740,"""I&#039;ve tried a few antidepressants over th...",10.0,"""I&#039;ve tried antidepressants years (citalo...",0.7429,2,positive,2,positive
1,206473,"""My son has Crohn&#039;s disease and has done ...",8.0,"""My son Crohn&#039;s disease done well Asacol....",0.4767,2,positive,2,positive
2,159672,"""Quick reduction of symptoms""",9.0,"""Quick reduction symptoms""",0.0,0,neutral,2,positive
3,39293,"""Contrave combines drugs that were used for al...",9.0,"""Contrave combines drugs used alcohol, smoking...",0.8115,2,positive,2,positive
4,97768,"""I have been on this birth control for one cyc...",9.0,"""I birth control one cycle. After reading revi...",0.9617,2,positive,2,positive


## Réorganisation des colonnes

In [65]:
df2 = df2[['Id','review','cleanReview','rating','ratingSentiment','ratingSentimentLabel','vaderReviewScore','vaderSentiment','vaderSentimentLabel']]

In [66]:
df2.head()

Unnamed: 0,Id,review,cleanReview,rating,ratingSentiment,ratingSentimentLabel,vaderReviewScore,vaderSentiment,vaderSentimentLabel
0,163740,"""I&#039;ve tried a few antidepressants over th...","""I&#039;ve tried antidepressants years (citalo...",10.0,2,positive,0.7429,2,positive
1,206473,"""My son has Crohn&#039;s disease and has done ...","""My son Crohn&#039;s disease done well Asacol....",8.0,2,positive,0.4767,2,positive
2,159672,"""Quick reduction of symptoms""","""Quick reduction symptoms""",9.0,2,positive,0.0,0,neutral
3,39293,"""Contrave combines drugs that were used for al...","""Contrave combines drugs used alcohol, smoking...",9.0,2,positive,0.8115,2,positive
4,97768,"""I have been on this birth control for one cyc...","""I birth control one cycle. After reading revi...",9.0,2,positive,0.9617,2,positive


Pour sauvegarder le jeu de données prétraitées au format csv

In [71]:
df2.to_csv('data_processed.csv')    # To save preprocessed dataset to csv

In [72]:
df2.head()

Unnamed: 0,Id,review,cleanReview,rating,ratingSentiment,ratingSentimentLabel,vaderReviewScore,vaderSentiment,vaderSentimentLabel
0,163740,"""I&#039;ve tried a few antidepressants over th...","""I&#039;ve tried antidepressants years (citalo...",10.0,2,positive,0.7429,2,positive
1,206473,"""My son has Crohn&#039;s disease and has done ...","""My son Crohn&#039;s disease done well Asacol....",8.0,2,positive,0.4767,2,positive
2,159672,"""Quick reduction of symptoms""","""Quick reduction symptoms""",9.0,2,positive,0.0,0,neutral
3,39293,"""Contrave combines drugs that were used for al...","""Contrave combines drugs used alcohol, smoking...",9.0,2,positive,0.8115,2,positive
4,97768,"""I have been on this birth control for one cyc...","""I birth control one cycle. After reading revi...",9.0,2,positive,0.9617,2,positive


In [73]:
df2.to_csv('data_processed.csv.gz',compression='gzip')