## Importation des librairies

In [85]:
# %matplotlib inline
%matplotlib tk

In [36]:
import numpy as np 
import pandas as pd 
import seaborn as sns
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from yellowbrick.cluster import KElbowVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD 
from sklearn.metrics import homogeneity_score
from sklearn.metrics import silhouette_score
from sklearn.metrics import accuracy_score
from bs4 import BeautifulSoup
from scipy.sparse import csr_matrix
import re

## Problème de regroupement : Clustering

* Regroupement des données textuelles suivant le nombre de cluster donné.

## Importation du dataset

In [3]:
dataset = pd.read_csv("IMDB Dataset.csv")
dataset

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


### - Exploration du dataset

* Information sur le dataset

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


* Affichage de la première ligne de la première colonne du dataset

In [5]:
dataset["review"][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

* Affichage de la première ligne de la deuxièmme colonne du dataset

In [6]:
dataset["sentiment"][0]

'positive'

* Verification du dataset à propos des valeurs manquantes.

In [7]:
dataset.isna().sum()

review       0
sentiment    0
dtype: int64

* Verification sur l'equilibre du dataset

In [8]:
dataset["review"].groupby(dataset["sentiment"]).count()

sentiment
negative    25000
positive    25000
Name: review, dtype: int64

### - Preprocessing

* On formate la première ligne du dataset en retirant les balises **HTML** avec l'objet **BeautifulSoup**
* **Beautiful Soup** est une bibliothèque qui facilite la récupération d'informations à partir de pages Web. Il repose sur un analyseur HTML ou XML.
* **get_text** est l'un de ses attributs

In [9]:
data_ = BeautifulSoup(dataset['review'][0], "lxml" )
print (dataset['review'][0])
print('')
print (data_.get_text())

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [10]:
# dir(data_)

* La bibliothèque **re** (**Regular Expression**) nous donne la possibilité de formater notre texte en remplaçant tout signe de ponctuation, les numerotations ainsi que les caractères spéciaux en utilisant l'objet **sub()** par un espace vide.

In [11]:
print(data_.get_text())
print('')
data_ =re.sub("[^a-zA-Z]"," ",data_.get_text())
print(data_)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wou

* On utilise la fonction **lower()** pour mettre l'ensemble du texte en miniscule.

In [12]:
print(data_)
print('')
data_ = data_.lower()
print(data_)

One of the other reviewers has mentioned that after watching just   Oz episode you ll be hooked  They are right  as this is exactly what happened with me The first thing that struck me about Oz was its brutality and unflinching scenes of violence  which set in right from the word GO  Trust me  this is not a show for the faint hearted or timid  This show pulls no punches with regards to drugs  sex or violence  Its is hardcore  in the classic use of the word It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary  It focuses mainly on Emerald City  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  Em City is home to many  Aryans  Muslims  gangstas  Latinos  Christians  Italians  Irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away I would say the main appeal of the show is due to the fact that it goes where other shows wou

* La fonction **split()** est utilisé pour diviser l'ensemble du texte en ses mots tout en le sauvegardant dans une liste.

In [13]:
print(data_)
print('')
data_ = data_.split()
print(data_)

one of the other reviewers has mentioned that after watching just   oz episode you ll be hooked  they are right  as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence  which set in right from the word go  trust me  this is not a show for the faint hearted or timid  this show pulls no punches with regards to drugs  sex or violence  its is hardcore  in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wou

* On crée une fonction qui renvoie chaque mot texte à sa forme radicale en utilisant l'objet **SnowballStemmer**.
* **Stem** est un attribut de **SnowballStemmer** et renvoie la forme radicale d'un mot.

In [14]:
def Stemming(sentence):
    stemmer = SnowballStemmer("english")
    phrase = []
    for word in sentence:
        phrase.append(stemmer.stem(word.lower()))
    return phrase

In [15]:
#  stemmer = SnowballStemmer("english")

In [16]:
# dir(stemmer)

* Utilisation de la fonction **Stemming**

In [17]:
print(data_)
print('')
print(Stemming(data_))

['one', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching', 'just', 'oz', 'episode', 'you', 'll', 'be', 'hooked', 'they', 'are', 'right', 'as', 'this', 'is', 'exactly', 'what', 'happened', 'with', 'me', 'the', 'first', 'thing', 'that', 'struck', 'me', 'about', 'oz', 'was', 'its', 'brutality', 'and', 'unflinching', 'scenes', 'of', 'violence', 'which', 'set', 'in', 'right', 'from', 'the', 'word', 'go', 'trust', 'me', 'this', 'is', 'not', 'a', 'show', 'for', 'the', 'faint', 'hearted', 'or', 'timid', 'this', 'show', 'pulls', 'no', 'punches', 'with', 'regards', 'to', 'drugs', 'sex', 'or', 'violence', 'its', 'is', 'hardcore', 'in', 'the', 'classic', 'use', 'of', 'the', 'word', 'it', 'is', 'called', 'oz', 'as', 'that', 'is', 'the', 'nickname', 'given', 'to', 'the', 'oswald', 'maximum', 'security', 'state', 'penitentary', 'it', 'focuses', 'mainly', 'on', 'emerald', 'city', 'an', 'experimental', 'section', 'of', 'the', 'prison', 'where', 'all', 'the', 'cells', 'h

* Fonction englobant tous les pretraitements decrient ci-dessus

In [18]:
def review_format(raw_review):
    review = BeautifulSoup(raw_review, "lxml" )
    review = re.sub("[^a-zA-Z]"," ",review.get_text())
    review = review.lower().split()
    review = Stemming(review)
    return(' '.join(review)) 

In [19]:
data = dataset['review'][0]
print(data)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [20]:
data_clean = review_format(data)
print('')
print(data_clean)


one of the other review has mention that after watch just oz episod you ll be hook they are right as this is exact what happen with me the first thing that struck me about oz was it brutal and unflinch scene of violenc which set in right from the word go trust me this is not a show for the faint heart or timid this show pull no punch with regard to drug sex or violenc it is hardcor in the classic use of the word it is call oz as that is the nicknam given to the oswald maximum secur state penitentari it focus main on emerald citi an experiment section of the prison where all the cell have glass front and face inward so privaci is not high on the agenda em citi is home to mani aryan muslim gangsta latino christian italian irish and more so scuffl death stare dodgi deal and shadi agreement are never far away i would say the main appeal of the show is due to the fact that it goe where other show wouldn t dare forget pretti pictur paint for mainstream audienc forget charm forget romanc oz 

* Recuperation de la taille du dataset suivant la colonne review.

In [21]:
num_reviews = dataset['review'].size
print (num_reviews)

50000


* Boucle pour appliquer les transformations à chaque enregistrement de la colonne de révision de l'ensemble de données

In [22]:
clean_data_review= []
for i in range(0,num_reviews):
    clean_data_review.append(review_format(dataset['review'][i]))



### - Separation de données

In [150]:
data_y = dataset['sentiment']
clean_data_review = np.array(clean_data_review)

In [152]:
# clean_data_review

In [44]:
# Répartition des données en formation et validation
# J'utilise seulement 20% de l'ensemble de données car avec toutes les données ma mchine ne dispose pas les ressources neccessaire
# X_train, X_test, y_train, y_test = train_test_split(clean_data_review, data_y, test_size=0.80, random_state=42)

![](https://plumbr.io/app/uploads/2016/06/tf-idf.png)

In [62]:
vec = TfidfVectorizer(stop_words="english")

* TfidfVectorizer est un estimateur du terme anglais fréquence du terme–fréquence inverse du document, qui signifie fréquence du terme–inverse de la fréquence dans les documents, c'est un indice qui mesure l'importance d'un mot dans un document par rapport à tous les mots dans le document, donc la valeur de l'indice **TfiDF** augmente lorsqu'il y a plus d'occurrences de ce mot dans l'ensemble du corpus. Nous appliquons cette technique en NPL car c'est un moyen efficace de modéliser les problèmes de langage naturel dans les ordinateurs car nous traitons de la fréquence des mots-clés pour prendre des décisions et cette fréquence informe également si ce mot est commun ou rare dans le corpus.
* **Corpus** est l'ensemble des documents.

In [154]:
vec.fit(clean_data_review)
features = vec.transform(clean_data_review)

* Méthode **Elbow**

In [155]:
visualizer = KElbowVisualizer(KMeans(), k=(1,11), timings=False)

visualizer.fit(features)        
visualizer.show()





<Axes3DSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

### - Reduction de la dimensionnalité

* Le transformateur TruncatedSVD  effectue une réduction de dimensionnalité linéaire au moyen d'une décomposition en valeurs singulières tronquées (SVD).
*  Ici on réduit la dimension du cluster en 3D en fixant le n_components à 3.

In [156]:
svd = TruncatedSVD(n_components=3, random_state=42)

In [158]:
reduced_features = svd.fit_transform(features)

In [159]:
reduced_features.shape

(50000, 3)

### Kmeans
* L'algorithme de classification K-means est un algorithme de clustering courament utilisé pour regrouper les élements en fonction de leurs proximités. 
* Critère d'optimisation : on choisi des centroïdes qui minimisent c'est à dire minimiser la distance entre chaque echantillon xi et la moyenne de chaque cluster. 
* Nos hyperparamètres sont ici : n_clusters, random_state

In [160]:
cls = KMeans(n_clusters=3, random_state=42)
cls.fit(reduced_features)

KMeans(n_clusters=3, random_state=42)

In [161]:
y_clusters = cls.predict(reduced_features)

In [162]:
y_clusters.shape

(50000,)

In [163]:
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(reduced_features[y_clusters == 0,0],reduced_features[y_clusters == 0,1],reduced_features[y_clusters == 0,2], s = 40 , color = 'blue', label = "cluster 0")
ax.scatter(reduced_features[y_clusters == 1,0],reduced_features[y_clusters == 1,1],reduced_features[y_clusters == 1,2], s = 40 , color = 'orange', label = "cluster 1")
ax.scatter(reduced_features[y_clusters == 2,0],reduced_features[y_clusters == 2,1],reduced_features[y_clusters == 2,2], s = 40 , color = 'green', label = "cluster 2")
# ax.scatter(reduced_features[y_clusters == 3,0],reduced_features[y_clusters == 3,1],reduced_features[y_clusters == 3,2], s = 40 , color = 'yellow', label = "cluster 3")
# ax.scatter(cls.cluster_centers_[:,0], cls.cluster_centers_[:,1], cls.cluster_centers_[:,2], marker='x', s = 100, c='r')
ax.view_init(0, azim=360)
ax.legend()
plt.show()

### Test du modèle : Phase d'inférence

* Nouvelle donnée pour le test

In [164]:
review1 = "Parasite was directed and written by Bong Joon Ho and tells the story of the Kim family and their life-changing involvement with the Park family. Parasite can best be described as astonishing, astounding, stunning or any other synonym of amazing. It is so far my favourite film of the year and one of my favourite films of the decade. The very idea of the plot is simple but incredibly hard to execute and that's, why it's editing, is pitch-perfect and leads to an unexpectedly shocking and brutal ending. The biggest forte of Parasite is the screenplay as it impeccably mixes comedy, drama and horror featuring flawless pacing, breath-taking cinematography, a beautiful score and a brilliant cast making it a masterpiece. Parasite also perfectly presents the subject of classism, showing us how both the working class and the upper class view each other and the people around them. Themes of capitalism can also be felt throughout the film, but Parasite shouldn't be mistaken as a pro-capitalism film as it doesn't support or hate anyone or any side; it's ambiguity also contributes to this factor. Taking everything into account, Parasite is a true work of art and a rare and extraordinary masterpiece that should be viewed by everyone at some point in their lives, especially film lovers."

In [165]:
review2 = "Cidade de Deus seems to have a lot of praise on the IMDb boards, and with good reason too. It simply is, in my opinion, one of the best contemporary films ever made. Based on true events and characters who live in the overlooked and poverty stricken slums in the shadows of Rio de Janiero, where life expectancy doesn't reach the 30's and drug dealers are kings. The tale of the City of God, and its myriad of characters is told by Rocket, a young man who struggles to make something of his life, other than to wind up another victim of drugs or gang wars. Not only are the characters in City of God absolutely fascinating, and also very endearing, but also convincingly acted by groups of young and unknown actors. The stoies are well-told, and at times, funny, and at others, brutally shocking. The cinematic style of the film gives a nod to Tarantino, with some clever time-jumping, freeze-framing, and texts indicating another chapter of the film. In every sense, a bit of a Brazillian Pulp Fiction or Goodfellas, but with its own unique flavour to it. The City of God is a marvel, and a highly recommended film to watch, but not recommended for the over-sensitive or easily distressed."

In [166]:
review3 = "This movie reminds me of that scene from Jurassic Park where Jeff Goldblum says You were so preoccupied with whether or not you could, you never bothered to ask if you should. This was hands down the most disturbingly awful movie I have ever seen. Whoever greenlit this should never be in charge of the light ever again. How dare they do this to me?!? Please don't go see this movie. And if you do, may God have mercy on your soul."

In [167]:
review4 = "This movie is like a big story put up in one paragraph with no punctuation marks whatsoever. Usually Transformers builds up a simple story that explodes into one big final battle, filled with special effects and epic fights between autobots and decepticons; pretty entertaining to watch, especially if you're into big explosions- action-robots-movies and usually it works. The last knight is different from the beginning, it introduces way more story lines than it should, it tries to make a suspenseful plot with so many resources it just drags into a story that's full of holes and patches that it seems you're watching a lot of trailers from different movies with no connection between each other. The movie forces a lot of secondary roles (between the old and new characters) that it becomes boring and confusing at the same time. The main story becomes clear in the last quarter of the movie, just in time for the final battleby this point you're in fully overdrive mode trying to catch up with everything that's happened that it's barely enjoyable. Personally I think that the worst part is the lack of continuity, they jump from one scene to another in less than 30 seconds. This dynamicity makes it impossible to follow the main plot or even one story line. There are so many details and jumps between stories that it exhausting trying to connect the dots. I don't recommend it at all, not worth the money or time. Maybe if you're really curious or a big fan into the saga you may find it entertaining at some point, my only suggestion: lower your expectations, cause this is, by far, the worst Transformers movie."

  tokens_by_line = make_tokens_by_line(lines)


In [168]:
A_df = pd.DataFrame([review1, review2, review3, review4], columns=['review'])

In [169]:
A_df

Unnamed: 0,review
0,Parasite was directed and written by Bong Joon...
1,Cidade de Deus seems to have a lot of praise o...
2,This movie reminds me of that scene from Juras...
3,This movie is like a big story put up in one p...


* Traitement et prediction :

In [170]:
to_array=[]
for i in range(0, A_df['review'].size):
    to_array.append(review_format(A_df['review'][i]))
sample_final=vec.transform(to_array)
reduced_sample_final = svd.fit_transform(sample_final)
cls.predict(reduced_sample_final)

array([1, 1, 0, 1])