### Traitement des tweets: NLP (Natural LanguageProcessing)

# Projet : Fouille de Données
# Thème : Classification et clustering des tweets en Python.
<h1> Réaliser par : Hajer Chakroun </h1>
<hr>


## Objectifs : 
* Maitriser l’API de twitter pour l’extraction des tweets
1. Maitriser la partie NLP (natural language processing) avec NLTK en Python
2. Appliquer les principes de nettoyage des données
3. Classer les tweets : regrouper ensemble les tweets qui sont similaires. C’est une étape qui peut être considérée comme une étape 



## Twitter

Twitter est un réseau social de microblogage géré par l'entreprise Twitter Inc. Il permet à un utilisateur d’envoyer gratuitement de brefs messages, appelés tweets, sur internet, par messagerie instantanée ou par SMS. Ces messages sont limités à 280 caractères


##  Spécifications

Imaginons que vous avez un compte Twitter, et que vous lez suivre les tweets (texte très court) sur ce
réseau social. Vu le nombre colossal de Tweets, et faute de temps, vous n’avez pas la possibilité de les
lire tous. Pour cela, vous avez besoin d’une application qui va jouer le rôle d’assistantet qui va vous
effectuer un résumé de toutes ces informations. Une des approches qu’on peut utiliser estde le classer
sous former de groupes de sorte à ce qu’on présente à l’utilisateur un seul Tweet de chaque groupe.
Pour cela, on doit procéder en trois grandes étapes :

<h3> 1. Prétraitement des tweets </h3>
Dans cette étape, l’objectif est d’éliminer le texte inutile des tweets tels que les #, les noms des
utilisateurs, les url, …

<h3> 2. Traitement des tweets : NLP (Natural LanguageProcessing)</h3>
On doit procéder à l’analyse du tweet en respectant les différentes étapes du NLP (Natural
LanguageProcessing). La bibliothèque à utiliser est NLTK en Python.

<h3> 3. Classification des tweets </h3>
Etant donné un ensemble de tweets, l’objectif est de les résumer sous formes de groupes de sorte à
ce que les Tweets qui sont dans le même groupe soient similaires. Ainsi, l’utilisateur pourra par la
suite lire juste un Tweet de chaque groupe (le Tweet qui est le centroïde de groupes).

# Réalisation:


## Libraries

Les bibliothèques utilisées seront ajoutées au fur et à mesure tout au long du projet dans cette partie du code.

In [29]:
!pip install tweepy
!python -m spacy download en_core_web_sm
!pip install spacy

[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [30]:
!pip install en_core_web_sm



In [31]:
import pandas as pd
import spacy
import en_core_web_sm
import tweepy
import numpy as np
import datetime
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.tokenize import RegexpTokenizer, WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
from string import punctuation
import collections
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from sklearn.feature_extraction.text import CountVectorizer
%matplotlib inline

### Dataset
Pour créer notre dataset on a besoin de télécharger des tweets à partir de Twitter en utilisant l’API de twitter. Pour cela, on a créé un compte «Twitter Développer», et on a eu une clé secrète d'API et autres de tokens qu'on va les utiliser. 
![alt text](output.png)

#### Chargement des données

In [32]:
auth = tweepy.OAuthHandler('dQAqgwCB9saD67dEOzji0PQuE', 'BQbsZ6SpG91ESkaoqiug65L29oHHUqPGB2dTmm7B910EEPuqyP')
auth.set_access_token('1324987221087301632-ebJRIlmfvMx1UeVm45I1HZA34wPwt8', 'j7zGNCLz9ZPB47wVXVC0Ripg6RNpwzG2dzTwNXOoJUsTI')
api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)


In [33]:
user = api.get_user('twitter')

##### On va sauvegarder les tweets dans un fichier .csv qu'on l'appelle twitter_data_analysisY-M-D-H 

In [34]:
'''filename = 'Datasets/twitter_data_analysis'+(datetime.datetime.now().strftime("%Y-%m-%d-%H"))+'.csv'
with open (filename, 'w', newline='',encoding="utf-8") as csvFile:
    csvWriter = csv.writer(csvFile)
    csvWriter.writerow(['date', 'TweetId','Tweet','created_at','geo','place','coordinates','location'])
    #using tweepy Cursor
    for tweet in tweepy.Cursor(api.user_timeline , id="Twitter").items(11000):
        csvWriter.writerow([datetime.datetime.now().strftime("%Y-%m-%d  %H:%M"), tweet.id, tweet.text, tweet.created_at, tweet.geo, tweet.place.name if tweet.place else None, tweet.coordinates, tweet._json["user"]["location"]])
'''

'filename = \'Datasets/twitter_data_analysis\'+(datetime.datetime.now().strftime("%Y-%m-%d-%H"))+\'.csv\'\nwith open (filename, \'w\', newline=\'\',encoding="utf-8") as csvFile:\n    csvWriter = csv.writer(csvFile)\n    csvWriter.writerow([\'date\', \'TweetId\',\'Tweet\',\'created_at\',\'geo\',\'place\',\'coordinates\',\'location\'])\n    #using tweepy Cursor\n    for tweet in tweepy.Cursor(api.user_timeline , id="Twitter").items(11000):\n        csvWriter.writerow([datetime.datetime.now().strftime("%Y-%m-%d  %H:%M"), tweet.id, tweet.text, tweet.created_at, tweet.geo, tweet.place.name if tweet.place else None, tweet.coordinates, tweet._json["user"]["location"]])\n'

##### Pour obtenir au moins 10 mille tweets, on a concaténé 4 fichiers qui sont chargés auparavant

In [35]:
tweet_df1= pd.read_csv('Datasets/twitter_data_analysis2020-11-22-13.csv')
tweet_df2= pd.read_csv('Datasets/twitter_data_analysis2020-12-13-11.csv')
tweet_df3= pd.read_csv('Datasets/twitter_data_analysis2020-12-10-12.csv')
tweet_df4= pd.read_csv('Datasets/twitter_data_analysis2020-12-14-14.csv')
'''tweet_dfi= pd.concat([tweet_df1, tweet_df2], ignore_index= True)
tweet_dfii= pd.concat([tweet_dfi, tweet_df3], ignore_index= True)'''
tweet_df= pd.concat([tweet_df1, tweet_df2, tweet_df3, tweet_df4], ignore_index= True)

# Affichage de la taille du dataset (n_lignes and n_colonnes)
print('Dataset size:',tweet_df.shape)
print('Columns are:',tweet_df.columns)
tweet_df.info()
tweet_df.head()

Dataset size: (12890, 8)
Columns are: Index(['date', 'TweetId', 'Tweet', 'created_at', 'geo', 'place', 'coordinates',
       'location'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12890 entries, 0 to 12889
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         12890 non-null  object 
 1   TweetId      12890 non-null  int64  
 2   Tweet        12890 non-null  object 
 3   created_at   12890 non-null  object 
 4   geo          0 non-null      float64
 5   place        264 non-null    object 
 6   coordinates  0 non-null      float64
 7   location     12890 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 805.8+ KB


Unnamed: 0,date,TweetId,Tweet,created_at,geo,place,coordinates,location
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...,2020-11-19 23:05:00,,,,everywhere
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404,2020-11-19 00:16:53,,,,everywhere
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning,2020-11-19 00:14:37,,,,everywhere
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets,2020-11-18 17:02:21,,,,everywhere
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet...",2020-11-18 16:50:52,,,,everywhere


In [36]:
tweet_df=tweet_df.drop(columns = ['created_at','geo','place', 'coordinates', 'location'])
tweet_df.head()

Unnamed: 0,date,TweetId,Tweet
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet..."



## Prétraitement
Après avoir observé les données, nous avons vu que les phrases contenaient des balises HTML, des mots-vides et toute la ponctuation. Nous avons donc commencé par éliminer le bruit pour normaliser nos phrases.
Nous supprimons les balises HTML, nous supprimons aussi tous les caractères qui ne sont pas des lettres et donc, supprimons toute la ponctuation des textes. 

In [37]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

On supprime les hashtags, les mentions, les punctuations et les caractères indésirables.

In [38]:
def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

tweet_df['Tweet_punct'] = tweet_df['Tweet'].apply(lambda x: remove_punct(x))
tweet_df.head(10)

Unnamed: 0,date,TweetId,Tweet,Tweet_punct
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...,RT shesooosaddity if you had a twitter before ...
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404,CloudNaii
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning,issahairplug drink water replaced good morning
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets,NeThatGuy were taking oomf to the Fleets
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet...",JusJust remember I dedicate my th Tweet to
5,2020-11-22 13:18,1329104643062902788,@ambr_ncole they're tourists,ambrncole theyre tourists
6,2020-11-22 13:18,1329101613940797441,@PhallonXOXO proof you're doing it right 😌,PhallonXOXO proof youre doing it right 😌
7,2020-11-22 13:18,1328838299419627525,some of you hating...\n\nbut we see you Fleeti...,some of you hating\n\nbut we see you Fleeting 🧐
8,2020-11-22 13:18,1328684389388185600,That thing you didn’t Tweet but wanted to but ...,That thing you didn’t Tweet but wanted to but ...
9,2020-11-22 13:18,1328426768009736192,@quakerraina this is art,quakerraina this is art


Toutes les lettres sont également passées en minuscule.

In [39]:
def tokenization(text):
    text = re.split(' ', text)
    return text
tweet_df['Tweet_tokenized'] = tweet_df['Tweet_punct'].apply(lambda x: tokenization(x.lower()))
tweet_df.head()

Unnamed: 0,date,TweetId,Tweet,Tweet_punct,Tweet_tokenized
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...,RT shesooosaddity if you had a twitter before ...,"[rt, shesooosaddity, if, you, had, a, twitter,..."
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404,CloudNaii,"[cloudnaii, ]"
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning,issahairplug drink water replaced good morning,"[issahairplug, drink, water, replaced, good, m..."
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets,NeThatGuy were taking oomf to the Fleets,"[nethatguy, were, taking, oomf, to, the, fleets]"
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet...",JusJust remember I dedicate my th Tweet to,"[jusjust, remember, i, dedicate, my, th, tweet..."


In [40]:
!pip install nltk



Parce que les mots-vides, par définition, n’apportent pas d’information au texte, nous les éliminons aussi.

In [41]:
import nltk
nltk.download('stopwords')
stopword = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
stopword.extend(['a’s', 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards',\
                 'again', 'against', 'ain’t', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already','also', 'although', 'always', 'am', 'among',\
                 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways',\
                 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', 'aren’t', 'around', 'as', 'aside', 'ask', 'asking',\
                 'associated', 'at', 'available', 'away', 'awfully', 'be', 'became', 'because', 'become', 'becomes', 'becoming',\
                 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside','besides', 'best', 'better', 'between',\
                 'beyond', 'both', 'brief', 'but', 'by', 'c’mon', 'c’s', 'came', 'can', 'can’t', 'cannot', 'cant', 'cause', 'causes',\
                 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'comes', 'old', 'new', 'age', 'lot', 'bag', 'top', 'cat', 'bat', 'sap', 'jda', 'tea', 'dog', 'lie', 'law', 'lab',\
                 'mob', 'map', 'car', 'fat', 'sea', 'saw', 'raw', 'rob', 'win', 'can', 'get', 'fan', 'fun', 'big',\
                 'use', 'pea', 'pit','pot', 'pat', 'ear', 'eye', 'kit', 'pot', 'pen', 'bud', 'bet', 'god', 'tax', 'won', 'run',\
                 'lid', 'log', 'pr', 'pd', 'cop', 'nyc', 'ny', 'la', 'toy', 'war', 'law', 'lax', 'jfk', 'fed', 'cry', 'ceo',\
                 'pay', 'pet', 'fan', 'fun', 'usd', 'rio',':)', ';)', '(:', '(;', '}', '{','}','here', 'there', 'where', 'when', 'would', 'should', 'could','thats', 'youre', 'thanks', 'hasn',\
                 'thank', 'https', 'since', 'wanna', 'gonna', 'aint', 'http', 'unto', 'onto', 'into', 'havent',\
                 'dont', 'done', 'cant', 'werent', 'https', 'u', 'isnt', 'go', 'theyre', 'each', 'every', 'shes', 'youve', 'youll',\
                 'weve', 'theyve','googleele' , 'goog', 'lyin', 'lie', 'googles', 'goog', 'aapl','apple',\
                 'msft','microsoft', 'google', 'goog', 'googl','goog','https', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains',\
                 'corresponding', 'could', 'couldn’t', 'course', 'currently', 'definitely', 'described', 'despite',\
                 'did', 'didn’t', 'different', 'do', 'does', 'doesn’t', 'doing', 'don’t', 'done', 'down', 'downwards', 'during', 'each', 'edu', 'eg', 'eight', 'either',\
                 'else', 'elsewhere', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything',\
                 'everywhere', 'ex', 'exactly', 'example', 'except', 'far', 'few', 'fifth', 'first', 'five', 'followed', 'following', 'follows',\
                 'for', 'former', 'formerly', 'forth', 'four', 'from', 'further', 'furthermore', 'get', 'gets', 'getting', 'given', 'gives', 'go',\
                 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'had', 'hadn’t', 'happens', 'hardly', 'has', 'hasn’t', 'have', 'haven’t',\
                 'having', 'he', 'he’s', 'hello', 'help', 'hence', 'her', 'here', 'here’s', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself',\
                 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'i’d', 'i’ll', 'i’m', 'i’ve', 'ie', 'if', 'ignored', 'immediate',\
                 'in', 'inasmuch', 'inc', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'insofar', 'instead', 'into', 'inward', 'is', 'isn’t', 'it', 'it’d', 'it’ll',\
                 'it’s', 'its', 'itself', 'just', 'keep', 'keeps', 'kept', 'know', 'knows', 'known', 'last', 'lately', 'later', 'latter', 'latterly',\
                 'least', 'less', 'lest', 'let', 'let’s', 'like', 'liked', 'likely', 'little', 'look', 'looking', 'looks', 'ltd', 'mainly', 'many', 'may', 'maybe',\
                 'me', 'mean', 'meanwhile', 'merely', 'might', 'more', 'moreover', 'most', 'mostly', 'much', 'must', 'my', 'myself', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary',\
                 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'no', 'nobody', 'non', 'none', 'noone', 'nor', 'normally', 'not', 'nothing', 'novel', 'now', 'nowhere',\
                 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own',\
                 'particular', 'particularly', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provides', 'que', 'quite', 'qv', 'rather', 'rd', 're', 'really', 'reasonably', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right',\
                 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', 'she',\
                 'should', 'shouldn’t', 'since', 'six', 'so', 'some', 'somebody', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying',\
                 'still', 'sub', 'such', 'sup', 'sure', 't’s', 'take', 'taken', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'that’s', 'thats', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'there’s', 'thereafter', 'thereby',\
                 'therefore', 'therein', 'theres', 'thereupon', 'these', 'they', 'they’d', 'they’ll', 'they’re', 'they’ve', 'think', 'third', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'took', 'toward', 'towards',\
                 'tried', 'tries', 'truly', 'try', 'trying', 'twice', 'two', 'un', 'under', 'unfortunately', 'unless', 'unlikely', 'until', 'unto', 'up', 'upon',\
                 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'value', 'various', 'very', 'via', 'viz', 'vs', 'want', 'wants', 'was', 'wasn’t', 'way',\
                 'we', 'we’d', 'we’ll', 'we’re', 'we’ve', 'welcome', 'well', 'went', 'were', 'weren’t', 'what', 'what’s', 'whatever', 'when', 'whence', 'whenever', 'where', 'where’s', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither',\
                 'who', 'who’s', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'won’t', 'wonder', 'would', 'wouldn’t', 'yes', 'yet',\
                 'you', 'you’d', 'you’ll', 'you’re', 'you’ve', 'your', 'yours', 'yourself', 'yourselves', 'zero'])


On supprime les mots qui n'expriment aucun sens.

In [43]:
def remove_stopwords(text):
    text = [word for word in text if word not in stopword]
    return text
    
tweet_df['Tweet_nonstop'] = tweet_df['Tweet_tokenized'].apply(lambda x: remove_stopwords(x))
tweet_df.head(10)

Unnamed: 0,date,TweetId,Tweet,Tweet_punct,Tweet_tokenized,Tweet_nonstop
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...,RT shesooosaddity if you had a twitter before ...,"[rt, shesooosaddity, if, you, had, a, twitter,...","[rt, shesooosaddity, twitter, , rt]"
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404,CloudNaii,"[cloudnaii, ]","[cloudnaii, ]"
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning,issahairplug drink water replaced good morning,"[issahairplug, drink, water, replaced, good, m...","[issahairplug, drink, water, replaced, good, m..."
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets,NeThatGuy were taking oomf to the Fleets,"[nethatguy, were, taking, oomf, to, the, fleets]","[nethatguy, taking, oomf, fleets]"
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet...",JusJust remember I dedicate my th Tweet to,"[jusjust, remember, i, dedicate, my, th, tweet...","[jusjust, remember, dedicate, tweet]"
5,2020-11-22 13:18,1329104643062902788,@ambr_ncole they're tourists,ambrncole theyre tourists,"[ambrncole, theyre, tourists]","[ambrncole, tourists]"
6,2020-11-22 13:18,1329101613940797441,@PhallonXOXO proof you're doing it right 😌,PhallonXOXO proof youre doing it right 😌,"[phallonxoxo, proof, youre, doing, it, right, 😌]","[phallonxoxo, proof, 😌]"
7,2020-11-22 13:18,1328838299419627525,some of you hating...\n\nbut we see you Fleeti...,some of you hating\n\nbut we see you Fleeting 🧐,"[some, of, you, hating\n\nbut, we, see, you, f...","[hating\n\nbut, fleeting, 🧐]"
8,2020-11-22 13:18,1328684389388185600,That thing you didn’t Tweet but wanted to but ...,That thing you didn’t Tweet but wanted to but ...,"[that, thing, you, didn’t, tweet, but, wanted,...","[thing, tweet, wanted, close, nah, \n\nwe, pla..."
9,2020-11-22 13:18,1328426768009736192,@quakerraina this is art,quakerraina this is art,"[quakerraina, this, is, art]","[quakerraina, art]"


NLTK est une suite de bibliothèques de traitement de texte pour la classification, la tokenisation, la recherche de racines, le marquage, l'analyse et le raisonnement sémantique.
On vas utiliser la bibliothèque NLTK pour effectuer une analyse de chaque tweet et le transformer en un
ensemble de mots en suivant les différentes étapes de base du processus NLP (Natural Language Processing)

Stemming et Lammitization est une technique utilisée pour extraire la forme de base des mots en supprimant les affixes.


In [44]:
ps = nltk.PorterStemmer()
def stemming(text):
    text = [ps.stem(word) for word in text]
    return text

tweet_df['Tweet_stemmed'] = tweet_df['Tweet_nonstop'].apply(lambda x: stemming(x))
tweet_df.head()

Unnamed: 0,date,TweetId,Tweet,Tweet_punct,Tweet_tokenized,Tweet_nonstop,Tweet_stemmed
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...,RT shesooosaddity if you had a twitter before ...,"[rt, shesooosaddity, if, you, had, a, twitter,...","[rt, shesooosaddity, twitter, , rt]","[rt, shesooosadd, twitter, , rt]"
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404,CloudNaii,"[cloudnaii, ]","[cloudnaii, ]","[cloudnaii, ]"
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning,issahairplug drink water replaced good morning,"[issahairplug, drink, water, replaced, good, m...","[issahairplug, drink, water, replaced, good, m...","[issahairplug, drink, water, replac, good, morn]"
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets,NeThatGuy were taking oomf to the Fleets,"[nethatguy, were, taking, oomf, to, the, fleets]","[nethatguy, taking, oomf, fleets]","[nethatguy, take, oomf, fleet]"
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet...",JusJust remember I dedicate my th Tweet to,"[jusjust, remember, i, dedicate, my, th, tweet...","[jusjust, remember, dedicate, tweet]","[jusjust, rememb, dedic, tweet]"


In [45]:
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()

def lemmatizer(text):
    text = [wn.lemmatize(word) for word in text]
    return text

tweet_df['Tweet_lemmatized'] = tweet_df['Tweet_nonstop'].apply(lambda x: lemmatizer(x))
tweet_df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,date,TweetId,Tweet,Tweet_punct,Tweet_tokenized,Tweet_nonstop,Tweet_stemmed,Tweet_lemmatized
0,2020-11-22 13:18,1329561340596391936,RT @shesooosaddity: if you had a twitter befor...,RT shesooosaddity if you had a twitter before ...,"[rt, shesooosaddity, if, you, had, a, twitter,...","[rt, shesooosaddity, twitter, , rt]","[rt, shesooosadd, twitter, , rt]","[rt, shesooosaddity, twitter, , rt]"
1,2020-11-22 13:18,1329217044391342082,@CloudNaii 40404,CloudNaii,"[cloudnaii, ]","[cloudnaii, ]","[cloudnaii, ]","[cloudnaii, ]"
2,2020-11-22 13:18,1329216472711827458,@issahairplug drink water replaced good morning,issahairplug drink water replaced good morning,"[issahairplug, drink, water, replaced, good, m...","[issahairplug, drink, water, replaced, good, m...","[issahairplug, drink, water, replac, good, morn]","[issahairplug, drink, water, replaced, good, m..."
3,2020-11-22 13:18,1329107688916135936,@Ne_ThatGuy we're taking oomf to the Fleets,NeThatGuy were taking oomf to the Fleets,"[nethatguy, were, taking, oomf, to, the, fleets]","[nethatguy, taking, oomf, fleets]","[nethatguy, take, oomf, fleet]","[nethatguy, taking, oomf, fleet]"
4,2020-11-22 13:18,1329104797727940612,"@_JusJust_ remember ""I dedicate my 500th Tweet...",JusJust remember I dedicate my th Tweet to,"[jusjust, remember, i, dedicate, my, th, tweet...","[jusjust, remember, dedicate, tweet]","[jusjust, rememb, dedic, tweet]","[jusjust, remember, dedicate, tweet]"


####  Les données après le prétraitement:

In [46]:
tweet_df.Tweet_lemmatized

0                      [rt, shesooosaddity, twitter, , rt]
1                                            [cloudnaii, ]
2        [issahairplug, drink, water, replaced, good, m...
3                         [nethatguy, taking, oomf, fleet]
4                     [jusjust, remember, dedicate, tweet]
                               ...                        
12885               [themegaboi, keeping, brain, thinking]
12886    [guillaumetc, hamillhimself, chrisevans, combo...
12887                                   [ksjize, dogrates]
12888               [insomniacookies, cc, mecookiemonster]
12889    [mnoir, amp, guaranteed, good, morning, good, ...
Name: Tweet_lemmatized, Length: 12890, dtype: object

#### On va enregistrer notre dataset prétraité dans un nouveau fichier csv

In [47]:
tweet_df.Tweet_lemmatized.to_csv('Datasets/new_cleaned_tweets.csv',index = False)

In [48]:
#affichage
new_tweet_df= pd.read_csv('Datasets/new_cleaned_tweets.csv')
print('Dataset size:',new_tweet_df.shape)
print('Columns are:',new_tweet_df.columns)
new_tweet_df.info()
new_tweet_df.head()

Dataset size: (12890, 1)
Columns are: Index(['Tweet_lemmatized'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12890 entries, 0 to 12889
Data columns (total 1 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Tweet_lemmatized  12890 non-null  object
dtypes: object(1)
memory usage: 100.8+ KB


Unnamed: 0,Tweet_lemmatized
0,"['rt', 'shesooosaddity', 'twitter', '', 'rt']"
1,"['cloudnaii', '']"
2,"['issahairplug', 'drink', 'water', 'replaced',..."
3,"['nethatguy', 'taking', 'oomf', 'fleet']"
4,"['jusjust', 'remember', 'dedicate', 'tweet']"


## Vectorisation
Les données nettoyées devient sur une seule ligne 

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X=cv.fit_transform(new_tweet_df.Tweet_lemmatized)
print(X)

  (0, 4591)	2
  (0, 4803)	1
  (0, 5616)	1
  (1, 899)	1
  (2, 2722)	1
  (2, 1340)	1
  (2, 5816)	1
  (2, 4480)	1
  (2, 1859)	1
  (2, 3637)	1
  (3, 3772)	1
  (3, 5242)	1
  (3, 4027)	1
  (3, 1687)	1
  (4, 2958)	1
  (4, 4468)	1
  (4, 1185)	1
  (4, 5603)	1
  (5, 172)	1
  (5, 5526)	1
  (6, 4167)	1
  (6, 4307)	1
  (7, 1994)	1
  (7, 3745)	1
  (7, 1688)	1
  :	:
  (12882, 3461)	1
  (12883, 2587)	1
  (12884, 369)	1
  (12884, 3607)	1
  (12885, 597)	1
  (12885, 5407)	1
  (12885, 3008)	1
  (12885, 5362)	1
  (12886, 1931)	1
  (12886, 1955)	1
  (12886, 844)	1
  (12886, 941)	1
  (12886, 5863)	1
  (12887, 1297)	1
  (12887, 3092)	1
  (12888, 3474)	1
  (12888, 758)	1
  (12888, 2680)	1
  (12889, 1859)	2
  (12889, 3637)	1
  (12889, 5603)	1
  (12889, 3865)	1
  (12889, 179)	1
  (12889, 3598)	1
  (12889, 1925)	1


# Classification des tweets

On va utiliser l’algorithme K-Means pour classer les Tweets en 30 classes, qui est un algorithme non supervisé  de clustering, populaire en Machine Learning qui peuvent classer chaque tweet à une catégorie particulière .

In [None]:
%%time
import logging
from sklearn.cluster import KMeans
seed = 42

ks = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
# track a couple of metrics
sil_scores = []
result= []

# fit the models, save the evaluation metrics from each run
for k in ks:
    logging.warning('fitting model for {} clusters'.format(k))
    Kmeans = KMeans(n_clusters=k, n_jobs=-1, random_state=seed)
    Kmeans.fit(X)
    labels = Kmeans.labels_
    #sil_scores.append(silhouette_score(bio_matrix, labels))
    result.append(Kmeans.inertia_)





In [None]:
plt.plot(ks, result, 'o--')
plt.ylabel('result')
plt.title('kmeans parameter search')

In [None]:
import matplotlib.pyplot as plt
plt.plot(ks,result)
plt.xlabel('number of clusters')
plt.ylabel('word per cluster')
plt.show()

In [None]:
true_k=30
Kmeans=KMeans(n_clusters=true_k,init='k-means++',n_init=1)
Kmeans.fit(X)

In [None]:
print("Top terms per cluster:")
order_centroids = Kmeans.cluster_centers_.argsort()[:, ::-1]
terms = cv.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i)
    print("-----------------------")
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print()
print("\n")

### Pour chaque cluster on va afficher un seul tweet 
On va afficher le tweet qui a le plus grand score 


In [None]:
i=0
j=0
while i<28:
    while True: 
        Y=cv.transform([new_tweet_df.Tweet_lemmatized[j]])
        prediction=Kmeans.predict(Y)
        if i == prediction:
            print("Tweet of cluster "+str(prediction)+" : "+tweet_df.Tweet[i])
            print ("-----------------------------------------------")
            print("\n")
            j=0
            break
        j+=1
    i+=1