# Analyse d’opinion (Polarité de tweet) avec Bayes
Par Louis Boivin, Romain Deburghgraeve et Paul Peseux

L'objectif de ce Notebbok est d'obtenir un modèle de classification binaire sur des phrases. L'objectif est de dire si une phrase donnée contient une opinion positive ou negative.

Il n'y a **pas** d'entre deux. On considère qu'une phrase est soit positive soit négative.



---------
Petite astuce pour occuper tout l'espace

In [83]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Import des packages utiles et utilisés

In [84]:
import nltk
import string
from nltk.corpus import stopwords 
import sys
from sys import exit 
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix as confusion_matrix
from sklearn.metrics import precision_score as precision_score
from sklearn.metrics import accuracy_score as accuracy_score
from sklearn.metrics import recall_score as recall_score

In [85]:
stop_words = set(stopwords.words('english')) 

## Petit Exemple

Pour reprendre l'exemple proposé, on crée à la main un dataset d'entrainement 

In [86]:
pos_tweets = [  ("I love this car", "positive"),
                ("This view is amazing", "positive"),
                ("I feel great this morning", "positive"),
                ("I am so excited about the concert", "positive"),
                ("He is my best friend", "positive"),
                ("Going well", "positive"),
                ("Thank you", "positive"),
                ("Hope you are doing well", "positive"),
                ("I am very happy", "positive"),
                ("Good for you", "positive"),
                ("It is all good. I know about it and I accept it.", "positive"), ("This is really good!", "positive"),
                ("Tomorrow is going to be fun.", "positive"),
                ("Smiling all around.", "positive"),
                ("These are great apples today.", "positive"),
                ("How about them apples? Thomas is a happy boy.", "positive"), 
                ("Thomas is very zen. He is well−mannered.", "positive")]
neg_tweets = [  ("I do not like this car", "negative"), ("This view is horrible", "negative"),
                ("I feel tired this morning", "negative"),
                ("I am not looking forward to the concert", "negative"),
                ("He is my enemy", "negative"),
                ("I am a bad boy", "negative"),
                ("This is not good", "negative"),
                ("I am bothered by this", "negative"),
                ("I am not connected with this", "negative"),
                ("Sadistic creep you ass. Die.", "negative"),
                ("All sorts of crazy and scary as hell.", "negative"),
                ("Not his emails, no.", "negative"),
                ("His father is dead. Returned obviously.", "negative"),
                ("He has a bomb.", "negative"),
                ("Too fast to be on foot. We cannot catch them.", "negative")]
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if (len(e) >= 3 and not e.lower() in stop_words)] 
    tweets.append((words_filtered, sentiment))
    

On se retrouve avec un dataset **tweets** avec les mots filtrés

--------
On transforme alors cette liste en un training set applicable à un classifieur du type **nltk.NaiveBayesClassifier**

In [87]:
def mon_get_words_in_tweets(tweets): # from __future__ import print_function
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words) 
    return all_words
def mon_get_word_features(wordlist): 
    wordlist = nltk.FreqDist(wordlist) 
    word_features = wordlist.keys() 
    return word_features

def mon_extract_features(document): 
    document_words = set(document) 
    features = {}
    for word in word_features:
        features["contains(%s)" % word] = (word in document_words) 
    return features
word_features = mon_get_word_features(mon_get_words_in_tweets(tweets)) 
training_set = nltk.classify.apply_features(mon_extract_features, tweets) 

In [88]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

On utilise le formidable outil _pickle_ pour sauvegrader ce mini-modèle

In [89]:
save_classifier = open("tweetposneg.pickle","wb") 
pickle.dump(classifier, save_classifier) 
save_classifier.close()
# On doit recharger pour tester d"autres données de test : on peut le charger par les 3 lignes) : # classifier_f = open("naivebayes.pickle", "rb")
# classifier = pickle.load(classifier_f)
# classifier_f.close()


## Grand Exemple

### Paramètres

In [103]:
test_size = 0.5
line_used = 10000
path = "Tweets-folder-Alex/Sentiment-Analysis-Dataset.csv"

### Création du dataset d'entrainement et de test

On utilise la fameuse combinaison **pandas/sklearn** afin de générer ces deux datasets

In [104]:
df = pd.read_csv(path, sep=",", index_col="ItemID")
df = df.head(line_used)

train, test = train_test_split(df, test_size=test_size)

  mask |= (ar1 == a)


In [105]:
train.head()

Unnamed: 0_level_0,Sentiment,SentimentSource,SentimentText
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8328,0,Sentiment140,- Threw up
6168,0,Sentiment140,see yas later mommy :p
5404,1,Sentiment140,#followfriday @tiggercolman Annoyingly talente...
6249,1,Sentiment140,#dontuhateitwhen ppl be lying on you &amp; act...
5293,1,Sentiment140,#flowers BOTD Place colorful gerbera daisies i...


In [106]:
def turn0intoneg(x):
    if x:
        return "positive"
    else:
        return "negative"
  

Cette transformation est-elle judicieuse ? 

Il serait sûrement plus économique de stocker l'information sous True ou False.


Cependant c'est une pratique courante en text mining, on s'adapte donc.

In [107]:
  
def makeTweets(dataFrame):
    text = list(df.SentimentText)
    sentiment = list(df.Sentiment)
    sentiment = [turn0intoneg(s) for s in sentiment]
    tweet_sent = [(t,s) for t, s in zip(text, sentiment)]
    tweets = []

    for (words, sentiment) in tweet_sent:
        
        words_filtered = [e.lower() for e in words.translate(str.maketrans('', '', string.punctuation)).split() if (len(e) >= 3 and not e.lower() in stop_words)] 
        
        tweets.append((words_filtered, sentiment))
    return tweets
tweetsTrain = makeTweets(train)
tweetsTest = makeTweets(test)

textTest = list(test.SentimentText)
sentimentTest = list(test.Sentiment)

On a supprimé les _stopwords_ ainsi que la ponctuation.

Ce qui allonge considérablement le préprocessing.

In [108]:
word_features = mon_get_word_features(mon_get_words_in_tweets(tweetsTrain)) 
training_set = nltk.classify.apply_features(mon_extract_features, tweetsTrain) 
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [109]:
predicted = []
for tweett in textTest:
    valued = classifier.classify(mon_extract_features(tweett.split())) 
    predicted.append(valued)

In [110]:
def turnneginto0(x):
    if x=="positive":
        return 1
    elif x=="negative":
        return 0
    else:
        return x

In [111]:
predicted = [turnneginto0(p) for p in predicted]
sentimentTest = [turnneginto0(s) for s in sentimentTest]
CM = confusion_matrix(sentimentTest, predicted)
CM = pd.DataFrame(CM, index = ["False", "True"], columns=["Negative", "positive"])
PS = precision_score(sentimentTest, predicted)
ACC = accuracy_score(sentimentTest, predicted)
REC = recall_score(sentimentTest, predicted)

On affiche alors la matrice de confusion, qui permet d'avoir une idée de notre performance sur les données de test

In [112]:
print("="*30, "RESULTATS", "="*30) 
CM.head()



Unnamed: 0,Negative,positive
False,2848,36
True,1220,896


Ce qui nous donne des métriques :

In [113]:
print("Précision :", str(int(100 * PS)),  "%")
print("Recall :", str(int(100 * REC)),  "%")
print("Accuracy :", str(int(100 * ACC)),  "%")

Précision : 96 %
Recall : 42 %
Accuracy : 74 %


Le résultat est alors très biaisé, mais donne des performances satisfaisantes