# Analyse des sentiements liées aux entitées
On souhaite ici compter le nombre de messages décrit comme "positif" et "négatif" comptenant une entitée


In [None]:
!pip install pandas
!pip install tensorflow
!pip install transformers
!pip install sentencepiece

In [1]:
import tensorflow
import pandas as pd
import numpy as np
import json
from transformers import TFCamembertForSequenceClassification
import transformers.models.camembert.tokenization_camembert as tk

### Chargement des messages et des entitées

In [2]:
with open('./messages/BDE_8223.json', encoding="utf8") as f:
  messages = json.load(f)

with open('./entities/entities_final.json', encoding="utf8") as f:
  entities = json.load(f)

### Récuparation de l'encodeur pour notre modèle

In [3]:
tokenizer = tk.CamembertTokenizer.from_pretrained("jplu/tf-camembert-base",do_lower_case=True)
assert tokenizer != None

def encode_msg(messages, tokenizer = tokenizer, max_length=80):
    token_ids = np.zeros(shape=(len(messages), max_length),
                         dtype=np.int32)
    for i, msg in enumerate(messages):
        encoded = tokenizer.encode(msg, max_length=max_length)
        token_ids[i, 0:len(encoded)] = encoded
    attention_mask = (token_ids != 0).astype(np.int32)
    return {"input_ids": token_ids, "attention_mask": attention_mask}

### Chargement de notre modèle fine-tuned

In [4]:
model = TFCamembertForSequenceClassification.from_pretrained("jplu/tf-camembert-base")
model.load_weights("./models_weights/f179_count8000_epo20_batch4.h5")

All model checkpoint layers were used when initializing TFCamembertForSequenceClassification.

Some layers of TFCamembertForSequenceClassification were not initialized from the model checkpoint at jplu/tf-camembert-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Pré-traitement de notre jeu de donnée
On écarte les messages comptenant des liens, des gif et des enregistrement vocaux. On ne garde pas non plus les messages de moins de 2 caractères ainsi que les messages à plus de 120 caractères car le modèle a été entrainé sur des messages courts.

On format aussi les données en ne gardant que le contenu et l'auteur du messages

In [5]:
# entrée (messages) { who:.., what:.., when:.., feedback:.., whatType:..},{..},{..},..
# sortie (messages_keep) [[what, who],[],..]

messages_keep = [[m['what'],m['who']] for m in messages if m['whatType'] == "Texte"]

df = pd.DataFrame(messages_keep,columns=['messages','auteur'])
print(f"{len(df)} messages gardés")

6972 messages gardés


### Prédiction des sentiements avec le modèle fine tuned

In [6]:
# sortie [[what,who,sentiment],..]

messages_array = df.iloc[:,0].values
encoded_messages = encode_msg(messages_array)

scores = model.predict(encoded_messages)
sent_pred = np.argmax(scores['logits'], axis=1)
df['sentiment'] = sent_pred

df.head()

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Unnamed: 0,messages,auteur,sentiment
0,Mais elle a même pas présenté ce que c'était l...,Fanny-T,0
1,Bah tu fais ce que tu veux,Olivier,1
2,Elle a dit qu’on fait ce qu’on veut,Al,1
3,Meh,Fanny-T,1
4,Mais la vous avez tous des projets ou vous dem...,Al,0


In [7]:
mneg = df.iloc[:,2].value_counts(0)[0]
mpos = df.iloc[:,2].value_counts(0)[1]
tot = mpos+mneg
print(f"{round(100 * mpos/tot,1)}% messages positifs ({mpos})")
print(f"{round(100 * mneg/tot,1)}% messages négatifs ({mneg})")

63.4% messages positifs (4422)
36.6% messages négatifs (2550)


### Coloration des entitées
Association pour chaque entitées(groupe de mot) du nombre de messagse positif et négatif comptenant un des mots liées à l'entitée

In [9]:
# sortie [{entities: [n_positif,n_negatif,diff] }]

def containOneOf(message,elements):
    for e in elements:
        if e in message.lower():
            return True
    return False

entities_with_sentiment = {}

n_sans_entity = 0
for row in df.to_numpy():
    sans_entity = True
    for ent in entities:
        if containOneOf(row[0],entities[ent]):
            arr = np.array([0,0,0,0])
            arr[row[2]] = 1
            if ent not in entities_with_sentiment:
                entities_with_sentiment[ent] = arr.tolist()
            else:
                entities_with_sentiment[ent] = (np.array(arr)+np.array(entities_with_sentiment[ent])).tolist()
            sans_entity = False
    if sans_entity:
        n_sans_entity += 1

# ajout de la différence sentiement pos - neg
for ent in entities_with_sentiment:
    entities_with_sentiment[ent][2] = entities_with_sentiment[ent][1] - entities_with_sentiment[ent][0] 
    entities_with_sentiment[ent][3] = round(100*entities_with_sentiment[ent][1]/(entities_with_sentiment[ent][1] + entities_with_sentiment[ent][0]))

# sauvegarde de l'analyse
with open('analyse.json', 'w', encoding="utf8") as fout:
    json.dump(entities_with_sentiment, fout, ensure_ascii=False)

### Convertion du dict en DataFrame pour visualiser le résultat

In [12]:
# conversion dictionnaire en list
entities_with_sentiment_list = []
for key in entities_with_sentiment:
    temp = [key,entities_with_sentiment[key][0],entities_with_sentiment[key][1],entities_with_sentiment[key][2],entities_with_sentiment[key][3]]
    entities_with_sentiment_list.append(temp)

# conversion list en dataframe + trie
df_senti = pd.DataFrame(entities_with_sentiment_list,columns=["entity","negatif","positif","diff","pourcent"])

df_senti = df_senti.sort_values(by=["diff"],ascending=False)
index = df_senti[df_senti["positif"]+df_senti["negatif"] <= 2].index
df_senti = df_senti.drop(index)
df_senti

Unnamed: 0,entity,negatif,positif,diff,pourcent
11,julie,39,138,99,78
39,soirée,19,56,37,75
12,juliette,13,43,30,77
36,wec,17,45,28,73
3,albane,4,29,25,88
...,...,...,...,...,...
104,chat,3,1,-2,25
53,orange,5,3,-2,38
17,kermess,4,1,-3,20
88,gala,7,4,-3,36


### Affichage des résultats

In [13]:
msg_avec_entitees = len(messages_keep)-n_sans_entity
print(f"{msg_avec_entitees}   messags contenant une entitée (~{round(100*msg_avec_entitees/len(messages_keep))}%)")
print(f"{len(df_senti)}    entitées apparaissant au moins 3 fois dans le corpus")
arr = np.array(entities_with_sentiment_list)
neg = np.sum([int(n) for n in arr[:,1]])
pos = np.sum([int(p) for p in arr[:,2]])
print(f"{pos}   associations (messages positifs, entitée)")
print(f"{neg}    associations (messages négatifs, entitée)")


1986   messags contenant une entitée (~28%)
123    entitées apparaissant au moins 3 fois dans le corpus
1841   associations (messages positifs, entitée)
894    associations (messages négatifs, entitée)


In [33]:
def plot_result(df,column):
    if column == 3:
        print(f"{' '*17}Entités  +   -  diff")
    else:
        print(f"{' '*17}Entités  +   -   %")
    print("-"*120)
    n = 1
    if(column==3):
        df = df.sort_values(by=["diff"],ascending=False)
    else:
        df = df.sort_values(by=["pourcent"],ascending=False)
    array_senti = df.to_numpy()
    maxdiff = array_senti[0][column]
    for row in array_senti:
        spaces1 = 24-len(row[0])
        nb_char = int(row[column]*70/maxdiff)
        if(column == 4):
            nb_char-=35
        nb_moins = nb_plus = 0
        if nb_char < 0:
            nb_moins = -nb_char
        else:
            nb_plus = nb_char
        spaces2 = 10-nb_moins
        if column == 4:
            spaces2 += 18
        counts = [f"{' '*(3-len(str(row[i])))}{row[i]}" for i in [1,2,column]]
        print(f"{n}{' '*(spaces1-len(str(n)))}{row[0]} {counts[0]} {counts[1]} {counts[2]} {' '*spaces2}{'-'*nb_moins}|{'+'*nb_plus}")
        n+=1

In [37]:
plot_result(df_senti,3)

                 Entités  +   -  diff
------------------------------------------------------------------------------------------------------------------------
1                  julie  39 138  99           |++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2                 soirée  19  56  37           |++++++++++++++++++++++++++
3               juliette  13  43  30           |+++++++++++++++++++++
4                    wec  17  45  28           |+++++++++++++++++++
5                 albane   4  29  25           |+++++++++++++++++
6                 julien  22  47  25           |+++++++++++++++++
7                  simon   8  31  23           |++++++++++++++++
8                   paul   8  29  21           |++++++++++++++
9                  élise   7  28  21           |++++++++++++++
10               antoine  15  36  21           |++++++++++++++
11                google   8  28  20           |++++++++++++++
12                projet   6  26  20           |++++++++++++

In [38]:
index = df_senti[df_senti["positif"]+df_senti["negatif"] <= 6].index
plot_result(df_senti.drop(index),4)
#pourcentage de sentiment positif parmis tous les sentiments

                 Entités  +   -   %
------------------------------------------------------------------------------------------------------------------------
1                     ia   1  11  92                             |+++++++++++++++++++++++++++++++++++
2                 tiktok   1   8  89                             |++++++++++++++++++++++++++++++++
3                   hugo   1   7  88                             |+++++++++++++++++++++++++++++++
4                turn up   1   7  88                             |+++++++++++++++++++++++++++++++
5                 albane   4  29  88                             |+++++++++++++++++++++++++++++++
6                 lydia    2  14  88                             |+++++++++++++++++++++++++++++++
7                 ricard   1   7  88                             |+++++++++++++++++++++++++++++++
8                   rhum   1   7  88                             |+++++++++++++++++++++++++++++++
9                    pii   2  12  86                  