## École Polytechnique de Montréal
## Département Génie Informatique et Génie Logiciel

## INF8460 – Traitement automatique de la langue naturelle - TP1

## Objectifs d'apprentissage: 

•	Savoir accéder à un corpus, le nettoyer et effectuer divers pré-traitements sur les données
•	Savoir effectuer une classification automatique des textes pour l’analyse de sentiments
•	Evaluer l’impact des pré-traitements sur les résultats obtenus


## Équipe et contributions 
Veuillez indiquer la contribution effective de chaque membre de l'équipe en pourcentage et en indiquant les modules ou questions sur lesquelles chaque membre a travaillé


Nom Étudiant 1: Luu Thien-Kim (1834378) 33.33%

Nom Étudiant 2: Mellouk Souhaila (1835144) 33.33%

Nom Étudiant 3: Younes Mourad (1832387) 33.33%

Nous avons tous travaillé ensemble sur chaque question

## Librairies externes

In [442]:
import os
import pandas as pd
from typing import List, Literal, Tuple

## Valeurs globales

In [443]:
data_path = "data"
output_path = "output"

## Données

In [444]:
def read_data(path: str) -> Tuple[List[str], List[bool], List[Literal["M", "W"]]]:
    data = pd.read_csv(path)
    inputs = data["response_text"].tolist()
    labels = (data["sentiment"] == "Positive").tolist()
    gender = data["op_gender"].tolist()
    return inputs, labels, gender

In [445]:
train_data = read_data(os.path.join(data_path, "train.csv"))
test_data = read_data(os.path.join(data_path, "test.csv"))

train_data = ([text.lower() for text in train_data[0]], train_data[1], train_data[2])
test_data = ([text.lower() for text in test_data[0]], test_data[1], test_data[2])

## 1. Pré-traitement et Exploration des données

### Lecture et prétraitement

Dans cette section, vous devez compléter la fonction preprocess_corpus qui doit être appelée sur les fichiers train.csv et test.csv. La fonction preprocess_corpus appellera les différentes fonctions créées ci-dessous. Les différents fichiers de sortie doivent se retrouver dans le répertoire output.  Chacune des sous-questions suivantes devraient être une ou plusieurs fonctions.

In [446]:
train_path = os.path.join(data_path, "train.csv")
test_path = os.path.join(data_path, "test.csv")

train_phrases_path = os.path.join(output_path, "train_phrases.csv")
test_phrases_path = os.path.join(output_path, "test_phrases.csv")

#### 1) Segmentez chaque corpus en phrases, et stockez-les dans un fichier `nomcorpus`_phrases.csv (une phrase par ligne)

In [447]:
import nltk
nltk.download("punkt") 
nltk.download("wordnet")
import csv

def segmentSentences(path) :
    data = read_data(path)
    corpus = data[0]
    if not os.path.isdir(output_path) :
        try:
            os.mkdir(output_path)
        except OSError:
            print ("Creation of the directory %s failed" % path)
        else:
            print ("Successfully created the directory %s " % path)
    newFilePath = output_path + '/' + os.path.splitext(os.path.basename(path))[0] + "_phrases.csv"
    file = open(newFilePath, "w")
    with open(newFilePath, "w") as f: 
        f.write("response_text" + ',' + "sentiment" + ',' + "op_gender" +'\n')
        for i in range(len(corpus)) :
            sentences = nltk.sent_tokenize(corpus[i])
            for sentence in sentences:
                sentence = sentence.replace('"', '""').replace('"', '""')
                f.write('"'+ sentence +'"' + ',' + '"' +str(data[1][i])+ '"'+ ',' + '"'+data[2][i] + '"\n')
                
    return newFilePath


[nltk_data] Downloading package punkt to /Users/kimluu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kimluu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [448]:
segmentSentences(train_path)
segmentSentences(test_path)

'output/test_phrases.csv'

#### 2) Normalisez chaque corpus au moyen d’expressions régulières en annotant les négations avec _Neg L’annotation de la négation doit ajouter un suffixe _NEG à chaque mot qui apparait entre une négation et un signe de ponctuation qui identifie une clause. Exemple : 
No one enjoys it.  no one_NEG enjoys_NEG it_NEG .

I don’t think I will enjoy it, but I might.  i don’t think_NEG i_NEG will_NEG enjoy_NEG it_NEG, but i might.

In [449]:
def getPath(path) :
    if "train" in path :
        path = train_path
    elif "test" in path :
        path = test_path
        
    return path

In [450]:
import re

def normalize(path) :
    with open(path, "r") as f :
        data = list(f)
    
    newFilePath = output_path + '/' + os.path.splitext(os.path.basename(getPath(path)))[0] + "_negation.csv"
    file = open(newFilePath, "w")
    with open(newFilePath, "w") as f:
        for sentence in data:
            match = re.sub("(?i)(?<=not |n't | no )(.*?[,.(?!;]+)", lambda m: m.group(1).replace(" ", "_NEG ")
                           .replace(".", "_NEG.").replace(",", "_NEG,").replace("?", "_NEG?").replace("!", "_NEG!")
                           .replace("(", "_NEG(").replace(";", "_NEG;"), sentence)
            f.write(match)
            
    return newFilePath
            

In [451]:
normalize(train_phrases_path)
normalize(test_phrases_path)

'output/test_negation.csv'

#### 3) Segmentez chaque phrase en mots (tokenisation) et stockez-les dans un fichier `nomcorpus`_mots.csv. (Une phrase par ligne, chaque token séparé par un espace, il n’est pas nécessaire de stocker la phrase non segmentée ici) ;

In [452]:
def tokenize(path) :
    sentences = []
    
#     with open(path, "r") as f :
#         data = list(f)
    data = read_data(path)
    corpus = data[0]        
#     print(data[1])
    newFilePath = output_path + '/' + os.path.splitext(os.path.basename(getPath(path)))[0] + "_mots.csv"
    file = open(newFilePath, "w")
    with open(newFilePath, "w") as f: 
        f.write("response_text" + ',' + "sentiment" + ',' + "op_gender" +'\n')
        for i in range(len(corpus)) :
            listTokens = nltk.word_tokenize(corpus[i])
            tokens = ' '.join(listTokens)
            print(tokens)
            tokens = sentence.replace('"', '""').replace('"', '')
            f.write('"' + tokens + '"' + ',' + '"' + str(data[1][i])+ ',' + data[2][i] + '"\n')
                
    return newFilePath

In [453]:
# tokenize(train_phrases_path)
tokenize(test_phrases_path)

# train_mots_path = os.path.join(output_path, "train_mots.csv")
# test_mots_path = os.path.join(output_path, "test_mots.csv")

  res_values = method(rvalues)


Thanks back !
Yep , University of Alberta .
You live around here ?
please do n't sell my land Steve
just shaking my head at the ignorance and deliberate ignoring of the facts about FDR , Pearl Harbor , and WWII .
To be contemplated during your tri , perhaps ?
Pshh ... Is that how you treat my props .. Just go around deleting them ? ! ? !
Sureeeeeeeeeeeeeeeeeeee I see how it is .
: pYeah there 's definitely still some bugs around here .
My workout from last night posted with today 's date on it .
lol
Thanks !
I also love bacon .
: )
Hello Isaac !
My copy arrived yesterday in France , I 'm so happy and very excited to read it ! ! !
XD
We need to keep Bob Menendez in congress .
It 's really an excellent lecture , I believe in her.so , fake it till you make it !
I 'm human according to most of the questions .
And his tone was great , I had a lot of fun : )
You 're both awesome ! !
!
B Mattek is awesome , I love her bad girl risque style .
Brit : I 'm so glad u r on Fox tonight .
You calm p

Mind blowing conclusion .
Glad we could reach it together .
Yeah , like the twenty billion he gave to the unions .
Yeah , thats really investing in his reelection and thats all he is doing .
unbelievable - courageous inspiring .
loved it ... am sending this to my daughter who aspires to be an artist ...
Of course , they probably wrote It for you .
Palin can see Russia from her porch and you can see the South China Sea from yours .
Birds of a feather flock together ! ! !
!
Looking forward to your leadership in the Senate , Tim !
( 0_0 ) who me ? ?
Thank you , lovely !
Nice to have you in my feed !
Good luck , and I still support your war on outrageous gasoline prices .
There is my dedicated sister in law !
books include endless imagination .
so we tend to book when our dream dont take place .
because they understand better than human .
lisa bu has done in such .
If I could just take this to work and make all my whole company watch it ... ( sigh ... ) I feel vulnerable now ...
I need a c

With restricted movements and a hundred cables plugged into her brain ? ?
He 's understimating our inteligence .
Yes ! ! ! !
Knock em out ! !
!
WAIT but can we take a minute to look at Jared 's hair ?
!
This deal needs to be a NO-GO .
The only folks who will benefit will be the company executives .
Yea like everyone leaving world of warcraft after the shitty release .
The entire service was sooo beautiful .
< 3
You should encourage the Senators who gathered for an all nighter on climate change to put on tshirts and shorts and go out on the Capitol steps to bask in that weather : )
Congrats on your success Danny
This is just mind blowing and I am ever so grateful to Louie Schwartzberg for allowing me this perspective !
Thanks for the same and the props .
Is that tengwar in your picture ?
Lots of uncertain factors in this , but nevertheless very interesting .
Oh man , if could talk Chapelle into a movie or a show the world would be a better place .
3 of you with you & Yakov as well , wou

'output/test_mots.csv'

#### 4) Lemmatisez les mots et stockez les lemmes dans un fichier `nomcorpus`_lemmes.csv (une phrase par ligne, les lemmes séparés par un espace) ;

In [454]:
def lemmatize(path) :
    with open(path, "r") as f :
        data = list(f)
        
    newFilePath = output_path + '/' + os.path.splitext(os.path.basename(getPath(path)))[0] + "_lemmes.csv"
    lemmzer = nltk.WordNetLemmatizer()
    
    file = open(newFilePath, "w")
    with open(newFilePath, "w") as f: 
        for sentences in data :
            lemmes = [lemmzer.lemmatize(token) for token in sentences.split()]
            sentences = ' '.join(lemmes)
            f.write(sentences+'\n')
                
    return newFilePath

In [455]:
lemmatize(train_mots_path)
lemmatize(test_mots_path)

'output/test_lemmes.csv'

#### 5) Retrouvez la racine des mots (stemming) en utilisant nltk.PorterStemmer(). Stockez-les dans un fichier `nomcorpus`_stems.csv (une phrase par ligne, les racines séparées par une espace) ;

In [456]:
def stemmize(path) :    
    with open(path, "r") as f :
        reader = csv.reader(f)
        data = list(reader)
        
    path = getPath(path)
    newFilePath = output_path + '/' + os.path.splitext(os.path.basename(path))[0] + "_stems.csv"
    
    stemmer = nltk.PorterStemmer()
    
    file = open(newFilePath, "w")
    with open(newFilePath, "w") as f: 
        for sentences in data :
            for sentence in sentences :
                stems = [stemmer.stem(token) for token in sentence.split()]
                sentences = ' '.join(stems)
                f.write('"' + sentences + '"\n')
                
    return newFilePath
    

In [457]:
stemmize(train_mots_path)
stemmize(test_mots_path)

'output/test_stems.csv'

#### 6) Ecrivez une fonction qui supprime les mots outils (stopwords) du corpus. Vous devez utiliser la liste de stopwords de NLTK ;

In [458]:
nltk.download("stopwords")
from nltk.corpus import stopwords
stopwords.words("english")

def deleteStopWords(path) :
    with open(path, "r") as f :
        reader = csv.reader(f)
        data = list(reader)
        
    path = getPath(path)
    newFilePath = output_path + '/' + os.path.splitext(os.path.basename(path))[0] + "_stopWords.csv"
    stopwords_english = set(stopwords.words("english"))
    output = []
    
    file = open(newFilePath, "w")
    with open(newFilePath, "w") as f: 
        for sentences in data :
            for sentence in sentences :
                newSentence = [token for token in nltk.word_tokenize(sentence) if token not in stopwords_english]
                sentences = ' '.join(newSentence)
                output.append(sentences)
                f.write('"' + sentences + '"\n')
                
    return output
                
#enlever la création de nouveaux fichiers
    

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kimluu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [459]:
deleteStopWords(train_mots_path)
deleteStopWords(test_mots_path)

['response_text',
 'sentiment',
 'op_gender',
 '.',
 'False , W',
 '.',
 'False , W',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , W',
 '.',
 'False , W',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , W',
 '.',
 'False , W',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , M',
 '.',
 'False , W',
 '.',
 'False , M',
 '.',
 'False 

#### 7) Écrivez une fonction preprocess_corpus(corpus) qui prend un corpus brut stocké dans un fichier.csv, effectue les étapes précédentes, puis stocke le résultat de ces différentes opérations dans un fichier corpus _norm.csv

In [460]:
def preprocess_corpus(input_file: str, output_file: str) :
    #to do : vérifier si c'est bien le résultat voulu
    results = deleteStopWords(stemmize(lemmatize(tokenize(normalize(segmentSentences(input_file))))))
    file = open(output_file, "w")
    with open(output_file, "w") as f:
        for element in results :
            r = element.replace('"', '""').replace('"', '')
            f.write('"' + r + '"\n')
            
    

In [461]:
preprocess_corpus(
   os.path.join(data_path, "train.csv"), os.path.join(output_path, "train_norm.csv")
)
preprocess_corpus(
   os.path.join(data_path, "test.csv"), os.path.join(output_path, "test_norm.csv")
)

I do n't think_NEG any_NEG one_NEG there_NEG has_NEG EBOLA_NEG Bob_NEG Latta_NEG You_NEG should_NEG be_NEG back_NEG in_NEG Washington_NEG actually_NEG getting_NEG something_NEG done_NEG there_NEG on_NEG the_NEG House_NEG floor_NEG .
; - ) ... anything other than jeans and t-shirts are superfluous , by the way .
'Update your wardrobe ' ... pfft .
Meh , I could only get to 8 .
Need to work up .
A bill consisting of a single sentence .
Very well done , sir .
So far , so good .
Thx !
My buddy Jeff Johnson was your prop master on that .
she had me at everlasting youth .
Congratulations to you for a well deserved recognition !
baffoon , idiot , dumb .
The intelegent conversation continues .......
They do n't deserve_NEG that_NEG honour_NEG ( stupid Hollywood movie business people )
Yawn !
Is this honestly news ?
Same to you brotha !
Get at it : )
Would good to know how the age of all of these things have been measured .
the perfect society is shaped in the form of a pyramid , the old at the 

I just got a phone call about a town hall meeting ...
SO COOL ... you 've got my vote ! ! ! ! ! ! ! ! ! !
BIG THANKS ! ! ! !
!
Liz Diller is to function what Frank Gehry is to form .
Fascinating !
Thanks , it is great !
Hey Adam , thanks for wasting congressional time , and my hard-earned tax dollars on your salary by voting in lock-step with crazy Issa and your GOP cohorts on their biased , witch-hunt to hold Eric Holder in contempt of Congress .
SHAME ON YOU !
It was wonderful to see the Bi-partisan staff set-up in Cordova .
The staffers were very informative , respectful , and helpful .
Thank you for taking the time to visit us .
Yeah , destiny beta is pretty good .
Cant wait till full game comes out .
Dont like the warlock much but love the Titan .
happy your state is prospering with such a dedicated team to work with .
Hehe thank you = ) I appreciate the compliment !
Mornings like this , I feel like i 've been run over .
Oh - and the prop bomb was well deserved .
: -D
: - ) It 's 

I think music makes everybady happy .
Thank you for your talk .
Akitakaorita
< 3 peace and love and props and smiles : )
I support this initiative and have some childhood testimonies to share .
I think its more Brazils gape and Germanys girth .
Commented !
I am impressed with how low you 're getting - far too many people do n't get_NEG anywhere_NEG close_NEG .
Maybe this kind of speech is a little conscious , but it can be really a psychological support .
cheers : ) I 'm getting stronger at them , both back and front .
Great work , Stacy Arteaga and the team !
Wish I could 've helped this past weekend .
also , I think hes actually going to be good .
Thank you Robin for being a voice for humanity ! !
These are frightening times however I know you are standing up for all people !
Wish I coulda been there B !
!
Way cool , I love that guy !
Talk fast ... Ed does .
Give 'em hell ! !
Go Jeff !
Federal employees dont get paid as contractors do .
Do not compare_NEG apples_NEG to_NEG oranges_NE

Looking forward to cheering your workouts !
: )
> Also when did r/conspiracy get a surge of highly statist boot lickers ?
And thats * exactly * what this motherfucker is .
Sheesh .
I just ... shake my f head at this ...
That looks awesome .
Congrats , and enjoy the day !
Seems like you guys are on track for a wonderful time .
Look another old joke .
he made some good points !
some of the things he said are what we sometimes do consistently and I agree with him .
Did not know_NEG about_NEG the_NEG Southern_NEG California_NEG Chinese_NEG Relay_NEG for_NEG Life_NEG .
Will look into entering for next year .
Yes ! ...
It 's a must , Enrique Capriles need support from USA and the world , to show dictators we are United for Justice , Peace and Freedom ! ...
I agree , voting No as_NEG well_NEG !
Shame on Charlie Baker with his Yes on 2 commercial !
I 'll take ANY Biz-Govt .
Relationship , Eric !
At 5:00am in Canada , a good cup of coffee with fresh news from CNN .
Love the New Day trio keep wo

Fight the good fight , as you often do .
I love this talk ... !
Yeah , I love most of their stuff .
I think what he does is really cool .
I dont listen to it too much but when it pops up I wont usually hit next .
This is amazing things that I love most about TED .
Egory Mullins you are not being_NEG polite_NEG here_NEG .
I recognize those courts !
Good luck tmrw !
I would like to thank Mr Kelly for unknowingly helping us present the idea of namesets at http : //www.youtube.com/watch ? v=Ejkxu67jtE4
Your job iss on the line , Honey !
We thank you for your support and your efforts to keep our community safe !
Looks like a great day to be on the slopes !
Had a SANTANA tik for you .. but never answered .
Calling it quits .
: - (
Do n't follow_NEG me_NEG though_NEG .
You do n't know_NEG me_NEG .
Interesting .
Thanks !
Thanks Congresswoman for your pursuit of the CROOKS !
This kind of stuff continues to this day !
That 's cool .
I want exercise to make me smarter !
Stunning .
The epitome of 

Its unbelievable that such an important diagnosis is done with such simple methods , a child can their entire life altered just because someone made a mistake !
I think you 're so funny person your speech was so interested I 'm insterested in educational psychology , so i learn that subject in university .
And learned many things through your speech Your speech makes me to decide to act more powerful
Happy Holidays to you and your family .
Thank you for everything !
Awkwafina you 're my hero
Thank you lady frient : D
always look forward to a new koontz book , always excitting ! ! !
!
It 's a FOUNDATION to build on .
Thank you for all your hard work Congresswoman DeGette .
How does this organization constitute welfare Greg Robertson ?
Its was started after the blizzard by local people who wanted to help .
Maybe you should read before you comment .
Man my corner sucks .
All the Johns have little willies ...
You got my vote this am ... but you prob knew that .
Best of luck
First thing tha

intresting and inspiring
Yep , since about December 2011 .
It 's great for ideas and encouragement !
Keep up the good work .
Do n't let_NEG them_NEG get_NEG you_NEG down_NEG .
Oh , wait , I remember the last electon .
Stand Tall !
Im thnkfull dat God protectd me nd ma famly nd he has blessd us , may he continue blessing nd healin de fmly nd de world ....
Well , I am very honored you feel so honored
Common sense , Common sense , Common sense ...
Please purchase some .. ! !
America should not hand_NEG shake_NEG with_NEG any_NEG country_NEG that_NEG for_NEG decades_NEG have_NEG been_NEG chanting_NEG publicly_NEG `` '' death_NEG to_NEG America_NEG and_NEG Israel '' '' _NEG._NEG._NEG._NEG .
If it was n't for_NEG the_NEG tasty_NEG flavors_NEG , I would probably still be smoking , and be in bad health like I was before I started vaping .
My doctor is amazed with how much I have improved .
This is like cutting off your nose to spite your face .
Congratulations to the children and to you Jody .

Personally Im very glad that GOT is quite magic-light .
The story is interesting itself and doesnt need a lot of suprernatural occurences imo .
The most inspiring talk I 've watched ! !
Its a very inspiring talk and its has inspired me .
3 keys to a great career is find your passion , pursue it , and do n't let_NEG fear_NEG hold_NEG you_NEG back_NEG .
Amazing !
I guess we 'll see in 45 years how accurate it is .
Maybe they 'll do a remake with Brad being made to look younger as the movie goes on .
No but the fact that this gets 70k upvotes because of a misunderstood tradition from a different culture is dumb
No problem Chick : ) xxx
YEAH U FU8IN BLOW ! ! !
! u call urself a PRO ! !
! BACK TO THE STATES U WAN NA BNE ! ! ! ! !
! u suck ! ! ! !
! u G * * K ..... cant stand players like u that ruin the tour ! ! ! !
: ////////
Hm , I thought they had everything on here for fitness !
That 's a bummer !
Yes please do more debates with him .
You need to bring him out and let him talk .
We do n

Mean while the Bay of Fundy is still not generating_NEG any_NEG kind_NEG of_NEG ( green ) power that would create permanent jobs and profits for Atlantic Canada and Canada as a whole .
good stuff Chuck !
lem me know when you go hiking next !
I may want to join you .
: )
i live in his home town and the only reason he keeps getting re-elected is because he is connected to the old families in this town he has never done much for the regular people that live here .
That 's more than enough !
Lol
try standing up to pee first before you try to stand for anything
Great job , well done .
We would lose our country so this would be a moot point .
Thank you for voting for the Dicks amendment to eliminate the extinction rider from the Interior spending bill .
So good to see you back and playing tennis .
Thanks !
You rock too !
Keep up the great workouts and have a lovely day : )
I want Chris Rock to be the last colonel sanders for the kfc commercials .
I laugh every time I see it in my head ...
He

damn wish i was there
yeah , i need to get back to exercising regularly .
I 've been slacking this semester with school and attempting to spend my free time with Jen .
I 'll try to keep this up .
Do you know if there is an app for this ?
Child pornography is legal now ?
Obama got his behind handed back to him yesterday by the leaders of the emerging Republican Majority .
As good as I expected ...
I applaud you Monica !
You are claiming your story and using your voice for good .
To the Ted moderators I respectfully ask that the team review the moderation on this thread - the misogynistic comments that are getting through here are not helpful_NEG to_NEG discussion_NEG
I think more interesting then the talk was the photographers concept of placing the dog in the pictures .
I wonder if there is a write up on him stating his reasons for coming up with the concept of this photo shoot
It doesnt change my hate for them .
! ! !
NOW ! ! !
https : //twitter.com/onlynomaly/status/82138839563734220

Yes the Pic is nice too
Yes !
Agreed with Stefan : )
The Rebublican Party is in it 's death throes ..... https : //www.facebook.com/photo.php ? fbid=578405182255738 & set=a.279728135456779.59808.273864989376427 & type=1 & relevant_count=1
Idiots .
If world really really want to protect the innocents of Gaza , then do something about Hamas extremists .
Does the policy allow for strikes within the US ?
How do I know if I 'm on the list ?
Can I turn myself in prior to being blown up ?
Haha yeah .
Last time thru I hardly worked out , and when I did I kept forgetting to track , lol !
I thought the visual satisfaction of 'seeing ' my progress would help this time !
Have a lovely Sunday , Foxy !
Hope you get some laughs , some rest , and lots of hugs .
Here 's one to start : * * hug ! !
* *
Will do !
Great speech last night .
There is another reason for wanting to know what consciousness is and that is to know at what point we can say that someone has indeed died .
Happy birthday and good luc

I did , you guys are mocking the fact that CNN called out a racist , and thinking thats bad
Congratulations on your sweet baby girl .
What a joy she will be to your family !
!
sure !
thanks for the follow
we continue to enjoy your living sculpture here in Phoenix .
Thank you for your vision and creative insights .
What are you currently working on ?
I 'm so happy for you , Sloane , Jamie , and Serena !
Excellent talk , highly informative , showing solid research , including a critical evaluation of his own work .
Very important .
did that already .
you got mine .
pay off the national debt , balance the budget and restore the strength of our dollar .
if you do n't do_NEG the_NEG above_NEG , NOTHING else matters .
No worries : ) thanks for the FB !
Stopped it to soon .
Congrats on taking the challenge .
thank you !
your workouts are insane .
One day I hope to be as strong as you with those 200lb deadlifts !
Merry Christmas and a Happy New Year to you and your family from all of us .
I pr

25-30 years ago , Milt was a rock and major force in the SD Legislature .
His words should be considered as he is a wise and thoughtful man .
Thank you John , you do listen .
Passion is the first thing !
Absolutely !
Make sure u keep an eye on my grand daughters or I will send a zombie over there .. lol ... Thank you u Ms.Rebecca Lander
Be careful what you wish for Mr. Monreal .
Congressman Coffman is all for cutting jobs , COLAs , and benefits for Federal Government employees and retirees .
you are always there for me , and I am often owing you
You are an arrogant , dangerous man .
Just because the people had no choice_NEG and_NEG you_NEG ran_NEG uncontested_NEG , do n't think_NEG that_NEG people_NEG chose_NEG you_NEG and_NEG like_NEG you_NEG .
2018 will hopefully be different .
You 're a punk that needs to be taken down .
No one wants someone like you representing IL .
China needs more women with her courage , character and strength to take a change-we share many similarities !
Misse

Like The Great One , Mark Levin , I will say what I mean and mean what I say .
Congratulations Congressman Elect Richmond !
I am pulling for you on your journey to Washinton D.C .
These two are awesome
Me too .
But my children are on vacation for three weeks , so I now have all the time I could ask for .
No excuses now .
; )
I stand corrected !
; )
I receive it in Jesus name !
Bitsie , you have just got to come back , Nick needs you and so do we ! ! !
Besides , what would the Blazer games be without you .
I have a permanent crush on Alyson Hannigan and Im not usually_NEG into_NEG redheads_NEG .
Nowhere in the sidebar is that said , and this submission also violates rule 7 .
You either dont understand hypothetical questions or are just being a jerk .
GREAT ! ....
Thanks .....
Good for you and continue full support for Social Security & Medicare .
Thank-you for the props and the follow ...
Good luck , and I still support your war on outrageous gasoline prices .
I am in the bay area ! !
Y

Rep. Brenda L. So no one_NEG can_NEG question_NEG him_NEG about_NEG Flint's_NEG crisis_NEG ?
Will he never have to answer questions regarding this travesty ?
I am in your corner Tammy Duckworth .
Mark Kirk has got to .
How in hell did Mark Kirk steal my name Lorenza Kirk ?
Take care of Oddie !
Love those story lines !
Thank you : ) have a good day !
Very well said ! , , But I see a plus side to all of this too ..
! !
Thanks to technology I could listen to her great speech ! ..
And yes - we must know the limits to which we let technology control our lives and enrich ourselves with real world experiences too !
I shared this on my wall so my liberal friends in other areas would see how lucky we are to have you working in Washington .
Happy 4th , Rep. Schweikert !
With your terrific sense of humour you do n't need_NEG the_NEG led_NEG shoes_NEG to_NEG light_NEG up_NEG the_NEG dance_NEG floor_NEG -_NEG or_NEG anything_NEG else_NEG , for that matter !
Hope you put it in your breast pocket and

When will you finish Ride The Storm ? ? ? ?
We 're all waiting for Christopher Snow ! ! !
!
Compromise ! !
Stop blaming each other and try something that works !
!
Already did , I know who is good fot us !
Just leveled up !
Back in the game ! !
!
Respected sir , I am an kashmiri pandit from jammu and Kashmir just want to say a big thanks to u for ur efforts for introducing resolution on kashmiri pandits in us house of representatives
Definately i 'll be tuned .
Gd luck .
I do n't see_NEG your_NEG name_NEG on_NEG the_NEG list_NEG of_NEG lawmakers_NEG who_NEG will_NEG NOT_NEG attend_NEG the_NEG inauguration_NEG .
How you can support this man is mind boggling .
What a disappointment !
Congratulations to you and Lucy .
It was amazing match !
I can SO relate to this speech , but is she from Scandinavia ?
She is talking abt other people 's doubt in you , jelousy and envy , typically Swedish !
I did n't think_NEG Americans_NEG would_NEG say_NEG that_NEG .
Congrats Elizabeth , this doming book

keep your spirit on .
Let 's support the WI platform !
Succession is always an option if we do n't like_NEG the_NEG `` '' commons '' '' _NEG accepted_NEG by_NEG the_NEG larger_NEG community_NEG .
Wonderful news .
See you Sunday .
> he tried to only for the cop to cut him off , Dumbass FTFY
whats with the BS quality ?
So now next is money as a right , an equal right .
Take it from those who have it , give to those who do n't , and we 're all equal , right ?
Profound Thought process .
It can be quite intimidating to know that We are spending lives in digital illusion '' '' .
We ARE the Big 12 Champs !
!
!
!
!
inject SQL steriods erry day
Education is the answer not fascism_NEG .
I hope you reconsider your own humanity .
LOVE your Cover Photo ! !
I 'm behind you all the way !
Yeah , what he said : )
I knnnoowwww -- I 'll change it today .
I know I do n't get_NEG much_NEG say_NEG in_NEG my_NEG nickname_NEG but_NEG as_NEG long_NEG as_NEG E-dub_NEG sounds_NEG badass_NEG then_NEG it_NEG works

My copy should be here in the mail today ! ! !
YA ! ! ! !
!
Please , I want humanitarian organization Thtweini the family and I we suffer from persecution is very Social Mama Mtaah of suffering from cancer
Thanks for following back : D , good luck on your goals !
Very interesting and informative .
I totally enjoyed listening to this .
It is time for Holder to be tried and jailed .
No problem ! !
Yours was quite the surprise too !
!
Oh my pleasure !
So just curious ... What is your pp from ?
Very powerful speech .
Amazing .
Aww , it 's not a_NEG problem_NEG in_NEG the_NEG least_NEG ! _NEG !
Did n't expect_NEG that_NEG bomb_NEG to_NEG go_NEG off_NEG there_NEG .
I 'm happy to see you back at it and putting up some solid workouts !
These personal one on one discussions are critical in terms of getting the VOTE OUT !
Looks very promising ,
sure could talk a lot , seemed like he was just selling a product not promoting_NEG some_NEG idea_NEG
What 's the point you 're trying to make here Tim ?

omg r u going to fly in the sky thats so cool
Thank goodness Bitsie !
Love you all and we grimmsters are a very loyal bunch !
Oh , thank you ! ! !
It 's been a while since I rocked the sparklekini so it 's a bit of a throwback .
: )
yea they are great !
do you have access to rubber weights ?
It doesnt bode well .
OO HELLO WAS WONDER YOU OK THE WEATHER WAS TO BAD YESTERDAY ON THE WAY TENNESSEE AMEN JMJ +++AMEN CIAO BLESS ALL NO WE_NEG NOT_NEG LET_NEG NO_NEG ONE_NEG DESTROY_NEG THE_NEG CROSS_NEG AMEN_NEG _NEG .
Mike kay most stubborn person ever .
Not just_NEG yes_NEG , BUT HELL YES ENOUGH IS ENOUGH ! !
!
kench u get enough of them ?
May I attempt to sum up this talk ?
`` `` acai- fraud ! ...
i do n't even_NEG know_NEG what_NEG that_NEG is '' '' _NEG
If you sir are related to the late Honorable Harold Love , sr. , I knew him well .
I worked with him .
He would be proud .
the frozen tundra is where your socks really feel at home ... and no photobomb_NEG by_NEG Mr_NEG Rodgers_NEG._NEG._NEG

I will be watching # OscarBuzz
You realize it is your party and biggest supporters who have decimated our troop preparedness .
Oh of course !
Thanks for the follow and sorry for the delay on my part !
I 'm assuming due to workout and not fun_NEG : P_NEG I_NEG understand_NEG these_NEG feels_NEG , though -- my legs were still sore yesterday from doing zercher squats on Sunday ...
thank you , breakthrough moment for me , several ones .
We 're a big fan of yours too Eric .
We look forward to working with you to strengthen the U.S.-Israel relationship ... Jeremy Garelick and Samantha Rifkin Garelick
congratulation isner : )
Dont worry , Im sure there will be a spin-off collection announced sometime in the near fututre .
Reach , ODST , Wars , and Spartan Assault .
Happy Birthday josh , have a good one with lots of whisky
https : //www.youtube.com/watch ? v=M_Hh2hFuLpg ... you bused in people as seat fillers and have no idea_NEG what_NEG they_NEG were_NEG supporting_NEG we_NEG have_NEG proof_

Thanks back !
Yep , University of Alberta .
You live around here ?
please do n't sell_NEG my_NEG land_NEG Steve_NEG
just shaking my head at the ignorance and deliberate ignoring of the facts about FDR , Pearl Harbor , and WWII .
To be contemplated during your tri , perhaps ?
Pshh ... Is that how you treat my props .. Just go around deleting them ? ! ? !
Sureeeeeeeeeeeeeeeeeeee I see how it is .
: pYeah there 's definitely still some bugs around here .
My workout from last night posted with today 's date on it .
lol
Thanks !
I also love bacon .
: )
Hello Isaac !
My copy arrived yesterday in France , I 'm so happy and very excited to read it ! ! !
XD
We need to keep Bob Menendez in congress .
It 's really an excellent lecture , I believe in her.so , fake it till you make it !
I 'm human according to most of the questions .
And his tone was great , I had a lot of fun : )
You 're both awesome ! !
!
B Mattek is awesome , I love her bad girl risque style .
Brit : I 'm so glad u r on Fox toni

The ending to that talk was a disgrace .
Yes kiddies - DARPA makes weapons .
I have the exact same IVs on my Eirika , so shes always the team cheerleader .
Not that_NEG Im_NEG complaining_NEG , shes basically babysat every unit Ive ever had through training tower .
I wish I could be there .
I have loved teaching Warm Bodies and reading A New Hunger .
I have pre-ordered Burning World .
Best wishes .
Thank you Representative Correa !
Our community needs this .
We appreciate your leadership .
They actually do have the prettiest song in the world .
Well ... maybe Im a little biased .
: )
Haaaaaaaaaaaaaiiiiiiiiiiiiiii ! ! ...
Pfffffff why you no go_NEG on_NEG a_NEG ninja_NEG date_NEG wid_NEG me_NEG !
delightfully boring .
IT is rapidly progessing .
Save yourself 8 minutes .
Yeah Moraby is where it completely clicked for me .
needless to say , even if i saw it coming it was still fantastic .
Quite a pic !
Very cool !
: D
BTW ... you have 2 pts til level 2 .... Awesome !
Hurry , track a calf 

### Exploration des données

#### 1)

Complétez les fonctions retournant les informations suivantes (une fonction par information, chaque fonction prenant en argument un corpus composé d'une liste de phrases segmentées en tokens(tokenization)) ou une liste de genres et une liste de sentiments:

In [462]:
corpus = [['soso.', 'kim a acheté un mbp13 silver!', 'mourad a acheté un mbp16 spacegrey']]
corpus = [['soso', 'a', 'acheté', 'un', 'mbp16', 'silver', '.', 'Je', 'ne', 'suis', 'pas', 'daccord!'],
          ['kim', 'a', 'acheté', 'un', 'mpb13', 'silver','.'], 
          ['mourad', 'a', 'acheté', 'un', 'mbp16', 'spacegrey', '.', 'Quil', 'aime', 'beaucoup']]
corpus = [["I", "do", "not", "agree_NEG", "with_NEG", "this_NEG!", "I", "prefer", "the", "other", "option"],
           ["I", "really", "like", "that", "new", "mbp16", "Mourad", "made", "a", "good", "choice"],
           ["I", "don't", "think_NEG", "that_NEG.", "I", "prefer", "the", "mbp13"]]

In [463]:
#Return if the corpus is tokenized or not
def isTokenized(corpus) :
    for sentences in corpus :
        for sentence in corpus :
            for l in sentence :
                if ' ' in l :
                    return 0
    return 1

In [464]:
#Return the corpus as a list of documents, that are not tokenizated, but segmented in sentences
def getListOfSentences(corpus):
    listOfDocs = []
    listOfTokens = []
    if isTokenized(corpus) :
        for sentences in corpus :
            for token in sentences :
                listOfTokens.append(token)
            s = ' '.join(listOfTokens)
            s = nltk.sent_tokenize(s)
            listOfDocs.append(s)
            listOfTokens = []
    return listOfDocs

In [465]:
getListOfSentences(corpus)

[['I do not agree_NEG with_NEG this_NEG!', 'I prefer the other option'],
 ['I really like that new mbp16 Mourad made a good choice'],
 ["I don't think_NEG that_NEG.", 'I prefer the mbp13']]

##### a. Le nombre total de tokens (mots non distincts)

In [466]:
def getNumberOfTokens(corpus):
    corpus = getListOfSentences(corpus)
    count = 0
    for sentences in corpus :
        for sentence in sentences :
            count = count + len(nltk.word_tokenize(sentence))
    return count

In [467]:
getNumberOfTokens(corpus)

33

##### b. Le nombre total de types

In [468]:
def getNumberOfTypes(corpus):
    corpus = getListOfSentences(corpus)
    listOfTokens = []
    for sentences in corpus :
        for sentence in sentences :
            tokenList = nltk.word_tokenize(sentence)
            for token in tokenList :
                listOfTokens.append(token)
    listOfTypes = list(dict.fromkeys(listOfTokens))  
    return len(listOfTypes)

In [469]:
getNumberOfTypes(corpus)

26

##### c. Le nombre total de phrases avec négation

In [470]:
def getNumberOfNeg(corpus) :
    corpus = getListOfSentences(corpus)
    numberOfNegativeSentences = 0;
    for sentences in corpus:
        for sentence in sentences:
            if "_NEG" in sentence:
                numberOfNegativeSentences = numberOfNegativeSentences + 1
    return numberOfNegativeSentences

In [471]:
getNumberOfNeg(corpus)

2

##### d. Le ratio token/type

In [472]:
def getRatioTokenType(corpus):
    return float(getNumberOfTokens(corpus)/getNumberOfTypes(corpus))

In [473]:
getRatioTokenType(corpus)

1.2692307692307692

##### e. Le nombre total de lemmes distincts

In [474]:
import nltk
def getLemmesNumber(corpus):
    corpus = getListOfSentences(corpus)
    lemmzer = nltk.WordNetLemmatizer()
    lemmesList = []
    for sentences in corpus :
        for sentence in sentences :
            lemmes = [lemmzer.lemmatize(token) for token in sentence.split()]
            for lemme in lemmes :  
                lemmesList.append(lemme)
    
    lemmesList = list(dict.fromkeys(lemmesList))   
    return len(lemmesList)

In [475]:
getLemmesNumber(corpus)

24

##### f. Le nombre total de racines (stems) distinctes

In [476]:
import nltk
def getStemsNumber(corpus):
    corpus = getListOfSentences(corpus)
    stemmer = nltk.PorterStemmer()
    stemsList = []
    for sentences in corpus :
        for sentence in sentences :
            stems = [stemmer.stem(token) for token in sentence.split()]
            for stem in stems :
                stemsList.append(stem)
    stemsList = list(dict.fromkeys(stemsList))     
    return len(stemsList)

In [477]:
getStemsNumber(corpus)

24

##### g. Le nombre total de documents (par classe)

In [478]:
def getNumberOfDocPerClass(sentiments):
    countPositive = 0
    countNegative = 0
    for sentiment in sentiments : 
        if sentiment : #positif
            countPositive = countPositive + 1
        else : #negatif
            countNegative = countNegative + 1
    return countPositive, countNegative
        
        

In [479]:
semtiments = [0,
          0,
          0,
          0,
          0, #5 negatif
          1,
          1,
          1,
          1,
          1,
          1] #6 positif

getNumberOfDocPerClass(semtiments)

(6, 5)

##### h. Le nombre total de phrases (par classe)

In [480]:
def getNumberOfSentencesPerClass(corpus, sentiments) :
    corpus = getListOfSentences(corpus)
    countSentencesPositives = 0
    countSentencesNegatives = 0
    for i in range(len(corpus)):
        if sentiments[i] : #positive
            countSentencesPositives = countSentencesPositives + len(corpus[i])
        else : #negative
            countSentencesNegatives = countSentencesNegatives + len(corpus[i])          
    return countSentencesPositives, countSentencesNegatives   

In [481]:
sentiments = [0,1,1]
getNumberOfSentencesPerClass(corpus, sentiments)

(3, 2)

##### i. Le nombre total de phrases avec négation (par classe)

In [482]:
def getNumberOfNegativeSentences(corpus, sentiments) :
    corpus = getListOfSentences(corpus)
    countNegativeSentencesPositives = 0
    countNegativeSentencesNegatives = 0
    i = 0;
    for sentences in corpus:
        for sentence in sentences:
            if sentiments[i]: #positive
                if "_NEG" in sentence:
                    countNegativeSentencesPositives = countNegativeSentencesPositives + 1
            else: #negative
                if "_NEG" in sentence:
                    countNegativeSentencesNegatives = countNegativeSentencesNegatives + 1
            i = i + 1
    return countNegativeSentencesPositives, countNegativeSentencesNegatives

In [483]:
sentiments = [0,1,1,1,1]
getNumberOfNegativeSentences(corpus, sentiments)

(1, 1)

##### j. Le pourcentage de réponses positives par genre de la personne à qui cette réponse est faite (op_gender)

In [484]:
genders = [['M', 'M', 'W', 'W', 'M', 'M', 'W', 'W', 'M', 'M', 'W', 'W']]
sentiments = [1 , 0, 1, 0, 1 , 0, 1,
             0, 0, 1,  0, 0]

In [485]:
def getPourcentageOfPositiveReponsesPerGender(genders, sentiments):
    countPosM = countPosW = 0
    iterator = 0
    totalResponse = len(sentiments)
  
    for sentiment in sentiments:
        if sentiment :
            if genders[0][iterator] == 'M':
                countPosM = countPosM + 1
            elif genders[0][iterator] == 'W':
                countPosW = countPosW + 1
        iterator = iterator + 1
  
    pourcentageW = float(countPosW / totalResponse) * 100
    pourcentageM = float(countPosM / totalResponse) * 100
  
    return(pourcentageM,pourcentageW)       

In [486]:
getPourcentageOfPositiveReponsesPerGender(genders, sentiments)

(25.0, 16.666666666666664)

#### 2) Écrivez la fonction explore(corpus, sentiments, genders) qui calcule et affiche toutes ces informations, précédées d'une légende reprenant l’énoncé de chaque question (a,b, ….j).

In [487]:
def explore(
    corpus: List[List[str]], sentiments: List[bool], genders: List[Literal["M", "W"]]
) -> None:
    print("Le nombre total de tokens (mots non distincts) : " + getNumberOfTokens(corpus) + "\n")
    print("Le nombre total de types : " + getNumberOfTypes(corpus) + "\n")
    print("Le nombre total de phrases avec négation : " + getNumberOfNeg(corpus)  + "\n")
    print("Le ratio token/type : " + getRatioTokenType(corpus) + "\n")
    print("Le nombre total de lemmes distincts : " + getLemmesNumber(corpus) + "\n")
    print("Le nombre total de racines (stems) distinctes : " + getStemsNumber(corpus) + "\n")
    print("Le nombre total de documents (par classe) : " + getNumberOfDocPerClass(semtiments) + "\n")
    print("Le nombre total de phrases (par classe) : " + getNumberOfSentencesPerClass(corpus, sentiments) + "\n")
    print("Le nombre total de phrases avec négation (par classe) : " + getNumberOfNegativeSentences(corpus, sentiments)  + "\n")
    print("Le pourcentage de réponses positives par genre de la personne à qui cette réponse est faite (op_gender) : " 
          + getPourcentageOfPositiveReponsesPerGender(sentiments, genders)  + "\n")

    

#### 3) Calculer une table de fréquence (lemme, rang (le mot le plus fréquent a le rang 1 etc.) ; fréquence (le nombre de fois où il a été vu dans le corpus).  Seuls les N mots les plus fréquents du vocabulaire (N est un paramètre) doivent être gardés. Vous devez stocker les 1000 premières lignes de cette table dans un fichier nommé table_freq.csv

In [488]:
def calculateFrequences(corpus) :
   
    corpus = getListOfSentences(corpus)
    lemmzer = nltk.WordNetLemmatizer()
    lemmesList = []
    sorted_dict = {}
   
    for sentences in corpus :
        for sentence in sentences :
            lemmes = [lemmzer.lemmatize(token) for token in sentence.split()]
            for lemme in lemmes :
                lemmesList.append(lemme)
               
    for word in lemmesList:
        if word not in sorted_dict:
            sorted_dict[word] = 0
        sorted_dict[word] += 1
    words = sorted_dict.items()
    sorted_lemme = sorted(words, key= lambda kv: kv[1], reverse=True)
    return sorted_lemme

In [489]:
calculateFrequences(corpus)

[('I', 5),
 ('prefer', 2),
 ('the', 2),
 ('do', 1),
 ('not', 1),
 ('agree_NEG', 1),
 ('with_NEG', 1),
 ('this_NEG!', 1),
 ('other', 1),
 ('option', 1),
 ('really', 1),
 ('like', 1),
 ('that', 1),
 ('new', 1),
 ('mbp16', 1),
 ('Mourad', 1),
 ('made', 1),
 ('a', 1),
 ('good', 1),
 ('choice', 1),
 ("don't", 1),
 ('think_NEG', 1),
 ('that_NEG.', 1),
 ('mbp13', 1)]

## 2. Classification automatique

### a) Classification  automatique avec un modèle sac de mots (unigrammes), Naive Bayes et la régression logistique

En utilisant la librairie scikitLearn et l’algorithme Multinomial Naive Bayes et Logistic Regression, effectuez la classification des textes avec un modèle sac de mots unigramme pondéré avec TF-IDF.  Vous devez entrainer chaque modèle sur l’ensemble d’entrainement et le construire à partir de votre fichier corpus_train.csv. 

Construisez et sauvegardez votre modèle sac de mots avec les données d’entrainement en testant les pré-traitements suivants (séparément et en combinaison): tokenisation, lemmatisation, stemming, normalisation des négations, et suppression des mots outils. Vous ne devez garder que la combinaison d’opérations qui vous donne les meilleures performances sur le corpus de test. Indiquez dans un commentaire les pré-traitements qui vous amènent à votre meilleure performance (voir la section 3 – évaluation). Il est possible que la combinaison optimale ne soit pas la même selon que vous utilisiez la régression logistique ou Naive Bayes. On s’attend à avoir deux modèles optimaux, un pour Naive Bayes, et un avec régression logistique.

In [490]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

In [491]:
def getDataSet(train_csv, test_csv):
    trainPath = os.path.join(output_path, train_csv)
    testPath = os.path.join(output_path, test_csv)
    
    trainData_ = []
    testData_ = []
    
    with open(trainPath, "r") as f :
        trainData = list(f)
     
    for data in trainData :
        data = data.replace('"', '').replace('\n', '')
        trainData_.append(data)
    
    with open(testPath, "r") as f :
        testData = list(f)
        
    for data in testData :
        data = data.replace('"', '').replace('\n', '')
        testData_.append(data)
#     print(testData_)
    
#     trainData = pd.read_csv(trainPath)
#     testData = pd.read_csv(testPath)
#     print(testData[0])
    
#     trainInputs = trainData
#     testInputs = testData
    
    training_data = (trainData_, train_data[1], train_data[2])
    testing_data = (testData_, test_data[1], test_data[2])
    return training_data, testing_data

In [492]:
training_data, testing_data = getDataSet("train_phrases.csv", "test_phrases.csv" )
# print(training_data, testing_data)

### Naive Bayes

In [493]:
def naiveBayes(train_data, test_data):
    vectorizer = TfidfVectorizer()    
    vectors = vectorizer.fit_transform(train_data[0])
    clf = MultinomialNB(alpha=0.5)
    clf.fit(vectors, train_data[1])
    
    vectors_test = vectorizer.transform(test_data)
    y_pred = clf.predict(vectors_test)
    return y_pred

### Régression Logistique

In [494]:
def logisticsRegression(train_data, test_data):
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(train_data[0])
    model = LogisticRegression(C=1.0)
    model.fit(vectors, train_data[1])
    
    vectors_test = vectorizer.transform(test_data[0])
    y_pred = model.predict(vectors_test)
    return y_pred

###  b) Autre représentation pour l’analyse de sentiments et classification automatique

On vous propose maintenant d’utiliser une nouvelle représentation de chaque document à classifier.
Vous devez créer à partir de votre corpus la table suivante :

| Vocabulaire | Freq-positive | Freq-négative |
|-------------|---------------|---------------|
| happy | 10 | 1 |
| ... | ... | ... |

Où :

• Vocabulaire représente tous les types (mots uniques) de votre corpus d’entrainement

• Freq-positive : représente la somme des fréquences du mot dans tous les documents de la classe positive

• Freq-négative : représente la somme des fréquences du mot dans tous les documents de la classe négative

Notez qu’en Python, vous pouvez créer un dictionnaire associant à tout (mot, classe) une fréquence.
Ensuite il vous suffit de représenter chaque document par un vecteur à 3 dimensions dont le premier élément représente un biais (initialisé à 1), le deuxième élément représente la somme des fréquences positives (freq-pos) de tous les mots uniques (types) du document et enfin le troisième élément représente la somme des fréquences négative (freq-neg) de tous les mots uniques du document. 

En utilisant cette représentation ainsi que les pré-traitements suggérés, trouvez le meilleur modèle possible en testant la régression logistique et Naive Bayes. Vous ne devez fournir que le code de votre meilleur modèle dans votre notebook.

In [528]:
import pandas as pd
corpus = test_data

dictionnary = {}
listOfTokens = []

for i in range(len(corpus[0])) :
    tokenList = nltk.word_tokenize(corpus[0][i])
    for token in tokenList :
        listOfTokens.append(token)
        key = (token, corpus[1][i])
        if key in dictionnary : #if the key already exists in dic, increment frequency
            dictionnary[key] = dictionnary[key] + 1
        else : #if not, create key and initiate frequency to 1
            dictionnary.update({key : 1})
    listOfTokens = list(dict.fromkeys(listOfTokens)) #list of distinct tokens

data = []
for token in listOfTokens :
    freq_pos = 0
    freq_neg = 0
    if (token, True) not in dictionnary and (token, False) not in dictionnary :
        freq_pos = 0
        freq_neg = 0
    elif (token, True) in dictionnary and (token, False) in dictionnary :
        freq_pos = dictionnary[token, True]
        freq_neg = dictionnary[token, False]
    elif (token, True) in dictionnary and (token, False) not in dictionnary :
        freq_pos = dictionnary[token, True]
        freq_neg = 0
    elif (token, True) not in dictionnary and (token, False) in dictionnary :
        freq_neg = dictionnary[token, False]
        freq_pos = 0
        
        
    data.append([token, freq_pos, freq_neg])    
    
pd.DataFrame(data, columns=["Vocabulaire", "Freq-positive", "Freq-négative"])

Unnamed: 0,Vocabulaire,Freq-positive,Freq-négative
0,thanks,95,2
1,back,42,6
2,!,776,105
3,yep,0,0
4,",",355,176
...,...,...,...
3963,msnbc,0,0
3964,tingles,0,0
3965,ship,0,0
3966,fiber,0,0


## 3. Évaluation et discussion

#### a) Pour déterminer la performance de vos modèles, vous devez tester vos modèles de classification sur l’ensemble de test et générer vos résultats pour chaque modèle dans une table avec les métriques suivantes : Accuracy et pour chaque classe, la précision, le rappel et le F1 score. On doit voir cette table générée dans votre notebook avec la liste de vos modèles de la section 2 et leurs performances respectives. 

In [None]:
y_pred_bayes = naiveBayes(training_data, testing_data)
# print(classification_report(test_data[1], y_pred_bayes))

In [None]:
y_pred_regression = logisticsRegression(train_data, test_data)
print(classification_report(test_data[1], y_pred_regression))

#### b) Générez un graphique qui représente la performance moyenne (mean accuracy – 10 Fold cross-validation) de vos différents modèles par tranches de 500 textes sur l’ensemble d’entrainement.

#### c) Que se passe-t-il lorsque le paramètre de régularisation de la régression logisque (C) est augmenté ?

## 4. Analyse et discussion

#### a) En considérant les deux types de représentations, répondez aux question suivantes en reportant la question dans le notebook et en inscrivant votre réponse:

#### b) Quel est l’impact de l’annotation de la négation ?

#### c) La suppression des stopwords est-elle une bonne idée pour l’analyse de sentiments ?

#### d) Le stemming et/ou la lemmatisation sont-ils souhaitables dans le cadre de l’analyse de sentiments ?

## 5. Contribution

Complétez la section en haut du notebook indiquant la contribution de chaque membre de l’équipe en indiquant ce qui a été effectué par chaque membre et le pourcentage d’effort du membre dans le TP. 