<font color='red'>
    
# NLTK
<font color='black'> 
<n>

<font color='blue'>
## 1 - Tokenization
<n>
<font color='blue'>

### Tokenize = split :
- par mot : word_tokenize()
- par phrase : sent_tokenize()

In [1]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [2]:
#nltk.download()

In [3]:
example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue, you should not eat cardboard"

In [4]:
print(sent_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue, you should not eat cardboard']


In [5]:
print(word_tokenize(example_text))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', ',', 'you', 'should', 'not', 'eat', 'cardboard']


In [6]:
for i in sent_tokenize(example_text):
    print(i)

Hello Mr. Smith, how are you doing today?
The weather is great and Python is awesome.
The sky is pinkish-blue, you should not eat cardboard


In [7]:
for i in word_tokenize(example_text):
    print(i)

Hello
Mr.
Smith
,
how
are
you
doing
today
?
The
weather
is
great
and
Python
is
awesome
.
The
sky
is
pinkish-blue
,
you
should
not
eat
cardboard


<font color='blue'>
## 2 - Stop words
<n>
<font color='blue'>

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [9]:
stop_words = set(stopwords.words("english"))
print(stop_words)

{'again', 'own', 'll', 'these', 'only', "that'll", 'itself', 'that', 'they', 'am', 'against', 'up', "needn't", 'be', 'because', 'too', 'who', 'were', 'between', 'more', 'y', 'an', 'couldn', 'if', 'don', 'did', 'o', 'while', 'm', 'myself', 'to', 'or', "aren't", 'then', 'was', 'from', 'is', 'once', 'my', 'just', 'ours', 'me', 'he', 'can', 'theirs', 'does', "mustn't", 'for', "hasn't", 'needn', 'before', "doesn't", 'her', 'weren', 'all', 'nor', 'which', 'where', 'themselves', 'same', 'after', 'there', 'his', 'ain', "shouldn't", "couldn't", 'herself', 'this', 'mustn', "you'd", 'very', 'any', 'yours', 'some', 'mightn', 'off', 'so', 'in', 'being', 'not', 'it', "you've", 'further', "it's", 're', "you'll", 'other', 'isn', 't', 'under', 'hasn', "isn't", 'are', "won't", 'won', 'until', 'out', "should've", "hadn't", 'when', "wasn't", 'such', 'what', 'them', 'by', 'down', 'here', 'd', 'had', 'haven', 'having', 'through', 'should', 'at', 'with', 'him', "mightn't", 'a', 'doesn', 'you', "weren't", 'be

In [10]:
example_sentence = "This is an example showing off stop word filtration"

In [11]:
words = word_tokenize(example_sentence)

filtered_sentence = []

for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
    
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration']


In [12]:
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration']


<font color='blue'>
## 3 - Stemming
<n>
<font color='blue'>

### Stemming = "racinisation" ou "désuffixation" :
- On va chercher la racine du mot en supprimant les suffixes ou préfixes. La racine ainsi obtenue n'existe pas forcément comme mot du vocabulaire (en ce sens, racine $\ne$ lemme) => obtention d'une forme tronquée du mot, commune à toutes les variantes morphologiques.
- Un des algorithmes les plus connus est l'algorithme de Porter. 

L'algorithme de Porter se compose d'une cinquantaine de règles de racinisation/désuffixation classées en sept phases successives (traitement des pluriels et verbes à la troisième personne du singulier, traitement du passé et du progressif,...). Les mots à analyser passent par tous les stades et, dans le cas où plusieurs règles pourraient leur être appliquées, c'est toujours celle comprenant le suffixe le plus long qui est choisie.

In [13]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [14]:
ps = PorterStemmer()

example_words = ['python','pythoner','pythoning','pythoned','pythonly']

for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [15]:
new_text = 'it is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once'

In [16]:
words = word_tokenize(new_text)
print(words)
for w in words:
    print(ps.stem(w))

['it', 'is', 'very', 'important', 'to', 'be', 'pythonly', 'while', 'you', 'are', 'pythoning', 'with', 'python', '.', 'All', 'pythoners', 'have', 'pythoned', 'poorly', 'at', 'least', 'once']
it
is
veri
import
to
be
pythonli
while
you
are
python
with
python
.
all
python
have
python
poorli
at
least
onc


<font color='blue'>
## 4 - Speech tagging
<n>
<font color='blue'>

### POS tagging (part-of-speech tagging) = Etiquetage morpho-syntaxique ou grammatical
- Identification et marquage des informations grammaticales correspondants à chaque mot : noms, verbes, adjectifs, adverbe etc...
- 2 types de POS tagger : 
    - Rule-based POS tagger
    - Stochastic POS tagger

$Rule-based \; POS \;tagger$ :
- analyse des caractéristiques linguistiques du mot,
- analyse du mot précédent (i.e précédent = un article -> dès lors, le mot a une plus forte probabilité d'être un nom),
- analyse du mot suivant.
            
$Stochastic \; POS \;tagger$ :
- fréquence d'occurence d'un tag pour un mot en particulier (peut donner des résultats abérrants),
- n-gram approach : vise à obtenir une vraissemblance à partir d'une séquence donnée. Lorsque la séquence sont des mots : n-gram = shingles. Cette modélisation correspond à un modèle de Markov d'ordre n, où seules les n dernières observations sont utilisées pour la prédiction de la suivante.
- Hidden Markov model (HMM) : combine les 2 approches précédentes.

$Rappels :$

Propriété de Markov : $\mathbb{P}\left(q_1, q_2, ..., q_n\right) = \prod_{i=1}^n\mathbb{P}\left(q_i  \, \mid \,q_{i-1}\right)$

La distribution d'une variable aléatoire dans l'avenir dépend uniquement de sa distribution dans  l'état actuel, et aucun des états précédents n'a d'impact sur les états futurs (processus sans mémoire).

Dans HMM : les __états__ sont cachés (ex endormi/reveillé) = POS tags. Les __observations__ (ex : bruits/calme) = mots dans une séquence donnée. 
- Probabilité de transistion i.e $\mathbb{P}(VP\mid NP)$ (le mot a comme tag 'VP' sanchant que le mot précédent était un nom).
- Probabilité d'émission i.e $\mathbb{P}(John\mid NP)$ (le mot est John sachant que le tag est un nom).

$Exemples\;de\;POS\;tag\;:$
- Nom : **NN** - *nom pluriel* : **NNS** - *nom propre* : **NNP** ...
- Verbe : **VB** - *verbe conjugué au passé* : **VBD** - *participe présent* : **VBG** - *participe passé* : **VBN** - *participe présent* : **VBP** ...
- Adjectif : **JJ - *adjectif comparatif* : JJR - *superlatif* : JJS
- Adverb : **RB** - *adverbe comparatif* : **RBR** (i.e better) - *superlatif* : **RBS** (i.e best)

In [17]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [18]:
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)
            
    except Exception as e:
        print(str(e))
        
process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nat

[('In', 'IN'), ('a', 'DT'), ('dynamic', 'JJ'), ('world', 'NN'), ('economy', 'NN'), (',', ','), ('we', 'PRP'), ('are', 'VBP'), ('seeing', 'VBG'), ('new', 'JJ'), ('competitors', 'NNS'), (',', ','), ('like', 'IN'), ('China', 'NNP'), ('and', 'CC'), ('India', 'NNP'), (',', ','), ('and', 'CC'), ('this', 'DT'), ('creates', 'VBZ'), ('uncertainty', 'NN'), (',', ','), ('which', 'WDT'), ('makes', 'VBZ'), ('it', 'PRP'), ('easier', 'JJR'), ('to', 'TO'), ('feed', 'VB'), ('people', 'NNS'), ("'s", 'POS'), ('fears', 'NNS'), ('.', '.')]
[('So', 'IN'), ('we', 'PRP'), ("'re", 'VBP'), ('seeing', 'VBG'), ('some', 'DT'), ('old', 'JJ'), ('temptations', 'NNS'), ('return', 'NN'), ('.', '.')]
[('Protectionists', 'NNS'), ('want', 'VBP'), ('to', 'TO'), ('escape', 'VB'), ('competition', 'NN'), (',', ','), ('pretending', 'VBG'), ('that', 'IN'), ('we', 'PRP'), ('can', 'MD'), ('keep', 'VB'), ('our', 'PRP$'), ('high', 'JJ'), ('standard', 'NN'), ('of', 'IN'), ('living', 'NN'), ('while', 'IN'), ('walling', 'VBG'), ('off'

[('We', 'PRP'), ('will', 'MD'), ('compete', 'VB'), ('and', 'CC'), ('excel', 'VB'), ('in', 'IN'), ('the', 'DT'), ('global', 'JJ'), ('economy', 'NN'), ('.', '.')]
[('We', 'PRP'), ('will', 'MD'), ('renew', 'VB'), ('the', 'DT'), ('defining', 'VBG'), ('moral', 'JJ'), ('commitments', 'NNS'), ('of', 'IN'), ('this', 'DT'), ('land', 'NN'), ('.', '.')]
[('And', 'CC'), ('so', 'RB'), ('we', 'PRP'), ('move', 'VBP'), ('forward', 'RB'), ('--', ':'), ('optimistic', 'JJ'), ('about', 'IN'), ('our', 'PRP$'), ('country', 'NN'), (',', ','), ('faithful', 'JJ'), ('to', 'TO'), ('its', 'PRP$'), ('cause', 'NN'), (',', ','), ('and', 'CC'), ('confident', 'NN'), ('of', 'IN'), ('the', 'DT'), ('victories', 'NNS'), ('to', 'TO'), ('come', 'VB'), ('.', '.')]
[('May', 'NNP'), ('God', 'NNP'), ('bless', 'NN'), ('America', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]


In [19]:
train_text

'PRESIDENT GEORGE W. BUSH\'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nFebruary 2, 2005\n\n\n9:10 P.M. EST \n\nTHE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: \n\nAs a new Congress gathers, all of us in the elected branches of government share a great privilege: We\'ve been placed in office by the votes of the people we serve. And tonight that is a privilege we share with newly-elected leaders of Afghanistan, the Palestinian Territories, Ukraine, and a free and sovereign Iraq. (Applause.) \n\nTwo weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. \n\nTonight, with a healthy, growing economy, with more Americans going back to work, with our nation an active force for good in the world -- the state of our union is confident and strong. (Applause.

In [20]:
custom_sent_tokenizer

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x7f09256c5748>

In [21]:
tokenized

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',
 '(Applause.)',
 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",
 'We have gathered under this Capitol dome in moments of national mourning and national ach

<font color='blue'>
## 5 - Chunking
<n>
<font color='blue'>

### "Morcellement" : grouper les mots en "morceaux" qui, ont l'espère, auront un sens.
- Pour cela, on va former des 'noun phrases' contenant un nom, quelques mots descriptifs, éventuellement un verbe, et peut-être un adverbe ou autre mot du genre => On va combiner les **POS tags avec des Regex**.

In [22]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)
#            chunked.draw

    except Exception as e:
        print(str(e))
        
process_content()

(S
  (Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
  'S/POS
  (Chunk ADDRESS/NNP)
  BEFORE/IN
  (Chunk A/NNP JOINT/NNP SESSION/NNP)
  OF/IN
  (Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
  OF/IN
  (Chunk THE/NNP UNION/NNP January/NNP)
  31/CD
  ,/,
  2006/CD
  (Chunk THE/NNP PRESIDENT/NNP)
  :/:
  (Chunk Thank/NNP)
  you/PRP
  all/DT
  ./.)
(S
  (Chunk Mr./NNP Speaker/NNP)
  ,/,
  (Chunk Vice/NNP President/NNP Cheney/NNP)
  ,/,
  members/NNS
  of/IN
  (Chunk Congress/NNP)
  ,/,
  members/NNS
  of/IN
  the/DT
  (Chunk Supreme/NNP Court/NNP)
  and/CC
  diplomatic/JJ
  corps/NN
  ,/,
  distinguished/JJ
  guests/NNS
  ,/,
  and/CC
  fellow/JJ
  citizens/NNS
  :/:
  Today/VB
  our/PRP$
  nation/NN
  lost/VBD
  a/DT
  beloved/VBN
  ,/,
  graceful/JJ
  ,/,
  courageous/JJ
  woman/NN
  who/WP
  (Chunk called/VBD America/NNP)
  to/TO
  its/PRP$
  founding/NN
  ideals/NNS
  and/CC
  carried/VBD
  on/IN
  a/DT
  noble/JJ
  dream/NN
  ./.)
(S
  Tonight/NN
  we/PRP
  are/VBP
  comforted

(S (/( (Chunk Applause/NNP) ./. )/))
(S
  Our/PRP$
  coalition/NN
  has/VBZ
  learned/VBN
  from/IN
  our/PRP$
  experience/NN
  in/IN
  (Chunk Iraq/NNP)
  ./.)
(S
  We/PRP
  've/VBP
  adjusted/VBN
  our/PRP$
  military/JJ
  tactics/NNS
  and/CC
  changed/VBD
  our/PRP$
  approach/NN
  to/TO
  reconstruction/NN
  ./.)
(S
  Along/IN
  the/DT
  way/NN
  ,/,
  we/PRP
  have/VBP
  benefitted/VBN
  from/IN
  responsible/JJ
  criticism/NN
  and/CC
  counsel/NN
  offered/VBN
  by/IN
  members/NNS
  of/IN
  (Chunk Congress/NNP)
  of/IN
  both/DT
  parties/NNS
  ./.)
(S
  In/IN
  the/DT
  coming/VBG
  year/NN
  ,/,
  I/PRP
  will/MD
  continue/VB
  to/TO
  reach/VB
  out/RP
  and/CC
  seek/VB
  your/PRP$
  good/JJ
  advice/NN
  ./.)
(S
  Yet/RB
  ,/,
  there/EX
  is/VBZ
  a/DT
  difference/NN
  between/IN
  responsible/JJ
  criticism/NN
  that/WDT
  aims/VBZ
  for/IN
  success/NN
  ,/,
  and/CC
  defeatism/NN
  that/WDT
  refuses/VBZ
  to/TO
  acknowledge/VB
  anything/NN
  but/CC
  failure/NN


(S (/( (Chunk Applause/NNP) ./. )/))
(S
  And/CC
  every/DT
  year/NN
  we/PRP
  fail/VBP
  to/TO
  act/VB
  ,/,
  the/DT
  situation/NN
  gets/VBZ
  worse/JJR
  ./.)
(S
  So/RB
  tonight/JJ
  ,/,
  I/PRP
  ask/VBP
  you/PRP
  to/TO
  join/VB
  me/PRP
  in/IN
  creating/VBG
  a/DT
  commission/NN
  to/TO
  examine/VB
  the/DT
  full/JJ
  impact/NN
  of/IN
  baby/NN
  boom/NN
  retirements/NNS
  on/IN
  (Chunk Social/NNP Security/NNP)
  ,/,
  (Chunk Medicare/NNP)
  ,/,
  and/CC
  (Chunk Medicaid/NNP)
  ./.)
(S
  This/DT
  commission/NN
  should/MD
  include/VB
  members/NNS
  of/IN
  (Chunk Congress/NNP)
  of/IN
  both/DT
  parties/NNS
  ,/,
  and/CC
  offer/VBP
  bipartisan/JJ
  solutions/NNS
  ./.)
(S
  We/PRP
  need/VBP
  to/TO
  put/VB
  aside/RP
  partisan/JJ
  politics/NNS
  and/CC
  work/NN
  together/RB
  and/CC
  get/VB
  this/DT
  problem/NN
  solved/VBD
  ./.)
(S (/( (Chunk Applause/NNP) ./. )/))
(S
  (Chunk Keeping/VBG America/NNP)
  competitive/JJ
  requires/VBZ
  us/PRP
  

  ./.)
(S
  If/IN
  we/PRP
  ensure/VB
  that/IN
  (Chunk America/NNP)
  's/POS
  children/NNS
  succeed/VB
  in/IN
  life/NN
  ,/,
  they/PRP
  will/MD
  ensure/VB
  that/IN
  (Chunk America/NNP)
  succeeds/VBZ
  in/IN
  the/DT
  world/NN
  ./.)
(S (/( (Chunk Applause/NNP) ./. )/))
(S
  Preparing/VBG
  our/PRP$
  nation/NN
  to/TO
  compete/VB
  in/IN
  the/DT
  world/NN
  is/VBZ
  a/DT
  goal/NN
  that/IN
  all/DT
  of/IN
  us/PRP
  can/MD
  share/NN
  ./.)
(S
  I/PRP
  urge/VBP
  you/PRP
  to/TO
  support/VB
  the/DT
  American/JJ
  (Chunk Competitiveness/NNP Initiative/NNP)
  ,/,
  and/CC
  together/RB
  we/PRP
  will/MD
  show/VB
  the/DT
  world/NN
  what/WP
  the/DT
  American/JJ
  people/NNS
  can/MD
  achieve/VB
  ./.)
(S
  (Chunk America/NNP)
  is/VBZ
  a/DT
  great/JJ
  force/NN
  for/IN
  freedom/NN
  and/CC
  prosperity/NN
  ./.)
(S
  Yet/RB
  our/PRP$
  greatness/NN
  is/VBZ
  not/RB
  measured/VBN
  in/IN
  power/NN
  or/CC
  luxuries/NNS
  ,/,
  but/CC
  by/IN
  who/WP


<font color='blue'>
## 6 - Chinking
<n>
<font color='blue'>

### Filtre le chunk d'un chunk
- on enlève les mots indésirables d'un chunk (le chunk qu'on retire d'un chunk est un chink).
- On retire les verbes + préposition + déterminant

In [23]:
def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<.?>+}
                                    }<VB.?|IN|DT>+{"""
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            chunked.draw

    except Exception as e:
        print(str(e))
        
process_content()

<font color='blue'>
## 7 - Name Entity Recognition
<n>
<font color='blue'>

### => NER
- On cherche dans cette étape, à localiser et classer les entités nommées, en catégories prédéfinies telles que les noms de personnes, les pays, organisations, lieux, expressions de temps (dates, périodes), quantités (km, kg...), valeurs monétaires, pourcentages, etc...
- pretrained NE identifier : nltk.ne_chunk() -> binary = true : les NE ne sont pas caractérisés. .draw() pour les visualiser (arbres avec NE mis en relief).

In [24]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
#            namedEnt.draw()
    except Exception as e:
        print(str(e))


process_content()

<font color='blue'>
## 8 - Lemmatizing
<n>
<font color='blue'>

### Proche du stemming (racinisation).
- Mais à la différence du stemme, le lemme va former des mots existants (qu'on peut trouver dans le dictionnaire).
- lorsque que l'on procède à cette étaoe, on peut préciser le genre du mot lemmatisé (nom (par défaut), verbe, adjectif).

In [25]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))

print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("better", pos="n"))

print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

cat
cactus
goose
rock
python
good
better
run
run


<font color='blue'>
## 9 - Corpora
<n>
<font color='blue'>

Corpus sous NLTK :  http://www.nltk.org/nltk_data/

Accessing Text Corpora and Lexical Resources:
https://www.nltk.org/book/ch02.html

##### Sentiment Analysis:

- IMDB Movie Reviews – 50.000 annotated IMDB movie reviews
- Multi-Domain Sentiment Dataset – contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics
- Opinion Lexicon – Curated list of positive/negative words – available in NLTK: nltk.corpus.opinion_lexicon
- UMICH SI650 – Sentiment Classification on Kaggle – Positive/Negative Sentiment annotated sentences
- Sanders Analytics Twitter Sentiment Corpus – 5513 hand-classified tweets
- SentiWordNet – Polarity annotated Wordnet Synsets – available in NLTK: nltk.corpus.sentiwordnet
- Movie Reviews – 2000 Sentiment annotated movie reviews – available in NLTK: nltk.corpus.movie_reviews
- Twitter Samples – Sentiment annotated tweets – nltk.corpus.twitter_samples
- Subjectivity Dataset – 5000 subjective and 5000 objective processed sentences – available in NLTK: nltk.corpus.subjectivity
- Opinion Dataset – Miscellaneous Opinion annotated datasets
- Twitter airline sentiment on Kaggle – What travelers expressed about their adventures with the airlines on Twitter in February 2015
- Amazon Fine food Reviews
- First GOP Debate Twitter Sentiment – Analyze tweets on the first 2016 GOP Presidential Debate

In [26]:
import nltk
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import gutenberg

print(nltk.__file__, '\n\n\n')
path = str('/home/stephane/nltk_data'),
        
# sample text
sample = gutenberg.raw("melville-moby_dick.txt")

tok = sent_tokenize(sample)

for x in range(5):
    print(tok[x])

/home/stephane/anaconda3/lib/python3.6/site-packages/nltk/__init__.py 



[Moby Dick by Herman Melville 1851]


ETYMOLOGY.
(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.
He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.
He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.
"While you take in hand to school others, and to teach them by what
name a whale-fish is to be called in our tongue leaving out, through
ignorance, the letter H, which almost alone maketh the signification
of the word, you deliver that which is not true."


In [27]:
tok

['[Moby Dick by Herman Melville 1851]\r\n\r\n\r\nETYMOLOGY.',
 '(Supplied by a Late Consumptive Usher to a Grammar School)\r\n\r\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\r\nnow.',
 'He was ever dusting his old lexicons and grammars, with a queer\r\nhandkerchief, mockingly embellished with all the gay flags of all the\r\nknown nations of the world.',
 'He loved to dust his old grammars; it\r\nsomehow mildly reminded him of his mortality.',
 '"While you take in hand to school others, and to teach them by what\r\nname a whale-fish is to be called in our tongue leaving out, through\r\nignorance, the letter H, which almost alone maketh the signification\r\nof the word, you deliver that which is not true."',
 '--HACKLUYT\r\n\r\n"WHALE.',
 '... Sw. and Dan.',
 'HVAL.',
 'This animal is named from roundness\r\nor rolling; for in Dan.',
 'HVALT is arched or vaulted."',
 '--WEBSTER\'S\r\nDICTIONARY\r\n\r\n"WHALE.',
 '...',
 'It is more immediately from the Dut.',
 '

<font color='blue'>
## 10 - Wordnet
<n>
<font color='blue'>

### Base de donnée lexicale pour la langue anglaise 
- créée par Princeton et faisant partie du corpus NLTK.
-  définition, synonymes, antonymes, indices de similarité, lemme, exemples...

In [28]:
from nltk.corpus import wordnet

In [29]:
syns = wordnet.synsets("program")
syns

[Synset('plan.n.01'),
 Synset('program.n.02'),
 Synset('broadcast.n.02'),
 Synset('platform.n.02'),
 Synset('program.n.05'),
 Synset('course_of_study.n.01'),
 Synset('program.n.07'),
 Synset('program.n.08'),
 Synset('program.v.01'),
 Synset('program.v.02')]

In [30]:
print(syns[0].name())

plan.n.01


In [31]:
print(syns[0].lemmas())

[Lemma('plan.n.01.plan'), Lemma('plan.n.01.program'), Lemma('plan.n.01.programme')]


In [32]:
print(syns[0].lemmas()[0].name())

plan


In [33]:
print(syns[0].definition())

a series of steps to be carried out or goals to be accomplished


In [34]:
print(syns[0].examples())

['they drew up a six-step plan', 'they discussed plans for a new bond issue']


In [35]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

{'practiced', 'respectable', 'sound', 'skilful', 'trade_good', 'in_effect', 'safe', 'expert', 'unspoilt', 'beneficial', 'thoroughly', 'proficient', 'salutary', 'soundly', 'ripe', 'unspoiled', 'full', 'dear', 'dependable', 'right', 'near', 'secure', 'goodness', 'undecomposed', 'skillful', 'serious', 'upright', 'honest', 'well', 'adept', 'effective', 'honorable', 'good', 'commodity', 'in_force', 'just', 'estimable'}
{'badness', 'evil', 'evilness', 'bad', 'ill'}


In [36]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))

0.9090909090909091


In [37]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))

0.6956521739130435


In [38]:
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cactus.n.01')
print(w1.wup_similarity(w2))

0.38095238095238093


<font color='blue'>
## 11 - Text Classification
<n>
<font color='blue'>

print(doc[1]) = texte tokenisé + label (i.e positif ou negatif)

In [39]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print('\n', documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print('\n', all_words.most_common(15))
print('\n', all_words["stupid"])


 (['the', 'u', '.', 's', '.', 'army', 'utilizes', 'a', 'number', 'of', 'books', 'known', 'as', 'field', 'manuals', 'which', 'stipulate', 'the', 'specific', 'way', 'in', 'which', 'almost', 'every', 'action', 'imaginable', 'must', 'be', 'done', '.', 'one', 'particular', 'field', 'manual', 'is', 'known', 'as', 'the', 'fm', '22', '-', '5', ',', 'which', 'among', 'other', 'things', ',', 'covers', 'the', 'practice', 'of', 'saluting', '.', 'under', 'the', '"', 'saluting', '"', 'section', 'is', 'a', 'sub', '-', 'section', 'which', 'covers', 'how', 'a', 'salute', 'is', 'rendered', 'by', 'a', 'military', 'work', 'detail', 'in', 'the', 'presence', 'of', 'a', 'superior', 'officer', '.', 'the', 'salute', 'is', 'rendered', 'by', 'the', 'highest', '-', 'ranking', 'individual', 'present', 'when', 'the', 'superior', 'officer', 'comes', 'within', 'six', 'paces', 'of', 'the', 'detail', ',', 'and', 'is', 'dropped', 'when', 'the', 'officer', 'passes', 'six', 'paces', 'from', 'the', 'detail', '.', 'in', 'a


 [(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]

 253


In [40]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [41]:
print(all_words.most_common(15))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]


In [42]:
print(all_words["stupid"])

253


<font color='blue'>
## 12 - Words as Features for Learning 
<n>
<font color='blue'>

On prend chacun des mots dans les textes à analyser et on les tag en marquant leur présence dans les textes du corpus (déjà classifiés)

In [43]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

In [44]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

In [45]:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))



In [46]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

<font color='blue'>
## 13 - Naive Bayes Classifier
<n>
<font color='blue'>

Train/Test la classification.

In [47]:
# set that we'll train our classifier with
training_set = featuresets[:1900]

# set that we'll test against.
testing_set = featuresets[1900:]

In [48]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [49]:
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

Classifier accuracy percent: 76.0


In [50]:
classifier.show_most_informative_features(15)

Most Informative Features
                  regard = True              pos : neg    =     11.1 : 1.0
                   sucks = True              neg : pos    =     10.1 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                 frances = True              pos : neg    =      8.4 : 1.0
             silverstone = True              neg : pos    =      7.6 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.4 : 1.0
               atrocious = True              neg : pos    =      7.0 : 1.0
                  shoddy = True              neg : pos    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                  suvari = True              neg : pos    =      7.0 : 1.0
                    mena = True              neg : pos    =      7.0 : 1.0
                 cunning = True              pos : neg    =      6.4 : 1.0

<font color='blue'>
## 14 - Pickle
<n>
<font color='blue'>

Permet de sauver dans un ficher n'importe quel obejet Python.
Ainsi, on peut sauvegarder notre classifieur préentrainé.

In [51]:
import pickle
### ecriture
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier) #ce que l'on charge, endroit où on le charge
save_classifier.close() #on referme le fichier

In [52]:
### lecture
classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

In [53]:
print('Naive Bayes Algo Accuracy percent:', (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

Naive Bayes Algo Accuracy percent: 76.0
Most Informative Features
                  regard = True              pos : neg    =     11.1 : 1.0
                   sucks = True              neg : pos    =     10.1 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                 frances = True              pos : neg    =      8.4 : 1.0
             silverstone = True              neg : pos    =      7.6 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.4 : 1.0
               atrocious = True              neg : pos    =      7.0 : 1.0
                  shoddy = True              neg : pos    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                  suvari = True              neg : pos    =      7.0 : 1.0
                    mena = True              neg : pos    =      7.0 : 1.0
                 cunning = True   

<font color='blue'>
## 15 - Scikit-Learn Sklearn with NLTK
<n>
<font color='blue'>

- SET CLASS :  MNB_classifier = SklearnClassifier(MultinomialNB())
- TRAIN : MNB_classifier.train(training_set)
- ACCURACY : (nltk.claVC, LinearSVC, nuSCV

In [54]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [55]:
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier(max_iter=5, tol=None))
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)

Original Naive Bayes Algo accuracy percent: 76.0
Most Informative Features
                  regard = True              pos : neg    =     11.1 : 1.0
                   sucks = True              neg : pos    =     10.1 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                 frances = True              pos : neg    =      8.4 : 1.0
             silverstone = True              neg : pos    =      7.6 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.4 : 1.0
               atrocious = True              neg : pos    =      7.0 : 1.0
                  shoddy = True              neg : pos    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                  suvari = True              neg : pos    =      7.0 : 1.0
                    mena = True              neg : pos    =      7.0 : 1.0
                 cunning 

<font color='blue'>
## 16 - Combining algorithms
<n>
<font color='blue'>

In [56]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode


In [57]:
class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev), category) for (rev, category) in documents]
        
training_set = featuresets[:1900]
testing_set =  featuresets[1900:]

#classifier = nltk.NaiveBayesClassifier.train(training_set)

classifier_f = open("naivebayes.pickle","rb")
classifier = pickle.load(classifier_f)
classifier_f.close()


print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier(max_iter=5, tol=None))
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


voted_classifier = VoteClassifier(classifier,
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  SGDClassifier_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

print("Classification:", voted_classifier.classify(testing_set[0][0]), "Confidence %:",voted_classifier.confidence(testing_set[0][0])*100)
print("Classification:", voted_classifier.classify(testing_set[1][0]), "Confidence %:",voted_classifier.confidence(testing_set[1][0])*100)
print("Classification:", voted_classifier.classify(testing_set[2][0]), "Confidence %:",voted_classifier.confidence(testing_set[2][0])*100)
print("Classification:", voted_classifier.classify(testing_set[3][0]), "Confidence %:",voted_classifier.confidence(testing_set[3][0])*100)
print("Classification:", voted_classifier.classify(testing_set[4][0]), "Confidence %:",voted_classifier.confidence(testing_set[4][0])*100)
print("Classification:", voted_classifier.classify(testing_set[5][0]), "Confidence %:",voted_classifier.confidence(testing_set[5][0])*100)


Original Naive Bayes Algo accuracy percent: 84.0
Most Informative Features
                  regard = True              pos : neg    =     11.1 : 1.0
                   sucks = True              neg : pos    =     10.1 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
                 frances = True              pos : neg    =      8.4 : 1.0
             silverstone = True              neg : pos    =      7.6 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.4 : 1.0
               atrocious = True              neg : pos    =      7.0 : 1.0
                  shoddy = True              neg : pos    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                  suvari = True              neg : pos    =      7.0 : 1.0
                    mena = True              neg : pos    =      7.0 : 1.0
                 cunning 

<font color='blue'>
## 17 - Improving Training Data for sentiment analysis
<n>
<font color='blue'>

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk.tokenize import word_tokenize


class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf
        
short_pos = open("short_reviews/positive.txt","r", encoding='latin-1').read()
short_neg = open("short_reviews/negative.txt","r", encoding='latin-1').read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append( (r, "neg") )


all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w.lower())

for w in short_neg_words:
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:5000]

def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)

# positive data example:      
training_set = featuresets[:10000]
testing_set =  featuresets[10000:]

##
### negative data example:      
##training_set = featuresets[100:]
##testing_set =  featuresets[:100]


In [None]:


classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

##SVC_classifier = SklearnClassifier(SVC())
##SVC_classifier.train(training_set)
##print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


voted_classifier = VoteClassifier(
                                  NuSVC_classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

Original Naive Bayes Algo accuracy percent: 69.87951807228916
Most Informative Features
              engrossing = True              pos : neg    =     18.4 : 1.0
               inventive = True              pos : neg    =     15.1 : 1.0
              refreshing = True              pos : neg    =     13.7 : 1.0
            refreshingly = True              pos : neg    =     13.1 : 1.0
                    warm = True              pos : neg    =     12.3 : 1.0
               wonderful = True              pos : neg    =     12.3 : 1.0
             mesmerizing = True              pos : neg    =     11.7 : 1.0
                provides = True              pos : neg    =     11.0 : 1.0
           extraordinary = True              pos : neg    =     11.0 : 1.0
                  beauty = True              pos : neg    =     10.6 : 1.0
                  stupid = True              neg : pos    =     10.6 : 1.0
               realistic = True              pos : neg    =     10.4 : 1.0
            

In [None]:
short_pos = open("short_reviews/positive.txt","r", encoding='latin-1').read()
short_neg = open("short_reviews/negative.txt","r", encoding='latin-1').read()

documents = []

for r in short_pos.split('\n'):
    documents.append( (r, "pos") )

for r in short_neg.split('\n'):
    documents.append( (r, "neg") )


all_words = []

short_pos_words = word_tokenize(short_pos)
short_neg_words = word_tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w.lower())

for w in short_neg_words:
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

In [None]:
word_features = list(all_words.keys())[:5000]

def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features
	
featuresets = [(find_features(rev), category) for (rev, category) in documents]
random.shuffle(featuresets)

<font color='blue'>
## 18 - Creating a module for Sentiment Analysis
<n>
<font color='blue'>

In [None]:
import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize



class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf
    
short_pos = open("short_reviews/positive.txt","r", encoding='latin-1').read()
short_neg = open("short_reviews/negative.txt","r", encoding='latin-1').read()

# move this up here
all_words = []
documents = []


#  j is adject, r is adverb, and v is verb
#allowed_word_types = ["J","R","V"]
allowed_word_types = ["J"]

for p in short_pos.split('\n'):
    documents.append( (p, "pos") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())

    
for p in short_neg.split('\n'):
    documents.append( (p, "neg") )
    words = word_tokenize(p)
    pos = nltk.pos_tag(words)
    for w in pos:
        if w[1][0] in allowed_word_types:
            all_words.append(w[0].lower())



save_documents = open("pickled_algos/documents.pickle","wb")
pickle.dump(documents, save_documents)
save_documents.close()


all_words = nltk.FreqDist(all_words)


word_features = list(all_words.keys())[:5000]


save_word_features = open("pickled_algos/word_features5k.pickle","wb")
pickle.dump(word_features, save_word_features)
save_word_features.close()


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

random.shuffle(featuresets)
print(len(featuresets))

testing_set = featuresets[10000:]
training_set = featuresets[:10000]


classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

###############
save_classifier = open("pickled_algos/originalnaivebayes5k.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy percent:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

save_classifier = open("pickled_algos/MNB_classifier5k.pickle","wb")
pickle.dump(MNB_classifier, save_classifier)
save_classifier.close()

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100)

save_classifier = open("pickled_algos/BernoulliNB_classifier5k.pickle","wb")
pickle.dump(BernoulliNB_classifier, save_classifier)
save_classifier.close()

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

save_classifier = open("pickled_algos/LogisticRegression_classifier5k.pickle","wb")
pickle.dump(LogisticRegression_classifier, save_classifier)
save_classifier.close()


LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

save_classifier = open("pickled_algos/LinearSVC_classifier5k.pickle","wb")
pickle.dump(LinearSVC_classifier, save_classifier)
save_classifier.close()


##NuSVC_classifier = SklearnClassifier(NuSVC())
##NuSVC_classifier.train(training_set)
##print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


SGDC_classifier = SklearnClassifier(SGDClassifier())
SGDC_classifier.train(training_set)
print("SGDClassifier accuracy percent:",nltk.classify.accuracy(SGDC_classifier, testing_set)*100)

save_classifier = open("pickled_algos/SGDC_classifier5k.pickle","wb")
pickle.dump(SGDC_classifier, save_classifier)
save_classifier.close()

In [None]:
#File: sentiment_mod.py

import nltk
import random
#from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
from nltk.tokenize import word_tokenize



class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf


documents_f = open("pickled_algos/documents.pickle", "rb")
documents = pickle.load(documents_f)
documents_f.close()




word_features5k_f = open("pickled_algos/word_features5k.pickle", "rb")
word_features = pickle.load(word_features5k_f)
word_features5k_f.close()


def find_features(document):
    words = word_tokenize(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features



featuresets_f = open("pickled_algos/featuresets.pickle", "rb")
featuresets = pickle.load(featuresets_f)
featuresets_f.close()

random.shuffle(featuresets)
print(len(featuresets))

testing_set = featuresets[10000:]
training_set = featuresets[:10000]



open_file = open("pickled_algos/originalnaivebayes5k.pickle", "rb")
classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/MNB_classifier5k.pickle", "rb")
MNB_classifier = pickle.load(open_file)
open_file.close()



open_file = open("pickled_algos/BernoulliNB_classifier5k.pickle", "rb")
BernoulliNB_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/LogisticRegression_classifier5k.pickle", "rb")
LogisticRegression_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/LinearSVC_classifier5k.pickle", "rb")
LinearSVC_classifier = pickle.load(open_file)
open_file.close()


open_file = open("pickled_algos/SGDC_classifier5k.pickle", "rb")
SGDC_classifier = pickle.load(open_file)
open_file.close()




voted_classifier = VoteClassifier(
                                  classifier,
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)




def sentiment(text):
    feats = find_features(text)
    return voted_classifier.classify(feats),voted_classifier.confidence(feats)

In [None]:
import sentiment_mod as s

print(s.sentiment("This movie was awesome! The acting was great, plot was wonderful, and there were pythons...so yea!"))
print(s.sentiment("This movie was utter junk. There were absolutely 0 pythons. I don't see what the point was at all. Horrible movie, 0/10"))

<font color='blue'>
## 19 - Twitter Sentiment Analysis
<n>
<font color='blue'>

In [None]:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener


#consumer key, consumer secret, access token, access secret.
ckey="fsdfasdfsafsffa"
csecret="asdfsadfsadfsadf"
atoken="asdf-aassdfs"
asecret="asdfsadfsdafsdafs"

class listener(StreamListener):

    def on_data(self, data):
        print(data)
        return(True)

    def on_error(self, status):
        print status

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream = Stream(auth, listener())
twitterStream.filter(track=["car"])

In [None]:
tweet = all_data["text"]

In [None]:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
import sentiment_mod as s

#consumer key, consumer secret, access token, access secret.
ckey="asdfsafsafsaf"
csecret="asdfasdfsadfsa"
atoken="asdfsadfsafsaf-asdfsaf"
asecret="asdfsadfsadfsadfsadfsad"

from twitterapistuff import *

class listener(StreamListener):

    def on_data(self, data):

		all_data = json.loads(data)

		tweet = all_data["text"]
		sentiment_value, confidence = s.sentiment(tweet)
		print(tweet, sentiment_value, confidence)

		if confidence*100 >= 80:
			output = open("twitter-out.txt","a")
			output.write(sentiment_value)
			output.write('\n')
			output.close()

		return True

    def on_error(self, status):
        print(status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)

twitterStream = Stream(auth, listener())
twitterStream.filter(track=["happy"])

<font color='blue'>
## 21 - Graphing Live Twitter Sentiment Analysis
<n>
<font color='blue'>

In [None]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib import style
import time

style.use("ggplot")

fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)

def animate(i):
    pullData = open("twitter-out.txt","r").read()
    lines = pullData.split('\n')

    xar = []
    yar = []

    x = 0
    y = 0

    for l in lines[-200:]:
        x += 1
        if "pos" in l:
            y += 1
        elif "neg" in l:
            y -= 1

        xar.append(x)
        yar.append(y)
        
    ax1.clear()
    ax1.plot(xar,yar)
ani = animation.FuncAnimation(fig, animate, interval=1000)
plt.show()