# [Natural Language Processing Overview](https://www.youtube.com/watch?v=5ctbvkAMQO4)
### NLTK - Main tools for NLP
TEXT MINING/ANALYTICS = process of deriving meaningful information from natural language text 
*   involves structuring input text
*   deriving patterns
*   evaluating & interpreting the output

NLP = part of cs & AI which deals with human languages

Applications of NLP
*   Sentiment analysis 
*   Chatbot
*   Speech recognition
*   Machine translation
*   spell checking, keyword search, info extraction, advertisement matching (recommendation of ads based on history)

Components of NLP
- **natural language understanding** = analyze and find useful info from input
  - mapping input to useful representations
  - analyzing diff aspects of the language
- **natural language generation** = producing meaningful phrases 
  - text planning
  - sentence planning
  - text realiation

Processes of NLP
- **Tokenization** = breaking sentences into smaller useful units (i.e. words) 
  1. Break a complex sentence into words
  2. Understand the importance of each of the words with respect to the sentence
  3. Produce a structural description on an input sentence 
- Stemming = normalize words into its base or root form
  - ex. affecation, affects, affections, affecting >>> affect 
  - stemming algorithms usually take into account taking off the beginning or end, whilst taking into account a list of common prefixes, suffixes 
  - does not always work 
- Lemmatization = morphological analysis of the word
  - need to have a detailed dictionary that the algorithm can look through to link the form back to its original word/**lemma** 
  - **lemma** = root word 
  1. groups together different inflected forms of a word, called Lemma 
  2. somehow similar to **Stemming**, as it maps several words into one common root
  3. Output of lemmatisation is a proper word 
    - ex. using Lemmatization: mapping gone, going, went >> **go**
    - this would not be the same output when using stemming 
- POS tags
  - once we have tokens & divided them into its root forms, then POS tags
  - **POS Tags** = parts of speech, ex. nouns, pronounds, verbs etc. 
  - how the word functions as **meaning** and **grammatically** within the sentence 
  - **Issue** : * "Google" it on the web *
    - Google is a proper noun but its used as a verb here 
- Named Entity Recognition (NER) 
  - helps to overcome issues involved with POS 
  - takes name entity (ex. person name, company names, location, quantity etc.) & has 3 steps 
  1. noun phrase identification 
  2. phrase classification
  3. entity disambiguation 
  - ex. Google's CEO Sundar introduced the new phone at New York Central Mall
    - Organization: Google, Central Mall
    - Person: Sundar
    - Location: New York
- Chunking = picking up individual pieces of info & grouping them into bigger pieces (i.e. **chunks**)





# [NLTK Tutorial](https://www.youtube.com/watch?v=05ONoGfmKvA)
## Human Language
- **Language** = combinations of: alphabets > words > sentences
- **Grammar** = rules to create sentences 

## What is Text Mining?
- **Text Mining/ Text Analystics** = process of **deriving** meaningful **information** from natural language **text**
- the overall goal, is essentially to turn text into data for analysis, via appliation of NLP

## Basic Structure of a NLP Application 
![](https://www.dropbox.com/s/a0lppcyj2l9126w/Screenshot%202020-06-15%2017.24.59.png?dl=0)
- NLP Layer is connected to 
  1. Knowledge Base - Source Content = ex. chat logs
    - shit used to train the algorithms
  2. Data Storage - Interaction History & Analytics 
    - help generate meaningful output 

## Ambiguity 
1. Lexical/Semantic = many possible meanings for a single **word**
  - Fisherman went to the **bank** : money bank or water bank??
2. Syntactic = many possible meanings for a single **sentence**
  - The chicken is ready to eat
3. Referential = concerning pronounds...what does **he** represent?



In [None]:
import os
import nltk
import nltk.corpus

nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [None]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
hamlet

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]

In [None]:
#first 500 words in hamlet
for word in hamlet[:500]:
  print(word, sep=' ', end = ' ')

[ The Tragedie of Hamlet by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Barnardo and Francisco two Centinels . Barnardo . Who ' s there ? Fran . Nay answer me : Stand & vnfold your selfe Bar . Long liue the King Fran . Barnardo ? Bar . He Fran . You come most carefully vpon your houre Bar . ' Tis now strook twelue , get thee to bed Francisco Fran . For this releefe much thankes : ' Tis bitter cold , And I am sicke at heart Barn . Haue you had quiet Guard ? Fran . Not a Mouse stirring Barn . Well , goodnight . If you do meet Horatio and Marcellus , the Riuals of my Watch , bid them make hast . Enter Horatio and Marcellus . Fran . I thinke I heare them . Stand : who ' s there ? Hor . Friends to this ground Mar . And Leige - men to the Dane Fran . Giue you good night Mar . O farwel honest Soldier , who hath relieu ' d you ? Fra . Barnardo ha ' s my place : giue you goodnight . Exit Fran . Mar . Holla Barnardo Bar . Say , what is Horatio there ? Hor . A peece of him Bar 

#Tokenization

## Demo: Tokenizing A String Input

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize 
rona = """The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID‑19), caused by severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2). The outbreak was first identified in Wuhan, China, in December 2019. The World Health Organization declared the outbreak a Public Health Emergency of International Concern on 30 January 2020, and a pandemic on 11 March. As of 15 June 2020, more than 7.96 million cases of COVID-19 have been reported in more than 188 countries and territories, resulting in more than 434,000 deaths; more than 3.8 million people have recovered."""

rona_tokens = word_tokenize(rona)
rona_tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['The',
 'COVID-19',
 'pandemic',
 ',',
 'also',
 'known',
 'as',
 'the',
 'coronavirus',
 'pandemic',
 ',',
 'is',
 'an',
 'ongoing',
 'pandemic',
 'of',
 'coronavirus',
 'disease',
 '2019',
 '(',
 'COVID‑19',
 ')',
 ',',
 'caused',
 'by',
 'severe',
 'acute',
 'respiratory',
 'syndrome',
 'coronavirus',
 '2',
 '(',
 'SARS‑CoV‑2',
 ')',
 '.',
 'The',
 'outbreak',
 'was',
 'first',
 'identified',
 'in',
 'Wuhan',
 ',',
 'China',
 ',',
 'in',
 'December',
 '2019',
 '.',
 'The',
 'World',
 'Health',
 'Organization',
 'declared',
 'the',
 'outbreak',
 'a',
 'Public',
 'Health',
 'Emergency',
 'of',
 'International',
 'Concern',
 'on',
 '30',
 'January',
 '2020',
 ',',
 'and',
 'a',
 'pandemic',
 'on',
 '11',
 'March',
 '.',
 'As',
 'of',
 '15',
 'June',
 '2020',
 ',',
 'more',
 'than',
 '7.96',
 'million',
 'cases',
 'of',
 'COVID-19',
 'have',
 'been',
 'reported',
 'in',
 'more',
 'than',
 '188',
 'countries',
 'and',
 'territories',
 ',',
 'resulting',
 'in',
 'more',
 'than',
 '434,00

## Demo: Find Frequency of Tokens

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist()

for word in rona_tokens:
  fdist[word.lower()] += 1
fdist

FreqDist({'(': 2,
          ')': 2,
          ',': 8,
          '.': 4,
          '11': 1,
          '15': 1,
          '188': 1,
          '2': 1,
          '2019': 2,
          '2020': 2,
          '3.8': 1,
          '30': 1,
          '434,000': 1,
          '7.96': 1,
          ';': 1,
          'a': 2,
          'acute': 1,
          'also': 1,
          'an': 1,
          'and': 2,
          'as': 2,
          'been': 1,
          'by': 1,
          'cases': 1,
          'caused': 1,
          'china': 1,
          'concern': 1,
          'coronavirus': 3,
          'countries': 1,
          'covid-19': 2,
          'covid‑19': 1,
          'deaths': 1,
          'december': 1,
          'declared': 1,
          'disease': 1,
          'emergency': 1,
          'first': 1,
          'have': 2,
          'health': 2,
          'identified': 1,
          'in': 4,
          'international': 1,
          'is': 1,
          'january': 1,
          'june': 1,
          'known': 1,
   

## Demo: Select x # of tokens of highest frequence

In [None]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[(',', 8),
 ('the', 5),
 ('pandemic', 4),
 ('of', 4),
 ('.', 4),
 ('in', 4),
 ('more', 4),
 ('than', 4),
 ('coronavirus', 3),
 ('covid-19', 2)]

## Demo: Blank Tokenizer 
- the blankline_tokenizer
- indicates how many paragraphs there are

In [None]:
from nltk.tokenize import blankline_tokenize
rona_blank = blankline_tokenize(rona)
len(rona_blank)

1

## Bigrams, Trigrams & Ngrams
**Bigrams** = tokens of 2 consecutive written words known as Bigrams

**Trigrams** = tokes of 3 consecutive written words known as Trigram

**Ngrams** = tokens of any number of consecutive written words known as Ngrams

### Demo: Bigrams, Trigrams & Ngrams

In [None]:
from nltk.util import bigrams, trigrams, ngrams
string = "The best and most beautiful things int he world canno tbe seen or even touched, they must be felt with the heart"
quote_tokens = nltk.word_tokenize(string)
quotes_bigram = list(nltk.bigrams(quote_tokens))
quotes_bigram

[('The', 'best'),
 ('best', 'and'),
 ('and', 'most'),
 ('most', 'beautiful'),
 ('beautiful', 'things'),
 ('things', 'int'),
 ('int', 'he'),
 ('he', 'world'),
 ('world', 'canno'),
 ('canno', 'tbe'),
 ('tbe', 'seen'),
 ('seen', 'or'),
 ('or', 'even'),
 ('even', 'touched'),
 ('touched', ','),
 (',', 'they'),
 ('they', 'must'),
 ('must', 'be'),
 ('be', 'felt'),
 ('felt', 'with'),
 ('with', 'the'),
 ('the', 'heart')]

In [None]:
quotes_trigrams = list(nltk.trigrams(quote_tokens))
quotes_trigrams

[('The', 'best', 'and'),
 ('best', 'and', 'most'),
 ('and', 'most', 'beautiful'),
 ('most', 'beautiful', 'things'),
 ('beautiful', 'things', 'int'),
 ('things', 'int', 'he'),
 ('int', 'he', 'world'),
 ('he', 'world', 'canno'),
 ('world', 'canno', 'tbe'),
 ('canno', 'tbe', 'seen'),
 ('tbe', 'seen', 'or'),
 ('seen', 'or', 'even'),
 ('or', 'even', 'touched'),
 ('even', 'touched', ','),
 ('touched', ',', 'they'),
 (',', 'they', 'must'),
 ('they', 'must', 'be'),
 ('must', 'be', 'felt'),
 ('be', 'felt', 'with'),
 ('felt', 'with', 'the'),
 ('with', 'the', 'heart')]

In [None]:
quotes_ngrams = list(nltk.ngrams(quote_tokens, 5))
quotes_ngrams

[('The', 'best', 'and', 'most', 'beautiful'),
 ('best', 'and', 'most', 'beautiful', 'things'),
 ('and', 'most', 'beautiful', 'things', 'int'),
 ('most', 'beautiful', 'things', 'int', 'he'),
 ('beautiful', 'things', 'int', 'he', 'world'),
 ('things', 'int', 'he', 'world', 'canno'),
 ('int', 'he', 'world', 'canno', 'tbe'),
 ('he', 'world', 'canno', 'tbe', 'seen'),
 ('world', 'canno', 'tbe', 'seen', 'or'),
 ('canno', 'tbe', 'seen', 'or', 'even'),
 ('tbe', 'seen', 'or', 'even', 'touched'),
 ('seen', 'or', 'even', 'touched', ','),
 ('or', 'even', 'touched', ',', 'they'),
 ('even', 'touched', ',', 'they', 'must'),
 ('touched', ',', 'they', 'must', 'be'),
 (',', 'they', 'must', 'be', 'felt'),
 ('they', 'must', 'be', 'felt', 'with'),
 ('must', 'be', 'felt', 'with', 'the'),
 ('be', 'felt', 'with', 'the', 'heart')]

# Stemming
- RECAP: normalize words into its root form
- Literally just **cutting of the beginning or ending** of words whilst taking into account common prefixes & suffixes
- This approach doesnt always work in finding the root word 

## PorterStem

In [None]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()
# find stemming of a word
pst.stem('having')

'have'

In [None]:
words_to_stem=['give', "giving", "given", "gave"]
for words in words_to_stem:
  print(words + ":" + pst.stem(words))

give:give
giving:give
given:given
gave:gave


## Lancaster Stemmer
- more aggressive than porter

In [None]:
from nltk.stem import LancasterStemmer
lst=LancasterStemmer()
for words in words_to_stem:
  print(words + ":" + lst.stem(words))

give:giv
giving:giv
given:giv
gave:gav


# Lemmatization
- Stemming but smarter by taking into account morphological analysis of words
- need to have a detailed dictionary to link the word back to its **Lemma** = root word 
- Output is always a **proper word**
  - unlike stemming, **giv** is not a word 

## Demo: Lemmatization

In [None]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# providing the dictionary
word_lem = WordNetLemmatizer()

for words in words_to_stem:
  print(words + ":" + word_lem.lemmatize(words))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
give:give
giving:giving
given:given
gave:gave


### Why did it not find the lemma?
- we need to assign POS tags & hence **assumes all the words as nouns**

In [None]:
word_lem.lemmatize('corpora')

'corpus'

# POS Tags
- POS = Parts of Speech
- tags what type of word it is: i.e. noun, verb etc. 
- a word can have more than one POS depending on the context its used in 

## Stop Words
- words that doesnt not help with NLP, provides no meaning 
- ex. Really, All, Begin, Take, However, Of, The etc. 
- NLTK has a list of stop words
- they are only helpful in the creation of sentences
- they are not helpful in the processing of a language

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
english_stopwords = stopwords.words('english')

In [None]:
# we have 179 stop words in english
len(english_stopwords)

179

In [None]:
# recap, the most frequent tokens 
fdist_top10

[(',', 8),
 ('the', 5),
 ('pandemic', 4),
 ('of', 4),
 ('.', 4),
 ('in', 4),
 ('more', 4),
 ('than', 4),
 ('coronavirus', 3),
 ('covid-19', 2)]

## Removing punctuations & numbers as tokens

In [None]:
import re
punctuation = re.compile(r'[-.?!,:;()|0-9]')

In [None]:
post_punctuation=[]
for words in rona_tokens:
  word = punctuation.sub("", words)
  if len(word)>0:
    post_punctuation.append(word)
post_punctuation

['The',
 'COVID',
 'pandemic',
 'also',
 'known',
 'as',
 'the',
 'coronavirus',
 'pandemic',
 'is',
 'an',
 'ongoing',
 'pandemic',
 'of',
 'coronavirus',
 'disease',
 'COVID‑',
 'caused',
 'by',
 'severe',
 'acute',
 'respiratory',
 'syndrome',
 'coronavirus',
 'SARS‑CoV‑',
 'The',
 'outbreak',
 'was',
 'first',
 'identified',
 'in',
 'Wuhan',
 'China',
 'in',
 'December',
 'The',
 'World',
 'Health',
 'Organization',
 'declared',
 'the',
 'outbreak',
 'a',
 'Public',
 'Health',
 'Emergency',
 'of',
 'International',
 'Concern',
 'on',
 'January',
 'and',
 'a',
 'pandemic',
 'on',
 'March',
 'As',
 'of',
 'June',
 'more',
 'than',
 'million',
 'cases',
 'of',
 'COVID',
 'have',
 'been',
 'reported',
 'in',
 'more',
 'than',
 'countries',
 'and',
 'territories',
 'resulting',
 'in',
 'more',
 'than',
 'deaths',
 'more',
 'than',
 'million',
 'people',
 'have',
 'recovered']

## POS Tags List 
![alt text](https://thottingal.in/wp-content/uploads/2019/09/PENN-treebank-tagset.png)

## Demo


In [None]:
nltk.download('averaged_perceptron_tagger')

sentence = "Timothy is a natural when it comes to drawing"
sentence_tokens = word_tokenize(sentence)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
for token in sentence_tokens:
  print(nltk.pos_tag([token]))

[('Timothy', 'NN')]
[('is', 'VBZ')]
[('a', 'DT')]
[('natural', 'JJ')]
[('when', 'WRB')]
[('it', 'PRP')]
[('comes', 'VBZ')]
[('to', 'TO')]
[('drawing', 'VBG')]


# Named Entity Recognition 
- Movie, Monetary Value, Organization, Location, Quantities, Person

## 3 types of identification 
1. **non phrase identification** = extracting non phrase using dependency passing & POS tagging
2. **phrase classification** = the non phrases are categorized into location, movies etc. 
3. **knowledge graphs** =  validation layer for when phrases are misclassified 


In [None]:
from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
NE_sentence = "The US President stays in the WHITE HOUSE"
NE_tokens = word_tokenize(NE_sentence)
NE_tags = nltk.pos_tag(NE_tokens)

In [None]:
NE_NER = ne_chunk(NE_tags)
print(NE_NER)

(S
  The/DT
  (ORGANIZATION US/NNP)
  President/NNP
  stays/VBZ
  in/IN
  the/DT
  (FACILITY WHITE/NNP HOUSE/NNP))


# Syntax 
Linguistics Definition = set of rules, principles, and processes within a given sentence within a given language 

## Phrase Structure Rules 
- phrase structure rules specify the well-formes structures of sentences
- a tree must match the phrase structure rules to be grammatical
![alt text](https://image.slidesharecdn.com/recursion-140717094159-phpapp01/95/recursion-16-638.jpg?cb=1405590175)
- NP = noun phrase
- VP = verb phrase
- PP - prepositional phrase 
- P = proposition
- Art = article 


##Download **GhostScript**

#Chunking 


In [None]:
new = "The big cat ate the little mouse who was after fresh cheese"
new_tokens = nltk.pos_tag(word_tokenize(new))
new_tokens

[('The', 'DT'),
 ('big', 'JJ'),
 ('cat', 'NN'),
 ('ate', 'VBD'),
 ('the', 'DT'),
 ('little', 'JJ'),
 ('mouse', 'NN'),
 ('who', 'WP'),
 ('was', 'VBD'),
 ('after', 'IN'),
 ('fresh', 'JJ'),
 ('cheese', 'NN')]

In [None]:
#specifying the grammer for a noun phrase
grammar_np = r"NP: {<DT>?<JJ><NN>}"

In [None]:
# regex matching screen
chunk_parser = nltk.RegexpParser(grammar_np)

In [None]:
chunk_result = chunk_parser.parse(new_tokens)
chunk_result

#error occurs bc we didnt use ghost script & syntax tree 
# but you still see:
#Tree('S', [
  #Tree('NP', [('The', 'DT'), ('big', 'JJ'), ('cat', 'NN')]), ('ate', 'VBD'), 
  #Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('mouse', 'NN')]), ('who', 'WP'), ('was', 'VBD'), ('after', 'IN'), 
  #Tree('NP', [('fresh', 'JJ'), ('cheese', 'NN')])
#])

TclError: ignored

Tree('S', [Tree('NP', [('The', 'DT'), ('big', 'JJ'), ('cat', 'NN')]), ('ate', 'VBD'), Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('mouse', 'NN')]), ('who', 'WP'), ('was', 'VBD'), ('after', 'IN'), Tree('NP', [('fresh', 'JJ'), ('cheese', 'NN')])])

# Summary Project: ML Classifier on Movie Reviews from NLTK Corpora

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

In [None]:
from nltk.corpus import movie_reviews
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [None]:
print(movie_reviews.categories())

['neg', 'pos']


In [None]:
review = nltk.corpus.movie_reviews.words('pos/cv000_29590.txt')
review

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

In [None]:
review_list = []

In [None]:
pos_reviews = movie_reviews.fileids('pos')
for rev in pos_reviews:
  rev_text_pos = rev = nltk.corpus.movie_reviews.words(rev)
  review_one_string = " ".join(rev_text_pos)
  review_one_string = review_one_string.replace(' ,', ',')
  review_one_string = review_one_string.replace(' .', '.')
  review_one_string = review_one_string.replace("\' ", "'")
  review_one_string = review_one_string.replace(" \'", "'")
  review_list.append(review_one_string)

neg_reviews = movie_reviews.fileids('neg')
for rev in neg_reviews:
  rev_text_neg = rev = nltk.corpus.movie_reviews.words(rev)
  review_one_string = " ".join(rev_text_pos)
  review_one_string = review_one_string.replace(' ,', ',')
  review_one_string = review_one_string.replace(' .', '.')
  review_one_string = review_one_string.replace("\' ", "'")
  review_one_string = review_one_string.replace(" \'", "'")
  review_list.append(review_one_string)


In [None]:
len(review_list)

2000

In [None]:
# create targets 
neg_targets = np.zeros((1000,), dtype=np.int)
pos_targets = np.ones((1000,), dtype=np.int)

In [None]:
target_list = []
for neg_tar in neg_targets:
  target_list.append(neg_tar)
for pos_tar in pos_targets:
  target_list.append(pos_tar)

In [None]:
len(target_list)

2000

In [None]:
y = pd.Series(target_list)
type(y)
y.head()

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [None]:
# now start create features with countvectorizor 

# init count vectorizor 
count_vect = CountVectorizer(lowercase=True, stop_words='english', min_df =2)

X_count_vect = count_vect.fit_transform(review_list)

X_count_vect.shape

(2000, 17316)

In [None]:
X_names = count_vect.get_feature_names()
X_names

['00',
 '000',
 '007',
 '10',
 '100',
 '1000',
 '101',
 '102',
 '105',
 '107',
 '11',
 '110',
 '12',
 '129',
 '13',
 '130',
 '137',
 '13th',
 '14',
 '14th',
 '15',
 '150',
 '1500s',
 '155',
 '16',
 '160',
 '161',
 '16mm',
 '16th',
 '16x9',
 '17',
 '175',
 '1773',
 '17th',
 '18',
 '180',
 '1800s',
 '1839',
 '1888',
 '18th',
 '19',
 '1900',
 '1912',
 '1914',
 '1919',
 '1930',
 '1930s',
 '1932',
 '1935',
 '1937',
 '1938',
 '1939',
 '1940',
 '1940s',
 '1941',
 '1943',
 '1944',
 '1945',
 '1947',
 '1950',
 '1950s',
 '1953',
 '1954',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1962',
 '1963',
 '1964',
 '1965',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '19th',
 '1st',
 '20',
 '200',
 '2000',
 '2001',
 '2013',
 '2050',

In [None]:
X_count_vect = pd.DataFrame(X_count_vect.toarray(), columns=X_names)
X_count_vect.shape

(2000, 17316)

In [None]:
X_count_vect.head()

Unnamed: 0,00,000,007,10,100,1000,101,102,105,107,11,110,12,129,13,130,137,13th,14,14th,15,150,1500s,155,16,160,161,16mm,16th,16x9,17,175,1773,17th,18,180,1800s,1839,1888,18th,...,yuppie,yuppies,yvette,zachary,zack,zahn,zane,zany,zapped,zeal,zellweger,zemeckis,zen,zero,zeroing,zest,zeta,zeus,zhang,zhou,ziggy,zingers,zip,zippel,zipper,zippy,zoe,zombie,zombies,zone,zoo,zoolander,zoom,zooming,zooms,zorg,zorro,zucker,zuko,zwick
0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# splitting to testing & training sets
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [None]:
X_train_cv, X_test_cv, Y_train_cv, Y_test_cv = train_test_split(X_count_vect, y, test_size = 0.25, random_state = 5)
# test size will be 25% of the whole data frame, so training size will be 75%

In [None]:
X_train_cv.shape

(1500, 17316)

In [None]:
X_test_cv.shape

(500, 17316)

## Naive Bayes Classifier 
- classification techn based on bayes theorem on the assumption of independence among predictors 
- SO: assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature 

In [None]:
# so the data now has been split, will use naive bayes for text classification for both training & testing sets
from sklearn.naive_bayes import GaussianNB
# instantiate classifier and fit it with the training features & labels 
gnb = GaussianNB()
y_pred_gnb = gnb.fit(X_train_cv, Y_train_cv).predict(X_test_cv)

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf_cv = MultinomialNB()
clf_cv.fit(X_train_cv, Y_train_cv)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [None]:
# use predict function to pass the training feature 
y_pred_cv = clf_cv.predict(X_test_cv)
type(y_pred_cv)

numpy.ndarray

In [None]:
print(metrics.accuracy_score(Y_test_cv, y_pred_cv))

1.0


In [None]:
score_clf_cv = confusion_matrix(Y_test_cv, y_pred_cv)
score_clf_cv

array([[258,   0],
       [  0, 242]])