# Steps in NLP

### Step 1 - Tokenization 

Tokenization is a process of breaking strings into tokens which in turn are small structures or units that can be used what Tokenization. 

Now Tokenization involves Three steps - 

1. Break a complex sentence into words. 
2. Understand the importance of each of the words with respect to the sentence. 
3. Produce a structural description on an input sentence. 

For example if we consider the below sentence - 

**Tokenization is the first step in NLP**

We can see that we have below Seven tokens here as shown below. 

[Tokenization] [is] [the] [first] [step] [in] [NLP]

Now NLTK allows you to tokenize phrase containing more than one word. 

So, lets see how we can use **Tokenization**  using NTLK - 




In [1]:
import os
import nltk
import nltk.corpus

In [67]:
# nltk.download('gutenberg')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('maxent_ne_chunker')
# nltk.download('movie_reviews')

In [54]:
print(os.listdir(nltk.data.find('corpora')))

['stopwords', 'words', 'wordnet.zip', 'stopwords.zip', 'gutenberg', 'words.zip', 'wordnet', 'gutenberg.zip']


In [55]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [5]:
hamlet=nltk.corpus.gutenberg.words('shakespeare-macbeth.txt')
hamlet

['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', ...]

Lets look at top 500 elements of hamlet.

In [6]:
for word in hamlet[:500]:
    print(word, sep=' ',end=' ')

[ The Tragedie of Macbeth by William Shakespeare 1603 ] Actus Primus . Scoena Prima . Thunder and Lightning . Enter three Witches . 1 . When shall we three meet againe ? In Thunder , Lightning , or in Raine ? 2 . When the Hurley - burley ' s done , When the Battaile ' s lost , and wonne 3 . That will be ere the set of Sunne 1 . Where the place ? 2 . Vpon the Heath 3 . There to meet with Macbeth 1 . I come , Gray - Malkin All . Padock calls anon : faire is foule , and foule is faire , Houer through the fogge and filthie ayre . Exeunt . Scena Secunda . Alarum within . Enter King Malcome , Donalbaine , Lenox , with attendants , meeting a bleeding Captaine . King . What bloody man is that ? he can report , As seemeth by his plight , of the Reuolt The newest state Mal . This is the Serieant , Who like a good and hardie Souldier fought ' Gainst my Captiuitie : Haile braue friend ; Say to the King , the knowledge of the Broyle , As thou didst leaue it Cap . Doubtfull it stood , As two spent S

We use the above text for analysis. 

In [7]:
AI = """According to the father of Artificial Intelligence, John McCarthy, it is "The science and engineering of making intellgence" """

In [8]:
type(AI)

str

**Lets Tokenized the above string**

In [9]:
from nltk.tokenize import word_tokenize

In [10]:
AI_tokens = word_tokenize(AI)
AI_tokens

['According',
 'to',
 'the',
 'father',
 'of',
 'Artificial',
 'Intelligence',
 ',',
 'John',
 'McCarthy',
 ',',
 'it',
 'is',
 '``',
 'The',
 'science',
 'and',
 'engineering',
 'of',
 'making',
 'intellgence',
 "''"]

In [11]:
len(AI_tokens)

22

**So total we have 22 tokens** and these tokens are nothing but just seperated words. 

Now, to find the frequency of the distinct elements here in the given a paragraph, we are going to import the frequency distinct function.



In [12]:
from nltk.probability import FreqDist
fdist = FreqDist()

We are trying to find word count of all the words in the paragraph. We are also converting the tokens into lowercase to avoid the probability of considering a word with uppercase and lowercase as different. 

In [13]:
for word in AI_tokens:
    fdist[word.lower()]+=1
fdist

FreqDist({'the': 2, 'of': 2, ',': 2, 'according': 1, 'to': 1, 'father': 1, 'artificial': 1, 'intelligence': 1, 'john': 1, 'mccarthy': 1, ...})

Now, suppose we were to select the top 10 tokens with the highest frequency so here you can see - 

In [14]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[('the', 2),
 ('of', 2),
 (',', 2),
 ('according', 1),
 ('to', 1),
 ('father', 1),
 ('artificial', 1),
 ('intelligence', 1),
 ('john', 1),
 ('mccarthy', 1)]

**There is another type of Tokenizer i.e. Blank Tokenizer**.

Lets use the **Blank Tokenizer** over the same string, to tokenize the paragraph with respect to the blank string - 

In [15]:
from nltk.tokenize import blankline_tokenize
AI_blank = blankline_tokenize(AI)
len(AI_blank)

1

Output here is **1**, now this 1 indicates - how many paragraphs we have and what all paragraphs are seprated by a new line although it might seem like a one paragraph it is not. 

The original structure of the data remains intact. 

Another Important key terms in Tokenization is - 

- **Bigrams** - Tokens of two consecutive written words known as **Bigram**
- **Trigrams** - Tokens of three consecutive written words known as **Trigram**
- **Ngrams** - Tokens of any number of consecutive written words known as **Ngram**

**Lets do some demo for above**-

In [16]:
from nltk.util import bigrams, trigrams, ngrams

- First below i am splitting a string into a tokens. 

In [17]:
string = "The best and most beautiful things in the world cannot be seen or even touched, they must be felt with heart"
quotes_tokens = nltk.word_tokenize(string)
quotes_tokens

['The',
 'best',
 'and',
 'most',
 'beautiful',
 'things',
 'in',
 'the',
 'world',
 'can',
 'not',
 'be',
 'seen',
 'or',
 'even',
 'touched',
 ',',
 'they',
 'must',
 'be',
 'felt',
 'with',
 'heart']

**Lets create Bigrams of the list containing tokens**

In [18]:
quotes_bigrams = list(nltk.bigrams(quotes_tokens))
quotes_bigrams

[('The', 'best'),
 ('best', 'and'),
 ('and', 'most'),
 ('most', 'beautiful'),
 ('beautiful', 'things'),
 ('things', 'in'),
 ('in', 'the'),
 ('the', 'world'),
 ('world', 'can'),
 ('can', 'not'),
 ('not', 'be'),
 ('be', 'seen'),
 ('seen', 'or'),
 ('or', 'even'),
 ('even', 'touched'),
 ('touched', ','),
 (',', 'they'),
 ('they', 'must'),
 ('must', 'be'),
 ('be', 'felt'),
 ('felt', 'with'),
 ('with', 'heart')]

**Lets create Trigrams of the list containing tokens**

In [19]:
quotes_trigrams = list(nltk.trigrams(quotes_tokens))
quotes_trigrams

[('The', 'best', 'and'),
 ('best', 'and', 'most'),
 ('and', 'most', 'beautiful'),
 ('most', 'beautiful', 'things'),
 ('beautiful', 'things', 'in'),
 ('things', 'in', 'the'),
 ('in', 'the', 'world'),
 ('the', 'world', 'can'),
 ('world', 'can', 'not'),
 ('can', 'not', 'be'),
 ('not', 'be', 'seen'),
 ('be', 'seen', 'or'),
 ('seen', 'or', 'even'),
 ('or', 'even', 'touched'),
 ('even', 'touched', ','),
 ('touched', ',', 'they'),
 (',', 'they', 'must'),
 ('they', 'must', 'be'),
 ('must', 'be', 'felt'),
 ('be', 'felt', 'with'),
 ('felt', 'with', 'heart')]

**Lets create Ngrams of the list containing tokens**

In [20]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens,4)) # this value 4 can be 5, or 6 or 7 or anything.
quotes_ngrams

[('The', 'best', 'and', 'most'),
 ('best', 'and', 'most', 'beautiful'),
 ('and', 'most', 'beautiful', 'things'),
 ('most', 'beautiful', 'things', 'in'),
 ('beautiful', 'things', 'in', 'the'),
 ('things', 'in', 'the', 'world'),
 ('in', 'the', 'world', 'can'),
 ('the', 'world', 'can', 'not'),
 ('world', 'can', 'not', 'be'),
 ('can', 'not', 'be', 'seen'),
 ('not', 'be', 'seen', 'or'),
 ('be', 'seen', 'or', 'even'),
 ('seen', 'or', 'even', 'touched'),
 ('or', 'even', 'touched', ','),
 ('even', 'touched', ',', 'they'),
 ('touched', ',', 'they', 'must'),
 (',', 'they', 'must', 'be'),
 ('they', 'must', 'be', 'felt'),
 ('must', 'be', 'felt', 'with'),
 ('be', 'felt', 'with', 'heart')]

### Step 2 - Stemming

Once you have token we need to make some changes to the tokens so for that we have **Stemming**.

Stemming usually refers to normalizing words into its base form or the root form, so if we have a look at the words here, we have -

- Affectation
- Affects
- Affections
- Affected
- Affection
- Affecting

So the root word here is **Affect**. 

One thing to keep in mind that result will not always be the root word, Stemming algorithm works by cutting off the end or the begining of the word. 

Since it is not going to be correct always, so this is why it is affirm that this approach presents some limitation.





**Lets do some Demo**

In [21]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [22]:
pst.stem('having')

'have'

In [23]:
pst.stem('corpora')

'corpora'

In [24]:
words_to_stem=['give','giving','given','gave']
for words in words_to_stem:
    print(words+ ':' +pst.stem(words))

give:give
giving:give
given:given
gave:gave


Here it just removing only ing and replace it with letter **e**. 

Another pattern of this Stemms is **LancasterStemmer** - 

In [25]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
for words in words_to_stem:
    print(words+ ':' +lst.stem(words))

give:giv
giving:giv
given:giv
gave:gav


We the above result you can see that **LancasterStemmer** is more aggressive. So these stem depends upon task which we are going to perform. 

Because **Stemming** is basically Normalizing (cutting) a word from begining or from end. 

### Step 3 - Lemmatization

**Lemmatization** takes into consideration the morphological analysis of the world now in order to do so it is nexessary to have a detailed dictionary which the algorithm can look into to link the form back to its lemma. 

So basically **Lemmatization** -

- Groups together different inflected forms of a word, called **Lemma**. 
- Somehow similar to **Stemming**, as it maps several words into one common root. 
- Important thing here to note that - Output of **Lemmatisation** is a proper word. 
- For example - a **Lemmatiser** should map *gone*,*going* and went into *go*. This is the root of all the three words here. one thing **Lemmatiser** needs a detailed dictionary. Because the output of its a root word which is a particular given word, its not just any random word, its just a proper word, so to find that proper what it needs additonally. 

**Lets See the demo here** - 


In [26]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()

In [27]:
for words in words_to_stem:
    print(words+ ":" +word_lem.lemmatize(words))

give:give
giving:giving
given:given
gave:gave


Above output is just like that because we havent assigned a **POS** tag yet, so it has assumed all the words as noun. 

**POS** tags give that - what exactly the word is - is it a noun, or is it a verb or it is different parts of speech.

In [28]:
word_lem.lemmatize('corpora')

'corpus'

In [29]:
pst.stem('corpora')

'corpora'

In [30]:
lst.stem('corpora')

'corpor'

**You can see the difference between *lemmatization* and *stemming*.**

## Stop Words - 

There are several words in the English Language such as I AT ABOVE BELOW - which are very much useful in the formation of sentence and without it, the sentence wouldnt make any sense but these words do not provide any help in NLP and this list of words are also known as **Stop Words**. 

NLTK has its own list of stuff and we can utlise them by importing them. 

They are helpful in the creation of sentences but they are not helpful in the processing of the language. 

**Lets see the Demo**

In [31]:
from nltk.corpus import stopwords

In [32]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [33]:
len(stopwords.words('english'))

179

**We have 179 stops words in English Language**. 

If you look at above words these are very much important while creation of sentences, but for processing these are not important at all, lets see the top 10 earlier taken top words from AI paragraphs. 

In [34]:
fdist_top10

[('the', 2),
 ('of', 2),
 (',', 2),
 ('according', 1),
 ('to', 1),
 ('father', 1),
 ('artificial', 1),
 ('intelligence', 1),
 ('john', 1),
 ('mccarthy', 1)]

Above we can see that most of words are stop words or punctuations other than intelligence and those can be removed. 

For this we will use regex from re model to create a string that matches any digital or special character and then we will see how we can remove the stock words. 

In [35]:
import re
punctuation = re.compile(r'[-.?!,:,()[0-9]]')

In [36]:
post_punctuation = []
for words in AI_tokens:
    word = punctuation.sub("",words)
    if len(word)>0:
        post_punctuation.append(word)

In [37]:
post_punctuation

['According',
 'to',
 'the',
 'father',
 'of',
 'Artificial',
 'Intelligence',
 ',',
 'John',
 'McCarthy',
 ',',
 'it',
 'is',
 '``',
 'The',
 'science',
 'and',
 'engineering',
 'of',
 'making',
 'intellgence',
 "''"]

In this **Post Punctuation** we can see there are not stop word in given output and lets see the length. 

In [38]:
len(post_punctuation)

22

This process is very much necessary in NLP as it removes the unnecessary words, whic we do not hold any much more meaning. 

## POS: Parts of Speech - 

This generally speaking the grammatical type of the word which is the verb, noun, adjective, adverb, article, indicates how a word functions in the meaning, as well as the grammatical within the sentence. 

A word can have more than one parts of speech based on the context in which it is used. 

For example - If we take that sentence into consideration, Google something on the internet. 

Now a Google acts as a verb, although it is a proper noun.

Some Example of POS Tags - 

- CC -> Coordinating Conjunction
- CD -> Cardinal number
- RB -> Adverb
- TO -> to

**Lets see the Demo of POS**

Lets see one sentence - 

**The Dog killed the bat** here - 

**The** -- DT (determiners)
**Dog** -- NN (Noun)
**killed** -- VBZ (verb)
**the** -- DT
**bat** - NN

**JJ** - Adjective

Here we can see all the tokens here correspond to a particular type of tag, which is the parts of speech tag, it is very helpful in text realization. 

Now, let's consider a string and check how NLTK performs POS tagging on it - 

### Example 1

**Step 1 - Tokenise**

In [83]:
# As a thumb rule, first we will tokeninse it 
sent = "Timothy is a natural when it comes to drawing"
sent_tokens = word_tokenize(sent)

**Step 2 - TAG**

In [84]:
#Then in nltk pos_tag we will pass it to - 

for token in sent_tokens:
    print(nltk.pos_tag([token]))

[('Timothy', 'NN')]
[('is', 'VBZ')]
[('a', 'DT')]
[('natural', 'JJ')]
[('when', 'WRB')]
[('it', 'PRP')]
[('comes', 'VBZ')]
[('to', 'TO')]
[('drawing', 'VBG')]


**Just like above we define a POS tags**

### Example 2

In [85]:
sent2 = "John is eating a delicious cake"
sent2_tokens = word_tokenize(sent2)
for token in sent2_tokens:
    print(nltk.pos_tag([token]))

[('John', 'NNP')]
[('is', 'VBZ')]
[('eating', 'VBG')]
[('a', 'DT')]
[('delicious', 'JJ')]
[('cake', 'NN')]


You can see **is** and **eating** both has been taken as **VBG** i.e. **Verb**, because it has consider **is eating** as a single tone. This is one of the few shortcomings of the POS taggers. 

One thing important to keep in mind that now after POS tags.

There is another important topic for this the **Named Entity Recognition** - What does this mean - 

Now the person name, the location name, the company name, the organisation, the quantities and the monetary value is called the **Named Entity Recognition (NER)** 

For looking at **NER** lets look at below sentence - 

**Google's CEO Sundar Pichai introduced the new Pixel at Minnesota Roi Centre Event.**

Now from above sentence below are tags being identified - 

- Google - Organisation
- Sundar Pichai - Person
- Minnesota - Location
- Roi Centre Event - Organisation

**Lets see Demo for NER using NLTK Libraries**


In [86]:
from nltk import ne_chunk

In [87]:
NE_sent = "The US President stays in the WHITE HOUSE"

In [88]:
# Step 1 - Tokenise 
NE_tokens = word_tokenize(NE_sent)
# Step 2 - Tag
NE_tags = nltk.pos_tag(NE_tokens)

In [89]:
NE_NER = ne_chunk(NE_tags)
print(NE_NER)

(S
  The/DT
  (ORGANIZATION US/NNP)
  President/NNP
  stays/VBZ
  in/IN
  the/DT
  (FACILITY WHITE/NNP HOUSE/NNP))


**Above is only possible due to POS Tagging, else it would have been not possible**

## Syntax -

The Syntax is the set of rules principal and the processes that govern the structure of a given sentence in a given language, the term syntax is also used to refer to a study of such principles and processes.

- Principle
- Rules
- Process

What we have here are certain rules as to what part of the sentence should come what position and with these rules one can create a syntax tree whenever there is sentence input.

Now syntax tree in lehman terms is basically a tree representation of the syntactic structure of the sentence of the strings, it is a way of representing the syntax of programming language such as a hierarchical tree structure, this structure is used for generating symbol tables for compilers and later code generation that, he represents all the constructs in the language and their subsequent root. 

Lets consider one example i.e. **Cat Sat on the Mat** - As you can see here the input is a sentence or a war phrase and it has been classified into non phrase and the prepositional phase again the noun phase is classified into article and noun and again we have the verb which is **SAT** and finally we have the preposition on the article and the noun which are the **and mat**, now in order to render syntax trees in our notebook, you need to install the ghost rip, which is a rendering engine, now this takes a lot of time. 

## Chunking -

This is concept of analyzing the sentence structure. 

Chunking means basically picking up individual pieces of information and grouping them into bigger pieces and these bigger pieces are also known as **Chunks** in the context of NLP and text mining chunking means grouping of words or tokens into chunks. 

Lets see one examples here - 

**We Caught the Black Panther**

Lets break the above sentence - 

- We - PRP - NP - (Chunk Together in Noun Phase (NP))
- Caught - VBD
- the - DT - NP - (Chunk Together in Noun Phase (NP))
- Black - JJ - NP - (Chunk Together in Noun Phase (NP))
- Panther - NN - NP  - (Chunk Together in Noun Phase (NP))

Above all these pointers are called as **CHUNKS**

In [90]:
new = "The big cat ate the little mouse who was after fresh cheese"
new_tokens = nltk.pos_tag(word_tokenize(new))
new_tokens

[('The', 'DT'),
 ('big', 'JJ'),
 ('cat', 'NN'),
 ('ate', 'VBD'),
 ('the', 'DT'),
 ('little', 'JJ'),
 ('mouse', 'NN'),
 ('who', 'WP'),
 ('was', 'VBD'),
 ('after', 'IN'),
 ('fresh', 'JJ'),
 ('cheese', 'NN')]

In [91]:
grammer_np = r"NP: {<DT>?<JJ>*<NN}"

In [92]:
chunk_parser = nltk.RegexpChunkParser(grammer_np)

In [108]:
# chunk_result = chunk_parser.parse(new_tokens)
# chunk_result

**This is how we run Chunk in NLTK Libraries**

So far we have learnt all types and applied them and its time to apply them **Together** and building a Machine learning classifier on the movie reviews from the NLTK corpora.

In [109]:
# Import the libraries - 

import pandas as pd
import numpy as np

In [110]:
from sklearn.feature_extraction.text import CountVectorizer

In [111]:
print(os.listdir(nltk.data.find('corpora')))

['stopwords', 'words', 'wordnet.zip', 'stopwords.zip', 'gutenberg', 'words.zip', 'movie_reviews', 'wordnet', 'gutenberg.zip', 'movie_reviews.zip']


In [112]:
from nltk.corpus import movie_reviews

In [113]:
print(movie_reviews.categories())

['neg', 'pos']


In [114]:
print(len(movie_reviews.fileids('pos')))
print(' ')
print(movie_reviews.fileids('pos'))

1000
 
['pos/cv000_29590.txt', 'pos/cv001_18431.txt', 'pos/cv002_15918.txt', 'pos/cv003_11664.txt', 'pos/cv004_11636.txt', 'pos/cv005_29443.txt', 'pos/cv006_15448.txt', 'pos/cv007_4968.txt', 'pos/cv008_29435.txt', 'pos/cv009_29592.txt', 'pos/cv010_29198.txt', 'pos/cv011_12166.txt', 'pos/cv012_29576.txt', 'pos/cv013_10159.txt', 'pos/cv014_13924.txt', 'pos/cv015_29439.txt', 'pos/cv016_4659.txt', 'pos/cv017_22464.txt', 'pos/cv018_20137.txt', 'pos/cv019_14482.txt', 'pos/cv020_8825.txt', 'pos/cv021_15838.txt', 'pos/cv022_12864.txt', 'pos/cv023_12672.txt', 'pos/cv024_6778.txt', 'pos/cv025_3108.txt', 'pos/cv026_29325.txt', 'pos/cv027_25219.txt', 'pos/cv028_26746.txt', 'pos/cv029_18643.txt', 'pos/cv030_21593.txt', 'pos/cv031_18452.txt', 'pos/cv032_22550.txt', 'pos/cv033_24444.txt', 'pos/cv034_29647.txt', 'pos/cv035_3954.txt', 'pos/cv036_16831.txt', 'pos/cv037_18510.txt', 'pos/cv038_9749.txt', 'pos/cv039_6170.txt', 'pos/cv040_8276.txt', 'pos/cv041_21113.txt', 'pos/cv042_10982.txt', 'pos/cv043_1

In [115]:
neg_rev = movie_reviews.fileids('neg')
# len(neg_rev)

In [116]:
rev = nltk.corpus.movie_reviews.words('pos/cv000_29590.txt')
rev

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

Above one is already tokenised and now we are going to use **join** method to join all the tokens of the list into a single string. 

In [117]:
rev_list = []

In [118]:
# Negative Reviews - 

for rev in neg_rev:
    rev_text_neg = rev = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_neg)
    review_one_string = review_one_string.replace(' ,',',')
    review_one_string = review_one_string.replace(' .','.')
    review_one_string = review_one_string.replace("\' " , "''")
    review_one_string = review_one_string.replace("\'", "''")
    rev_list.append(review_one_string)
    

**What we have done above is - we have remove all the extra spaces, the commas from the list, while appending it to the empty list and perform the same for the positive and the negative reviews.**

Above is for **Negative** and we need to complete it for same **Positive** as well. 

In [119]:
len(rev_list)

1000

So, now in the above **rev_list**, when we will add both Negative and Positive list, it will reach to **2000**. 

In [120]:
# Positive Reviews - 

pos_rev = movie_reviews.fileids('pos')
# len(pos_rev)


In [121]:
for rev in pos_rev:
    rev_text_neg = rev = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_neg)
    review_one_string = review_one_string.replace(' ,',',')
    review_one_string = review_one_string.replace(' .','.')
    review_one_string = review_one_string.replace("\' " , "''")
    review_one_string = review_one_string.replace("\'", "''")
    rev_list.append(review_one_string)

In [122]:
len(rev_list)

2000

**Now Lets create some target before creating the few features for our classifiers**

While creating the targets, we are using the negative reviews here, we are generating it as **Zero** and for the positive reviews we are converting it into **One** and will also create an empty list and we will add thousand zeros, followed by thousand ones into the empty list. 

In [123]:
neg_targets = np.zeros((1000),dtype=np.int)
pos_targets = np.ones((1000),dtype=np.int)

In [125]:
target_list = []
for neg_tar in neg_targets:
    target_list.append(neg_tar)
for pos_tar in pos_targets:
    target_list.append(pos_tar)

In [126]:
len(target_list)

2000

Now, we will create a Panda Series for the target list, now the type of **y** must result into a panda series. So the type of y must return the Pandas series. 

In [127]:
y = pd.Series(target_list)

In [128]:
type(y)

pandas.core.series.Series

Now, lets have a look at the first five entries of these series, so we can see it is thousand zeros, which will be followed by thousand ones over the first five inputs are all zeros. 

In [129]:
y.head()

0    0
1    0
2    0
3    0
4    0
dtype: int64

Now we can start creating features using the count Vectorizer or the bag of force for that we need to import the count vectorizer.

In [130]:
from sklearn.feature_extraction.text import CountVectorizer

In [131]:
count_vect = CountVectorizer(lowercase=True,stop_words='english',min_df=2)

Now, we need to fit it onto the rev list. 

In [132]:
X_count_vect = count_vect.fit_transform(rev_list)

Now, if we see the shape of this - 

In [133]:
X_count_vect.shape

(2000, 23784)

Now we are going to create a list with the names of all the features by typing the vector, as we can see here we have our list. 

In [134]:
X_names = count_vect.get_feature_names()
X_names

['00',
 '000',
 '007',
 '05',
 '10',
 '100',
 '1000',
 '100m',
 '101',
 '102',
 '103',
 '105',
 '106',
 '107',
 '108',
 '10th',
 '11',
 '110',
 '113',
 '115',
 '11th',
 '12',
 '126',
 '129',
 '13',
 '130',
 '132',
 '137',
 '13th',
 '14',
 '14th',
 '15',
 '150',
 '1500s',
 '155',
 '15th',
 '16',
 '160',
 '1600',
 '161',
 '16mm',
 '16th',
 '16x9',
 '17',
 '175',
 '1773',
 '17th',
 '18',
 '180',
 '1800s',
 '1839',
 '1869',
 '1871',
 '1888',
 '18th',
 '19',
 '1900',
 '1912',
 '1914',
 '1919',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1932',
 '1933',
 '1935',
 '1937',
 '1938',
 '1939',
 '1940',
 '1940s',
 '1941',
 '1943',
 '1944',
 '1945',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1953',
 '1954',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1963',
 '1964',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '198

Now, we will create a panda's data frame by passing the SyFy csr matrix as values and feature names as the column needs. 

In [135]:
X_count_vect = pd.DataFrame(X_count_vect.toarray(), columns=X_names)

Now check the dimension of this particular pandas dataframe, so as you can see its the same dimension.

In [136]:
X_count_vect.shape

(2000, 23784)

Now, if we have a look at the top five rows of the data frame, so you can see here - 

In [137]:
X_count_vect.head()

Unnamed: 0,00,000,007,05,10,100,1000,100m,101,102,...,zoom,zooming,zooms,zoot,zorg,zorro,zucker,zuko,zwick,zwigoff
0,0,0,0,0,10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As you can see we have 23784 and all the inputs are here as **0**. 

The data frame, we are going to do is now - split it into training and testing sets and now examine that training and the tests sets as well- 

In [138]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [139]:
X_train_cv, X_test_cv, y_train_cv, y_test_cv = train_test_split(X_count_vect, y, test_size=0.25, random_state=5)

So now you can see the size here we have defined as 0.25 that is the test set, that is 25%, the training set will have 75% of the particular data frame. 

In [140]:
X_train_cv.shape

(1500, 23784)

In [141]:
X_test_cv.shape

(500, 23784)

In [142]:
y_train_cv.shape

(1500,)

In [143]:
y_test_cv.shape

(500,)

**Since now our data is split, I will use Nave Bias classifier for text classification over the training and testing sets.**

In [144]:
from sklearn.naive_bayes import GaussianNB

**Naive Bias** - So it is basically a classification technique based on the base theorem with an assumption of Independence among predictors. 

In simple terms our Naive Bias classifiers assumes that the presence of a particularfeature in a class is unrelated to the presence of any other feature.

Now to implement naive bias algorithm in Python, i will be using below library and the functions, we are going to instantiate the classifier now and fir the classifier with the training features and the label we are also going to import the multinomial naive bayes because we do not have only two features here, we have here multifeature. 

In [145]:
from sklearn.naive_bayes import MultinomialNB

In [146]:
clf_cv = MultinomialNB()

In [147]:
clf_cv.fit(X_train_cv,y_train_cv)

MultinomialNB()

Now we have passed the **training** and the **test** dataset to this particular multinomial naive bias and then we will use the pandas function and pass the training features. 

In [149]:
y_pred_cv = clf_cv.predict(X_test_cv)
type(y_pred_cv)

numpy.ndarray

Now lets have a look and check the accuracy of this particular metrics - 

In [150]:
print(metrics.accuracy_score(y_test_cv,y_pred_cv))

0.798


As you can see here the accuracy here is one that is very highly unlikely but since it has given one that means it is overfitting and it is overly accurate and you can also check the confusion mtrix for the same. 

In [151]:
score_clf_cv = confusion_matrix(y_test_cv,y_pred_cv)
score_clf_cv

array([[213,  45],
       [ 56, 186]])