In [2]:
import nltk
import random
from nltk.corpus import movie_reviews
import pprint
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
import pickle

**There are a thousand movie reviews for both**

- positive and
- negetive

**reviews**

In [3]:
movie_reviews.categories()

['neg', 'pos']

### Now I need to store it as 

```python
documents = [
    ('pos', ['good', 'awesome', ....]), 
    ('neg', ['ridiculous', 'horrible', ...])
]
```

**OR** 

Storing it in a dictionary would also be a better idea, will try out with both

```python
documents = {
    'pos': ['good', 'awesome', ....],
    'neg': ['ridiculous', 'horrible', ...]
}
```

In [4]:
documents = [(list(word for word in movie_reviews.words(fileid) if word not in stop_words), category)
            for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)
            ]
random.shuffle(documents)

**Getting the list of all words to store the most frequently occuring ones**

In [5]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

**Making a frequency distribution of the words**

In [6]:
all_words = nltk.FreqDist(all_words)
all_words.most_common(20)

[(',', 77717),
 ('the', 76529),
 ('.', 65876),
 ('a', 38106),
 ('and', 35576),
 ('of', 34123),
 ('to', 31937),
 ("'", 30585),
 ('is', 25195),
 ('in', 21822),
 ('s', 18513),
 ('"', 17612),
 ('it', 16107),
 ('that', 15924),
 ('-', 15595),
 (')', 11781),
 ('(', 11664),
 ('as', 11378),
 ('with', 10792),
 ('for', 9961)]

In [7]:
all_words["hate"]  ## counting the occurences of a single word

134

**will train only for the first 5000 top words in the list**

In [8]:
feature_words = list(all_words.keys())[:5000]

### Finding these feature words in documents, making our function would ease it out!

In [9]:
def find_features(document):
    words = set(document)
    feature = {}
    for w in feature_words:
        feature[w] = (w in words)
    return feature

**What the below one does is, before hand we had only words and its category. But not we have the feature set (along with a boolean value of whether it is one of the most frequently used words or not)of the same word and then the category.**

In [10]:
feature_sets = [(find_features(rev), category) for (rev, category) in documents]

In [11]:
feature_sets[:1]

[({'murderers': False,
   'magoo': False,
   'autumn': False,
   'cheryl': False,
   'dalmatians': False,
   'inaction': False,
   'goateed': False,
   'entomologist': False,
   'filmed': False,
   '92s': False,
   'snipers': False,
   'unintentionally': False,
   'dragon': False,
   'wrinkles': False,
   'blasphemy': False,
   'forward': False,
   'butter': False,
   'interrelate': False,
   'tantor': False,
   'marx': False,
   'errs': False,
   'chews': False,
   'outlook': False,
   'keywords': False,
   'honkey': False,
   'replenishing': False,
   'shapes': False,
   'sommeliers': False,
   'grappling': False,
   'traps': False,
   'loathsome': False,
   'obscene': False,
   'korman': False,
   'hohh': False,
   'campaigns': False,
   'horizontally': False,
   'unearthing': False,
   'uhhm': False,
   'promoted': False,
   'discontented': False,
   'wallpaper': False,
   'lifespan': False,
   'excises': False,
   'bonnier': False,
   'lipnicki': False,
   'curious': False,
   'im

### Training the classifier

In [12]:
training_set = feature_sets[:1900]
testing_set = feature_sets[1900:]

We won't be telling the machine the category i.e. whether the document is a postive one or a negative one. We ask it to tell that to us. Then we compare it to the known category that we have and calculate how accurate it is.

## Naive bayes algorithm

It states that

\begin{equation*}
posterior = \frac{PriorOccurences \times likelihood}{CurrentEvidence}
\end{equation*}

Here posterior is likelihood of occurence

In [20]:
## TO-DO: To build own naive bais algorithm
# classifier = nltk.NaiveBayesClassifier.train(training_set)

## saving the classifier
# save_classifier = open("naive_bayes.pickle", "wb")
# pickle.dump(classifier, save_classifier)
# save_classifier.close()

## Now that the picke is saved we will use that.

In [21]:
## Using the pickle file now 
pickle_classifier = open("naive_bayes.pickle", "rb")
classifier = pickle.load(pickle_classifier)
pickle_classifier.close()

## Testing it's accuracy
print("Naive bayes classifier accuracy percentage : ", (nltk.classify.accuracy(classifier, testing_set))*100)

Naive bayes classifier accuracy percentage :  71.0


In [19]:
classifier.show_most_informative_features(20)

Most Informative Features
                  hatred = True              pos : neg    =     10.1 : 1.0
                 symbols = True              pos : neg    =      7.5 : 1.0
               balancing = True              pos : neg    =      7.5 : 1.0
               laughably = True              neg : pos    =      7.1 : 1.0
                   pixar = True              pos : neg    =      6.9 : 1.0
             fulfillment = True              pos : neg    =      6.2 : 1.0
                 labeled = True              pos : neg    =      6.2 : 1.0
                    jude = True              pos : neg    =      6.2 : 1.0
                 outlook = True              pos : neg    =      6.2 : 1.0
                  symbol = True              pos : neg    =      6.1 : 1.0
               strongest = True              pos : neg    =      6.1 : 1.0
                   jolie = True              neg : pos    =      5.9 : 1.0
                 misfire = True              neg : pos    =      5.8 : 1.0

What the above feature set means is lets take **abysmal**, 

> **neg : pos    =      6.3 : 1.0**

means that it appears **6.3** times more in **neg** reviews than in **pos** reviews

## Saving the trained algorithm using **Pickle**

We will be saving python objects so that we can quickly load them again.

_Importing pickle at the top_

**We will now use this classifier in the next file to classify documents**