In [21]:
import nltk
import random
from nltk.corpus import movie_reviews
import pprint
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
import pickle

**There are a thousand movie reviews for both**

- positive and
- negetive

**reviews**

In [2]:
movie_reviews.categories()

['neg', 'pos']

### Now I need to store it as 

```python
documents = [
    ('pos', ['good', 'awesome', ....]), 
    ('neg', ['ridiculous', 'horrible', ...])
]
```

**OR** 

Storing it in a dictionary would also be a better idea, will try out with both

```python
documents = {
    'pos': ['good', 'awesome', ....],
    'neg': ['ridiculous', 'horrible', ...]
}
```

In [3]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)
            ]
random.shuffle(documents)

#### Other way to do it would be the normal way instead of this one liner

In [4]:
document_dict = {
    'pos': [],
    'neg': []
}
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        # this will store the list of words read from the particular file in fileid
        raw_list = movie_reviews.words(fileid)
        # cleaning the list using stopwords
        word_list = [word for word in raw_list if word not in stop_words]
        if category == 'pos':
            document_dict['pos'].extend(word_list)
        elif category == 'neg':
            document_dict['neg'].extend(word_list)


**Cleaning it up using the stopwords**

In [5]:
print(len(document_dict['pos']))
print(len(document_dict['neg']))

506495
457774


**Getting the list of all words to store the most frequently occuring ones**

In [6]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

**Making a frequency distribution of the words**

In [7]:
all_words = nltk.FreqDist(all_words)
all_words.most_common(20)

[(',', 77717),
 ('the', 76529),
 ('.', 65876),
 ('a', 38106),
 ('and', 35576),
 ('of', 34123),
 ('to', 31937),
 ("'", 30585),
 ('is', 25195),
 ('in', 21822),
 ('s', 18513),
 ('"', 17612),
 ('it', 16107),
 ('that', 15924),
 ('-', 15595),
 (')', 11781),
 ('(', 11664),
 ('as', 11378),
 ('with', 10792),
 ('for', 9961)]

In [8]:
all_words["hate"]  ## counting the occurences of a single word

134

**will train only for the first 5000 top words in the list**

In [9]:
feature_words = list(all_words.keys())[:5000]

### Finding these feature words in documents, making our function would ease it out!

In [10]:
def find_features(document):
    words = set(document)
    feature = {}
    for w in feature_words:
        feature[w] = (w in words)
    return feature

**What the below one does is, before hand we had only words and its category. But not we have the feature set (along with a boolean value of whether it is one of the most frequently used words or not)of the same word and then the category.**

In [11]:
feature_sets = [(find_features(rev), category) for (rev, category) in documents]

In [12]:
print(feature_sets[:1])

[({'republic': False, 'seat': False, 'cider': False, 'forster': False, 'loitering': False, 'coto': False, 'subconscience': False, 'every': False, 'pricey': False, 'syd': False, 'readers': False, 'housed': False, 'database': False, 'rollickingly': False, 'approximately': False, 'hysteria': False, 'mistic': False, '1942': False, 'greenstreet': False, 'cystic': False, 'breast': False, 'dreyer': False, 'disorientation': False, 'pomp': False, 'greenway': False, 'jurek': False, 'transcends': False, 'pileggi': False, 'spielberg': False, 'gagging': False, 'gripe': False, 'septien': False, 'posters': False, 'malevolence': False, 'fountain': False, 'humbert': False, 'interpersonal': False, 'major': False, 'revealing': False, 'shepherd': False, 'wim': False, 'gushing': False, 'bangers': False, 'extract': False, '_looks_': False, 'vines': False, 'calling': False, 'polymorphously': False, 'isaacman': False, 'communities': False, 'traveller': False, 'lugubrious': False, 'kickboxer': False, 'newton':

### Training the classifier

In [13]:
training_set = feature_sets[:1900]
testing_set = feature_sets[1900:]

We won't be telling the machine the category i.e. whether the document is a postive one or a negative one. We ask it to tell that to us. Then we compare it to the known category that we have and calculate how accurate it is.

## Naive bayes algorithm

It states that

\begin{equation*}
posterior = \frac{PriorOccurences \times likelihood}{CurrentEvidence}
\end{equation*}

Here posterior is likelihood of occurence

In [18]:
## TO-DO: To build own naive bais algorithm
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [20]:
## Testing it's accuracy
print("Naive bayes classifier accuracy percentage : ", (nltk.classify.accuracy(classifier, testing_set))*100)

Naive bayes classifier accuracy percentage :  65.0


In [17]:
classifier.show_most_informative_features(20)

Most Informative Features
                  hatred = True              pos : neg    =      9.7 : 1.0
                reminder = True              pos : neg    =      9.0 : 1.0
                fairness = True              neg : pos    =      9.0 : 1.0
              weaknesses = True              pos : neg    =      8.4 : 1.0
                 wasting = True              neg : pos    =      8.3 : 1.0
                    deft = True              pos : neg    =      7.7 : 1.0
                 symbols = True              pos : neg    =      7.7 : 1.0
              compensate = True              neg : pos    =      7.0 : 1.0
             overwrought = True              neg : pos    =      7.0 : 1.0
                  crappy = True              neg : pos    =      7.0 : 1.0
                    mena = True              neg : pos    =      7.0 : 1.0
                  suvari = True              neg : pos    =      7.0 : 1.0
                 supreme = True              pos : neg    =      6.3 : 1.0

What the above feature set means is lets take **abysmal**, 

> **neg : pos    =      6.3 : 1.0**

means that it appears **6.3** times more in **neg** reviews than in **pos** reviews

## Saving the trained algorithm using **Pickle**

We will be saving python objects so that we can quickly load them again.

_Importing pickle at the top_

In [22]:
save_classifier = open("naivebayes.pickle", "wb") ## 'wb' tells to write it using bytes
pickle.dump(classifier, save_classifier)
save_classifier.close()

**We will now use this classifier in the next file to classify documents**