# Bag of Words

One way to understand the context of some statement is to analyze just the word frequencies. Their is a drawback to this approach, namely, language encodes meaning based on order of words. Still, this approach can be effective with the right dataset and goal. 

In [1]:
import pandas as pd
import numpy as np
fname = "../Resources/yelp_reviews.csv"
data = pd.read_csv(fname)
data.head()

Unnamed: 0,class,text
0,positive,Wow... Loved this place.
1,negative,Crust is not good.
2,negative,Not tasty and the texture was just nasty.
3,positive,Stopped by during the late May bank holiday of...
4,positive,The selection on the menu was great and so wer...


## Converting text to word frequencies

Removing filler words will help the model since these words rarely add meaning. Text is converted to a vector of numbers where the value represents the frequency of a word and the position in the vector distinguishes each word

In [2]:
# convert text to word frequencies
from collections import Counter
text = "Peter Piper picked a peck of pickled peppers; A peck of pickled peppers Peter Piper picked"
word_frequency = Counter(text.split())
print("Vector of frequencies:\n {}".format(list(word_frequency.values())))
print("Corresponding words:\n {}".format(list(word_frequency.keys())))

Vector of frequencies:
 [2, 2, 2, 1, 2, 2, 2, 1, 1, 1]
Corresponding words:
 ['Peter', 'Piper', 'picked', 'a', 'peck', 'of', 'pickled', 'peppers;', 'A', 'peppers']


In [3]:
# remove special characters
# remove stop words
# stem words
# normalize text (lowercase)

# download stopwords and punkt
from string import punctuation
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' ) + list(punctuation)
stemmer = PorterStemmer()

text = text.lower()
words = nltk.word_tokenize(text)
words = [stemmer.stem(w) for w in words if w not in stopwords]
words

['peter',
 'piper',
 'pick',
 'peck',
 'pickl',
 'pepper',
 'peck',
 'pickl',
 'pepper',
 'peter',
 'piper',
 'pick']

In [4]:
Counter(words)

Counter({'peter': 2,
         'piper': 2,
         'pick': 2,
         'peck': 2,
         'pickl': 2,
         'pepper': 2})

## Sentiment Classifier

To classify Positive vs Negative sentiment given a Yelp review we use the word frequency thinking some words are used more than others when relating sentiment. Using just word frequency has a drawback, namely, common words don't convery much meaning but would get high weights of importance with just word frequency. **Solution:** Divide the word frequency by the number of document the word appears in. This is called *Term Frequency Inverse Document Frequency** or **TFIDF**

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# This vectorizer breaks text into single words and bi-grams
# and then calculates the TF-IDF representation
vectorizer = TfidfVectorizer(ngram_range=(1,2), lowercase=True, stop_words=stopwords)

vectors = vectorizer.fit_transform(data["text"][:1])
pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names()).head()

Unnamed: 0,loved,loved place,place,wow,wow loved
0,0.447214,0.447214,0.447214,0.447214,0.447214


In [6]:
vectors = vectorizer.fit_transform(data["text"])
pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names()).head()

Unnamed: 0,00,10,10 minutes,10 times,100,100 recommended,100 times,11,11 99,12,...,yum,yum sauce,yum yum,yummy,yummy christmas,yummy try,yummy tummy,zero,zero stars,zero taste
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# Stemming

punc_list = list(punctuation)
def special_remove(word):
    if len(word)>2:
        return False
    for c in word:
        if c in punc_list:
            return True
    return False

# custom function that overrides default token generation
def custom_tokenizer(text):
    text = text.lower()
    words = nltk.word_tokenize(text)
    words = [stemmer.stem(w) for w in words if w not in stopwords+["..."]]
    # further remove woords with a special char
    words = [w for w in words if not special_remove(w)]
    return words

In [8]:
vectorizer = TfidfVectorizer(ngram_range=(1,2), tokenizer=custom_tokenizer)

vectors = vectorizer.fit_transform(data["text"])
pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names()).head()

Unnamed: 0,'ll,'ll back,'ll definit,'ll done,'ll go,'ll hit,'ll impress,'ll leav,'ll never,'ll regular,...,yum,yum sauc,yum yum,yummi,yummi christma,yummi tri,yummi tummi,zero,zero star,zero tast
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Naive Bayes Model

**TFIDF** is used to convert text into numeric values. A model must still take those numbers to compute probabilities of Positive or Negative sentiment. A Naive Bayes model computes a probability by
    * Assuming features are independent (words and bi-grams)
    * 

In [9]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
X = vectors.toarray()
y = data[["class"]].values
model.fit(X, y)

  y = column_or_1d(y, warn=True)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
model.predict(X[2:3])

array(['negative'], dtype='<U8')

In [11]:
y[2:3]

array([['negative']], dtype=object)

In [12]:
pd.DataFrame({"actual": y.reshape(-1), "prediction": model.predict(X)}).head()

Unnamed: 0,actual,prediction
0,positive,positive
1,negative,negative
2,negative,negative
3,positive,positive
4,positive,positive


In [13]:
correct_preds = sum([
    actual == predict
    for actual, predict in zip(y.reshape(-1), model.predict(X))
])

total_preds = len(y.reshape(-1))

In [14]:
correct_preds / total_preds

0.988