<a href="https://colab.research.google.com/github/ovieimara/ITNPBD6/blob/master/Session9_Bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag of words intro
The demo is mostly following:
[https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words]

In [25]:
# Read the labeled training data
import pandas as pd
raw_data = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [26]:
# Take a look at the raw data; we have IDs, sentiment, and the review text
# sentiment: description of the data set says that IMDB rating < 5 results in a
# sentiment score of 0, and rating >=7 have a sentiment score of 1
print(raw_data)

              id  sentiment                                             review
0       "5814_8"          1  "With all this stuff going down at the moment ...
1       "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2       "7759_3"          0  "The film starts with a manager (Nicholas Bell...
3       "3630_4"          0  "It must be assumed that those who praised thi...
4       "9495_8"          1  "Superbly trashy and wondrously unpretentious ...
...          ...        ...                                                ...
24995   "3453_3"          0  "It seems like more consideration has gone int...
24996   "5064_1"          0  "I don't believe they made this film. Complete...
24997  "10905_3"          0  "Guy is a loser. Can't get girls, needs to bui...
24998  "10194_3"          0  "This 30 minute documentary Buñuel made in the...
24999   "8478_8"          1  "I saw this movie as a child and it broke my h...

[25000 rows x 3 columns]


In [27]:
# ...and a high level summary:
raw_data.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [28]:
# split into training and test sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(raw_data, test_size=0.33, random_state=42)
train.reset_index(inplace=True, drop=True) # tidy up the indices
test.reset_index(inplace=True, drop=True)

In [29]:
train.describe()

Unnamed: 0,sentiment
count,16750.0
mean,0.498806
std,0.500014
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [30]:
test.describe()

Unnamed: 0,sentiment
count,8250.0
mean,0.502424
std,0.500024
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


# Preprocessing

Now, we'll look the the individual preprocessing steps for a single review, before wrapping them all up into a function that will be applied to all of the reviews...

In [31]:
# We'll use BeautifulSoup to strip the html markup
from bs4 import BeautifulSoup

# Initialize the BeautifulSoup object on a single movie review
example1 = BeautifulSoup(train["review"][0])

# Print the raw review and then the output of get_text(), for
# comparison
print(train["review"][0])
print("=====================")
print(example1.get_text())

"When I saw previews of this movie I thought that it may be dumb, but it will at least be funny. Well I was wrong. Even though somewhere deep down the producers had an interesting message to convey about parents being left alone and re-evaluating their life, the way they tried to deliver that message was horrible. The first fifty times something silly happened to the couple was relatively funny. But by the end, I could almost predict what stupid mishap is going to happen next.<br /><br />Throughout the movie I like a total of maybe five lines of dialogue and everything else was at best mediocre, which is still more than I can say for the movie itself."
"When I saw previews of this movie I thought that it may be dumb, but it will at least be funny. Well I was wrong. Even though somewhere deep down the producers had an interesting message to convey about parents being left alone and re-evaluating their life, the way they tried to deliver that message was horrible. The first fifty times s

In [32]:
# Use regular expressions to do a find-and-replace
# and strip any digits or punctuation
import re
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print(letters_only)

 When I saw previews of this movie I thought that it may be dumb  but it will at least be funny  Well I was wrong  Even though somewhere deep down the producers had an interesting message to convey about parents being left alone and re evaluating their life  the way they tried to deliver that message was horrible  The first fifty times something silly happened to the couple was relatively funny  But by the end  I could almost predict what stupid mishap is going to happen next Throughout the movie I like a total of maybe five lines of dialogue and everything else was at best mediocre  which is still more than I can say for the movie itself  


In [33]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words


words   # take a look at the resulting processed words in this review

['when',
 'i',
 'saw',
 'previews',
 'of',
 'this',
 'movie',
 'i',
 'thought',
 'that',
 'it',
 'may',
 'be',
 'dumb',
 'but',
 'it',
 'will',
 'at',
 'least',
 'be',
 'funny',
 'well',
 'i',
 'was',
 'wrong',
 'even',
 'though',
 'somewhere',
 'deep',
 'down',
 'the',
 'producers',
 'had',
 'an',
 'interesting',
 'message',
 'to',
 'convey',
 'about',
 'parents',
 'being',
 'left',
 'alone',
 'and',
 're',
 'evaluating',
 'their',
 'life',
 'the',
 'way',
 'they',
 'tried',
 'to',
 'deliver',
 'that',
 'message',
 'was',
 'horrible',
 'the',
 'first',
 'fifty',
 'times',
 'something',
 'silly',
 'happened',
 'to',
 'the',
 'couple',
 'was',
 'relatively',
 'funny',
 'but',
 'by',
 'the',
 'end',
 'i',
 'could',
 'almost',
 'predict',
 'what',
 'stupid',
 'mishap',
 'is',
 'going',
 'to',
 'happen',
 'next',
 'throughout',
 'the',
 'movie',
 'i',
 'like',
 'a',
 'total',
 'of',
 'maybe',
 'five',
 'lines',
 'of',
 'dialogue',
 'and',
 'everything',
 'else',
 'was',
 'at',
 'best',
 'm

In [34]:
import nltk
nltk.download("stopwords")  # Download stop words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [35]:
# let's see what counts as a stop word
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [36]:
# Remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print(words)

['saw', 'previews', 'movie', 'thought', 'may', 'dumb', 'least', 'funny', 'well', 'wrong', 'even', 'though', 'somewhere', 'deep', 'producers', 'interesting', 'message', 'convey', 'parents', 'left', 'alone', 'evaluating', 'life', 'way', 'tried', 'deliver', 'message', 'horrible', 'first', 'fifty', 'times', 'something', 'silly', 'happened', 'couple', 'relatively', 'funny', 'end', 'could', 'almost', 'predict', 'stupid', 'mishap', 'going', 'happen', 'next', 'throughout', 'movie', 'like', 'total', 'maybe', 'five', 'lines', 'dialogue', 'everything', 'else', 'best', 'mediocre', 'still', 'say', 'movie']


In [37]:
# Now, wrap all of the above into a function that we can call for every review in our data set

#we can uncomment these lines and line 31 to apply stemming, but we're not doing that just now.
#a little bit more on that is at the end of this notebook
#from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize
#ps = PorterStemmer()

def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text()
    #
    # 2. Remove non-letters
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))
    #
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]

    # this is how we'd add the Porter stemmer...
    #stemmed_words = [ps.stem(w) for w in words]
    #meaningful_words = stemmed_words

    #
    # 6. Join the words back into one string separated by space,
    # and return the result.
    return( " ".join( meaningful_words ))

In [38]:
# test the function on a single review...

clean_review = review_to_words( train["review"][0] )
print(clean_review)

saw previews movie thought may dumb least funny well wrong even though somewhere deep producers interesting message convey parents left alone evaluating life way tried deliver message horrible first fifty times something silly happened couple relatively funny end could almost predict stupid mishap going happen next throughout movie like total maybe five lines dialogue everything else best mediocre still say movie


In [39]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list
for index, row in train.iterrows():
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_train_reviews.append( review_to_words( row["review"] ) )

# train["review"] = train["review"].apply(lambda x: review_to_words(x))

clean_train_reviews[0]

'saw previews movie thought may dumb least funny well wrong even though somewhere deep producers interesting message convey parents left alone evaluating life way tried deliver message horrible first fifty times something silly happened couple relatively funny end could almost predict stupid mishap going happen next throughout movie like total maybe five lines dialogue everything else best mediocre still say movie'

In [43]:
# create bag of words...
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 50)

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an
# array
train_data_features = train_data_features.toarray()

In [46]:
# Take a look at the words in the vocabulary
# repeat the above with 50 and 5000 - 50 is a bit easier to see but too limited for modelling
vocab = vectorizer.get_feature_names()
print(vocab)

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [47]:
# show review 8 and its feature vector
# first word: "acting" appears once
# second word: "also" appears no times
# third word: "bad" appears 4 times
print(train["review"][8])
print(train_data_features[8])

"I don't think you can get much worse then this. Put together bad actors, fake limbs, and three stupid stories and what do you get? This B-rate pointless excuse for a movie.<br /><br />The first story immediately shows the bad video quality and the acting is just really pathetic, especially when you bring in the 25 year old posing as a grandma with the usually grandma bun over the ears bit. Plus, the man is OK, but the woman is rather ugly. \"You look great!\" NOT! The werewolf in this one was the best one out of all three I'd say, but its still not impressive since it was all bad costume. The face on the woman later was decent enough for halloween but not for a werewolf movie.<br /><br />The more stories you go through the worse it gets. There are two lesbians in this next one who are completely retarded its ridiculous. The whole \"I want to be a werewolf, too\" \"How could you do this to me?!\" Was silly to say. You asked for it now get over it! The werewolf will not even be spoken o

In [49]:
# We now have a vector of numbers for each review!
# We can train the random forest...

from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)

# Fit the forest to the training set, using the bag of words as
# features and the sentiment labels as the response variable
#
# This may take a few minutes to run
forest = forest.fit( train_data_features, train["sentiment"] )

In [50]:
# now let's make a test set...
# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])
clean_test_reviews = []

for index, row in test.iterrows():
    clean_review = review_to_words( row["review"] )
    clean_test_reviews.append( clean_review )

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

In [53]:
# how did the model do?
from sklearn.metrics import accuracy_score
print(accuracy_score(test["sentiment"], result))

from sklearn.metrics import confusion_matrix
confusion_matrix(test["sentiment"], result, labels=None, sample_weight=None)

0.6987878787878787


array([[2832, 1273],
       [1212, 2933]])

When I run this it's got an accuracy of 69.6%, which isn't bad but is not great! We can probably do better.

Try tuning your model: use a grid search or random search, and try alternatives to the Random Forest. What's the best you can do?

In the next part of this session we'll look at word embeddings, which should offer a considerable improvement.

In [55]:
# A brief example of a porter stemmer
# We could rerun the above by adding the stemmer to review_to_words()
# but in this case it doesn't actually make much difference,
# and slows things down a bit!

# try a few different example words with alternative methods from
# https://www.nltk.org/api/nltk.stem.html to see their impact

# in particular, try WordNetLemmatizer to see lemmatization

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')

ps = PorterStemmer()
example_words = ["run","runner","running","runs","ran"]
print("Example 1")
for w in example_words:
    print(ps.stem(w))

example_sentence = "the quick brown fox jumped over the lazy dog"
example_words = word_tokenize(example_sentence)
print("\nExample 2")
for w in example_words:
    print(ps.stem(w))


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Example 1
run
runner
run
run
ran

Example 2
the
quick
brown
fox
jump
over
the
lazi
dog
