# Reading in the Data

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import re

In [3]:
train = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

In [4]:
unlabeled_train = pd.read_csv('../data/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

In [5]:
test = pd.read_csv('../data/testData.tsv', header=0, delimiter='\t', quoting=3)

In [6]:
train.shape

(25000, 3)

In [16]:
train['score'] = train.id.map(lambda x: x.strip('"').split('_')[1])

In [28]:
train = train.astype({'score':int}, copy=False)

In [31]:
train.score.value_counts()

1     5100
10    4732
8     3009
4     2696
7     2496
3     2420
2     2284
9     2263
Name: score, dtype: int64

In [40]:
(train.score < 7).mean()

0.5

In [17]:
train.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

As we can see above, there are a lot of HTML tags and some backslashes whenever there are single quotes, so that all needs to be cleaned.

Testing out BeautifulSoup on its cleaning mechanism

In [26]:
BeauSoup = BeautifulSoup(train.review[0])

In [27]:
BeauSoup.get_text()

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

So it looks like BeautifulSoup did a good job getting rid of the HTML tags, but I'm going to add to BeautifulSoup's automatic cleaning by using regular expressions to keep only letters.

In [28]:
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search

In [29]:
letters_only

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    m

So looks like we got rid of those backslashes and the single quotes. Now I will transform all the characters into lowercase.

In [30]:
lower_case = letters_only.lower()

In [31]:
lower_case

' with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    m

Now split all those lowercase words in the review into separate words. 

In [35]:
words = lower_case.split()

Taking a look at the first 10 words of our review

In [36]:
words[0:10]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with']

Importing stopwords from NLTK (stopwords are very common words like "the" that do not help much with gauging sentiment).

In [37]:
from nltk.corpus import stopwords

Here is a sample of the first 10 words in the stopword list

In [38]:
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Removing stopwords from our corpus

In [39]:
words = [w for w in words if not w in stopwords.words('english')]

As we can see, removing the stopwords changed the composition of our first 10 words that we saw above. 

In [41]:
words[0:10]

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary']

So far this looks good, but it is only for 1 review. Below I will construct a custom function that will do this for all the reviews in the training set. Until then however, there is one more step for removing URLs.

Removing URLs from the review data

In [42]:
train.review = train.review.map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))

In [43]:
train.review = train.review.map(lambda x: re.sub('http?:\/\/[^\s]*', ' ', x))

---

Creating a custom function to convert reviews to a sequence of words, with the option of removing stopwords (because in some later analyses looking at entire sentences, I will not want to remove stopwords). 

In [52]:
def review_to_wordlist(review, remove_stopwords=False):
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    # 5. Return the list of words as a paragraph by joining them together
    return(" ".join(words))

In [53]:
clean_review = review_to_wordlist(train.review[0])

In [54]:
clean_review

'with all this stuff going down at the moment with mj i ve started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for minutes or s

So this is the exact same thing we got from the earlier steps, but we can now loop through the custom function to do it for all reviews!

In [56]:
clean_train = []
for x in range(0, train.review.size):
    clean_train.append(review_to_wordlist(train.review[x]))

Checking the length of the cleaned reviews, to make sure we have 25,000 reviews

In [62]:
len(clean_train)

25000

Checking the first 2 reviews, to see what they look like

In [61]:
clean_train[0:2]

['with all this stuff going down at the moment with mj i ve started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for minutes or 

Appending the cleaned reviews to the original training dataframe and saving it for Exploratory Data Analysis in the next notebook

In [68]:
train['cleanreview'] = clean_train

In [72]:
train.to_csv('../data/cleantrain.csv', index=False)

---

Loading NLTK's punkt tokenizer for sentence splitting

In [None]:
import nltk.data

In [None]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

Creating a custom function similar to the one before, but this time for entire sentences

In [41]:
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
    # Function to split a review into parsed sentences. Returns a 
    # list of sentences, where each sentence is a list of words
    #
    # 1. Use the NLTK tokenizer to split the paragraph into sentences
    raw_sentences = tokenizer.tokenize(review.strip())
    #
    # 2. Loop over each sentence
    sentences = []
    for raw_sentence in raw_sentences:
        # If a sentence is empty, skip it
        if len(raw_sentence) > 0:
            # Otherwise, call review_to_wordlist to get a list of words
            sentences.append( review_to_wordlist( raw_sentence, \
              remove_stopwords ))
    #
    # Return the list of sentences (each sentence is a list of words, so this returns a list of lists
    return sentences

In [51]:
sentences = []  # Initialize an empty list of sentences

print("Parsing sentences from training set")
for review in train["review"]:
    sentences += review_to_sentences(review, tokenizer)

print("Parsing sentences from unlabeled set")
for review in unlabeled_train["review"]:
    sentences += review_to_sentences(review, tokenizer)


Parsing sentences from training set
Parsing sentences from unlabeled set


In [52]:
len(sentences)

795534

In [58]:
sentences[0]

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 've',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again']

Saving my sentences to then use for modeling later.

In [79]:
with open('../data/sentences.txt', 'wb') as fp:
    pickle.dump(sentences, fp)