## Using basic Bag of Words features for text mining applications

This notebook is based on the excellent Kaggle tutorial [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words), which details how to use various Python libraries to preprocess text data for NLP tasks like sentiment analysis, document clustering, and the like.  

This preprocessing includes:
- removing all html all tags from each document
- parsing the text of each document (separating it out into individual words)
- removing stop words (words of little meaning like 'and' and 'the')
- stemming (combining words of similar meaning like 'dogs' and 'dog')
- making a word-frequency representations (a.k.a. a Bag of Words feature vector) of a document preprocessed in the manner above

This notebook also employs the exemplary dataset used in the tutorial - one consisting of movie reviews - labeled as either 'positive' (meaning the person enjoyed the film) or 'negative'.  These reviews are in raw html form (hence the need for tag-stripping etc.,).  

Once all documents are pre-processed (once each movie review is transformed into a Bag of Words feature vector) we can then train a supervised learning model to distinguish between positive and negative reviews.  

In [1]:
# a class contaniing all necessary functions to transform raw html documents into Bag of Words feature vectors
from sentiment_analyzer import sentiment_analyzer
sentiment_analyzer = sentiment_analyzer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Nurgetson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We first load in our training and testing datasets consisting of 20,000 and 5,000 movie reviews respectively. Each review has a label of 'positive' or 'negative'.

In [2]:
### we'll need to split this into training and testing sets ourselves
# load in training data
csvname = "training_data.tsv"
training_data,training_labels = sentiment_analyzer.load_data(csvname)

# load in testing data
csvname = "testing_data.tsv"
testing_data,testing_labels = sentiment_analyzer.load_data(csvname)

Lets take a quick look at the raw data - notice the many html tags that need removing

In [3]:
# show a raw document from the training set - those from the testing set look the same
training_data[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

Next we clean both the training and testing datasets - this means we

- removing all html all tags from each document
- parsing the text of each document (separating it out into individual words)
- removing stop words (words of little meaning like 'and' and 'the'), punctuation, numbers
- stemming (combining words of similar meaning like 'dogs' and 'dog')

In [4]:
# clean training data 
clean_training_data = sentiment_analyzer.clean_data(training_data)

# clean testing data
clean_testing_data = sentiment_analyzer.clean_data(testing_data)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Now that each dataset has been cleaned we can transform the documents from each into Bag of Words features.  Note: we need to create this transformation based on the **training data** alone, as the dictionary (the set of words shared by documents in the training set) used here will be the one on which we train our supervised learning algorithm.  Hence any future testing data must be transformed in the same way for us to be able to apply the learned training algorithm.

In [5]:
# fit a BoW transform to the cleaned documents
BoW_transform = sentiment_analyzer.make_BoW_transform(clean_training_data)

In [6]:
# Use the newly formed BoW transformation to transform both training and testing sets
# normalize BoW data
import sklearn
training_BoW_features = BoW_transform.transform(clean_training_data)
training_BoW_features = training_BoW_features.toarray()
training_BoW_features = sklearn.preprocessing.normalize(training_BoW_features,axis = 1)

testing_BoW_features = BoW_transform.transform(clean_testing_data)
testing_BoW_features = testing_BoW_features.toarray()
testing_BoW_features = sklearn.preprocessing.normalize(testing_BoW_features,axis = 1)

In [7]:
# perform classification on training set, report accuracy on training and testing sets
sentiment_analyzer.perform_classification(training_BoW_features,training_labels,testing_BoW_features,testing_labels)

done training boosted classifier
accuracy on training set is 0.8383
accuracy on testing set is 0.834
