PART 1 - Processing text to create design matrices

In [1]:
import pandas as pd
train = pd.read_csv("labeledTrainData.tsv", header=0, \
                   delimiter="\t", quoting=3)

The training data contains 25,000 reviews, each with a binary sentiment score and ID.

In [2]:
print train.shape
print train.columns.values

(25000, 3)
['id' 'sentiment' 'review']


In [3]:
!pip install BeautifulSoup4
from bs4 import BeautifulSoup

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [4]:
import re

In [5]:
import nltk
#nltk.download()
from nltk.corpus import stopwords

Given a raw review, the function review_to_words will clean the text to remove HTML markup, stop words, and all punctuation and numbers. 

In [6]:
def review_to_words(raw_review):
    review_text = BeautifulSoup(raw_review).get_text()
    letters_only = re.sub("[^a-zA-z]", " ", review_text)
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in stops]
    return " ".join(meaningful_words)

This function can now be used to iterate through the training data and make a new list of cleaned reviews.

In [7]:
num_reviews = train['review'].size
clean_train_reviews = []
sentiments_train = []

print "Cleaning and parsing the movie reviews training set...\n"
for i in xrange(0, num_reviews):
    clean_train_reviews.append(review_to_words(train['review'][i]))
    sentiments_train.append(train['sentiment'][i])
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_reviews )

Cleaning and parsing the movie reviews training set...





 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



Now we can use clean_train_words to create a bag of words using sklearn's feature extraction methods. We limit the size of the vocabulary to 5000 words and create the X_counts array to house the raw words counts.

In [8]:
print "Creating the bag of words...\n"
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)
X_counts = train_data_features.toarray()

Creating the bag of words...



We can now transform X_counts into a binary matrix X_binary, where counts greater than 0 become 1 and counts of 0 remain as they were.

In [9]:
X_binary = []
for review in X_counts:
    review_binary = []
    for count in review:
        if count > 0: 
            review_binary.append(1)
        else:
            review_binary.append(0)
    X_binary.append(review_binary)

We can use sklearn's tfidf vectorizer to get the X_tfidf term-frequency inverse document-frequency matrix

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer

In [11]:
transformer = TfidfTransformer(smooth_idf=False)
X_tfidf = transformer.fit_transform(X_counts)
X_tfidf = X_tfidf.toarray()

Create X_binary_imabalnce that will delete 75% of reviews with sentiment=1 to create an imabalanced dataset.

In [12]:
import numpy as np

X_binary_imbalance = []
X_1 = []
for i in range(len(X_counts)):
    if sentiments_train[i] == 1:
        X_1.append(X_counts[i])
    else:
        X_binary_imbalance.append(X_counts[i])
        
# take X_1 and delete 75% of its rows randomly
# first, select random indeces to keep

random_indeces = np.random.choice(len(X_1), len(X_1)//4)
for index in random_indeces:
    X_binary_imbalance.append(X_1[int(index)])

PART 2 - Feature Space Similarity Experiment

Let dist(X, i, j, distance_function=’Euclidean’) be a function which returns the (Euclidean) distance between rows i and j of a design matrix. Import scipy to make the calculation efficient.

In [13]:
import scipy

In [14]:
def dist(X,i,j,distance_function='Euclidean'):
    matrix = [X[i],X[j]]
    return scipy.spatial.distance.pdist(matrix, 'euclidean')

Let topk(X, k) be a function which returns ((i1,j1,d1),...(ik,jk,dk)) where (ix,jx) are the indices of the xth closest pair, and dx is the corresponding distance.

In [15]:
from sklearn.metrics.pairwise import euclidean_distances
from scipy.sparse import csr_matrix, coo_matrix
from itertools import izip

In [17]:
def topk(X,k):
    X = csr_matrix(X)
    distances = coo_matrix(euclidean_distances(X,X))
    
    tuples = izip(distances.row, distances.col, distances.data)
    tup_sort = sorted(tuples, key=lambda x: (x[2]))
    
    # now get the first k unique elements from tup_sort
    topk = []
    i, skips = 0,0
    while i < (k+skips):
        if i > 0 and (tup_sort[i][0] == aaa[i-1][1])\
                and (tup_sort[i][1] == tup_sort[i-1][0]):
            skips += 1
            i += 1
        else:
            topk.append(tup_sort[i])
            i += 1
    return topk

In [21]:
def assemble_topk(X,n,k):
    segmentsize = len(X)//n
    s = 0
    winners = []
    for i in np.arange(segmentsize,len(X)+segmentsize,segmentsize):
        winners.append(topk(X[s:i],k))
        s += segmentsize
    return winners

In [22]:
print assemble_topk(X_counts,10,1)

[[(135, 356, 4.2426406871192848)], [(1269, 1548, 3.1622776601683795)], [(345, 621, 3.3166247903553998)], [(1538, 1745, 3.7416573867739413)], [(217, 517, 4.4721359549995796)], [(15, 2196, 4.4721359549995796)], [(116, 1369, 4.0)], [(989, 1328, 3.7416573867739413)], [(276, 2258, 2.2360679774997898)], [(1445, 1722, 3.3166247903553998)]]


In [None]:
#topk_X_binary = topk(X_binary,1)

In [None]:
#topk_X_binary_imblance = topk(X_binary_imbalance,1)

In [None]:
#topk_X_tfidf = topk(X_tfidf,1)

PART 3 - Classification Experiment