# Reviewing Bag of Words and TD-IDF

In this document, we use a few simple techniques to try and complete the task set out by the Kaggle StumbleUpon Competition listed below.

https://www.kaggle.com/c/stumbleupon

**Competition**: Some web pages, such as news articles or seasonal recipes, are only relevant for a short period of time. Others continue to be important for a long time.

**Goal**: The goal is to identify pages which pages will be relevant for a short span of time, and which will be relevant for a long span on time and are thus considered "evergreen".

**Evaluation**: Area under the curve (AUC)

## 1. Initial Setup

### 1.1 Import Python Packages

In [6]:
# quick hack to fix import path
# import sys; sys.path.append('/Users/julianalverio/code/conda/envs/sac/lib/python3.6/site-packages/')

# data manipulation
import pandas as pd
import numpy as np
#  mjujuuj
# plots
%matplotlib inline
import random
import matplotlib
import matplotlib.pyplot as plt
import pylab as pl

# classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

# dimensionality reduction
from sklearn.decomposition import PCA

# cross-validation
from sklearn.model_selection import train_test_split
from sklearn import model_selection

# text features
import re
from nltk import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# model evaluation
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings("ignore")

import os
os.chdir(os.getcwd())

Here we look at the data.

In [7]:
# Read stumbleupon data using pandas
data = pd.read_table("train.tsv", sep= "\t")

# Look at data
df = pd.DataFrame(data["boilerplate"]) # extract data
df # print out dataframe to look at it

Unnamed: 0,boilerplate
0,"{""title"":""IBM Sees Holographic Calls Air Breat..."
1,"{""title"":""The Fully Electronic Futuristic Star..."
2,"{""title"":""Fruits that Fight the Flu fruits tha..."
3,"{""title"":""10 Foolproof Tips for Better Sleep ""..."
4,"{""title"":""The 50 Coolest Jerseys You Didn t Kn..."
5,"{""url"":""conveniencemedical genital herpes home..."
6,"{""title"":""fashion lane American Wild Child "",""..."
7,"{""url"":""insidershealth article racing for reco..."
8,"{""title"":""Valet The Handbook 31 Days 31 days"",..."
9,"{""url"":""howsweeteats 2010 03 24 cookies and cr..."


### 1.2 Using Numerical Features (same as last week)

In [8]:
# Alchemy category, converting to one-hots
df = data['alchemy_category']   # 2K ? values
one_hots = pd.get_dummies(data['alchemy_category'])
df = one_hots
rename_dict = {'?': 'alchemy_cat_?'}
df = df.rename(columns=rename_dict)

# FrameTagRatio, leaving as continuous number
df_var = data['frameTagRatio']
df['frame_tag_ratio'] = df_var

# link word score, 0-100 gaussian, keeping continuous
df['link_word_score'] = data['linkwordscore']

# alchemy category score, with replacing missing values with random
df_var = data['alchemy_category_score']
df_var_temp = df_var.apply(lambda x: np.random.random() if x == '?' else float(x)).astype('float32')
df['alchemy_category_score'] = df_var_temp

# num word in url -- discrete 0-25 to custom binning from looking at the histogram
df_var = data['numwords_in_url']
bins = [0, 6, 8, 13, 25]
df_var_temp = pd.cut(x=df_var, bins=bins, right=True, labels=['num_words_url_bin_0', 'num_words_url_bin_1', 'num_words_url_bin_2', 'num_words_url_bin_3'])
dummies = pd.get_dummies(df_var_temp)
df = pd.concat([df, dummies], axis=1)

# parameterized_link_ratio -- leaving as continuous, right-half gaussian
df['parameterized_link_ratio'] = data['parametrizedLinkRatio']

# spelling errors ratio -- leaving as continuous
df['spelling_errors_ratio'] = data['spelling_errors_ratio']

# embed_ratio -- bimodal continuous binned into 2 bins
df_var = pd.DataFrame(data['embed_ratio'])
df_var = df_var['embed_ratio'].apply(lambda x: 1 if x > -1 else 0)
dummies = pd.get_dummies(df_var)
rename = {0: 'embed_ratio_0', 1: 'embed_ratio_1'}
dummies = dummies.rename(columns=rename)
df = pd.concat([df, dummies], axis=1)

# html_ratio -- leaving continuous
df['html_ratio'] = data['html_ratio']

# lengthy_link_domain
df_var = pd.get_dummies(data['lengthyLinkDomain'])
rename = {0: 'lengthy_link_domain_0', 1: 'lengthy_link_domain_1'}
df_var = df_var.rename(columns=rename)
df = pd.concat([df, df_var], axis=1)

df['labels'] = data['label']

### 1.3 Creating Training and Testing Data Splits

In [9]:
# Split data into training and testing
train, val = train_test_split(df, test_size=0.5, train_size=0.5, random_state=234)

# Split testing into validation and test
val, test = train_test_split(val, test_size=0.5, train_size=0.5, random_state=675)

# Get labels for training dataset
train_labels = train['labels']
train = train.drop(['labels'], axis=1, inplace=False)

# Get labels for validation dataset
val_labels = val['labels']
val = val.drop(['labels'], axis=1, inplace=False)

# Get labels for testing dataset
test_labels = test['labels']
test = test.drop(['labels'], axis=1, inplace=False)

## 2. Bag of Words

### 2.1 Import Data

The data for this exercise has been placed within the folder that this notebook is in. Therefore, we can simply reference it below.

In [10]:
# Read stumbleupon data using pandas
data = pd.read_table("train.tsv", sep= "\t")

### 2.2 Using Count Vectorizer

Below is the code that we will re-run from last week. We have four major features which are relevant! I have written the descriptions from the function documentation for your convenience. Because of this, know that it is often your responsibility to do this step. Understanding what functions you are using solely relies on how diligently you reference the documentation! The link to the documentation is here, but please do this yourself for the exercise. Here is the [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit) to the `CountVectorizer` class documentation.

- *min_df* = minimum frequencey cut-off
    - min_dffloat in range [0.0, 1.0] or int, default=1: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- *max_features* = take the top 1000 most common feature
    - max_featuresint or None, default=None: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
- *strip_accents* = to handle non english letters
    - strip_accents{‘ascii’, ‘unicode’, None}: Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing. Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
- *ngram_range* = we are doing bag of word features here
    - ngram_rangetuple (min_n, max_n), default=(1, 1): The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

#### 2.2.1 Instantiate Class

In this first function call, we instantiate an instance of the class CountVectorizer.

In [11]:
# Instantiate our class
unigram_dtm = CountVectorizer(min_df= 10,  max_features= 1000, strip_accents= "unicode",
                          ngram_range=(1, 1))
print(unigram_dtm) # by printing this variable, we see that it outputs a class description

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=1000, min_df=10,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents='unicode', token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


#### 2.2.2 Make Training and Testing Sets to do Bag of Words

We have talked about generating training and testing data. Below are functions which do this. If the code is confusing, reference the documentation associated with the functions! Here is an example which shows you what this function does. When writing your own code, try and look at examples or make examples like this to mess around with your code.

In [12]:
# Example of using the function
X = np.arange(10).reshape((5, 2))
print("Our data:\n{}".format(X))

X_train, X_test = train_test_split(X, test_size=0.5, random_state=88)
print("Our training set:\n{}".format(X_train))
print("Our testing set:\n{}".format(X_test))

Our data:
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]
Our training set:
[[8 9]
 [0 1]]
Our testing set:
[[4 5]
 [6 7]
 [2 3]]


Here is where we split the data relevant with our problem.

In [13]:
# Split Data Before Generating Bag of Words Representation
train_boilerplate, val_boilerplate = train_test_split(data['boilerplate'], test_size=0.5, train_size=0.5, random_state=234)
val_boilerplate, test_boilerplate = train_test_split(val_boilerplate, test_size=0.5, train_size=0.5, random_state=675)

Following the instantiation of the class, we then run some functions which help us to build a Bag of Words representation of the data.

#### 2.2.3 Learn Vocabulary from Document

Below, we call the `fit()` function in order to learn a vocabulary from one or more documents. 

#### 2.2.4 Generate Bag of Words Vectors from Document

Following this, we call the `transform()` function on one or more documents as needed to encode each as a vector. 

In this case, we use the same data for each function. Think about why this is!

In [14]:
# Make Bag of Words Representation
unigram_dtm.fit(train_boilerplate) # here we are creating a dictionary
train_text = unigram_dtm.transform(train_boilerplate) # here, we encode this document as a vector in our dictionary space
val_text = unigram_dtm.transform(val_boilerplate) # why do we have to transform but not fit?

# Look at some of the variables
print(type(train_text)) # when looking at the output, keep in mind that this data type is a SPARSE matrix
print(train_text) # looking at the data, what may sparse mean?

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 0)	2
  (0, 1)	2
  (0, 11)	2
  (0, 12)	1
  (0, 13)	1
  (0, 17)	1
  (0, 28)	3
  (0, 29)	1
  (0, 40)	1
  (0, 43)	1
  (0, 45)	1
  (0, 48)	1
  (0, 55)	1
  (0, 57)	1
  (0, 58)	2
  (0, 61)	1
  (0, 64)	2
  (0, 71)	3
  (0, 72)	17
  (0, 73)	1
  (0, 77)	1
  (0, 78)	6
  (0, 80)	2
  (0, 81)	2
  (0, 85)	3
  :	:
  (3696, 920)	2
  (3696, 921)	2
  (3696, 922)	1
  (3696, 923)	1
  (3696, 935)	1
  (3696, 945)	7
  (3696, 946)	1
  (3696, 949)	1
  (3696, 951)	13
  (3696, 956)	1
  (3696, 958)	1
  (3696, 960)	4
  (3696, 963)	4
  (3696, 966)	2
  (3696, 970)	10
  (3696, 971)	1
  (3696, 972)	1
  (3696, 974)	5
  (3696, 977)	7
  (3696, 979)	1
  (3696, 981)	1
  (3696, 987)	3
  (3696, 992)	4
  (3696, 993)	1
  (3696, 995)	4


Here, we actually implement the functions for our data.

#### 2.2.5 Explore Your Output

In [15]:
# Randomly choose features
np.random.choice(unigram_dtm.get_feature_names(), 10)

array(['cakes', 'against', '29', 'size', 'kind', 'hands', 'pictures',
       'usually', 'for', 'september'], dtype='<U11')

In [16]:
xx = train_text.toarray() # convert data type to something easier to look at
print(xx.shape) # look at shape of array
print("\nLooking at our training text matrix:\n{}".format(xx)) # look at data

(3697, 1000)

Looking at our training text matrix:
[[ 2  2  0 ...  2  3  0]
 [ 0  0  0 ... 30 33  0]
 [ 0  0  0 ...  0  0  0]
 ...
 [ 0  0  0 ...  0  2  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  7  1 ...  0  0  0]]


In [17]:
# Print dimensionality of training data
print(train.shape)
# Convert to array and print dimensions
print(train_text.toarray().shape)

(3697, 28)
(3697, 1000)


#### 2.2.6 Add New Features to Training Set

In [18]:
# Create training and validation datasets with text with concatenation
train_with_text = pd.concat([train.reset_index(drop = True), pd.DataFrame(train_text.toarray())], axis=1)
val_with_text = pd.concat([val.reset_index(drop = True), pd.DataFrame(val_text.toarray())], axis=1)
train_with_text.head()

Unnamed: 0,alchemy_cat_?,arts_entertainment,business,computer_internet,culture_politics,gaming,health,law_crime,recreation,religion,...,990,991,992,993,994,995,996,997,998,999
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,3,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,30,33,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,7,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


#### 2.2.7 Do Logistic Regression with new Feature Set

In [19]:
# Create logistic regression model with sklearn
model = LogisticRegression()

# Fit, generate predictions, and evaluate model
model.fit(train_with_text, train_labels.values)
preds = model.predict_proba(val_with_text)[:,1]
score = roc_auc_score(val_labels, preds)
print("Score is {}".format(score))

Score is 0.7966274228989841


### 2.3 Keras Implementation

#### 2.3.1 Instantiate Class

#### 2.3.2 Make Training and Testing Sets to do Bag of Words

#### 2.3.3 Learn Vocabulary from Document

#### 2.3.4 Generate Bag of Words Vectors from Document

#### 2.3.5 Explore Your Output

#### 2.3.6 Add New Features to Training Set

#### 2.3.7 Do Logistic Regression with New Feature Set in Keras