# Classification: End-to-end

Kaggle StumbleUpon Competition

https://www.kaggle.com/c/stumbleupon

** Competition **: 
1. Some web pages, such as news articles or seasonal recipes, are only relevant for a short period of time. Others continue to be important for a long time.
2. The goal is to identify pages which pages will be relevant for a short span of time, and which will be relevant for a long span on time and are thus considered "evergreen". 

** Evaluation **: Area under the curve (AUC) 


Import Python Modules 
=================

In [None]:
# quick hack to fix import path
# import sys; sys.path.append('/Users/julianalverio/code/conda/envs/sac/lib/python3.6/site-packages/')

# data manipulation
import pandas as pd
import numpy as np

# plots
%matplotlib inline
import random
import matplotlib
import matplotlib.pyplot as plt
import pylab as pl

# classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

# dimensionality reduction
from sklearn.decomposition import PCA

# cross-validation
from sklearn.model_selection import train_test_split
from sklearn import model_selection


# model evaluation
from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings("ignore")

import os
os.chdir(os.path.join("..", "data"))

In [None]:
os.getcwd()

# 1. Data Import

In [None]:
# We will use pandas to read this .tsv table
data = pd.read_table("stumbleupon/train.tsv", sep= "\t")

## Variable descriptions:

1. **url**: Url of the webpage to be classified
2. **urlid**: StumbleUpon's unique identifier for each url
3. **boilerplate**: Boilerplate text
4. **alchemy_category**:	Alchemy category
5. **alchemy_category_score**:	Alchemy category score
6. **avglinksize**:	Average number of words in each link
7. **commonLinkRatio_1**:	# of links sharing at least 1 word with 1 other links / # of links
8. **commonLinkRatio_2**:	# of links sharing at least 1 word with 2 other links / # of links
9. **commonLinkRatio_3**:	# of links sharing at least 1 word with 3 other links / # of links
10. **commonLinkRatio_4**:	# of links sharing at least 1 word with 4 other links / # of links
11. **compression_ratio**:	Compression achieved on this page via gzip (measure of redundancy)
12. **embed_ratio**: Count of number of "embed" usage
13. **frameBased**: A page is frame-based (1) if it has no body markup but have a frameset markup
14. **frameTagRatio**: Ratio of iframe markups over total number of markups
15. **hasDomainLink**:	True (1) if it contains an "a" with an url with domain
16. **html_ratio**:	Ratio of tags vs text in the page
17. **image_ratio**: Ratio of "img" tags vs text in the page
18. **is_news**: True (1) if StumbleUpon's news classifier determines that this webpage is news
19. **lengthyLinkDomain**: True (1) if at least 3 "a"'s text contains more than 30 alphanumeric characters
20. **linkwordscore**: Percentage of words on the page that are in hyperlink's text
21. **news_front_page**: True (1) if StumbleUpon's news classifier determines that this webpage is front-page news
22. **non_markup_alphanum_characters**:	integer	Page's text's number of alphanumeric characters
23. **numberOfLinks**: Number of "a"  markups
24. **numwords_in_url**: Number of words in url
25. **parametrizedLinkRatio**: A link is parametrized if it's url contains parameters  or has an attached onClick event
26. **spelling_errors_ratio**: Ratio of words not found in wiki (considered to be a spelling mistake)
27. **label**: User-determined label. Either evergreen (1) or non-evergreen (0)


## Summary of Features

Before we dive into engineering our features for this problem, let's first understand how we can organize different features based off of their properties.

### Types of Numerical Features

1. Continuous number
2. Discrete number
3. One-hot vector
4. Binned number

In [None]:
'''
------------------------------
FEATURES THAT ARE NOT SUITABLE
------------------------------
url -- object type, the literal url
urlid -- uid for the url, drop it
avglinksize -- huge spread, mostly around 1 or two. This would need to be hella binned carefully. I think useless from Madhav's results
frameBased -- 0 everywhere, garbage
is_news -- 1 everywhere

------------------------------
TEXT
------------------------------
boilerplate -- text feature

------------------------------
NUMERICAL
------------------------------
alchemy_category -- business, health, sports, etc.
alchemy_category_score -- 0 to 1 STRING with LOTS of '?' values
commonlinkratio -- cont value from 0 to 1 that works nicely, needs binning. Nice gaussian on 1, moves to left with increasing idx
compression_ratio -- continuous value with huge bimodal spread -- needs to be binned carefully
embed_ratio -- perfect bimodal spread -- two humps. -1 and 0.25
frameTagRatio -- 0 to 0.44 cont, spread leans left -- it is half a gaussian
hasDomainLink -- 0/1 categorical, almost all 0
html_ratio -- cont 0 to 0.7, gaussian spread
image_ratio -- spread -1 to 113, but almost exclusively right at 0
lengthy-link domain -- categorical 0/1, good spread
linkwordscore -- 0 to 100 good spread
news_front_page -- STRING, ? values, 0-1 categorical, almost all 0
non_markup_alphnum_characters -- huge spread, highly concentrated near 0
numberoflinks -- huge spread, mostly close to 0
numwordsinurl -- good spread, 0 to 25, has a long upper tail
Parametrizedlinkratio -- 0 to 1 continuous, skewed toward 0 -- right half of gaussian
spelling_errors_ratio -- 0 to 1, heavy leaning to 0

------------------------------
Entropy measures on each feature - Entropy captures the amount of "randomness" of a vector. 
------------------------------

[(-0.0017029098026004608, 'avglinksize'),
 (-0.0015477640384510272, 'image_ratio'),
 (-0.0013366491854696072, 'numberOfLinks'),
 (-0.0012422148274934264, 'urlid'),
 (-0.0012276386004722584, 'non_markup_alphanum_characters'),
 (-0.0003632131231506852, 'hasDomainLink'),
 (-1.6445368175466157e-05, 'compression_ratio'),
 (0, 'alchemy_category_score'),
 (0, 'boilerplate'),
 (0, 'commonLinkRatio_1'),
 (0, 'commonLinkRatio_2'),
 (0, 'commonLinkRatio_3'),
 (0, 'commonLinkRatio_4'),
 (0, 'frameBased'),
 (0, 'is_news'),
 (0, 'label'),
 (0, 'news_front_page'),
 (0, 'url'),
 (0.00016331922818457745, 'lengthyLinkDomain'),
 (0.0007063846400122697, 'html_ratio'),
 (0.0011190978216005787, 'embed_ratio'),
 (0.0020862547468744053, 'spelling_errors_ratio'),
 (0.002983052198938685, 'parametrizedLinkRatio'),
 (0.004883582510772921, 'numwords_in_url'),
 (0.016376170421172676, 'linkwordscore'),
 (0.016942353001673127, 'frameTagRatio'),
 (0.040352206434091764, 'alchemy_category')]

'''

# Feature Engineering - Getting Features into a Useful Form

Now we can dive into feature engineering itself - the manipulation of features into a form in which they can be leveraged to solve useful problems in machine learning.  A few of the feature engineering techniques we use are:

1. **One-hot encoding**: Converting discrete numbers in a range bounded by N, and converting this into a vector of zeros of length N.  The only non-zero index of this all-zeros vector corresponds to the number of the feature.  For example:

If we have the numbers 0, 1, and 2, we can convert these to a **one-hot encoding** with:

0 --> (1 0 0), 1 --> (0 1 0), 2 --> (0 0 1)

2. **Replace missing with random**: Replace missing data with random value.

3. **Binning**: Taking a set of discrete or continuous numbers, and reducing the size of the feature space for that feature by assigning values based off of different categories.  An example would be binning age (discrete numbers greater than zero) into different decades.





In [None]:
#1. Alchemy category, converting to one-hot encodings: 
# Link: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

df = data['alchemy_category']   # 2K ? values
one_hots = pd.get_dummies(data['alchemy_category'])  # Convert to one-hot representation
df = one_hots
rename_dict = {'?': 'alchemy_cat_?'}
df = df.rename(columns=rename_dict)

# FrameTagRatio, leaving as continuous number
df_var = data['frameTagRatio']
df['frame_tag_ratio'] = df_var


# link word score, 0-100 gaussian, keeping continuous
df['link_word_score'] = data['linkwordscore']


#2.  alchemy category score, with replacing missing values with random
df_var = data['alchemy_category_score']
df_var_temp = df_var.apply(lambda x: np.random.random() if x == '?' else float(x)).astype('float32')
df['alchemy_category_score'] = df_var_temp


#3. num word in url -- discrete 0-25 to custom binning from looking at the histogram
df_var = data['numwords_in_url']
bins = [0, 6, 8, 13, 25]
df_var_temp = pd.cut(x=df_var, bins=bins, right=True, labels=['num_words_url_bin_0', 'num_words_url_bin_1', 'num_words_url_bin_2', 'num_words_url_bin_3'])
dummies = pd.get_dummies(df_var_temp)
df = pd.concat([df, dummies], axis=1)


# parameterized_link_ratio -- leaving as continuous, right-half gaussian
df['parameterized_link_ratio'] = data['parametrizedLinkRatio']

# spelling errors ratio -- leaving as continuous
df['spelling_errors_ratio'] = data['spelling_errors_ratio']

# embed_ratio -- bimodal continuous binned into 2 bins
df_var = pd.DataFrame(data['embed_ratio'])
df_var = df_var['embed_ratio'].apply(lambda x: 1 if x > -1 else 0)
dummies = pd.get_dummies(df_var)
rename = {0: 'embed_ratio_0', 1: 'embed_ratio_1'}
dummies = dummies.rename(columns=rename)
df = pd.concat([df, dummies], axis=1)


# html_ratio -- leaving continuous
df['html_ratio'] = data['html_ratio']

# lengthy_link_domain
df_var = pd.get_dummies(data['lengthyLinkDomain'])
rename = {0: 'lengthy_link_domain_0', 1: 'lengthy_link_domain_1'}
df_var = df_var.rename(columns=rename)
df = pd.concat([df, df_var], axis=1)

df['labels'] = data['label']

## Create Training and Testing Data Splits
Whenever we create and implement a **supervised** machine learning algorithm, we want to split our data into **training (known as in-sample)** and **testing (known as out-of-sample)** datasets.  The ratio for this can vary, but some common examples for train: test are 0.5: 0.5 and 0.8: 0.2.  The **testing** dataset can also be split into **validation** and **test** datasets - the difference between these two **testing** subsets are given below, courtesy of [machinelearningmastery.com](https://machinelearningmastery.com/difference-test-validation-datasets/).

**Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

**Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

In [None]:
# Split data into training and testing
train, val = train_test_split(df, test_size=0.5, train_size=0.5, random_state=234)

# Split testing into validation and test
val, test = train_test_split(val, test_size=0.5, train_size=0.5, random_state= 675)

# Get labels for training dataset
train_labels = train['labels']
train = train.drop(['labels'], axis=1, inplace=False)

# Get labels for validation dataset
val_labels = val['labels']
val = val.drop(['labels'], axis=1, inplace=False)

# Get labels for testing dataset
test_labels = test['labels']
test = test.drop(['labels'], axis=1, inplace=False)

## Hyperparameter Search with Logistic Regression
Now we will investigate the role hyperparameters play in machine learning algorithms.  We will test many different hyperparameter values on our model, and will evaulate how well these hyperparameters work based off of the score of the model on the **validation dataset**.  We will select the model and hyperparameters that yield the best performance on this **validation dataset**.

In [2]:
def logistic_search(budget):
    ret = []
    for i in range(budget):  # Iterate through models with different hyperparameters
        
        # Use seeds to make sure your results are reproducible
        seed = random.randrange(500)
        
        # Hyperparameters to randomly choose from
        rand_C =    random.choice([0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000])
        rand_fit_intercept = random.choice([True, False])
        rand_penalty = random.choice(['l1', 'l2'])
        
        # Create model using randomly-selected hyperparameters
        model = LogisticRegression(random_state=seed, C = rand_C, fit_intercept = rand_fit_intercept, penalty=rand_penalty)
        
        # Train the model, make predictions, and evaluate the model
        model.fit(train, train_labels)
        preds = model.predict_proba(val)[:,1]
        score = roc_auc_score(val_labels, preds)
        
        # Add score to all the scores
        ret.append((score, model))
    
    # Return all scores, beginning with the model/hyperparameters that performed the best
    return list(reversed(sorted(ret, key=lambda x: x[0])))

## Best logistic regression: AUC=0.71
**Hyperparameters**:
seed = 61
C = 1
fit_intercept = False
penalty = 'l1'
model = LogisticRegression(random_state=seed, C=C, fit_intercept=fit_intercept, penalty=penalty)
model.fit(train, train_labels)
preds = model.predict_proba(val)[:,1]
score = roc_auc_score(val_labels, preds)
print(score)

In [None]:
# Now we will run logistic search over our logistic regression model with 100 random combinations of hyperparameters
logistic_search(100)

## Hyperparameter Search with Rule-Based Decision Trees (RBDTs)
We will now apply the same random hyperparameter approach as the one used above, but for **(gradient-boosted) Rule-Based Decision Trees**.

In [None]:
def rbdt_search(budget):
    ret = []
    for i in range(budget):  # Iterate through different random combinations of hyperparameters

        # Use seeds to make sure your results are reproducible
        seed = random.randrange(500)
        
        # Hyperparameters to randomly choose from
        ccp_alpha = random.choice([0.0, 0.1])
        learning_rate = random.choice([0.001, 0.01, 0.1])
        n_estimators = random.choice([10, 30, 80, 160, 300])
        max_depth = random.choice([1,3,5,10])
        
        # Create model using randomly-selected hyperparameters
        model = GradientBoostingClassifier(random_state=seed, learning_rate=learning_rate, n_estimators=n_estimators, max_depth = max_depth)
        
        # Train the model, make predictions, and evaluate the model
        model.fit(train, train_labels)
        preds = model.predict_proba(val)[:,1]
        score = roc_auc_score(val_labels, preds)
        
        # Add score to all scores
        ret.append((score, model))
        
    # Return all scores, in order of best to worst
    return list(reversed(sorted(ret, key=lambda x: x[0])))

In [None]:
# Best Gradient Boosted Decision Tree: AUC=0.73
learning_rate = 0.01
max_depth = 3
n_estimators = 300
random_state = 239
model = GradientBoostingClassifier(random_state=seed, learning_rate=learning_rate, n_estimators=n_estimators, max_depth = max_depth)
model.fit(train, train_labels)
preds = model.predict_proba(val)[:,1]
score = roc_auc_score(val_labels, preds)
print(score)

In [None]:
# Run RBDT over 100 combinations of hyperparameters
rbdt_search(100)

## Hyperparameter Search with Support Vector Machines (SVMs)
We will now apply the same random hyperparameter approach as the one used above, but for **Support Vector Machines**.

In [None]:
def svm_search(budget):
    ret = []
    for i in range(budget): # Iterate through random combinations of hyperparameters

        # Use seeds to make sure your results are reproducible
        seed = random.randrange(500)
        
        # Hyperparameters to randomly choose from
        C = random.choice([0.0001, 0.001, 0.01, 0.1, 1.0])
        kernel = random.choice(['linear', 'poly', 'rbf', 'sigmoid'])
        print ("running for ", C, kernel)
        
        # Create model using randomly-selected hyperparameters
        model = svm.SVC(random_state=seed, C=C, kernel=kernel, probability=True)
        
        # Train the model, make predictions, and evaluate the model
        model.fit(train, train_labels)
        preds = model.predict_proba(val)[:,1]
        score = roc_auc_score(val_labels, preds)
        
        # Add score to all scores
        ret.append((score, model))
        print (score, model)
    
    # Return all scores, in order of best to worst
    return list(reversed(sorted(ret, key=lambda x: x[0])))


In [None]:
# Run SVM search over 20 different combinations of hyperparameters
svm_search(20)

### Optimal Parameters for SVM

In [None]:
# Best SVM: AUC=0.7
C = 0.1
kernel = 'linear'
seed = 184
model = svm.SVC(random_state=seed, C=C, kernel=kernel, probability=True)
model.fit(train, train_labels)
preds = model.predict_proba(val)[:,1]
score = roc_auc_score(val_labels, preds)
print(score)

** Note **: Scikit-learn training algorithms do not accept categorical features and hence they need to be converted to numeric or binary before fitting the model 

We did really well for only using numerical features, but there's much more to be done! Dimensionality reduction, text processing, and more. Stay tuned for next week's natural language processing (NLP) theme.
===========