__Data presented to us has both numerical as well as text based features.__

*Objectives of this notebook.*

* Whether numerical features are any significant or not.

* Whether we should parse raw content or not ?
* Is boilerplate code sufficient enough to capture detailed intricacies in the data ?
* Learn a whole lot new text mining techniques.
* Learn how to run processes in parallel, which is very important when we want to quickly iterate through our various ideas.

** Evaluation Metric - AUC ( Area Under Curve ) **

In [34]:
%matplotlib inline

# load libraries
import pandas as pd
import numpy as np
import os
import sys

from urllib.parse import urlparse

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns


sns.set_style('whitegrid')
sns.set_context('poster')

import warnings
warnings.filterwarnings('ignore')

# set seed
np.random.seed(1)

basepath = os.path.expanduser('~/Desktop/src/Stumbleupon_classification_challenge/')
sys.path.append(os.path.join(basepath, 'src'))

from models import train_test_split, cross_val_scheme

In [2]:
# load files
train = pd.read_csv(os.path.join(basepath, 'data/raw/train.tsv'), delimiter='\t')
test = pd.read_csv(os.path.join(basepath, 'data/raw/test.tsv'), delimiter='\t')
sample_sub = pd.read_csv(os.path.join(basepath, 'data/raw/sampleSubmission.csv'))

In [3]:
train.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,1,1,40,0,4973,187,9,0.181818,0.125448,1


In [4]:
test.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,image_ratio,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio
0,http://www.lynnskitchenadventures.com/2009/04/...,5865,"{""title"":""Homemade Enchilada Sauce Lynn s Kitc...",recreation,0.443906,2.55814,0.389706,0.257353,0.044118,0.022059,...,0.199438,1,1,15,0,5643,136,3,0.242647,0.080597
1,http://lolpics.se/18552-stun-grenade-ar,782,"{""title"":""lolpics Stun grenade ar "",""body"":"" f...",culture_politics,0.135844,3.771429,0.461538,0.205128,0.051282,0.0,...,0.08,?,1,62,0,382,39,2,0.128205,0.176471


In [5]:
sample_sub.head()

Unnamed: 0,urlid,label
0,5865,0
1,782,0
2,6962,0
3,7640,0
4,3589,0


In [6]:
# remove urlid from the train and test and store them in separate variable
def fetch_urlid(data):
    return data['urlid']

def delete_urlid(data):
    del data['urlid']

train_urlid = fetch_urlid(train)
test_urlid = fetch_urlid(test)

delete_urlid(train)
delete_urlid(test)

** Helper Functions **

In [28]:
def encode_variable(train, test):
    """
    Convert categorical variable to numerical form
    
    train: Values of the variable in the training set
    test: Values of the variable  in the test set
    
    """
    
    data = pd.concat((train, test), axis=0)
    
    lbl = LabelEncoder()
    lbl.fit(data)
    
    train_ = lbl.transform(train)
    test_ = lbl.transform(test)
    
    return train_, test_

### Exploratory Data Analysis

In [7]:
train.columns

Index(['url', 'boilerplate', 'alchemy_category', 'alchemy_category_score',
       'avglinksize', 'commonlinkratio_1', 'commonlinkratio_2',
       'commonlinkratio_3', 'commonlinkratio_4', 'compression_ratio',
       'embed_ratio', 'framebased', 'frameTagRatio', 'hasDomainLink',
       'html_ratio', 'image_ratio', 'is_news', 'lengthyLinkDomain',
       'linkwordscore', 'news_front_page', 'non_markup_alphanum_characters',
       'numberOfLinks', 'numwords_in_url', 'parametrizedLinkRatio',
       'spelling_errors_ratio', 'label'],
      dtype='object')

** Let's see the url variable. **

** Lets create a variable which counts the depth in the url. **

e.g. www.guardian.co.uk/a has depth 1, whereas www.guardian.co.uk/a/b has depth 2

In [8]:
def url_depth(url):
    """
    Takes in a url and calculates depth
    e.g. www.guardian.co.uk/a has depth 1, whereas www.guardian.co.uk/a/b has depth 2
    
    url - url of the webpage
    """
    
    parsed_url = urlparse(url)
    path = parsed_url.path

    return len(list(filter(lambda x: len(x)> 0, path.split('/'))))

url_depths = train.url.map(url_depth)
url_depths_test = test.url.map(url_depth)

assert len(url_depths) == len(train)
assert len(url_depths_test) == len(test)

In [9]:
feature_df = pd.DataFrame({'url_depths': url_depths, 'label': train.label})
feature_df_test = pd.DataFrame({'url_depths': url_depths_test})

** Validate the hypothesis that this feature is actually indicative or not. **

1. Split the dataset into training and test set
2. Set up a cross validation scheme.
3. Record the final performance on the test set.

In [10]:
train.is_news.value_counts() / train.is_news.value_counts().sum()

1    0.615551
?    0.384449
Name: is_news, dtype: float64

In [11]:
test.is_news.value_counts() / test.is_news.value_counts().sum()

1    0.613687
?    0.386313
Name: is_news, dtype: float64

** Since the ratio of the news article to other articles is somewhat constant in training and test sets is constant, we have to make sure that this ratio is preserved in the differnt folds we create during the cross-validation so that our dataset is representative of the original set. **

In [12]:
params = {
    'test_size': 0.2,
    'random_state': 2,
    'stratify': train.is_news
}

X_train, X_test, y_train, y_test = train_test_split.tr_ts_split(feature_df[['url_depths']], feature_df['label'], **params)

In [13]:
# cross validation scheme
est = LogisticRegression()

params = {
    'n_folds': 3,
    'shuffle': True,
    'random_state': 3
}

scores, mean_score, std_score = cross_val_scheme.cv_scheme(est, X_train, y_train, train.is_news, **params)

print('CV Scores: %s'%(scores))
print('Mean CV Score: %f'%(mean_score))
print('Std Cv Scoes: %f'%(std_score))

CV Scores: [ 0.52118866  0.54039943  0.52608173]
Mean CV Score: 0.529223
Std Cv Scoes: 0.008151


In [16]:
# performance on the held out test set
est.fit(X_train, y_train)
y_pred = est.predict_proba(X_test)[:, 1]

print('ROC AUC score on the held out set: %f '%(roc_auc_score(y_test, y_pred)))

ROC AUC score on the held out set: 0.531968 


** Private Leaderboard score - 0.54425** 

In [120]:
# train on full dataset
est.fit(feature_df[['url_depths']], feature_df.label)
predictions = est.predict_proba(feature_df_test[['url_depths']])[:, 1]

### Top Level Domains

In [23]:
def extract_top_level_domain(url):
    """
    Extracts top level domain from a given url
    
    url: Url of the webpage in the dataset
    """
    parsed_url = urlparse(url)
    top_level = parsed_url[1].split('.')[-1]
    
    return top_level
    
top_level_domains_train = train.url.map(extract_top_level_domain)
top_level_domains_test = test.url.map(extract_top_level_domain)

assert len(top_level_domains_train) == len(train)
assert len(top_level_domains_test) == len(test)

In [29]:
tld_encode_train, tld_encoded_test = encode_variable(top_level_domains_train, top_level_domains_test)

In [32]:
feature_df['tld'] = tld_encode_train
feature_df_test['tld'] = tld_encoded_test

In [33]:
params = {
    'test_size': 0.2,
    'random_state': 2,
    'stratify': train.is_news
}

features = ['url_depths', 'tld']

X_train, X_test, y_train, y_test = train_test_split.tr_ts_split(feature_df[features], feature_df['label'], **params)

In [44]:
# cross validation scheme
est = RandomForestClassifier(n_jobs=-1)

params = {
    'n_folds': 3,
    'shuffle': True,
    'random_state': 3
}

scores, mean_score, std_score = cross_val_scheme.cv_scheme(est, X_train, y_train, train.is_news, **params)

print('CV Scores: %s'%(scores))
print('Mean CV Score: %f'%(mean_score))
print('Std Cv Scoes: %f'%(std_score))

CV Scores: [ 0.58944466  0.60132065  0.58664653]
Mean CV Score: 0.592471
Std Cv Scoes: 0.006361


** Private Leaderboard Score: 0.61713**

In [46]:
# performance on the held out test set
est.fit(X_train, y_train)
y_pred = est.predict_proba(X_test)[:, 1]

print('ROC AUC score on the held out set: %f '%(roc_auc_score(y_test, y_pred)))

ROC AUC score on the held out set: 0.594963 


In [47]:
# train on full dataset
est.fit(feature_df[features], feature_df.label)
predictions = est.predict_proba(feature_df_test[features])[:, 1]

### Whether webpage belongs to news category or not

In [54]:
train_is_news, test_is_news = encode_variable(train.is_news, test.is_news)

In [56]:
feature_df['is_news'] = train_is_news
feature_df_test['is_news'] = test_is_news

In [57]:
params = {
    'test_size': 0.2,
    'random_state': 2,
    'stratify': train.is_news
}

features = ['url_depths', 'tld', 'is_news']

X_train, X_test, y_train, y_test = train_test_split.tr_ts_split(feature_df[features], feature_df['label'], **params)

In [64]:
# cross validation scheme
est = RandomForestClassifier(n_estimators=50, max_depth=10, n_jobs=-1)

params = {
    'n_folds': 3,
    'shuffle': True,
    'random_state': 3
}

scores, mean_score, std_score = cross_val_scheme.cv_scheme(est, X_train, y_train, train.is_news, **params)

print('CV Scores: %s'%(scores))
print('Mean CV Score: %f'%(mean_score))
print('Std Cv Scoes: %f'%(std_score))

CV Scores: [ 0.59582831  0.60836585  0.59857301]
Mean CV Score: 0.600922
Std Cv Scoes: 0.005381


In [65]:
# performance on the held out test set
est.fit(X_train, y_train)
y_pred = est.predict_proba(X_test)[:, 1]

print('ROC AUC score on the held out set: %f '%(roc_auc_score(y_test, y_pred)))

ROC AUC score on the held out set: 0.586896 


# Submission

In [48]:
sample_sub['label'] = predictions

In [49]:
sample_sub.to_csv(os.path.join(basepath, 'submissions/depth_tld.csv'), index=False)