__Data presented to us has both numerical as well as text based features.__

*Objectives of this notebook.*

* Whether numerical features are any significant or not.

* Whether we should parse raw content or not ?
* Is boilerplate code sufficient enough to capture detailed intricacies in the data ?
* Learn a whole lot new text mining techniques.
* Learn how to run processes in parallel, which is very important when we want to quickly iterate through our various ideas.

** Evaluation Metric - AUC ( Area Under Curve ) **

In [30]:
%matplotlib inline

# load libraries
import pandas as pd
import numpy as np
import os
import sys

from urllib.parse import urlparse

import matplotlib.pyplot as plt
import seaborn as sns


sns.set_style('whitegrid')
sns.set_context('poster')

import warnings
warnings.filterwarnings('ignore')

# set seed
np.random.seed(1)

basepath = os.path.expanduser('~/Desktop/src/Stumbleupon_classification_challenge/')
sys.path.append(os.path.join(basepath, 'src'))

In [7]:
# load files
train = pd.read_csv(os.path.join(basepath, 'data/raw/train.tsv'), delimiter='\t')
test = pd.read_csv(os.path.join(basepath, 'data/raw/test.tsv'), delimiter='\t')
sample_sub = pd.read_csv(os.path.join(basepath, 'data/raw/sampleSubmission.csv'))

In [10]:
train.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,1,1,40,0,4973,187,9,0.181818,0.125448,1


In [11]:
test.head(2)

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,image_ratio,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio
0,http://www.lynnskitchenadventures.com/2009/04/...,5865,"{""title"":""Homemade Enchilada Sauce Lynn s Kitc...",recreation,0.443906,2.55814,0.389706,0.257353,0.044118,0.022059,...,0.199438,1,1,15,0,5643,136,3,0.242647,0.080597
1,http://lolpics.se/18552-stun-grenade-ar,782,"{""title"":""lolpics Stun grenade ar "",""body"":"" f...",culture_politics,0.135844,3.771429,0.461538,0.205128,0.051282,0.0,...,0.08,?,1,62,0,382,39,2,0.128205,0.176471


In [12]:
sample_sub.head()

Unnamed: 0,urlid,label
0,5865,0
1,782,0
2,6962,0
3,7640,0
4,3589,0


In [13]:
# remove urlid from the train and test and store them in separate variable
def fetch_urlid(data):
    return data['urlid']

def delete_urlid(data):
    del data['urlid']

train_urlid = fetch_urlid(train)
test_urlid = fetch_urlid(test)

delete_urlid(train)
delete_urlid(test)

### Exploratory Data Analysis

In [16]:
train.columns

Index(['url', 'boilerplate', 'alchemy_category', 'alchemy_category_score',
       'avglinksize', 'commonlinkratio_1', 'commonlinkratio_2',
       'commonlinkratio_3', 'commonlinkratio_4', 'compression_ratio',
       'embed_ratio', 'framebased', 'frameTagRatio', 'hasDomainLink',
       'html_ratio', 'image_ratio', 'is_news', 'lengthyLinkDomain',
       'linkwordscore', 'news_front_page', 'non_markup_alphanum_characters',
       'numberOfLinks', 'numwords_in_url', 'parametrizedLinkRatio',
       'spelling_errors_ratio', 'label'],
      dtype='object')

** Let's see the url variable. **

** Lets create a variable which counts the depth in the url. **

e.g. www.guardian.co.uk/a has depth 1, whereas www.guardian.co.uk/a/b has depth 2

In [56]:
def url_depth(url):
    """
    Takes in a url and calculates depth
    e.g. www.guardian.co.uk/a has depth 1, whereas www.guardian.co.uk/a/b has depth 2
    
    url - url of the webpage
    """
    
    parsed_url = urlparse(url)
    path = parsed_url.path
    
    return len(path.split('/')) - 1

url_depths = train.url.map(url_depth)
assert len(url_depths) == len(train.label)

In [57]:
feature_df = pd.DataFrame({'url_depths': url_depths, 'label': train.label})

** Validate the hypothesis that this feature is actually indicative or not. **

1. Split the dataset into training and test set
2. Set up a cross validation scheme.
3. Record the final performance on the test set.

In [59]:
pd.crosstab(feature_df.url_depths, feature_df.label)

label,0,1
url_depths,Unnamed: 1_level_1,Unnamed: 2_level_1
1,739,590
2,935,852
3,714,1033
4,544,651
5,414,479
6,139,132
7,66,42
8,9,8
9,39,9


In [44]:
urlparse(train.url.iloc[17]).path

'/fashion/gallery/2012/jul/08/gentlemen-compete-chap-olympiad-in-pictures'

In [27]:
train[['url', 'label']]

Unnamed: 0,url,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,0
1,http://www.popsci.com/technology/article/2012-...,1
2,http://www.menshealth.com/health/flu-fighting-...,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,1
4,http://bleacherreport.com/articles/1205138-the...,0
5,http://www.conveniencemedical.com/genital-herp...,0
6,http://gofashionlane.blogspot.tw/2012/06/ameri...,1
7,http://www.insidershealth.com/article/racing_f...,0
8,http://www.valetmag.com/the-handbook/features/...,1
9,http://www.howsweeteats.com/2010/03/24/cookies...,1
