The dataset contains an ID column, 25 features and one target label column. The description of the features is given below:

    url : URL of the webpage to be classified
    webpageDescription : One line description of the webpage
    alchemy_category : Alchemy category (per the publicly available Alchemy API)
    alchemy_category_score : Alchemy category score (per the publicly available Alchemy API)
    avgLinkWordLength : Average number of words in a webpage
    AvglinkWithOneCommonWord : Average number of web pages sharing at least one word with one other web page
    AvglinkWithTwoCommonWord : Average number of web pages sharing at least one word with two other web pages
    AvglinkWithThreeCommonWord : Average number of web pages sharing at least one word with three other web pages
    AvglinkWithFourCommonWord : Average number of web pages sharing at least one word with four other web pages
    redundancyMeasure : Measure of redundancy computed by finding the compression achieved on this web page via gzip
    embedRatio : Count of tags
    frameBased : Binary indication of whether a webpage has frameset markup
    frameTagRatio : Ratio of frameset markups over total markups
    domainLink: Binary indication of whether the webpage contains in URL with domain
    tagRatio : Ratio of tags over text in the webpage
    imageTagRatio : Ratio of tags over text in the webpage
    isNews : Binary indication of whether a webpage is a news article
    lengthyDomain : Binary indication of whether webpage's text contains more than 30 alpha-numeric characters
    hyperlinkToAllWordsRatio : Percentage of words on the webpage that are also in the hyperlink text
    isFrontPageNews : Binary indication of whether webpage is front-page news
    alphanumCharCount : Number of alpha-numeric characters in webpage's text
    linksCount : Number of markups
    wordCount : Number of words in URL
    parametrizedLinkRatio :
    spellingErrorsRatio : Ration of words that contain spelling errors

The label value of 0 represents that the webpage is not "ad-worthy", and a label value of 1 represents that the webpage is "ad-worthy". Your task is to predict the label value based on the features described above.

In [2]:
import numpy as np
import pandas as pd


import json
import urllib
import string
import re
import nltk

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
train_data = pd.read_csv('dataset/train_data.csv')
test_data = pd.read_csv('dataset/test_data.csv')

### Duplicate rows

In [4]:
print(train_data['url'].duplicated().sum())
print(test_data.duplicated().sum())

0
0


No duplicate rows

### Merging train and test data

### Converting webpageDescription from string to JSON data

In [5]:
merged_data = pd.concat([train_data, test_data], ignore_index=True)

merged_data['webpageDescription'] = merged_data['webpageDescription'].apply(lambda x: json.loads(x))

### Checking missing entries in each column

In [6]:
missing_entries = []

for feature in merged_data.columns:
    record = {}
    record['feature'] = feature
    record['NA values'] = merged_data[feature].isna().sum()
    record['? values'] = len(merged_data[merged_data[feature] == '?'][feature])
    missing_entries.append(record)

pd.DataFrame(missing_entries)

Unnamed: 0,feature,NA values,? values
0,url,0,0
1,webpageDescription,0,0
2,alchemy_category,0,2342
3,alchemy_category_score,0,2342
4,avgLinkWordLength,0,0
5,AvglinkWithOneCommonWord,0,0
6,AvglinkWithTwoCommonWord,0,0
7,AvglinkWithThreeCommonWord,0,0
8,AvglinkWithFourCommonWord,0,0
9,redundancyMeasure,0,0


### alchemy_category & alchemy_category_score have 2342 missing entries (? values)

### isNews has 2843 missing entries (? values)

### isFrontpageNews has 1248 missing entries (? values)



### Analyzing the keys in JSON value of webpageDescription

It seems that "body" key is the one-line description of the page as described in the dataset description

In [7]:
merged_data['webpageDescription'].apply(lambda x: x.keys())

0       (title, body, url)
1            (body, title)
2       (title, body, url)
3       (title, body, url)
4       (url, title, body)
               ...        
7390    (title, body, url)
7391    (title, body, url)
7392    (url, title, body)
7393    (title, body, url)
7394    (title, body, url)
Name: webpageDescription, Length: 7395, dtype: object

In [8]:
len(merged_data[merged_data['webpageDescription'].apply(lambda x: x['body'] == None)])

57

### There are 57 entries where the webpageDescription has no "body" entry

In [9]:
merged_data[merged_data['webpageDescription'].apply(lambda x: x['title'] if x['body'] == None else x['body']).isna()]

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
2994,http://icanhascheezburger.com/2009/05/07/funny...,"{'title': None, 'body': None, 'url': 'icanhasc...",?,?,1.982178,0.756809,0.285992,0.235409,0.210117,0.0,...,1,35,?,8682,514,6,0.206226,0.125,0.0,5330


### There is 1 entry where the webpageDescription has no "body" entry and no "title" entry too

This entry belongs to train set, so we can just delete it

### Replacing webpageDescription using the following logic

    All entries have "body" key in webpageDescription column but some have these values as None
    So we fill in those entries with "title" key value
    
    if webpageDescription["body"] != None
        webpageDescription["body"]
    else webpageDescription["title"]

In [10]:
cleaned_data = merged_data.copy(deep=True)
cleaned_data.drop(index=2994, inplace=True)
cleaned_data['webpageDescription'] = cleaned_data['webpageDescription'].apply(lambda x: x['title'] if x['body'] == None else x['body'])

In [11]:
print(cleaned_data['webpageDescription'].isna().sum())
print((cleaned_data['webpageDescription'] == None).sum())

0
0


Ensuring that no NA or None values are present in the description column after making the changes

### Analyzing value_counts() of the two news columns

In [12]:
cleaned_data['isNews'].value_counts()

1    4552
?    2842
Name: isNews, dtype: int64

It seems pretty reasonable to assume that ? refers to isNews = 0 but more analysis needs to be done

In [13]:
cleaned_data['isFrontPageNews'].value_counts()

0    5853
?    1247
1     294
Name: isFrontPageNews, dtype: int64

It also seems reasonable to assume that ? refers to isFrontpageNews = 1 but more analysis needs to be done

### Analyzing ? values of isNews column

In [14]:
cleaned_data[cleaned_data['isNews'] == '?'].head(10)

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
0,http://www.polyvore.com/cgi/home?id=1389651,polyvore is the best place to discover or sta...,?,?,1.916667,0.047619,0.007937,0.0,0.0,0.803797,...,0,34,0,682,126,1,0.531746,0.142857,1.0,3711
1,http://www.youtube.com/watch?v=ippMPPu6gh4,Speed Air Man--David Belle david belle speed a...,?,?,1.257576,0.141026,0.0,0.0,0.0,1.142857,...,0,12,0,3008,78,1,0.628205,0.0,1.0,7222
4,http://recipes.wuzzle.org/index.php/72,Barbecued Chicken Chow Siew from The Exotic Ki...,computer_internet,0.535009,0.181818,0.036364,0.0,0.0,0.0,0.292614,...,0,3,0,1745,55,1,0.072727,0.115044,1.0,4321
6,http://www.insidershealth.com/natural_cure/poi...,Copyright 2011 InsidersHealth com All Rights R...,health,0.893828,1.691781,0.469799,0.147651,0.026846,0.006711,0.522134,...,0,57,0,961,149,4,0.006711,0.086957,1.0,5416
7,http://yawoot.com/post/2122,Well the world is made of peanut butter now po...,recreation,0.550131,1.170213,0.634921,0.0,0.0,0.0,0.224906,...,0,11,0,1865,63,1,0.031746,0.123209,0.0,7349
16,http://www.nationalpost.com/news/story.html?id...,In a discovery that has stunned even those beh...,health,0.603181,3.261649,0.653595,0.369281,0.196078,0.140523,0.442116,...,1,44,0,5791,306,2,0.075163,0.098795,0.0,4794
17,http://thecompletecookbook.wordpress.com/,Before we get to today s post please understan...,science_technology,0.564931,1.371717,0.475904,0.269076,0.128514,0.054217,0.511271,...,1,13,?,19995,498,0,0.104418,0.104478,0.0,3472
21,http://www.pimpthatsnack.com/full.php,Pimp That Snack Snack Search Join the mailing ...,culture_politics,0.208574,2.388489,0.733813,0.496403,0.302158,0.086331,0.723183,...,0,86,0,825,139,1,0.093525,0.125,1.0,4644
22,http://www.etsy.com/treasury/MTA0NjAxODZ8NzkwM...,,culture_politics,0.165873,1.488235,0.408451,0.140845,0.014085,0.0,21.0,...,0,43,0,1818,213,2,0.267606,0.315789,0.0,3528
24,http://warehouse.carlh.com/article_157/,February 4th 2008 Copy the line below to link ...,arts_entertainment,0.361629,2.0,0.086957,0.0,0.0,0.0,0.485349,...,0,4,0,2814,23,1,0.173913,0.095164,1.0,3744


In [15]:
cleaned_data.iloc[16, 1]

'In a discovery that has stunned even those behind it scientists at a Toronto hospital say they have proof the body s nervous system helps trigger diabetes opening the door to a potential near cure of the disease that affects millions of Canadians Diabetic mice became healthy virtually overnight after researchers injected a substance to counteract the effect of malfunctioning pain neurons in the pancreas I couldn t believe it said Dr Michael Salter a pain expert at the Hospital for Sick Children and one of the scientists Mice with diabetes suddenly didn t have diabetes any more The researchers caution they have yet to confirm their findings in people but say they expect results from human studies within a year or so Any treatment that may emerge to help at least some patients would likely be years away from hitting the market But the excitement of the team from Sick Kids whose work is being published today in the journal Cell is almost palpable I ve never seen anything like it said Dr 

It is clear that this is a news article and its 'isNews' value was set to '?'

This just tells us that we'll have to closely analyze each entry with '?' for isNews to determine whether they're a news page or not

Or we can consider this to be an outlier and assume that its mostly non-news page whenever isNews = '?'

A separate NLP model can be trained to predict whether that page isNews or not given the description of the page. But this can be very unreliable plus we don't have class label 0 entries for isNews in the given dataset. So it makes more sense to straight manually label each entry for whether it is news or not.

Similar manual labelling can be done for isFrontPageNews as well


These same trends were observed on the test data separately as well

### Analyzing isFrontPageNews column for ? values

In [16]:
cleaned_data[cleaned_data['isFrontPageNews'] == '?']['isNews'].value_counts()

?    1247
Name: isNews, dtype: int64

The entries which have ? values in isFrontPageNews also have ? values in isNews

### Analyzing the alchemy_category column

In [17]:
cleaned_data['alchemy_category'].value_counts()

?                     2341
recreation            1229
arts_entertainment     941
business               880
health                 506
sports                 380
culture_politics       343
computer_internet      296
science_technology     289
gaming                  76
religion                72
law_crime               31
unknown                  6
weather                  4
Name: alchemy_category, dtype: int64

In [18]:
cleaned_data[cleaned_data['alchemy_category'] == 'law_crime'].head()

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
66,http://vii2012.com/2012/06/accidental-90s-nick...,Your smiling at me is my daily dose of magic S...,law_crime,0.505332,3.666667,0.545455,0.454545,0.318182,0.227273,0.621094,...,0,51,0,306,22,4,0.0,0.16129,0.0,6771
377,http://www.independent.co.uk/news/world/americ...,google ad client ca pub 5964551156905038 if re...,law_crime,0.818603,2.06846,0.579572,0.187648,0.054632,0.033254,0.432215,...,1,53,0,3828,421,14,0.07601,0.09396,1.0,3319
639,http://newsfeed.time.com/2012/01/09/reading-wh...,Monday s links talk surprising birthdays body ...,law_crime,0.752151,2.972222,0.553333,0.193333,0.1,0.073333,0.504981,...,1,57,0,1595,150,8,0.126667,0.087108,0.0,452
847,http://www.insidethekaganoffkitchen.com/2010/0...,by Rachel on April 11 2010 I just love a good ...,law_crime,0.246582,1.904255,0.392523,0.242991,0.121495,0.11215,0.522953,...,1,19,0,3475,107,2,0.009346,0.089928,1.0,3203
1334,http://www.magnoliarouge.com/2012/08/outfit-of...,AboutWelcome to Magnolia Rouge a collection of...,law_crime,0.313761,1.300971,0.258929,0.071429,0.035714,0.017857,0.675141,...,0,45,0,1034,112,4,0.241071,0.136986,0.0,1628


### This cell actually shows you how the alchemy_category can be misclassified

Because some of these pages are clearly not related to law_crime category

This tells us that alchemy_category and the corresponding score cannot be relied on too much and perhaps we can try thresholding off the basis of alchemy_category_score, where we only retain those categories where the score is high, like 60%



In [19]:
cleaned_data[cleaned_data['alchemy_category_score'].apply(lambda x: 0 if x == '?' else float(x)) > 0.6]

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
5,http://www.sleepdisordersguide.com/blog/good-n...,You have to pay high price if you are not gett...,health,0.953935,3.697917,0.848485,0.636364,0.191919,0.101010,0.382585,...,1,23,0,6061,99,16,0.010101,0.105439,1.0,6940
6,http://www.insidershealth.com/natural_cure/poi...,Copyright 2011 InsidersHealth com All Rights R...,health,0.893828,1.691781,0.469799,0.147651,0.026846,0.006711,0.522134,...,0,57,0,961,149,4,0.006711,0.086957,1.0,5416
12,http://sugarcrafter.net/2009/08/20/sesame-chic...,August 20 2009 Print E mail Filed under Asian ...,arts_entertainment,0.722537,2.573529,0.451220,0.195122,0.146341,0.097561,0.487435,...,1,19,0,3447,82,2,0.329268,0.073770,1.0,6194
16,http://www.nationalpost.com/news/story.html?id...,In a discovery that has stunned even those beh...,health,0.603181,3.261649,0.653595,0.369281,0.196078,0.140523,0.442116,...,1,44,0,5791,306,2,0.075163,0.098795,0.0,4794
19,http://bleacherreport.com/articles/1319130-100...,100 Ashley Judd Dario Franchitti Superfan Poli...,arts_entertainment,0.85,2.649573,0.496503,0.237762,0.118881,0.055944,0.463007,...,1,7,0,18053,143,4,0.041958,0.116240,0.0,5947
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7388,http://www.deepsouthdish.com/2010/10/macaroni-...,As soon as I saw this Mac n Cheese Soup recipe...,recreation,0.65858,1.125320,0.295122,0.151220,0.078049,0.043902,0.482424,...,1,27,0,5241,410,4,0.209756,0.094808,,1428
7389,http://www.nanny.net/blog/18-blogs-featuring-t...,Posted on June 5 2013 by admin in NannyMillion...,health,0.625404,4.294737,0.845361,0.701031,0.360825,0.144330,0.466958,...,1,34,0,3456,97,11,0.020619,0.119048,,3262
7390,http://www.lanascooking.com/2011/07/08/roasted...,,business,0.681418,1.640000,0.367521,0.102564,0.034188,0.017094,21.000000,...,1,14,0,4994,117,3,0.119658,0.250000,,3336
7392,http://www.laweekly.com/bestof/2011/award/best...,Best Belly Dance Workout 2011 Swerve Studio s ...,arts_entertainment,0.943693,1.829787,0.512690,0.279188,0.055838,0.020305,0.547304,...,1,56,0,1425,197,5,0.446701,0.056122,,738


There are in total 2733 rows where we have alchemy_category_score > 0.6, which means a 60% confidence in the fact that the alchemy_category is correct. 

In total we have 7394 entries out of which 2341 are ? entries and 2733 entries with high confidence, so we have 7394 - 2341 - 2733 = 2320 entries with tagged alchemy_category but low confidence

In [20]:
cleaned_data[cleaned_data['alchemy_category'] == '?']

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
0,http://www.polyvore.com/cgi/home?id=1389651,polyvore is the best place to discover or sta...,?,?,1.916667,0.047619,0.007937,0.000000,0.000000,0.803797,...,0,34,0,682,126,1,0.531746,0.142857,1.0,3711
1,http://www.youtube.com/watch?v=ippMPPu6gh4,Speed Air Man--David Belle david belle speed a...,?,?,1.257576,0.141026,0.000000,0.000000,0.000000,1.142857,...,0,12,0,3008,78,1,0.628205,0.000000,1.0,7222
11,http://www.chow.com/recipes/13499-creamy-carro...,Difficulty Easy Total Time 55 mins Active Time...,?,?,1.804000,0.436090,0.086466,0.007519,0.003759,0.476190,...,1,25,0,6611,266,4,0.406015,0.068306,1.0,5156
15,http://www.whattoexpect.com/preconception/heal...,Next time you re in the bathroom crack open yo...,?,?,2.476562,0.558824,0.250000,0.095588,0.014706,0.450391,...,1,34,0,3185,136,9,0.044118,0.144550,1.0,4988
23,http://www.washingtonpost.com/lifestyle/wellne...,The oil is the healthiest part of a nut contai...,?,?,2.919403,0.632312,0.281337,0.130919,0.077994,0.440646,...,1,42,0,6962,359,9,0.128134,0.094203,0.0,2242
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7383,http://www.healthguru.com/applications/quiz/qz...,Weird Facts on Drugs Quiz HealthGuru Question ...,?,?,2.243094,0.622951,0.327869,0.120219,0.027322,0.858025,...,1,78,?,730,183,6,0.016393,0.132075,,3106
7385,http://news.menshealth.com/touch-at-your-own-p...,It s a wonder your hand doesn t melt off every...,?,?,0.000000,0.000000,0.000000,0.000000,0.000000,0.512821,...,0,0,0,0,161,12,0.124224,0.080357,,4313
7387,http://dentalcalgary.org/,,?,?,1.363636,0.230769,0.153846,0.000000,0.000000,21.000000,...,0,1,?,4594,13,0,0.000000,0.500000,,6131
7391,http://www.marthastewart.com/255456/billys-cho...,Save to your Collections Sorry for the inconve...,?,?,2.236559,0.676768,0.232323,0.090909,0.020202,0.413333,...,1,36,?,1933,99,4,0.060606,0.105263,,2867


Similar to the isNews labelling, the ? values can fall into any one of the given categories which can be inferred pretty quickly given the context. This can either be manually filled in or a separate NLP classifier can be created.

It also makes sense that once we have finished labelling, the alchemy_category_score column doesn't make sense as by that point we'll only leave those entries in alchemy_category for which we have very high confidence

### Analyzing the numerical columns in the dataset

Basically analyzing the output of the describe() call on the dataset

In [21]:
cleaned_data.describe()

Unnamed: 0,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,embedRatio,framebased,frameTagRatio,domainLink,...,imageTagRatio,lengthyDomain,hyperlinkToAllWordsRatio,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
count,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,...,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,7394.0,5915.0,7394.0
mean,2.761929,0.468191,0.21407,0.092043,0.04924,2.255408,-0.103629,0.0,0.056427,0.021233,...,0.275882,0.660265,30.076413,5716.197187,178.709224,4.960509,0.17286,0.101218,0.517329,3696.779145
std,8.620371,0.203119,0.14675,0.09597,0.07261,5.704639,0.306389,0.0,0.041447,0.144171,...,1.919392,0.473651,20.3944,8875.965656,179.435974,3.233307,0.183298,0.079236,0.499742,2134.956849
min,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,...,-1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.601957,0.340342,0.105263,0.022222,0.0,0.442632,0.0,0.0,0.028504,0.0,...,0.025923,0.0,14.0,1579.0,82.0,3.0,0.040984,0.068734,0.0,1848.25
50%,2.088562,0.481481,0.202417,0.06856,0.022222,0.483681,0.0,0.0,0.045788,0.0,...,0.083064,1.0,25.0,3500.0,139.0,5.0,0.113402,0.089311,1.0,3696.5
75%,2.627451,0.616532,0.3,0.133273,0.065041,0.578263,0.0,0.0,0.07346,0.0,...,0.236869,1.0,43.0,6376.0,222.0,7.0,0.241339,0.112331,1.0,5545.75
max,363.0,1.0,1.0,0.980392,0.980392,21.0,0.25,0.0,0.444444,1.0,...,113.333333,1.0,100.0,207952.0,4997.0,22.0,1.0,1.0,1.0,7394.0


avgLinkWordLength has a max value of 363 which is a bit outlier-ish considering both the median and mean are around the 2-3 mark.

However avgLinkWordLength measures the average number of words in a webpage and its sensible that a particular page can have 363 words in it. So you cannot consider it as an invalid value or such.

Simple outlier fixing like setting the outlier values to the Q3 quartile is also valid.

Regardless, it will be beneficial if feature standardization is done on all these numeric columns.

embedRatio column is described to contain the "count of tags" but the column has negative values so it becomes difficult to interpret what "count of tags" refers to. It is important to note that the max value in the column is 0.25 which does seem to indicate that this is a low value column.

framebased seems to have just 0 values, so that can be dropped

domainLink (binary column) seems to be skewed toward the 0 class.

hyperlinkToAllWordsRatio (percentage column) has values in range 0 to 100, so no erroneous or outlier values there.

alphanumCharCount understandably has very high values. However min value is 0 but this is also possible because some webpages in the dataset have only images or videos hosted on them.

linksCount, wordCount, parameterizedLinkRatio and spellingErrorsRatio also seem to have appropriate values


In [22]:
cleaned_data['framebased'].value_counts()

0    7394
Name: framebased, dtype: int64

The framebased column is completely 0. So we can just drop it

In [23]:
cleaned_data.drop('framebased', axis=1, inplace=True)

### Analyzing tagRatio column

In [24]:
cleaned_data['tagRatio'].describe()

count    7394.000000
mean        0.233787
std         0.052484
min         0.045564
25%         0.201087
50%         0.230565
75%         0.260774
max         0.716883
Name: tagRatio, dtype: float64

Everything looks alright here

### Analyzing domainLink column

In [25]:
cleaned_data['domainLink'].value_counts()

0    7237
1     157
Name: domainLink, dtype: int64

In [26]:
test_data['domainLink'].value_counts()

0    1436
1      43
Name: domainLink, dtype: int64

Huge class imbalance in both train and test data, but oversampling doesn't make much sense. At this point it is more sensible to straight up drop the column entirely but this can be tested later. Train model using this column and without using this column.

### Analyzing lengthyDomain column

In [27]:
cleaned_data['lengthyDomain'].value_counts()

1    4882
0    2512
Name: lengthyDomain, dtype: int64

Everything looks alright here

### Checking correlation between numeric columns

In [28]:
cleaned_data.corr()

Unnamed: 0,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,embedRatio,frameTagRatio,domainLink,tagRatio,imageTagRatio,lengthyDomain,hyperlinkToAllWordsRatio,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
avgLinkWordLength,1.0,0.120501,0.161778,0.174599,0.134599,-0.003583,0.005221,-0.049283,-0.002048,0.01896,-0.00301,0.020861,0.122553,-0.010978,0.000383,-0.033887,0.006091,0.035397,0.010598,-0.009222
AvglinkWithOneCommonWord,0.120501,1.0,0.808077,0.560458,0.388558,-0.017805,0.005846,-0.294754,0.006819,-0.201301,-0.064318,0.421218,0.25719,0.193878,0.317052,0.144313,-0.078072,-0.035082,0.094861,-0.000993
AvglinkWithTwoCommonWord,0.161778,0.808077,1.0,0.758358,0.555195,-0.032435,0.019593,-0.259183,0.000273,-0.159636,-0.044621,0.39879,0.257583,0.177767,0.311447,0.096921,-0.079498,-0.027908,0.095873,-0.008873
AvglinkWithThreeCommonWord,0.174599,0.560458,0.758358,1.0,0.850567,-0.01611,0.008175,-0.218435,-0.031072,-0.133142,-0.050232,0.363082,0.109622,0.263996,0.283656,0.049146,-0.00869,-0.008661,0.109624,-0.004883
AvglinkWithFourCommonWord,0.134599,0.388558,0.555195,0.850567,1.0,-0.020304,0.006354,-0.177883,-0.052492,-0.13623,-0.037886,0.287049,0.059171,0.162838,0.233471,0.026297,0.036344,-0.013602,0.083346,-0.003317
redundancyMeasure,-0.003583,-0.017805,-0.032435,-0.01611,-0.020304,1.0,-0.890026,0.1593,0.02765,0.106278,-0.189019,-0.090291,0.146485,-0.064146,-0.055301,-0.042597,-0.033763,0.364144,-0.059075,0.011109
embedRatio,0.005221,0.005846,0.019593,0.008175,0.006354,-0.890026,1.0,-0.131163,-0.026546,-0.091519,0.183657,0.075652,-0.108444,0.046643,0.043716,0.043496,0.037455,-0.342287,0.041775,-0.011827
frameTagRatio,-0.049283,-0.294754,-0.259183,-0.218435,-0.177883,0.1593,-0.131163,1.0,0.010177,0.384853,-0.088929,-0.196608,0.158909,-0.30366,-0.362384,0.049369,-0.094541,0.033699,-0.182243,-0.010369
domainLink,-0.002048,0.006819,0.000273,-0.031072,-0.052492,0.02765,-0.026546,0.010177,1.0,0.00964,-0.003903,0.008593,0.022588,-0.017355,0.013718,0.058092,0.051334,0.008724,-0.004861,0.00047
tagRatio,0.01896,-0.201301,-0.159636,-0.133142,-0.13623,0.106278,-0.091519,0.384853,0.00964,1.0,-0.173067,-0.215714,-0.141338,-0.136508,-0.455465,-0.04195,-0.183339,0.013915,-0.057335,-0.002651


It looks like only one or two of the CommonWord columns should be retained, like OneCommonWord & FourCommonWord

And definitely one of embedRatio or redundancyMeasure must be dropped

### Dropping embedRatio, two common word & three common word columns

embedRatio its meaning is more ambiguous as compared to redundancyMeasure

In [29]:
non_correlated_data = cleaned_data.drop(['embedRatio', 'AvglinkWithTwoCommonWord', 'AvglinkWithThreeCommonWord'], axis=1)

### Generating websiteName feature from URL parameter

    Using the urllib library to parse through the url of the given page and extract the host name from it, 
        urllib.parse.urlparse("URL").netloc

In [30]:
non_correlated_data['websiteName'] = non_correlated_data['url'].apply(lambda x: urllib.parse.urlparse(x).netloc)

websiteName is a categorical column and will obviously be populated with a ridiculous number of website categories

So we have to now analyze how many categories we allow in this dataset and tag the remaining categories under the umbrella category of "other"

In [31]:
website_names = non_correlated_data['websiteName'].value_counts()

print(len(website_names[website_names > 0]))
print(len(website_names[website_names > 5]))
print(len(website_names[website_names > 10]))
print(len(website_names[website_names > 15]))
print(len(website_names[website_names > 20]))
print(len(website_names[website_names > 30]))

3372
199
68
40
31
19


There are in total 3372 unique website names in around 7000 entries.

We have 19 unique website names that have count > 30 in the dataset,<br>so we can let these website names be as it is and combine all the other website names into "other" category

So in total there will be 20 categories in total in websiteName

Following are the 19 unique categories,

In [32]:
websitesWithAtleast30Entries = list(website_names[website_names > 30].index)

website_names[website_names > 30]

www.insidershealth.com       143
sportsillustrated.cnn.com    109
www.huffingtonpost.com        99
allrecipes.com                93
bleacherreport.com            86
www.youtube.com               85
blogs.babble.com              62
www.ivillage.com              59
www.foodnetwork.com           57
www.dailymail.co.uk           46
www.epicurious.com            36
www.womansday.com             35
www.bbc.co.uk                 34
www.guardian.co.uk            33
www.popsci.com                33
www.marthastewart.com         33
www.collegehumor.com          31
www.buzzfeed.com              31
itechfuture.com               31
Name: websiteName, dtype: int64

In [33]:
non_correlated_data['websiteName'] = non_correlated_data['websiteName'].apply(lambda x: x if x in websitesWithAtleast30Entries else 'other')

New value counts of websiteName column

In [34]:
non_correlated_data['websiteName'].value_counts()

other                        6258
www.insidershealth.com        143
sportsillustrated.cnn.com     109
www.huffingtonpost.com         99
allrecipes.com                 93
bleacherreport.com             86
www.youtube.com                85
blogs.babble.com               62
www.ivillage.com               59
www.foodnetwork.com            57
www.dailymail.co.uk            46
www.epicurious.com             36
www.womansday.com              35
www.bbc.co.uk                  34
www.popsci.com                 33
www.guardian.co.uk             33
www.marthastewart.com          33
www.buzzfeed.com               31
itechfuture.com                31
www.collegehumor.com           31
Name: websiteName, dtype: int64

### Processing the webpageDescription column

Reference: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

### 1. Split the sentences into words

In [35]:
webpage_description = non_correlated_data['webpageDescription']

In [36]:
webpage_description_words = [x.split() for x in webpage_description]

In [37]:
print(webpage_description_words[0])

['polyvore', 'is', 'the', 'best', 'place', 'to', 'discover', 'or', 'start', 'fashion', 'trends.', 'browse', 'and', 'shop', 'looks', 'created', 'by', 'a', 'global', 'community', 'of', 'independent', 'trendsetters', 'and', 'stylists.']


### 2. Remove punctuation marks

Python offers a function called translate() that will map one set of characters to another.

We can use the function maketrans() to create a mapping table. 
We will create an empty mapping table, i.e., map '' to ''.
The third argument of this function allows us to list all of the characters to remove during the translation process. So we can pass all the punctuation marks as part of this parameter which is stored in,
string.punctuation

In [38]:
# Make a translation table that maps all '' to '' and removes all punctuation marks
translation_table = str.maketrans('', '', string.punctuation)

stripped_data = [[y.translate(translation_table) for y in x] for x in webpage_description_words]

In [39]:
print(stripped_data[0])

['polyvore', 'is', 'the', 'best', 'place', 'to', 'discover', 'or', 'start', 'fashion', 'trends', 'browse', 'and', 'shop', 'looks', 'created', 'by', 'a', 'global', 'community', 'of', 'independent', 'trendsetters', 'and', 'stylists']


The last word had a fullstop before which is removed now

### 3. Converting everything to lowercase

It is common to convert all words to one case.

This means that the vocabulary will shrink in size, but some distinctions are lost (e.g. “Apple” the company vs “apple” the fruit is a commonly used example).

In [40]:
lowercase_data = [[y.lower() for y in x] for x in stripped_data]

In [41]:
print(stripped_data[1])
print(lowercase_data[1])

['Speed', 'Air', 'ManDavid', 'Belle', 'david', 'belle', 'speed', 'air', 'parkour', 'Sports']
['speed', 'air', 'mandavid', 'belle', 'david', 'belle', 'speed', 'air', 'parkour', 'sports']


### 4. Filter out tokens/words that are not alphabetic

In [42]:
alphabetic_data = [[y for y in x if y.isalpha()] for x in lowercase_data]

In [43]:
print(lowercase_data[4][-30:], end='\n\n')
print(alphabetic_data[4][-30:])

['of', 'malaysia', 'by', 'copeland', 'marks', 'collection', 'of', '4000', 'recipes', 'from', 'all', 'over', 'the', 'world', 'great', 'international', 'and', 'ethnic', 'cuisine', 'international', 'recipes', 'ethnic', 'cuisine', 'world', 'recipes', 'ethnic', 'food', 'malaysian', 'recipes', 'malaysian']

['kitchens', 'of', 'malaysia', 'by', 'copeland', 'marks', 'collection', 'of', 'recipes', 'from', 'all', 'over', 'the', 'world', 'great', 'international', 'and', 'ethnic', 'cuisine', 'international', 'recipes', 'ethnic', 'cuisine', 'world', 'recipes', 'ethnic', 'food', 'malaysian', 'recipes', 'malaysian']


### 5. Removing stopwords

Takes signficant amount of time to run

In [44]:
no_stopwords_data = [[y for y in x if y not in stopwords.words('english')] for x in alphabetic_data]

In [45]:
print(no_stopwords_data[0])

['polyvore', 'best', 'place', 'discover', 'start', 'fashion', 'trends', 'browse', 'shop', 'looks', 'created', 'global', 'community', 'independent', 'trendsetters', 'stylists']


### 6. Stemming words

Stemming includes chopping the ends of words so that it becomes closer to its root word. 

Like stemming "studies" to "studi" and "drinking" to "drink"

In [46]:
porter = PorterStemmer()

stemmed_data = [[porter.stem(y) for y in x] for x in no_stopwords_data]

In [47]:
print(no_stopwords_data[0], end='\n\n')

print(stemmed_data[0])

['polyvore', 'best', 'place', 'discover', 'start', 'fashion', 'trends', 'browse', 'shop', 'looks', 'created', 'global', 'community', 'independent', 'trendsetters', 'stylists']

['polyvor', 'best', 'place', 'discov', 'start', 'fashion', 'trend', 'brows', 'shop', 'look', 'creat', 'global', 'commun', 'independ', 'trendsett', 'stylist']


### Using TF-IDF Vectorizer to encode text into word vectors

The vectorizer requires that the input is a list of sentences/documents, so that is why the data is joined using spaces before being given to the vectorizer

Reference: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

In [48]:
vectorizer = TfidfVectorizer()

joined_data = [' '.join(y for y in x) for x in stemmed_data]
vectorizer.fit(joined_data)

TfidfVectorizer()

In [49]:
vectorized_data = vectorizer.transform(joined_data)

In [50]:
vectorized_data.shape

(7394, 62013)

vectorization created a 7394 x 62013 matrix where 62013 = Size of vocabulary of the dataset and 7394 = Number of rows in the dataset

In [55]:
joined_data[0]

'polyvor best place discov start fashion trend brows shop look creat global commun independ trendsett stylist'

In [51]:
vectorized_data[0]

<1x62013 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

As you can see the first entry had 16 words in it and that's what the non-zero entries (16 stored elements) stored in vectorized_data[0] refer to

In [52]:
vectorized_data_df = pd.DataFrame(vectorized_data.toarray())

This vectorized data dataframe can directly be given to an ML classifier as its just numbers now

### Merging the vectorized data with our processed dataset

In [53]:
processed_data = pd.concat([non_correlated_data, vectorized_data_df], axis=1)

In [54]:
processed_data.head()

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,frameTagRatio,domainLink,...,62003,62004,62005,62006,62007,62008,62009,62010,62011,62012
0,http://www.polyvore.com/cgi/home?id=1389651,polyvore is the best place to discover or sta...,?,?,1.916667,0.047619,0.0,0.803797,0.027778,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,http://www.youtube.com/watch?v=ippMPPu6gh4,Speed Air Man--David Belle david belle speed a...,?,?,1.257576,0.141026,0.0,1.142857,0.015086,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,http://www.musingsofahousewife.com/2011/03/tri...,Chicken Gruyere one of our favorite special di...,science_technology,0.386685,2.024,0.63035,0.202335,0.443409,0.033935,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,http://www.thelittleteochew.com/2011/07/ikan-b...,Oh me oh my This was really snackalicious swee...,recreation,0.475039,1.665254,0.41958,0.066434,0.472649,0.03653,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,http://recipes.wuzzle.org/index.php/72,Barbecued Chicken Chow Siew from The Exotic Ki...,computer_internet,0.535009,0.181818,0.036364,0.0,0.292614,0.015152,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
processed_data.columns[:25]

Index([                      'url',        'webpageDescription',
                'alchemy_category',    'alchemy_category_score',
               'avgLinkWordLength',  'AvglinkWithOneCommonWord',
       'AvglinkWithFourCommonWord',         'redundancyMeasure',
                   'frameTagRatio',                'domainLink',
                        'tagRatio',             'imageTagRatio',
                          'isNews',             'lengthyDomain',
        'hyperlinkToAllWordsRatio',           'isFrontPageNews',
               'alphanumCharCount',                'linksCount',
                       'wordCount',     'parametrizedLinkRatio',
             'spellingErrorsRatio',                     'label',
                              'id',               'websiteName',
                                 0],
      dtype='object')