The dataset contains an ID column, 25 features and one target label column. The description of the features is given below:

    url : URL of the webpage to be classified
    webpageDescription : One line description of the webpage
    alchemy_category : Alchemy category (per the publicly available Alchemy API)
    alchemy_category_score : Alchemy category score (per the publicly available Alchemy API)
    avgLinkWordLength : Average number of words in a webpage
    AvglinkWithOneCommonWord : Average number of web pages sharing at least one word with one other web page
    AvglinkWithTwoCommonWord : Average number of web pages sharing at least one word with two other web pages
    AvglinkWithThreeCommonWord : Average number of web pages sharing at least one word with three other web pages
    AvglinkWithFourCommonWord : Average number of web pages sharing at least one word with four other web pages
    redundancyMeasure : Measure of redundancy computed by finding the compression achieved on this web page via gzip
    embedRatio : Count of tags
    frameBased : Binary indication of whether a webpage has frameset markup
    frameTagRatio : Ratio of frameset markups over total markups
    domainLink: Binary indication of whether the webpage contains in URL with domain
    tagRatio : Ratio of tags over text in the webpage
    imageTagRatio : Ratio of tags over text in the webpage
    isNews : Binary indication of whether a webpage is a news article
    lengthyDomain : Binary indication of whether webpage's text contains more than 30 alpha-numeric characters
    hyperlinkToAllWordsRatio : Percentage of words on the webpage that are also in the hyperlink text
    isFrontPageNews : Binary indication of whether webpage is front-page news
    alphanumCharCount : Number of alpha-numeric characters in webpage's text
    linksCount : Number of markups
    wordCount : Number of words in URL
    parametrizedLinkRatio :
    spellingErrorsRatio : Ration of words that contain spelling errors

The label value of 0 represents that the webpage is not "ad-worthy", and a label value of 1 represents that the webpage is "ad-worthy". Your task is to predict the label value based on the features described above.

In [38]:
import numpy as np
import pandas as pd
import json

In [2]:
train_data = pd.read_csv('dataset/train_data.csv')
test_data = pd.read_csv('dataset/test_data.csv')

### Merging train and test data

### Converting webpageDescription from string to JSON data

In [42]:
merged_data = pd.concat([train_data, test_data])

merged_data['webpageDescription'] = merged_data['webpageDescription'].apply(lambda x: json.loads(x))

### Checking missing entries in each column

In [43]:
missing_entries = []

for feature in merged_data.columns:
    record = {}
    record['feature'] = feature
    record['NA values'] = merged_data[feature].isna().sum()
    record['? values'] = len(merged_data[merged_data[feature] == '?'][feature])
    missing_entries.append(record)

pd.DataFrame(missing_entries)

Unnamed: 0,feature,NA values,? values
0,url,0,0
1,webpageDescription,0,0
2,alchemy_category,0,2342
3,alchemy_category_score,0,2342
4,avgLinkWordLength,0,0
5,AvglinkWithOneCommonWord,0,0
6,AvglinkWithTwoCommonWord,0,0
7,AvglinkWithThreeCommonWord,0,0
8,AvglinkWithFourCommonWord,0,0
9,redundancyMeasure,0,0


### AlchemyCategory & AlchemyCategoryScore has 2342 missing entries (? values)

### isNews 2843 entries (? values)

### isFrontpageNews 1248 entries (? values)



### Analyzing the keys in JSON value of webpageDescription

It seems that "body" key is the one-line description of the page as described in the dataset description

In [48]:
merged_data['webpageDescription'].apply(lambda x: x.keys())

0       (title, body, url)
1            (body, title)
2       (title, body, url)
3       (title, body, url)
4       (url, title, body)
               ...        
1474    (title, body, url)
1475    (title, body, url)
1476    (url, title, body)
1477    (title, body, url)
1478    (title, body, url)
Name: webpageDescription, Length: 7395, dtype: object

In [89]:
len(merged_data[merged_data['webpageDescription'].apply(lambda x: x['body'] == None)])

57

### There are 57 entries where the webpageDescription has no "body" entry

In [98]:
merged_data[merged_data['webpageDescription'].apply(lambda x: x['title'] if x['body'] == None else x['body']).isna()]

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
2994,http://icanhascheezburger.com/2009/05/07/funny...,"{'title': None, 'body': None, 'url': 'icanhasc...",?,?,1.982178,0.756809,0.285992,0.235409,0.210117,0.0,...,1,35,?,8682,514,6,0.206226,0.125,0.0,5330


### There is 1 entry where the webpageDescription has no "body" entry and no "title" entry too

This entry belongs to train set, so we can just delete it

### Replacing webpageDescription using the following logic

    All entries have "body" key in webpageDescription column but some have these values as None
    So we fill in those entries with "title" key value
    
    if webpageDescription["body"] != None
        webpageDescription["body"]
    else webpageDescription["title"]

In [95]:
merged_data[merged_data['webpageDescription'].apply(lambda x: x['title'] if x['body'] == None else x['body']).isna()]

cleaned_data = merged_data.copy(deep=True)
cleaned_data.drop(index=2994, inplace=True)

In [96]:
cleaned_data['webpageDescription'] = cleaned_data['webpageDescription'].apply(lambda x: x['title'] if x['body'] == None else x['body'])

### Analyzing value_counts() of the two news columns

In [114]:
cleaned_data['isNews'].value_counts()

1    4552
?    2842
Name: isNews, dtype: int64

It seems pretty reasonable to assume that ? refers to isNews = 0 but more analysis needs to be done

In [115]:
cleaned_data['isFrontPageNews'].value_counts()

0    5853
?    1247
1     294
Name: isFrontPageNews, dtype: int64

It also seems reasonable to assume that ? refers to isFrontpageNews = 1 but more analysis needs to be done

In [127]:
cleaned_data[cleaned_data['isNews'] == '?'].head(10)

Unnamed: 0,url,webpageDescription,alchemy_category,alchemy_category_score,avgLinkWordLength,AvglinkWithOneCommonWord,AvglinkWithTwoCommonWord,AvglinkWithThreeCommonWord,AvglinkWithFourCommonWord,redundancyMeasure,...,lengthyDomain,hyperlinkToAllWordsRatio,isFrontPageNews,alphanumCharCount,linksCount,wordCount,parametrizedLinkRatio,spellingErrorsRatio,label,id
0,http://www.polyvore.com/cgi/home?id=1389651,polyvore is the best place to discover or sta...,?,?,1.916667,0.047619,0.007937,0.0,0.0,0.803797,...,0,34,0,682,126,1,0.531746,0.142857,1.0,3711
1,http://www.youtube.com/watch?v=ippMPPu6gh4,Speed Air Man--David Belle david belle speed a...,?,?,1.257576,0.141026,0.0,0.0,0.0,1.142857,...,0,12,0,3008,78,1,0.628205,0.0,1.0,7222
4,http://recipes.wuzzle.org/index.php/72,Barbecued Chicken Chow Siew from The Exotic Ki...,computer_internet,0.535009,0.181818,0.036364,0.0,0.0,0.0,0.292614,...,0,3,0,1745,55,1,0.072727,0.115044,1.0,4321
6,http://www.insidershealth.com/natural_cure/poi...,Copyright 2011 InsidersHealth com All Rights R...,health,0.893828,1.691781,0.469799,0.147651,0.026846,0.006711,0.522134,...,0,57,0,961,149,4,0.006711,0.086957,1.0,5416
7,http://yawoot.com/post/2122,Well the world is made of peanut butter now po...,recreation,0.550131,1.170213,0.634921,0.0,0.0,0.0,0.224906,...,0,11,0,1865,63,1,0.031746,0.123209,0.0,7349
16,http://www.nationalpost.com/news/story.html?id...,In a discovery that has stunned even those beh...,health,0.603181,3.261649,0.653595,0.369281,0.196078,0.140523,0.442116,...,1,44,0,5791,306,2,0.075163,0.098795,0.0,4794
17,http://thecompletecookbook.wordpress.com/,Before we get to today s post please understan...,science_technology,0.564931,1.371717,0.475904,0.269076,0.128514,0.054217,0.511271,...,1,13,?,19995,498,0,0.104418,0.104478,0.0,3472
21,http://www.pimpthatsnack.com/full.php,Pimp That Snack Snack Search Join the mailing ...,culture_politics,0.208574,2.388489,0.733813,0.496403,0.302158,0.086331,0.723183,...,0,86,0,825,139,1,0.093525,0.125,1.0,4644
22,http://www.etsy.com/treasury/MTA0NjAxODZ8NzkwM...,,culture_politics,0.165873,1.488235,0.408451,0.140845,0.014085,0.0,21.0,...,0,43,0,1818,213,2,0.267606,0.315789,0.0,3528
24,http://warehouse.carlh.com/article_157/,February 4th 2008 Copy the line below to link ...,arts_entertainment,0.361629,2.0,0.086957,0.0,0.0,0.0,0.485349,...,0,4,0,2814,23,1,0.173913,0.095164,1.0,3744


In [124]:
cleaned_data.iloc[16, 1]

'In a discovery that has stunned even those behind it scientists at a Toronto hospital say they have proof the body s nervous system helps trigger diabetes opening the door to a potential near cure of the disease that affects millions of Canadians Diabetic mice became healthy virtually overnight after researchers injected a substance to counteract the effect of malfunctioning pain neurons in the pancreas I couldn t believe it said Dr Michael Salter a pain expert at the Hospital for Sick Children and one of the scientists Mice with diabetes suddenly didn t have diabetes any more The researchers caution they have yet to confirm their findings in people but say they expect results from human studies within a year or so Any treatment that may emerge to help at least some patients would likely be years away from hitting the market But the excitement of the team from Sick Kids whose work is being published today in the journal Cell is almost palpable I ve never seen anything like it said Dr 

It is clear that this is a news article and its 'isNews' value was set to '?'

This just tells us that we'll have to closely analyze each entry with '?' for isNews to determine whether they're a news page or not

Or we can consider this to be an outlier and assume that its mostly non-news page whenever isNews = '?'

Either way more analysis needs to be done

### Analyzing isFrontPageNews column for ? values

In [99]:
cleaned_data[cleaned_data['isFrontPageNews'] == '?']['isNews'].value_counts()

?    1247
Name: isNews, dtype: int64

The entries which have ? values in isFrontPageNews also have ? values in isNews

### Analyzing isNews column for ? values

In [101]:
cleaned_data[cleaned_data['isNews'] == '?']['webpageDescription']

0        polyvore is the best place to discover or sta...
1       Speed Air Man--David Belle david belle speed a...
4       Barbecued Chicken Chow Siew from The Exotic Ki...
6       Copyright 2011 InsidersHealth com All Rights R...
7       Well the world is made of peanut butter now po...
                              ...                        
1471                                                     
1474                                                     
1475    Save to your Collections Sorry for the inconve...
1477    outerbanxchic posted 10 14 2011 Delicious and ...
1478    Summer always signifies the beginning of steak...
Name: webpageDescription, Length: 2842, dtype: object