# Obtain and Scrub

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import pickle

## Obtain and Inspect Data

In this project, I will use the attributes of online article posts to predict whether an article earns fewer or more than a determined number of shares.

In [2]:
# reading the dataset into a pandas dataframe
df = pd.read_csv('data/online-news-popularity.csv', index_col=0)
print(df.shape)
df.head()

(39644, 60)


Unnamed: 0_level_0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
http://mashable.com/2013/01/07/amazon-instant-video-browser/,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,0.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,0.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
http://mashable.com/2013/01/07/apple-40-billion-app-downloads/,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,0.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,0.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,0.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505


##### Dataset Information:

* This dataset was acquired from the University of California, 
  Irvine's Center for Machine Learning and Intelligent Systems archive (https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity#).
* Data references articles published by Mashable (www.mashable.com).
* Citation:
  K Fernandes, P Vinagre, P Cortez - Progress in Artificial Intelligence: 17th 
  Portuguese Conference on Artificial Intelligence, EPIA 2015, Coimbra, Portugal, 
  September 8-11, 2015.
     

##### Attribute Information:
     0. url:                           URL of the article
     1. timedelta:                     Days between the article publication and
                                       the dataset acquisition
     2. n_tokens_title:                Number of words in the title
     3. n_tokens_content:              Number of words in the content
     4. n_unique_tokens:               Rate of unique words in the content
     5. n_non_stop_words:              Rate of non-stop words in the content
     6. n_non_stop_unique_tokens:      Rate of unique non-stop words in the
                                       content
     7. num_hrefs:                     Number of links
     8. num_self_hrefs:                Number of links to other articles
                                       published by Mashable
     9. num_imgs:                      Number of images
    10. num_videos:                    Number of videos
    11. average_token_length:          Average length of the words in the
                                       content
    12. num_keywords:                  Number of keywords in the metadata
    13. data_channel_is_lifestyle:     Is data channel 'Lifestyle'?
    14. data_channel_is_entertainment: Is data channel 'Entertainment'?
    15. data_channel_is_bus:           Is data channel 'Business'?
    16. data_channel_is_socmed:        Is data channel 'Social Media'?
    17. data_channel_is_tech:          Is data channel 'Tech'?
    18. data_channel_is_world:         Is data channel 'World'?
    19. kw_min_min:                    Worst keyword (min. shares)
    20. kw_max_min:                    Worst keyword (max. shares)
    21. kw_avg_min:                    Worst keyword (avg. shares)
    22. kw_min_max:                    Best keyword (min. shares)
    23. kw_max_max:                    Best keyword (max. shares)
    24. kw_avg_max:                    Best keyword (avg. shares)
    25. kw_min_avg:                    Avg. keyword (min. shares)
    26. kw_max_avg:                    Avg. keyword (max. shares)
    27. kw_avg_avg:                    Avg. keyword (avg. shares)
    28. self_reference_min_shares:     Min. shares of referenced articles in
                                       Mashable
    29. self_reference_max_shares:     Max. shares of referenced articles in
                                       Mashable
    30. self_reference_avg_sharess:    Avg. shares of referenced articles in
                                       Mashable
    31. weekday_is_monday:             Was the article published on a Monday?
    32. weekday_is_tuesday:            Was the article published on a Tuesday?
    33. weekday_is_wednesday:          Was the article published on a Wednesday?
    34. weekday_is_thursday:           Was the article published on a Thursday?
    35. weekday_is_friday:             Was the article published on a Friday?
    36. weekday_is_saturday:           Was the article published on a Saturday?
    37. weekday_is_sunday:             Was the article published on a Sunday?
    38. is_weekend:                    Was the article published on the weekend?
    39. LDA_00:                        Closeness to LDA topic 0
    40. LDA_01:                        Closeness to LDA topic 1
    41. LDA_02:                        Closeness to LDA topic 2
    42. LDA_03:                        Closeness to LDA topic 3
    43. LDA_04:                        Closeness to LDA topic 4
    44. global_subjectivity:           Text subjectivity
    45. global_sentiment_polarity:     Text sentiment polarity
    46. global_rate_positive_words:    Rate of positive words in the content
    47. global_rate_negative_words:    Rate of negative words in the content
    48. rate_positive_words:           Rate of positive words among non-neutral
                                       tokens
    49. rate_negative_words:           Rate of negative words among non-neutral
                                       tokens
    50. avg_positive_polarity:         Avg. polarity of positive words
    51. min_positive_polarity:         Min. polarity of positive words
    52. max_positive_polarity:         Max. polarity of positive words
    53. avg_negative_polarity:         Avg. polarity of negative  words
    54. min_negative_polarity:         Min. polarity of negative  words
    55. max_negative_polarity:         Max. polarity of negative  words
    56. title_subjectivity:            Title subjectivity
    57. title_sentiment_polarity:      Title polarity
    58. abs_title_subjectivity:        Absolute subjectivity level
    59. abs_title_sentiment_polarity:  Absolute polarity level
    60. shares:                        Number of shares (target)

In [4]:
# viewing statistics and checking for missing data
display(df.describe().round(3))
display(df.info())

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
count,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,...,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0
mean,354.53,10.399,546.515,0.548,0.996,0.689,10.884,3.294,4.544,1.25,...,0.095,0.757,-0.26,-0.522,-0.108,0.282,0.071,0.342,0.156,3395.38
std,214.164,2.114,471.108,3.521,5.231,3.265,11.332,3.855,8.309,4.108,...,0.071,0.248,0.128,0.29,0.095,0.324,0.265,0.189,0.226,11626.951
min,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,1.0
25%,164.0,9.0,246.0,0.471,1.0,0.626,4.0,1.0,1.0,0.0,...,0.05,0.6,-0.328,-0.7,-0.125,0.0,0.0,0.167,0.0,946.0
50%,339.0,10.0,409.0,0.539,1.0,0.69,8.0,3.0,1.0,0.0,...,0.1,0.8,-0.253,-0.5,-0.1,0.15,0.0,0.5,0.0,1400.0
75%,542.0,12.0,716.0,0.609,1.0,0.755,14.0,4.0,4.0,1.0,...,0.1,1.0,-0.187,-0.3,-0.05,0.5,0.15,0.5,0.25,2800.0
max,731.0,23.0,8474.0,701.0,1042.0,650.0,304.0,116.0,128.0,91.0,...,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0,843300.0


<class 'pandas.core.frame.DataFrame'>
Index: 39644 entries, http://mashable.com/2013/01/07/amazon-instant-video-browser/ to http://mashable.com/2014/12/27/youtube-channels-2015/
Data columns (total 60 columns):
 timedelta                        39644 non-null float64
 n_tokens_title                   39644 non-null float64
 n_tokens_content                 39644 non-null float64
 n_unique_tokens                  39644 non-null float64
 n_non_stop_words                 39644 non-null float64
 n_non_stop_unique_tokens         39644 non-null float64
 num_hrefs                        39644 non-null float64
 num_self_hrefs                   39644 non-null float64
 num_imgs                         39644 non-null float64
 num_videos                       39644 non-null float64
 average_token_length             39644 non-null float64
 num_keywords                     39644 non-null float64
 data_channel_is_lifestyle        39644 non-null float64
 data_channel_is_entertainment    39644 non-null

None

> There are 39644 rows.
* There are no missing data.
* The target variable's 50th percentile is 1400 shares. We can use that number as a starting point, to plan separating our dataset into two classes.

In [4]:
# testing lengths to separate the classes
value = 1400
t = df[' shares']

for i in range(6): # print and increment test value by 25, 6 times
    print(f"Number of rows below {value}: ", 
          len(df[t < value]))
    value += 25

Number of rows below 1400:  18490
Number of rows below 1425:  20082
Number of rows below 1450:  20082
Number of rows below 1475:  20082
Number of rows below 1500:  20082
Number of rows below 1525:  21405


> We can use the nicely even value of **1500** to  segment our dataset. That gives us 20082 articles shared fewer than 1500 times and 19562 articles shared 1500 times or more.

> To proceed:
* Clean feature labels
* Create a new target column, indicating whether or not articles were shared at least 1500 times
* Build a decision tree classifier, to set a modeling baseline
* Examine the relative importance of dataset features
* Test, tune, and score ensemble method models against the dataset
* Compare the model scores to determine which model is most effective

In addition, depending on time allowance and feature relative importance, I may repeat tests without features such as best and worst keywords, LDA, and polarity, as they may require additional unpacking beyond our current scope.

## Clean Feature Names

I will not be too picky about name length or meaningfulness, at this point.
* Column names include an unneccessary leading space; we will address those.
* In addition, some column names include the prefix 'n', while others use 'num'; we will make them all 'n' prefixes.

In [5]:
# viewing column labels
df.columns

Index([' timedelta', ' n_tokens_title', ' n_tokens_content',
       ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
       ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
       ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
       ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
       ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
       ' self_reference_max_shares', ' self_reference_avg_sharess',
       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
       ' LDA_03', ' LDA_04', ' global_subjectivity',
       ' global_sentiment_p

Okay, maybe I will be a bit more picky.
To shorten some of these names, we can also do the following:
* remove 'tokens_'
* remove 'data_'
* replace 'entertainment' with 'ent'
* replace 'reference' with 'ref''
* replace 'average' with 'avg''
* replace 'positive' with 'pos''
* replace 'negative' with 'neg'
* remove 'day'
* remove'week_is'
* remove 'ectivity'
* remove 'arity'

So I will make those replacements along with the original two:
* replace ' ' with ''
* replace 'num' with 'n'

In [6]:
# make replacements in the listed names
df.columns = df.columns.str.replace(
    ' ',
    ''
).str.replace(
    'num_', 'n'
).str.replace(
    'tokens_', ''
).str.replace(
    'is_', ''
).str.replace(
    'data_', ''
).str.replace(
    'day', ''
).str.replace(
    'week_is', ''
).str.replace(
    'ectivity', ''
).str.replace(
    'arity', ''
).str.replace(
    'entertainment', 'ent'
).str.replace(
    'reference', 'ref'
).str.replace(
    'sharess', 'shares'
).str.replace(
    'average', 'avg'
).str.replace(
    'positive', 'pos'
).str.replace(
    'negative', 'neg'
)

df.columns

Index(['timedelta', 'n_title', 'n_content', 'n_unique_tokens',
       'n_non_stop_words', 'n_non_stop_unique_tokens', 'nhrefs', 'nself_hrefs',
       'nimgs', 'nvideos', 'avg_token_length', 'nkeywords',
       'channel_lifestyle', 'channel_ent', 'channel_bus', 'channel_socmed',
       'channel_tech', 'channel_world', 'kw_min_min', 'kw_max_min',
       'kw_avg_min', 'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg',
       'kw_max_avg', 'kw_avg_avg', 'self_ref_min_shares',
       'self_ref_max_shares', 'self_ref_avg_shares', 'week_mon', 'week_tues',
       'week_wednes', 'week_thurs', 'week_fri', 'week_satur', 'week_sun',
       'weekend', 'LDA_00', 'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04',
       'global_subj', 'global_sentiment_pol', 'global_rate_pos_words',
       'global_rate_neg_words', 'rate_pos_words', 'rate_neg_words',
       'avg_pos_pol', 'min_pos_pol', 'max_pos_pol', 'avg_neg_pol',
       'min_neg_pol', 'max_neg_pol', 'title_subj', 'title_sentiment_pol',
       'abs_title_

## Building a Target

In [7]:
# creating column (value = 1 for rows where `shares` is at least 1500)
df['Shares_plus'] = np.where(df['shares']>=1500, '1', '0')

In [8]:
# viewing the first 10 rows
df[['shares', 'Shares_plus']].head(10)

Unnamed: 0_level_0,shares,Shares_plus
url,Unnamed: 1_level_1,Unnamed: 2_level_1
http://mashable.com/2013/01/07/amazon-instant-video-browser/,593,0
http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/,711,0
http://mashable.com/2013/01/07/apple-40-billion-app-downloads/,1500,1
http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/,1200,0
http://mashable.com/2013/01/07/att-u-verse-apps/,505,0
http://mashable.com/2013/01/07/beewi-smart-toys/,855,0
http://mashable.com/2013/01/07/bodymedia-armbandgets-update/,556,0
http://mashable.com/2013/01/07/canon-poweshot-n/,891,0
http://mashable.com/2013/01/07/car-of-the-future-infographic/,3600,1
http://mashable.com/2013/01/07/chuck-hagel-website/,710,0


## Notebook Summary
> 39,644 rows with `url` index and 59 predictors for `shares`
* all numeric features
* shortened several feature names
* data to be divided into two classes of articles: 
    * 20082 shared < 1500 times and 19562 shared >= 1500 times
* built boolean target feature, `Shares_plus`, representing the two classes


## Save and Continue

In [9]:
# saving the current - state dataframe to a pickle file
with open('data/df-os.pickle', 'wb') as f:
    # pickling the dataframe using the highest protocol available
    pickle.dump(df, f, pickle.HIGHEST_PROTOCOL)

In [10]:
# 72 Char. screen - width reference
########################################################################