## Studying News Popularity in terms of Number of Shares

Courtesy of K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

Refer to https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity for details. 

** Table of Content**

1. [Upload the data](#Upload-the-data)
2. [Read data with random missingness](#Read-data-with-random-missingness)
3. [Explore the data](#Explore-the-data)

In [101]:
import os
import pandas as pd
import azureml.dataprep as dprep

In [102]:
import azureml.core
print("SDK version:", azureml.core.VERSION)
from azureml.core import Workspace, Experiment, Run
ws = Workspace.from_config()

SDK version: 1.0.17
Found the config file in: /home/nbuser/library/config.json


### Upload the data

In [103]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

data_folder = os.path.join(os.getcwd(), 'data/OnlineNewsPopularity')
print(data_folder)
os.makedirs(data_folder, exist_ok=True)

/home/nbuser/library/data/OnlineNewsPopularity


In [159]:
resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip")
zipfile = ZipFile(BytesIO(resp.read()))
zipfile.namelist()
file = 'OnlineNewsPopularity/OnlineNewsPopularity.csv'
original_df = pd.read_csv(zipfile.open(file))

In [151]:
import collections
import random
import numpy as np
from tqdm import tqdm_notebook as tqdm

def insert_missing(df:'pd.DataFrame', missing_percent:'float') -> 'pd.DataFrame': 
    replaced_dict = collections.defaultdict(set) # first create an empty set
    idx = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])] #  create all combinations of indexes
    random.shuffle(idx)
    index_to_replace = int(round(missing_percent * len(idx)))
#    print(index_to_replace)
    with tqdm(total=index_to_replace, unit_scale=True) as pbar:
        for row, col in idx: 
            if len(replaced_dict[row]) < df.shape[1] - 1: # so that no row is completely missing
                df.iloc[row, col] = np.nan
                index_to_replace -= 1
                replaced_dict[row].add(col)
                if index_to_replace == 0:
                    break
                pbar.update(10)
    return df

In [None]:
random.seed(607)

df = insert_missing(original_df, 0.2)

HBox(children=(IntProgress(value=0, max=483657), HTML(value='')))

In [153]:
df.shape

(39644, 61)

In [155]:
df.isnull().sum() / df.shape[0] * df.shape[1]

url                               26.933306
 timedelta                        26.800979
 n_tokens_title                   27.108718
 n_tokens_content                 26.762511
 n_unique_tokens                  27.153340
 n_non_stop_words                 26.794824
 n_non_stop_unique_tokens         26.931768
 num_hrefs                        26.753279
 num_self_hrefs                   26.931768
 num_imgs                         26.876375
 num_videos                       26.854833
 average_token_length             27.025628
 num_keywords                     27.011780
 data_channel_is_lifestyle        26.759434
 data_channel_is_entertainment    26.951771
 data_channel_is_bus              26.994854
 data_channel_is_socmed           26.957926
 data_channel_is_tech             26.770205
 data_channel_is_world            26.830214
 kw_min_min                       27.057941
 kw_max_min                       26.717889
 kw_avg_min                       26.965619
 kw_min_max                     

In [154]:
# to use dprep module on Azure, save to folder first
df.to_csv(os.path.join(data_folder,'Missing_OnlineNewsPopularity.csv'), index=False)

# Read data with random missingness

In [156]:
print('Reading data...')
df = dprep.read_csv(path=os.path.join(data_folder,'Missing_OnlineNewsPopularity.csv'))
# df.shape

Reading data...


In [157]:
test2 = dprep.auto_read_file(path=os.path.join(data_folder,'Missing_OnlineNewsPopularity.csv'))

BrokenPipeError: [Errno 32] Broken pipe

# Explore the data

In [158]:
df.get_profile()

BrokenPipeError: [Errno 32] Broken pipe

In [None]:
# create an experiment
experiment = Experiment(workspace = ws, name = "news_popularity_eng")