# Work in Progress: Studying News Popularity in terms of Number of Shares

Courtesy of K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

Refer to https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity for details. 

** Table of Content**

1. [Upload the data](#Upload-the-data)
2. [Explore the data](#Explore-the-data)
3. [Train the Model](#Train-the-Model)
4. [Submit an experiment](#Submit-an-experiment)

In [1]:
import os
import pandas as pd
import azureml.dataprep as dprep

In [2]:
import azureml.core
print("SDK version:", azureml.core.VERSION)
from azureml.core import Workspace, Experiment, Run
ws = Workspace.from_config()

SDK version: 1.0.17


If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Found the config file in: /home/nbuser/library/config.json
Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code F7C7PFFXX to authenticate.
Interactive authentication successfully completed.


### Upload the data

In [3]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

project_folder = os.getcwd()
data_folder = os.path.join(os.getcwd(), 'data/OnlineNewsPopularity')
print(data_folder)
os.makedirs(data_folder, exist_ok=True)

/home/nbuser/library/data/OnlineNewsPopularity


In [4]:
resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip")
zipfile = ZipFile(BytesIO(resp.read()))
zipfile.namelist()
file = 'OnlineNewsPopularity/OnlineNewsPopularity.csv'
original_df = pd.read_csv(zipfile.open(file))

In [5]:
df = original_df

In [6]:
df.shape

(39644, 61)

In [7]:
df.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505


In [50]:
df_columns = df.columns.tolist()

# Explore the data

In [37]:
df.rename(columns=lambda x: x.strip(), inplace=True)

In [38]:
df.shares.head()

0     593
1     711
2    1500
3    1200
4     505
Name: shares, dtype: int64

In [68]:
share_categories = [1,2,3,4,5]
df['share_cat'] = np.array(pd.qcut(df['shares'], 5, share_categories))
df['share_cat'].dtype

In [69]:
df.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares,share_cat
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.66,1.0,0.82,4.0,2.0,1.0,...,0.7,-0.35,-0.6,-0.2,0.5,-0.19,0.0,0.19,593,1
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.6,1.0,0.79,3.0,1.0,1.0,...,0.7,-0.12,-0.12,-0.1,0.0,0.0,0.5,0.0,711,1
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.58,1.0,0.66,3.0,1.0,1.0,...,1.0,-0.47,-0.8,-0.13,0.0,0.0,0.5,0.0,1500,3
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.5,1.0,0.67,9.0,0.0,1.0,...,0.8,-0.37,-0.6,-0.17,0.0,0.0,0.5,0.0,1200,2
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.42,1.0,0.54,19.0,19.0,20.0,...,1.0,-0.22,-0.5,-0.05,0.45,0.14,0.05,0.14,505,1


# Train the Model

In [43]:
df.columns[:-2]

Index(['url', 'timedelta', 'n_tokens_title', 'n_tokens_content',
       'n_unique_tokens', 'n_non_stop_words', 'n_non_stop_unique_tokens',
       'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
       'average_token_length', 'num_keywords', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'kw_min_min', 'kw_max_min', 'kw_avg_min',
       'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg',
       'kw_avg_avg', 'self_reference_min_shares', 'self_reference_max_shares',
       'self_reference_avg_sharess', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend', 'LDA_00',
       'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
     

In [11]:
df[df.columns[-1]]

0          593
1          711
2         1500
3         1200
4          505
5          855
6          556
7          891
8         3600
9          710
10        2200
11        1900
12         823
13       10000
14         761
15        1600
16       13600
17        3100
18        5700
19       17100
20        2800
21         598
22         445
23        1500
24         852
25         783
26        1500
27        1800
28         462
29         425
         ...  
39614     1400
39615     5700
39616     2100
39617      691
39618     1400
39619     1200
39620     2400
39621    24300
39622     2900
39623      947
39624     3200
39625     1400
39626     1100
39627     1200
39628     1000
39629     2400
39630     1500
39631      914
39632     1700
39633     1500
39634     1000
39635     1300
39636     1700
39637     1400
39638     1200
39639     1800
39640     1900
39641     1900
39642     1100
39643     1300
Name:  shares, Length: 39644, dtype: int64

In [71]:
from sklearn.model_selection import train_test_split
import logging


x_df = df[df.columns[:-2]]
y_df = df[df.columns[-1]]


x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=607)

method = 'classification'

if method == 'regression':
    # flatten y_train to 1d array
    y_train_array = y_train.values.flatten()
else: 
    y_train_array = y_train.values

Refer to the following page for all metrics possible: <br>
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train

When preprocess: True, this happens under the hood: <br>
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train?view=azure-dataprep-py#data-pre-processing-and-featurization


In [72]:
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 30,
    "primary_metric" : 'AUC_weighted', # regression: r2_score
    "preprocess" : True,
    "verbosity" : logging.INFO,
    "n_cross_validations": 5
}

In [73]:
from azureml.train.automl import AutoMLConfig

# local compute
automated_ml_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automated_ml_errors.log',
                             path = project_folder,
                             X = x_train.values,
                             y = y_train_array,
                             **automl_settings)

# Submit an experiment

In [74]:
# create an experiment
experiment = Experiment(workspace = ws, name = "news_popularity")
run = experiment.submit(automated_ml_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_96d26b44-876a-4643-b56f-6250e616e123
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   MaxAbsScaler LightGBM                          100.0000    0:00:41       0.6559    0.6559
         1   RobustScaler LightGBM                          100.0000    0:01:03       0.6694    0.6694
         2   RobustScaler LogisticRegression                100

In [75]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

###  Retrieve the Best Model

In [76]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: news_popularity,
Id: AutoML_96d26b44-876a-4643-b56f-6250e616e123_29,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('LightGBM', Pipeline(memory=None,
     steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ...666666666667, 0.06666666666666667, 0.26666666666666666, 0.13333333333333333, 0.26666666666666666]))])


### Best Model Based on Any Other Metric

In [78]:
lookup_metric = "accuracy"
best_run, fitted_model = run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

Run(Experiment: news_popularity,
Id: AutoML_96d26b44-876a-4643-b56f-6250e616e123_29,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('LightGBM', Pipeline(memory=None,
     steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ...666666666667, 0.06666666666666667, 0.26666666666666666, 0.13333333333333333, 0.26666666666666666]))])


## *Working draft* Hyperparameter Sweep

Bayesian sampling is based on the Bayesian optimization algorithm and makes intelligent choices on the hyperparameter values to sample next. It picks the sample based on how the previous samples performed, such that the new sample improves the reported primary metric.

When you use Bayesian sampling, the number of concurrent runs has an impact on the effectiveness of the tuning process. Typically, a smaller number of concurrent runs can lead to better sampling convergence, since the smaller degree of parallelism increases the number of runs that benefit from previously completed runs.

Bayesian sampling supports only choice and uniform distributions over the search space.

#### Ways to specify Hyperparameters

Advanced discrete hyperparameters can also be specified using a distribution. The following distributions are supported:
```
quniform(low, high, q) - Returns a value like round(uniform(low, high) / q) * q
qloguniform(low, high, q) - Returns a value like round(exp(uniform(low, high)) / q) * q
qnormal(mu, sigma, q) - Returns a value like round(normal(mu, sigma) / q) * q
qlognormal(mu, sigma, q) - Returns a value like round(exp(normal(mu, sigma)) / q) * q
```

Continuous hyperparameters
Continuous hyperparameters are specified as a distribution over a continuous range of values. Supported distributions include:
```
uniform(low, high) - Returns a value uniformly distributed between low and high
loguniform(low, high) - Returns a value drawn according to exp(uniform(low, high)) so that the logarithm of the return value is uniformly distributed
normal(mu, sigma) - Returns a real value that's normally distributed with mean mu and standard deviation sigma
lognormal(mu, sigma) - Returns a value drawn according to exp(normal(mu, sigma)) so that the logarithm of the return value is normally distributed
```

In [1]:
from azureml.train.hyperdrive import BayesianParameterSampling

In [2]:
from azureml.train.hyperdrive import BayesianParameterSampling
param_sampling = BayesianParameterSampling( {
        "learning_rate": uniform(0.01, 0.1),
        "min_samples_split": choice(2, 5, 10, 20),
        "min_samples_leaf": choice(1, 5, 10, 20),
        "max_depth": choice(3,6,9),
        "C":uniform(0.1, 1)
    }
)

NameError: name 'uniform' is not defined

###  for gbm

learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001

### for logistic 
(penalty=’l2’, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver=’warn’, max_iter=100, multi_class=’warn’, verbose=0, warm_start=False, n_jobs=None)[source]¶

### for random forest
(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)[source]

Log metrics for hyperparameter tuning
The training script for your model must log the relevant metrics during model training. When you configure the hyperparameter tuning, you specify the primary metric to use for evaluating run performance. (See Specify a primary metric to optimize.) In your training script, you must log this metric so it is available to the hyperparameter tuning process.

In [None]:
from azureml.core.run import Run
run_logger = Run.get_context()
run_logger.log("accuracy", float(val_accuracy))

Configure your hyperparameter tuning experiment using the defined hyperparameter search space, early termination policy, primary metric, and resource allocation from the sections above. Additionally, provide an estimator that will be called with the sampled hyperparameters. The estimator describes the training script you run, the resources per job (single or multi-gpu), and the compute target to use. Since concurrency for your hyperparameter tuning experiment is gated on the resources available, ensure that the compute target specified in the estimator has sufficient resources for your desired concurrency. (For more information on estimators, see how to train models.)

Configure your hyperparameter tuning experiment:

In [None]:
from azureml.train.hyperdrive import HyperDriveRunConfig
hyperdrive_run_config = HyperDriveRunConfig(estimator=estimator,
                          hyperparameter_sampling=param_sampling, 
                          policy=early_termination_policy,
                          primary_metric_name="accuracy", 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                          max_total_runs=100,
                          max_concurrent_runs=4)

Submit hyperdrive experiment


In [None]:
from azureml.core.experiment import Experiment
experiment = Experiment(workspace, experiment_name)
hyperdrive_run = experiment.submit(hyperdrive_run_config)

# Explain Model

# Deploy Model

# Batch Scoring