# Studying News Popularity in terms of Number of Shares

Courtesy of K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

Refer to https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity for details. 

** Table of Content**

1. [Set Up Workspace](#Set-Up-Workspace)
2. [Define preprocessing script for model training](#Define-preprocessing-script-for-model-training)
3. [Specify experiment settings and compute targets](#Specify-Experiment-Settings-and-Compute-Targets)
4. [Submit an experiment](#Submit-an-experiment)
5. [Explain Model](#Explain-Model)
6. [Register Model](#Register-Model)


Tip: if you need to debug your SDK, you can run this line `from azureml._logging.debug_mode import debug_sdk` and call `debug_sdk()` just before the code block you want to debug.

# Set Up Workspace

In [3]:
import os
import pandas as pd
import numpy as np
import logging

## Read data from a website
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

## Split data 
from sklearn.model_selection import train_test_split


## Azure-related
import azureml.dataprep as dprep # preprocessing module 
import azureml.core
print("SDK version:", azureml.core.VERSION)
from azureml.core import Workspace, Experiment, Run

### Specify compute targets
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget


SDK version: 1.0.17


In [4]:
ws = Workspace.from_config()

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Found the config file in: /home/nbuser/library/config.json
Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code FYZ7TLMAG to authenticate.
Interactive authentication successfully completed.


# Define preprocessing script for model training

If you intend to use remote compute, rather than local compute, you should provide your data in a python script as shown below, rather than a series of jupyter cells. If you were to use local compute, you can write your codes as you usually would in a jupyter notebook. An example would be shown later in Section [Submit an experiment](#Submit-an-experiment) to show you how you can supply objects to use either remote or local compute.

In [7]:
import os
project_folder = os.getcwd()

In [8]:
%%writefile $project_folder/preprocess.py
import os
project_folder = os.getcwd()
print(project_folder)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## Read data from a website
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

def get_data():
    # Read Data from url
    print('Reading data...')
    resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip")
    zipfile = ZipFile(BytesIO(resp.read()))
    zipfile.namelist()
    file = 'OnlineNewsPopularity/OnlineNewsPopularity.csv'
    df = pd.read_csv(zipfile.open(file))
    
    # Preprocessing
    # Remove beginning white space in the columns
    print('Stripping off white space...')
    df.rename(columns=lambda x: x.strip(), inplace=True)
    
    # Set Target Label
    # Define number of popularity categories to predict
    print('Make target categories')
    share_categories = [1,2,3,4,5]
    df['share_cat'] = np.array(pd.qcut(df['shares'], 5, share_categories))
    df['share_cat'].dtype
    df['share_cat'] = np.array(df['share_cat'].astype('category'))
    
    # Split Data
    # time delta and url are not predictive attributes, exclude them
    x_df = df[df.columns[2:-2]] # url and time delta are the first two attributes 
    y_df = df[df.columns[-1]]
    
    print('Splitting data...')
    x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, 
                                                        random_state=607)
    
    return { "X": x_train.values, "y": y_train.values, 
            "X_valid": x_test.values, "y_valid": y_test.values}


Overwriting /home/nbuser/library/preprocess.py


### the section below is to give a glimpse into the data set only
Writing preprocess.py above is the only script needed to submit a model training job

In [3]:
project_folder = os.getcwd()
data_folder = os.path.join(os.getcwd(), 'data/OnlineNewsPopularity')
print(data_folder)
os.makedirs(data_folder, exist_ok=True)

/home/nbuser/library/data/OnlineNewsPopularity


In [4]:
resp = urlopen("https://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip")
zipfile = ZipFile(BytesIO(resp.read()))
zipfile.namelist()
file = 'OnlineNewsPopularity/OnlineNewsPopularity.csv'
original_df = pd.read_csv(zipfile.open(file))

In [5]:
df = original_df

In [6]:
df.shape

(39644, 61)

In [7]:
df.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505


# Specify Experiment Settings and Compute Targets

Specify a compute target based on an existing cluster or you can create a new compute target. You can view your existing clusters here 
<br>
<img src="Images/compute_target.png" width="2000">:

In [11]:
# gpucluster is an existing compute target
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "gpucluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6/STANDARD_D2_V2
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_STANDARD_NC24", "STANDARD_NC24") 

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

found compute target. just use it. gpucluster


By using automated machine learning, Azure iterates through appropriate machine learning algorithms depending on your task. The array of models supported is listed [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train). Notice that in the settings below, preprocess flag is set to True. Refer to this [article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-auto-train?view=azure-dataprep-py#data-pre-processing-and-featurization) to learn what preprocessing steps Azure takes.


In [9]:
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 10,
    "primary_metric" : 'AUC_weighted',
    "verbosity" : logging.INFO,
    "preprocess": True
}

You can specify your experiment settings as such. Since I am interested in using remote compute target, I am turning the flag `local_compute` to be False. Note the difference between two settings in terms of how you can supply the objects. Beyond what's specified as a parameter in the settings, you can also provide other parameters according to your needs as well. You can refer to [this web page](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py) to read about other parameters.

In [12]:
from azureml.train.automl import AutoMLConfig

local_compute = False

if local_compute: 
    print('using local compute')
    automated_ml_config = AutoMLConfig(task = 'classification',
                                 debug_log = 'automated_ml_errors.log',
                                 compute_target=compute_target,
                                 path = project_folder,
                                 X = x_train.values,
                                 y = y_train_array,
                                 X_valid = x_test.values,
                                 **automl_settings)
else: 
    print('using remote compute')
    automated_ml_config = AutoMLConfig(task = 'classification',
                                 debug_log = 'automated_ml_errors.log',
                                 compute_target=compute_target,
                                 path = project_folder,
                                 data_script= project_folder + "/get_data.py",
                                 model_explainability=True,
                                 **automl_settings)

using remote compute


# Submit an experiment

Now that you have defined your experiment settings above, you can create and name the experiment to run in the workspace you desired. 
<br>
<img src="Images/experiment_homepage.png" width="1500">

In [15]:
# create an experiment
experiment = Experiment(workspace = ws, name = "news_popularity")

Next, submit the experiment. If you wish to view the output within the notebook, turn on the flag `show_output`.

In [None]:
# submit an experiment
run = experiment.submit(automated_ml_config, show_output=True)

/home/nbuser/library
Reading data...
Stripping off white space...
Make target categories
Splitting data...
Running on remote compute: gpucluster
Parent Run ID: AutoML_2e02b696-bf8a-47f4-b035-64347532b27e
********************************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
SAMPLING %: Percent of the training data to sample.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
********************************************************************************************************************

 ITERATION   PIPELINE                                       SAMPLING %  DURATION      METRIC      BEST
         0   StandardScalerWrapper SGD                      100.0000    0:26:46       0.6304    0.6304


Different runs of the same experiment name will be grouped under the same parent run.

<br>
<img src="Images/experiment_runs.png" width="1000">

In [17]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

<img src="Images/run_output_details.png" width="1500">

###  Retrieve the Best Model

You can retrieve the best model based on the run that you just defined above.

In [76]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: news_popularity,
Id: AutoML_96d26b44-876a-4643-b56f-6250e616e123_29,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('LightGBM', Pipeline(memory=None,
     steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ...666666666667, 0.06666666666666667, 0.26666666666666666, 0.13333333333333333, 0.26666666666666666]))])


### Best Model Based on Any Other Metric

Even though I specified `AUC_weighted` to be the metric that I wanted to measure against, I can also choose other relevant metrics to pick the best model. In the following cell, I am interested in looking up the model that has the highest accuracy rate.

In [78]:
lookup_metric = "accuracy"
best_run, fitted_model = run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

Run(Experiment: news_popularity,
Id: AutoML_96d26b44-876a-4643-b56f-6250e616e123_29,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('datatransformer', DataTransformer(logger=None, task=None)), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('LightGBM', Pipeline(memory=None,
     steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ...666666666667, 0.06666666666666667, 0.26666666666666666, 0.13333333333333333, 0.26666666666666666]))])


## To retrieve the best model run without re-training

There are also times when you wish to retrieve the best model run, without having to re-define your model objects again. In this scenario, you can head to the home page of your desired experiment to obtain the `Run Id` of the experiment.
<br>
<img src="Images/run_id.png" width="1500">

In [9]:
from azureml.core import get_run
run_cpu_id = 'AutoML_fd9055d1-1f4b-4484-ae70-d773ca82bd67_1' #get from portal
best_run = get_run(experiment, run_cpu_id)

# Explain Model

Since I included the flag `model_explanability` in my `automated_ml_config` settings (refer to Section [Specify Experiment Settings and Compute Targets](#Specify-Experiment-Settings-and-Compute-Targets)], I can easily retrieve model explanation. To learn about the `automlexplainer` module, you can visit this [page](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlexplainer?view=azure-ml-py) to gain more information.

In [20]:
best_run = run

You can choose to print to view overall and class-level feature importance in the notebook, or you can navigate to the portal to view a ready-made chart that compares feature importance. 

In [None]:
# if there is problem in importing `numpy.core.multiarray`, upgrade your numpy package
# !pip install -U numpy
from azureml.train.automl.automlexplainer import retrieve_model_explanation

## for a specific run 
shap_values, expected_values, overall_summary, overall_imp, per_class_summary, per_class_imp = \
    retrieve_model_explanation(best_run)

# Overall feature importance
# print(overall_imp)
# print(overall_summary)

# Class-level feature importance
# print(per_class_imp)
# print(per_class_summary)

<br>
The block above will also create a folder within your project directory to save `explanation` in json files.
<br>
<img src="Images/explanation.png" width="1500">

If you would like to retrieve feature importance chart, go to your portal and click on the experiment run you are interested in to view it. 
<br>
<img src="Images/explanation_chart.png" width="500">

# Register Model

Finally, if you are interested in hosting your model online, you can register your model! Now, you can go on and connect your webpage with your model or do batch scoring if you would like to!

In [36]:
model = best_run.register_model('predict_news_popularity')

<br>
<img src="Images/register_model.png" width="1000">