# Data Preparation
In this notebook, we use a subset of [Stack Exchange network](https://archive.org/details/stackexchange) question data which includes original questions tagged as 'JavaScript', their duplicate questions and their answers. Here, we provide the steps to prepare the data to use for training, tuning, and testing a model that will match a new question with an existing original question. The data files produced are stored in a `data` directory for ease of reference and also to keep them separate from the training script.

The data preparation steps are
- [import libraries and define parameters](#import),
- [ingest the data](#ingest),
- [cleanse the data](#cleanse),
- [prepare the train, tune, and test datasets](#prepare), and
- [save the datasets.](#save)

## Imports and parameters <a id='import'></a>

In [1]:
import logging

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig

In [2]:
from azureml.core import Workspace

ws = Workspace.from_config()

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


In [3]:
from azureml.core import Dataset
dataset_name = 'ai_impact_scores'

ai_impact_score_ds = Dataset.get_by_name(workspace=ws, name=dataset_name)

In [4]:
ai_impact_score_pd = ai_impact_score_ds.to_pandas_dataframe()

2019-12-18 23:42:45.923825 | ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Failure, Duration=7384.49 [ms], Info = {'activity_id': '37bb98aa-e380-4867-8b13-ef25e47aaa99', 'activity_name': 'to_pandas_dataframe', 'activity_type': 'PublicApi', 'app_name': 'TabularDataset', 'source': 'azureml.dataset', 'version': '1.0.76', 'completionStatus': 'Success', 'durationMs': 776.5}, Exception=DatasetExecutionError; Could not execute the specified transform.|session_id=9ad585b2-02c4-46b9-be46-3b80136c33f2


DatasetExecutionError: Could not execute the specified transform.|session_id=9ad585b2-02c4-46b9-be46-3b80136c33f2

In [None]:
ai_impact_score_pd.head()

In [None]:
ai_impact_score_pd.Title.iloc[0]

In [None]:
columns_to_remove = ["IsBlocker", "Pri", "LogScore", "DCReview"]
for col in columns_to_remove:
    ai_impact_score_pd.pop(col)

ai_impact_score_pd.head(5)

In [None]:
test_size = 0.10

In [None]:
from sklearn.model_selection import train_test_split

y_df = ai_impact_score_pd.pop("Score")
x_df = ai_impact_score_pd

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)

In [None]:
import logging
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task = 'regression',
                             iteration_timeout_minutes = 10,
                             iterations = 10,
                             primary_metric = 'spearman_correlation',
                             n_cross_validations = 5,
                             debug_log = 'automl.log',
                             verbosity = logging.INFO,
                             X = x_train, 
                             y = y_train,
                             preprocess=True)

In [8]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "ai-impact-score-experiment")

runs = experiment.get_runs()

if not runs:
    local_run = experiment.submit(automl_config, show_output=True)
else:
    for run in runs:
        local_run = run
        break;

In [9]:
children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8
explained_variance,0.17,0.77,0.75,0.73,0.18,0.78,0.77,0.58,0.78
mean_absolute_error,6.1,1.81,1.84,2.16,5.82,1.56,1.68,1.77,1.61
mean_absolute_percentage_error,221.29,17.74,17.48,29.78,206.53,9.4,12.59,12.24,11.15
median_absolute_error,3.91,0.07,0.1,0.38,3.67,0.0,0.01,0.0,0.03
normalized_mean_absolute_error,0.09,0.03,0.03,0.03,0.09,0.02,0.03,0.03,0.02
normalized_median_absolute_error,0.06,0.0,0.0,0.01,0.06,0.0,0.0,0.0,0.0
normalized_root_mean_squared_error,0.14,0.07,0.07,0.08,0.13,0.07,0.07,0.09,0.07
normalized_root_mean_squared_log_error,0.24,0.06,0.06,0.07,0.23,0.05,0.06,0.06,0.05
r2_score,0.06,0.76,0.74,0.71,0.06,0.77,0.76,0.57,0.77
root_mean_squared_error,9.17,4.94,4.93,5.05,8.94,4.85,4.79,6.24,4.81


In [None]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

In [None]:
y_pred_train = fitted_model.predict(x_train)
y_residual_train = y_train - y_pred_train

y_pred_test = fitted_model.predict(x_test)
y_residual_test = y_test - y_pred_test

In [None]:
%matplotlib inline
from sklearn.metrics import mean_squared_error, r2_score

# Set up a multi-plot chart.
f, (a0, a1) = plt.subplots(1, 2, gridspec_kw = {'width_ratios':[1, 1], 'wspace':0, 'hspace': 0})
f.suptitle('Regression Residual Values', fontsize = 18)
f.set_figheight(6)
f.set_figwidth(16)

# Plot residual values of training set.
a0.axis([0, 360, -200, 200])
a0.plot(y_residual_train, 'bo', alpha = 0.5)
a0.plot([-10,360],[0,0], 'r-', lw = 3)
a0.text(16,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_train, y_pred_train))), fontsize = 12)
a0.text(16,140,'R2 score = {0:.2f}'.format(r2_score(y_train, y_pred_train)), fontsize = 12)
a0.set_xlabel('Training samples', fontsize = 12)
a0.set_ylabel('Residual Values', fontsize = 12)

# Plot a histogram.
a0.hist(y_residual_train, orientation = 'horizontal', color = 'b', bins = 10, histtype = 'step')
a0.hist(y_residual_train, orientation = 'horizontal', color = 'b', alpha = 0.2, bins = 10)

# Plot residual values of test set.
a1.axis([0, 90, -200, 200])
a1.plot(y_residual_test, 'bo', alpha = 0.5)
a1.plot([-10,360],[0,0], 'r-', lw = 3)
a1.text(5,170,'RMSE = {0:.2f}'.format(np.sqrt(mean_squared_error(y_test, y_pred_test))), fontsize = 12)
a1.text(5,140,'R2 score = {0:.2f}'.format(r2_score(y_test, y_pred_test)), fontsize = 12)
a1.set_xlabel('Test samples', fontsize = 12)
a1.set_yticklabels([])

# Plot a histogram.
a1.hist(y_residual_test, orientation = 'horizontal', color = 'b', bins = 10, histtype = 'step')
a1.hist(y_residual_test, orientation = 'horizontal', color = 'b', alpha = 0.2, bins = 10)

plt.show()

In [None]:
run.get_file_names()

In [None]:
model = run.register_model(model_name='best_impact_score_model', model_path='./outputs/model.pkl')

In [None]:
print("Registered model:\n --> Name: {}\n --> Version: {}\n --> URL: {}".format(model.name, model.version, model.url))