# Classyfing semester grades regarding alcohol consumption and family situation between youg people

As a problem for this notebook I found dataset that is output of some kind of survey. Main purpose of this study was to find linkage between alcohol consumption, situation in household, student's ambition and final grade from school subject. This [dataset](https://www.kaggle.com/uciml/student-alcohol-consumption) contains mostly binary or discrete values and so is the final grade (number 0-20)

In this notebook an effort was put to create and learn classification model. The following notebook shows how we can create, manipulate and upload data and training processess to Azure AutoML service.

The following notebook consists of the following parts:
1. [Dataset manipulation and cloud upload](#dataset)
2. [Upload dataset to cloud](#dataset_upload)
3. [Create and run AutoML Experiment](#experiment)
4. [Retreive Training process logs and model](#training)

#### IMPORTANT

This notebook is not oriented on training or developing the best model. It's purpose is to show how easily experiments can be executed with automated cloud machine learning services and how flexible it can be to use cloud solution as opposed to configuring everything manually.

### Requirements

To run this notebook every package from the following section must be downloaded to machine. If anything is not installed please install this using pip into your environment.

### Necessary imports

In [3]:
from msrest.exceptions import HttpOperationError
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.dataset_type_definitions import PromoteHeadersBehavior
from azureml.train.automl import AutoMLConfig
from azureml.core import Experiment



import time
from datetime import datetime
from sklearn.metrics import accuracy_score
import pandas as pd
import os
import logging

<a id="dataset"></a>
# 1.Dataset manipulation and cloud upload

# Clean dataset locally

At first we need to make some changes to input data to make dataset computable. It is not possible to easily apply changes and load them straight online. More convenient way is to apply operations locally and then upload data to cloud.

So lets load the dataset

In [4]:
school_dataset = pd.read_csv('original_data/mat.csv')

First we list all variables and decide which are relevant and which are not

In [5]:
# print all available collumns from dataset
school_dataset.columns

Index(['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
       'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime',
       'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery',
       'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc',
       'Walc', 'health', 'absences', 'G1', 'G2', 'G3'],
      dtype='object')

Let's mark the data that we don't want to analyze during classification
* drop mothers and fathers job, beacause it is not properly identifiable and cannot be easily distinguished and classified
* reason - why they picked this school -  irrelevant too
* nursery - attended nursery school - i dont want to take that into consideration

NOTE: This operationd doesn't need to happen here. Azure API enables dropping columns before training process so we will upload them online just in case.



In [6]:
# school_dataset.drop(['Mjob', 'Fjob', 'reason', 'nursery'], axis=1)

Let's convert other values to numeric representations

In [7]:
# school - is binary because only two schools participated in study
# GP - 1, MS - 0
school_dataset['school'].replace({"GP": 1, "MS": 0}, inplace=True)
school_dataset['school'].unique()

array([1, 0])

In [8]:
# sex, Female - 1, M - 0
school_dataset['sex'].replace({"F": 1, "M": 0}, inplace=True)
school_dataset['sex'].unique()

array([1, 0])

In [9]:
# adress - U = urban places = 1, R = rural = 0
school_dataset['address'].replace({"U": 1, "R": 0}, inplace=True)
school_dataset['address'].unique()

array([1, 0])

In [10]:
# famsize - family size indicatin LE3 - 0 (less or equal to 3), GT3 - 1 (greater than 3)
school_dataset['famsize'].replace({"GT3": 1, "LE3": 0}, inplace=True)
school_dataset['famsize'].unique()

array([1, 0])

In [11]:
# Pstatus T - 1(Parents living together), A - 0 (Parents living apart)
school_dataset['Pstatus'].replace({"T": 1, "A": 0}, inplace=True)
school_dataset['Pstatus'].unique()

array([0, 1])

In [12]:
# guardian: indicating influence of father(1)/mother(2)/other(0)
school_dataset['guardian'].replace({"father": 1, "mother": 2, "other": 0}, inplace=True)
school_dataset['guardian'].unique()

array([2, 1, 0])

### Convert yes/no information to 1/0 adequatly

* schoolsup - extra educational support yes - 1, no - 0
* famsup - family educational support yes - 1, no - 0
* paid - extra paid classes in matter of subject yes - 1, no - 0
* activities - extra paid acivities yes - 1, no - 0
* higher - want to take higher education yes - 1, no - 0
* internet - has internet access at home
* romantic - with a romantic relationship

In [13]:
yes_no_to_numeric = [
    'schoolsup', 'famsup', 'paid', 'activities',
    'higher', 'internet', 'romantic', 'nursery'
]

for col in yes_no_to_numeric:
    school_dataset[col].replace({"yes": 1, "no": 0}, inplace=True)
    print(school_dataset[col].unique())

[1 0]
[0 1]
[0 1]
[0 1]
[1 0]
[0 1]
[0 1]
[1 0]


## other dataset columns description

* Medu -  Mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary
* Fedu - Father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary
* traveltime - Home to school travel time (numeric: 1 - &lt;15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - &gt;1 hour)
* studytime - Weekly study time (numeric: 1 - &lt;2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - &gt;10 hours)
* failures - Number of past class failures (numeric: n if 1&lt;=n&lt;3, else 4)
* famrel - Quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
* freetime - Free time after school (numeric: from 1 - very low to 5 - very high)
* goout - Going out with friends (numeric: from 1 - very low to 5 - very high)
* Dalc - Workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
* Walc - Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
* health - Current health status (numeric: from 1 - very bad to 5 - very good)
* absences - Number of school absences (numeric: from 0 to 93)
* G1 - First period grade (numeric: from 0 to 20)
* G2 - Second period grade (numeric: from 0 to 20)

## OUTPUT VARIABLE

* G3 - Final grade (numeric: from 0 to 20, output target)

In [14]:
# save data locally on disk
school_dataset.to_csv('dataset/mat.csv')

<a id="dataset_upload"></a>
## 2. Upload dataset to cloud

# Configuration

In this section we can specify which actions will be conducted on dataset

In [15]:
# this variable indicates if we want to upload new dataset
send_dataset_to_cloud = False
#specify folder in which we have data for presentation
local_dataset_source ='dataset'
# indicate where do we upload our dataset in cloud and where do we get data from our cloud data blob
upstream_dataset_path = 'datasets/tabular/'
# specify where project effects will be stored
project_folder = './AMLclassification'

In [16]:
#login to Microsoft account and connect with configured Azure workspace
workspace = Workspace.from_config()

Function `from_config` loads `config.json` file from directory where the notebook is run from.

File looks like the following example

```
{
    "subscription_id": "<your azure subscription id>",
    "resource_group": "<resource group name where your AML resource is placed>",
    "workspace_name": "<AML workspace name>"
}
```

## Upload dataset and retrieve it's cloud representation 

In [17]:
# get default AML workspace datastore to get datasets or upload new ones
datastore = workspace.get_default_datastore()

In [18]:
# decide if we want to upload data to cloud
if send_dataset_to_cloud:
    datastore.upload(
        src_dir = local_dataset_source,
        target_path = upstream_dataset_path,
        overwrite = True,
        show_progress = True)

# specify dataset path and datastore which stores it
datastore_paths = [
    (datastore, upstream_dataset_path + 'mat.csv')
#     (datastore, upstream_dataset_path + 'por.csv')
]

# read data from cloud blob as tabular data read from csv files
school_dataset = Dataset.Tabular.from_delimited_files(
    path=datastore_paths, separator=',',
    header=PromoteHeadersBehavior.ALL_FILES_HAVE_SAME_HEADERS
)

### Drop irrelevant columns from online dataset

In [19]:
school_dataset = school_dataset.drop_columns(['Mjob', 'Fjob', 'reason', 'nursery'])

## Split dataset into train and test subsets

In [20]:
train_dataset, test_dataset = school_dataset.random_split(0.9, seed=1)

Please note that all this operations are made on Azure object. They look similar to scikit and pandas methods but they are just implemented in Azure in the following manner. It makes the process of adaptation to cloud solutions for Datascientists easier. 

To materialize the dataset as pandas dataframe in ipython we need to call `to_pandas_dataframe()` on Azure object.

<a id="experiment"></a>
## 3. Create and run AutoML Experiment

### Check if there are any created AML compute instances in current workspace

In [21]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

ComputeTarget.list(workspace)

[AmlCompute(workspace=Workspace.create(name='AMLAlcohol', subscription_id='2e2771d1-53f9-4787-9bc8-31f23d10d063', resource_group='AMLAlcohol'), name=alcohol-cluster, id=/subscriptions/2e2771d1-53f9-4787-9bc8-31f23d10d063/resourceGroups/AMLAlcohol/providers/Microsoft.MachineLearningServices/workspaces/AMLAlcohol/computes/alcohol-cluster, type=AmlCompute, provisioning_state=Succeeded, location=northeurope, tags=None)]

#### Connect or create AML computing instance

In [22]:
# Choose a name for your cluster.
amlcompute_cluster_name = "alcohol-cluster"
# lets assume that computing instance is not found
is_cluster_found = False
# get computation targets from given workspace
compute_targets = workspace.compute_targets

if amlcompute_cluster_name in compute_targets and compute_targets[amlcompute_cluster_name].type == 'AmlCompute':
     is_cluster_found = True
     print('Found existing training cluster.')
     # Get existing cluster
     aml_remote_compute = compute_targets[amlcompute_cluster_name]
         
# create new instance
if not is_cluster_found:
    print('Creating a new training cluster...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_DS2_V2", # for GPU, use "STANDARD_NC12"
                                                                 max_nodes = 20)
    # Create the cluster.
    aml_remote_compute = ComputeTarget.create(workspace, amlcompute_cluster_name, provisioning_config)


print('Checking cluster status...')
aml_remote_compute.wait_for_completion(show_output = True, min_node_count = 0, timeout_in_minutes = 20)

Found existing training cluster.
Checking cluster status...
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


#### Let's check which metrics can we use to evaluate classification model

In [23]:
from azureml.train import automl

automl.utilities.get_primary_metrics('classification')

['average_precision_score_weighted',
 'AUC_weighted',
 'accuracy',
 'precision_score_weighted',
 'norm_macro_recall']

### Let's define experiment configuration

In this step we specify how the experiment would look like - decide if it's regression, classification or time series prediction task and many many more. All options and configurations can be found in [docs](https://docs.microsoft.com/pl-pl/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py)

In [24]:
# create local directory if not existent
os.makedirs(project_folder, exist_ok=True)
# let's create AML experiment settings
automl_config = AutoMLConfig(
    compute_target=aml_remote_compute,  # getting remote compute target on which we will run experiment
     task='classification',  # multi class classification to predict final grade
     primary_metric='accuracy',  # evaluate classifier performance by predictions accuracy -> anything from cell above can be used
     experiment_timeout_minutes=15,  # stop evaluation after given number of minutes                     
     training_data=train_dataset, # data for training
     label_column_name="G3",  # choose output variable as pandas column
     n_cross_validations=5,  # number of crossvalidations to run on given dataset,                                                    
     enable_early_stopping=True,  # enable earlier learn process termination
     featurization='auto',  # let's enable cloud model to detect datatypes for each column
     debug_log='alcohol_model_errors.log',  # set name for log file
     verbosity=logging.INFO,  # set which logs should be visible 
     path=project_folder  # where to store AML project
)


### Create and run AML Experiment on remote instance

This part is the most crucial for training. In this step we send configured experiment to run on Azure. It is the core functionality of this notebook

In [25]:
now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "school-alcohol-remote-experiment-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=workspace, name=experiment_name)
start_time = time.time()

run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

school-alcohol-remote-experiment-01-26-2021-18
Running on remote.
No run_configuration provided, running on alcohol-cluster with default configuration
Running on remote compute: alcohol-cluster
Parent Run ID: AutoML_54445ab1-a532-4f34-b761-82b6daea772b

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetBalancing. Performing class balancing sweeping
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a mod

<a id="training"></a>
## 4. Retreive Training process logs and model

### Let's look into results using Widget

In [27]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

#### Now we cen check how long did AML process lasted

In [28]:
import time
import datetime as dt

run_details = run.get_details()

# Like: 2020-01-12T23:11:56.292703Z
end_time_utc_str = run_details['endTimeUtc'].split(".")[0]
start_time_utc_str = run_details['startTimeUtc'].split(".")[0]
timestamp_end = time.mktime(datetime.strptime(end_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())
timestamp_start = time.mktime(datetime.strptime(start_time_utc_str, "%Y-%m-%dT%H:%M:%S").timetuple())

parent_run_time = timestamp_end - timestamp_start
print('Run Timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (parent_run_time))

Run Timing: --- 1636.0 seconds needed for running the whole Remote AutoML Experiment ---


In [41]:
# register model in AML Studio
x = run.register_model("alcclassificationmodel")

In [43]:
# download model to local machine
os.makedirs('model', exist_ok=True)
x.download('model')

'model/model.pkl'

### In the end let's download best found model

In [37]:
best_run, fitted_model = run.get_output()
print(best_run)
print()
print(fitted_model)
print()
print(run.summary())



Run(Experiment: school-alcohol-remote-experiment-01-26-2021-18,
Id: AutoML_54445ab1-a532-4f34-b761-82b6daea772b_15,
Type: azureml.scriptrun,
Status: Completed)

None





[['StackEnsemble', 1, 0.3702213279678068], ['VotingEnsemble', 1, 0.5279678068410463], ['XGBoostClassifier', 4, 0.4602012072434608], ['LightGBM', 3, 0.3840241448692153], ['RandomForest', 5, 0.499758551307847], ['LogisticRegression', 1, 0.33319919517102614], ['ExtremeRandomTrees', 1, 0.44334004024144874], ['GradientBoosting', 1, 0.35585513078470826]]


## Predictions made with our model

Now we can test outcome of our model when using it on new data. Main advantage of this purpose is we can teach model in the cloud and as a outcome we retreive ready model to run on our local machine. With little more labor we can upload it online and create script as a service to make use of it online. In this showcase we will test classifier locally and measure it's accuracy.

In first step let's download test set to local machine by converting Dataset.Tabular to pandas.Dataframe object

In [51]:
from sklearn.utils.fixes import pinvh
import pickle 
import sklearn
with open('model/model.pkl', 'rb') as file:
    pickle_model = pickle.load(file)
# download test dataset
local_test_dataset = test_dataset.to_pandas_dataframe()
# drop output variable from test set do we can compare classifier output with real 
y_test = local_test_dataset.pop('G3')
# get other part of train dataset as input
x_test = test_dataset

ImportError: cannot import name 'pinvh' from 'sklearn.utils.fixes' (/home/mkielczykowski/.local/lib/python3.8/site-packages/sklearn/utils/fixes.py)

### In the end let's calculate model accuracy

In [36]:
y_predictions = fitted_model.predict(x_test)
print('Accuracy:')
accuracy_score(y_test, y_predictions)

AttributeError: 'NoneType' object has no attribute 'predict'