<p style="padding: 10px; border: 1px solid black;">
<img src="./../../images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>
    
    
# <a name="0">MLU Day One Machine Learning - Demo</a>


[__AutoGluon__](https://auto.gluon.ai/stable/index.html#) automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications with just a few lines of code. 

This notebook shows how to use AutoGluon Tabular to solve a __multiclass classification task__. The metric we use to evaluate the performance of the model is accuracy.

1. <a href="#1">Business Problem and ML Problem Description</a>
2. <a href="#2">Importing AutoGluon</a>
3. <a href="#3">Getting the Data</a>
4. <a href="#4">Sampling Data</a>
5. <a href="#5">Model Training with AutoGluon (smaller train dataset)</a>
6. <a href="#6">AutoGluon Training Results</a>
7. <a href="#7">Model Prediction with AutoGluon</a>
8. <a href="#8">Re-Train (with full train data) and predict again</a>
9. <a href="#9">Before You Go (clean up model artifacts)</a>


__Jupiter notebooks environment__:

* Jupiter notebooks allow creating and sharing documents that contain both code and rich text cells. If you are not familiar with Jupiter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
* This is a quick-start demo to bring you up to speed on coding and experimenting with machine learning. Move through the notebook __from top to bottom__. 
* Run each code cell to see its output. To run a cell, click within the cell and press __Shift+Enter__, or click __Run__ from the top of the page menu. 
* A `[*]` symbol next to the cell indicates the code is still running. A `[#]` symbol, where # is an integer, indicates it is finished.
* Beware, __some code cells might take longer to run__, sometimes 5-10 minutes (depending on the task, installing packages and libraries, training models, etc.)

Let's start by loading some libraries and packages!

In [1]:
%%capture
!pip install -q autogluon==0.8.2

In [2]:
# Load in libraries
import pandas as pd

## 1. <a name="1">Business Problem and ML Problem Description</a>
(<a href="#0">Go to top</a>)

__Business Problem:__ Products from the Amazon Product Catalog cannot be listed for sale because are missing some relevant information, the Unit Of Measure (count, volume, weight). 

__ML Problem Description:__ Predict the Unit Of Measure (count, volume, weight) Identification (UOMI) for a product from the Amazon Product Catalog. 
> This is a __multiclass classification__ task (3 distinct classes). <br>

The data for this ML problem has 33 features columns and 1 label column. Examples of features include:


| Feature | Description |
| :---        |    :----  |
| marketplace_id | Marketplace ID.|
| product_type   | Type of product.  |
| item_name | Short item description. |
| product_description   | Long item description.  |
| bullet_point | Bullet point item description. |
| brand   | Brand name.  |
| manufacturer | Manufacturer name. |
| ...   | ...  |
| list_price_value_with_tax   | Price of item including tax.  |
| imgID | ID for image of product. |
| ID   | Product identifier.  |

___
## 2. <a name="2">Importing AutoGluon</a>
(<a href="#0">Go to top</a>)

Now we load the libraries needed to work with our Tabular dataset.

In [3]:
from autogluon.tabular import TabularPredictor, TabularDataset

___
## 3. <a name="3">Getting the Data</a>
(<a href="#0">Go to top</a>)

Let's load the datasets and look at a few data samples.

In [4]:
train = TabularDataset("../../data/uomi-train.csv")
test = TabularDataset("../../data/uomi-test.csv")

In [5]:
# Print size of train set
print(f"Size of training set: {len(train)}")

# Show the first rows of train data
train.head(2)

Size of training set: 28305


Unnamed: 0,ID,marketplace_id,label,product_type,item_name,product_description,bullet_point,brand,manufacturer,part_number,...,item_dimensions_height,item_dimensions_width,item_dimensions_length,normalized_item_weight,normalized_item_package_weight,list_price_currency,list_price_value,list_price_value_with_tax,imgID,ID_0
0,1633,1,1,GROCERY,"JELL-O Play Ocean Build + Eat Kit, 6 oz Box",,One 6 oz. JELL-O Play Ocean Build + Eat Kit,Jell-O Play,Jell-o,4300008150.0,...,2.625,6.625,8.5,0.023438,0.500449,USD,3.99,,51sislDjTYL,9cd726a519754b6bad27be39bc95cac6
1,18103,1,2,GROCERY,Crystal Light Pure Variety Pack includes- Rasp...,"With no artificial sweeteners, flavors or pres...",Customer Will Receive 6 Boxes Total - 1 Raspbe...,Crystal Light,Crystal Light,,...,,,,,0.599657,,,,41MsGCednqL,44a997b7ff9f4d2ebd1615ac5f3861ff


___
## 4. <a name="4">Sampling Data</a>
(<a href="#0">Go to top</a>)

It is good practice to grab a small sample dataset to quickly run AutoGluon before using the full dataset.

In [6]:
# Take a sample of 1000 datapoints for a quick test
train_sample_small = train.sample(n=1000, random_state=1)

___
## 5. <a name="5">Model Training with AutoGluon</a>
(<a href="#0">Go to top</a>)


We can train a model using AutoGluon with only a single line of code.  All we need to do is tell AutoGluon what column from the dataset we want to predict, and what the train dataset is.

For fast experimentation, we use only the small sample from our train dataset, containing 1000 data points.

__NOTE__: Training on this smaller dataset might still take approx. 3-4 minutes!

In [7]:
# We specify train and validation data for the model training
first_predictor = TabularPredictor(label="label").fit(
    train_data=train_sample_small
) 

No path specified. Models will be saved in: "AutogluonModels/ag-20230706_182206/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230706_182206/"
AutoGluon Version:  0.8.2
Python Version:     3.10.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Apr 24 23:34:06 UTC 2023
Disk Space Avail:   477.39 GB / 528.24 GB (90.4%)
Train Data Rows:    1000
Train Data Columns: 33
Label Column: label
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	3 unique label values:  [2, 0, 1]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:

___
## 6. <a name="6">AutoGluon Training Results</a>
(<a href="#0">Go to top</a>)

Now let's take a look at all the training information AutoGluon provides via its [__leaderboard function__](https://auto.gluon.ai/api/autogluon.task.html?highlight=leaderboard#autogluon.tabular.TabularPredictor.leaderboard). <br/>

In [8]:
first_predictor.leaderboard(silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,0.825,0.114462,21.056885,0.000723,0.704545,2,True,14
1,CatBoost,0.82,0.040112,19.47726,0.040112,19.47726,1,True,8
2,LightGBM,0.78,0.012597,1.798811,0.012597,1.798811,1,True,5
3,XGBoost,0.765,0.013985,1.947053,0.013985,1.947053,1,True,11
4,LightGBMLarge,0.76,0.028222,5.479264,0.028222,5.479264,1,True,13
5,LightGBMXT,0.755,0.014995,1.971939,0.014995,1.971939,1,True,4
6,NeuralNetTorch,0.75,0.054374,2.571996,0.054374,2.571996,1,True,12
7,NeuralNetFastAI,0.745,0.018688,7.281922,0.018688,7.281922,1,True,3
8,ExtraTreesGini,0.745,0.075681,0.765243,0.075681,0.765243,1,True,9
9,RandomForestGini,0.735,0.077066,0.801487,0.077066,0.801487,1,True,6


___
## 7. <a name="7">Model Prediction with AutoGluon</a>
(<a href="#0">Go to top</a>)

Now that we trained a model on the train data (that had labels to learn from), let's use the fitted model to predict the labels for the test dataset.

In [9]:
prediction = first_predictor.predict(test)

# Print a few test predictions
print(f"Predictions for the first 20 data points in the test dataset: {prediction.values[0:20]}")

Predictions for the first 20 data points in the test dataset: [2 2 2 2 2 1 2 1 2 2 1 0 2 2 2 1 0 2 1 2]


___
## 8. <a name="8">Re-Train (with full train data) and predict again</a>
(<a href="#0">Go to top</a>)

This first submission to the MLU Leaderboard using a small sample from the train dataset, might not perform best. To improve performance, repeat the process using the full dataset and submit again to see if the score gets better.

In [10]:
# Retrain the model using all training data 
# We let AutoGluon handle the train/validation split directly
# NOTE: We cap the training time to 20 minutes!
second_predictor = TabularPredictor(label="label").fit(
    train_data=train, time_limit = 60*20) 

# Use the trained model to make predictions on the test dataset
prediction = second_predictor.predict(test)

# Print a few test predictions
print(f"Predictions for the first 20 data points in the test dataset: {prediction.values[0:20]}")

No path specified. Models will be saved in: "AutogluonModels/ag-20230706_182304/"
Beginning AutoGluon training ... Time limit = 1200s
AutoGluon will save models to "AutogluonModels/ag-20230706_182304/"
AutoGluon Version:  0.8.2
Python Version:     3.10.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Mon Apr 24 23:34:06 UTC 2023
Disk Space Avail:   477.33 GB / 528.24 GB (90.4%)
Train Data Rows:    28305
Train Data Columns: 33
Label Column: label
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	3 unique label values:  [1, 2, 0]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Train Data Class Count: 3
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator..

[1000]	valid_set's multi_error: 0.1232


	0.878	 = Validation score   (accuracy)
	258.15s	 = Training   runtime
	0.8s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 111.77s of remaining time.
	0.88	 = Validation score   (accuracy)
	0.81s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 1091.25s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230706_182304/")


Predictions for the first 20 data points in the test dataset: [2 2 2 2 2 1 2 1 2 2 1 0 2 2 2 2 2 2 1 2]


___
## 9. <a name="10">Before You Go</a>
(<a href="#0">Go to top</a>)

After you are done with this Demo, clean model artifacts by uncommenting and executing the cell below.

__It is always good practice to clean everything when you are done, preventing the disk from getting full.__

In [11]:
!rm -r AutogluonModels