## [AutoGluon](https://autogluon.mxnet.io/) TabularPrediction for a Regression Problem 

In this notebook, we will see how __AutoGluon TabularPrediction__ works on our regression problem to predict the __log_votes__ field of our review dataset, using:
* TabularPrediction from here: https://autogluon.mxnet.io/tutorials/tabular_prediction/index.html

Via a simple __fit()__ call, __AutoGluon TabularPrediction__ can produce a highly-accurate model to predict the values in the __log_votes__ column of our data table based on the rest of the columns’ values. 

__AutoGluon__ with tabular data works for both classification and regression problems. Moreover, we do not need to specify the kind of problem, as this it automatically inferred from the data and the appropriate performance metric is reported (by default, RMSE for regression, and accuracy for classification).

__AutoGluon__ also automatically decides which variables should be represented as integers, which variables should be represented as categorical objects, and handles common issues like missing data and rescaling feature values.

Rather than just a single model, __AutoGluon__ trains multiple models and ensembles them together to ensure superior predictive performance. Each type of model has various hyperparameters, which traditionally, the user would have to specify. __AutoGluon__ automates this process, including cross-validation, so there is no need to specify separate validation data.

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __rating:__ Rating of the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)


### 1. Setup the AutoGluon environment 

In [1]:
! pip install bokeh
!pip install --upgrade pip
!pip install mxnet autogluon
!brew install libomp

import warnings
warnings.filterwarnings('ignore')


Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 981kB/s eta 0:00:01
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.1.1
    Uninstalling pip-19.1.1:
      Successfully uninstalled pip-19.1.1
Successfully installed pip-20.0.2
Collecting autogluon
  Downloading autogluon-0.0.5-py3-none-any.whl (328 kB)
[K     |████████████████████████████████| 328 kB 1.1 MB/s eta 0:00:01
Collecting paramiko>=2.5.0
  Downloading paramiko-2.7.1-py2.py3-none-any.whl (206 kB)
[K     |████████████████████████████████| 206 kB 4.4 MB/s eta 0:00:01
[?25hCollecting scipy>=1.3.3
  Downloading scipy-1.4.1-cp37-cp37m-macosx_10_6_intel.whl (28.4 MB)
[K     |████████████████████████████████| 28.4 MB 18.4 MB/s eta 0:00:01
[?25hCollecting boto3==1.9.187
  Downloading boto3-1.9.187-py2.py3-none-any.whl (12

Collecting cffi!=1.11.3,>=1.8
  Downloading cffi-1.14.0-cp37-cp37m-macosx_10_9_x86_64.whl (174 kB)
[K     |████████████████████████████████| 174 kB 15.9 MB/s eta 0:00:01
Collecting heapdict
  Downloading HeapDict-1.0.1-py3-none-any.whl (3.9 kB)
Collecting pycparser
  Downloading pycparser-2.19.tar.gz (158 kB)
[K     |████████████████████████████████| 158 kB 10.8 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: gluonnlp, psutil, ConfigSpace, toolz, pycparser
  Building wheel for gluonnlp (setup.py) ... [?25ldone
[?25h  Created wheel for gluonnlp: filename=gluonnlp-0.8.1-py3-none-any.whl size=293520 sha256=ccff06b938d447619d6bb6ab2474f2f69a5b37279cfd4fb905a359ce14300759
  Stored in directory: /Users/rlhu/Library/Caches/pip/wheels/04/c6/2f/fed73b370eadabfe8809fc8c19b657a4eb4d71228c7ce17a45
  Building wheel for psutil (setup.py) ... [?25ldone
[?25h  Created wheel for psutil: filename=psutil-5.6.7-cp37-cp37m-macosx_10_7_x86_64.whl size=227611 sha256=a37104a5fd0e55a319d76

### 2. AutoGluon TabularPrediction on raw unprocessed datasets

#### 2.1 Reading and getting the dataset in AutoGluon TabularPrediction friendly format

We first use the __pandas__ library to read our raw unpreprocessed __review_dataset__ and split into training and testing datasets. Let's take a look of what does the dataset look like.

In [4]:
import pandas as pd
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/NLP/review_dataset.csv')

X_train, X_test, y_train, y_test = train_test_split(df.drop("log_votes", axis =1), df["log_votes"],
                                                  test_size=0.10,  # 10% test, 90% tranining
                                                  shuffle=True # Shuffle the whole dataset
                                                 )

pd.concat([X_train, y_train], axis = 1).to_csv('data/NLP/review_dataset_AG_training.csv', index=False)
pd.concat([X_test, y_test], axis = 1).to_csv('data/NLP/review_dataset_AG_test.csv', index=False)

X_train.head()

Unnamed: 0,reviewText,summary,verified,time,rating
21972,Update 5/2/2011\nOk this software is starting ...,Errors in Mortgage Calculations & Software dis...,True,1303948800,1.0
6156,Great software product. I have used this produ...,Great software,False,1425427200,5.0
39264,When I tried to download the Turbo Tax program...,Virus??,True,1451347200,2.0
35413,The topo map is outdated as far as road names ...,SE topo,False,1178668800,3.0
8245,"Ripped me off this last year, though I used th...","Filling out the forms is fine, especially if y...",True,1426464000,2.0


#### 2.2 Use AutoGluon TabularPrediction to train a regressor 

AutoGluon can deal with a varied of tasks such as Image Classification, Object Detection, Text Classification, etc. Please see more details at [autogluon.task](https://autogluon.mxnet.io/api/autogluon.task.html). Our task in this demo is `TabularPrediction`, which is equipped to predict values in column of tabular dataset (classification or regression).


Now, let's load the raw unpreprocessed training and test datasets. 

In [5]:
from autogluon import TabularPrediction as task

train_data = task.Dataset(file_path='data/NLP/review_dataset_AG_training.csv')
test_data = task.Dataset(file_path='data/NLP/review_dataset_AG_test.csv')

# For speed, grab a small subset of the dataset
train_data = train_data.head(1000)

Loaded data from: data/NLP/review_dataset_AG_training.csv | Columns = 6 / 6 | Rows = 49500 -> 49500
Loaded data from: data/NLP/review_dataset_AG_test.csv | Columns = 6 / 6 | Rows = 5500 -> 5500


##### Hyperparameters

To train a regressor with AutoGluon `TabularPrediction`, we need to configure the `hyperparameters` parameter. By default, it is a dictionary of key-value pairs. The keys are strings that indicate which ML models to train, includes: 
- ‘NN’ (neural network), 
- ‘GBM’ (lightGBM boosted trees), 
- ‘CAT’ (CatBoost boosted trees), 
- ‘RF’ (random forest), 
- ‘XT’ (extremely randomized trees), 
- ‘KNN’ (k-nearest neighbors).

And the values are dictionaries of hyperparameter settings for each model type. For example, we can define a `hyperparameters` "hyp" as 

```
hyp = {'NN': {'num_epochs': 500}, 'GBM': {'num_boost_round': 10000}, 'CAT': {'iterations': 10000}, 'RF': {'n_estimators': 300}, 'XT': {'n_estimators': 300}, 'KNN': {}, 'custom': ['GBM']}
```

Let's define the hyperparameters for this demo.

In [7]:
# For speed, change the default hyperparameters
hyp = {'GBM': {'num_boost_round': 1000}, 'CAT': {'iterations': 1000}}

##### Auto_stack

Auto_stack decides whether to automatically attempt to select optimal `num_bagging_folds` and `stack_ensemble_levels` based on data properties. This can decrease the training time by up to 20x, but can produce much better results. Additionally, this can decrease inference time by up to 20x.

Now we are ready to define and train our model using `fit()`, which trains neural networks and various types of tree ensembles by default.

In [10]:
auto_stack = True 

model = task.fit(train_data = train_data, label = 'log_votes', 
                 eval_metric = 'r2', auto_stack = auto_stack, hyperparameters = hyp)


No output_directory specified. Models will be saved in: AutogluonModels/ag-20200210_195926/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20200210_195926/
Train Data Rows:    1000
Train Data Columns: 6
Preprocessing data ...
Here are the first 10 unique label values in your data:  [0.         1.60943791 1.09861229 1.79175947 2.30258509 3.4657359
 1.38629436 2.63905733 1.94591015 3.49650756]
AutoGluon infers your prediction problem is: regression  (because dtype of label-column == float and label-values can't be converted to int)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Feature Generator processed 1000 data points with 466 features
Original Features:
	object features: 2
	bool features: 1
	int features: 1
	float features: 1
Generated Features:
	int features: 461
All Features:
	object features: 2
	bool features: 1
	int features: 462
	float 

#### 2.3 Evaluate performance with TabularPrediction

Let's now use our trained `model` to make predictions on the test dataset using `predict()`. 

Then we evaluate performance using [`evaluate_predictions()`](https://autogluon.mxnet.io/api/autogluon.task.html?highlight=evaluate_predictions#autogluon.task.tabular_prediction.TabularPredictor.evaluate_predictions). Here we set `auxiliary_metrics` as True, which indicates the predictor to compute other tabular metrics in addition to the default metric.

In [13]:
y_pred = model.predict(test_data)
y_test = test_data['log_votes']
performance = model.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)


Evaluation: r2 on test data: 0.320081
Evaluations on test data:
{
    "r2": 0.3200810218276856,
    "mean_absolute_error": 0.5421254660779772,
    "explained_variance_score": 0.32412104656348506,
    "r2_score": 0.3200810218276856,
    "pearson_correlation": 0.5837249356520682,
    "mean_squared_error": 0.6419765336436036,
    "median_absolute_error": 0.274038471246732
}


Moreover, if we want a summary of what happened during `fit()`, the `fit_summary()` will return the details of the ensemble models as below. It may create various generated summary plots in a new window.

In [16]:
results = model.fit_summary()

*** Summary of fit() ***
Number of models trained: 6
Types of models trained: 
{'WeightedEnsembleModel', 'StackerEnsembleModel'}
Validation performance of individual models: {'LightGBMRegressor_STACKER_l0': 0.28411560144152903, 'CatboostRegressor_STACKER_l0': 0.3194285529330769, 'weighted_ensemble_k0_l1': 0.32083864852375077, 'LightGBMRegressor_STACKER_l1': 0.3002416757259313, 'CatboostRegressor_STACKER_l1': 0.31119809889680494, 'weighted_ensemble_k0_l2': 0.3257069698656295}
Best model (based on validation performance): weighted_ensemble_k0_l2
Hyperparameter-tuning used: False
Bagging used: True  (with 10 folds)
Stack-ensembling used: True  (with 1 levels)
User-specified hyperparameters:
{'GBM': {'num_boost_round': 1000}, 'CAT': {'iterations': 1000}}
Plot summary of models saved to file: SummaryOfModels.html
*** End of fit() summary ***


From this summary, we can see that __AutoGluon__ trained many different types of models as well as an ensemble of the best-performing models. The summary also describes the actual models that were trained during fit and how well each model performed on the held-out validation data. 

We can also view what properties __AutoGluon__ automatically inferred about our prediction task, along with more details on features preprocessing:

In [17]:
print("AutoGluon infers problem type is: ", model.problem_type)
print()
print("AutoGluon categorized the features as: ", model.feature_types)


AutoGluon infers problem type is:  regression

AutoGluon categorized the features as:  {'nlp': ['reviewText', 'summary'], 'vectorizers': ['__nlp__.10', '__nlp__.able', '__nlp__.able to', '__nlp__.about', '__nlp__.after', '__nlp__.again', '__nlp__.all', '__nlp__.all the', '__nlp__.also', '__nlp__.always', '__nlp__.am', '__nlp__.amazon', '__nlp__.an', '__nlp__.and', '__nlp__.and have', '__nlp__.and it', '__nlp__.and the', '__nlp__.another', '__nlp__.any', '__nlp__.anything', '__nlp__.are', '__nlp__.around', '__nlp__.as', '__nlp__.as well', '__nlp__.at', '__nlp__.at all', '__nlp__.at the', '__nlp__.back', '__nlp__.back to', '__nlp__.bad', '__nlp__.be', '__nlp__.because', '__nlp__.been', '__nlp__.been using', '__nlp__.before', '__nlp__.being', '__nlp__.best', '__nlp__.better', '__nlp__.bit', '__nlp__.bought', '__nlp__.bought this', '__nlp__.business', '__nlp__.but', '__nlp__.but it', '__nlp__.buy', '__nlp__.by', '__nlp__.can', '__nlp__.can be', '__nlp__.company', '__nlp__.computer', '__nlp

### 3. Summary: AutoGluon TabularPrediction

With just a few lines of code __AutoGluon TabularPrediction__ should be able to achieve strong predictive performance on your datasets, as long as your tabular datasets are stored in a popular format like .csv.

**Note**: The code below can be very computationally-intensive!

In [7]:
## AutoGluon TabularPrediction run on the full datasets!
## WARNING: The code below can be very computationally-intensive!

# from autogluon import TabularPrediction as task
# train_data = task.Dataset(file_path='data/NLP/review_dataset_AG_training.csv')
# test_data = task.Dataset(file_path='data/NLP/review_dataset_AG_test.csv')
# predictor = task.fit(train_data=train_data, label='log_votes', eval_metric = 'r2')
# performance = predictor.evaluate(test_data)


In [None]:
# results = predictor.fit_summary()

In [None]:
# print("AutoGluon infers problem type is: ", predictor.problem_type)
# print("AutoGluon categorized the features as: ", predictor.feature_types)