<a href="https://colab.research.google.com/github/veronicalimpooikhoon/ITI103/blob/main/AutoML_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Automated machine learning (AutoML)**

Automated machine learning algorithms can be a huge time saver especially if the data is huge or the algorithm to be used is a simple classification or regression type problem. One such open-source automation in AutoML was the development of AutoSklearn. We know that the popular sklearn library is very rampantly used for building machine learning models. But with sklearn, it is up to the user to decide the algorithm that has to be used and do the hyperparameter tuning. With autosklearn, all the processes are automated for the benefit of the user. The benefit of this is that along with data preparation and model building, it also learns from models that have been used on similar datasets and can create automatic ensemble models for better accuracy. In this *session*, we will see how to make use of autosklearn for classification and regression problems. 

**Install libraries- Auto-Sklearn and Pipeline Profiler**

In [None]:
!pip install auto-sklearn
!pip install PipelineProfiler
!pip install --upgrade scipy
#import scipy
#print(scipy.__version__)

# You may start to restart runtime after the installation

# **Import all the libraries**

Note: You may need to click on the "restart runtime" if the following error occurs.
- IncorrectPackageVersionError: found 'scipy' version 1.4.1 but requires scipy version >=1.7.0
- Auto-sklearn not found

In [1]:
import sklearn
from pprint import pprint

import sklearn.datasets
import sklearn.metrics

import autosklearn.classification

# **AutoML- Classifier**

1. Split the dataset into train, test set, features and label. Breast Cancer dataset will be used. 

In [6]:
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)

2. Building the classification model

Time_left_for_this_task is the amount of time the user specifies for searching all the right models. I have allowed the search to take place for two minutes but you can choose any amount of time as you wish. 

Notes: Since we are using auto-sklearn, we need not specify the name of the algorithm or the parameters.

In [9]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30)
    #tmp_folder='/tmp/autosklearn_classification_example_tmp',

automl.fit(X_train, y_train, dataset_name='breast_cancer')

[ERROR] [2022-07-04 19:21:35,583:Client-AutoML(1):breast_cancer] Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n  File "/usr/local/Cellar/python@3.9/3.9.7_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap\n    self.run()\n  File "/usr/local/Cellar/python@3.9/3.9.7_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run\n    self._target(*self._args, **self._kwargs)\n  File "/usr/local/lib/python3.9/site-packages/pynisher/limit_function_call.py", line 108, in subprocess_func\n    resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit\n', 'exitcode': 1, 'configuration_origi

ValueError: Dummy prediction failed with run state StatusType.CRASHED and additional output: {'error': 'Result queue is empty', 'exit_status': "<class 'pynisher.limit_function_call.AnythingException'>", 'subprocess_stdout': '', 'subprocess_stderr': 'Process pynisher function call:\nTraceback (most recent call last):\n  File "/usr/local/Cellar/python@3.9/3.9.7_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap\n    self.run()\n  File "/usr/local/Cellar/python@3.9/3.9.7_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 108, in run\n    self._target(*self._args, **self._kwargs)\n  File "/usr/local/lib/python3.9/site-packages/pynisher/limit_function_call.py", line 108, in subprocess_func\n    resource.setrlimit(resource.RLIMIT_AS, (mem_in_b, mem_in_b))\nValueError: current limit exceeds maximum limit\n', 'exitcode': 1, 'configuration_origin': 'DUMMY'}.

3. View the models found by auto-sklearn (Classifier)

In [10]:
print(automl.leaderboard())

AttributeError: 'AutoMLClassifier' object has no attribute 'runhistory_'

Print the final ensemble constructed by auto-sklearn (Classifier)

In [None]:
pprint(automl.show_models())

Now we have the statistics of the model and the algorithms that were checked were 7. Let us now see the accuracy of the model. 

In [None]:
predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

This is a good score since we have not scaled or pre-processed the data and we have allowed the model to run only for 2 minutes. Thus, we have built a classification model using autosklearn. 

Check all the best outperforming pipelines with PipelineProfiler

In [None]:
import PipelineProfiler
profiler_data= PipelineProfiler.import_autosklearn(automl)
PipelineProfiler.plot_pipeline_matrix(profiler_data)

# **AutoML- Regression**

1. Import the boston dataset and Split the dataset into train, test set, features and label. 

In [8]:
from sklearn.datasets import load_boston
import pandas as pd
boston_data=load_boston()
features=pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
target=pd.DataFrame(boston_data.target,columns=['TARGET'])
dataset=pd.concat([features,target],axis=1)

#from sklearn.model_selection import train_test_split
#xtrain,xtest,ytrain,ytest=train_test_split(features,target,test_size=0.2)


**Exercise**

Import the train_test_split and seperate the data into 80:20.

<details><summary>Click here for answer</summary>
<br/>
    
```python
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(features,target,test_size=0.2)
    
```
</details>

In [9]:
###complete the code below###

2.Building the Regressor model

**Exercise**

Build your autosklearn regressor model with max of 2 mins, max time per model is 30s and mean absoulte error as your performance matrix. 

Next, fit your model with the train and testing data.

<details><summary>Click here for answer</summary>
<br/>
    
```python
from autosklearn.regression import AutoSklearnRegressor
regressor=autosklearn.regression.AutoSklearnRegressor(time_left_for_this_task=120,per_run_time_limit=30,metric=autosklearn.metrics.mean_absolute_error)

regressor.fit(xtrain, ytrain)
```
</details>

In [None]:
###complete the code below###


3. View the models found by auto-sklearn (Regression)

**Exercise**

Print the leaderscore board for regressor.

Hint: You may refer to leaderboard under AutoML-Classifier.

<details><summary>Click here for answer</summary>
<br/>
    
```python
print(regressor.leaderboard())
    
```
</details>

In [None]:
###complete the code below###

4. Print the final ensemble constructed by auto-sklearn (Regression)

**Exercise**

Show the model constructed by auto-sklearn (Regression)
<details><summary>Click here for answer</summary>
<br/>
    
```python
pprint(regressor.show_models(), indent=4)
    
```
</details>

In [None]:
###complete the code below###


Get the Score of the final ensemble

**Exercise**

Calculate the mean absolute error for the testing set.
<details><summary>Click here for answer</summary>
<br/>
    
```python
from sklearn.metrics import mean_absolute_error

print(regressor.sprint_statistics())
pred= regressor.predict(xtest)
mae = mean_absolute_error(ytest, pred)
print("MAE:" ,mae)
    
```
</details>

In [None]:
###complete the code below###


Check all the best outperforming pipelines with PipelineProfiler

**Exercise**

Plot out the PipelineProfiler matrix for regrssor.

Hint: You may refer to PipelineProfiler matrix under classifier.

<details><summary>Click here for answer</summary>
<br/>
    
```python
import PipelineProfiler
profiler_data= PipelineProfiler.import_autosklearn(regressor)
PipelineProfiler.plot_pipeline_matrix(profiler_data)
    
```
</details>

In [None]:
###complete the code below###
