## Checking possibilities of AutoML package from MLJAR

---

based on the https://github.com/mljar/mljar-supervised

### Supported evaluation metrics (eval_metric argument in AutoML())
---

    for binary classification: logloss, auc, f1, average_precision, accuracy- default is logloss
    for mutliclass classification: logloss, f1, accuracy - default is logloss
    for regression: rmse, mse, mae, r2, mape, spearman, pearson - default is rmse



Lets now use data for flats in Moskow and see the results with only automatic setup

## Loading libraries

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500) # to avoid displaying only couple of columns
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML
from tqdm import tqdm

pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.


## Loading date and preparing set of   X, y 

In [21]:
df_train = pd.read_csv(
    "./moskwa/train_property.csv"
)

In [22]:
df_train.sample(3)

Unnamed: 0.1,Unnamed: 0,breadcrumbs,date,geo_block,owner,price,Security:,Building type:,Object type:,Ad type:,Commission agent:,Construction phase:,Housing class:,Elevator:,Bathroom type:,Balcony type:,Mortgage possible:,The view from the window:,Garbage chute:,Repair:,Fridge:,Phone:,Furniture:,Free layout:,It is possible to bargain:,Floor covering:,Room type:,Internet:,Kitchen furniture:,TV:,Washing machine:,Foundation type:,Overlap type:,Type of the building:,Playground:,Class:
18002,18002,"['Москва', 'Черёмушки', 'м. Профсоюзная', 'МЦК...","['16 мая', '3', '(+1 за сегодня)']","['г. Москва', 'Черёмушки', 'г. Москва', 'Черём...",[],11.627275,,Monolithic,apartments,secondary,no fee,,,,,,,,,,,,,,,,,,,,,,,,,
12824,12824,"['Москва', 'Богородское', 'м. Бульвар Рокоссов...","['9 мая', '2', '(+1 за сегодня)', 'Обновлено 1...","['г. Москва', 'Богородское', 'г. Москва', 'Бог...",[],9.194868,,Monolithic,apartments,secondary,no fee,,,,,,,,,,,,,,,,,,,,,,,,,
21975,21975,"['Москва', 'Ярославский', 'МЦК Ростокино', 'ул...","['16 января', '17', '(+1 за сегодня)', 'Обновл...","['г. Москва', 'Ярославский', 'ул Красная Сосна...",[],11.40454,provided,Monolithic,flat,new building,no fee,Playground,Comfort class,yes,,,yes,,yes,,,,,,,,,,,,,,,,,


In [23]:
df_test = pd.read_csv(
    "./moskwa/test_property.csv"
)

In [24]:
df_test.sample(3)

Unnamed: 0.1,Unnamed: 0,breadcrumbs,date,geo_block,owner,Security:,Building type:,Object type:,Ad type:,Commission agent:,Construction phase:,Housing class:,Elevator:,Bathroom type:,Balcony type:,Mortgage possible:,The view from the window:,Garbage chute:,Repair:,Fridge:,Phone:,Furniture:,Free layout:,It is possible to bargain:,Floor covering:,Room type:,Internet:,Kitchen furniture:,TV:,Washing machine:,Foundation type:,Overlap type:,Type of the building:,Playground:,Class:,id
15941,15941,"['Москва', 'МЦК Локомотив']","['23 апреля', '4', '(+2 за сегодня)', 'Обновле...","['г. Москва', 'г. Москва']",[],video surveillance,Monolithic,apartments,from the developer,no fee,Finish,Business class,,,,,,,,,,,,,,,,,,,,,,,,15941
20749,20749,"['Москва', 'Филёвский Парк', 'м. Пятницкое шос...","['30 октября 2017', '89', '(+2 за сегодня)', '...","['г. Москва', 'Филёвский Парк', 'г. Москва', '...",[],provided,Monolithic,flat,from the developer,no fee,Finish,Comfort class,,,,yes,,,,,,,,,,,,,,,,,,,,20749
2984,2984,"['Москва', 'Новая Москва', 'п. Сосенское', 'п....","['15 мая', '4', '(+1 за сегодня)']","['Новая Москва', 'п. Сосенское', 'п. Коммунарк...",[],provided,Monolithic,flat,from the developer,no fee,Finish,Comfort class,yes,two,,yes,,,,,,,,,,,,,,,,,,,,2984


In [25]:
df_test.shape, df_train.shape

((22667, 36), (45694, 36))

In [14]:
columns_=list(df_train.columns)
columns_

['Unnamed: 0',
 'breadcrumbs',
 'date',
 'geo_block',
 'owner',
 'Security:',
 'Building type:',
 'Object type:',
 'Ad type:',
 'Commission agent:',
 'Construction phase:',
 'Housing class:',
 'Elevator:',
 'Bathroom type:',
 'Balcony type:',
 'Mortgage possible:',
 'The view from the window:',
 'Garbage chute:',
 'Repair:',
 'Fridge:',
 'Phone:',
 'Furniture:',
 'Free layout:',
 'It is possible to bargain:',
 'Floor covering:',
 'Room type:',
 'Internet:',
 'Kitchen furniture:',
 'TV:',
 'Washing machine:',
 'Foundation type:',
 'Overlap type:',
 'Type of the building:',
 'Playground:',
 'Class:',
 'id']

In [None]:
as mostly there are some text columns lets factorize them to get something for X

In [26]:
df = pd.read_csv(
    "./moskwa/train_property.csv"
)

In [29]:
cat_feats = [x for x in df.columns if ":" in x]
for feat in tqdm(cat_feats):
    df["{}_cat".format(feat)] = df[feat].factorize()[0]

100%|██████████| 31/31 [00:00<00:00, 90.05it/s] 


In [30]:
df.shape

(45694, 67)

In [32]:
df.describe()

Unnamed: 0.1,Unnamed: 0,price,Unnamed: 0_cat,Security:_cat,Building type:_cat,Object type:_cat,Ad type:_cat,Commission agent:_cat,Construction phase:_cat,Housing class:_cat,Elevator:_cat,Bathroom type:_cat,Balcony type:_cat,Mortgage possible:_cat,The view from the window:_cat,Garbage chute:_cat,Repair:_cat,Fridge:_cat,Phone:_cat,Furniture:_cat,Free layout:_cat,It is possible to bargain:_cat,Floor covering:_cat,Room type:_cat,Internet:_cat,Kitchen furniture:_cat,TV:_cat,Washing machine:_cat,Foundation type:_cat,Overlap type:_cat,Type of the building:_cat,Playground:_cat,Class:_cat
count,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0,45694.0
mean,22846.5,17.422577,22846.5,2.638727,1.992778,0.493785,0.769204,-0.007069,1.090953,0.846873,-0.41303,-0.550794,-0.626253,-0.58198,-0.513656,-0.577297,-0.846457,-0.992909,-0.985118,-0.992866,-0.980938,-0.99326,-0.982689,-0.981748,-0.981639,-0.995273,-0.996608,-0.998008,-0.986935,-0.991946,-0.9921,-0.995711,-0.998249
std,13190.865937,38.332439,13190.865937,7.170444,0.686521,0.565491,0.693,0.083779,1.612602,1.139299,0.492384,0.839044,0.736021,0.493239,0.759319,0.493994,0.455172,0.083908,0.12108,0.084165,0.136743,0.081824,0.175871,0.161432,0.134255,0.068592,0.058144,0.044582,0.200081,0.094845,0.090247,0.065354,0.041806
min,0.0,0.820018,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,11423.25,7.173917,11423.25,-1.0,2.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
50%,22846.5,9.91,22846.5,2.0,2.0,0.0,1.0,0.0,1.0,1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
75%,34269.75,15.405717,34269.75,2.0,2.0,1.0,1.0,0.0,2.0,2.0,0.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
max,45693.0,3000.000015,45693.0,108.0,8.0,3.0,2.0,0.0,6.0,3.0,0.0,2.0,4.0,0.0,1.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,0.0,0.0,0.0,0.0,5.0,3.0,1.0,0.0,0.0


In [33]:
num_feats = [x for x in df.columns if "_cat" in x]

['Unnamed: 0_cat',
 'Security:_cat',
 'Building type:_cat',
 'Object type:_cat',
 'Ad type:_cat',
 'Commission agent:_cat',
 'Construction phase:_cat',
 'Housing class:_cat',
 'Elevator:_cat',
 'Bathroom type:_cat',
 'Balcony type:_cat',
 'Mortgage possible:_cat',
 'The view from the window:_cat',
 'Garbage chute:_cat',
 'Repair:_cat',
 'Fridge:_cat',
 'Phone:_cat',
 'Furniture:_cat',
 'Free layout:_cat',
 'It is possible to bargain:_cat',
 'Floor covering:_cat',
 'Room type:_cat',
 'Internet:_cat',
 'Kitchen furniture:_cat',
 'TV:_cat',
 'Washing machine:_cat',
 'Foundation type:_cat',
 'Overlap type:_cat',
 'Type of the building:_cat',
 'Playground:_cat',
 'Class:_cat']

In [34]:
X_train, X_test, y_train, y_test = train_test_split(
    df[num_feats], df["price"], test_size=0.25
)

In [36]:
X_train.shape

(34270, 31)

In [38]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((34270, 31), (11424, 31), (34270,), (11424,))

Lets see what automl presents us without even using log of price

In [39]:
# automl = AutoML()
automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

Linear algorithm was disabled.
AutoML directory: AutoML_1
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
1_Baseline rmse 36.241518 trained in 1.26 seconds
2_DecisionTree rmse 31.822292 trained in 28.07 seconds
* Step default_algorithms will try to check up to 3 models


ntree_limit is deprecated, use `iteration_range` or model slicing instead.
ntree_limit is deprecated, use `iteration_range` or model slicing instead.


3_Default_Xgboost rmse 26.313265 trained in 16.53 seconds
4_Default_NeuralNetwork rmse 28.971193 trained in 6.72 seconds
5_Default_RandomForest rmse 29.087302 trained in 27.68 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 26.313265 trained in 0.63 seconds


An input array is constant; the correlation coefficent is not defined.


AutoML fit time: 126.71 seconds
AutoML best model: 3_Default_Xgboost


ntree_limit is deprecated, use `iteration_range` or model slicing instead.



based on the https://github.com/mljar/mljar-supervised

#  Excellent automatic checking tabular data  going through typical models
----


In [None]:
# this is a test of ML jar tools

In [None]:
# Binary clasification=============================== to check if libraries are load and installed

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.


In [None]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
    skipinitialspace=True,
)

In [4]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
    skipinitialspace=True,
)
X_train, X_test, y_train, y_test = train_test_split(
    df[df.columns[:-1]], df["income"], test_size=0.25
)

# automl = AutoML()
automl = AutoML(mode="Explain")
automl.fit(X_train, y_train)

predictions = automl.predict(X_test)

Linear algorithm was disabled.
AutoML directory: AutoML_2
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble availabe models


Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.


AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
1_Baseline logloss 0.551796 trained in 4.31 seconds
2_DecisionTree logloss 0.367442 trained in 50.73 seconds
* Step default_algorithms will try to check up to 3 models


ntree_limit is deprecated, use `iteration_range` or model slicing instead.
ntree_limit is deprecated, use `iteration_range` or model slicing instead.


3_Default_Xgboost logloss 0.276031 trained in 44.76 seconds
4_Default_NeuralNetwork logloss 0.320872 trained in 23.64 seconds
5_Default_RandomForest logloss 0.337926 trained in 36.01 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.276031 trained in 8.01 seconds


An input array is constant; the correlation coefficent is not defined.


AutoML fit time: 222.51 seconds
AutoML best model: 3_Default_Xgboost


ntree_limit is deprecated, use `iteration_range` or model slicing instead.


### Package took default metric  rmse , and generated excelent package with results wit starting README
http://localhost:8888/lab/tree/MLJAR/AutoML_1/README.md



# AutoML Leaderboard

| Best model   | name                                                         | model_type     | metric_type   |   metric_value |   train_time |
|:-------------|:-------------------------------------------------------------|:---------------|:--------------|---------------:|-------------:|
|              | [1_Baseline](1_Baseline/README.md)                           | Baseline       | rmse          |        36.2415 |         2.01 |
|              | [2_DecisionTree](2_DecisionTree/README.md)                   | Decision Tree  | rmse          |        31.8223 |        29.27 |
| **the best** | [3_Default_Xgboost](3_Default_Xgboost/README.md)             | Xgboost        | rmse          |        26.3133 |        17.71 |
|              | [4_Default_NeuralNetwork](4_Default_NeuralNetwork/README.md) | Neural Network | rmse          |        28.9712 |         7.47 |
|              | [5_Default_RandomForest](5_Default_RandomForest/README.md)   | Random Forest  | rmse          |        29.0873 |        29.27 |
|              | [Ensemble](Ensemble/README.md)                               | Ensemble       | rmse          |        26.3133 |         0.63 |

### AutoML Performance
![AutoML Performance](ldb_performance.png)

### AutoML Performance Boxplot
![AutoML Performance Boxplot](ldb_performance_boxplot.png)

### Features Importance
![features importance across models](features_heatmap.png)



### Spearman Correlation of Models
![models spearman correlation](correlation_heatmap.png)

In [None]:
# Looks like it is working
# it would be nice to get in particular mode compete whatever etc
# explain, perform, compete, optuna==============================

In [3]:
# automl = AutoML(mode="Explain")

## Supported evaluation metrics (eval_metric argument in AutoML())
---

    for binary classification: logloss, auc, f1, average_precision, accuracy- default is logloss
    for mutliclass classification: logloss, f1, accuracy - default is logloss
    for regression: rmse, mse, mae, r2, mape, spearman, pearson - default is rmse

If you don't find eval_metric that you need, please add a new issue. We will add it.

# Created automatic documentation with nice structure
---
http://localhost:8888/lab/tree/MLJAR/AutoML_2/README.md