# Random forest (RF)  - Regression : Sales



![title](connective.png)

Random forests build an ensemble of classifiers, each of which is a tree model constructed using bootstrapped samples from the input data. The results of these models are then combined to yield a single prediction, which, at the expense of some loss in interpretation, have been found to be highly accurate.



### [Documentation ](http://madlib.apache.org/docs/latest/group__grp__random__forest.html)



### load sql extension

In [None]:
%load_ext sql

### Connect to the database

In [None]:
%sql postgresql://gpadmin:pivotal@10.0.2.6:5432/gpadmin

### check version of the database  

In [None]:
%sql select version();

### Create the table desired in madlib schema

In [None]:
#%%sql
#DROP TABLE IF EXISTS madlib.ventas_timeseries;
#CREATE TABLE madlib.ventas_timeseries 
#AS(
#SELECT count(*) as ventas, date_trunc('hour', ventas."fecha_venta")::timestamp as fecha 
#from interbus.ventas 
#group by fecha);

### Create a table with the sales sampled by hours in public schema

In [None]:
%%sql
DROP TABLE IF EXISTS ventas_timeseries;
CREATE TABLE ventas_timeseries 
AS(
SELECT count(*) as ventas, date_trunc('hour', ventas."fecha_venta")::timestamp as fecha , tarifa, trayecto, origen
from interbus.ventas 
group by fecha, tarifa, trayecto, origen);

### add column with ID

In [None]:
%%sql
ALTER TABLE ventas_timeseries
ADD COLUMN id SERIAL;

In [None]:
%%sql
SELECT * 
from ventas_timeseries
where fecha >= '2019-07-01 00:00:00'::timestamp
and fecha < '2019-08-01 00:00:00'::timestamp

### add column with ID

In [None]:
%%sql
SELECT * FROM ventas_timeseries WHERE fecha BETWEEN '2019-07-01 00:00:00' AND '2019-08-01 00:00:00'

### Obtain number of rows of the table

In [None]:
%%sql
SELECT COUNT(*) FROM ventas_timeseries;

### Create a copy of the table in the madlib schema

The table copied will be the:

    *<model_table> : ventas_timeseries

### Train random forest for regression
We train a regression random forest tree with grouping on transmission type (0 = automatic, 1 = manual) and use surrogates for NULL handling

A table with the trained model is generated:

    * <model_table>_output

Aditionally, two more tables are generated

    * <model_table>_output_group
    * <model_table>_output_summary

### [Documentation ](http://madlib.apache.org/docs/latest/group__grp__random__forest.html)

__id column__ is mandatory and is used for prediction and other purposes. The values are expected to be unique for each row.

__grouping columns__ This will produce multiple random forests, one for each group.

In [None]:
%%sql
DROP TABLE IF EXISTS ventas_timeseries_output,
                     ventas_timeseries_output_group,
                     ventas_timeseries_output_summary;

SELECT madlib.forest_train('ventas_timeseries',         -- source table
                           'ventas_timeseries_output',  -- output model table
                           'id',              -- id column
                           'ventas',             -- response
                           '*',               -- features
                           '',  -- exclude columns
                           'tarifa',              -- grouping columns
                           10::integer,       -- number of trees
                           2::integer,        -- number of random features
                           TRUE::boolean,     -- variable importance
                           1,                 -- num_permutations
                           10,                -- max depth
                           8,                 -- min split
                           3,                 -- min bucket
                           10,                -- number of splits per continuous variable
                           'max_surrogates=2' -- NULL handling
                           );

SELECT * FROM ventas_timeseries_output_summary;

Review the group table to see variable importance by group:

In [None]:
%%sql
SELECT * FROM metro_entradas_output_group ORDER BY gid;

Use the helper function to display normalized variable importance:

In [None]:
%%sql
DROP TABLE IF EXISTS metro_entradas_imp_output;

SELECT madlib.get_var_importance('metro_entradas_output','metro_entradas_imp_output');
SELECT * FROM metro_entradas_imp_output ORDER BY oob_var_importance DESC;

### Predict

Predict regression output for the same data and compare with original:

In [None]:
%%sql
DROP TABLE IF EXISTS prediction_results;

SELECT madlib.forest_predict('metro_entradas_output',
                             'metro_entradas',
                             'prediction_results',
                             'response');


In [None]:
%%sql

SELECT s.codigo_estacion, utilizaciones_2018, estimated_utilizaciones_2018, utilizaciones_2018-estimated_utilizaciones_2018 as delta
FROM prediction_results p, metro_entradas s
WHERE s.codigo_estacion = p.codigo_estacion
ORDER BY s.codigo_estacion;

In [None]:
from sqlalchemy.engine import create_engine
import pandas as pd

engine = create_engine("postgresql://gpadmin:pivotal@10.0.2.6:5432/gpadmin")

sql = """
SELECT s.codigo_estacion, utilizaciones_2018, estimated_utilizaciones_2018, utilizaciones_2018-estimated_utilizaciones_2018 as delta
FROM prediction_results p, metro_entradas s
WHERE s.codigo_estacion = p.codigo_estacion
ORDER BY s.codigo_estacion;
"""

df = pd.read_sql_query(sql, engine)
df.dropna(inplace=True)
df.head()


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df[['utilizaciones_2018','estimated_utilizaciones_2018']].plot(kind='bar')
plt.show()

In [None]:
df.loc[1:10,['utilizaciones_2018','estimated_utilizaciones_2018']].plot(kind='bar')
plt.show()

In [None]:
df['APE'] = 100*(df['estimated_utilizaciones_2018']-df['utilizaciones_2018'])/df['utilizaciones_2018']
df['APE'].plot()
plt.show()

In [None]:
from sklearn.metrics import r2_score
r2_score(df['estimated_utilizaciones_2018'],  df['utilizaciones_2018'])

### view tree

In [None]:
%%sql
SELECT madlib.get_tree('metro_entradas_output',1,7);

In [None]:
%pwd

In [None]:
# set paths for your environment
from os.path import expanduser
home = expanduser("C:\\Users\\javelascor\\INDRA\\madlib-site-asf-site\\community-artifacts\\Supervised-learning")

dot_output = %sql select madlib.get_tree('metro_entradas_output',1,7, TRUE, TRUE);
with open('tree_out.dot', 'w') as f: 
     f.write(dot_output[0][0])
        
if 1 == 0:    
    import pygraphviz as pgv
    from IPython.display import Image
    graph = pgv.AGraph("tree_out.dot")
    print(bool(graph))
    graph.draw('tree_out.png',prog='dot')
    Image('tree_out.png')

Display the surrogate variables that are used to compute the split for each node when the primary variable is NULL:

In [None]:
%%sql
SELECT madlib.get_tree_surr('metro_entradas_output',1,7);