### Random Forest est un algorithme d'ensemble en apprentissage automatique qui combine les prédictions de plusieurs arbres de décision pour améliorer la précision et la stabilité du modèle. Il est particulièrement efficace pour la classification et la régression.

### Dataset: https://www.kaggle.com/mirichoi0218/insurance

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.simplefilter('ignore')

In [3]:
data = pd.read_csv('insurance.csv')

In [4]:
X = data.iloc[:,:-1]

In [5]:
Y = data.iloc[:,-1]

In [6]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Label encoding

In [7]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [8]:
X['sex'] = le.fit_transform(X['sex'])

In [9]:
X['smoker'] = le.fit_transform(X['smoker'])

In [10]:
X

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.900,0,1,southwest
1,18,1,33.770,1,0,southeast
2,28,1,33.000,3,0,southeast
3,33,1,22.705,0,0,northwest
4,32,1,28.880,0,0,northwest
...,...,...,...,...,...,...
1333,50,1,30.970,3,0,northwest
1334,18,0,31.920,0,0,northeast
1335,18,0,36.850,0,0,southeast
1336,21,0,25.800,0,0,southwest


## One hot encoding

In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [12]:
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [5])], remainder='passthrough')

In [13]:
X = columnTransformer.fit_transform(X)

In [14]:
print(X)

[[ 0.    0.    0.   ... 27.9   0.    1.  ]
 [ 0.    0.    1.   ... 33.77  1.    0.  ]
 [ 0.    0.    1.   ... 33.    3.    0.  ]
 ...
 [ 0.    0.    1.   ... 36.85  0.    0.  ]
 [ 0.    0.    0.   ... 25.8   0.    0.  ]
 [ 0.    1.    0.   ... 29.07  0.    1.  ]]


### Train test split

In [15]:
import numpy as np
from sklearn.model_selection import train_test_split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

### Modèle

In [17]:
from sklearn.ensemble import RandomForestRegressor

In [18]:
model = RandomForestRegressor(n_estimators= 25, random_state = 10)
model.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=25, n_jobs=None, oob_score=False,
                      random_state=10, verbose=0, warm_start=False)

### Prédiction

In [19]:
y_pred = model.predict(X_test)

### Comparaison

In [20]:
comparaison = pd.DataFrame()
comparaison['Actual'] = y_test
comparaison['predicted'] = y_pred

In [21]:
comparaison

Unnamed: 0,Actual,predicted
559,1646.42970,2480.000463
1087,11353.22760,11659.776198
1020,8798.59300,9062.291662
460,10381.47870,10289.648328
802,2103.08000,2218.975448
...,...,...
682,40103.89000,41028.495589
629,42983.45850,47780.614048
893,44202.65360,44502.188259
807,2136.88225,2083.121326


### Evaluation

In [22]:
from sklearn.metrics import r2_score

r2_score(y_test, y_pred)

0.8516692931213358