# Ensemble Random Forest

---

A Random Forest is a type of Ensemble method used for classification or regression...

---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
cars = pd.read_csv('cars.csv')

## Exploratory Data Analysis

In [3]:
cars.shape

(1728, 7)

---

We can see that there are 1728 instances and 7 variables in the dataset. Let's view the top 5 instances and a summary of the data.

---

In [4]:
cars.head()

Unnamed: 0,price,maintenance,doors,passengers,trunk_space,safety,state
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
price          1728 non-null object
maintenance    1728 non-null object
doors          1728 non-null object
passengers     1728 non-null object
trunk_space    1728 non-null object
safety         1728 non-null object
state          1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB


---

This dataset is about evaluating the state of a list of vehicles. Our goal is to predict the ```state``` variable (or desireability) of a car using the other variables. Let's next split our dataset into our instances, X, and label, y.

---

In [6]:
col_names = ['price', 'maintenance', 'doors', 'passengers', 'trunk_space', 'safety']

X = cars[col_names]
y = cars['state']

---
## Building Our Random Forest Model

We are concerned with how our ML models perform on out of sample data, so we'll next use ```sklearn.model_selection.train_test_split``` to partition our data into training and testing subsets. Documentation of this function can be found at [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split)


---

In [7]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 40)

---

We next need to encode the categorical data in our ```X_train``` and ```X_test``` datasets into binary values. We'll need to import the ```category_encoders``` module, docuemntation for which can be found at [category_encoders](https://contrib.scikit-learn.org/category_encoders/).

We'll then change the ```X_train``` and ```X_test``` datasets into binary values.

---

In [16]:
import category_encoders as ce

encoder = ce.OrdinalEncoder(cols=col_names)

X_train = encoder.fit_transform(X_train)
X_test = encoder.fit_transform(X_test)

---

Preview datasets to confirm the encoder worked.

---

In [17]:
print(X_train.head())
print(X_test.head())

      price  maintenance  doors  passengers  trunk_space  safety
1417      1            1      1           1            1       1
404       2            2      2           2            2       2
1273      3            2      3           3            1       1
1423      1            1      1           2            3       1
192       2            1      3           3            1       3
      price  maintenance  doors  passengers  trunk_space  safety
237       2            4      1           2            1       3
854       4            2      3           1            2       2
752       4            4      3           2            1       2
1674      1            2      2           3            3       3
189       2            1      3           3            3       3


---

Our target variable is a categorical data type, so to build our model we'll need to import ```sklearn.ensemble.RandomForestClassifier```, documentation for which can be found at [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier). 

We will then instantiate the ```RandomForestClassifier``` module with 100 estimateors, then we'll fit our model to our training datasets. Next, we'll predict on unseen data from our test datasets.

---

In [18]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators = 100, 
                                random_state = 42)
rf_clf.fit(X_train, y_train)

pred = rf_clf.predict(X_test)

---

Let's check the accuracy of our model. To do so we'll need to import ```sklearn.metrics.accuracy_score```, documentation for which can be found at [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

---

In [19]:
from sklearn.metrics import accuracy_score

print(f"Random Forest Classifier model with 100 estimators yields an accuracy score of {np.round(100*accuracy_score(y_test, pred), 2)}%")

Random Forest Classifier model with 100 estimators yields an accuracy score of 93.98%


---

Our Random Forest Classifier model yielded a very high accuracy score almost 94%. This shows the power of the Ensemble Random Forest model in Machine Learning.

---