# Final Conclusion
* Compare random forest modeling with linear regression modeling

#### 1. Evaluate and Draw Conclusions from Random Forest Model
- Once you have coerced and generated features, and selected features, it is time to fit the model, and score the model against a holdout set.

After scoring the model, write (2 - 3) paragraphs where you interpret the model (by feature scores, and grouping features like geographic features, or datetimes), identify assumptions of your work (eg. omitted variable bias, multicollinearity), answer the "so what" of your analysis, and propose next steps.

#### 2. Compare against linear regression
How did the random forest model compare to the linear regression model?

- If there is a significant difference in accuracy, what would account for this?
- Is there a significant difference in the feature importances found?
- Imagine that your company/stakeholder now needs to decide which type of model to rely on -- a linear regression model, or a random forests model. Which of the two models do you trust more? Why?

## 1. Evaluating the final Random Forest Model

### Loading the dataset

In [1]:
import pandas as pd
#Loading the selected dataset with 10 features in the previous task.

X_train = pd.read_feather('./X_train') 
y_train = pd.read_feather('./y_train')

X_val = pd.read_feather('./X_val')
y_val = pd.read_feather('./y_val')

X_test = pd.read_feather('./X_test')
y_test = pd.read_feather('./y_test')

In [2]:
y_train.drop(columns = 'index', inplace=True)
y_val.drop(columns = 'index', inplace=True)
y_test.drop(columns = 'index', inplace=True)
X_test.drop(columns = 'index', inplace=True)

In [3]:
y_train = y_train['normalized_losses']
y_val = y_val['normalized_losses']
y_test = y_test['normalized_losses']
X_test = X_test[X_train.columns]

In [4]:
print(X_train.shape, X_val.shape, X_test.shape)
print(y_train.shape, y_val.shape, y_test.shape)

(131, 10) (33, 10) (41, 10)
(131,) (33,) (41,)


In [5]:
X_train.columns

Index(['symboling', 'num_of_doors', 'wheel-base', 'length', 'width', 'height',
       'peak_rpm', 'highway_mpg', 'price', 'make_other'],
      dtype='object')

In [6]:
len(X_train.columns)

10

### By performing feature selection using permutation importance and correlation anlysis and then focused feature engineering, I could narrow down 48 features to 10 features for the final model. Also, below is the final random forest regressor after conducting hyperparameter tuning.

In [7]:
from sklearn.ensemble import RandomForestRegressor

rfr_tuned = RandomForestRegressor(min_samples_leaf = 4,
                                  n_estimators = 14,
                                  max_features = 0.5, random_state=1)

### Fit the model & Score the model against a holdout set.

In [8]:
import numpy as np

combined_X = np.vstack((X_train, X_val))
print(combined_X.shape)
combined_y = np.concatenate((y_train, y_val))
print(combined_y.shape)

(164, 10)
(164,)


In [9]:
rfr_tuned.fit(X_train, y_train)
rfr_tuned.score(X_val, y_val)

0.5166901930384304

In [10]:
rfr_tuned.fit(combined_X, combined_y)
rfr_tuned.score(X_test, y_test)

0.5415616813159074

### My final model's accuracy score is 0.5167, and evaluated accuracy score against the test set is 0.5416. This means my model can predict the target variable with unseen dataset pretty well, meaning my model was not overfit. 

### - Interpret the model by feature scores or grouping features, identify assumptions of your work, "so what" analysis, propose next steps. 

In [11]:
X_train.columns

Index(['symboling', 'num_of_doors', 'wheel-base', 'length', 'width', 'height',
       'peak_rpm', 'highway_mpg', 'price', 'make_other'],
      dtype='object')

In [12]:
import eli5
from eli5.sklearn import PermutationImportance

pmi = PermutationImportance(rfr_tuned).fit(X_val, y_val)
eli5.explain_weights_df(pmi, feature_names = X_val.columns.to_list())

Unnamed: 0,feature,weight,std
0,symboling,0.352003,0.035547
1,height,0.237157,0.043077
2,peak_rpm,0.102802,0.055563
3,length,0.091071,0.007083
4,price,0.082163,0.020734
5,wheel-base,0.075515,0.006757
6,num_of_doors,0.072968,0.027162
7,highway_mpg,0.053519,0.016581
8,width,0.018479,0.027678
9,make_other,0.000889,0.004117


### Based above, symboling, height, peak_rpm, length, num_of_doors are the top five important features in my model. The 10 features selected do not appear to be correlated to each other. From these feature importances to predict the normalized losses, I can say that a person who is interested in buying a car may consider those features to reduce the normalized losses. Also, car manufactureres may consider these features, and car insurance company will be able to gauge the normalized losses based on these features when they get new customers. I want to propose that the next step should be finding the optimal values for each significant feature to reduce the normalized losses. 

## 2. Compare Random Forest Model against Linear Regression Model

### - Significant difference in accuracy? in the feature importances found? Which model do you trust more? why?

### In linear regression model, I could narrow down to 21 features from 48 features, and in random forest model, I could narrow down to 10 features. The accuracy of the linear regression model was 0.1956, while the accuracy of random forest model is 0.5167. With smaller number of features, the random forest model perforemd better to predict the target variable than the linear regression model did with more features.<br>Regarding the feature importances, the linear regression model found 'fuel_type_is_gas', 'compression_ratio', 'engine_size', 'symboling', 'drive_wheels_rwd' as the top five significant features. Meanwhile, the random forest model found 'symboling', 'height', 'peak_rpm', 'length', 'price' as the top five significant features. I think this discrepancy occurred because the meachanisms to find the important features of the two models are different from each other.<br>In my opinion, I prefer random forest because linear regression has more limitations or requirements. First, all the variables should be transformed into numeric variables for the linear regression model, which takes more time and also can be cumbersome. Also, missing values and outliers should be taken care of before fitting the model for the linear regression model, as the relationship between the features and the target variable is global, which means there is only one line for all the data points. Hence, if I know that the relationship between the target variable and the explanatory variables (features) is linear, I will use linear regression model, but if I am not sure about the relationship, I will use random forest model. 