# Task

As a data analyst there is plenty of opportunity to improve processes or suggest new ways of doing things. When doing so it is often very smart and efficient (time is a scarce resource) to create a POC (Proof of Concept) which basically is a small demo checking wether it is worthwile going further with something. It is also something concrete which facilitates discussions, do not underestimate the power of that. 

In this example, you are working in a company that sells houses and they have a "manual" process of setting prices by humans. You as a Data Scientist can make this process better by using Machine Learning. Your task is to create a POC that you will present to your team colleagues and use as a source of discussion of wether or not you should continue with more detailed modelling. 

Two quotes to facilitate your reflection on the value of creating a PoC: 

"*Premature optimization is the root of all evil*". 

"*Fail fast*".

**More specifially, do the following:**

1. A short EDA (Exploratory Data Analysis) of the housing data set.
2. Drop the column "ocean_proximity", then you only have numeric columns which will simplify your analysis. Remember, this is a POC!
3. Split your data into train and test set. 
4. Create a pipeline containing a SimpleImputer [ SimpleImputer(strategy="median") ] and a std_scaler (and fit-transform your train set). 

5. Use GridSearchCV when choosing your model. You will look at a RandomForestRegressor with 2, 5, 10 or 100 estimators. More specifically, use the following code: 

```python
param_grid = [{'n_estimators': [2, 5, 10, 100]}]

forest_reg = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=3,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(train_feature, train_label)
```

6. Evaluate your model on the test set using the mean squared error as the metric. Conclusions? (Remember, you have fitted your pipeline above so now you just transform your test set without fitting your pipeline on it, else it is "cheating".)

7. Do a short presentation (~ 2-5 min) on your POC that you present to your colleagues (no need to prepare anything particular, just talk from the code). Think of:
- What do you want to highlight/present?
- What is your conclusion?
- What could be the next step? Is the POC convincing enough or is it not worthwile continuing? Do we need to dig deeper into this before taking some decisions?


**(8. If you have time, try to build a better model than the one presented in the POC.)**

# POC

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [2]:
# Below, set your own path where you have stored the data file. 
housing = pd.read_csv(r'C:\Users\Antonio Prgomet\Documents\ec_utbildning\kursframställning\sthlm_gbg\ml_sthlm_gbg\exercises_and_examinations\housing.csv')

## EDA

In [3]:
#there are 207 missing values in total_bedrooms
# all the type of columns are in float64 except "ocean_proximity"
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [4]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [5]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


# Drop the column "ocean_proximity"

In [6]:
housing_num = housing.drop("ocean_proximity",axis = 1)

# Spliting train & test

In [7]:
train_set, test_set = train_test_split(housing_num, test_size=0.2, random_state=42)

In [8]:
X_train_pre = train_set.drop('median_house_value', axis=1)
y_train = train_set['median_house_value'].copy()

In [9]:
X_test_pre = test_set.drop('median_house_value', axis=1)
y_test = test_set['median_house_value'].copy()

# Create a pipeline 

In [10]:
my_pipe = Pipeline ([
        ('imputer', SimpleImputer(strategy='median')),
        ('std_scaler', StandardScaler()),
])
X_train = my_pipe.fit_transform(X_train_pre)

# GridSearchCV

In [11]:
param_grid = [{'n_estimators': [2, 5, 10, 100]}]

forest_reg = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=3,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42),
             param_grid=[{'n_estimators': [2, 5, 10, 100]}],
             return_train_score=True, scoring='neg_mean_squared_error')

In [12]:
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.12093,0.006972,0.004659,0.000478,2,{'n_estimators': 2},-4381116000.0,-3972698000.0,-4196944000.0,-4183586000.0,167003300.0,4,-1315632000.0,-1254250000.0,-1183058000.0,-1250980000.0,54172380.0
1,0.289832,0.008042,0.011328,0.001884,5,{'n_estimators': 5},-3301854000.0,-2984350000.0,-3243556000.0,-3176586000.0,137999900.0,3,-717662900.0,-675719400.0,-685729000.0,-693037100.0,17886100.0
2,0.568673,0.006463,0.018649,0.004492,10,{'n_estimators': 10},-2963571000.0,-2785780000.0,-2917142000.0,-2888831000.0,75293130.0,2,-531791100.0,-514179600.0,-510709700.0,-518893500.0,9229349.0
3,5.826572,0.279962,0.154571,0.007409,100,{'n_estimators': 100},-2628848000.0,-2473381000.0,-2584217000.0,-2562149000.0,65359270.0,1,-365895800.0,-370222200.0,-356067500.0,-364061800.0,5922344.0


In [13]:
grid_search.best_params_

{'n_estimators': 100}

# Evaluate model on the test set

In [14]:
X_test = my_pipe.transform(X_test_pre)

In [15]:
prediction = grid_search.predict(X_test)

In [16]:
mse = mean_squared_error(y_test, prediction)
rmse = np.sqrt(mse)
rmse

49875.648686594046

In [17]:
mean_value = housing["median_house_value"].mean()
mean_value

206855.81690891474

In [18]:
rmse/ mean_value

0.24111310685817405