## Kaggle

For our project, we are going to use [www.kaggle.com](www.kaggle.com) as a hosting tool. The fun part is that we can see the team leader boards in real time. To joing the competition, goto 

[https://www.kaggle.com/t/2f3cef9d033149e7a02091e681dda594](https://www.kaggle.com/t/2f3cef9d033149e7a02091e681dda594)

download the files. 

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv('./housing_data/train.csv')
test = pd.read_csv('./housing_data/test.csv')

Notice that the training set has the `median_house_value` feature, but the test file does not. 

In [2]:
train.head()

Unnamed: 0,id,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,1,-121.89,37.29,38.0,1568.0,351.0,710.0,339.0,2.7042,286600.0,<1H OCEAN
1,2,-121.93,37.05,14.0,679.0,108.0,306.0,113.0,6.4214,340600.0,<1H OCEAN
2,3,-117.2,32.77,31.0,1952.0,471.0,936.0,462.0,2.8621,196900.0,NEAR OCEAN
3,4,-119.61,36.31,25.0,1847.0,371.0,1460.0,353.0,1.8839,46300.0,INLAND
4,5,-118.59,34.23,17.0,6592.0,1525.0,4459.0,1463.0,3.0347,254500.0,<1H OCEAN


In [3]:
test.head()

Unnamed: 0,id,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,16513,-118.39,34.12,29.0,6447.0,1012.0,2184.0,960.0,8.2816,<1H OCEAN
1,16514,-117.86,33.77,39.0,4159.0,655.0,1669.0,651.0,4.6111,<1H OCEAN
2,16515,-119.05,34.21,27.0,4357.0,926.0,2110.0,876.0,3.0119,<1H OCEAN
3,16516,-118.15,34.2,52.0,1786.0,306.0,1018.0,322.0,4.1518,INLAND
4,16517,-117.68,34.07,32.0,1775.0,314.0,1067.0,302.0,4.0375,INLAND


## Creating labels

We first need to split our training set into the features we want to use to train and the labels. I am also dropping `id` as that is not important for modeling the data. 

In [4]:
housing = train.drop(["id","median_house_value"], axis=1)
housing_labels = train["median_house_value"].copy()

In [5]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-121.89,37.29,38.0,1568.0,351.0,710.0,339.0,2.7042,<1H OCEAN
1,-121.93,37.05,14.0,679.0,108.0,306.0,113.0,6.4214,<1H OCEAN
2,-117.2,32.77,31.0,1952.0,471.0,936.0,462.0,2.8621,NEAR OCEAN
3,-119.61,36.31,25.0,1847.0,371.0,1460.0,353.0,1.8839,INLAND
4,-118.59,34.23,17.0,6592.0,1525.0,4459.0,1463.0,3.0347,<1H OCEAN


In [6]:
housing_labels.head()

0    286600.0
1    340600.0
2    196900.0
3     46300.0
4    254500.0
Name: median_house_value, dtype: float64

## Fill in missing values

When we call `housing.info()`, we notice that not all of the features have attributes. Many times datasets have missing values and it is our job to do something meaningful. 

In [7]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 9 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16354 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
ocean_proximity       16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.1+ MB


To handle this problem, we basically have three options:
1. Get rid of the corresponding districts (rows) that have missing values. 
```
housing.dropna(subset=["total_bedrooms"])
```
2. Delete the problem feature. The option `axis=1` means delete along the columns as opposed to the rows (`axis=0`). 
```
housing.drop('total_bedrooms', axis=1)
```
3. Set the values to something meaningful such as zero, the mean, the median, etc.)
```
median = housing['total_bedrooms'].median()
housing = housing['total_bedrooms'].fillna(median) # Fills all NAN with median value
```

### A better solution

These options will work, but you will need to keep track of the any deletion or addition. All operations you do on your training set will need to be done on your test set as well. For example, if we went with option 2, we would need to drop the `total_bedrooms` feature from the test set as well. But what if the test set has all the values for `total_bedrooms`, but was missing values for `total_rooms`? We would need to delete both features from the training and test set. This can be problematic. 

One solution is to use the `SimpleImputer` from `sklearn.impute`. The world impute is synonymous to assign. We wish to use option 3 and assign all the missing values with something meaningful, say the median. On top of that, we want to keep track of **all** the median values of our features so that we can fill in any missing values of our test set (as well as any new data we wish to use). 

In [8]:
from sklearn.impute import SimpleImputer
help(SimpleImputer)

Help on class SimpleImputer in module sklearn.impute._base:

class SimpleImputer(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)
 |  
 |  Imputation transformer for completing missing values.
 |  
 |  Read more in the :ref:`User Guide <impute>`.
 |  
 |  Parameters
 |  ----------
 |  missing_values : number, string, np.nan (default) or None
 |      The placeholder for the missing values. All occurrences of
 |      `missing_values` will be imputed.
 |  
 |  strategy : string, optional (default="mean")
 |      The imputation strategy.
 |  
 |      - If "mean", then replace missing values using the mean along
 |        each column. Can only be used with numeric data.
 |      - If "median", then replace missing values using the median along
 |        each column. Can only be used with numeric data.
 |      - If "most_frequent", then replace missing using the most fr

Here we will call our instance of `SimpleImputer` the boring name of `imputer`. As we want `imputer` to only work on numerical values, we create a new data frame, `housing_num` that drops the `ocean_proximity` feature. We then use the `fit` method to collect the median values and store them in the `statistics_` attribute.  

In [9]:
imputer = SimpleImputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='median', verbose=0)

We can get the stored statistics of our `imputer` by calling `imputer.statistics_` (this is the state of the `statistics_` attribute of our object).

In [10]:
imputer.statistics_

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

Notice it is the same as the actual median values of our training data. We have now stored this information so we can use, if needed, later on in our pipeline.

In [11]:
housing_num.median().values

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

Now we can use this information to transform our data (fill in the missing values). 

In [12]:
X = imputer.transform(housing_num)
X.shape

(16512, 8)

Concerning data types, `X` is a simple NumPy array. It is not too hard to put this back into a Pandas DataFrame. 

In [13]:
housing_tr = pd.DataFrame(X,columns=housing_num.columns)
housing_tr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 8 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16512 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
dtypes: float64(8)
memory usage: 1.0 MB


## What to do with categorical data?

Now that we have handled missing data in the numerical data, what are we going to do with the categorical stuff? Let's look at what we have. 

In [14]:
housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
0,<1H OCEAN
1,<1H OCEAN
2,NEAR OCEAN
3,INLAND
4,<1H OCEAN
5,INLAND
6,<1H OCEAN
7,INLAND
8,<1H OCEAN
9,<1H OCEAN


One idea is to define a number to each item, i.e. `0 = <1H OCEAN`, `1 = NEAR OCEAN`, etc. Scikit-learn has a nice class called `ORdinalEncoder` that will do this for us, as well as keep track of the transformation so we can apply it to future data. 

In [15]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])

One isssue here ist aht ML algorithms will assume that two nearby values are similar than two distant values. This is great for casses such as "bad", "average", "good", etc. But with `ocean_proximity`, this is not the case. We need a better solution. `SciKit-Learn` has a class called `OneHotEncoder` that will create an array of length equal to the number of instances in our category and assign a 1 or a 0 to each entry. For example, 
`<1H OCEAN` would be associated to `[1,0,0,0,0]` and `INLAND` to `[0,1,0,0,0]`. This creates numerical data that keeps the categories separate. 

In [16]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [17]:
housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

## Creating new columns and automating it

Now we would like to preform some new operations to our data. Maybe we want to create new columns. To do this we can use the `FunctionTransformer`. Here we define a function that will transform old data to new data. 


In [63]:
from sklearn.preprocessing import FunctionTransformer

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6 #really we should not hard code this

def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": False})
housing_extra_attribs = attr_adder.fit_transform(housing.values)



Here is some code to see what the result of this method is.

In [64]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,rooms_per_household,population_per_household
0,-121.89,37.29,38,1568,351,710,339,2.7042,<1H OCEAN,4.62537,2.0944
1,-121.93,37.05,14,679,108,306,113,6.4214,<1H OCEAN,6.00885,2.70796
2,-117.2,32.77,31,1952,471,936,462,2.8621,NEAR OCEAN,4.22511,2.02597
3,-119.61,36.31,25,1847,371,1460,353,1.8839,INLAND,5.23229,4.13598
4,-118.59,34.23,17,6592,1525,4459,1463,3.0347,<1H OCEAN,4.50581,3.04785


Now it is time to create a pipline that will keep track of all our changes to the data. This will allow us to incorporate new data as well as run the test set through the same process. Notice we are running through a standard scaler. This will make sure all our numerical values are between 0 and 1 making the ML algorithm work faster. There are several ways to do this, look at the documentation for more info. 

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

This pipeline takes a list of name/estimator pairs defining a sequence of steps. To run it we can call `fit_transform` on the numerical data. 

In [21]:
housing_num_tr = num_pipeline.fit_transform(housing_num)
housing_num_tr

array([[-1.15604281,  0.77194962,  0.74333089, ..., -0.31205452,
        -0.08649871,  0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.21768338,
        -0.03353391, -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, ..., -0.46531516,
        -0.09240499,  0.4222004 ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.3469342 ,
        -0.03055414, -0.52177644],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.02499488,
         0.06150916, -0.30340741],
       [-1.43579109,  0.99645926,  1.85670895, ..., -0.22852947,
        -0.09586294,  0.10180567]])

Combining the numerical data with the categorical data, we obtain the full pipeline. Notice we need the `ColumnTransformer` to do this. 

In [22]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

In [24]:
housing_prepared[5]

array([-0.69645635,  0.94500913, -0.37004716,  0.14369276,  0.1314467 ,
        0.02528492,  0.19413836, -0.17643487, -0.11486671, -0.04800274,
       -0.19926409,  0.        ,  1.        ,  0.        ,  0.        ,
        0.        ])

Now our data is ready to teach a machine!

## Selecting a model

### Linear Regression

Here is the linear regression model

In [25]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

That's it! we can use the predict to see our results.

In [26]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions: ", lin_reg.predict(some_data_prepared))
print("Labels: ", list(some_labels))

Predictions:  [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
Labels:  [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]


Now it is time to look at our errors. Here is the RMSE.

In [27]:
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

68628.19819848923

Here is the MAE. 

In [28]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae

49439.89599001897

### Decision Tree

Notice this model seems super fly. 

In [29]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

### Random Forest

In [40]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse



22149.80457453486

## Cross Validation

In [31]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

The decision tree scores are actually not that good :(.

In [32]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004


Here are the Linear scores.

In [33]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.674001798342


And the forest scores.

In [34]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [51441.98152668 50233.14678114 53022.38310738 54484.02740789
 52136.53810838 55814.97727589 51372.86763109 50865.01493639
 56274.95209222 52294.8164913 ]
Mean: 52794.07053583609
Standard deviation: 1973.8198160865963


## Grid Search

A grid search allows us to run a few training rounds to test which parameters are the best. To see what parameters you can use, try the help function. 

In [53]:
help(forest_reg)

Help on RandomForestRegressor in module sklearn.ensemble.forest object:

class RandomForestRegressor(ForestRegressor)
 |  RandomForestRegressor(n_estimators='warn', criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)
 |  
 |  A random forest regressor.
 |  
 |  A random forest is a meta estimator that fits a number of classifying
 |  decision trees on various sub-samples of the dataset and uses averaging
 |  to improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is always the same as the original
 |  input sample size but the samples are drawn with replacement if
 |  `bootstrap=True` (default).
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------
 |  n_estimators : integer, optional (d

Here we are going to try a few combinations of the parameters. Keep in mind, the more combinations you try, the longer your search will take.

In [54]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    # 12 combinations here
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # 6 combinations here
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=42,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid=[{'max_features': [2, 4, 6, 8],
  

Once this finishes, you can look at the best parameter combination using the `best_params_` attribute. 

In [55]:
grid_search.best_params_

{'max_features': 8, 'n_estimators': 30}

We can also look at what estimator to use. 

In [56]:
grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features=8, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=30,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

We can also look at all the results of the parameters.

In [57]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

63669.05791727153 {'max_features': 2, 'n_estimators': 3}
55627.16171305252 {'max_features': 2, 'n_estimators': 10}
53384.57867637289 {'max_features': 2, 'n_estimators': 30}
60965.99185930139 {'max_features': 4, 'n_estimators': 3}
52740.98248528835 {'max_features': 4, 'n_estimators': 10}
50377.344409590376 {'max_features': 4, 'n_estimators': 30}
58663.84733372485 {'max_features': 6, 'n_estimators': 3}
52006.15355973719 {'max_features': 6, 'n_estimators': 10}
50146.465964159885 {'max_features': 6, 'n_estimators': 30}
57869.25504027614 {'max_features': 8, 'n_estimators': 3}
51711.09443660957 {'max_features': 8, 'n_estimators': 10}
49682.25345942335 {'max_features': 8, 'n_estimators': 30}
62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators'

With this knowledge, we can define a new random forest model.

In [58]:
forest_reg = RandomForestRegressor(max_features=8, n_estimators=30)
forest_reg.fit(housing_prepared, housing_labels)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

19361.154389599888

Nice! If we are ok with this result, we can apply our model to the the test file. 

## Creating a submission file

Here we will use our model to create a submission file. All we need is the test file, our full pipeline, and our `forest_reg` model. 

In [59]:
test.head()

Unnamed: 0,id,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,16513,-118.39,34.12,29.0,6447.0,1012.0,2184.0,960.0,8.2816,<1H OCEAN
1,16514,-117.86,33.77,39.0,4159.0,655.0,1669.0,651.0,4.6111,<1H OCEAN
2,16515,-119.05,34.21,27.0,4357.0,926.0,2110.0,876.0,3.0119,<1H OCEAN
3,16516,-118.15,34.2,52.0,1786.0,306.0,1018.0,322.0,4.1518,INLAND
4,16517,-117.68,34.07,32.0,1775.0,314.0,1067.0,302.0,4.0375,INLAND


Now run the test file through the pipeline to prep it for the model. Keep in mind, if it is missing values, our imputer will fill in the values. 

In [61]:
housing_test_prepared = full_pipeline.transform(test)
housing_test_prepared

array([[ 0.59238393, -0.71074948,  0.02758786, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.8571457 , -0.87445443,  0.8228579 , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.26268061, -0.66865392, -0.13146615, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.54242889, -0.68268578,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.12690297, -0.77155418, -0.13146615, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.33261768,  0.53808541, -0.76768218, ...,  0.        ,
         0.        ,  0.        ]])

Now we can run our prediction on the prepared data. 

In [62]:
final_predictions = forest_reg.predict(housing_test_prepared)
final_predictions

array([489500.86666667, 312096.66666667, 209906.66666667, ...,
       324250.03333333, 172806.66666667, 151176.66666667])

These are our predictions! Now we just create a csv file with the proper format. 

In [50]:
test_ID = test["id"]
output = pd.DataFrame({'Id': test_ID,
                        'median_house_value': final_predictions})

In [51]:
output.head()

Unnamed: 0,Id,median_house_value
0,16513,484950.9
1,16514,247150.0
2,16515,248120.0
3,16516,215020.0
4,16517,147340.0


In [52]:
output.to_csv('./housing_data/submission.csv', index=False)

Finally, upload this submission file to kaggle and cross your fingers! 