### California Housing Price 

https://www.kaggle.com/camnugent/california-housing-prices


1. longitude: A measure of how far west a house is; a higher value is farther west

2. latitude: A measure of how far north a house is; a higher value is farther north

3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

4. totalRooms: Total number of rooms within a block

5. totalBedrooms: Total number of bedrooms within a block

6. population: Total number of people residing within a block

7. households: Total number of households, a group of people residing within a home unit, for a block

8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

10. oceanProximity: Location of the house w.r.t ocean/sea

In [None]:
import pandas as pd

df = pd.read_csv("../data/housing.csv")

In [None]:
df.info()

In [None]:
model_features = df.columns.drop('median_house_value')
model_target = 'median_house_value'

In [None]:
import numpy as np
numerical_features_all = df[model_features].select_dtypes(include=np.number).columns
print('Numerical columns:',numerical_features_all)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

for c in numerical_features_all:
    print(c)
    df[c].plot.hist(bins=10)
    plt.show()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df['ocean_proximity'].value_counts().plot.bar()
plt.show()

In [None]:
df[numerical_features_all].corr()


In [None]:
df.plot.scatter(numerical_features_all[3], numerical_features_all[4])
plt.show()

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline

In [None]:
regressor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler()),
    ('estimator', KNeighborsRegressor(n_neighbors = 3))
])

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=42)

X_train = train_data[numerical_features_all]
y_train = train_data[model_target]

regressor.fit(X_train, y_train)

In [None]:
from sklearn.metrics import mean_absolute_error


# Get test data to test the classifier
X_test = test_data[numerical_features_all]
y_test = test_data[model_target]

# Use the fitted model to make predictions on the test dataset
# Test data going through the Pipeline it's first imputed (with means from the train), scaled (with the min/max from the train data), and finally used to make predictions
test_predictions = regressor.predict(X_test)

tree_mse = mean_absolute_error(y_test, test_predictions)
tree_mse 