# Pre-processing exercise

In this exercise you will pre-prcess the [California housing dataset](https://www.kaggle.com/camnugent/california-housing-prices). 


The pre-processing needs to have the next parts: 
- [ ] Basic inspection of the data. 
- [ ] Dealing with missing values (you can choose what to do with them in each case). 
- [ ] Finding outliers and decide what to do with them. 
- [ ] Extract new variables. 
- [ ] Transform all categorical variables into one-hot-encoding variables. 
- [ ] Transform the numerical variables; you can use MinMax, Standarization, boxcox or any other transformation that makes sense. 


In [None]:
# %matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
from collections import Counter
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"
pd.set_option("display.precision", 3)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
housing = pd.read_csv("housing.csv", header=0, delimiter=",")
housing.head()

In [None]:
housing.plot(
    kind="scatter",
    x="longitude",
    y="latitude",
    alpha=0.4,
    s=housing["population"] / 100,
    label="population",
    figsize=(10, 7),
    c="median_house_value",
    cmap=plt.get_cmap("jet"),
    colorbar=True,
)

In [None]:
housing.describe()

In [None]:
housing.isna().sum()

In [None]:
subset_of_housing = housing.drop(columns=["total_bedrooms", "ocean_proximity"])

In [None]:
subset_of_housing.shape
housing.shape

In [None]:
rows_not_missing = housing.total_bedrooms.notna()
rows_missing = housing.total_bedrooms.isna()
rows_not_missing.sum()
rows_missing.sum()

In [None]:
housing_without_missings = subset_of_housing[rows_not_missing]
housing_without_missings.shape

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(housing_without_missings, housing.total_bedrooms[rows_not_missing])
imputed_values = knn.predict(subset_of_housing[rows_missing]);

In [None]:
housing.total_bedrooms[rows_missing] = imputed_values

In [None]:
housing.isna().sum()

In [None]:
housing.total_bedrooms.hist()

In [None]:
housing.describe()