* **Objective**:
The main objective of this project is to build a model that  predicts a district’s median housing price. This will be essential to determine whether it is worth investing in a given area. More specifically, our model’s output will be fed to another machine learning system, along with some other signals. So it’s important to make our housing price model as accurate as we can.


* **Current working way**:
Housing prices are currently estimated manually by experts based on copmplex rules.

## Download the data

In [None]:
# Import important libraries for downloading the data.

import urllib.request
from pathlib import Path
import tarfile # to unzip the file
import pandas as pd

In [14]:
# Get the data

def get_hosing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file(): # if the file is not downloaded then do:
        Path("datasets").mkdir(parents=True, exist_ok=True)

        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
    
    # Tar the file - unzip it
    with tarfile.open(tarball_path) as _:
            _.extractall(path="datasets")
    
    # Read the file
    return pd.read_csv("datasets/housing/housing.csv")

housing_full = get_hosing_data()

  _.extractall(path="datasets")


## A Quick Look at the Data

In [13]:
housing_full.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
housing_full.info()

* All features are numerical, except ocean_proximity.
* Our dataset has 20,640 districts, i.e., rows (data samples).
* It contains 10 columns.
* total_rooms has missing values.

In [None]:
housing_full.describe()

In [None]:
housing_full[housing_full["total_bedrooms"].isna()].shape[0]

* 207 missing values

In [None]:
housing_full['ocean_proximity'].unique()

In [None]:
housing_full["ocean_proximity"].value_counts()

* the colomn ocean_proximity has 5 unique values.

#### Data Visualization at glance

In [None]:
import matplotlib.pyplot as plt

In [None]:
housing_full.hist(bins=50, figsize=(12,8))

plt.show()

* Some distributions are Right skewed As a result some models may struggle to find patterns for such data distribution.

#### Create a Test Set

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

* based on the convo with the client you found that income distribution really matters. So, it is better to use distribution based sampling instead of random sampling.

* If you split train/test randomly, some income groups may disappear or be under-represented.

In [None]:
housing_full.head()

In [None]:
housing_full["income_cat"] = pd.cut(housing_full["median_income"],
                                    bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                    labels=[1, 2, 3, 4, 5])

In [None]:
housing_full["income_cat"].head()

In [None]:
cat_counts = housing_full["income_cat"].value_counts().sort_index()
cat_counts

In [None]:
cat_counts.plot(kind="bar" ,rot=0, grid=True)

plt.xlabel("Income category")
plt.ylabel("Number of districts")
plt.show()

In [None]:
housing_full.head()

In [None]:
# Stratified Sampling.

strat_train_set, strat_test_set = train_test_split(housing_full, test_size=0.2, random_state=42,  stratify=housing_full["income_cat"])

In [None]:
strat_train_set.shape

In [None]:
strat_test_set.shape

In [None]:
strat_test_set.shape[0]/housing_full.shape[0]

## Data Visualization to Gain Insights

In [None]:
housing = strat_train_set.copy() # housing represents training data set.

### Visualize Geographical Data

In [None]:
housing.head()

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2, grid=True)

plt.show()

* It looks like California - the dataset is for houses in california, that is why!
* high population around the beach, and this makes sense.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude",
             s=housing["population"]/100, label="population",
             c="median_house_value", cmap="jet")

In [None]:
# just graph compartion

housing.plot(kind="scatter", x="longitude", y="latitude", grid=True,
             s=housing["population"]/100, label="population",
             c="median_house_value", cmap="jet", colorbar=True,
             legend=True, sharex=False, figsize=(10,7))

### Correlations

In [None]:
housing.corr(numeric_only=True)

In [None]:
corr_matrix = housing.corr(numeric_only=True)["median_house_value"]
corr_matrix

In [None]:
corr_matrix.sort_values(ascending=False)

* median income has a strong correlation

In [None]:
from pandas.plotting import scatter_matrix

# lets go crazy

scatter_matrix(housing)

plt.show()

Oh, no!

In [None]:
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12,8))

plt.show()

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value")

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.2, grid=True)

* there are are horizontal lines around 500k, 450k, 275k, 230k.....

## Prepare the Data for Machine Learning Algorithms

* lets separate the "predictors" and the "labels"

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) # X_train
housing_labels = strat_train_set["median_house_value"].copy() # y_train

In [None]:
housing.head(2)

In [None]:
housing_labels.head()

### Clean Data

* Missing values

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [None]:
# only numerical values have median

housing_num = housing.select_dtypes(include="number")

In [None]:
housing_num.head(2)

In [None]:
imputer.fit(housing_num)

In [None]:
housing_num.isna().sum()

In [None]:
imputer.statistics_

In [None]:
housing_num_filled = imputer.transform(housing_num)

In [None]:
housing_num_filled

In [None]:
housing_tr = pd.DataFrame(housing_num_filled, columns=housing_num.columns, index=housing_num.index)

In [None]:
housing_tr.head(1) # Yo, it is pd df again!

### Handling Categorical Attributes

In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head()

In [None]:
housing_cat["ocean_proximity"].unique()

In [None]:
# Note: ordinal encoder is great for order based categories and it is not reocmmended of non ordered based categories like cities

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_encoded

In [None]:
ordinal_encoder.categories_

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_1hot

In [None]:
#use .toarray() method to convert the sparse matrix to a NumPy array. note that NumPy array is heavier.

housing_cat_1hot.toarray()

Alternatively, you can set sparse_output=False when creating the OneHotEncoder (note: the sparse hyperparameter was renamned to sparse_output in Scikit-Learn 1.2)

In [None]:
cat_encoder.categories_

### Feature Scaling and Transformation

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler(feature_range=(-1,1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)


In [None]:
housing_num_min_max_scaled

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

In [None]:
housing_num_std_scaled

In [None]:
from sklearn.linear_model import LinearRegression
target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)

some_new_data = housing[["median_income"]].iloc[:5]  # pretend this is new data
scaled_predictions = model.predict(some_new_data)

predictions = target_scaler.inverse_transform(scaled_predictions)

In [None]:
from sklearn.compose import TransformedTargetRegressor

model = TransformedTargetRegressor(LinearRegression(),
transformer=StandardScaler())

model.fit(housing[["median_income"]], housing_labels)
predictions = model.predict(some_new_data)

### Transfomation Pipelines

In [None]:
# a small pipeline for numerical attributes

from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("standardized", StandardScaler()),
])

num_pipeline

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector, make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer

num_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler()
)

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore")
    #why handle_unknown?
)


preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object))
)

In [None]:
housing_prepared = preprocessing.fit_transform(housing)
housing_prepared

In [None]:
preprocessing.get_feature_names_out()