### Step 1: Frame the problem

The first question to ask your stakeholder/manager/boss is what is the business objective. <br>
Questions such as: <br>
1. What will be the use case of the model? or What business problem are you trying to address?
2. How does the company expect to benefit from the model?
3. What metric will we use to call this project a success and how will we measure it?

#### Problem: Predict the destrict's median house price

Questions to ask: <br>
- Do we currently have the data required data or do we need to first start collecting it?
- If the data is availabl, how the data is collected and what are the data sources?
- If data is available, what sort of attributes/features would you need to actually start making the predictions?
- Is there any current solution in place or are we starting from the scratching?

#### What sort of problem is it?
It apears to be supervised learning problem and more specifically **multiple regression** problem.

#### Performance Measure
Root Mean Square Error

### Step 2: Download or Load the Dataset

In [2]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

In [5]:
import pandas as pd

df = load_housing_data()

df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Step 2: Exploratory Data Analysis

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


### Step 3: Prepare Test dataset

In [97]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [100]:
df['avg_room'] = round(df['total_rooms']/df['total_bedrooms'])

In [101]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,avg_room
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,7.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.0


In [102]:
import numpy as np
X = df.drop(columns=['total_rooms','total_bedrooms','median_house_value','ocean_proximity'])
y = df['median_house_value']

In [103]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [104]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error

linreg = LinearRegression()
linreg.fit(X_train, y_train)

In [105]:
print(linreg.intercept_)

-3260826.3960699295


In [106]:
print(linreg.coef_)

[-3.93152459e+04 -3.89210691e+04  1.11988284e+03 -3.94369605e+01
  1.28371007e+02  4.37371000e+04 -1.12595576e+04]


In [107]:
mse = mean_squared_error(y_test, linreg.predict(X_test))

np.sqrt(mse)

70758.97094482878

In [108]:
linreg.score(X_test,y_test)

0.6210635135989993

In [112]:
X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,population,households,median_income,avg_room
12672,-117.92,33.74,13.0,3385.0,1109.0,3.1773,4.0
10821,-121.27,38.67,15.0,866.0,519.0,2.7388,4.0
8674,-123.23,39.13,33.0,529.0,217.0,3.8958,6.0
7632,-122.29,37.91,38.0,905.0,378.0,5.1691,6.0
1717,-117.26,34.17,30.0,945.0,344.0,3.8906,6.0


In [110]:
# using the model
linreg.predict(np.array([[-117.92, 33.74, 13, 880, 129 , 322, 126, 8.3252]]))



ValueError: X has 8 features, but LinearRegression is expecting 7 features as input.