<a href="https://colab.research.google.com/github/om1chael/machine-learning-/blob/master/Chapter2_housing_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
`


# The problem
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.

# Working with Real Data

# What is the problem ?
## How does the company expect to use and benefit from this model?

### **Boss**: "your model’s output (a prediction of a district’s median housing price) will be fed to another Machine Learning system , along with many other signals. This downstream system will determine whether it is worth investing in a given area or not. Getting this right is critical, as it directly affects revenue. "

### **The next question**: to ask your boss is what the current solution looks like (if any). The current situation will often give you a *reference for performance*, as well as insights on how to solve the problem.

# Analysing the problem:

### With all this information, you are now ready to start designing your system. First, you need to frame the problem: is it supervised, unsupervised, or ReinforcementLearning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques? Before you read on, pause and try to answer these questions for yourself

#### Personal answer:
- Supervised
- Regression problem
- Online learning  (incorrect) but, why ???
-- beavuse there is no need to particular need to adjust to changing data rapidly

Performance Measure:
- RMSE == \RMSE (X, h) = √(∑(h (xi=1(i)) − y(i)))

# Lets code

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [None]:
# Download and unzip the dataset
!wget https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
!tar -xvzf cal_housing.tgz

--2024-12-18 01:27:59--  https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz
Resolving www.dcc.fc.up.pt (www.dcc.fc.up.pt)... 193.136.39.12
Connecting to www.dcc.fc.up.pt (www.dcc.fc.up.pt)|193.136.39.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 441963 (432K) [application/x-gzip]
Saving to: ‘cal_housing.tgz’


2024-12-18 01:28:02 (253 KB/s) - ‘cal_housing.tgz’ saved [441963/441963]



In [None]:
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()


import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
load_housing_data(HOUSING_PATH)

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/housing/housing.csv'

In [6]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("camnugent/california-housing-prices")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/camnugent/california-housing-prices?dataset_version_number=1...


100%|██████████| 400k/400k [00:00<00:00, 784kB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/camnugent/california-housing-prices/versions/1





In [7]:
import pandas as pd
import numpy as np

In [14]:
path = "/content/sample_data/california_housing_test.csv"
housing = pd.DataFrame(pd.read_csv(path))

In [15]:
housing

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0
