# Working with Real Data
When you are learning about machine learning, it is best to experiment with realworld
data, not artificial datasets. Fortunately, there are thousands of open datasets to
choose from, ranging across all sorts of domains. Here are a few places you can look
to get data:

- OpenML.org
- Kaggle.com
- PapersWithCode.com
- UC Irvine Machine Learning Repository
- Amazon’s AWS datasets
- TensorFlow datasets
- DataPortals.org
- OpenDataMonitor.eu
- Wikipedia’s list of machine learning datasets
- Quora.com
- The datasets subreddit

# Machine Learning Project Checklist
This checklist can guide you through your machine learning projects. There are eight
main steps:
1. Frame the problem and look at the big picture.
2. Get the data.
3. Explore the data to gain insights.
4. Prepare the data to better expose the underlying data patterns to machine learning
algorithms.
5. Explore many different models and shortlist the best ones.
6. Fine-tune your models and combine them into a great solution.
7. Present your solution.
8. Launch, monitor, and maintain your system.

Obviously, you should feel free to adapt this checklist to your needs.

A typical performance measure
for regression problems is the root mean square error (RMSE). It gives an idea of how
much error the system typically makes in its predictions, with a higher weight given
to large errors.
# Root Mean Square Error
$$
\text{RMSE} = \sqrt{\frac{1}{m} \sum_{i=1}^n (h(x_i) - \hat{y}_i)^2}
$$

- m is the number of instances in the dataset
- h is your system’s prediction function, also called a hypothesis, also called “y-hat”

RMSE(X,h) is the cost function measured on the set of examples using your
hypothesis h.

- Computing the root of a sum of squares (RMSE) corresponds to the Euclidean
norm: this is the notion of distance we are all familiar with. It is also called the $ℓ_2$
norm, noted $∥ · ∥_2$ (or just $∥ · ∥$).
- Computing the sum of absolutes (MAE) corresponds to the $ℓ_1 norm, noted ∥ · ∥_1.$
This is sometimes called the Manhattan norm because it measures the distance
between two points in a city if you can only travel along orthogonal city blocks.
- More generally, the $ℓ_k$ norm of a vector v containing n elements is defined as
$∥v∥_k$$ = (|v_1|_k + |v_2|_k + ... + |v_n|_k)^\frac{1}{k}.$

$ℓ_0$ gives the number of nonzero elements in the
vector, and ℓ∞ gives the maximum absolute value in the vector.

When calculating L2 or L1 norm, replace the v vector with the e vector (the error -> so you should have the minus sign there!)


The higher the norm index, the more it focuses on large values and neglects small
ones. This is why the RMSE is more sensitive to outliers than the MAE. But when
outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very
well and is generally preferred.

Use RMSE:

When errors are expected to follow a bell-shaped distribution (Gaussian).

When large errors are rare but should be heavily penalized.

When you need a smooth and differentiable loss function for optimization.

Use MAE:

When errors may contain outliers or follow a distribution with heavy tails.

When you want a robust metric that is less sensitive to extreme values.

When interpretability is important (MAE is easier to explain since it’s in the same units as the data).



# Downloading the data

In [1]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

#pathlib.Path: Used for handling file paths in a platform-independent way.
#pandas: Used for data manipulation and analysis (loading the dataset into a DataFrame).
#tarfile: Used for extracting files from a .tgz (tar.gz) archive.
#urllib.request: Used for downloading files from the internet.


def load_housing_data():
  """
  This function is responsible for:

  Checking if the dataset already exists locally.
  Downloading the dataset if it doesn’t exist.
  Extracting the dataset from the .tgz archive.
  Loading the dataset into a Pandas DataFrame.
  """
  tarball_path = Path('datasets/housing.tgz') #tarball_path is the path where the downloaded .tgz file will be saved.
  if not tarball_path.is_file(): #If the file does not exist, the script proceeds to download it.
    Path('datasets').mkdir(parents = True, exist_ok=True)
    #If the datasets directory does not exist, it creates it.
    #parents=True: Ensures that any missing parent directories are also created.
    #exist_ok=True: Prevents an error if the directory already exists.

    url = "https://github.com/ageron/data/raw/main/housing.tgz"
    urllib.request.urlretrieve(url, tarball_path)

    with tarfile.open(tarball_path) as housing_tarball:
      housing_tarball.extractall(path='datasets')
  return pd.read_csv(Path("datasets/housing/housing.csv"))


housing = load_housing_data()


In [3]:
#read the first 5 rows of the dataset
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


total_bedrooms attribute has only 20,433 non-null values, meaning that 207 districts
are missing this feature.

In [6]:
#Checking the values of ocean_proximity
housing["ocean_proximity"].value_counts()

Unnamed: 0_level_0,count
ocean_proximity,Unnamed: 1_level_1
<1H OCEAN,9136
INLAND,6551
NEAR OCEAN,2658
NEAR BAY,2290
ISLAND,5


In [7]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0
