# Um projeto de *machine learning*

Vamos construir um exemplo de projeto de *machine learning* nestas primeiras aulas. Para isso vamos usar um *dataset* de preços de carros usados.

## Obtendo os dados

O *dataset* vem do *site* 'Kaggle' (https://www.kaggle.com/). Sempre que possível é bom automatizar o processo de *download*, descompactação e leitura dos dados.

In [1]:
import zipfile
from pathlib import Path

import pandas as pd
import requests

_CAR_DATASET_URL = 'https://www.kaggle.com/api/v1/datasets/download/asinow/car-price-dataset'
_TIMEOUT = 10
_PROJECT_NAME = 'car_price'
_COMPRESSED_CAR_DATASET_FILENAME = 'car_price_dataset.zip'
_CAR_DATASET_FILENAME = 'car_price_dataset.csv'


def _fetch_car_dataset(raw_dataset_path: Path, project_data_dir: Path) -> None:
    '''Fetches the car dataset from Kaggle and saves it to the data_dir.
    '''
    project_data_dir.mkdir(parents=True, exist_ok=True)
    response = requests.get(_CAR_DATASET_URL, timeout=_TIMEOUT)
    response.raise_for_status()
    with open(raw_dataset_path, 'wb') as f:
        f.write(response.content)


def _unpack_car_dataset(raw_dataset_path: Path, project_data_dir: Path) -> None:
    '''Unpacks the car dataset from the data_dir.
    '''
    with zipfile.ZipFile(raw_dataset_path, 'r') as zip_ref:
        zip_ref.extractall(project_data_dir)


def _fetch_and_unpack_car_dataset(project_data_dir: Path) -> None:
    '''Fetches and unpacks the car dataset from Kaggle.
    '''
    raw_dataset_path = project_data_dir / _COMPRESSED_CAR_DATASET_FILENAME
    _fetch_car_dataset(raw_dataset_path, project_data_dir)
    _unpack_car_dataset(raw_dataset_path, project_data_dir)


def load_car_dataset(data_dir: Path) -> pd.DataFrame:
    '''Loads the car dataset from the data_dir.
    '''
    project_data_dir = data_dir / _PROJECT_NAME
    dataset_path = project_data_dir / _CAR_DATASET_FILENAME
    if not dataset_path.exists():
        _fetch_and_unpack_car_dataset(project_data_dir)
    dataset = pd.read_csv(dataset_path)
    return dataset

In [2]:
DATA_DIR = Path.cwd().resolve().parents[2] / 'datasets'

dataset = load_car_dataset(DATA_DIR)

Verifique que o código funcionou:

In [3]:
dataset.head()

Unnamed: 0,Brand,Model,Year,Engine_Size,Fuel_Type,Transmission,Mileage,Doors,Owner_Count,Price
0,Kia,Rio,2020,4.2,Diesel,Manual,289944,3,5,8501
1,Chevrolet,Malibu,2012,2.0,Hybrid,Automatic,5356,2,3,12092
2,Mercedes,GLA,2020,4.2,Diesel,Automatic,231440,4,2,11171
3,Audi,Q5,2023,2.0,Electric,Manual,160971,2,1,11780
4,Volkswagen,Golf,2003,2.6,Hybrid,Semi-Automatic,286618,3,3,2867


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Brand         10000 non-null  object 
 1   Model         10000 non-null  object 
 2   Year          10000 non-null  int64  
 3   Engine_Size   10000 non-null  float64
 4   Fuel_Type     10000 non-null  object 
 5   Transmission  10000 non-null  object 
 6   Mileage       10000 non-null  int64  
 7   Doors         10000 non-null  int64  
 8   Owner_Count   10000 non-null  int64  
 9   Price         10000 non-null  int64  
dtypes: float64(1), int64(5), object(4)
memory usage: 781.4+ KB


Parece que deu tudo certo com este dataset.

***

***Exercício***

Verifique se a criação dos arquivos de dados realmente ocorreu.

***

***Exercicio***

Modifique o código acima para adicionar uma opção de apagar automaticamente o arquivo `.zip` original (para economizar espaço).

Ou seja, altere a função:

> ```Python
> def _fetch_and_unpack_car_dataset(project_data_dir: Path) -> None:
>     ...
> ```

para

> ```Python
> def _fetch_and_unpack_car_dataset(
>     project_data_dir: Path,
>     remove_original: bool,
> ) -> None:
>     ...
> ```

E também altere a função

> ```Python
> def load_car_dataset(data_dir: Path) -> pd.DataFrame:
>     ...
> ```

para

> ```Python
> def load_car_dataset(
>     data_dir: Path,
>     remove_original: bool = False,
> ) -> pd.DataFrame:
>     ...
> ```