# Fuel efficiency prediction project

The goal of this notebook is to demonstrate the practical use of concepts presented in previous presentations on a real dataset, going through all the necessary steps of data loading, exploratory analysis and cleaning, building and training a model, and finally - evaluating the results.

## The dataset

Dataset used in this project is Auto MPG, which describes city-cycle fuel consumption of automobiles from the turn of 70s/80s decades.

The attributes that can be found in the data:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Dataset URL on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/auto+mpg

In [25]:
# Prepare necessary libraries
import csv

import numpy as np
import pandas as pd
import requests


In [4]:
# Download the dataset
dataset_source_url = "https://archive.ics.uci.edu/" + \
                     "ml/machine-learning-databases/" + \
                     "auto-mpg/auto-mpg.data"
filename = dataset_source_url.split("/")[-1] or "auto-mpg.data"
response = requests.get(dataset_source_url)
response.raise_for_status()

with open(filename, "wb") as f:
    f.write(response.content)

print(f"Succesfully downloaded dataset to file: {filename}")

Succesfully downloaded dataset to file: auto-mpg.data


In [48]:
# Read the dataset
column_names = [
    "mpg", "cylinders", "displacement", "horsepower", 
    "weight", "acceleration", "model_year", "origin"
]

dataset_raw = pd.read_csv(
    filepath_or_buffer=filename, # name of file containing data
    names=column_names, # list of column names
    comment="\t", # treats car model column as comment whic means
                  # that it won't be read as data
    sep=" ", # delimiter to use
    skipinitialspace=True, # skip spaces after delimiter
    na_values="?" # character used for missing values
)

# Show first five rows
dataset_raw.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


In [50]:
# Show number of missing values for each column
dataset_raw.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
dtype: int64