# The cars dataset

The cars dataset is a basic dataset of some cars and their mileage.  
mpg.csv contains fuel economy data from 1999 to 2008 for 38 popular models of cars.   
What all the fields stand for you can look up [here](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.6/topics/mpg)  

In this notebook we'll import the dataset and clean it up. After the cleaning, we make some selections and aggregations.

## Part 1: cleaning the dataset

In [None]:
import pandas as pd

df = pd.read_csv("files/mpg.csv")  
    
df.head(10) 

The data was imported fine, but the first column is all wrong: the CSV has an index, but that index was seen as an extra column (and another index was added).

In [None]:
df = pd.read_csv("files/mpg.csv", index_col=0) 
df.head(10)

A way to get an indication of what is in the file is by calling the info() or describe() methods:   

* info(): provides a concise summary of a dataframe: number of records, number of columns, ...
* describe(): generates descriptive statistics that will provide visibility of the dispersion and shape of a dataset’s distribution. It excludes NaN values. It can be used for dataframe or a specific series.

In [None]:
df.info()
df.describe()

When you are interested in the number of distinct observations for each column, use 
* nunique(): Count distinct observations. Can be used for a dataframe or a series. By default, it exclude the NaN values.   
* value_counts(): returns number of unique values for the specified series. NaN values are excluded by default.

In [None]:
df.nunique()

In [None]:
df['class'].value_counts()

Next step when loading data is to check the data quality. But what does this mean? When looking at data there are a number of ways in which data can be dirty:

- Bad data (missing observations, dual observations, ...)
- Wrong structure (fields joined or spread out, ...)
- Dirty data (wrong datatypes, string processing needed, ...)

The mpg-dataset has no bad data, and the structure is also fine. But still there are improvement possible.

Let's start with miles per gallon in the cty (cty) and on the highway (hwy). In liters per 100km that would be:

Liters100km = 	(100 * 3.785411784)/(1.609344 * MPG)

In [None]:
df['clkm'] = [ (100 * 3.785411784)/(1.609344 * mpg) for mpg in df['cty']]
df['hwlkm'] = [ (100 * 3.785411784)/(1.609344 * mpg) for mpg in df['hwy']]

df.head(10)

The class of a car is actually a categorical value. This means it can only have a number of discrete values. Let's convert the class of the car to that type.

In [None]:
df["class"] = pd.Categorical(df['class'])

At first sight the class of a car is not an ordered type of category (think of health labels on food, year a student is in, a grade from A-E) However, in our scenario whe we want to study the impact of the car class on the fuel consumption, we can assume there is some order in the car classes.   
First let's detect which unique field values for class exist. 

In [None]:
df['class'].unique()

Next step is to rank these values in the preferred order.

In [None]:
from pandas.api.types import CategoricalDtype

# categories-list copied en rearranged from unique values
# and added a category (three wheeled car) just for fun
cat_type = CategoricalDtype(categories=['three wheeled car','2seater',
        'subcompact', 'compact', 'midsize', 'minivan', 'suv', 'pickup'], ordered=True)

df["class"] = df['class'].astype(cat_type)

df.head()

And why are we doing this? In the previous output you will see no difference. It becomes clear when you now look at the result of a 'group by'... The class categories will not be sorted alphabetically but will appear in the order you specified.

In [None]:
df.groupby('class').describe()

Finally, let's perform some last extra cleaning.
When we look at the transmission of a car, it surely looks like there is more than one value in every cell. We'll use the pandas-split method to split this up.

In [None]:
df[["trans"]]

In [None]:
df["trans"].str.split('(')

So now we have a list. But what if we want a dataframe?

In [None]:
df["trans"].str.split('(', expand=True)

Good! How could we get rid of the final ")"? To do that we need to store the output as a dataframe and apply a lambda function...

In [None]:
splitted = df["trans"].str.split('(', expand=True)
splitted[1] = splitted.apply(lambda row : row[1].replace(')',''), axis=1)
splitted

When you don't want to end with a seperate dataframe, you can add 'splitted' to the orginal dataframe (Chapter 6)  

In [None]:
df.join(splitted)

## Part 2: Selections and aggregations

Now the cleaning is finished, we can explore the dataset.   
**First we show some selections.**  

Let's get all cars with an engine displacement of 3 or less.

In [None]:
df[ df.displ <= 3]

And from this, only show the manufacturer and the nr of cylinders.

In [None]:
df[ df.displ <= 3][['manufacturer', 'cyl']]

All cars having an odd number of cylinders or a displacement of exactly 2.8.

In [None]:
df[ (df.displ == 2.8) | (df.cyl % 2 == 1)][['manufacturer', 'cyl', 'displ']]

Same as above, but sort by ascending number of cylinders.

In [None]:
df[ (df.displ == 2.8) | (df.cyl % 2 == 1)][['manufacturer', 'cyl', 'displ']].sort_values('cyl')

Return the number of cars per manufacturer.

In [None]:
df['manufacturer'].value_counts()

**Finally, we show how to do some aggregations.**

Watch out! Avoid selecting categorical columns to avoid errors.

* sum(): Return the sum of the values for the requested axis. You can use it for both dataframe and series.


In [None]:
df[['displ', 'cyl', 'hwy','model']].sum()

# TypeError: unsupported operand type(s) for +: 'CategoricalDtype' and 'int64'
#df['class'].sum() 

* count(): Return number of non-NA/null observations.

In [None]:
df.count( numeric_only = True)

Min and Max, Mean and Median:

* min(): Return the minimum value
* max(): Return the maximum value
* mean(): Return the mean of the values
* median(): Return the median of the values

These functions can be applied to both dataframe and series.

(Note the class! The max is the last value of our ordered class.)

In [None]:
df.max()

* agg(): apply more than one aggregation operations to the same dataset over the specified axis.

In [None]:
df[['displ', 'cyl', 'hwy']].agg(['count','min','max'])

* groupby(): allows you to group data (by applying aggregate functions like sum, max, min…) with the same values into summary rows.

In [None]:
df.groupby('class').cyl.mean()