## Cleaning and Exploratory Data Analysis 

In this notebook, we will perform the cleaning and exploration of the data.


In [10]:
import pandas as pd
from IPython.display import display

from mtcars_practice.config import data_dir

In [12]:
mtcars = pd.read_csv(data_dir + '/processed/mtcars.csv')
display(mtcars.head())

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


Taking a look at the first few rows of the dataset, we can see the distribution of mpg, the dependent variable. The variables cylinders, displacement, horsepower, weight, acceleration seem like good candidates to be independent variables. We can take a look at the model_year variable to see if it would be helpful to include. The documentation does not describe exactly what the variable origin is denoting, yet I would guess it is the country or region where the vehicle was built. If this is the case, then including it would be potentially beneficial. The car_name attribute will be dropped, as it is of no use.  

In [13]:
mtcars.drop('car_name', axis=1, inplace=True)
display(mtcars.describe())

Unnamed: 0,mpg,cylinders,displacement,weight,acceleration,model_year,origin
count,398.0,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,2970.424623,15.56809,76.01005,1.572864
std,7.815984,1.701004,104.269838,846.841774,2.757689,3.697627,0.802055
min,9.0,3.0,68.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,104.25,2223.75,13.825,73.0,1.0
50%,23.0,4.0,148.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,262.0,3608.0,17.175,79.0,2.0
max,46.6,8.0,455.0,5140.0,24.8,82.0,3.0


The description of the dataset above shows us the distribution of the independent variables. Notably, we can see that model_year ranges from 70 to 82, which makes this variable more of a categorical variable. The origin variable is clearly categorical with 3 categories.

The following table examines the data for any missingness, which supriprisingly it does not have.

In [22]:
missing = mtcars.isnull().sum(axis=0)
display(missing.to_frame().rename(columns={0:'missingness'}))

Unnamed: 0,missingness
mpg,0
cylinders,0
displacement,0
horsepower,0
weight,0
acceleration,0
model_year,0
origin,0


Firstly, I would like to take a look at the distribution of the categorical variables, so we take a look at the model_year and origin variables. The following are frequency tables for both variables.

In [23]:
display(mtcars.model_year.value_counts().sort_index().to_frame())
display(mtcars.origin.value_counts().sort_index().to_frame())

Unnamed: 0,model_year
70,29
71,28
72,28
73,40
74,27
75,30
76,34
77,28
78,36
79,29


Unnamed: 0,origin
1,249
2,70
3,79


Due to the limited range of years contained in the model_year variable, I am rather skeptical as to its use for analysis. Treating it as a continuous variable would be inappropriate. Yet treating it as a categorical variable would create some challenges, as using each year as a single category would create many categories, which would then need to be one-hot encoded. This might cause issues in training any models. Creating groupings of years would also be difficult, since any new data would have to be grouped as well, and it is uncertain the range of years that could be encountered in the new data.


But I believe the year a car was made would have a lot of predictive power in relation to its mpg, as this has increased over the years. For this reason, I would like to keep it, if only just to test its usefulness. If it were to be included, I think treating it as an ordinal variable would be best, and therefore we will keep it in the data after ordinal encoding.


The distribution of the origin variable seems pretty skewed toward the "1" category, which makes me believe that it represents the "American" category, and the other two categories must represent "European" and "Asian" cars. We could confirm this by looking at the names, but it really should be necessary. Since there are only three categories, this variable would be good to one-hot encode, and I think it would contribute to the predictive power of the model, since differences in the manufacturing process have resulted in differences in mpg between cars manufactured in different locations.