# <font color='#eb3483'> Exploratory Data Analysis </font>

## <font color='#eb3483'> 1. Loading & Data Processing </font>

In this section we are going to explain how to do Exploratory Data Analysis (EDA).

One of the reasons I like to explain EDA is because there is no standard way of performing it. The process I usually follow is based of a process proposed by Distric Data Labs and explained in 3 blog posts [part 1](https://www.districtdatalabs.com/data-exploration-with-python-1), [part 2](https://www.districtdatalabs.com/data-exploration-with-python-2) and [part 3](https://www.districtdatalabs.com/data-exploration-with-python-3) and a [video](https://www.youtube.com/watch?v=YEBRkLo568Q).

We are going to analyze the **Fuel Economy Dataset**, which is a dataset generated by the US Environmental Protection Agency and the Department of Energy that analyzes fuel consumption and CO2 emission for pretty much any car that was ever sold in the US.

The original dataset is located at: https://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip
The file we are going to use is a simplified version.

Here is the dataset description (the **Data Dictionary**)
http://www.fueleconomy.gov/feg/ws/index.shtml#ft7


For this example we will put ourselves in the place of an Analyst at the EPA, in charge of analyzing this dataset **to find insights related to pollution and fuel consumption.**

In [1]:
import pandas as pd
import seaborn as sns
%matplotlib inline

### <font color='#eb3483'> Data Loading <font color='#9531A9'>

In this step we read the data however it was provided, whether it was provided as a CSV (like in this case), or we had to run some sql queries to create it, the goal is to receive the data, and usually create a dataset that fits in our computer than we can analyze easily.

In [2]:
vehicles = pd.read_csv("data/vehicles_original.csv")

In [3]:
vehicles.shape

(39865, 11)

In [4]:
vehicles.head()

Unnamed: 0,co2TailpipeGpm,comb08,cylinders,displ,drive,fuelType,make,model,trany,VClass,year
0,423.190476,21,4.0,2.0,Rear-Wheel Drive,Regular,Alfa Romeo,Spider Veloce 2000,Manual 5-spd,Two Seaters,1985
1,807.909091,11,12.0,4.9,Rear-Wheel Drive,Regular,Ferrari,Testarossa,Manual 5-spd,Two Seaters,1985
2,329.148148,27,4.0,2.2,Front-Wheel Drive,Regular,Dodge,Charger,Manual 5-spd,Subcompact Cars,1985
3,807.909091,11,8.0,5.2,Rear-Wheel Drive,Regular,Dodge,B150/B250 Wagon 2WD,Automatic 3-spd,Vans,1985
4,467.736842,19,4.0,2.2,4-Wheel or All-Wheel Drive,Premium,Subaru,Legacy AWD Turbo,Manual 5-spd,Compact Cars,1993


First thing is to rename columns so they are easily identifiable for our analysis. It is important to not use spaces when renaming columns (that will break some pandas shortcuts).

In [5]:
vehicles = vehicles.rename(columns={
    "trany":"transmission",
    "displ":"displacement", #engine displacement
    "VClass":"vehicle_class",
    "fuelType":"fuel",
    "comb08":"consumption_mpg", #combined MPG for fuelType1
    "co2TailpipeGpm":"co2", # tailpipe CO2 in grams/mile
})

In [6]:
vehicles.head()

Unnamed: 0,co2,consumption_mpg,cylinders,displacement,drive,fuel,make,model,transmission,vehicle_class,year
0,423.190476,21,4.0,2.0,Rear-Wheel Drive,Regular,Alfa Romeo,Spider Veloce 2000,Manual 5-spd,Two Seaters,1985
1,807.909091,11,12.0,4.9,Rear-Wheel Drive,Regular,Ferrari,Testarossa,Manual 5-spd,Two Seaters,1985
2,329.148148,27,4.0,2.2,Front-Wheel Drive,Regular,Dodge,Charger,Manual 5-spd,Subcompact Cars,1985
3,807.909091,11,8.0,5.2,Rear-Wheel Drive,Regular,Dodge,B150/B250 Wagon 2WD,Automatic 3-spd,Vans,1985
4,467.736842,19,4.0,2.2,4-Wheel or All-Wheel Drive,Premium,Subaru,Legacy AWD Turbo,Manual 5-spd,Compact Cars,1993


In [7]:
vehicles.dtypes

co2                float64
consumption_mpg      int64
cylinders          float64
displacement       float64
drive               object
fuel                object
make                object
model               object
transmission        object
vehicle_class       object
year                 int64
dtype: object

**What is the goal of our analysis?**

When we start an EDA it is important to always keep in mind our goal. Usually there is a list of questions that we are trying to answer, or at least a general reason that this dataset was compiled for.

In this case one of the main goals the EPA is collecting this dataset is to monitor each car's emissions (mostly co2).

### <font color='#eb3483'> Data Dictionary <font color='#9531A9'>

It is important to write down the description and datatypes of the variables.

* co2 (float).
* consumption_mpg (int). Miles per galong fuel consumption
* cylinders (float). Number of cylinders in the engine
* displacement (float) volume of gas displaced by the engine
* drive (categorical). Drive type
* fuel (categorical). Type of fuel
* make (categorical). Car manufacturer
* model (categorical). Model name
* transmission (categorical). Transmission type
* vehicle_class (categorical). Vehicle class
* year (int). Year the car was released.


### <font color='#eb3483'> Entity Description <font color='#9531A9'>

Here we describe the possible entities that we can break our dataset into, this will help us think of different ways to slice and group the dataset in further steps.

- make  *(All Toyota cars)*
- make-model *(All toyota Camry cars)*
- make-model-year *(the toyota Camry of 2015)*
- make-year *(All toyota cars manufactured in 2014)*

### <font color='#eb3483'>  Saving our data </font>
After each step it is important to save the dataset with a different name (so we dont modify the original).

In [8]:
vehicles.to_csv("data/vehicles.1.initial_process.csv", index=False)