# What is Fuel Economy?

by [Wikipedia](https://en.wikipedia.org/wiki/Fuel_economy_in_automobiles)

The fuel economy of an automobile is the fuel efficiency relationship between the distance traveled and the amount of fuel consumed by the vehicle. Consumption can be expressed in terms of volume of fuel to travel a distance, or the distance travelled per unit volume of fuel consumed.

![img](fuel-consumption.png)

- Fuel Economy : [Information](https://www.epa.gov/compliance-and-fuel-economy-data/data-cars-used-testing-fuel-economy)
- Fuel Economy: [Dataset](https://www.fueleconomy.gov/feg/download.shtml/)
    - [Data Description](http://www.fueleconomy.gov/feg/epadata/Readme.txt)
    - [PDF](http://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf)

--------------

# 1.Accessing Data

In [33]:
import pandas as pd
import numpy as np

In [34]:
df_2008 = pd.read_csv("Data/all_alpha_08.csv")
df_2018 = pd.read_csv("Data/all_alpha_18.csv")

In [35]:
#number of samples in each dataset
print("Data Size:")
df_2008.size, df_2018.size

Data Size:


(43272, 28998)

In [36]:
# number of columns in each dataset
print("Numer of rows and columns:")
df_2008.shape, df_2018.shape

Numer of rows and columns:


((2404, 18), (1611, 18))

In [37]:
# duplicate rows in each dataset
print("Duplicate information 2008: ", df_2008.duplicated().sum())
print("Duplicate information 2018: ", df_2018.duplicated().sum())

Duplicate information 2008:  25
Duplicate information 2018:  0


In [38]:
# datatypes of columns
print("Datatypes information: ")
df_2008.info()

Datatypes information: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2404 entries, 0 to 2403
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Model                 2404 non-null   object 
 1   Displ                 2404 non-null   float64
 2   Cyl                   2205 non-null   object 
 3   Trans                 2205 non-null   object 
 4   Drive                 2311 non-null   object 
 5   Fuel                  2404 non-null   object 
 6   Sales Area            2404 non-null   object 
 7   Stnd                  2404 non-null   object 
 8   Underhood ID          2404 non-null   object 
 9   Veh Class             2404 non-null   object 
 10  Air Pollution Score   2404 non-null   object 
 11  FE Calc Appr          2205 non-null   object 
 12  City MPG              2205 non-null   object 
 13  Hwy MPG               2205 non-null   object 
 14  Cmb MPG               2205 non-null   object 
 1

In [39]:
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Model                 1611 non-null   object 
 1   Displ                 1609 non-null   float64
 2   Cyl                   1609 non-null   float64
 3   Trans                 1611 non-null   object 
 4   Drive                 1611 non-null   object 
 5   Fuel                  1611 non-null   object 
 6   Cert Region           1611 non-null   object 
 7   Stnd                  1611 non-null   object 
 8   Stnd Description      1611 non-null   object 
 9   Underhood ID          1611 non-null   object 
 10  Veh Class             1611 non-null   object 
 11  Air Pollution Score   1611 non-null   int64  
 12  City MPG              1611 non-null   object 
 13  Hwy MPG               1611 non-null   object 
 14  Cmb MPG               1611 non-null   object 
 15  Greenhouse Gas Score 

In [40]:
# features with missing values
print("Features with missing values for 2008: ", df_2008.isnull().sum())

Features with missing values for 2008:  Model                     0
Displ                     0
Cyl                     199
Trans                   199
Drive                    93
Fuel                      0
Sales Area                0
Stnd                      0
Underhood ID              0
Veh Class                 0
Air Pollution Score       0
FE Calc Appr            199
City MPG                199
Hwy MPG                 199
Cmb MPG                 199
Unadj Cmb MPG           199
Greenhouse Gas Score    199
SmartWay                  0
dtype: int64


In [41]:
print("Features with missing values for 2018: ", df_2018.isna().sum())

Features with missing values for 2018:  Model                   0
Displ                   2
Cyl                     2
Trans                   0
Drive                   0
Fuel                    0
Cert Region             0
Stnd                    0
Stnd Description        0
Underhood ID            0
Veh Class               0
Air Pollution Score     0
City MPG                0
Hwy MPG                 0
Cmb MPG                 0
Greenhouse Gas Score    0
SmartWay                0
Comb CO2                0
dtype: int64


In [42]:
# number of non-null unique values for features in each dataset
# what those unique values are and counts for each
print("Number of non null unique values for features 2008: ", df_2008.nunique())

Number of non null unique values for features 2008:  Model                   436
Displ                    47
Cyl                       8
Trans                    14
Drive                     2
Fuel                      5
Sales Area                3
Stnd                     12
Underhood ID            343
Veh Class                 9
Air Pollution Score      13
FE Calc Appr              2
City MPG                 39
Hwy MPG                  43
Cmb MPG                  38
Unadj Cmb MPG           721
Greenhouse Gas Score     20
SmartWay                  2
dtype: int64


In [43]:
# number of non-null unique values for features in each dataset
# what those unique values are and counts for each
print("Number of non null unique values for features 2018: ", df_2018.nunique())

Number of non null unique values for features 2018:  Model                   367
Displ                    36
Cyl                       7
Trans                    26
Drive                     2
Fuel                      5
Cert Region               2
Stnd                     19
Stnd Description         19
Underhood ID            230
Veh Class                 9
Air Pollution Score       6
City MPG                 58
Hwy MPG                  62
Cmb MPG                  57
Greenhouse Gas Score     10
SmartWay                  3
Comb CO2                299
dtype: int64


--------------

# 2. Cleaning Data

## 2.1. Drop Extra columns

In [44]:
df_2008.drop(["Stnd", "Underhood ID", "FE Calc Appr", "Unadj Cmb MPG"], axis = 1)
df_2018.drop(["Stnd", "Stnd Description", "Underhood ID", "Comb CO2"], axis = 1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,small SUV,3,20,28,23,5,No
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,CA,small SUV,3,20,28,23,5,No
2,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,FA,small SUV,3,19,27,22,4,No
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,CA,small SUV,3,19,27,22,4,No
4,ACURA TLX,2.4,4.0,AMS-8,2WD,Gasoline,CA,small car,3,23,33,27,6,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1606,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline,FA,standard SUV,5,22,28,24,5,No
1607,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline,CA,standard SUV,5,20,27,23,5,No
1608,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline,FA,standard SUV,5,20,27,23,5,No
1609,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline/Electricity,CA,standard SUV,7,26/63,30/61,27/62,10,Elite


# 2.2. Rename the columns

In [47]:
#make column name consistent between two data sets
df_2008.rename(columns = {"Sales Area": "Cert Region"}, inplace = True)
df_2008.head(2)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Stnd,Underhood ID,Veh Class,Air Pollution Score,FE Calc Appr,City MPG,Hwy MPG,Cmb MPG,Unadj Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT03.7PKR,SUV,6,Drv,15,20,17,22.0527,4,no


In [48]:
# replace space with underscore _ and lowercase for all column names
df_2008.rename(columns = lambda x: x.strip().lower().replace(" ", "_"), inplace = True)
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT03.7PKR,SUV,6,Drv,15,20,17,22.0527,4,no


In [49]:
df_2018.rename(columns = lambda x : x.strip().lower().replace(" ", "_"), inplace = True)
df_2018.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,CA,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386


In [53]:
# confirm all columns between 2008 and 2018 are identical
(df_2008.columns == df_2018.columns).all()

True

In [54]:
# save the progress datasets
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

---------