# What is Fuel Economy?

by [Wikipedia](https://en.wikipedia.org/wiki/Fuel_economy_in_automobiles)

The fuel economy of an automobile is the fuel efficiency relationship between the distance traveled and the amount of fuel consumed by the vehicle. Consumption can be expressed in terms of volume of fuel to travel a distance, or the distance travelled per unit volume of fuel consumed.

![img](fuel-consumption.png)

- Fuel Economy : [Information](https://www.epa.gov/compliance-and-fuel-economy-data/data-cars-used-testing-fuel-economy)
- Fuel Economy: [Dataset](https://www.fueleconomy.gov/feg/download.shtml/)
    - [Data Description](http://www.fueleconomy.gov/feg/epadata/Readme.txt)
    - [PDF](http://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf)

--------------

# 1.Accessing Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_2008 = pd.read_csv("Data/all_alpha_08.csv")
df_2018 = pd.read_csv("Data/all_alpha_18.csv")

In [3]:
#number of samples in each dataset
print("Data Size:")
df_2008.size, df_2018.size

Data Size:


(43272, 28998)

In [4]:
# number of columns in each dataset
print("Numer of rows and columns:")
df_2008.shape, df_2018.shape

Numer of rows and columns:


((2404, 18), (1611, 18))

In [5]:
# duplicate rows in each dataset
print("Duplicate information 2008: ", df_2008.duplicated().sum())
print("Duplicate information 2018: ", df_2018.duplicated().sum())

Duplicate information 2008:  25
Duplicate information 2018:  0


In [6]:
# datatypes of columns
print("Datatypes information: ")
df_2008.info()

Datatypes information: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2404 entries, 0 to 2403
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Model                 2404 non-null   object 
 1   Displ                 2404 non-null   float64
 2   Cyl                   2205 non-null   object 
 3   Trans                 2205 non-null   object 
 4   Drive                 2311 non-null   object 
 5   Fuel                  2404 non-null   object 
 6   Sales Area            2404 non-null   object 
 7   Stnd                  2404 non-null   object 
 8   Underhood ID          2404 non-null   object 
 9   Veh Class             2404 non-null   object 
 10  Air Pollution Score   2404 non-null   object 
 11  FE Calc Appr          2205 non-null   object 
 12  City MPG              2205 non-null   object 
 13  Hwy MPG               2205 non-null   object 
 14  Cmb MPG               2205 non-null   object 
 1

In [7]:
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Model                 1611 non-null   object 
 1   Displ                 1609 non-null   float64
 2   Cyl                   1609 non-null   float64
 3   Trans                 1611 non-null   object 
 4   Drive                 1611 non-null   object 
 5   Fuel                  1611 non-null   object 
 6   Cert Region           1611 non-null   object 
 7   Stnd                  1611 non-null   object 
 8   Stnd Description      1611 non-null   object 
 9   Underhood ID          1611 non-null   object 
 10  Veh Class             1611 non-null   object 
 11  Air Pollution Score   1611 non-null   int64  
 12  City MPG              1611 non-null   object 
 13  Hwy MPG               1611 non-null   object 
 14  Cmb MPG               1611 non-null   object 
 15  Greenhouse Gas Score 

In [8]:
# features with missing values
print("Features with missing values for 2008: ", df_2008.isnull().sum())

Features with missing values for 2008:  Model                     0
Displ                     0
Cyl                     199
Trans                   199
Drive                    93
Fuel                      0
Sales Area                0
Stnd                      0
Underhood ID              0
Veh Class                 0
Air Pollution Score       0
FE Calc Appr            199
City MPG                199
Hwy MPG                 199
Cmb MPG                 199
Unadj Cmb MPG           199
Greenhouse Gas Score    199
SmartWay                  0
dtype: int64


In [9]:
print("Features with missing values for 2018: ", df_2018.isna().sum())

Features with missing values for 2018:  Model                   0
Displ                   2
Cyl                     2
Trans                   0
Drive                   0
Fuel                    0
Cert Region             0
Stnd                    0
Stnd Description        0
Underhood ID            0
Veh Class               0
Air Pollution Score     0
City MPG                0
Hwy MPG                 0
Cmb MPG                 0
Greenhouse Gas Score    0
SmartWay                0
Comb CO2                0
dtype: int64


In [10]:
# number of non-null unique values for features in each dataset
# what those unique values are and counts for each
print("Number of non null unique values for features 2008: ", df_2008.nunique())

Number of non null unique values for features 2008:  Model                   436
Displ                    47
Cyl                       8
Trans                    14
Drive                     2
Fuel                      5
Sales Area                3
Stnd                     12
Underhood ID            343
Veh Class                 9
Air Pollution Score      13
FE Calc Appr              2
City MPG                 39
Hwy MPG                  43
Cmb MPG                  38
Unadj Cmb MPG           721
Greenhouse Gas Score     20
SmartWay                  2
dtype: int64


In [11]:
# number of non-null unique values for features in each dataset
# what those unique values are and counts for each
print("Number of non null unique values for features 2018: ", df_2018.nunique())

Number of non null unique values for features 2018:  Model                   367
Displ                    36
Cyl                       7
Trans                    26
Drive                     2
Fuel                      5
Cert Region               2
Stnd                     19
Stnd Description         19
Underhood ID            230
Veh Class                 9
Air Pollution Score       6
City MPG                 58
Hwy MPG                  62
Cmb MPG                  57
Greenhouse Gas Score     10
SmartWay                  3
Comb CO2                299
dtype: int64


--------------

# 2. Cleaning Data

## 2.1. Drop Extra columns

In [12]:
df_2008.drop(["Stnd", "Underhood ID", "FE Calc Appr", "Unadj Cmb MPG"], axis = 1)
df_2018.drop(["Stnd", "Stnd Description", "Underhood ID", "Comb CO2"], axis = 1)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Veh Class,Air Pollution Score,City MPG,Hwy MPG,Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,small SUV,3,20,28,23,5,No
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,CA,small SUV,3,20,28,23,5,No
2,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,FA,small SUV,3,19,27,22,4,No
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,CA,small SUV,3,19,27,22,4,No
4,ACURA TLX,2.4,4.0,AMS-8,2WD,Gasoline,CA,small car,3,23,33,27,6,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1606,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline,FA,standard SUV,5,22,28,24,5,No
1607,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline,CA,standard SUV,5,20,27,23,5,No
1608,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline,FA,standard SUV,5,20,27,23,5,No
1609,VOLVO XC 90,2.0,4.0,SemiAuto-8,4WD,Gasoline/Electricity,CA,standard SUV,7,26/63,30/61,27/62,10,Elite


## 2.2. Rename the columns

In [13]:
#make column name consistent between two data sets
df_2008.rename(columns = {"Sales Area": "Cert Region"}, inplace = True)
df_2008.head(2)

Unnamed: 0,Model,Displ,Cyl,Trans,Drive,Fuel,Cert Region,Stnd,Underhood ID,Veh Class,Air Pollution Score,FE Calc Appr,City MPG,Hwy MPG,Cmb MPG,Unadj Cmb MPG,Greenhouse Gas Score,SmartWay
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT03.7PKR,SUV,6,Drv,15,20,17,22.0527,4,no


In [14]:
# replace space with underscore _ and lowercase for all column names
df_2008.rename(columns = lambda x: x.strip().lower().replace(" ", "_"), inplace = True)
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT03.7PKR,SUV,6,Drv,15,20,17,22.0527,4,no


In [15]:
df_2018.rename(columns = lambda x : x.strip().lower().replace(" ", "_"), inplace = True)
df_2018.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,FA,T3B125,Federal Tier 3 Bin 125,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,CA,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386


In [16]:
# confirm all columns between 2008 and 2018 are identical
(df_2008.columns == df_2018.columns).all()

False

In [17]:
# save the progress datasets
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

---------

# 3. Filter, Drop Nulls, Dedupe

## 3.1. Filter only for california region

For consistency, only compare cars certified by California standards. Filter both datasets using query to select only rows where cert_region is CA. Then, drop the cert_region columns, since it will no longer provide any useful information (we'll know every value is 'CA').

In [18]:
# get only for CA region
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
1,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,FA,B5,8HNXT03.7PKR,SUV,6,Drv,15,20,17,22.0527,4,no


In [19]:
df_2008 = df_2008.query('cert_region == "CA"')
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,CA,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no


In [20]:
df_2008.drop("cert_region", axis=1, inplace=True)
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no


In [21]:
# 2018 dataset
df_2018 = df_2018.query('cert_region == "CA"')
df_2018.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,cert_region,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,CA,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,CA,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,19,27,22,4,No,402


In [22]:
df_2018.drop("cert_region", axis = 1, inplace = True)
df_2018.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,19,27,22,4,No,402


## 3.2. Drop Nulls

Drop any rows in both datasets that contain missing values.

In [23]:
# 2008 data, check for na values
df_2008.isnull().sum()

model                    0
displ                    0
cyl                     75
trans                   75
drive                   37
fuel                     0
stnd                     0
underhood_id             0
veh_class                0
air_pollution_score      0
fe_calc_appr            75
city_mpg                75
hwy_mpg                 75
cmb_mpg                 75
unadj_cmb_mpg           75
greenhouse_gas_score    75
smartway                 0
dtype: int64

In [24]:
# drop null values rows
df_2008.dropna(axis=0, inplace = True)

In [25]:
# checks if any of columns in 2008 have null values - should print False
df_2008.isna().sum().any()

False

In [26]:
# 2018 data
df_2018.dropna(axis = 0, inplace = True)
df_2018.isnull().sum().any()

False

## 3.3. Dedupe
Drop any duplicate rows in both datasets.

In [27]:
# check for duplicate rows
print("duplicated rows for 2008: ", df_2008.duplicated().sum())
print("duplicated rows for 2018: ", df_2018.duplicated().sum())

duplicated rows for 2008:  3
duplicated rows for 2018:  0


In [28]:
# drop the duplicate rows for 2008 data
df_2008.drop_duplicates(inplace = True)

In [29]:
# print number of duplicates again to confirm dedupe - should be 0
df_2008.duplicated().sum()

0

In [30]:
# save the progress data
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

------------

# 4. Inspecting Data Types

inspect the datatypes of features in each dataset and think about what changes should be made to make them practical and consistent (in both datasets).

In [31]:
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no


In [32]:
df_2018.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,19,27,22,4,No,402


In [33]:
df_2008.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1006 entries, 0 to 2400
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   model                 1006 non-null   object 
 1   displ                 1006 non-null   float64
 2   cyl                   1006 non-null   object 
 3   trans                 1006 non-null   object 
 4   drive                 1006 non-null   object 
 5   fuel                  1006 non-null   object 
 6   stnd                  1006 non-null   object 
 7   underhood_id          1006 non-null   object 
 8   veh_class             1006 non-null   object 
 9   air_pollution_score   1006 non-null   object 
 10  fe_calc_appr          1006 non-null   object 
 11  city_mpg              1006 non-null   object 
 12  hwy_mpg               1006 non-null   object 
 13  cmb_mpg               1006 non-null   object 
 14  unadj_cmb_mpg         1006 non-null   float64
 15  greenhouse_gas_score 

In [34]:
df_2018.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 797 entries, 1 to 1609
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   model                 797 non-null    object 
 1   displ                 797 non-null    float64
 2   cyl                   797 non-null    float64
 3   trans                 797 non-null    object 
 4   drive                 797 non-null    object 
 5   fuel                  797 non-null    object 
 6   stnd                  797 non-null    object 
 7   stnd_description      797 non-null    object 
 8   underhood_id          797 non-null    object 
 9   veh_class             797 non-null    object 
 10  air_pollution_score   797 non-null    int64  
 11  city_mpg              797 non-null    object 
 12  hwy_mpg               797 non-null    object 
 13  cmb_mpg               797 non-null    object 
 14  greenhouse_gas_score  797 non-null    int64  
 15  smartway              

## Inspecting Results:

- we can make data consistency for `cyl` column, using int data type
-  for `air pollution` column, we can convert to float data type
-  for `city_mpg` , `hwy_mpg`, `cmb_mpg` - need to convert them to float data type
- for `greenhouse_gas_score` column, need to conver it to int data type

-------------

# 5. Fixing Data Types

## 5.1. Fix `cyl` datatype
- 2008: extract int from string.
- 2018: convert float to int.

### 2008: extract int from string.

In [35]:
# check value counts for the 2008 cyl column
df_2008["cyl"].value_counts()

(6 cyl)     417
(4 cyl)     287
(8 cyl)     207
(5 cyl)      48
(12 cyl)     30
(10 cyl)     14
(2 cyl)       2
(16 cyl)      1
Name: cyl, dtype: int64

In [36]:
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no


In [37]:
# df_2008["cyl"].str[1:-5].astype(int)

df_2008['cyl'] = df_2008['cyl'].str.extract('(\d+)').astype(int)

In [38]:
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,6,Auto-S5,4WD,Gasoline,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,4,Auto-S5,4WD,Gasoline,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no


In [39]:
#confirm the value counts of cyl again
df_2008["cyl"].value_counts()

6     417
4     287
8     207
5      48
12     30
10     14
2       2
16      1
Name: cyl, dtype: int64

### 2018: convert float to int.

In [40]:
df_2018.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
1,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,20,28,23,5,No,386
3,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,U2,California LEV-II ULEV,JHNXT03.5GV3,small SUV,3,19,27,22,4,No,402


In [41]:
df_2018["cyl"] = df_2018["cyl"].astype(int)

In [42]:
df_2018["cyl"].dtype

dtype('int32')

In [43]:
# save the progress data
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

----------

## 5.2. Fix air_pollution_score datatype

- 2008: convert string to float.
- 2018: convert int to float.

#### 2008: convert string to float.

In [44]:
df_2008.head(2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,6,Auto-S5,4WD,Gasoline,U2,8HNXT03.7PKR,SUV,7,Drv,15,20,17,22.0527,4,no
2,ACURA RDX,2.3,4,Auto-S5,4WD,Gasoline,U2,8HNXT02.3DKR,SUV,7,Drv,17,22,19,24.1745,5,no


In [45]:
df_2008["air_pollution_score"] = df_2008["air_pollution_score"].astype(float)

ValueError: could not convert string to float: '6/4'

In [46]:
# check error happening row value
err_position = df_2008.query('air_pollution_score == "6/4"')
err_position

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
1550,MERCEDES-BENZ C300,3.0,6,Auto-L7,2WD,ethanol/gas,L2,8MBXV03.0U2A,small car,6/4,Drv,13/18,19/25,15/21,19.7699,7/6,no


According to [resource](http://www.fueleconomy.gov/feg/findacarhelp.shtml#airPollutionScore),

- If a vehicle can operate on more than one type of fuel, an estimate is provided for each fuel type.

### Findings
so we need to check for those cars which can have more than one type of fuel. The above car is using `fuel = ethanol/gas`

Columns with `/` for holding more than 2 values as string:
- fuel
- air_pollution_score
- city_mpg
- hwy_mpg
- cmb_mpg
- greenhouse_gas_score

#### get all hybirds in 2008

In [47]:
hb_2008 = df_2008[df_2008["fuel"].str.contains("/")]
hb_2008.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
1550,MERCEDES-BENZ C300,3.0,6,Auto-L7,2WD,ethanol/gas,L2,8MBXV03.0U2A,small car,6/4,Drv,13/18,19/25,15/21,19.7699,7/6,no


In [48]:
# get all hybrids in 2018
hb_2018 = df_2018[df_2018["fuel"].str.contains("/")]
hb_2018.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,stnd_description,underhood_id,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway,comb_co2
108,BMW 330e,2.0,4,SemiAuto-8,2WD,Gasoline/Electricity,L3ULEV125,California LEV-III ULEV125,JBMXV02.0H48,small car,3,28/66,34/78,30/71,10,Yes,189
160,BMW 530e,2.0,4,SemiAuto-8,2WD,Gasoline/Electricity,L3SULEV30,California LEV-III SULEV30,JBMXV02.0H30,small car,7,27/70,31/75,29/72,10,Elite,193
162,BMW 530e,2.0,4,SemiAuto-8,4WD,Gasoline/Electricity,L3SULEV30,California LEV-III SULEV30,JBMXV02.0H30,small car,7,27/66,31/68,28/67,10,Elite,200
188,BMW 740e,2.0,4,SemiAuto-8,4WD,Gasoline/Electricity,L3ULEV125,California LEV-III ULEV125,JBMXV02.0H48,large car,3,25/62,29/68,27/64,9,Yes,214
382,CHEVROLET Impala,3.6,6,SemiAuto-6,2WD,Ethanol/Gas,L3ULEV70,California LEV-III ULEV70,JGMXV03.6166,large car,5,14/18,20/28,16/22,4,No,394/409


As each row needs to be split into two rows; One for each specific fuel type (separated by `/`)

In [49]:
# create two copies of 2008 hybrid df
df1 = hb_2008.copy() # data on first fuel type of each hybrid vehicle
df2 = hb_2008.copy() # data on second fuel type of each hybrid vehicle

In [50]:
# affected columns process
columns_to_be_splited = ["fuel", "air_pollution_score", "city_mpg", "hwy_mpg", "cmb_mpg", "greenhouse_gas_score"]

# split each column
for col in columns_to_be_splited:
    df1[col] = df1[col].apply(lambda x: x.split("/")[0]) # first fuel type value
    df2[col] = df2[col].apply(lambda x: x.split("/")[1]) # second fuel type value value

In [51]:
df1.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
1550,MERCEDES-BENZ C300,3.0,6,Auto-L7,2WD,ethanol,L2,8MBXV03.0U2A,small car,6,Drv,13,19,15,19.7699,7,no


In [52]:
df2.head()

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
1550,MERCEDES-BENZ C300,3.0,6,Auto-L7,2WD,gas,L2,8MBXV03.0U2A,small car,4,Drv,18,25,21,19.7699,6,no


In [53]:
# combine dataframes to add to the original dataframe
new_rows = df1.append(df2)

# now we have separate rows for each fuel type of each vehicle!
new_rows

Unnamed: 0,model,displ,cyl,trans,drive,fuel,stnd,underhood_id,veh_class,air_pollution_score,fe_calc_appr,city_mpg,hwy_mpg,cmb_mpg,unadj_cmb_mpg,greenhouse_gas_score,smartway
1550,MERCEDES-BENZ C300,3.0,6,Auto-L7,2WD,ethanol,L2,8MBXV03.0U2A,small car,6,Drv,13,19,15,19.7699,7,no
1550,MERCEDES-BENZ C300,3.0,6,Auto-L7,2WD,gas,L2,8MBXV03.0U2A,small car,4,Drv,18,25,21,19.7699,6,no


In [54]:
# drop the original hybrid rows
df_2008.drop(hb_2008.index, inplace = True)

# add newly fixed rows
df_2008 = df_2008.append(new_rows, ignore_index = True)

In [56]:
# recheck whether "/" values are gone
df_2008["fuel"].str.contains("/").sum()

0

In [57]:
df_2008.shape # 1 additional row for newly sperated one (1006 + 1)

(1007, 17)

#### Repeat process for 2018: get all hybirds in 2018

In [67]:
# create two copies of 2018 hybrid df
df1 = hb_2018.copy()
df2 = hb_2018.copy()

In [68]:
# affected columns process
columns_to_be_splited = ["fuel", "city_mpg", "hwy_mpg", "cmb_mpg"]

# split each column
for col in columns_to_be_splited:
    df1[col] = df1[col].apply(lambda x: x.split("/")[0]) # first fuel type value
    df2[col] = df2[col].apply(lambda x: x.split("/")[1]) # second fuel type value value

In [69]:
# combine two df
new_rows = df1.append(df2)

# drop the original non sperated rows of hybrid datset from original 2018 dataset
df_2018.drop(hb_2018.index, inplace = True)

# append original 2018 dataset with newly created rows
df_2018 = df_2018.append(new_rows, ignore_index = True)

In [70]:
# check "/" still there or not
df_2018["fuel"].str.contains("/").sum()

0

In [71]:
df_2018.shape

(835, 17)

----------

### continue changes for `air_pollution_score`:
- 2008: convert string to float.
- 2018: convert int to float.

In [74]:
df_2008["air_pollution_score"] = df_2008["air_pollution_score"].astype(float)

In [75]:
df_2018["air_pollution_score"] = df_2018["air_pollution_score"].astype(float)

In [77]:
# save the progress
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

-------------