# What is Fuel Economy?

[Wikipedia](https://en.wikipedia.org/wiki/Fuel_economy_in_automobiles): The fuel economy of an automobile is the fuel efficiency relationship between the distance traveled and the amount of fuel consumed by the vehicle. Consumption can be expressed in terms of volume of fuel to travel a distance, or the distance travelled per unit volume of fuel consumed.

- Fuel Economy : [Information](https://www.epa.gov/compliance-and-fuel-economy-data/data-cars-used-testing-fuel-economy)
- Fuel Economy: [Dataset](https://www.fueleconomy.gov/feg/download.shtml/)
    - [Data Description](http://www.fueleconomy.gov/feg/epadata/Readme.txt)
    - [PDF](http://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf)

--------------

## 1. Data Acquisition

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_2008 = pd.read_csv("Data/all_alpha_08.csv")
df_2018 = pd.read_csv("Data/all_alpha_18.csv")

In [None]:
#number of samples in each dataset
print("Data Size:")
df_2008.size, df_2018.size

In [None]:
# number of columns in each dataset
print("Numer of rows and columns:")
df_2008.shape, df_2018.shape

In [None]:
# duplicate rows in each dataset
print("Duplicate information 2008: ", df_2008.duplicated().sum())
print("Duplicate information 2018: ", df_2018.duplicated().sum())

In [None]:
# datatypes of columns
print("Datatypes information: ")
df_2008.info()

In [None]:
df_2018.info()

In [None]:
# features with missing values
print("Features with missing values for 2008: ", df_2008.isnull().sum())

In [None]:
print("Features with missing values for 2018: ", df_2018.isna().sum())

In [None]:
# number of non-null unique values for features in each dataset
# what those unique values are and counts for each
print("Number of non null unique values for features 2008: ", df_2008.nunique())

In [None]:
# number of non-null unique values for features in each dataset
# what those unique values are and counts for each
print("Number of non null unique values for features 2018: ", df_2018.nunique())

--------------

## 2. Data Cleaning and Pre-processing

### 2.1. Drop Extra columns

In [None]:
df_2008.drop(["Stnd", "Underhood ID", "FE Calc Appr", "Unadj Cmb MPG"], axis = 1, inplace = True)

In [None]:
df_2008.head(2)

In [None]:
df_2018.drop(["Stnd", "Stnd Description", "Underhood ID", "Comb CO2"], axis = 1, inplace = True)

In [None]:
df_2018.head(2)

### 2.2. Rename the columns

In [None]:
#make column name consistent between two data sets
df_2008.rename(columns = {"Sales Area": "Cert Region"}, inplace = True)
df_2008.head(2)

In [None]:
# replace space with underscore _ and lowercase for all column names
df_2008.rename(columns = lambda x: x.strip().lower().replace(" ", "_"), inplace = True)
df_2008.head(2)

In [None]:
df_2018.rename(columns = lambda x : x.strip().lower().replace(" ", "_"), inplace = True)
df_2018.head(2)

In [None]:
# confirm all columns between 2008 and 2018 are identical
(df_2008.columns == df_2018.columns).all()

In [None]:
# save the progress datasets
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

### 2.3 Filter only for california region

For consistency, only compare cars certified by California standards. Filter both datasets using query to select only rows where cert_region is CA. Then, drop the cert_region columns, since it will no longer provide any useful information (we'll know every value is 'CA').

In [None]:
# get only for CA region
df_2008.head(2)

In [None]:
df_2008 = df_2008.query('cert_region == "CA"')
df_2008.head(2)

In [None]:
df_2008.drop("cert_region", axis=1, inplace=True)
df_2008.head(2)

In [None]:
# 2018 dataset
df_2018 = df_2018.query('cert_region == "CA"')
df_2018.head(2)

In [None]:
df_2018.drop("cert_region", axis = 1, inplace = True)
df_2018.head(2)

### 2.4 Drop Missing Values

Drop any rows in both datasets that contain missing values.

In [None]:
# 2008 data, check for na values
df_2008.isnull().sum()

In [None]:
# drop null values rows
df_2008.dropna(axis=0, inplace = True)

In [None]:
# checks if any of columns in 2008 have null values - should print False
df_2008.isna().sum().any()

In [None]:
# 2018 data
df_2018.dropna(axis = 0, inplace = True)
df_2018.isnull().sum().any()

### 2.5 Drop Duplicates
Drop any duplicate rows in both datasets.

In [None]:
# check for duplicate rows
print("duplicated rows for 2008: ", df_2008.duplicated().sum())
print("duplicated rows for 2018: ", df_2018.duplicated().sum())

In [None]:
# drop the duplicate rows for 2008 data
df_2008.drop_duplicates(inplace = True)

In [None]:
# print number of duplicates again to confirm dedupe - should be 0
df_2008.duplicated().sum()

In [None]:
# save the progress data
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

### 2.6 Data Types

inspect the datatypes of features in each dataset and think about what changes should be made to make them practical and consistent (in both datasets).

In [None]:
df_2008.head(2)

In [None]:
df_2018.head(2)

In [None]:
df_2008.info()

In [None]:
df_2018.info()

- we can make data consistency for `cyl` column, using int data type
-  for `air pollution` column, we can convert to float data type
-  for `city_mpg` , `hwy_mpg`, `cmb_mpg` - need to convert them to float data type
- for `greenhouse_gas_score` column, need to conver it to int data type

-------------

#### Fix `cyl` datatype
- 2008: extract int from string.
- 2018: convert float to int.

In [None]:
# check value counts for the 2008 cyl column
df_2008["cyl"].value_counts()

In [None]:
df_2008.head(2)

In [None]:
# df_2008["cyl"].str[1:-5].astype(int)

df_2008['cyl'] = df_2008['cyl'].str.extract('(\d+)').astype(int)

In [None]:
df_2008.head(2)

In [None]:
#confirm the value counts of cyl again
df_2008["cyl"].value_counts()

In [None]:
df_2018.head(2)

In [None]:
df_2018["cyl"] = df_2018["cyl"].astype(int)

In [None]:
df_2018["cyl"].dtype

In [None]:
# save the progress data
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

----------

#### Fix `air_pollution_score` datatype

- 2008: convert string to float.
- 2018: convert int to float.

#### 2008: convert string to float.

In [None]:
df_2008.head(2)

In [None]:
df_2008["air_pollution_score"] = df_2008["air_pollution_score"].astype(float)

In [None]:
# check error happening row value
err_position = df_2008.query('air_pollution_score == "6/4"')
err_position

According to [resource](http://www.fueleconomy.gov/feg/findacarhelp.shtml#airPollutionScore),

- If a vehicle can operate on more than one type of fuel, an estimate is provided for each fuel type so we need to check for those cars which can have more than one type of fuel. The above car is using `fuel = ethanol/gas`

Columns with `/` for holding more than 2 values as string:
- fuel
- air_pollution_score
- city_mpg
- hwy_mpg
- cmb_mpg
- greenhouse_gas_score

#### All hybrids in 2008

In [None]:
hb_2008 = df_2008[df_2008["fuel"].str.contains("/")]
hb_2008.head()

In [None]:
# get all hybrids in 2018
hb_2018 = df_2018[df_2018["fuel"].str.contains("/")]
hb_2018.head()

As each row needs to be split into two rows; One for each specific fuel type (separated by `/`)

In [None]:
# create two copies of 2008 hybrid df
df1 = hb_2008.copy() # data on first fuel type of each hybrid vehicle
df2 = hb_2008.copy() # data on second fuel type of each hybrid vehicle

In [None]:
# affected columns process
columns_to_be_splited = ["fuel", "air_pollution_score", "city_mpg", "hwy_mpg", "cmb_mpg", "greenhouse_gas_score"]

# split each column
for col in columns_to_be_splited:
    df1[col] = df1[col].apply(lambda x: x.split("/")[0]) # first fuel type value
    df2[col] = df2[col].apply(lambda x: x.split("/")[1]) # second fuel type value value

In [None]:
df1.head()

In [None]:
df2.head()

In [None]:
# combine dataframes to add to the original dataframe
new_rows = df1.append(df2)

# now we have separate rows for each fuel type of each vehicle!
new_rows

In [None]:
# drop the original hybrid rows
df_2008.drop(hb_2008.index, inplace = True)

# add newly fixed rows
df_2008 = df_2008.append(new_rows, ignore_index = True)

In [None]:
# recheck whether "/" values are gone
df_2008["fuel"].str.contains("/").sum()

In [None]:
df_2008.shape # 1 additional row for newly sperated one

#### All hybrids in 2018

In [None]:
# create two copies of 2018 hybrid df
df1 = hb_2018.copy()
df2 = hb_2018.copy()

In [None]:
# affected columns process
columns_to_be_splited = ["fuel", "city_mpg", "hwy_mpg", "cmb_mpg"]

# split each column
for col in columns_to_be_splited:
    df1[col] = df1[col].apply(lambda x: x.split("/")[0]) # first fuel type value
    df2[col] = df2[col].apply(lambda x: x.split("/")[1]) # second fuel type value value

In [None]:
# combine two df
new_rows = df1.append(df2)

# drop the original non sperated rows of hybrid datset from original 2018 dataset
df_2018.drop(hb_2018.index, inplace = True)

# append original 2018 dataset with newly created rows
df_2018 = df_2018.append(new_rows, ignore_index = True)

In [None]:
# check "/" still there or not
df_2018["fuel"].str.contains("/").sum()

In [None]:
df_2018.shape

----------

#### Fix`air_pollution_score` Dtype
- 2008: convert string to float.
- 2018: convert int to float.

In [None]:
df_2008["air_pollution_score"] = df_2008["air_pollution_score"].astype(float)

In [None]:
df_2018["air_pollution_score"] = df_2018["air_pollution_score"].astype(float)

In [None]:
# save the progress
df_2008.to_csv("Data/data_08.csv", index = False)
df_2018.to_csv("Data/data_18.csv", index = False)

-------------

#### Fix `city_mpg`, `hwy_mpg`, `cmb_mpg` datatypes
- 2008 and 2018: convert string to float.

In [None]:
df_2008.head(2)

In [None]:
df_2008["city_mpg"] = df_2008["city_mpg"].astype(float)

In [None]:
df_2008["hwy_mpg"] = df_2008["hwy_mpg"].astype(float)
df_2008["cmb_mpg"] = df_2008["cmb_mpg"].astype(float)

In [None]:
df_2018["city_mpg"] = df_2018["city_mpg"].astype(float)
df_2018["hwy_mpg"] = df_2018["hwy_mpg"].astype(float)
df_2018["cmb_mpg"] = df_2018["cmb_mpg"].astype(float)

--------

#### Fix `greenhouse_gas_score` datatype
- 2008: convert from float to int.

In [None]:
df_2008["greenhouse_gas_score"] = df_2008["greenhouse_gas_score"].astype(int)

In [None]:
# check all data type again
df_2008.dtypes

In [None]:
df_2018.dtypes

In [None]:
df_2008.shape, df_2018.shape

------------------

## 3. Data Exploration and Visualization

### 3.1. Distributions of greenhouse gas score in 2008 and 2018

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df_2008.head(2)

In [None]:
df_2018.head(2)

In [None]:
df_2008["greenhouse_gas_score"].hist();
plt.title("Green House Gas Score 2008")
plt.xlabel("Green House Gas Score")
plt.ylabel("Frequency");

As per the histogram above, 2008 Greenhouse Gas Score is left schewed.

In [None]:
df_2018["greenhouse_gas_score"].hist();
plt.title("Green House Gas Score 2018")
plt.xlabel("Green House Gas Score")
plt.ylabel("Frequency");

As per the histogram above, 2018 Greenhouse Gas Score is right schewed.

-------

### 3.2. Distribution of combined mpg changed from 2008 to 2018

In [None]:
df_2008["cmb_mpg"].hist(bins = 20);
plt.title("Combined mpg 2008 Histogram")
plt.xlabel("combined mpg score")
plt.ylabel("frequency");

In [None]:
df_2018["cmb_mpg"].hist(bins = 20);
plt.title("Combined mpg 2018 Histogram")
plt.xlabel("combined mpg score")
plt.ylabel("frequency");

As per two histograms above, both 2008 and 2018 Combined MPG are both right schewed.

-------

### 3.3. Correlation between displacement and combined mpg

In [None]:
df_2008.plot(x="displ", y="cmb_mpg", kind="scatter");
plt.title("Correlation between displacement and combined mpg 2008");

In [None]:
df_2018.plot(x="displ", y="cmb_mpg", kind="scatter");
plt.title("Correlation between displacement and combined mpg 2018");

As per two scatter plots above, there is negative correlation between displacement and combined mpg.

------

### 3.4. Correlation between greenhouse gas score and combined mpg

In [None]:
df_2008.plot(x="greenhouse_gas_score", y="cmb_mpg", kind="scatter");
plt.title("Correlation between green house gas score and combined mpg 2008");

In [None]:
df_2018.plot(x="greenhouse_gas_score", y="cmb_mpg", kind="scatter");
plt.title("Correlation between green house gas score and combined mpg 2018");

As per two scatter plots above, there is positive correlation between displacement and combined mpg.

---------

## 4. Conclusions

### Are more models using alternative fuels in 2018 compared to 2008? By how much?

In [None]:
df_2008.head(2)

check the fuel usage across the years.

In [None]:
df_2008["fuel"].value_counts()

In [None]:
df_2018["fuel"].value_counts()

#### Alternative Fuel classification: reference from [wikipeida](https://en.wikipedia.org/wiki/Alternative_fuel#:~:text=Alternative%20fuels%2C%20known%20as%20non,as%20artificial%20radioisotope%20fuels%20that)

- 2008 alternative fuels: Ethanol, CNG
- 2018 alternative fuels: Ethanol, Electricity

check the models per each fuel type across the years.

In [None]:
df_2008.groupby("fuel").count().model

In [None]:
# how many unique models used alternative sources of fuel in 2008
alternative_2008 = df_2008.query('fuel in ["CNG", "ethanol"]').model.nunique()
alternative_2008

In [None]:
df_2018.groupby("fuel").count().model

In [None]:
# how many unique models used alternative sources of fuel in 2018
alternative_2018 = df_2018.query('fuel in ["Ethanol", "Electricity"]').model.nunique()
alternative_2018

In [None]:
# plot the numbers
plt.bar(["2008", "2018"], [alternative_2008, alternative_2018]);
plt.xlabel("Year")
plt.ylabel("Number of Models")
plt.title("Number of Unique Models using Alternative Fuel");

24 more unique models have been using Alternative Fuel in 2018 since 2008. We can find the additional information such as proportion.

In [None]:
# get the total unique models for each year
total_2008 = df_2008.model.nunique()
total_2018 = df_2018.model.nunique()

print("Total unique model in 2008: ", total_2008)
print("Total unique model in 2018: ", total_2018)

In [None]:
# calculate the proportion of alternative fuels model , in each year
proportion_2008 = alternative_2008 / total_2008
proportion_2018 = alternative_2018 / total_2018

print("Proportion of Altrnative Fuel used Models in 2008: ", proportion_2008)
print("Proportion of Altrnative Fuel used Models in 2018: ", proportion_2018)

In [None]:
# plot the proportion
plt.bar(["2008", "2018"], [proportion_2008, proportion_2018]);
plt.title("Proportion of Unique Models Using Alternative Fuels")
plt.xlabel("Year")
plt.ylabel("Proportion of Unique Models");

-------

### How much have vehicle classes improved in fuel economy (increased in mpg)?

In [None]:
df_2008.head(2)

we can find the mean for each mpg per each vehicle classes, and compare it for both year.

In [None]:
vehicle_2008 = df_2008.groupby("veh_class").cmb_mpg.mean()

In [None]:
vehicle_2018 = df_2018.groupby("veh_class").cmb_mpg.mean()

In [None]:
# check the improvement
improvement = vehicle_2018 - vehicle_2008
improvement

In [None]:
# only interested in vehicle classes existed in both year
improvement.dropna(inplace = True)

In [None]:
improvement

In [None]:
# plot the improvement
fig, ax = plt.subplots(figsize = (10, 6))
ax.bar(improvement.index, improvement);
plt.title("Improvements in Fuel Economy from 2008 to 2018 by Vehicle Class")
plt.xlabel("Vehicle Class")
plt.ylabel("Increase in Average Combined MPG");

----------

### What are the characteristics of SmartWay vehicles? Have they changed over time? (mpg, greenhouse gas)

In [None]:
df_2008.head(2)

we can filter out the smartway vehicles and analyze them for each characteristics

In [None]:
df_2008.smartway.value_counts()

In [None]:
smart_2008 = df_2008[df_2008["smartway"] == 'yes']
print("Total Number of Smarway Vehicles in 2008: ", smart_2008.shape[0])

smart_2008.head(2)

In [None]:
df_2018.smartway.value_counts()

In [None]:
smart_2018 = df_2018.query('smartway in ["Yes", "Elite"]')
print("Total Number of Smarway Vehicles in 2018: ", smart_2018.shape[0])

smart_2018.head(2)

In [None]:
# check the characterctics for 2008
smart_2008.describe()

In [None]:
# check the characterctics for 2018
smart_2018.describe()

#### Summary Findings

After comparing the mean value for each characteristics between 2008 and 2018, we can briefly say that the following
- Engine displacement, Number of Engine cylinders are getting smaller.
- Air pollution score (smog rating) is also getting decreased.
- city fuel economy, – highway fuel economy , combined city/highway fuel economy are getting more efficient.
- greenhouse gas score is getting improved.

----------

### What features are associated with better fuel economy (mpg)?

we will explore the top features asscoiated with better mpg, which are better than average mpg in each year.

In [None]:
top_2008 = df_2008.query('cmb_mpg > cmb_mpg.mean()')
top_2008.describe()

In [None]:
top_2018 = df_2018.query('cmb_mpg > cmb_mpg.mean()')
top_2018.describe()

-----------

### For all of the models that were produced in 2008 that are still being produced in 2018, how much has the mpg improved and which vehicle improved the most?

In [None]:
df_2008.head(2)

In [None]:
df_2008.model.value_counts()

In [None]:
df_2018.model.value_counts()

--------

### Which model has made the most improvement for mpg from 2008-2018?

In [None]:
# postfix with "_2008" for each columns
df_2008 = df_2008.rename(columns = lambda x : x[:10] + "_2008")
df_2008.head(2)

we are only interested in how the same model of car has been updated and how the new model's mpg compares to the old model's mpg.

In [None]:
df_combined = df_2008.merge(df_2018, left_on = "model_2008", right_on = "model", how = "inner")

In [None]:
# check the result
df_combined.head()

In [None]:
# save the combined dataset , just for incase
df_combined.to_csv("Data/models_still_produced_2018.csv", index = False)

-------

In [None]:
# 1. Create a new dataframe, model_mpg, that contain the mean combined mpg values in 2008 and 2018 for each unique model
combined_model_mpg  = df_combined.groupby(["model"])[["cmb_mpg_2008", "cmb_mpg"]].mean()
combined_model_mpg.head()

In [None]:
# 2. Create a new column, mpg_change, with the change in mpg
combined_model_mpg["mpg_change"] = combined_model_mpg["cmb_mpg"] - combined_model_mpg["cmb_mpg_2008"]
combined_model_mpg.head()

In [None]:
# 3. Find the vehicle that improved the most
combined_model_mpg.sort_values(by = "mpg_change", ascending = False)[0:1]

### Summary Findings:
VOLVO XC 90 is the best model that made the biggest improvement in combined mpg from 2008 to 2018. 

----------