# Automobile Fuel Efficiency Analysis in Python
## Objectives 
* Acquiring automobile fuel efficiency data
* Preparing the Python environment
* Importing automobile fuel efficiency data into Python
* Exploring and describing fuel efficiency data
* Analyzing automobile fuel efficiency over time

### Metadata

The Office of Energy Efficiency and Renewable Energy of the U.S. Environmental Protection Agency (EPA) provide access to Vehicle Dataset through their [Fuel Economy Site](http://www.fueleconomy.gov). This dataset contains fuel efficiency performance metrics, measured in miles per gallon (MPG) over time, for most makes and models of automobiles available in the U.S. since 1984. In addition to fuel efficiency attributes, this dataset also contains several other attributes of the vehicle listed, thereby providing the opportunity to summarize and group data to determine which makers or models tend to have better fuel efficiency historically and how this has changed over the years.

### Preparing the Python environment

In [None]:
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import numpy as np
import plotnine as pn
import matplotlib.pyplot as plt
import functools

### Acquiring automobile fuel efficiency data

In [None]:
# URL containing the data file
url='http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip'
# Downloading the file and saving it in memory
response = urlopen(url)
compressed_file = ZipFile(BytesIO(response.read()))
# Decompressing the file and loading it into a Python Data Frame
file=compressed_file.open(compressed_file.namelist().pop())
data=pd.read_csv(file,low_memory=False)

### Exploring and describing fuel efficiency data

#### Data Structure
A description of the features can be access [here](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle)

In [None]:
data.info()

#### Summary of the data

In [None]:
data.describe()

In [None]:
data.year.value_counts()

We see from the previous output that the year ranges from 1984 to 2021. Looking at the distribution of observations by year it can be observe that the year 2021 only has 3. Therefore, this year is going to be remove from the dataset for posterior analysis.

In [None]:
data=data[data.year<2021]

### Average MPG overall trend over time.
To do this, we will use a technique call split-apply-combine. We will __split__ the dataframe into groups by year, __apply__ the mean function to specific variables, and __combine__ the results into a new dataframe. This can be accomplish by using the groupby and the agg functions of the Pandas Library to aggregate the observations by year, and then, for each group, compute the mean (by using the mean function of the Numpy Library) highway, city, and combine fuel efficiency. The result is then assigned to a new data frame, meanMPG. This dataset then visualize using the ggplot2 Library which is an implementation of the [Grammer of Graphics](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149).

In [None]:
meanMPG = data.groupby("year")[['year','comb08','highway08','city08']].agg([np.mean])
meanMPG.columns = ['year','avgMPG','avgHghy','avgCity']
pn.ggplot(meanMPG,pn.aes('year','avgMPG')) + pn.geom_point(color='steelblue') + pn.geom_smooth(linetype='dashed') + \
    pn.xlab("Year") + pn.ylab("Average MPG") + pn.ggtitle("All cars") + pn.theme_bw()

Based on this visualization, one might conclude that there has been a tremendous increase in the fuel economy of cars sold in the last few years. However, this can be a little misleading as there have been more hybrid and non-gasoline vehicles in the later years.

In [None]:
criteria1 = data.fuelType1.str.endswith('Gasoline')
criteria2 = data.fuelType2.isnull()
criteria3 = data.atvType != "Hybrid"
vehicles_non_hybrid = data[criteria1 & criteria2 & criteria3]
meanMPG_non_hybrid = vehicles_non_hybrid.groupby('year')[['comb08','highway08','city08']].agg([np.mean])
meanMPG_non_hybrid.columns = ['avgMPG','avgHghy','avgCity']
meanMPG_non_hybrid['year']=meanMPG.index
meanMPG_non_hybrid['year']=meanMPG_non_hybrid.index
pn.ggplot(meanMPG_non_hybrid,pn.aes('year','avgMPG')) + pn.geom_point(color='steelblue') + \
    pn.geom_smooth(linetype='dashed') + pn.xlab("Year") + pn.ylab("Average MPG") + pn.ggtitle("Gasoline cars") + \
    pn.theme_bw()

This visualization shows that there is still a marked rise in the average miles per gallon even after eliminating hybrids. The next question that we can ask is whether there have been fewer cars with large engines built more recently? If this is true, it could explain the increase in average miles per gallon. First, let's verify that larger engine cars have poorer miles per gallon.

In [None]:
vehicles_non_hybrid = vehicles_non_hybrid[vehicles_non_hybrid.displ.notnull() &\
                                          vehicles_non_hybrid.comb08.notnull()]
vehicles_non_hybrid.displ = vehicles_non_hybrid.displ.astype('float')
pn.ggplot(vehicles_non_hybrid, pn.aes('displ', 'comb08')) + pn.geom_point(color='steelblue') + \
    pn.xlab("Engine Displacement") + pn.ylab("Average MPG") + pn.ggtitle("Gasoline cars") + \
    pn.annotate('text',x=5,y=40,label='Correlation:' + \
             str(np.corrcoef(vehicles_non_hybrid.comb08,vehicles_non_hybrid.displ)[0,1]),colour='darkred',size=10 \
            ) + pn.theme_bw()

Now, let's see whether more small cars were made in later years, which can explain the drastic increase in fuel efficiency.

In [None]:
vehicles_non_hybrid_grouped=vehicles_non_hybrid.groupby('year')[['year','displ']].agg([np.mean])
vehicles_non_hybrid_grouped.columns =['year','avgDispl']
pn.ggplot(vehicles_non_hybrid_grouped,pn.aes('year', 'avgDispl')) + pn.geom_line(color='steelblue') + \
    pn.xlab("Year") + pn.ylab("Average engine displacement (l)") + pn.theme_bw()

From the preceding figure,it can be observed the average engine displacement has decreased substantially since 2008. To get a better sense of the impact this might have had on fuel efficiency, we can put both MPG and displacement by year on the same graph. To do this, we need to reshape the dataset to convert it from the wide format to the long format using the __melt__ function of the Pandas library.

In [None]:
avg_grouped_by_year = vehicles_non_hybrid.groupby(['year'])[['year','displ', 'comb08']].agg([np.mean])
avg_grouped_by_year.columns=['year','avgDispl','avgMPG']
melted_avg_grouped_by_year=pd.melt(avg_grouped_by_year,id_vars='year')
pn.ggplot(melted_avg_grouped_by_year,pn.aes('year','value')) + pn.geom_line(color='steelblue') + pn.theme_bw() + \
    pn.facet_wrap('variable',ncol = 1,scales='free_y')

From this visulazation, we can see the following:
* Engine sizes have generally increased until 2008, with a sudden increase in large cars between 2006 and 2008.
* Since 2009, there has been a decrease in the average car size, which partially explains the increase in fuel efficiency.
* Until 2005, there was an increase in the average car size, but the fuel efficiency remained roughly constant. This seems to indicate that engine efficiency has increased over the years.
* The years 2006–2008 are interesting. Though the average engine size increased quite suddenly, the MPG remained roughly the same as in previous years. This seeming discrepancy might require more investigation.

Now, let's see how makes and models of cars inform us about fuel efficiency over time. First, look at the frequency of makes and models of cars available in the U.S., concentrating on 4-cylinder cars.

In [None]:
vehicles_non_hybrid_4 = vehicles_non_hybrid[(vehicles_non_hybrid.cylinders == 4.0)]
vehicles_non_hybrid_4_grouped = vehicles_non_hybrid_4.groupby('year').make.nunique()
figure = vehicles_non_hybrid_4_grouped.plot(figsize=(9,6))
figure.set_xlabel('Year')
figure.set_ylabel('Number of 4-cylider Makes')
figure

We can see in the preceding graph that there has been a decline in the number of makes with 4-cylinder engines available since 1980. Can we look at the makes that have been available every year in this dataset?

In [None]:
unique_makes=[]
for name, group in vehicles_non_hybrid_4.groupby('year'):
    unique_makes.append(set(group.make.unique()))
unique_makes=functools.reduce(set.intersection,unique_makes)
unique_makes

Let see how these car manufacturers' models have performed over time.

In [None]:
vehicles_non_hybrid_4_makers = vehicles_non_hybrid_4[vehicles_non_hybrid_4.make.isin(unique_makes)]
vehicles_non_hybrid_4_makers_grouped = vehicles_non_hybrid_4_makers.groupby(['year','make'])[['make','comb08']] \
    .agg([np.mean]).reset_index()
vehicles_non_hybrid_4_makers_grouped.columns = ['year','make','avgMPG']
pn.ggplot(vehicles_non_hybrid_4_makers_grouped,pn.aes(x='year', y='avgMPG')) + pn.geom_line() + \
    pn.facet_wrap('make') + pn.theme_bw()