**<font size=5>Tennessee Fuel Quality Analysis</font>**

* **Date Published**: 2019/07/31
* **Collaborators**: [Kate Hayes](https://github.com/99KHayes) & [Misha Berrien](https://github.com/mishaberrien)
* **Data Source**: State of Tennessee Department of Agriculture

In [12]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
import seaborn as sns
import sklearn.preprocessing as preprocessing
import statsmodels.api as sm
import sys

from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE, ADASYN


src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)

# helper functions
from d01_data.intermediate_cleaning import concatenate_and_save_intermediate_files, clean_dataset_intermediate_1, clean_dataset_intermediate_2
from d03_processing.feature_engineering import merge_gasoline_asm_datasets
from d04_modelling.modelling import get_model_pvalue

# Load the "autoreload" extension
%load_ext autoreload

# reload modules so that as you change code in src, it gets loaded
%autoreload

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Introduction   

In this project, we are interested in understanding if we can predict when a fuel compliance test will fail. 

The state of Tennessee's Department of Agriculture (TDA) maintaines a fuel quality inspection program. Each year, the state inspects all places where fuel is sold/ distributed including gas stations, terminals and airports. The results of these routine tests as well as follow up tests for complaints are maintained by the Tennessee Department of Agriculture. 

Although around 97% of the tests pass these compliance tests, each year around 3% of the tests fail. A failing test could mean that consumers are exposed to fuel that could do harm to their property or themselves.  

## Dataset

Tennessee has a Sunshine law (public records law) that allows anyone to request any state record through the right legal channels. The data set was acquired in this manor. We sent an official Tennessee records request for the fuel quality data for the last 5 years. It came in the form of routine inspections and complaints one set for each years, so 10 Excel files in total.

Five year fuel quality inspection records
for the state of Tennessee

* Source: State of Tennessee Department of Agriculture
* Time Period: Mid 2014 to early 2019
* Number of Fuel Products: 11
* Number of Test Types: 72

## Key Metrics

***The following key metrics were descriptions given to us by the TDA Fuel Quality Manager. The main thing that we were inspecting in this project was the prediction of pass/fail rates and volatility properties with the seasons.*** 

Volatility is an important property of gasoline because it must be able to vaporize before combusting in an engine. Three characteristics are used to measure the volatility of gasoline and evaluate suitability: vapor pressure, vapor-liquid ratio, and the distillation temperature at which 50% of the fuel is evaporated.


* Vapor pressure is a measure of the amount of vapor that is produced by a gasoline sample at 37.8°C (100°F). Vapor pressure most affects an engine’s ease of starting. The vapor pressure specification is a maximum allowable limit reported in kilopascals. The pressure must be high enough to promote easy starting but not too high to contribute to excessive emissions or vapor lock -  the presence of too much vapor that leads to loss of engine power or rough operation.


* Vapor-liquid ratio is the ratio of the volume of vapor to the volume of liquid at atmospheric pressure. The vapor-liquid ratio specification is a minimum allowable limit reported in degrees Celsius. The reported value is the temperature at which the vapor-liquid ratio is equal to 20 (20:1 vol/vol), the approximate temperature at which engine problems may occur. Vapor-liquid ratio is used to evaluate a gasoline sample’s tolerance to changes in temperature. A noncompliant test result (too low) may lead to vapor lock or hot fuel handling problems, as evidenced by loss of power while accelerating or idling.


* Distillation measures the temperature range across which a sample is heated to fully evaporate. The temperature at which 50% of a sample is evaporated (T50) relates to the driveability (smoothness and ease of driving) and idling characteristics for the fuel. T50 most similarly relates to how a fuel performs under continuous activity (not starting or warming up). T50 has minimum and maximum allowable limits reported in degrees Celsius.

## Questions

For this project, we are interested in two distinct questions: 

1. Is there seasonality associated with fuel compliance tests failures in the state of Tennessee? 
1. Can we use data collected by fuel inspectors to better predict gas station fuel test failures in the state of Tennessee?

## Time Series Analysis

### Load and Process Datasets 

The fuel volatility specifications change throughout the year based on outdoor temperature. Therse specifications are set by the ASTM. The ASTM standards were joined onto the five year result data frame. in order to assess the seasonality of the data. A Dicky Fuller test and selective seaonal decomposition was done.

### Dicky-Fuller Stationality Tests
Once the proper dataframes were created a Dicky Fuller Test was added to assess for stationarity. All of the tests were stationary above a 1% critical value. Two failed at 1% but that was due to extreme sample failures. The average, maximum, and minimum values for each day were taken becasue there was more than one value for each date and time series analysis requires there to be no more than one value for each date. Assessing at these points allowed differing seasonal trends to be seen and the outliers on either side of the spectrum to become more obvious. It is not the most informative thing to say that the value of one sample in Knoxville averaged with another in Nashville can tell the user anything about the trends. It can disguies a lot of the analysis

### Time Series Decomposition
A time series decomposition on the average vapor pressure was done. This was selected to do the decomposition becasue it had an easy to see trend of seasonality with few outliers.

## Logistic Regression Analysis

### Load and Process Datasets

**The functions for cleaning/ processing the dataset can be found in the src/d03_processing folder.** 

In [2]:
# gasoline_proc = pd.read_csv('../../data/03_processed/gasoline_processed.csv')
astm = pd.read_csv('../../data/01_raw/ASTM_fuel.csv')
astm.columns = ['Date', 'TN_retailers_seasons', 'TN_distributor_seasons',
       'vapor_liquid_minC_retail', 'distillation_50_minC _retail',
       'distillation_50_maxC_retail', 'vapor_pressure_maxC_retail',
       'vapor_liquid_minC_dist', 'distillation_50_minC_dist',
       'distillation_50_maxC_dist', 'vapor_pressure_maxC_dist']

In [3]:
concatenate_and_save_intermediate_files('fy_*_routine', 'routine_full')

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
routine_full = pd.read_csv('../../data/02_intermediate/routine_full.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
routine_clean = clean_dataset_intermediate_1(routine_full)

In [6]:
gasoline_clean = clean_dataset_intermediate_2(routine_clean)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [13]:
gasoline = merge_gasoline_asm_datasets(gasoline_clean, astm)

### Build & Choose Model

#### Define Variables

In [None]:
# construct features 
x_feats = ['Grade']
X = pd.get_dummies(gasoline[x_feats], dtype=float)
X = sm.tools.add_constant(X)
# convert target using get_dummies
y = pd.get_dummies(gasoline["compliance_vap_liq_pressure"], dtype=float)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y.iloc[:,0], test_size=0.3, random_state=0)

Our dataset is heavily imbalanced with a 26:9062 ratio of 1 to 0s. In order to find a reliable result, we need to balance these numbers with oversampling. 

In [None]:
print("Label Count '1': {}".format(sum(y_train==1)))
print("Label Count '0': {} \n".format(sum(y_train==0)))

#### Oversampling

In [None]:
smote = SMOTE()

# simple resampling from your previously split data
X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train.ravel())

In [None]:
print("Label Count '1': {}".format(sum(y_train_resampled==1)))
print("Label Count '0': {} \n".format(sum(y_train_resampled==0)))

We now have a balanced dataset to build our models with. 

#### Model 1: Vapor Liquid-Ratio Test Outcome ~ Grade

In [None]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12)
model_log = logreg.fit(X_train_resampled, y_train_resampled)
model_log

Let's find our pvalues

In [None]:
get_model_pvalue(y_train_resampled, X_train_resampled)

These pvalues are not significant. Let's try another model. 

#### Model 2: Vapor Liquid-Ratio Test Outcome ~ Tennessee Retailers Season & Grade

We will repeat the process laid out above for our second model

In [None]:
# construct features 
x_feats = ['TN_retailers_seasons', 'grade']
X = pd.get_dummies(gasoline[x_feats], dtype=float)
X = sm.tools.add_constant(X)
# convert target using get_dummies
y = pd.get_dummies(gasoline["compliance_vap_liq_pressure"], dtype=float)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y.iloc[:,0], test_size=0.3, random_state=0)

In [None]:
smote = SMOTE()

# simple resampling from your previously split data
X_train_resampled, y_train_resampled = smote.fit_sample(X_train, y_train.ravel())

In [None]:
# simple resampling from your previously split data
X_test_resampled, y_test_resampled = smote.fit_sample(X_test, y_test.ravel())

In [None]:
logreg = LogisticRegression(fit_intercept = False, C = 1e12)
model_log = logreg.fit(X_train_resampled, y_train_resampled)
model_log

In [None]:
get_model_pvalue(y_train_resampled, X_train_resampled)

Neither of our models have significant pvalues, but, since we want to understand a bit better about blah blah blah blah, we are going to choose the second model (which have slightly higher pvalues) and move on to our testing phase. 

### Test the Chosen Model (Prediction) 

In [None]:
y_hat_test_resampled = logreg.predict(X_test_resampled)
y_hat_train_resampled = logreg.predict(X_train_resampled)

#### Precision, Recall, Accuracy and F1-Score

In [None]:
print('Training Precision: ', precision_score(y_hat_train_resampled, y_train_resampled))
print('Testing Precision: ', precision_score(y_hat_test_resampled, y_test_resampled))
print('\n\n')

print('Training Recall: ', recall_score(y_hat_train_resampled, y_train_resampled))
print('Testing Recall: ', recall_score(y_hat_test_resampled, y_test_resampled))
print('\n\n')

print('Training Accuracy: ', accuracy_score(y_hat_train_resampled, y_train_resampled))
print('Testing Accuracy: ', accuracy_score(y_hat_test_resampled, y_test_resampled))
print('\n\n')

print('Training F1-Score: ',f1_score(y_hat_train_resampled, y_train_resampled))
print('Testing F1-Score: ',f1_score(y_hat_test_resampled, y_test_resampled))

#### ROC Curve & AUC

In [None]:
#First calculate the probability scores of each of the datapoints:
y_score = logreg.fit(X_train_resampled, y_train_resampled).decision_function(X_test_resampled)

fpr, tpr, thresholds = roc_curve(y_test_resampled, y_score)

In [None]:
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.set_context('poster')
print('AUC: {}'.format(auc(fpr, tpr)))
plt.figure(figsize=(10,8))
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i/20.0 for i in range(21)])
plt.xticks([i/20.0 for i in range(21)])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.savefig("../../results/Images/roc.jpg")

### Summary & Caveats

BLAH BLAH BLAH 

## Conclusion & Next Steps

* There is obvious seasonal trends due to the nature of the regulations.
* Further analysis is required in order to interpret the logistic regression.
* Do tests that pass but barely pass lead to more complaints?
* Do certain regions of Tennessee have more failures in certain seasons
* Do certain brands of gas stations or distributors deliver more failures or have more complaints against them?