# COGS 108 - Data Checkpoint

# Names

- Anurag Asthana
- Michael Granado
- Victorionna Tran
- Alex Bumbalov
- Tianyue (Terry) Zhang

<a id='research_question'></a>
# Research Question

Is there a relationship between a state's number of electric vehicles and its total carbon emission count? Furthermore, how does a state's electric vehicle amount affect carbon emission amounts in regards to the state's electric-energy and petroleum-energy subsectors? 

# Dataset(s)

- Dataset Name: Annual number of EVs (electric vehicles) registered by state(2011-2020)
   - Link to the dataset: https://www.atlasevhub.com/materials/state-ev-registration-data/#data 
   - Number of observations: 16 datasets and each data set corresponds to a specific state, with over 1,000,000 observations per dataset
   - There are 16 different datasets corresponding to 16 different states. Each of these datasets contains over 1,000,000 EV observations that contain information from variables including the vehicle registration year and model. There are specifically 16 states because only 16 states have had their information made public through the Open Vehicle Registration Initiative. We hope to make our own single dataset from these 16 datasets. Our single custom dataset will have 16 observations for each of the 16 states. The goal is to create 10 variables for each year 2011-2020 which display the count of registered EVs and correspond these to the 16 observations corresponding to the documented states. 

 
- Dataset Name: Annual energy-related CO2 emissions by state(2000-2018) in million metric tons
   - Link to the dataset: https://www.eia.gov/environment/emissions/state/excel/table2.xlsx 
   - Number of observations: 50 observations(1 per state)
   - This dataset contains an observation for each of the 50 U.S. states. Variables span from the years 2000-2018 and describe the million metric tons of carbon dioxide produced by energy production. We can trim this dataset down to the 16 states EV registration data about. Then, we can merge this dataset with the dataset containing EV registrations.
 
 
- Dataset Name: Annual electricity energy-related carbon dioxide emissions by state(1980-2018) in million metric tons
   - Link to the dataset: https://www.eia.gov/environment/emissions/state/excel/electricity.xlsx 
   - Number of observations: 50 observations(1 per state)
   - This dataset contains an observation for each of the 50 U.S. states. Variables span from the years 1980-2018 and describe the million metric tons of carbon dioxide specifically produced by electric energy production. We can trim this dataset down to the 16 state observations recorded in the other datasets. Then, this dataset can be merged with the others.

 
- Dataset Name: Annual electricity petroleum-related carbon dioxide emissions by state(1980-2018) in million metric tons
   - Link to the dataset: https://www.eia.gov/environment/emissions/state/excel/transportation.xlsx 
   - Number of observations: 50 observations(1 per state)
   - This dataset contains an observation for each of the 50 U.S. states. Variables span from the years 1980-2018 and describe the million metric tons of carbon dioxide specifically produced by petroleum energy production. We can trim this dataset down to the 16 state observations recorded in the other datasets. Then, this dataset can be merged with the others.
  

**Combining Datasets:** Our end goal is to have our own complete custom dataset which contains 128. We choose 128 observations because we will have 16 states and 8 years ranging from 2011-2018 associated with each state. We choose 16 states because those states have the public EV registration data. We choose the years 2011-2018 because these years are commonly intersected among all datasets. Our variables will consist of the number of EVs registered, metric million tons of energy-related CO2 emissions, metric million tons of electricity energy-related carbon dioxide emissions, and metric million tons of petroleum energy-related carbon dioxide emissions.

# Setup

In [1]:
# Libraries that will be used
import numpy as np
import pandas as pd

# Display options for Pandas DataFrames
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# List of states that the data will be filtered by
states = ["California", "Colorado", "Connecticut", "Florida", "Montana", "Michigan", "Minnesota", "New Jersey", 
            "New York", "Oregon", "Tennessee", "Texas", "Vermont", "Virginia", "Washington", "Wisconsin"]

# List of years that the data will be filtered by (2011-2018 inclusive)
years = range(2011,2019)

# Data Cleaning

Describe your data cleaning steps here.

### Annual number of EVs (electric vehicles) registered by state

In [2]:
## Annual number of EVs (electric vehicles) registered by state

### Annual energy-related CO2 emissions by state (in million metric tons)

In [3]:
## Annual energy-related CO2 emissions by state

### Annual electricity energy-related carbon dioxide emissions by state (in million metric tons)

In [4]:
## Annual electricity energy-related carbon dioxide emissions by state (in million metric tons)

# Reading in the dataset as a csv (modified from original to exclude empty rows and column names)
electric_energy_CO2_by_state = pd.read_csv("datawrangling/annual_electricity_related_CO2_emissions.csv")

# Filtering by the states in the EV dataset
electric_energy_CO2_by_state = electric_energy_CO2_by_state.loc[(electric_energy_CO2_by_state["State"].isin(states))]

# Filtering from the years 2011 to 2018
str_years = [str(i) for i in years]
electric_energy_CO2_by_state = electric_energy_CO2_by_state[["State"] + str_years]

# Resetting the indices of the dataframe
electric_energy_CO2_by_state.reset_index(drop=True, inplace=True)
electric_energy_CO2_by_state

Unnamed: 0,State,2011,2012,2013,2014,2015,2016,2017,2018
0,California,36.4,48.0,45.7,46.3,44.3,36.6,33.0,33.7
1,Colorado,39.3,39.5,39.0,38.1,37.0,35.7,35.2,34.2
2,Connecticut,6.6,7.2,6.8,6.7,7.4,7.0,6.3,8.1
3,Florida,111.0,107.5,105.2,109.8,108.1,106.4,103.4,100.9
4,Michigan,65.6,63.5,62.7,60.2,63.0,55.6,55.7,58.9
5,Minnesota,29.3,25.7,26.0,29.4,27.2,26.7,25.3,26.7
6,Montana,16.7,15.7,16.6,17.3,17.8,16.1,15.5,15.1
7,New Jersey,15.7,14.9,14.4,16.8,17.9,19.7,16.7,17.4
8,New York,33.9,32.3,30.1,30.6,29.2,27.8,22.1,24.5
9,Oregon,6.4,7.0,9.1,8.0,8.6,7.8,7.5,8.4


This data is very clean as information for all states listed is provided. The only small issue is that Vermont does not produce even close to a million metric tons of annual CO2 emissions through electricity and thus using the Vermont data may prove to be somewhat unworkable. The data was initially provided through [this site](https://www.eia.gov/environment/emissions/state/excel/electricity.xlsx) which had column names that were impractical to use. By removing the unnecessary column names as well as a row that contributed no data whatsoever, we were able to generate a more useful data table that could be easily utilized through Pandas. With Pandas, we were able to limit the range of the years from 2011 to 2019 as well as limit the states to the states that have EV registration data to produce the dataframe seen above. 

### Annual electricity petroleum-related carbon dioxide emissions by state (in million metric tons)

In [5]:
## Annual electricity petroleum-related carbon dioxide emissions by state