# COGS 108 - Data Checkpoint

# Names

- Anurag Asthana
- Michael Granado
- Victorionna Tran
- Alex Bumbalov
- Tianyue (Terry) Zhang

<a id='research_question'></a>
# Research Question

Is there a relationship between a state's number of electric vehicles and its total carbon emission count? Furthermore, how does a state's electric vehicle amount affect carbon emission amounts in regards to the state's electric-energy and petroleum-energy subsectors? 

# Dataset(s)

- Dataset Name: Annual number of EVs (electric vehicles) registered by state(2011-2020)
   - Link to the dataset: https://www.atlasevhub.com/materials/state-ev-registration-data/#data 
   - Number of observations: 16 datasets and each data set corresponds to a specific state, with over 1,000,000 observations per dataset
   - There are 16 different datasets corresponding to 16 different states. Each of these datasets contains over 1,000,000 EV observations that contain information from variables including the vehicle registration year and model. There are specifically 16 states because only 16 states have had their information made public through the Open Vehicle Registration Initiative. We hope to make our own single dataset from these 16 datasets. Our single custom dataset will have 16 observations for each of the 16 states. The goal is to create 10 variables for each year 2011-2020 which display the count of registered EVs and correspond these to the 16 observations corresponding to the documented states. 

 
- Dataset Name: Annual energy-related CO2 emissions by state(2000-2018) in million metric tons
   - Link to the dataset: https://www.eia.gov/environment/emissions/state/excel/table2.xlsx 
   - Number of observations: 50 observations(1 per state)
   - This dataset contains an observation for each of the 50 U.S. states. Variables span from the years 2000-2018 and describe the million metric tons of carbon dioxide produced by energy production. We can trim this dataset down to the 16 states EV registration data about. Then, we can merge this dataset with the dataset containing EV registrations.
 
 
- Dataset Name: Annual electricity energy-related carbon dioxide emissions by state(1980-2018) in million metric tons
   - Link to the dataset: https://www.eia.gov/environment/emissions/state/excel/electricity.xlsx 
   - Number of observations: 50 observations(1 per state)
   - This dataset contains an observation for each of the 50 U.S. states. Variables span from the years 1980-2018 and describe the million metric tons of carbon dioxide specifically produced by electric energy production. We can trim this dataset down to the 16 state observations recorded in the other datasets. Then, this dataset can be merged with the others.

 
- Dataset Name: Annual electricity petroleum-related carbon dioxide emissions by state(1980-2018) in million metric tons
   - Link to the dataset: https://www.eia.gov/environment/emissions/state/excel/transportation.xlsx 
   - Number of observations: 50 observations(1 per state)
   - This dataset contains an observation for each of the 50 U.S. states. Variables span from the years 1980-2018 and describe the million metric tons of carbon dioxide specifically produced by petroleum energy production. We can trim this dataset down to the 16 state observations recorded in the other datasets. Then, this dataset can be merged with the others.
  

**Combining Datasets:** Our end goal is to have our own complete custom dataset which contains 128. We choose 128 observations because we will have 16 states and 8 years ranging from 2011-2018 associated with each state. We choose 16 states because those states have the public EV registration data. We choose the years 2011-2018 because these years are commonly intersected among all datasets. Our variables will consist of the number of EVs registered, metric million tons of energy-related CO2 emissions, metric million tons of electricity energy-related carbon dioxide emissions, and metric million tons of petroleum energy-related carbon dioxide emissions.

# Setup

In [1]:
# Libraries that will be used
import numpy as np
import pandas as pd

# Display options for Pandas DataFrames
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# List of states that the data will be filtered by
states = ["California", "Colorado", "Connecticut", "Florida", "Montana", "Michigan", "Minnesota", "New Jersey", 
            "New York", "Oregon", "Tennessee", "Texas", "Vermont", "Virginia", "Washington", "Wisconsin"]

# List of years that the data will be filtered by (2011-2018 inclusive)
years = range(2011,2019)

# Data Cleaning

Describe your data cleaning steps here.

### Annual number of EVs (electric vehicles) registered by state

In [2]:
## Annual number of EVs (electric vehicles) registered by state

# for loading resource from Web (some .csv files are too big (> 200 M) to be stored Github repo)
tables = ["https://www.atlasevhub.com/public/dmv/ca_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/co_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/ct_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/fl_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/mt_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/mi_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/mn_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/nj_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/ny_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/or_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/tn_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/tx_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/vt_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/va_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/wa_ev_registrations_public.csv",
          "https://www.atlasevhub.com/public/dmv/wi_ev_registrations_public.csv",]
EV_registration_columns = ["State", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018"]
EV_registration = pd.DataFrame(columns = EV_registration_columns)
EV_model_cloumns = ["State", "Year", "Model", "Amount"]
EV_model = pd.DataFrame(columns = EV_model_cloumns)

# process data from each state (16 in total)
for i in range(16):
    curr_state = states[i]
    curr_table = tables[i]
    table = pd.read_csv(curr_table, low_memory=False)
    table["Registration Valid Date"] = pd.to_datetime(table["Registration Valid Date"])
    table["Registration Valid Date"] = table["Registration Valid Date"].dt.year
    table = table[table['Registration Valid Date'].notna()]
    table["Registration Valid Date"].astype(int).dtypes
    table = table[(table["Registration Valid Date"] >= 2011) & (table["Registration Valid Date"] <= 2018)]
    table = table.sort_values(by='Registration Valid Date')
    annual_registration_amount = table["Registration Valid Date"].value_counts(sort = False).tolist()

    # ONLY process states having data for all 8 years (2011-2018) 
    if len(annual_registration_amount) == 8:

        # add new row (representing the annual number of EV registered in current state) to table "EV_registration" 
        newrow_data = [curr_state] + annual_registration_amount
        newrow_df = pd.DataFrame([newrow_data],columns = EV_registration_columns) 
        EV_registration = pd.concat([EV_registration, newrow_df], axis='rows', ignore_index=True)
        
        for j in range(8):
            # add new row (representing the annual amount of EV (by model) registered in current state) to table "EV_model" 
            curr_year = 2011 + j
            curr_year_table = table[table['Registration Valid Date'] == curr_year]
            if 'Vehicle Name' not in table.columns:
                curr_year_table = curr_year_table.assign(Vehicle_Name = lambda x: x.Make + x.Model)
                curr_year_table = curr_year_table.rename(columns={"Vehicle_Name": "Vehicle Name"})
            curr_year_model = curr_year_table["Vehicle Name"].value_counts(sort = False).keys().tolist()
            curr_year_model_amount = curr_year_table["Vehicle Name"].value_counts(sort = False).tolist()
            model_amount = len(curr_year_table["Vehicle Name"].unique())
            for k in range(model_amount):
                curr_model = curr_year_model[k]
                curr_model_amount = curr_year_model_amount[k]
                newrow_data = [curr_state,curr_year,curr_model,curr_model_amount]
                newrow_df = pd.DataFrame([newrow_data],columns = EV_model_cloumns) 
                EV_model = pd.concat([EV_model, newrow_df], axis='rows', ignore_index=True)
    else:
        print(curr_state + " only has data in " + str(table["Registration Valid Date"].unique()) + " among the period 2011-2018")
print("------------------------------------------------------------------------------------------------------------------------")
print(EV_registration)
print("------------------------------------------------------------------------------------------------------------------------")
print(EV_model)

Connecticut only has data in [2011 2012 2013 2014 2015 2016 2018] among the period 2011-2018
Florida only has data in [2018] among the period 2011-2018
Michigan only has data in [2013 2014 2015 2016 2017 2018] among the period 2011-2018
New Jersey only has data in [2017 2018] among the period 2011-2018
Oregon only has data in [] among the period 2011-2018
Tennessee only has data in [] among the period 2011-2018
Texas only has data in [] among the period 2011-2018
Wisconsin only has data in [2018] among the period 2011-2018


This data does not directly provide the information we need, so we build two new tables based on the original data:
1) nunber of EV registered in each state by year
2) nunber of EV registered in each state by year and model

### Annual energy-related CO2 emissions by state (in million metric tons)

In [2]:
## Annual energy-related CO2 emissions by state
annual_energy_related_CO2_em = pd.read_csv('https://raw.githubusercontent.com/vctrnna/COGS108_Repo/main/annual_energy_related_CO2_emmisions.csv')

#drop columns from 2000-2010
annual_energy_related_CO2_em = annual_energy_related_CO2_em.drop(annual_energy_related_CO2_em.columns[[1,2,3,4,5,6,7,8,9,10,11]], axis = 1)

#only keep rows for the 16 states mentioned in first data set
annual_energy_related_CO2_em = annual_energy_related_CO2_em.loc[annual_energy_related_CO2_em['State'].isin(states)]

#re-index the rows
annual_energy_related_CO2_em = annual_energy_related_CO2_em.reset_index(drop = True)

annual_energy_related_CO2_em


Unnamed: 0,State,2011,2012,2013,2014,2015,2016,2017,2018,Percent,Absolute
0,California,341.4,347.8,348.3,344.0,351.0,351.8,354.2,355.5,-6.00%,-22.8
1,Colorado,93.0,91.9,92.6,93.4,91.9,88.9,89.0,90.3,5.90%,5.0
2,Connecticut,35.1,34.5,34.8,35.2,36.5,34.2,33.9,37.5,-13.10%,-5.7
3,Florida,233.3,228.4,227.9,234.1,238.6,239.8,238.8,242.5,0.60%,1.3
4,Michigan,162.9,156.2,164.5,164.3,165.0,154.7,155.4,163.6,-16.20%,-31.7
5,Minnesota,91.7,87.0,91.4,96.6,89.8,91.4,91.4,94.9,-3.20%,-3.1
6,Montana,31.7,30.5,31.7,32.2,32.1,30.7,30.7,30.7,0.10%,0.0
7,New Jersey,104.1,98.8,100.9,105.0,103.9,105.6,99.1,105.1,-14.40%,-17.7
8,New York,175.9,168.5,170.1,177.8,175.6,170.7,166.4,175.4,-17.50%,-37.2
9,Oregon,37.3,37.1,39.3,38.1,38.2,37.9,39.0,39.7,-3.30%,-1.4


  <p style="margin-left:2.5em"><small> Units: million metric tons of energy-related carbon dioxide<small></p>

The data is very clean. From the original dataset, colomns for the years 2000-2010 were removed and rows for the 16 states mentioned in annual number of EVs registered by state dataset were kept. The names for the rows and columns of the dataset have not been changed from the original dataset. To make the original dataset useable and readable in csv form, uneccessary titles and descriptions were removed from the top of the original datasets. By using the Pandas, we were able to remove columns for specific years and extract rows specific states we intend to use.

### Annual electricity energy-related carbon dioxide emissions by state (in million metric tons)

In [3]:
## Annual electricity energy-related carbon dioxide emissions by state (in million metric tons)

# Reading in the dataset as a csv (modified from original to exclude empty rows and column names)
electric_energy_CO2_by_state = pd.read_csv("datawrangling/annual_electricity_related_CO2_emissions.csv")

# Filtering by the states in the EV dataset
electric_energy_CO2_by_state = electric_energy_CO2_by_state.loc[(electric_energy_CO2_by_state["State"].isin(states))]

# Filtering from the years 2011 to 2018
str_years = [str(i) for i in years]
electric_energy_CO2_by_state = electric_energy_CO2_by_state[["State"] + str_years]

# Resetting the indices of the dataframe
electric_energy_CO2_by_state.reset_index(drop=True, inplace=True)
electric_energy_CO2_by_state

Unnamed: 0,State,2011,2012,2013,2014,2015,2016,2017,2018
0,California,36.4,48.0,45.7,46.3,44.3,36.6,33.0,33.7
1,Colorado,39.3,39.5,39.0,38.1,37.0,35.7,35.2,34.2
2,Connecticut,6.6,7.2,6.8,6.7,7.4,7.0,6.3,8.1
3,Florida,111.0,107.5,105.2,109.8,108.1,106.4,103.4,100.9
4,Michigan,65.6,63.5,62.7,60.2,63.0,55.6,55.7,58.9
5,Minnesota,29.3,25.7,26.0,29.4,27.2,26.7,25.3,26.7
6,Montana,16.7,15.7,16.6,17.3,17.8,16.1,15.5,15.1
7,New Jersey,15.7,14.9,14.4,16.8,17.9,19.7,16.7,17.4
8,New York,33.9,32.3,30.1,30.6,29.2,27.8,22.1,24.5
9,Oregon,6.4,7.0,9.1,8.0,8.6,7.8,7.5,8.4


This data is very clean as information for all states listed is provided. The only small issue is that Vermont does not produce even close to a million metric tons of annual CO2 emissions through electricity and thus using the Vermont data may prove to be somewhat unworkable. The data was initially provided through [this site](https://www.eia.gov/environment/emissions/state/excel/electricity.xlsx) which had column names that were impractical to use. By removing the unnecessary column names as well as a row that contributed no data whatsoever, we were able to generate a more useful data table that could be easily utilized through Pandas. With Pandas, we were able to limit the range of the years from 2011 to 2019 as well as limit the states to the states that have EV registration data to produce the dataframe seen above. 

### Annual electricity petroleum-related carbon dioxide emissions by state (in million metric tons)

In [5]:
## Annual electricity petroleum-related carbon dioxide emissions by state

# Reading in the dataset as a excel
transportation_CO2_by_state = pd.read_excel("datawrangling/annual_transportation_related_CO2_emissions.xlsx")

# Create column names for the table (and filtering from the years 2011 to 2018)
transportation_CO2_by_state = transportation_CO2_by_state.iloc[1:-4,:-2]
transportation_CO2_by_state.drop(transportation_CO2_by_state.iloc[:,1:-8], axis = "columns", inplace = True)
str_years = [str(i) for i in years]
transportation_CO2_by_state.columns = ["State"] + str_years

# Filtering by the states in the EV dataset
transportation_CO2_by_state = transportation_CO2_by_state.loc[transportation_CO2_by_state["State"].isin(states)]

# Resetting the indices of the dataframe
transportation_CO2_by_state.reset_index(drop=True, inplace = True)
transportation_CO2_by_state

Unnamed: 0,State,2011,2012,2013,2014,2015,2016,2017,2018
0,California,194.0,190.3,189.8,191.0,195.9,203.8,209.5,210.6
1,Colorado,28.6,28.4,28.4,29.3,28.8,29.4,29.8,30.7
2,Connecticut,15.7,15.3,15.1,15.1,15.2,15.3,15.5,15.9
3,Florida,105.7,103.7,105.2,106.9,110.7,113.3,115.6,121.7
4,Michigan,49.7,48.8,50.9,50.4,52.1,52.8,51.9,53.5
5,Minnesota,30.8,31.3,31.6,31.7,31.7,33.3,33.5,33.5
6,Montana,8.1,7.9,8.0,7.8,7.6,7.9,8.1,8.1
7,New Jersey,54.0,51.5,51.7,50.6,50.6,52.8,50.3,52.4
8,New York,77.3,75.6,76.4,80.9,79.1,82.3,82.9,83.3
9,Oregon,20.9,20.6,20.8,20.9,20.4,20.5,21.0,21.7


The original data is in the excel format with fancy title on top of it, so pandas cannot find column names. We need to create column names for it before we start filtering data.