# INFO 2950 Final Project - Phase II

## Research Questions: 


## Data Descriptions:

We collected 5 data tables for this phase: one pertaining to the variable we are aiming to predict, and four input variables for our model.

1. **Vehicle Registration Counts by State**
   * <u>Data Source</u>: US Department of Energy - Alternative Fuels Data Center (AFDC)
   * <u>URL</u>: https://afdc.energy.gov/vehicle-registration?year=2023
   * <u>Description</u>: This page provides approximate light-duty vehicle registration counts derived by the National Renewable Energy Laboratory with data from Experian Information Solutions. Counts are rounded to the closest 100 vehicles and reflect the total number of light-duty registered vehicles through the selected year. Fuel types are based on vehicle identification numbers (VINs), which do not reflect aftermarket conversions to use different fuels or power sources.

2. **Renewable and Total Energy Production by State**
    * <u>Data Source</u>: 
    * <u>URL</u>: 
    * <u>Description</u>: 


3. **EV Pricing**
    * <u>Data Source</u>: 
    * <u>URL</u>: 
    * <u>Description</u>: 


4. **EV Charging Stations**
    * <u>Data Source</u>: 
    * <u>URL</u>: 
    * <u>Description</u>: 


5. **EV Incentives**
    * <u>Data Source</u>: 
    * <u>URL</u>: 
    * <u>Description</u>: 


## Importing:

In [1]:
pip install pdfplumber

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import requests
from bs4 import BeautifulSoup
import os
import re
import requests
import pdfplumber
from io import BytesIO

## Data Scraping:

1. **Vehicle Registration Counts by State**
   * <u>Data Source</u>: US Department of Energy - Alternative Fuels Data Center (AFDC)
   * <u>URL</u>: https://afdc.energy.gov/vehicle-registration?year=2023

In [6]:
afdc_url = "https://afdc.energy.gov/vehicle-registration?year={}"
years = range(2016, 2024)

compiled_data = []

for year in years:
    url = afdc_url.format(year)
    afdc_result = requests.get(url)

    if afdc_result.status_code == 200:
        print(f"Scraping data for {year}...")
                
    page = BeautifulSoup(afdc_result.text, 'html.parser')
                
    table = page.find('table')
                
    if table:
        rows = table.find_all('tr')

        print(f"Found {len(rows)} rows in the table for {year}.")
                    
        for row in rows[2:]: 
            cols = row.find_all('td')
            cols = [col.text.strip() for col in cols]
                        
            compiled_data.append([year] + cols)

    else:
        print(f"Failed to retrieve data for {year}: {afdc_result.status_code} - {afdc_result.reason}")


if table:
        header_row = table.find('tbody').find_all('tr')[0]
        headers = [td['headers'] for td in header_row.find_all('td')]
        clean_headers = []
        for header in headers:
            if header[0].isupper():
                    clean_headers.append(header[0].strip())
            else:
                    clean_headers.append(header[0].strip().capitalize())
        
        print(f"Headers found: {clean_headers}")
    
compiled_df = pd.DataFrame(compiled_data, columns=["Year"] + clean_headers)

print(compiled_df.head(n=5))

Scraping data for 2016...
Found 54 rows in the table for 2016.
Scraping data for 2017...
Found 54 rows in the table for 2017.
Scraping data for 2018...
Found 54 rows in the table for 2018.
Scraping data for 2019...
Found 54 rows in the table for 2019.
Scraping data for 2020...
Found 54 rows in the table for 2020.
Scraping data for 2021...
Found 54 rows in the table for 2021.
Scraping data for 2022...
Found 54 rows in the table for 2022.
Scraping data for 2023...
Found 54 rows in the table for 2023.
Headers found: ['State', 'Electric', 'PHEV', 'HEV', 'Biodiesel', 'Flex', 'CNG', 'Propane', 'Hydrogen', 'Methanol', 'Gas', 'Diesel', 'Unknown']
   Year       State Electric     PHEV      HEV Biodiesel       Flex     CNG  \
0  2016     Alabama      500      900   29,100         0    428,300  20,100   
1  2016      Alaska      200      200    5,000         0     55,700   4,900   
2  2016     Arizona    4,700    4,400   89,600         0    427,300  17,500   
3  2016    Arkansas      200      500

2. **Renewable and Total Energy Production by State**
    * <u>Data Source</u>: US Energy Information Administration (EIA) – State Energy Data System (SEDS)
    * <u>Website URL</u>: https://www.eia.gov/renewable/data.php
    * <u>PDF URL</u>: https://www.eia.gov/state/seds/sep_prod/SEDS_Production_Report.pdf

In [4]:
def extract_pdf(pdf_url, start_page, end_page):
    response = requests.get(pdf_url)
    
    if response.status_code == 200:
        file = BytesIO(response.content)
        
        with pdfplumber.open(file) as pdf:
            all_tables = []
            
            for i in range(start_page, end_page):
                one_page = []  
                count = 0  
                
                page = pdf.pages[i]
                text = page.extract_text()
                
                entries = re.findall(r'NA|\(s\)|\b\d{1,3}(?:,\d{3})*(?:\.\d+)?\b', text)
                
                cleaned_entries = [float(num.replace(',', '')) if num not in ['NA', '(s)'] else num for num in entries]
                
                one_page.extend(cleaned_entries)
                count += len(cleaned_entries)

                one_page = one_page[:354]
                
                reshaped_data = np.array(one_page).reshape(59, 6)
                
                all_tables.append(reshaped_data)
                
            return all_tables  

    else:
        return "Something went wrong"

pdf_url = 'https://www.eia.gov/state/seds/sep_prod/SEDS_Production_Report.pdf'

data = extract_pdf(pdf_url, 17, 119)

In [5]:
states = ["Alabama", "Alaska", "Arizona", "Arkansas", 
          "California", "Colorado", "Connecticut", 
          "Delaware", "District of Columbia", "Florida", 
          "Georgia", "Hawaii", "Idaho", "Illinois", 
          "Indiana", "Iowa", "Kansas", "Kentucky", 
          "Louisiana", "Maine", "Maryland", "Massachusetts", 
          "Michigan", "Minnesota", "Mississippi", "Missouri", 
          "Montana", "Nebraska", "Nevada", "New Hampshire", 
          "New Jersey", "New Mexico", "New York", "North Carolina", 
          "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", 
          "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
          "Texas", "Utah", "Vermont", "Virginia", "Washington", 
          "West Virginia", "Wisconsin", "Wyoming"]

physical_units = []
thermal_units = []

for j in range(len(data)):
    df = pd.DataFrame(data[j])
    
    if j % 2 == 0: 
        state_index = j//2
        df.insert(0, 'State', states[state_index])
        physical_units.append(df)

    else:
        state_index = (j-1)//2
        df.insert(0, 'State', states[state_index])
        thermal_units.append(df)


physical_units_df = pd.concat(physical_units, ignore_index=True)

physical_units_df.insert(1, 'Units', "Physical")
physical_units_df.rename(
    columns={
            0: 'Coal (K short tons)', 
            1: 'Natural Gas (M cubic ft)',
            2: 'Crude Oil (K barrels)',
            3: 'Fuel Ethanol (K barrels)',
            4: 'Biodiesel (K barrels)',
            5: 'Renewable Diesel (K barrels)'
        }, inplace=True)


thermal_units_df = pd.concat(thermal_units, ignore_index=True)    

thermal_units_df.insert(1, 'Units', "Thermal")
thermal_units_df.rename(
        columns={
            0: 'Coal (T Btu)', 
            1: 'Natural Gas (T Btu)',
            2: 'Crude Oil (T Btu)',
            3: 'Fuel Ethanol (T Btu)',
            4: 'Biodiesel (T Btu)',
            5: 'Renewable Diesel (T Btu)'
        }, inplace=True)

print(physical_units_df.head(n=5))
print(thermal_units_df.head(n=5))

     State     Units Coal (K short tons) Natural Gas (M cubic ft)  \
0  Alabama  Physical             13011.0                     57.0   
1  Alabama  Physical             14832.0                    203.0   
2  Alabama  Physical             14219.0                    252.0   
3  Alabama  Physical             15486.0                    248.0   
4  Alabama  Physical             16440.0                    230.0   

  Crude Oil (K barrels) Fuel Ethanol (K barrels) Biodiesel (K barrels)  \
0                7329.0                       NA                    NA   
1                8064.0                       NA                    NA   
2                8030.0                       NA                    NA   
3                7348.0                       NA                    NA   
4                7635.0                       NA                    NA   

  Renewable Diesel (K barrels)  
0                           NA  
1                           NA  
2                           NA  
3       

## Data Cleaning:

## Data Limitations:

## Exploratory Data Analysis: 

## Questions For Reviewers: 