<a href="https://colab.research.google.com/github/manoloc0/cs4774-final-project/blob/main/CS4774_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Big Picture

*   **What’s the benefit (use case) for this model?** The main motivation for performing an analysis of EV car charger distribution is to equip proponents of electric vehicle car adoption and infrastructure growth with reliable machine learning model backed data to guide strategic allocation of financial support. By training a machine learning model with a diverse dataset on EV driving ranges, EV car density, and traffic data, we hope to provide a useful application to help substantiate decisions pertaining to optimal locations for new EV chargers. Our problem is that there are many jurisdictions in Virginia that don’t have enough charging stations for long distance trips. We hypothesize that there are optimal locations for new electric vehicle charging station located in less populated areas that should be prioritized during charging station infrastructure planning.

*   **What performance measure?**  
*   **How much data to evaluate?** Data is available on [Drive Electric VA EV Dashboard](https://driveelectricva.org/why-drive-electric/ev-dashboard/#/analyze?region=US-VA&show_map=true&country=US&access=public&access=private&fuel=ELEC&lpg_secondary=true&hy_nonretail=true&ev_levels=all).
   *   **Station Location Data**: [Download](https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?access=public%2Cprivate&api_key=foEpOo7RpC4gPM41vxhvNB8IQLzek39WVbwjlX5p&cards_accepted=all&cng_fill_type=all&cng_has_rng=all&cng_psi=all&country=US&download=true&e85_has_blender_pump=false&ev_charging_level=all&ev_connector_type=all&ev_network=all&fuel_type=ELEC&funding_sources=all&hy_is_retail=true&limit=all&lng_has_rng=all&lpg_include_secondary=false&maximum_vehicle_class=all&offset=0&owner_type=all&state=US-VA&status=E&utf8_bom=true) **Description:** Electric Charging Stations in Virginia with Zipcode and Longitude & Latitiude data.
   *   **EV Charging Port Data**: [Download](https://developer.nrel.gov/api/alt-fuel-stations/v1/ev-charging-units.csv?access=public%2Cprivate&api_key=foEpOo7RpC4gPM41vxhvNB8IQLzek39WVbwjlX5p&cards_accepted=all&cng_fill_type=all&cng_has_rng=all&cng_psi=all&country=US&download=true&e85_has_blender_pump=false&ev_charging_level=all&ev_connector_type=all&ev_network=all&fuel_type=ELEC&funding_sources=all&hy_is_retail=true&limit=all&lng_has_rng=all&lpg_include_secondary=false&maximum_vehicle_class=all&offset=0&owner_type=all&state=US-VA&status=E&utf8_bom=true). **Description:** Electric Charging Stations in Virginia with Zipcode and Longitude & Latitiude data. Includes power output by port where each location may have multiple ports and port types.
   *   **Electric and Hybrid Vehicle Registration by Jurisdiction Data**: [Download](https://driveelectricva.org/wp-content/uploads/2023/07/Virginia-Electric-Vehicle-Hybrid-Electric-Vehicle-Registrations-by-Jurisdiction-2008-2022.xlsx). **Description:** Electric Vehicle ownership data from 2008 - 2022 by jursisdiction (i.e., County or City). Shows total registered EV's, Total Registered Vehicles, Growth Rates, and projections.
*   **What learning algorithm?** We plan to run DBSCAN (Density-Based Spatial Clustering of Applications with Noise) experiments to find clusters or locations of high need/demand for EV chargers. Specifically we plan to investigate core neighborhoods for the possibility of adding additional chargers to accommodate growing demand/expected projection of electric vehicle adoption. Additionally, isolated clusters or outliers will be investigated as a potential charging "desert". By analyzing the distance to the nearest charging station in relation to typical travel distances and the lowest EV mileage range, we can assess whether installing an additional charging station is essential to prevent the risk of an EV depleting its charge before reaching its destination.
*   **How much effort to be spent?**


In [None]:
# Scikit-Learn ≥0.20 is required
import sklearn # general ml package

# Common imports
import numpy as np # fundamental package for scientific computing
import os # to run file I/O operation

# to make this notebook's output stable across runs
# any number will do, as long as it is used consistently
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# Step 2: Get Data
*   Load data, get basic statistics, create train & test sets, plot histogram




In [None]:
from six.moves import urllib

# URLs for datasets
EV_CHARGING_UNITS_URL = "https://developer.nrel.gov/api/alt-fuel-stations/v1/ev-charging-units.csv?access=public%2Cprivate&api_key=foEpOo7RpC4gPM41vxhvNB8IQLzek39WVbwjlX5p&cards_accepted=all&cng_fill_type=all&cng_has_rng=all&cng_psi=all&country=US&download=true&e85_has_blender_pump=false&ev_charging_level=all&ev_connector_type=all&ev_network=all&fuel_type=ELEC&funding_sources=all&hy_is_retail=true&limit=all&lng_has_rng=all&lpg_include_secondary=false&maximum_vehicle_class=all&offset=0&owner_type=all&state=US-VA&status=E&utf8_bom=true"
EV_REGISTRATIONS_URL = "https://driveelectricva.org/wp-content/uploads/2023/07/Virginia-Electric-Vehicle-Hybrid-Electric-Vehicle-Registrations-by-Jurisdiction-2008-2022.xlsx"

# Paths to save the files
EV_CHARGING_PATH = os.path.join("datasets", "ev_charging_units")
EV_REGISTRATIONS_PATH = os.path.join("datasets", "ev_registrations")

def fetch_data():
    """Fetch both EV charging and EV registration data to the local file system."""

    # Download EV Charging Data
    if not os.path.isdir(EV_CHARGING_PATH):
        os.makedirs(EV_CHARGING_PATH)
    csv_path = os.path.join(EV_CHARGING_PATH, "ev_charging_units.csv")
    urllib.request.urlretrieve(EV_CHARGING_UNITS_URL, csv_path)
    print("EV Charging data downloaded and saved to:", csv_path)

    # Download EV Registrations Data
    if not os.path.isdir(EV_REGISTRATIONS_PATH):
        os.makedirs(EV_REGISTRATIONS_PATH)
    xlsx_path = os.path.join(EV_REGISTRATIONS_PATH, "Virginia_EV_Registrations.xlsx")
    urllib.request.urlretrieve(EV_REGISTRATIONS_URL, xlsx_path)
    print("EV Registrations data downloaded and saved to:", xlsx_path)

fetch_data()


EV Charging data downloaded and saved to: datasets/ev_charging_units/ev_charging_units.csv
EV Registrations data downloaded and saved to: datasets/ev_registrations/Virginia_EV_Registrations.xlsx


In [None]:
EV_CHARGING_PATH = os.path.join("datasets", "ev_charging_units")

def load_ev_charging_data(ev_charging_path=EV_CHARGING_PATH):
    """Load EV Charging Stations Data into Workspace from a CSV file."""
    csv_path = os.path.join(ev_charging_path, "ev_charging_units.csv")
    return pd.read_csv(csv_path)

# Load the EV charging station data
ev_charging = load_ev_charging_data()

# Display the first 20 rows of the data
ev_charging.head(20)


Unnamed: 0,Fuel Type Code,Station Name,Street Address,Intersection Directions,City,State,ZIP,Plus4,Station Phone,Status Code,...,EV Workplace Charging,Funding Sources,EV J1772 Connector Count,EV J1772 Power Output (kW),EV CCS Connector Count,EV CCS Power Output (kW),EV CHAdeMO Connector Count,EV CHAdeMO Power Output (kW),EV J3400 Connector Count,EV J3400 Power Output (kW)
0,ELEC,Hotel Floyd,120 Wilson St,,Floyd,VA,24091,,540-745-6080,E,...,False,,1,,0,,0,,0,
1,ELEC,Passport Nissan - Alexandria,150 S Pickett St,,Alexandria,VA,22304,,703-823-9000,E,...,False,,1,,0,,0,,0,
2,ELEC,Passport Nissan - Alexandria,150 S Pickett St,,Alexandria,VA,22304,,703-823-9000,E,...,False,,1,,0,,0,,0,
3,ELEC,Passport Nissan - Alexandria,150 S Pickett St,,Alexandria,VA,22304,,703-823-9000,E,...,False,,1,,0,,0,,0,
4,ELEC,Passport Nissan - Alexandria,150 S Pickett St,,Alexandria,VA,22304,,703-823-9000,E,...,False,,0,,1,50.0,1,50.0,0,
5,ELEC,Priority Nissan - Chantilly,14840 Stonecroft Center Ct,,Chantilly,VA,20151,,703-889-3700,E,...,False,,1,,0,,0,,0,
6,ELEC,Priority Nissan - Chantilly,14840 Stonecroft Center Ct,,Chantilly,VA,20151,,703-889-3700,E,...,False,,0,,0,,1,44.0,0,
7,ELEC,Priority Nissan - Chantilly,14840 Stonecroft Center Ct,,Chantilly,VA,20151,,703-889-3700,E,...,False,,1,,0,,0,,0,
8,ELEC,Colonial Nissan,200 Myers Dr,,Charlottesville,VA,22901,,434-978-3711,E,...,False,,1,,0,,0,,0,
9,ELEC,Colonial Nissan,200 Myers Dr,,Charlottesville,VA,22901,,434-978-3711,E,...,False,,1,,0,,0,,0,


In [None]:
EV_REGISTRATIONS_PATH = os.path.join("datasets", "ev_registrations")

def load_ev_registrations_data(ev_registrations_path=EV_REGISTRATIONS_PATH):
    """Load EV Registrations Data into Workspace from an Excel file starting at cell B3."""
    xlsx_path = os.path.join(ev_registrations_path, "Virginia_EV_Registrations.xlsx")
    return pd.read_excel(xlsx_path, skiprows=2, usecols="B:R")  # Adjust "B:Z" as needed to match your column range

# Load the EV registrations data
ev_registrations = load_ev_registrations_data()

# Display the first 20 rows of the data
ev_registrations.head(20)


Unnamed: 0,FIPS,Jurisdiction,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,51001,ACCOMACK,6,6,8,14,14,25,43,73.0,79.0,92.0,8.0,10.0,85.0,145.0,179
1,51003,ALBEMARLE,5,9,10,10,15,26,42,48.0,85.0,101.0,183.0,298.0,435.0,647.0,979
2,51510,ALEXANDRIA,8,7,9,11,9,17,26,51.0,126.0,195.0,319.0,452.0,624.0,920.0,1444
3,51005,ALLEGHANY,-,-,-,-,-,-,-,0.0,0.0,1.0,,1.0,3.0,3.0,9
4,51007,AMELIA,-,-,-,-,-,-,1,1.0,1.0,1.0,1.0,2.0,4.0,8.0,12
5,51009,AMHERST,-,-,-,-,1,-,3,5.0,6.0,8.0,11.0,12.0,22.0,25.0,44
6,51011,APPOMATTOX,-,-,-,-,-,-,-,0.0,0.0,0.0,0.0,3.0,3.0,3.0,3
7,51013,ARLINGTON,9,9,11,9,17,54,87,107.0,251.0,321.0,592.0,873.0,1192.0,1711.0,2479
8,51015,AUGUSTA,2,3,7,7,7,10,13,16.0,16.0,24.0,23.0,33.0,50.0,84.0,146
9,51017,BATH,-,-,1,-,-,1,2,3.0,3.0,7.0,2.0,1.0,2.0,11.0,11


In [None]:
ev_registrations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   FIPS          147 non-null    object 
 1   Jurisdiction  135 non-null    object 
 2   2008          149 non-null    object 
 3   2009          149 non-null    object 
 4   2010          149 non-null    object 
 5   2011          148 non-null    object 
 6   2012          148 non-null    object 
 7   2013          148 non-null    object 
 8   2014          146 non-null    object 
 9   2015          145 non-null    float64
 10  2016          145 non-null    float64
 11  2017          145 non-null    float64
 12  2018          136 non-null    float64
 13  2019          145 non-null    float64
 14  2020          145 non-null    float64
 15  2021          144 non-null    float64
 16  2022          146 non-null    object 
dtypes: float64(7), object(10)
memory usage: 23.6+ KB


In [None]:
ev_charging.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4986 entries, 0 to 4985
Data columns (total 83 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Fuel Type Code                           4986 non-null   object 
 1   Station Name                             4986 non-null   object 
 2   Street Address                           4986 non-null   object 
 3   Intersection Directions                  706 non-null    object 
 4   City                                     4986 non-null   object 
 5   State                                    4986 non-null   object 
 6   ZIP                                      4986 non-null   int64  
 7   Plus4                                    0 non-null      float64
 8   Station Phone                            4673 non-null   object 
 9   Status Code                              4986 non-null   object 
 10  Expected Date                            0 non-n

In [None]:
ev_registrations.describe()

Unnamed: 0,2015,2016,2017,2018,2019,2020,2021
count,145.0,145.0,145.0,136.0,145.0,145.0,144.0
mean,103740.8,100836.5,103807.8,112082.1,105825.2,106249.8,107659.9
std,879450.8,854470.5,879443.0,918747.1,895013.1,897517.9,904463.7
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,1.011998,2.0,3.0,4.0,7.0
50%,3.0,5.0,6.0,8.0,11.0,17.0,29.0
75%,11.0,19.0,24.0,34.25,55.0,80.0,127.0
max,7514484.0,7301081.0,7514484.0,7604646.0,7647692.0,7669209.0,7702236.0


In [None]:
ev_charging.describe()

Unnamed: 0,ZIP,Plus4,Expected Date,BD Blends,NG Fill Type Code,NG PSI,EV Level1 EVSE Num,EV Level2 EVSE Num,EV DC Fast Count,EV Other Info,...,LNG Station Sells Renewable Natural Gas,Funding Sources,EV J1772 Connector Count,EV J1772 Power Output (kW),EV CCS Connector Count,EV CCS Power Output (kW),EV CHAdeMO Connector Count,EV CHAdeMO Power Output (kW),EV J3400 Connector Count,EV J3400 Power Output (kW)
count,4986.0,0.0,0.0,0.0,0.0,0.0,87.0,3726.0,1368.0,0.0,...,0.0,0.0,4986.0,2485.0,4986.0,506.0,4986.0,263.0,4986.0,401.0
mean,22687.580826,,,,,,5.16092,4.51986,7.184211,,...,,,0.678099,7.569296,0.1073,186.758893,0.05355,60.467681,0.231047,249.281796
std,4348.314258,,,,,,6.130127,7.196495,5.217969,,...,,,1.165911,2.648638,0.363752,137.408134,0.22515,19.693692,0.421545,10.622754
min,20105.0,,,,,,1.0,1.0,1.0,,...,,,0.0,3.1,0.0,22.0,0.0,25.0,0.0,62.0
25%,22102.0,,,,,,2.0,2.0,4.0,,...,,,0.0,6.5,0.0,50.0,0.0,50.0,0.0,250.0
50%,22315.0,,,,,,4.0,2.0,8.0,,...,,,1.0,6.5,0.0,150.0,0.0,50.0,0.0,250.0
75%,23320.0,,,,,,5.0,4.0,8.0,,...,,,1.0,7.4,0.0,350.0,0.0,62.5,0.0,250.0
max,99354.0,,,,,,38.0,65.0,24.0,,...,,,57.0,21.6,14.0,480.0,1.0,100.0,1.0,250.0


In [None]:
#Tyler TO DO: Create train & test sets, plot histogram

#  Step 3: Discover & Visualize Data
*   Plot, use opacity, use color & sizes, add image, Look at correlations, isolate interesting one Experiment w/ feature extraction (e.g., total # of rooms?)




In [None]:
# TO DO: Visualize Data

# Step 4: Prepare data (data cleaning)

*   Detect & fill missing values, fill in w/ Imputer, Transform training set (ie, change raw data)
*   Pipeline for a sequence of transformations (e.g., detect/fill missing values, process categorical inputs, scale features in one pipeline)
*   Process categorical inputs (one-hot encoding)
*   Combine Columns (num-attributes & cat-attributes)















In [None]:
# Tyler test

# Step 5: Select model to train


*   Select model (eg: Linear Regression, DecisionTreeRegressor, RandomForrest)
*   Evaluate on some data (pipelined)
*   Calculate errors (eg MSE, RMSE)
*   Cross-validation scores (outputs array)
*   Display score data (Scores, mean, standard deviation)



# Step 6: Fine-Tuning your Model
*   Grid Search: hyperparameter optimization technique via combining each possible configuration (e.g., n_estimators & num_features). Computationally expensive. Identify best score (e.g., lowest error) & associate hyperparameters



# Step 7: Presenting your solution

*   Evaluate model on Test Set (drop label)
*   Prepare test set w/ pipeline
*   Find final error (eg: MSE, RMSE, etc)