# 🚗 Electric Vehicle (EV) Population Analysis and Prediction

## 📜 Overview

This project analyzes the "Electric Vehicle Population Size History" dataset to understand trends in EV adoption. The primary goal is to perform data cleaning, exploratory data analysis (EDA), and build a machine learning model to predict the total number of electric vehicles (`Electric Vehicle (EV) Total`) based on various features like date, county, state, and vehicle type. The notebook uses a **Random Forest Regressor** for the prediction task.

***

## 📊 Dataset

The dataset used is `Electric_Vehicle_Population_Size_History_By_County_.csv`. It contains historical data on vehicle registrations, separating electric and non-electric vehicles.

**Key Columns:**
- `Date`: The date of the data record.
- `County`: The county of registration.
- `State`: The state of registration.
- `Vehicle Primary Use`: The primary use of the vehicle (e.g., Passenger, Truck).
- `Battery Electric Vehicles (BEVs)`: Count of fully electric vehicles.
- `Plug-In Hybrid Electric Vehicles (PHEVs)`: Count of plug-in hybrid vehicles.
- `Electric Vehicle (EV) Total`: The target variable for prediction; sum of BEVs and PHEVs.
- `Non-Electric Vehicle Total`: Count of non-electric vehicles.
- `Total Vehicles`: Total registered vehicles.
- `Percent Electric Vehicles`: The percentage of EVs in the total vehicle population.

***

## 🛠️ Project Workflow

The project follows these main steps, as detailed in the Jupyter Notebook:

### 1. Data Loading and Initial Exploration
- The dataset is loaded using **Pandas**.
- Initial exploration is done using `.head()`, `.shape()`, and `.info()` to understand the data's structure and types.
- The dataset contains **20,819 rows** and **10 columns**.
- `df.info()` reveals that several numerical columns are incorrectly typed as `object`.

### 2. Data Cleaning and Preprocessing
- **Handling Missing Values:** The `County` and `State` columns have 86 missing values, which are filled with the string `'Unknown'`.
- **Correct Data Types:** The `Date` column is converted from `object` to a proper `datetime` format. Other numerical columns stored as objects are converted to numeric types.
- **Outlier Treatment:** Outliers in the `Percent Electric Vehicles` column are identified using the Interquartile Range (IQR) method. These outliers are **capped** (winsorized) by replacing any value above `Q3 + 1.5*IQR` or below `Q1 - 1.5*IQR` with the boundary value itself. This mitigates the effect of extreme values without removing data.

### 3. Feature Engineering
- **Date Features:** New features like `Year`, `Month`, and `Day` are extracted from the `Date` column to help the model capture time-based trends.
- **Categorical Encoding:** Categorical features such as `County`, `State`, and `Vehicle Primary Use` are converted into numerical format using `LabelEncoder` so they can be used by the machine learning model.

### 4. Model Training and Evaluation
- **Splitting Data:** The dataset is split into training and testing sets using `train_test_split`.
- **Model Selection:** A **`RandomForestRegressor`** is chosen for its robustness and ability to handle complex interactions between features.
- **Hyperparameter Tuning:** `RandomizedSearchCV` is used to efficiently search for the best hyperparameters for the Random Forest model, optimizing its performance.
- **Evaluation:** The model's predictions are evaluated against the test set using standard regression metrics:
  - **Mean Absolute Error (MAE)**
  - **Mean Squared Error (MSE)**
  - **R² Score**
- **Model Persistence:** The final trained model is saved to a file using `joblib` for future use or deployment.

***

## 🔧 Technologies Used

- **Python**
- **Pandas** for data manipulation and analysis.
- **NumPy** for numerical operations.
- **Scikit-learn** for machine learning (model building, preprocessing, and evaluation).
- **Matplotlib** & **Seaborn** for data visualization.
- **Joblib** for saving the trained model.

***

## 🚀 How to Run

1.  **Clone the repository:**
    ```bash
    git clone [https://github.com/your-username/EV_Adoption_Trend_Analysis.git](https://github.com/your-username/EV_Adoption_Trend_Analysis.git)
    cd EV_Adoption_Trend_Analysis
    ```
2.  **Install the required libraries:**
    ```bash
    pip install -r requirements.txt
    ```
3.  **Launch Jupyter Notebook:**
    ```bash
    jupyter notebook
    ```
4.  Open and run the notebook to see the complete analysis and model training process.

In [1]:
import joblib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [4]:
# Load data
df = pd.read_csv("Electric_Vehicle_Population_Size_History_By_County_.csv")

In [5]:
df.head() # top 5 rows

Unnamed: 0,Date,County,State,Vehicle Primary Use,Battery Electric Vehicles (BEVs),Plug-In Hybrid Electric Vehicles (PHEVs),Electric Vehicle (EV) Total,Non-Electric Vehicle Total,Total Vehicles,Percent Electric Vehicles
0,September 30 2022,Riverside,CA,Passenger,7,0,7,460,467,1.5
1,December 31 2022,Prince William,VA,Passenger,1,2,3,188,191,1.57
2,January 31 2020,Dakota,MN,Passenger,0,1,1,32,33,3.03
3,June 30 2022,Ferry,WA,Truck,0,0,0,3575,3575,0.0
4,July 31 2021,Douglas,CO,Passenger,0,1,1,83,84,1.19


In [6]:
# no of rows and cols
df.shape

(20819, 10)

In [7]:
# Data Types, class and memory alloc
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20819 entries, 0 to 20818
Data columns (total 10 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Date                                   20819 non-null  object 
 1   County                                 20733 non-null  object 
 2   State                                  20733 non-null  object 
 3   Vehicle Primary Use                    20819 non-null  object 
 4   Battery Electric Vehicles (BEVs)       20819 non-null  object 
 5   Plug-In Hybrid Electric Vehicles (PHEVs) 20819 non-null  object 
 6   Electric Vehicle (EV) Total            20819 non-null  object 
 7   Non-Electric Vehicle Total             20819 non-null  object 
 8   Total Vehicles                         20819 non-null  object 
 9   Percent Electric Vehicles              20819 non-null  float64
dtypes: float64(1), object(9)
memory usage: 1.6+ MB


In [8]:
df.isnull().sum()

Date                                      0
County                                   86
State                                    86
Vehicle Primary Use                       0
Battery Electric Vehicles (BEVs)          0
Plug-In Hybrid Electric Vehicles (PHEVs)    0
Electric Vehicle (EV) Total               0
Non-Electric Vehicle Total                0
Total Vehicles                            0
Percent Electric Vehicles                 0
dtype: int64

In [9]:
# Compute Q1 and Q3
Q1 = df['Percent Electric Vehicles'].quantile(0.25)
Q3 = df['Percent Electric Vehicles'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print('lower_bound:', lower_bound)
print('upper_bound:', upper_bound)

# Identify outliers
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])

lower_bound: -3.5174999999999996
upper_bound: 6.9025
Number of outliers in 'Percent Electric Vehicles': 2476


In [10]:
# Converts the "Date" column to actual datetime objects
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Removes rows where "Date" conversion failed
df = df[df['Date'].notnull()]

# Removes rows where the target (EV Total) is missing
df = df[df['Electric Vehicle (EV) Total'].notnull()]

# Fill missing values
df['County'] = df['County'].fillna('Unknown')
df['State'] = df['State'].fillna('Unknown')

# Confirm remaining nulls
print("Missing after fill:")
print(df[['County', 'State']].isnull().sum())

df.head()

Missing after fill:
County    0
State     0
dtype: int64


        Date          County State Vehicle Primary Use  \
0 2022-09-30       Riverside    CA             Passenger   
1 2022-12-31  Prince William    VA             Passenger   
2 2020-01-31          Dakota    MN             Passenger   
3 2022-06-30           Ferry    WA                 Truck   
4 2021-07-31         Douglas    CO             Passenger   

  Battery Electric Vehicles (BEVs) Plug-In Hybrid Electric Vehicles (PHEVs)  \
0                                7                                         0   
1                                1                                         2   
2                                0                                         1   
3                                0                                         0   
4                                0                                         1   

  Electric Vehicle (EV) Total Non-Electric Vehicle Total Total Vehicles  \
0                           7                        460           467   
1           

In [11]:
# Cap the outliers - it keeps all the data while reducing the skew from extreme values.

df['Percent Electric Vehicles'] = np.where(df['Percent Electric Vehicles'] > upper_bound, upper_bound,
                                      np.where(df['Percent Electric Vehicles'] < lower_bound, lower_bound, df['Percent Electric Vehicles']))

# Identify outliers
outliers = df[(df['Percent Electric Vehicles'] < lower_bound) | (df['Percent Electric Vehicles'] > upper_bound)]
print("Number of outliers in 'Percent Electric Vehicles':", outliers.shape[0])

Number of outliers in 'Percent Electric Vehicles': 0
