# Electric Vehicle Population Data Analysis

This notebook analyzes the Electric Vehicle Population dataset. It covers:
1. Loading the data
2. Initial data exploration (structure, types, summary statistics, missing values)
3. Data cleaning (example for handling missing values)
4. Various visualizations to understand distributions and trends.

**Instructions for Colab:**
1. Upload the `Electric_Vehicle_Population_Data.csv` file to your Colab session (use the file upload feature in the left sidebar).
2. Run the cells sequentially.

## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
# Make sure the path to the CSV file is correct.
# If you uploaded it to the root of your Colab environment, 'Electric_Vehicle_Population_Data.csv' should work.
try:
    df = pd.read_csv('Electric_Vehicle_Population_Data.csv')
except FileNotFoundError:
    print("Error: 'Electric_Vehicle_Population_Data.csv' not found. Please upload the file to your Colab session.")
    df = pd.DataFrame() # Create an empty DataFrame to avoid errors in subsequent cells

# Set plot style
sns.set_style('whitegrid')

## 2. Initial Data Exploration
Get a first look at the data: structure, data types, summary statistics, and missing values.

In [None]:
if not df.empty:
    print("First 5 rows of the dataset:")
    print(df.head())
    print("\nDataset information:")
    df.info()
    print("\nSummary statistics for numerical columns:")
    print(df.describe())
    print("\nMissing values per column:")
    print(df.isnull().sum())
else:
    print("DataFrame is empty. Please check the file loading step.")

## 3. Data Cleaning (Example)
Based on the output of `df.isnull().sum()` from the previous step, you might need to handle missing values. 
The cell below contains commented-out examples. You should inspect your data and decide on the appropriate strategy (e.g., dropping rows/columns, filling with mean/median/mode, or a constant like 'Unknown').

In [None]:
if not df.empty:
    # Example: Drop rows where 'Model Year' or 'Make' is missing, if any.
    # df.dropna(subset=['Model Year', 'Make'], inplace=True)

    # For numerical columns like 'Electric Range', you might fill with 0 or mean/median
    # df['Electric Range'].fillna(0, inplace=True) # Example: fill with 0

    # For categorical columns, you might fill with 'Unknown'
    # df['SomeCategoricalColumn'].fillna('Unknown', inplace=True)

    print("Missing values after potential cleaning (if any operations were un-commented and run):")
    print(df.isnull().sum())
else:
    print("DataFrame is empty.")

## 4. Visualizations and Analysis

### a. Distribution of Vehicle Makes

In [None]:
if not df.empty and 'Make' in df.columns:
    plt.figure(figsize=(12, 8))
    make_counts = df['Make'].value_counts().nlargest(15) # Top 15 makes
    sns.barplot(x=make_counts.values, y=make_counts.index, palette='viridis', hue=make_counts.index, legend=False)
    plt.title('Top 15 Electric Vehicle Makes')
    plt.xlabel('Number of Vehicles')
    plt.ylabel('Make')
    plt.tight_layout()
    plt.show()
else:
    print("DataFrame is empty or 'Make' column not found.")

### b. Distribution of Model Years

In [None]:
if not df.empty and 'Model Year' in df.columns:
    plt.figure(figsize=(10, 6))
    # Ensure 'Model Year' is treated as a discrete variable for histplot if it's numeric
    # Or convert to category/string if appropriate for clearer binning
    # df_plot = df.copy()
    # df_plot['Model Year'] = df_plot['Model Year'].astype(str) # Example if years are like 2018.0
    # sns.histplot(data=df_plot, x='Model Year', discrete=True, shrink=0.8)
    # Using nunique for bins is good if model years are integers and you want one bin per year.
    sns.histplot(df['Model Year'], bins=df['Model Year'].nunique(), kde=False, color='skyblue')
    plt.title('Distribution of Model Years')
    plt.xlabel('Model Year')
    plt.ylabel('Number of Vehicles')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("DataFrame is empty or 'Model Year' column not found.")

### c. Distribution of Electric Vehicle Types

In [None]:
if not df.empty and 'Electric Vehicle Type' in df.columns:
    plt.figure(figsize=(8, 6))
    ev_type_counts = df['Electric Vehicle Type'].value_counts()
    sns.barplot(x=ev_type_counts.index, y=ev_type_counts.values, palette='coolwarm', hue=ev_type_counts.index, legend=False)
    plt.title('Distribution of Electric Vehicle Types')
    plt.xlabel('Electric Vehicle Type')
    plt.ylabel('Number of Vehicles')
    plt.tight_layout()
    plt.show()
else:
    print("DataFrame is empty or 'Electric Vehicle Type' column not found.")

### d. Distribution of Electric Range
This plot focuses on Battery Electric Vehicles (BEVs) for a more meaningful range distribution, as PHEVs can have different range characteristics. It falls back to all EV types if BEVs are not specifically identified or present.

In [None]:
if not df.empty and 'Electric Range' in df.columns:
    plt.figure(figsize=(10, 6))
    plot_made = False
    if 'Electric Vehicle Type' in df.columns:
        bevs = df[df['Electric Vehicle Type'] == 'Battery Electric Vehicle (BEV)']
        # Check if BEVs exist and have non-null range data
        if not bevs.empty and bevs['Electric Range'].notna().any():
            sns.histplot(bevs['Electric Range'].dropna(), bins=30, kde=True, color='coral')
            plt.title('Distribution of Electric Range for BEVs')
            plot_made = True
        # Fallback if no BEVs or BEVs have no range data, but general range data exists
        elif df['Electric Range'].notna().any():
            print("No BEVs found or BEVs have no range data. Plotting range for all EV types.")
            sns.histplot(df['Electric Range'].dropna(), bins=30, kde=True, color='coral')
            plt.title('Distribution of Electric Range (All EV Types with available data)')
            plot_made = True
        else:
            print("No electric range data available to plot (even after checking all EV types).")
    # Fallback if 'Electric Vehicle Type' column doesn't exist, but general range data exists
    elif df['Electric Range'].notna().any(): 
        sns.histplot(df['Electric Range'].dropna(), bins=30, kde=True, color='coral')
        plt.title('Distribution of Electric Range (All EV Types with available data)')
        plot_made = True
    else:
        print("No electric range data available to plot.")

    if plot_made:
        plt.xlabel('Electric Range (miles)')
        plt.ylabel('Frequency')
        plt.tight_layout()
        plt.show()
else:
    print("DataFrame is empty or 'Electric Range' column not found.")

### e. Clean Alternative Fuel Vehicle (CAFV) Eligibility

In [None]:
if not df.empty and 'Clean Alternative Fuel Vehicle (CAFV) Eligibility' in df.columns:
    plt.figure(figsize=(10, 6))
    cafv_counts = df['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].value_counts()
    sns.barplot(x=cafv_counts.index, y=cafv_counts.values, palette='pastel', hue=cafv_counts.index, legend=False)
    plt.title('CAFV Eligibility Status')
    plt.xlabel('Eligibility')
    plt.ylabel('Number of Vehicles')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("DataFrame is empty or 'Clean Alternative Fuel Vehicle (CAFV) Eligibility' column not found.")

### f. Top N Cities with Most Electric Vehicles

In [None]:
if not df.empty and 'City' in df.columns:
    plt.figure(figsize=(12, 8))
    city_counts = df['City'].value_counts().nlargest(15) # Top 15 cities
    sns.barplot(x=city_counts.values, y=city_counts.index, palette='mako', hue=city_counts.index, legend=False)
    plt.title('Top 15 Cities with Electric Vehicles')
    plt.xlabel('Number of Vehicles')
    plt.ylabel('City')
    plt.tight_layout()
    plt.show()
else:
    print("DataFrame is empty or 'City' column not found.")

## 5. Further Analysis Ideas (Commented Out)
These are more complex analyses you might explore. They are commented out by default.

### a. Average Electric Range by Make and Model Year

In [None]:
# if not df.empty and 'Make' in df.columns and 'Model Year' in df.columns and 'Electric Range' in df.columns:
#     avg_range_by_make_year = df.groupby(['Make', 'Model Year'])['Electric Range'].mean().reset_index()
#     # This can be complex to visualize directly (e.g., heatmap or faceted plots)
#     # For simplicity, let's look at top makes' average range over years
#     top_makes = df['Make'].value_counts().nlargest(5).index
#     plt.figure(figsize=(14, 8))
#     for make in top_makes:
#         make_data = avg_range_by_make_year[avg_range_by_make_year['Make'] == make]
#         if not make_data.empty:
#             sns.lineplot(x='Model Year', y='Electric Range', data=make_data, label=make, marker='o')
#     plt.title('Average Electric Range by Model Year for Top 5 Makes')
#     plt.xlabel('Model Year')
#     plt.ylabel('Average Electric Range (miles)')
#     plt.legend()
#     plt.tight_layout()
#     plt.show()
# else:
#     print("DataFrame is empty or required columns ('Make', 'Model Year', 'Electric Range') not found.")

### b. Correlation Heatmap (for numerical features)

In [None]:
# if not df.empty:
#     numerical_cols = df.select_dtypes(include=['number']).columns
#     if len(numerical_cols) > 1:
#         # Drop columns that are identifiers or categorical encoded as numbers if they are not truly numeric for correlation
#         # For example, 'VIN (1-10)', 'DOL Vehicle ID', 'Model Year' might not be suitable for direct correlation in some contexts
#         # Add other columns like 'Postal Code' to this list if they exist and are numeric but not suitable for correlation
#         cols_to_exclude_from_corr = ['VIN (1-10)', 'DOL Vehicle ID', 'Model Year', 'Legislative District', '2020 Census Tract', 'Postal Code'] 
#         relevant_numerical_cols = [col for col in numerical_cols if col not in cols_to_exclude_from_corr and df[col].nunique() > 1] # Ensure variety
#         if len(relevant_numerical_cols) > 1:
#             correlation_matrix = df[relevant_numerical_cols].corr()
#             plt.figure(figsize=(10, 8))
#             sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
#             plt.title('Correlation Heatmap of Numerical Features')
#             plt.show()
#         else:
#             print("Not enough relevant numerical columns with variance for a correlation heatmap after exclusions.")
#     else:
#         print("Not enough numerical columns for a correlation heatmap.")
# else:
#     print("DataFrame is empty.")

## 6. Interpreting the Results
For each plot and analysis, consider:
*   **Distributions:** Are they skewed? Are there multiple peaks? What are the common values?
*   **Comparisons:** How do different categories (e.g., makes, EV types) compare?
*   **Trends:** Are there any trends over time (e.g., model year vs. range)?
*   **Relationships:** Do certain variables seem to influence others?
*   **Data Quality:** Are there anomalies or missing data points that affect the interpretation?