## Introduction
In this assignment, I chose to explore the electric vehicle population for the state of Washington (see the link below) due. I have an interest in the evolving electric car market, so I thought this would be an interesting dataset to explore. The dataset shows a subset (n=1000) of the full-electric vehicles and plug-in hybrids that are currently registered in Washington State. 

## Data Exploration


In [3]:
#Data Import and exploration
import pandas as pd
electric_cars = pd.read_csv("https://www.pro-football-reference.com/years/2024/#team_stats")
print("Dataframe dimensions (rows, columns):",electric_cars.shape, "\n")

print(electric_cars.info(),"\n")

print("Descriptive Statistics of each Column:\n",electric_cars.describe(),"\n")

print("Missing Values in each Column:\n", electric_cars.isnull().sum())

Dataframe dimensions (rows, columns): (1000, 17) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   vin_1_10              1000 non-null   object 
 1   county                999 non-null    object 
 2   city                  999 non-null    object 
 3   state                 1000 non-null   object 
 4   zip_code              999 non-null    float64
 5   model_year            1000 non-null   int64  
 6   make                  1000 non-null   object 
 7   model                 1000 non-null   object 
 8   ev_type               1000 non-null   object 
 9   cafv_type             1000 non-null   object 
 10  electric_range        1000 non-null   int64  
 11  base_msrp             1000 non-null   int64  
 12  legislative_district  996 non-null    float64
 13  dol_vehicle_id        1000 non-null   int64  
 14  geocoded_column       

## Data Wrangling

In [4]:
# 1. Modify multiple column names.
#Simplified two column names
print("1.Modify multiple column names.\n")
electric_cars.rename(columns={"vin_1_10":"VIN", "dol_vehicle_id": "id", "electric_utility": "utility"}, inplace=True)
print(electric_cars.info(),"\n") 

# 2. Look at the structure of your data – are any variables improperly coded? Such as strings or characters? 
#Changed generic object type to string
print("2. Redefine improperly coded variables \n")
electric_cars['VIN'] = electric_cars['VIN'].astype('string')
electric_cars['county'] = electric_cars['county'].astype('string')
electric_cars['city'] = electric_cars['city'].astype('string')
electric_cars['state'] = electric_cars['state'].astype('string')
electric_cars['make'] = electric_cars['make'].astype('string')
electric_cars['model'] = electric_cars['model'].astype('string')
electric_cars['ev_type'] = electric_cars['ev_type'].astype('string')
electric_cars['cafv_type'] = electric_cars['cafv_type'].astype('string')
electric_cars['geocoded_column'] = electric_cars['geocoded_column'].astype('string')
electric_cars['utility'] = electric_cars['utility'].astype('string')
print(electric_cars.info(), "\n") 

# 3. Fix missing and invalid values in data.
#dropped rows with missing data
print("3. Fix missing and invalid values in data. \n")
electric_cars.dropna(inplace = True)
print(electric_cars.shape,"\n")

# 4. Create new columns based on existing columns or calculations.
#created two new columns
print("4. Create new columns based on existing columns or calculations.\n")
electric_cars['range_comp_to_mean'] = electric_cars['electric_range']/electric_cars['electric_range'].mean()
electric_cars['is_tesla_or_audi'] = electric_cars['make'].apply(lambda x: 1 if x == "TESLA" or x == "AUDI" else 0)
print(electric_cars.columns,"\n")

# 5. Drop column(s) from your dataset.
#dropped base_msrp since most values were $0
print("5. Drop column(s) from your dataset.\n")
electric_cars.drop(columns=['base_msrp'],inplace = True)
print(electric_cars.columns, "\n")

# 6. Drop a row(s) from your dataset.
#Dropped the first 3 rows
print("6. Drop rows from your dataset.\n")
electric_cars.drop([0,1,2], inplace = True)
print(electric_cars.shape,"\n")

# 7. Sort your data based on multiple variables.
#Sorted data based on zip code and electric range
print("7. Sort your data based on multiple variables.\n")
sorted = electric_cars.sort_values(['zip_code','electric_range'])
print(sorted.head())
print("\n")

# 8. Filter your data based on some condition.
#filtered rows based on whether they had useful (non-zero) information regarding range
print("8. Filter your data based on some condition. \n")
electric_cars_filtered = electric_cars[electric_cars['electric_range'] > 0]
print(electric_cars_filtered.shape)
print("\n")

# 9. Convert all the string values to upper or lower cases in one column.
#Converted strings in the "make" column to lower case
print("9. Convert all the string values to upper or lower cases in one column.\n")
electric_cars_filtered.loc[:,'make'] = electric_cars_filtered['make'].str.lower()
print(electric_cars_filtered['make'])
print("\n")

# 10. Check whether numeric values are present in a given column of your dataframe.
print("10. Check whether numeric values are present in a given column of your dataframe. \n")
print(electric_cars_filtered.dtypes)
print("\n")

# 11. Group your dataset by one column, and get the mean, min, and max values by group.
#Grouped model year and examined the mean, min, and mix values of the electric range
print("11.Group your dataset by one column, and get the mean, min, and max values by group.\n ")
print(electric_cars_filtered.groupby('model_year')['electric_range'].agg(['mean', 'min', 'max']))


1.Modify multiple column names.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   VIN                   1000 non-null   object 
 1   county                999 non-null    object 
 2   city                  999 non-null    object 
 3   state                 1000 non-null   object 
 4   zip_code              999 non-null    float64
 5   model_year            1000 non-null   int64  
 6   make                  1000 non-null   object 
 7   model                 1000 non-null   object 
 8   ev_type               1000 non-null   object 
 9   cafv_type             1000 non-null   object 
 10  electric_range        1000 non-null   int64  
 11  base_msrp             1000 non-null   int64  
 12  legislative_district  996 non-null    float64
 13  id                    1000 non-null   int64  
 14  geocoded_column       999 non-null    ob

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  electric_cars_filtered.loc[:,'make'] = electric_cars_filtered['make'].str.lower()


## Conclusions
The data was grouped by model year, then the electric range was summarized via mean, max, and min values. The results were surprising in that the mean range varied considerably over time rather than improving. This unlikely a true reflection of reality, but rather is due to limitations of the data set. Many of of the electric range values are suspiciously low, so I suspect that some of the data in the electric_range column is actually the miles per gallon equivalent (MPGe) rather than range. An attribute that is inconsistently defined is challenging to use for analysis and would likely lead to a "garbage in, garbage out" result. Ultimately, I would seek out an alternative data set if I were to further analyze the electric vehicle market in Washington state.  