# <font color = "Green"> This notebook is a comprehensive program used for data preprocesssing and build models for Predicting EV Buying Intention

# <font color= "Indigo">Project Team:
<ol>
  <font color= "Indigo"><li>Gayathri Shanmugam</li>
  <li>Kayalvizhi Vellaichamy</li>
  <li>Nitya Malladi</li>
    <li>Saranya Anandan</li>
</ol>

### Context
The global auto manufacturing industry is undergoing rapid transformation by shifting focus from fuel-based vehicles to zero-emission vehicles (ZEVs). ZEVs are further categorized into battery electric vehicles and hydrogen fuel cell electric vehicles. In the United States, the federal government has mandated that at least 50% of the total cars sold should belong to the zero-emission category by the year 2030.
The scope of this project is to build a predictive model which classifies EV buyers in the United States based on their socio-demographic characteristics and their views on the current EV ecosystem. As per a recent survey, 53% of American vehicle users continue to prefer the traditional fuel-based vehicles over electric vehicles.This is a concerning issue for the government, auto companies and their dealers, and needs to be addressed by classifying an EV buyer from a non-buyer. Once the non-buyers are identified, focused strategies can be implemented to convert them into EV buyers

### Data attribute information


### Import required packages.

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from mord import LogisticIT

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, KernelPCA
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn import neighbors 
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
import statsmodels.api as sm


from dmba import classificationSummary, gainsChart, liftChart
from dmba import adjusted_r2_score, AIC_score, BIC_score
from dmba import regressionSummary, exhaustive_search 
from dmba import backward_elimination, forward_selection, stepwise_selection
from dmba import plotDecisionTree, classificationSummary, regressionSummary

from sklearn.neural_network import MLPClassifier, MLPRegressor 
from sklearn.preprocessing import StandardScaler

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import warnings


from sklearn.datasets import make_classification

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
warnings.filterwarnings("ignore")

  from pandas import MultiIndex, Int64Index


## Loading dataset

In [2]:
# Create data frame for EV data set.
EV_intention_df = pd.read_csv('AfterMerge_Dataset.csv')

# Display the first 10 records of EV_intention_df data frame.
print(EV_intention_df.head(10))

   bichoice  range  home_chg  work_chg  town  highway  gender          state  \
0         0      1         3         1     3        2       0  Massachusetts   
1         0      4         3         3     4        2       0  Massachusetts   
2         0      2         5         0     2        4       0  Massachusetts   
3         0      4         5         0     1        1       0  Massachusetts   
4         0      1         5         0     1        2       0  Massachusetts   
5         0      3        20        10     2        4       0  Massachusetts   
6         1      3         1         1     3        2       0  Massachusetts   
7         0      1         3         3     4        2       0  Massachusetts   
8         0      1         3         5     4        1       0  Massachusetts   
9         0      2         5        20     3        4       0  Massachusetts   

   Region  education  ...  home_parking  home_evse  work_parking  work_evse  \
0       1          4  ...             3 

# Data preprocessing

### Understanding the shape of the dataset

In [31]:
# Determine dimensions of dataframe. 
print('Dimensions of dataframe:',EV_intention_df.shape )
# It has 5898 rows and 27 columns.

Dimensions of dataframe: (5898, 27)


In [32]:
EV_intention_df.duplicated().sum()


0

In [33]:
EV_intention_df.isna().sum()

bichoice        0
range           0
home_chg        0
work_chg        0
town            0
highway         0
gender          0
state           0
Region          0
education       0
employment      0
hsincome        0
hsize           0
housit          0
residence       0
all_cars        0
ev_cars         0
home_parking    0
home_evse       0
work_parking    0
work_evse       0
buycar          0
zipcode         0
dmileage        0
long_dist       0
Age_category    0
RUCA            0
dtype: int64

### Check the data types of the columns for the dataset.

In [34]:
# Display column data types in the dataframe
print('Datatypes of all the columns in the dataset')
print(EV_intention_df.info())
EV_intention_df.describe().T

Datatypes of all the columns in the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5898 entries, 0 to 5897
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   bichoice      5898 non-null   int64 
 1   range         5898 non-null   int64 
 2   home_chg      5898 non-null   int64 
 3   work_chg      5898 non-null   int64 
 4   town          5898 non-null   int64 
 5   highway       5898 non-null   int64 
 6   gender        5898 non-null   int64 
 7   state         5898 non-null   object
 8   Region        5898 non-null   int64 
 9   education     5898 non-null   int64 
 10  employment    5898 non-null   int64 
 11  hsincome      5898 non-null   int64 
 12  hsize         5898 non-null   int64 
 13  housit        5898 non-null   int64 
 14  residence     5898 non-null   int64 
 15  all_cars      5898 non-null   int64 
 16  ev_cars       5898 non-null   int64 
 17  home_parking  5898 non-null   int64 
 18  home

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bichoice,5898.0,0.550017,0.497534,0.0,0.0,1.0,1.0,1.0
range,5898.0,2.50746,1.112326,1.0,2.0,3.0,3.0,4.0
home_chg,5898.0,5.943371,6.602592,0.0,1.0,3.0,10.0,20.0
work_chg,5898.0,5.899627,6.574685,0.0,1.0,3.0,10.0,20.0
town,5898.0,2.504408,1.124472,1.0,1.0,3.0,4.0,4.0
highway,5898.0,2.502204,1.115089,1.0,2.0,2.0,3.0,4.0
gender,5898.0,0.503561,0.50003,0.0,0.0,1.0,1.0,1.0
Region,5898.0,2.899288,1.367156,1.0,2.0,3.0,4.0,5.0
education,5898.0,2.703967,0.839617,1.0,2.0,3.0,3.0,4.0
employment,5898.0,1.659207,1.291301,1.0,1.0,1.0,2.0,6.0


- There are no null values
- All columns are integer type except state which is of object datatype
- Some of the attributes need the conversion into their equivalent dummy values

In [35]:
# Display column data types in the dataframe before modification
print('Original Column data types')
print(EV_intention_df.dtypes)

Original Column data types
bichoice         int64
range            int64
home_chg         int64
work_chg         int64
town             int64
highway          int64
gender           int64
state           object
Region           int64
education        int64
employment       int64
hsincome         int64
hsize            int64
housit           int64
residence        int64
all_cars         int64
ev_cars          int64
home_parking     int64
home_evse        int64
work_parking     int64
work_evse        int64
buycar           int64
zipcode          int64
dmileage         int64
long_dist        int64
Age_category     int64
RUCA             int64
dtype: object


In [3]:
# Need to change all the variables with multiple classes to 'category'datatype 
EV_intention_df.gender = EV_intention_df.gender.astype('category')
EV_intention_df.state = EV_intention_df.state.astype('category')
EV_intention_df.employment = EV_intention_df.employment.astype('category')
EV_intention_df.hsize = EV_intention_df.hsize.astype('category')
EV_intention_df.housit = EV_intention_df.housit.astype('category')
EV_intention_df.residence = EV_intention_df.residence.astype('category')
#EV_intention_df.zipcode = EV_intention_df.zipcode.astype('category')
EV_intention_df.buycar = EV_intention_df.buycar.astype('category')
EV_intention_df.home_evse = EV_intention_df.home_evse.astype('category')
EV_intention_df.work_evse = EV_intention_df.work_evse.astype('category')
EV_intention_df.town = EV_intention_df.town.astype('category')
EV_intention_df.highway = EV_intention_df.highway.astype('category')
EV_intention_df.home_parking = EV_intention_df.home_parking.astype('category')
EV_intention_df.work_parking = EV_intention_df.work_parking.astype('category')
EV_intention_df.RUCA = EV_intention_df.RUCA.astype('category')
EV_intention_df.Region = EV_intention_df.Region.astype('category')
EV_intention_df.Age_category = EV_intention_df.Age_category.astype('category')
EV_intention_df.education = EV_intention_df.education.astype('category')
EV_intention_df.hsincome = EV_intention_df.hsincome.astype('category')
EV_intention_df.range = EV_intention_df.range.astype('category')
EV_intention_df.bichoice = EV_intention_df.bichoice.astype('category')

# Display category levels (attributes) and category type.
print(' ')
print('Category levels and changed variable type:')
print(EV_intention_df.gender.cat.categories)
print(EV_intention_df.gender.dtype)
print(EV_intention_df.state.cat.categories)
print(EV_intention_df.state.dtype)
print(EV_intention_df.employment.cat.categories)
print(EV_intention_df.employment.dtype)
print(EV_intention_df.hsize.cat.categories)
print(EV_intention_df.hsize.dtype)
print(EV_intention_df.housit.cat.categories)
print(EV_intention_df.housit.dtype)
print(EV_intention_df.residence.cat.categories)
print(EV_intention_df.residence.dtype)
print(EV_intention_df.bichoice.cat.categories)
print(EV_intention_df.bichoice.dtype)
# print(EV_intention_df.zipcode.cat.categories)
# print(EV_intention_df.zipcode.dtype)
print(EV_intention_df.buycar.cat.categories)
print(EV_intention_df.buycar.dtype)
print(EV_intention_df.home_evse.cat.categories)
print(EV_intention_df.home_evse.dtype)
print(EV_intention_df.work_evse.cat.categories)
print(EV_intention_df.work_evse.dtype)
print(EV_intention_df.town.cat.categories)
print(EV_intention_df.town.dtype)
print(EV_intention_df.highway.cat.categories)
print(EV_intention_df.highway.dtype)
print(EV_intention_df.home_parking.cat.categories)
print(EV_intention_df.home_parking.dtype)
print(EV_intention_df.work_parking.cat.categories)
print(EV_intention_df.work_parking.dtype)
print(EV_intention_df.RUCA.cat.categories)
print(EV_intention_df.RUCA.dtype)
print(EV_intention_df.Region.cat.categories)
print(EV_intention_df.Region.dtype)
print(EV_intention_df.Age_category.cat.categories)
print(EV_intention_df.Age_category.dtype)
print(EV_intention_df.education.cat.categories)
print(EV_intention_df.education.dtype)
print(EV_intention_df.hsincome.cat.categories)
print(EV_intention_df.hsincome.dtype)
print(EV_intention_df.range.cat.categories)
print(EV_intention_df.range.dtype)
print(EV_intention_df.bichoice.cat.categories)
print(EV_intention_df.bichoice.dtype)

 
Category levels and changed variable type:
Int64Index([0, 1], dtype='int64')
category
Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',
       'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
       'North Carolina', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas',
       'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia',
       'Wisconsin', 'Wyoming'],
      dtype='object')
category
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')
category
Int64Index([1, 2, 3, 4, 5], dtype='int64')
category
Int64Index([1, 2, 3, 4], dtype='int64')
category
Int64Index([1, 2, 3, 4, 5, 6, 8], dtype='int64')

In [9]:
ordinal_encoded_columns= ['state']

ordinal_encoder = OrdinalEncoder(categories='auto')
ordinal_encoded_data = ordinal_encoder.fit_transform(EV_intention_df[ordinal_encoded_columns])

#Convert it to df
ordinal_encoded_data_df = pd.DataFrame(ordinal_encoded_data, index=EV_intention_df.index,columns=['state'])
# # ordinal_encoded_data_df.columns = ordinal_encoder.get_feature_names_out(input_features=EV_intention_df[ordinal_encoded_columns])

# #Extract only the columns that didnt need to be encoded
data_other_cols = EV_intention_df.drop(columns=ordinal_encoded_columns)

# #Concatenate the two dataframes : 
EV_intention_df = pd.concat([ordinal_encoded_data_df, data_other_cols], axis=1)
print(EV_intention_df)
EV_intention_df.shape



      state bichoice range  home_chg  work_chg town highway gender Region  \
0      20.0        0     1         3         1    3       2      0      1   
1      20.0        0     4         3         3    4       2      0      1   
2      20.0        0     2         5         0    2       4      0      1   
3      20.0        0     4         5         0    1       1      0      1   
4      20.0        0     1         5         0    1       2      0      1   
...     ...      ...   ...       ...       ...  ...     ...    ...    ...   
5893    1.0        0     2        10         5    2       2      0      5   
5894    1.0        1     3         1         3    4       3      0      5   
5895    1.0        0     1        20         2    2       4      0      5   
5896    1.0        0     2        20         5    4       2      0      5   
5897    1.0        0     1         2         1    3       1      0      5   

     education  ... home_parking home_evse work_parking work_evse buycar  \

(5898, 27)

In [10]:
EV_intention_df.state = EV_intention_df.state.astype('category')


In [11]:
# Display column data types in the dataframe after modification
print('Modified Column data types')
print(EV_intention_df.dtypes)

Modified Column data types
state           category
bichoice        category
range           category
home_chg           int64
work_chg           int64
town            category
highway         category
gender          category
Region          category
education       category
employment      category
hsincome        category
hsize           category
housit          category
residence       category
all_cars           int64
ev_cars            int64
home_parking    category
home_evse       category
work_parking    category
work_evse       category
buycar          category
zipcode            int64
dmileage           int64
long_dist          int64
Age_category    category
RUCA            category
dtype: object


### Data summary

In [12]:
# Use describe() function to display column statistics for the entire data set. 
np.round(EV_intention_df.describe(), decimals=2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
home_chg,5898.0,5.94,6.6,0.0,1.0,3.0,10.0,20.0
work_chg,5898.0,5.9,6.57,0.0,1.0,3.0,10.0,20.0
all_cars,5898.0,1.58,0.72,1.0,1.0,1.0,2.0,4.0
ev_cars,5898.0,0.08,0.3,0.0,0.0,0.0,0.0,4.0
zipcode,5898.0,52330.8,29095.13,1247.0,29483.0,48073.0,78258.0,99703.0
dmileage,5898.0,24.75,20.35,0.0,10.0,20.0,30.0,100.0
long_dist,5898.0,1.49,1.32,0.0,0.0,1.0,2.0,4.0


### Observation


In [23]:
#Develop predictors X and output variable Y for the data set.
X = EV_intention_df.drop(columns=['bichoice','zipcode','state'])
y = EV_intention_df['bichoice']

# Develop training (60%) and validation(40% or 0.4) partitions for
# heart_disease_df data frame.
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.3, random_state=1)
print('Training : ', train_X.shape)
print('Validation : ', valid_X.shape)

Training :  (4128, 24)
Validation :  (1770, 24)


In [24]:
print('Predictors list')
print(X.columns)

Predictors list
Index(['range', 'home_chg', 'work_chg', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'all_cars', 'ev_cars', 'home_parking', 'home_evse', 'work_parking',
       'work_evse', 'buycar', 'dmileage', 'long_dist', 'Age_category', 'RUCA'],
      dtype='object')


In [25]:
scaler = StandardScaler()

In [26]:
# Note the use of an array of column names.
scaler.fit(train_X[['range', 'home_chg', 'work_chg', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'all_cars', 'ev_cars', 'home_parking', 'home_evse', 'work_parking',
       'work_evse', 'buycar', 'dmileage', 'long_dist', 'Age_category', 'RUCA']])  


StandardScaler()

In [27]:
# Transform the full data set into standardized (normalized) data set. 
#train_X.reset_index(drop=True,inplace=True)
train_X = pd.concat([pd.DataFrame(scaler.transform(train_X[['range', 'home_chg', 'work_chg', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'all_cars', 'ev_cars', 'home_parking', 'home_evse', 'work_parking',
       'work_evse', 'buycar', 'dmileage', 'long_dist', 'Age_category', 'RUCA']]), 
                                    columns=['zrange', 'zhome_chg', 'zwork_chg', 'ztown', 'zhighway', 'zgender', 'zRegion',
       'zeducation', 'zemployment', 'zhsincome', 'zhsize', 'zhousit', 'zresidence',
       'zall_cars', 'zev_cars', 'zhome_parking', 'zhome_evse', 'zwork_parking',
       'zwork_evse', 'zbuycar', 'zdmileage', 'zlong_dist', 'zAge_category', 'zRUCA'],index=train_X.index),
                       train_X ], axis=1)
print('Standardized (Normalized) Values of EV Intention Data Set')
print()
print(train_X)

Standardized (Normalized) Values of EV Intention Data Set

        zrange  zhome_chg  zwork_chg     ztown  zhighway   zgender   zRegion  \
5495  1.325118  -0.150337  -0.126720  0.445019  1.348281 -1.031504  1.533678   
2146  1.325118   0.602443  -0.587059 -1.323064  1.348281  0.969458 -0.654212   
1893  1.325118  -0.602005   2.174977 -0.439022 -1.349152  0.969458 -0.654212   
4741  0.434079   0.602443  -0.433613 -1.323064 -0.450008 -1.031504  0.804382   
1686 -0.456960   0.602443  -0.126720 -0.439022 -0.450008  0.969458 -0.654212   
...        ...        ...        ...       ...       ...       ...       ...   
905  -1.347998  -0.602005   2.174977 -0.439022  1.348281 -1.031504 -0.654212   
5192  0.434079   0.602443  -0.126720  1.329060  1.348281  0.969458  1.533678   
3980 -1.347998  -0.150337  -0.126720 -1.323064  0.449137 -1.031504 -0.654212   
235  -0.456960   2.108003  -0.126720  1.329060 -0.450008  0.969458 -1.383508   
5157  0.434079  -0.451449  -0.893952 -0.439022 -1.349152  0.9

In [28]:
# Transform the full data set into standardized (normalized) data set. 
#valid_X.reset_index(drop=True,inplace=True)
valid_X = pd.concat([pd.DataFrame(scaler.transform(valid_X[['range', 'home_chg', 'work_chg', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'all_cars', 'ev_cars', 'home_parking', 'home_evse', 'work_parking',
       'work_evse', 'buycar', 'dmileage', 'long_dist', 'Age_category', 'RUCA']]), 
                                    columns=['zrange', 'zhome_chg', 'zwork_chg', 'ztown', 'zhighway', 'zgender', 'zRegion',
       'zeducation', 'zemployment', 'zhsincome', 'zhsize', 'zhousit', 'zresidence',
       'zall_cars', 'zev_cars', 'zhome_parking', 'zhome_evse', 'zwork_parking',
       'zwork_evse', 'zbuycar', 'zdmileage', 'zlong_dist', 'zAge_category', 'zRUCA'],index=valid_X.index),
                       valid_X ], axis=1)
print('Standardized (Normalized) Values of EV Intention Data Set')
print()
print(valid_X)

Standardized (Normalized) Values of EV Intention Data Set

        zrange  zhome_chg  zwork_chg     ztown  zhighway   zgender   zRegion  \
573  -0.456960  -0.903117  -0.433613 -0.439022 -1.349152 -1.031504 -1.383508   
3219  0.434079   2.108003  -0.740506 -0.439022  0.449137 -1.031504  0.075085   
4436 -1.347998   2.108003  -0.587059  0.445019 -0.450008  0.969458  0.804382   
3887  0.434079   0.602443   2.174977 -0.439022 -0.450008 -1.031504  0.075085   
3656  1.325118  -0.150337   0.640513  0.445019 -0.450008 -1.031504  0.075085   
...        ...        ...        ...       ...       ...       ...       ...   
3321 -0.456960  -0.602005  -0.126720  1.329060 -1.349152 -1.031504  0.075085   
2105 -0.456960  -0.451449  -0.740506 -0.439022  1.348281  0.969458 -0.654212   
710   1.325118  -0.602005   2.174977 -0.439022 -1.349152 -1.031504 -1.383508   
4302  1.325118  -0.602005  -0.126720  0.445019 -1.349152  0.969458  0.804382   
3201  0.434079  -0.602005   0.640513 -0.439022 -0.450008 -1.0

In [32]:
train_X_s = train_X.drop(columns= ['range', 'home_chg', 'work_chg', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'all_cars', 'ev_cars', 'home_parking', 'home_evse', 'work_parking',
       'work_evse', 'buycar', 'dmileage', 'long_dist', 'Age_category', 'RUCA'])
print(train_X_s)

valid_X_s = valid_X.drop(columns= ['range', 'home_chg', 'work_chg', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'all_cars', 'ev_cars', 'home_parking', 'home_evse', 'work_parking',
       'work_evse', 'buycar', 'dmileage', 'long_dist', 'Age_category', 'RUCA'])
print(valid_X_s)



        zrange  zhome_chg  zwork_chg     ztown  zhighway   zgender   zRegion  \
5495  1.325118  -0.150337  -0.126720  0.445019  1.348281 -1.031504  1.533678   
2146  1.325118   0.602443  -0.587059 -1.323064  1.348281  0.969458 -0.654212   
1893  1.325118  -0.602005   2.174977 -0.439022 -1.349152  0.969458 -0.654212   
4741  0.434079   0.602443  -0.433613 -1.323064 -0.450008 -1.031504  0.804382   
1686 -0.456960   0.602443  -0.126720 -0.439022 -0.450008  0.969458 -0.654212   
...        ...        ...        ...       ...       ...       ...       ...   
905  -1.347998  -0.602005   2.174977 -0.439022  1.348281 -1.031504 -0.654212   
5192  0.434079   0.602443  -0.126720  1.329060  1.348281  0.969458  1.533678   
3980 -1.347998  -0.150337  -0.126720 -1.323064  0.449137 -1.031504 -0.654212   
235  -0.456960   2.108003  -0.126720  1.329060 -0.450008  0.969458 -1.383508   
5157  0.434079  -0.451449  -0.893952 -0.439022 -1.349152  0.969458  1.533678   

      zeducation  zemployment  zhsincom

In [30]:
#poly = PolynomialFeatures(degree=2)
#poly.fit(train_X_s)
#train_X_p = poly.transform(train_X_s)
#valid_X_p = poly.transform(valid_X_s)
# #X = poly.transform(X)

# PCA

In [21]:
#pca = PCA()

In [22]:
#pca.fit(train_X_p)

PCA()

In [23]:
#print(pca.explained_variance_ratio_)

[1.07615695e-01 2.98593510e-02 2.58944330e-02 2.00274046e-02
 1.66499154e-02 1.49455789e-02 1.31376225e-02 1.28702050e-02
 1.04674310e-02 9.49731006e-03 8.75787088e-03 8.68175818e-03
 8.43240515e-03 8.21052677e-03 7.77058337e-03 7.49339879e-03
 7.40834098e-03 7.36253601e-03 6.97472030e-03 6.92947373e-03
 6.82347241e-03 6.70725386e-03 6.57466970e-03 6.49345030e-03
 6.30754305e-03 6.16278129e-03 6.13355856e-03 6.12579697e-03
 5.84841606e-03 5.79613721e-03 5.74536543e-03 5.66923409e-03
 5.51769527e-03 5.44390189e-03 5.31798254e-03 5.27140798e-03
 5.10685063e-03 5.03767190e-03 4.95163237e-03 4.88140657e-03
 4.74864144e-03 4.69702735e-03 4.65196863e-03 4.60563599e-03
 4.59145779e-03 4.53698094e-03 4.46273773e-03 4.38140976e-03
 4.35384040e-03 4.30441045e-03 4.26962420e-03 4.22195088e-03
 4.15821359e-03 4.05557280e-03 4.02194859e-03 4.00053665e-03
 3.95401172e-03 3.93361941e-03 3.83782443e-03 3.82418818e-03
 3.73316513e-03 3.71975743e-03 3.69875635e-03 3.63649400e-03
 3.62753910e-03 3.594138

In [None]:
#np.sum(pca.explained_variance_ratio_[0:50])

In [None]:
#np.sum(pca.explained_variance_ratio_[0:200])

In [None]:
#np.sum(pca.explained_variance_ratio_[0:250])

In [None]:
# np.sum(pca.explained_variance_ratio_[0:100])

In [None]:
# np.sum(pca.explained_variance_ratio_[0:75])

In [None]:
#np.sum(pca.explained_variance_ratio_[0:300])

In [24]:
#pca300=PCA(n_components=300)
#pca300.fit(train_X_p)
#train_X_pca300 = pd.DataFrame(pca300.transform(train_X_p))
#valid_X_pca300 = pd.DataFrame(pca300.transform(valid_X_p))

In [None]:
# train_X_pca200
#print('Training : ', train_X_pca300.shape)


In [None]:
#print('Validation : ', valid_X_pca300.shape)


In [None]:
# train_X_pca300.head()

In [None]:
# valid_X_pca300.head()

In [None]:
#train_X_s.shape
#valid_X_s.shape

## Neural network Model 

In [None]:
!pip install pytorch_tabnet

In [None]:
import numpy as np
from pytorch_tabnet.tab_model import TabNetClassifier
import torch

# define the model
clf= TabNetClassifier(optimizer_fn=torch.optim.Adam,
                       scheduler_params={"step_size":10, 
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                      )
# clf.fit(train_X_pca300, train_y)
# fit the model 
# train_X_pca300 = pd.DataFrame[['train_X_pca300']].values
# train_y = df[['train_y']].values
# valid_X_pca300 = df[['valid_X_pca300']].values
# valid_y = df[['valid_y']].values
train_X_s = np.asarray(train_X_s)
train_y = np.asarray(train_y)
valid_X_s = np.asarray(valid_X_s)
valid_y = np.asarray(valid_y)



clf.fit(
    train_X_s,train_y,
    eval_set=[(train_X_s, train_y), (valid_X_s, valid_y)],
    eval_name=['train', 'test'],
    eval_metric=['auc','balanced_accuracy'],
    max_epochs=200, patience=60,
    batch_size=512, virtual_batch_size=512,
    num_workers=0,
    weights=1,
    drop_last=False
)            

In [None]:
# Confusion matrices for improved neural network model for EV intention
# using grid search results. 

# Identify and display confusion matrix for training partition. 
print('Training Partition for Neural Network Model based on grid search')
classificationSummary(train_y, clf.predict(train_X_pca300))

# Identify and display confusion matrix for validation partition. 
print()
print('Validation Partition for Neural Network Model based on grid search')
classificationSummary(valid_y, clf.predict(valid_X_pca300))
print(classification_report(valid_y, clf.predict(valid_X_pca300)))

In [30]:
pip install fastai
pip install torchtext

Note: you may need to restart the kernel to use updated packages.


In [45]:
#Code without data split
from fastai.tabular.all import *
from pathlib import Path
import pandas as pd
from fastai.metrics import accuracy

# load the data
path = Path('/Users/nitya/Downloads/AfterMerge_Dataset.csv')
df = pd.read_csv(path)
df = df.drop(columns=['state', 'zipcode'])

#df = EV_intention_df
dep_var = 'bichoice'
cat_names = ['range', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'home_parking', 'home_evse', 'work_parking','work_evse', 'buycar', 'Age_category', 'RUCA']
cont_names = ['home_chg', 'work_chg', 'all_cars', 'ev_cars','dmileage', 'long_dist']
procs = [Categorify, FillMissing, Normalize]
dls = TabularDataLoaders.from_df(df, path=path, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=dep_var)

# define the model architecture
learn = tabular_learner(dls, metrics=accuracy)

# train the model
learn.fit_one_cycle(5)

# get the validation accuracy
accuracy_val = learn.validate()[1]

# evaluate the model
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

epoch,train_loss,valid_loss,accuracy,time
0,0.581493,0.459067,0.0,00:05
1,0.367263,0.284824,0.0,00:05
2,0.266319,0.236123,0.0,00:05
3,0.222906,0.223348,0.0,00:05
4,0.20574,0.220953,0.0,00:05


AttributeError: vocab

In [48]:
#Code with data split
from fastai.tabular.all import *
from pathlib import Path
import pandas as pd
from fastai.metrics import accuracy

# load the data
path = Path('/Users/nitya/Downloads/AfterMerge_Dataset.csv')
df = pd.read_csv(path)
df = df.drop(columns=['state', 'zipcode'])

cat_names = ['range', 'town', 'highway', 'gender', 'Region',
       'education', 'employment', 'hsincome', 'hsize', 'housit', 'residence',
       'home_parking', 'home_evse', 'work_parking','work_evse', 'buycar', 'Age_category', 'RUCA']
cont_names = ['home_chg', 'work_chg', 'all_cars', 'ev_cars','dmileage', 'long_dist']


# Split data into train and test sets
splits = RandomSplitter(valid_pct=0.2, seed=42)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing, Normalize],
                   cat_names=cat_names, cont_names=cont_names,
                   y_names='bichoice', splits=splits)

# Create dataloaders
dls = to.dataloaders(bs=64)

# Define model
learn = tabular_learner(dls, metrics=accuracy)

# Train model
learn.fit_one_cycle(5, 0.01)

#train_df = to.train
test_df = to.valid

# Evaluate model on test set
test_dl = dls.test_dl(test_df)
preds = learn.get_preds(dl=test_dl)[0]

epoch,train_loss,valid_loss,accuracy,time
0,0.389315,0.265234,0.0,00:05
1,0.281105,0.238546,0.0,00:05
2,0.248736,0.227556,0.0,00:05
3,0.225009,0.215319,0.0,00:05
4,0.196219,0.211574,0.0,00:05


TypeError: only list-like objects are allowed to be passed to isin(), you passed a [slice]

In [49]:
test_dl

<fastai.tabular.core.TabDataLoader at 0x7f947fbe8af0>