## Train and Classify - Census Tracts

This script explores different machine learning regressions to predict total e-scooter trip counts for census tracts in Minneapolis MN. The demographic data are from ACS-surveys for 2014-2018, and 2015-2019. The regressions explored are Random Forest, Linear, Ridge and Poisson regression

The script requires sklearn, pandas, and an ArcGIS pro license.

Data sources: ACS-Survey 2014-2018 5-year Estimates, ACS-Survey 2015-2019 5-year Estimates, City of Minneapolis, U.S. Census Bureau

In [394]:
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import PoissonRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

In [376]:
file_path = r"C:\Users\msong\Desktop\Independent proj\escooter_ML\escooter_all.csv"
cols = ['GISJOIN', 
        'year', 
        'SUM_TripCount', 
        'percent_nonwhite',
       'percent_hsandabv', 
        'medhhinc_normal',
       'popdens_sqmi']
#        'med_hh_inc']
df = pd.read_csv(file_path,usecols=cols)

In [377]:
# create df that do not correspond with a census tract
empty_df = df.loc[(df['GISJOIN'].isna())]

In [378]:
# dataset with no null vals
data = df[df['GISJOIN'].notna()].drop('GISJOIN',axis=1)

In [379]:
data['SUM_TripCount'].describe()

count       232.000000
mean       4097.737069
std       13487.965540
min           5.000000
25%         121.250000
50%         459.000000
75%        2002.500000
max      124153.000000
Name: SUM_TripCount, dtype: float64

In [380]:
# pearson's correlation to see relationships between variables
# ignore year correlation values
corr = data.astype('float64').corr()

In [381]:
corr

Unnamed: 0,year,SUM_TripCount,percent_nonwhite,percent_hsandabv,medhhinc_normal,popdens_sqmi
year,1.0,0.160063,0.008275,0.719022,0.049112,0.006367
SUM_TripCount,0.160063,1.0,-0.036318,0.130319,-0.098501,0.179049
percent_nonwhite,0.008275,-0.036318,1.0,-0.526152,-0.683684,0.185241
percent_hsandabv,0.719022,0.130319,-0.526152,1.0,0.438344,-0.178517
medhhinc_normal,0.049112,-0.098501,-0.683684,0.438344,1.0,-0.447139
popdens_sqmi,0.006367,0.179049,0.185241,-0.178517,-0.447139,1.0


In [426]:
# split data into test and train set for validation
# fracnum is the percentage
fracNum = 0.30
train_set = data.sample(frac = fracNum)
test_set = data.drop(train_set.index)

In [430]:
# input demographic fields of interest
x_cols = ['percent_nonwhite',
          'percent_hsandabv', 
          'medhhinc_normal',
          'popdens_sqmi']
# field to predict
y_cols = 'SUM_TripCount'


# indicate fields to be used in multilinear regression
X_train = train_set[x_cols]
y_train = train_set[y_cols]         

# format test set
X_test = test_set[x_cols]
y_test = test_set[y_cols]

### Linear Regression

In [389]:
# run linear regression
lin_reg = LinearRegression()
_ = lin_reg.fit(X_train, y_train) # train regression with training set
preds = lin_reg.predict(X_test) # predict values in test set

print("Training score:", lin_reg.score(X_train, y_train))
print("Testing score:", lin_reg.score(X_test, y_test))
print("MAE of Linear Regression:", mean_absolute_error(y_test, preds), '\n')

Training score: 0.050158136719898216
Testing score: 0.02782287541735995
MAE of Linear Regression: 5396.486526471126 



### Ridge Regression

In [390]:
# based on pearson's correlation, ridge is not a good method 
# because data is not have multicollinearity
ridge = Ridge(alpha=0.1) # alpha can be altered
_ = ridge.fit(X_train, y_train)
preds = ridge.predict(X_test)

print("Training score:", ridge.score(X_train, y_train))
print("Testing score:", ridge.score(X_test, y_test))
print("MAE of Ridge Regression:", mean_absolute_error(y_test, preds), '\n')

Training score: 0.0500130752943293
Testing score: 0.0322975290094315
MAE of Ridge Regression: 5355.170285914989 



### Random Forest Classifier Approach

Note this approach was attempted before running Random Forest regression algorithm

In [None]:
# Resource: # https://www.datacamp.com/community/tutorials/random-forests-classifier-python
clf=RandomForestClassifier(n_estimators=100,
                           bootstrap=True,
                           warm_start=True,
                           max_features="sqrt"
                           # oob_score=True
                           #min_samples_split=.1
                           # random_state = 0
                           #
                          ) 

# Accuracy score will not compute anything when I have bootstrap set to True
# Have tried reducing fields for random forest

# Other factors tried:
# warm_start=True
# min_samples_split=10
# random_state = 0

_= clf.fit(X_train,y_train) # create branches and trees from training data
y_pred=clf.predict(X_test) # predict values in test dataset

In [393]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) # check accuracy score of test and train.

Accuracy: 0.0


### Random Forest Regression

In [443]:
# run Random Forest Regressor
rf_regr = RandomForestRegressor(n_estimators = 1000, random_state=0)
_ = rf_regr.fit(X_train, y_train) # create trees in forest
preds=rf_regr.predict(X_test) # predict values in test set

print("Training score:", rf_regr.score(X_train, y_train))
print("Testing score:", rf_regr.score(X_test, y_test))
print("MAE of Random Forest Regression:", mean_absolute_error(y_test, preds), '\n')

Training score: 0.8726307058531736
Testing score: 0.03695197469390088
MAE of Random Forest Regression: 5109.405580246914 



## Poisson Regression

In [None]:
""" 
resources:
- about: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html
- about_2: https://timeseriesreasoning.com/contents/poisson-regression-model/
- tutorial: https://www.kaggle.com/gauravduttakiit/explore-the-poisson-regression
"""


# pr = PoissonRegressor()
# pr.fit(X_train, y_train)
# y_pr = pr.predict(X_test)"

In [None]:
# reformat table to include two fields:
# year of dataset and total trip counts
pdata = data[["year","SUM_TripCount"]]

In [None]:
# create a training and test set
train,test=train_test_split(pdata, train_size = .3,random_state =1)

In [None]:
# reshape SUM_TripCount to be from a scale of -1 to 1
X_train = train['SUM_TripCount'].values.reshape(-1, 1)
y_train = train.year
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)


X_train.shape,y_train.shape

In [None]:
# reshape SUM_TripCount to be from a scale of -1 to 1
X_test = test['SUM_TripCount'].values.reshape(-1, 1)
y_test = test.year
X_test.shape,y_test.shape

In [None]:
# Train regression and predict trip counts
pipeline = Pipeline([('standardscaler', StandardScaler()),('model', PoissonRegressor())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

r2_test = metrics.r2_score(y_test, y_pred)
r2_test

In [None]:
"""
Notes: 

Documentation:
<https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier.score>

- warm_start: bool, default=False
When set to True, reuse the solution of the previous call to fit and 
add more estimators to the ensemble, otherwise, just fit a whole new forest. 
See the Glossary.

- bootstrap : bool, default=True
Whether bootstrap samples are used when building trees. 
If False, the whole dataset is used to build each tree.

- max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:

- random_state : int, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used 
when building trees (if bootstrap=True) and the sampling of the 
features to consider when looking for the best split at each node 
(if max_features < n_features). See Glossary for details.

- min_samples_leaf : int or float, default=1
The minimum number of samples required to be at a leaf node. 
A split point at any depth will only be considered if it leaves at 
least min_samples_leaf training samples in each of the left and right 
branches. This may have the effect of smoothing the model, especially 
in regression.

"""