<a href="https://colab.research.google.com/github/kelseytyler/In-N-Out-Eastward-Expansion/blob/main/Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**3. Machine Learning**

In this section, we trained 70% of our cleaned data in order to test the remaining 30%. To do so, we first created three different models: Random Forest Classifier, Logistics Regression, and Ridge Classifier. After creating our models, we made confusion matrices for each one to discover which model is the best one to use to predict our test variable.

In [5]:
#creating training and testing data
training = west[:196]
testing = west[196:]

#preprocessor
x_train = training[['Median Age', 'Male Population', 'Female Population',
            'Total Population', 'Number of Veterans', 'Foreign-born',
            'Average Household Size', 'White Percentage', 'Minority Percentage']]
y_train = training['InNOut']

x_test = testing[['Median Age', 'Male Population', 'Female Population',
            'Total Population', 'Number of Veterans', 'Foreign-born',
            'Average Household Size', 'White Percentage', 'Minority Percentage']]
y_test = testing['InNOut']

#pipeline
pipeline1 = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=10))

#train
pipeline1.fit(x_train, y_train)
y_train_predict = pipeline1.predict(x_train)

#test
scores = cross_val_score(
    pipeline1,
    x_train,
    y_train,
    scoring='f1_macro',
    cv=10
)
scores.mean()

0.4805155016552075

**MODEL 1** : Random Forest Classifier


In [6]:
#trying different classifiers
def models(x):
  pipeline1 = make_pipeline(
      StandardScaler(),
      x,
  )
  scores = cross_val_score(
      pipeline1,
      x_train,
      y_train,
      cv=10,
      scoring='f1_macro')
  print(x, scores.mean())
models(DecisionTreeClassifier())
models(GaussianProcessClassifier())
models(GaussianNB()) #high result **
models(RandomForestClassifier()) #highest result
models(RidgeClassifier()) #high result
models(SGDClassifier()) #high result **
models(LogisticRegression()) #high result
models(KNeighborsClassifier())

DecisionTreeClassifier() 0.4880622689451126
GaussianProcessClassifier() 0.5539685192258722
GaussianNB() 0.5874767115220152
RandomForestClassifier() 0.5789336660861804
RidgeClassifier() 0.574763655462185
SGDClassifier() 0.5861936984656673
LogisticRegression() 0.5787219887955182
KNeighborsClassifier() 0.5230643723765167


In [7]:
#hyper-tuning random forest classifier: n_estimators
def n_estimators(x):
  pipeline1 = make_pipeline(
      StandardScaler(),
      RandomForestClassifier(n_estimators = x),
  )
  scores = cross_val_score(
      pipeline1,
      x_train,
      y_train,
      cv=10,
      scoring='f1_macro')
  return scores.mean()

f1scores = pd.Series([])
for x in range(1, 101):
  f1score = n_estimators(x)
  f1scores[x] = f1score
print(f1scores.sort_values(ascending=False))

64    0.631504
80    0.625954
1     0.622593
49    0.621325
40    0.619841
        ...   
12    0.551232
7     0.548519
8     0.545816
6     0.535036
3     0.518065
Length: 100, dtype: float64


In [8]:
#trying different combinations of features
from itertools import combinations
potentialFeatures = ['Median Age', 'Male Population', 'Female Population',
            'Total Population', 'Number of Veterans', 'Foreign-born',
            'Average Household Size', 'White Percentage', 'Minority Percentage']
combination = []
for feat in combinations(potentialFeatures, 8):
  print(feat)
  list(feat)

pipeline = make_pipeline(
      StandardScaler(),
      RandomForestClassifier(n_estimators = 43),
  )
f1 = {}
y_train = training['InNOut']
for size in range(1, len(potentialFeatures) +1):
  for feat in combinations(potentialFeatures, size):
    scores = cross_val_score(
      pipeline,
      x_train,
      y_train,
      cv=10,
      scoring='f1_macro')
    f1[str(list(feat))] = scores.mean()
sorted = pd.Series(f1).sort_values(ascending=False)
sorted

('Median Age', 'Male Population', 'Female Population', 'Total Population', 'Number of Veterans', 'Foreign-born', 'Average Household Size', 'White Percentage')
('Median Age', 'Male Population', 'Female Population', 'Total Population', 'Number of Veterans', 'Foreign-born', 'Average Household Size', 'Minority Percentage')
('Median Age', 'Male Population', 'Female Population', 'Total Population', 'Number of Veterans', 'Foreign-born', 'White Percentage', 'Minority Percentage')
('Median Age', 'Male Population', 'Female Population', 'Total Population', 'Number of Veterans', 'Average Household Size', 'White Percentage', 'Minority Percentage')
('Median Age', 'Male Population', 'Female Population', 'Total Population', 'Foreign-born', 'Average Household Size', 'White Percentage', 'Minority Percentage')
('Median Age', 'Male Population', 'Female Population', 'Number of Veterans', 'Foreign-born', 'Average Household Size', 'White Percentage', 'Minority Percentage')
('Median Age', 'Male Population', '

['Male Population', 'Number of Veterans', 'Foreign-born', 'Average Household Size', 'White Percentage']                                         0.650756
['Median Age', 'Number of Veterans', 'Minority Percentage']                                                                                     0.637692
['Female Population', 'Number of Veterans', 'Average Household Size', 'Minority Percentage']                                                    0.633021
['Median Age', 'Female Population', 'Total Population', 'Foreign-born', 'Average Household Size', 'White Percentage', 'Minority Percentage']    0.629563
['Median Age', 'Male Population', 'Female Population', 'White Percentage', 'Minority Percentage']                                               0.628887
                                                                                                                                                  ...   
['Median Age', 'Average Household Size']                                          

In [9]:
#testing cv=10 vs cv=5
x_train1 = training[['Median Age', 'Male Population', 'Female Population', 'Average Household Size', 'Minority Percentage'] ]
y_train = training['InNOut']
eastModel1 = east[['Median Age', 'Male Population', 'Female Population', 'Average Household Size', 'Minority Percentage'] ].dropna()
pipeline1 = make_pipeline(
      StandardScaler(),
      RandomForestClassifier(n_estimators = 43),
      )
scores = cross_val_score(
      pipeline1,
      x_train1,
      y_train,
      cv=10,
      scoring='f1_macro')

scores1 = cross_val_score(
      pipeline,
      x_train1,
      y_train,
      cv=5,
      scoring='f1_macro')
print(scores.mean(), scores1.mean())

0.6046300283520563 0.6066403221163098


In [10]:
#Final Model 1
x_train1 = training[['Median Age', 'Male Population', 'Female Population', 'Average Household Size', 'Minority Percentage'] ]
y_train = training['InNOut']
x_test = testing[['Median Age', 'Male Population', 'Female Population', 'Average Household Size', 'Minority Percentage']]

eastModel1 = east[['Median Age', 'Male Population', 'Female Population', 'Average Household Size', 'Minority Percentage'] ].dropna()
pipeline1 = make_pipeline(
      StandardScaler(),
      RandomForestClassifier(n_estimators = 43),
      )
scores1 = cross_val_score(
      pipeline1,
      x_train1,
      y_train,
      cv=5,
      scoring='f1_macro')
print(scores1.mean())

0.563729580106896


In [11]:
#confusion matrix

pipeline1.fit(x_train1, y_train)
y_predict = pipeline1.predict(x_test)

pd.DataFrame(
   confusion_matrix(y_test, y_predict),
   index=pipeline1.classes_,
   columns=pipeline1.classes_)

Unnamed: 0,False,True
False,57,5
True,17,4


**Model 1 Analysis:**

Model 1 gives an f1_macro score of .6495. This means this model predicts locations accurately 64.95% of the time. Model 1 uses the following features: median age, male population, female population, and white percentage. It performs very poorly when predicting True, and fairly well when predicting False.

**MODEL #2**: Logistics Regression

In [12]:
#testing C values
def logistic_C(C):
    pipeline2 = make_pipeline(
        StandardScaler(),
        LogisticRegression(C=C),
    )
    scores = cross_val_score(
        pipeline2,
        x_train,
        y_train,
        cv=10,
        scoring='f1_macro'
    )
    return scores.mean()

C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]  # You can adjust these values as needed
f1_scores = {}
for C in C_values:
    f1_scores[C] = logistic_C(C)
f1_scores

{0.001: 0.4464604023427552,
 0.01: 0.5003071683218743,
 0.1: 0.5944342106344005,
 1.0: 0.5787219887955182,
 10.0: 0.5994857904060942,
 100.0: 0.5840020631908678}

In [13]:
#trying different combinations of features
potential_features = ['Median Age', 'Male Population', 'Female Population',
                      'Total Population', 'Number of Veterans', 'Foreign-born',
                      'Average Household Size', 'White Percentage', 'Minority Percentage']
for size in range(1, len(potential_features) + 1):
    for feat in combinations(potential_features, size):
        pipeline2 = make_pipeline(
            StandardScaler(),
            LogisticRegression(C=10.0),  # Adjust the C value as needed
        )
        scores = cross_val_score(
            pipeline2,
            x_train[list(feat)],  # Use the selected features
            y_train,
            cv=10,
            scoring='f1_macro'
        )
        f1_scores[str(list(feat))] = scores.mean()

sorted_f1_scores = pd.Series(f1_scores).sort_values(ascending=False)
sorted_f1_scores

['Foreign-born', 'Average Household Size', 'White Percentage']                               0.623424
['Foreign-born', 'Average Household Size', 'Minority Percentage']                            0.623424
['Foreign-born', 'Average Household Size', 'White Percentage', 'Minority Percentage']        0.623424
['Total Population', 'Number of Veterans', 'Average Household Size']                         0.620882
['Female Population', 'Total Population', 'Number of Veterans', 'Average Household Size']    0.620882
                                                                                               ...   
['Median Age', 'White Percentage', 'Minority Percentage']                                    0.439009
['White Percentage']                                                                         0.435648
['Minority Percentage']                                                                      0.435648
['White Percentage', 'Minority Percentage']                                       

In [14]:
#testing cv=10 vs cv=5
x_train2 = training[['Foreign-born', 'Average Household Size', 'Minority Percentage'] ]
y_train = training['InNOut']
eastModel2 = east[['Foreign-born', 'Average Household Size', 'Minority Percentage'] ].dropna()
pipeline2 = make_pipeline(
      StandardScaler(),
      LogisticRegression(C=10.0),
      )
scores = cross_val_score(
      pipeline2,
      x_train2,
      y_train,
      cv=10,
      scoring='f1_macro')

scores2 = cross_val_score(
      pipeline2,
      x_train2,
      y_train,
      cv=5,
      scoring='f1_macro')
print(scores.mean(), scores2.mean())

0.6234242824695861 0.6150652879728967


In [15]:
#Final Model 2
x_train2 = training[['Foreign-born', 'Average Household Size', 'Minority Percentage']]
y_train = training['InNOut']
x_test2 = testing[['Foreign-born', 'Average Household Size', 'Minority Percentage']]

eastModel2 = east[['Foreign-born', 'Average Household Size', 'Minority Percentage'] ].dropna()
pipeline2 = make_pipeline(
      StandardScaler(),
      LogisticRegression(C=10.0),
      )
scores2 = cross_val_score(
      pipeline2,
      x_train2,
      y_train,
      cv=10,
      scoring='f1_macro')
print(scores2.mean())

0.6234242824695861


In [16]:
#confusion matrix

pipeline2.fit(x_train2, y_train)
y_predict = pipeline2.predict(x_test2)

pd.DataFrame(
   confusion_matrix(y_test, y_predict),
   index=pipeline2.classes_,
   columns=pipeline2.classes_)

Unnamed: 0,False,True
False,59,3
True,18,3


**Model 2 Analysis:**

Model 2 has an f1_macro score of .6234. The confusion matrix for this model shows that it performs poorly when correctly predicting True, and it performs well when correctly predicting False. Based on the f1_macro score, this model ia a little less accurate than the previous one, only predicting location correctly 62.34% of the time.

MODEL #3: Ridge Classifier

In [17]:
#MODEL 3: ridge classifier
x_train = training[['Median Age', 'Male Population', 'Female Population',
            'Total Population', 'Number of Veterans', 'Foreign-born',
            'Average Household Size', 'White Percentage', 'Minority Percentage']]
y_train = training['InNOut']

#trying different alphas
def ridge_alpha(alpha):
    pipeline3 = make_pipeline(
        StandardScaler(),
        RidgeClassifier(alpha=alpha),
    )
    scores = cross_val_score(
        pipeline3,
        x_train,
        y_train,
        cv=10,
        scoring='f1_macro'
    )
    return scores.mean()
alpha_values = [0.1, 1.0, 10.0]  # You can adjust these values as needed
f1_scores = {}
for alpha in alpha_values:
    f1_scores[alpha] = ridge_alpha(alpha)

f1_scores

{0.1: 0.590359853618867, 1.0: 0.574763655462185, 10.0: 0.5765628978864275}

In [18]:
#trying different combinations of features
potential_features = ['Median Age', 'Male Population', 'Female Population',
                      'Total Population', 'Number of Veterans', 'Foreign-born',
                      'Average Household Size', 'White Percentage', 'Minority Percentage']
for size in range(1, len(potential_features) + 1):
    for feat in combinations(potential_features, size):
        pipeline3 = make_pipeline(
            StandardScaler(),
            RidgeClassifier(alpha=1.0),  # Adjust the alpha value as needed
        )
        scores = cross_val_score(
            pipeline3,
            x_train[list(feat)],  # Use the selected features
            y_train,
            cv=10,
            scoring='f1_macro'
        )
        f1_scores[str(list(feat))] = scores.mean()

sorted_f1_scores = pd.Series(f1_scores).sort_values(ascending=False)
sorted_f1_scores

0.1                                                                                                              0.590360
['Female Population', 'Average Household Size', 'White Percentage']                                              0.578838
['Male Population', 'Female Population', 'Foreign-born', 'Average Household Size', 'White Percentage']           0.578838
['Male Population', 'Female Population', 'Total Population', 'Average Household Size', 'Minority Percentage']    0.578838
['Male Population', 'Female Population', 'Total Population', 'Average Household Size', 'White Percentage']       0.578838
                                                                                                                   ...   
['Median Age', 'Minority Percentage']                                                                            0.425041
['Median Age', 'White Percentage', 'Minority Percentage']                                                        0.425041
['White Percentage']    

In [19]:
#Final Model 3
x_train3 = training[['Female Population', 'Average Household Size', 'White Percentage']]
y_train = training['InNOut']
east_model3 = east[['Female Population', 'Average Household Size', 'White Percentage']].dropna()
x_test3 = testing[['Female Population', 'Average Household Size', 'White Percentage']]

pipeline3 = make_pipeline(
            StandardScaler(),
            LogisticRegression(C=10.0),
        )
scores3 = cross_val_score(
      pipeline3,
      x_train3,
      y_train,
      cv=10,
      scoring='f1_macro')
print(scores3.mean())

0.6039546727782023


In [20]:
#confusion matrix

pipeline3.fit(x_train3, y_train)
y_predict = pipeline3.predict(x_test3)

pd.DataFrame(
   confusion_matrix(y_test, y_predict),
   index=pipeline3.classes_,
   columns=pipeline3.classes_)

Unnamed: 0,False,True
False,59,3
True,17,4


**Model 3 Analysis**
Model 3 has an f1_macro score of .604. It follows the trend of performing poorly at predicting True and performing well at predcting False. Based on the f1_macro score, this model is worse than both of the previous ones we attempted, prediction location correctly only 60.4% of the time.

**Model comparison**

Ultimately, based on the three models we created, and their f1_macro scores, we can conclude that using RandomForestClassifier is the most accurate model to use. We proceeded our analysis using Model 1.

In [21]:
#Predicting with Model 1
x_train1 = training[['Median Age', 'Male Population', 'Female Population', 'White Percentage']]
y_train = training['InNOut']
eastModel1 = east[['Median Age', 'Male Population', 'Female Population', 'White Percentage']] #159
pipeline1 = make_pipeline(
      StandardScaler(),
      RandomForestClassifier(n_estimators = 43),
      )

pipeline1.fit(x_train1, y_train)
y_predict = pipeline1.predict(eastModel1) #159
east['Predictions'] = y_predict
east[east['Predictions']==True]

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,White Percentage,Minority Percentage,InNOut,Predictions
0,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,0.271013,0.728987,False,True
42,Columbus,Georgia,33.7,98785.0,101794.0,200579,21747.0,10376.0,2.62,GA,0.485774,0.514226,False,True
52,Richmond,Virginia,33.6,104793.0,115496.0,220289,12538.0,15741.0,2.29,VA,0.474686,0.525314,False,True
157,Fayetteville,North Carolina,30.7,101051.0,100914.0,201965,28089.0,12863.0,2.5,NC,0.505409,0.494591,False,True
161,Charlotte,North Carolina,34.3,396646.0,430475.0,827121,36046.0,128897.0,2.52,NC,0.540181,0.459819,False,True
465,Jersey City,New Jersey,34.3,131765.0,132512.0,264277,4374.0,109186.0,2.57,NJ,0.375742,0.624258,False,True
526,Baltimore,Maryland,34.7,294027.0,327822.0,621849,29540.0,49857.0,2.51,MD,0.333319,0.666681,False,True
607,Boynton Beach,Florida,45.8,33701.0,40271.0,73972,4727.0,15431.0,2.45,FL,0.67193,0.32807,False,True
733,Jacksonville,Florida,35.7,419203.0,448828.0,868031,75432.0,85650.0,2.62,FL,0.626224,0.373776,False,True
870,Augusta-Richmond County consolidated government,Georgia,33.7,94662.0,101917.0,196579,19085.0,7915.0,2.67,GA,0.396482,0.603518,False,True


Next, we used the Whataburger dataset to test out how well our chosen model predicts the correct location.

In [22]:
cities = pd.Series([])
whata_locs = pd.Series([])

for x, y in whataburger['city'].items():
  whata_locs[x] = y

for x, y in df['City'].items():
  cities[x] = y

matches = cities.isin(whata_locs)
df['in_df'] = matches

westWhata = df[(df['State Code'] == 'CA') | (df['State Code'] == 'TX') | (df['State Code'] == 'OR')
 |(df['State Code'] == 'WA') |(df['State Code'] == 'CO') |(df['State Code'] == 'NV') |(df['State Code'] == 'ID') |
  (df['State Code'] == 'MT') |(df['State Code'] == 'AZ') |(df['State Code'] == 'NM') |(df['State Code'] == 'UT') |
  (df['State Code'] == 'WY')]

eastWhata = (df[((df['State Code'] == 'ME') | (df['State Code'] == 'VA') | (df['State Code'] == 'NC')
 |(df['State Code'] == 'SC') |(df['State Code'] == 'GA') |(df['State Code'] == 'FL') |(df['State Code'] == 'PA') |
  (df['State Code'] == 'NY') |(df['State Code'] == 'DE') |(df['State Code'] == 'RI') |(df['State Code'] == 'CT') |
  (df['State Code'] == 'VT') |(df['State Code'] == 'MA') |(df['State Code'] == 'NH') |(df['State Code'] == 'MD')|
  (df['State Code'] == 'NJ')) & (df['City'] != 'Union City') &(df['City'] != 'Reading')]).dropna()

In [23]:
x_trainV = westWhata[['Median Age', 'Male Population', 'Female Population', 'White Percentage']]
y_trainV = westWhata['in_df']

eastPred = eastWhata[['Median Age', 'Male Population', 'Female Population', 'White Percentage']]

pipeline1.fit(x_trainV, y_trainV)
y_predict = pipeline1.predict(eastPred)
eastWhata['Predictions'] = y_predict
predictions = eastWhata[eastWhata['Predictions']==True]
predictions

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,White Percentage,Minority Percentage,InNOut,in_df,Predictions
133,Charleston,South Carolina,35.0,63956.0,71568.0,135524,9368.0,5767.0,2.4,SC,0.76751,0.23249,False,False,True
152,Chesapeake,Virginia,36.7,114964.0,120465.0,235429,29772.0,13966.0,2.73,VA,0.638893,0.361107,False,False,True
568,Miami,Florida,40.4,215840.0,225149.0,440989,7233.0,260789.0,2.5,FL,0.766985,0.233015,False,False,True
739,Hialeah,Florida,43.0,111530.0,125552.0,237082,1844.0,170148.0,3.31,FL,0.927911,0.072089,False,False,True
1007,Brandon,Florida,36.1,55679.0,58289.0,113968,9417.0,16390.0,2.64,FL,0.709067,0.290933,False,False,True
1091,Saint Petersburg,Florida,41.8,123524.0,133564.0,257088,20247.0,28567.0,2.42,FL,0.69673,0.30327,False,False,True
1262,Alexandria,Virginia,36.6,74989.0,78522.0,153511,10635.0,44030.0,2.2,VA,0.691905,0.308095,False,False,True
2300,Orlando,Florida,33.1,130940.0,139977.0,270917,12782.0,50558.0,2.42,FL,0.661166,0.338834,False,False,True
2327,Gainesville,Florida,26.0,60803.0,69330.0,130133,4788.0,15272.0,2.33,FL,0.67325,0.32675,False,True,True
2486,Lynchburg,Virginia,28.7,38614.0,41198.0,79812,4322.0,4364.0,2.48,VA,0.673169,0.326831,False,False,True


In [24]:
correctPred = whataburger[(whataburger['city'] == 'Charleston')|(whataburger['city'] == 'Miami')|(whataburger['city'] == 'Hialeah')|
                   (whataburger['city'] == 'Homestead')|(whataburger['city'] == 'Saint Petersburg')|(whataburger['city'] == 'Alexandria')|
                   (whataburger['city'] == 'Wilmington')|(whataburger['city'] == 'Columbia')|(whataburger['city'] == 'Orlando')|
                   (whataburger['city'] == 'Gainsville')|(whataburger['city'] == 'Stamford')|(whataburger['city'] == 'Cambridge')|
                   (whataburger['city'] == 'Raleigh')]
correctPred

Unnamed: 0,city,state


**Verification Analysis**

This verification used a dataset on Whataburger locations. We split the data into Whataburger locations in western states and Whataburger locations in eastern states. We trained Model 1 on the western locations and predicted on the eastern locations to test if cities that have Whataburger locations were correctly identified. With the confusion matrix results for this model in mind, it is not surprising that the model did not predict any cities that currently have a Whataburger location.

**Reflection**

Based on the results from testing out our Whataburger data, our model was not very accurate. If we were to do this project again, we would need to collect more data in order to create a better predictive model. Our training data did not have enough rows where an In-N-Out existed, to make the code learn well enough, so the model was not the best at predicting the values to be True.