# School Location Choice Model

In this notebook, a simple logit location choice model is used to predict university campus choice. Data from the 2015 StudentMoveTO data is used. The model is run separately on each of the 13 segments we have previously outlined.

First, we install Biogeme and Pandas, and load the original data.

In [1]:
import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
from biogeme.expressions import Beta, DefineVariable
#import biogeme.messaging as msg
import pandas as pd

hh_df = pd.read_csv('../Data/SMTO_2015/SMTO_2015_Households.csv')
ps_df = pd.read_csv('../Data/SMTO_2015/SMTO_2015_Respondents.csv')

Now, we prepare a dataframe which will be used as input for Biogeme by selecting and manipulating the relevant columns from the data. We use a function to convert information about Level, Status, Years of Involvement, and Living Arrangement into a column containing the respondent's Segment.

In [2]:
def info_to_segment(x):
    return (not x.Family) + (0 if (x.Level == 'UG') else 2)

# Load relevant columns
df = ps_df[['pscampusattend', 'personstatusgrad', 'personstatustime', 'psuniversityinvolvednumyears']]
df = df.join(hh_df[['HmTTS2006', 'hhlivingsituation']])
df = df.rename(columns={'HmTTS2006': 'HomeZone', 'pscampusattend': 'Campus', 'hhlivingsituation': 'Family', 
                       'personstatusgrad': 'Level', 'personstatustime': 'Status', 'psuniversityinvolvednumyears': 'Years'})
df = df.dropna() # Remove rows with missing data

# Convert Campus column to numerical column
campus_name_to_num = {"Downtown Toronto (St. George)": 0, "Scarborough (UTSC)": 1, "Mississauga (UTM)": 2,
                      "Keele": 3, "Glendon": 4, "RyersonU": 5, "OCADu": 6}
df.replace({'Campus': campus_name_to_num}, inplace=True)
df['HomeZone'] = pd.to_numeric(df['HomeZone'], downcast='signed')
df['Family'] = (df['Family'] == 'Live with family/parents')*1
df['Segment'] = df.apply(info_to_segment, axis=1)
df['Available'] = 1 # All campuses available to all students

df['Segment'].value_counts()

0    7310
1    3787
3    2706
2    1107
Name: Segment, dtype: int64

Keeping only the relevant columns, we have:

In [3]:
df = df[['Campus', 'HomeZone', 'Segment', 'Available']]
df

Unnamed: 0,Campus,HomeZone,Segment,Available
0,1,261,0,1
1,0,71,3,1
2,0,3714,0,1
3,0,74,1,1
4,0,71,3,1
...,...,...,...,...
15221,3,212,1,1
15222,3,233,0,1
15223,3,95,1,1
15224,3,2221,0,1


Now we must load information about the distance between each respondent's home zone and each campus. First, we load the data from the LoS matrix.

In [4]:
# Dataframe with walk distances
df_path = pd.read_csv('../../LoS/Walk_Distances.csv')
origins = list(set(list(df_path['Origin'])))
dists = list(df_path['Data'])

Now, we declare a function to look up the walk distance given an origin-destination pair.

In [5]:
not_found = set()
# Function for distance lookup
def find_distance(origin, destination):
    try:
        i = origins.index(origin)
    except ValueError:
        not_found.add(origin)
        return 0
    try:
        j = origins.index(destination)
    except ValueError:
        not_found.add(destination)
        return 0
    return dists[i*2392 + j] / 1000

We run this function for each row on each school zone, creating seven new columns.

In [6]:
# List of campus' TTS zones from Joven's MOE data
campus_zones = [69, 566, 3631, 391, 225, 38, 67]

# Load distances in dataframe
for i in range(7):
    df["Dist" + str(i)] = df['HomeZone'].apply(lambda x: find_distance(x, campus_zones[i]))
len(not_found)

127

127 zones were not found as they are outside the GTHA. Let's see what our dataframe looks like now:

In [7]:
df

Unnamed: 0,Campus,HomeZone,Segment,Available,Dist0,Dist1,Dist2,Dist3,Dist4,Dist5,Dist6
0,1,261,0,1,10.256060,14.88098,29.20657,22.59214,9.218413,9.580635,11.241730
1,0,71,3,1,1.132351,23.03920,19.64290,15.87906,11.211150,2.675173,2.723838
2,0,3714,0,1,23.319230,45.63271,4.51742,28.58045,32.555200,24.964000,23.686150
3,0,74,1,1,0.699414,24.11954,19.43932,16.81186,12.830410,2.314008,1.541276
4,0,71,3,1,1.132351,23.03920,19.64290,15.87906,11.211150,2.675173,2.723838
...,...,...,...,...,...,...,...,...,...,...,...
15221,3,212,1,1,7.120260,19.01731,23.76046,14.19530,5.732956,6.553092,8.193741
15222,3,233,0,1,15.917590,12.03644,32.96591,17.68772,6.019180,15.242170,16.903260
15223,3,95,1,1,2.783940,25.09743,17.84462,15.53600,12.829600,4.733398,3.979057
15224,3,2221,0,1,23.379880,26.15476,37.35434,13.30458,15.379040,23.250580,24.518920


We are now ready to run our model! We do this 13 times, each time keeping only one segment. The following funtion is used to print the results from each trial.

In [8]:
def print_results(results):
    print("___________Segment " + str(i) + "__________")
    print("n:" + str(results.getGeneralStatistics()['Sample size'][0]), "\tR^2", results.getGeneralStatistics()['Rho-square for the init. model'][0])
    print(results.getEstimatedParameters()[['Value', 'p-value']])
    print()

Finally, let's run the model.

In [10]:
for i in range(4):
    temp_df = df.copy()
    database = db.Database("SMTO", temp_df)
    globals().update(database.variables)
    database.remove(Dist0 == 0) # Remove unknown distances
    
    database.remove(Segment != i)
    
    # Beta initialization: (name, value, lowerbound, upperbound, status, desc='')
    # Status 0 if estimated, 1 if maintained - reference choice should be 1
    ASC_SG = Beta('ASC_SG', 0, None, None, 0)
    ASC_SC = Beta('ASC_SC', 0, None, None, 0)
    ASC_MI = Beta('ASC_MI', 0, None, None, 0)
    ASC_YK = Beta('ASC_YK', 0, None, None, 0)
    ASC_YG = Beta('ASC_YG', 0, None, None, 1)
    ASC_RY = Beta('ASC_RY', 0, None, None, 0)
    ASC_OC = Beta('ASC_OC', 0, None, None, 0)
    B_DIST = Beta('B_DIST', 0, None, None, 0)

    # Variables: from columns in database
    AV = DefineVariable('AV', Available, database)
    SG_DIST = DefineVariable('SG_DIST', Dist0, database)
    SC_DIST = DefineVariable('SC_DIST', Dist1, database)
    MI_DIST = DefineVariable('MI_DIST', Dist2, database)
    YK_DIST = DefineVariable('YK_DIST', Dist3, database)
    YG_DIST = DefineVariable('YG_DIST', Dist4, database)
    RY_DIST = DefineVariable('RY_DIST', Dist5, database)
    OC_DIST = DefineVariable('OC_DIST', Dist6, database)

    # Utility Functions: note ASC_YG is 0
    V0 = ASC_SG + B_DIST * SG_DIST
    V1 = ASC_SC + B_DIST * SC_DIST
    V2 = ASC_MI + B_DIST * MI_DIST
    V3 = ASC_YK + B_DIST * YK_DIST
    V4 = ASC_YG + B_DIST * YG_DIST
    V5 = ASC_RY + B_DIST * RY_DIST
    V6 = ASC_OC + B_DIST * OC_DIST

    V  = {0: V0, 1: V1, 2: V2, 3: V3, 4: V4, 5: V5, 6: V6}
    av = {0: AV, 1: AV, 2: AV, 3: AV, 4: AV, 5: AV, 6: AV}
    
    logprob = models.loglogit(V, av, Campus)

    biogeme = bio.BIOGEME(database, logprob, numberOfThreads=1)
    biogeme.modelName = "SMTO_Mod_2_Segmented_Output/SMTO_Campus_Choice_Segment_" + str(i)
    results = biogeme.estimate(saveIterations=True)
    print_results(results)


___________Segment 0__________
n:7221 	R^2 0.1906045050172679
           Value  p-value
ASC_MI  1.135481  0.00000
ASC_OC  0.215536  0.02838
ASC_RY  2.292203  0.00000
ASC_SC  1.451279  0.00000
ASC_SG  2.314392  0.00000
ASC_YK  2.163624  0.00000
B_DIST -0.066455  0.00000

___________Segment 1__________
n:3645 	R^2 0.4375934034204848
           Value       p-value
ASC_MI  0.565082  1.726372e-04
ASC_OC -0.133929  2.967667e-01
ASC_RY  1.142239  0.000000e+00
ASC_SC  0.959644  2.176903e-11
ASC_SG  2.000747  0.000000e+00
ASC_YK  2.000966  0.000000e+00
B_DIST -0.163347  0.000000e+00

___________Segment 2__________
n:1079 	R^2 0.41994535800795163
           Value   p-value
ASC_MI  0.828133  0.041993
ASC_OC  0.455907  0.293688
ASC_RY  3.027166  0.000000
ASC_SC  0.835999  0.041244
ASC_SG  4.305832  0.000000
ASC_YK  3.207224  0.000000
B_DIST -0.057274  0.000000

___________Segment 3__________
n:2645 	R^2 0.5034146725144619
           Value       p-value
ASC_MI  1.430642  4.329096e-08
ASC_OC  0.1374

Results information:  
http://biogeme.epfl.ch/jupyter/bioResults.html