# School Location Choice Model

In this notebook, a simple logit location choice model is used to predict university campus choice. Data from the 2015 StudentMoveTO data is used. The model is run separately on each of the 13 segments we have previously outlined.

First, we install Biogeme and Pandas, and load the original data.

In [1]:
import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.models as models
from biogeme.expressions import Beta, DefineVariable
#import biogeme.messaging as msg
import pandas as pd
import math

hh_df = pd.read_csv('../Data/SMTO_2015/SMTO_2015_Households.csv')
ps_df = pd.read_csv('../Data/SMTO_2015/SMTO_2015_Respondents.csv')

KeyboardInterrupt: 

Now, we prepare a dataframe which will be used as input for Biogeme by selecting and manipulating the relevant columns from the data. We use a function to convert information about Level, Status, Years of Involvement, and Living Arrangement into a column containing the respondent's Segment.

In [None]:
def info_to_segment(x):
    if x.Level == 'UG':
        return 2 if x.Status == 'PT' else (not x.Family)*1
    elif x.Level == 'Grad':
        return 5 if x.Status == 'PT' else (3 + (not x.Family))
    else:
        return 6

# Load relevant columns
df = ps_df[['pscampusattend', 'personstatusgrad', 'personstatustime', 'psuniversityinvolvednumyears']]
df = df.join(hh_df[['HmTTS2006', 'hhlivingsituation']])
df = df.rename(columns={'HmTTS2006': 'HomeZone', 'pscampusattend': 'Campus', 'hhlivingsituation': 'Family', 
                       'personstatusgrad': 'Level', 'personstatustime': 'Status', 'psuniversityinvolvednumyears': 'Years'})
df = df.dropna() # Remove rows with missing data

# Convert Campus column to numerical column
campus_name_to_num = {"Downtown Toronto (St. George)": 0, "Scarborough (UTSC)": 1, "Mississauga (UTM)": 2,
                      "Keele": 3, "Glendon": 4, "RyersonU": 5, "OCADu": 6}
df.replace({'Campus': campus_name_to_num}, inplace=True)
df['HomeZone'] = pd.to_numeric(df['HomeZone'], downcast='signed')
df['Family'] = (df['Family'] == 'Live with family/parents')*1
df['Segment'] = df.apply(info_to_segment, axis=1)
df['Available'] = 1 # All campuses available to all students

df['Segment'].value_counts()

Keeping only the relevant columns, we have:

In [None]:
df = df[['Campus', 'HomeZone', 'Segment', 'Available']]
df

Now we must load information about the distance between each respondent's home zone and each campus. First, we load the data from the LoS matrix.

In [None]:
# Dataframe with walk distances
df_path = pd.read_csv('../../LoS/Walk_Distances.csv')
origins = list(set(list(df_path['Origin'])))
dists = list(df_path['Data'])

Now, we declare a function to look up the walk distance given an origin-destination pair.

In [None]:
not_found = set()
# Function for distance lookup
def find_distance(origin, destination):
    try:
        i = origins.index(origin)
    except ValueError:
        not_found.add(origin)
        return 0
    try:
        j = origins.index(destination)
    except ValueError:
        not_found.add(destination)
        return 0
    return dists[i*2392 + j] / 1000

We run this function for each row on each school zone, creating seven new columns.

In [None]:
# List of campus' TTS zones from Joven's MOE data
campus_zones = [69, 566, 3631, 391, 225, 38, 67]

# Load distances in dataframe
for i in range(7):
    df["Dist" + str(i)] = df['HomeZone'].apply(lambda x: find_distance(x, campus_zones[i]))
len(not_found)

127 zones were not found as they are outside the GTHA. Let's see what our dataframe looks like now:

In [None]:
df

We are now ready to run our model! We do this 13 times, each time keeping only one segment. The following funtion is used to print the results from each trial.

In [None]:
def print_results(results):
    print("___________Segment " + str(i) + "__________")
    print("n:" + str(results.getGeneralStatistics()['Sample size'][0]), "\tR^2", results.getGeneralStatistics()['Rho-square for the init. model'][0])
    print(results.getEstimatedParameters()[['Value', 'p-value']])
    print()

Finally, let's run the model.

In [None]:
enrollment_df = pd.read_csv('../Data/Enrolment/Joven_Enrollment.csv')
enrollment_df = enrollment_df.set_index('School')
enrollment_df

In [None]:
def get_log_enrollment(segment, school):
    return math.log1p(enrollment_df.loc[school]['UG' if segment < 3 else ('Grad' if segment < 6 else 'Total')])

In [None]:
for i in range(6):
    temp_df = df.copy()
    database = db.Database("SMTO", temp_df)
    globals().update(database.variables)
    database.remove(Dist0 == 0) # Remove unknown distances
    
    database.remove(Segment != i)
    
    # Beta initialization: (name, value, lowerbound, upperbound, status, desc='')
    # Status 0 if estimated, 1 if maintained - reference choice should be 1
    ASC_SG = Beta('ASC_SG', get_log_enrollment(i, 'SG'), None, None, 1)
    ASC_SC = Beta('ASC_SC', get_log_enrollment(i, 'SC'), None, None, 1)
    ASC_MI = Beta('ASC_MI', get_log_enrollment(i, 'MI'), None, None, 1)
    ASC_YK = Beta('ASC_YK', get_log_enrollment(i, 'YK'), None, None, 1)
    ASC_YG = Beta('ASC_YG', get_log_enrollment(i, 'YG'), None, None, 1)
    ASC_RY = Beta('ASC_RY', get_log_enrollment(i, 'RY'), None, None, 1)
    ASC_OC = Beta('ASC_OC', get_log_enrollment(i, 'OC'), None, None, 1)
    B_DIST = Beta('B_DIST', 0, None, None, 0)

    # Variables: from columns in database
    AV = DefineVariable('AV', Available, database)
    SG_DIST = DefineVariable('SG_DIST', Dist0, database)
    SC_DIST = DefineVariable('SC_DIST', Dist1, database)
    MI_DIST = DefineVariable('MI_DIST', Dist2, database)
    YK_DIST = DefineVariable('YK_DIST', Dist3, database)
    YG_DIST = DefineVariable('YG_DIST', Dist4, database)
    RY_DIST = DefineVariable('RY_DIST', Dist5, database)
    OC_DIST = DefineVariable('OC_DIST', Dist6, database)

    # Utility Functions: note ASC_YG is 0
    V0 = ASC_SG + B_DIST * SG_DIST
    V1 = ASC_SC + B_DIST * SC_DIST
    V2 = ASC_MI + B_DIST * MI_DIST
    V3 = ASC_YK + B_DIST * YK_DIST
    V4 = ASC_YG + B_DIST * YG_DIST
    V5 = ASC_RY + B_DIST * RY_DIST
    V6 = ASC_OC + B_DIST * OC_DIST

    V  = {0: V0, 1: V1, 2: V2, 3: V3, 4: V4, 5: V5, 6: V6}
    av = {0: AV, 1: AV, 2: AV, 3: AV, 4: AV, 5: AV, 6: AV}
    
    logprob = models.loglogit(V, av, Campus)

    biogeme = bio.BIOGEME(database, logprob, numberOfThreads=1)
    biogeme.modelName = "SMTO_Location_Choice_Enrollment/Segment_" + str(i)
    results = biogeme.estimate(saveIterations=True)
    print_results(results)


Results information:  
http://biogeme.epfl.ch/jupyter/bioResults.html