In [1]:
# Import useful libraries
import pandas as pd
import numpy as np
import os
import pickle

Looking at the tables, there are few observations also confirmed by our "client partners".

The tables beginning with R_ are reference tables containing descriptions of codes found in the data tables. Those include:
* *R\_ANZSCO*: occupation titles defined by ANZSCO at 6 levels e.g. Professionals -> Education Professionals -> School Teachers -> School Teachers nfd -> Art Teacher <br><br>

* *R_ASGS_2016*: Contains coodes that are 7-, 5-, 4-, and 3- digits that represent ABS statistical areas in decreasing size as well as a description for each. More can be read about Australian Statistical Geography Standard (ASGS) at the [ABS website](https://www.abs.gov.au/websitedbs/D3310114.nsf/home/Australian+Statistical+Geography+Standard+(ASGS)).<br><br>

* *R_Distance_bins*: these represent distance but in groups (bins) e.g. BIN_1KM contains the description '0 to 0.99', BIN_2km contains '0 to 1.99'.<br><br>

* *R_GROUP_MAINACT*: contains combinations of individual's main activity and "grouped main activity" e.g. if the main activity is 'Full-time Work', this mapped to the grouped main activity 'Worker'.<br><br>

* *R_GROUP_MODE*: contains combinations of transport mode choices and several group mode choices. e.g. if the main mode choice is "Car passenger", the GROUPMODE1 is 'Vehicle as Passenger', GROUPMODE2 is 'Private Vehicle', GROUPMODE3 is 'Vehicle as Passenger'.<br><br>

* *R_LGA*: the local government authority (council) the household belongs to. <br><br>

* *R_MAINACT*: contains combinations of main activity to studying and workstatus. This table is not populated for all combinations. <br><br>

* *R_MAINMODE*: contains combinations of main transport mode and other 7 other modes e.g. Train, Car passenger, Walking, Public bus, School bus (with route number). <br><br>

* *R_OVERALL_PURPOSE*: contains combinations of original purpose, destination purpose, overall puporse (1) and overall purpose (3). Only original purpose appears to change fore each of the 5 total combinations. e.g. ORIGPUPR = 'Shopping'. <br><br>

* *R_REGION*: contains a mapping of regions in Queensland but only two distinct: Sunshine Coast and Greater Brisbane. <br><br>

* *R_TIME*: contains combinations for time values that are represented within 15 minute blocks and peak / non-peak information. This appears to be labels used for scheduling. <br><br>

* *R_TRAVELWHYNOT*: contains reasons why travellers did not travel e.g. "Illness". <br><br> 

* *RP_AGE_GROUP*: contains a mapping for age group to multiple age groups and their descriptions. e.g. age 19 maps to age group ID 5 which is described as "20-24 years" and "18-29" years. 

There are five tables that contain the results of the Queensland Transport Survey (QTS) and include:

* *1_QTS_HOUSEHOLDS*: contains data at the household level e.g. household size, bikes owned, dwelling type and represented by a unique identifier HHID.<br><br>

* *2_QTS_PERSONS*: contains 42 columns of data at the individual level including unique individual identifier, household ID, age, sex, etc. <br><br>

* *3_QTS_VEHICLES*: contains data about the number of vehicles per household and their details e.g. type (car), make, year.<br><br> 

* *4_QTS_STOPS*: contains data about the travel behaviour, specifically the number of stops individuals make on their travel route with STOPID as unique identifier. Includes links to other tables such as individual ID, household ID and Trip ID. <br><br>

* *5_QTS_TRIPS*: contains data about the trips individuals made such as number of stops, mode choice, destination etc. with a unique Trip identifier.

The remaining two tables contain weighting information and include:

* WGT_DEMOG
* WGT_HH




Next, the tables must be combined. This will be dependent on the types of information that should be included in the modelling. Factors to consider here include:

* Classes - one of the objectives of the analysis is to identify latent classes i.e. classes that are not obvious. This is like to determined by a clustering technique of some kind.

* Mode Choice - once the class an individual belongs to is identified, this can be used to understand how likely a member of that class is to choice a particular mode of transport. However, there may also be value in understanding individual choice as well.

## Update Data Types

### 1_QTS_HOUSEHOLDS table
Looks like the following must be kept as string because they have unexpected values that indicate non response and a full coding system will not be implemented here.
* REFUSEAGE
* REFUSESEX 
* REFUSESIZE
* REFUSEVEH

In [2]:
# Import the python objects created in 01-Access_Database_Import.ipynb
# This is where the objects were exported
filepath = './Access_Tables/'
df = pd.read_pickle(filepath + '1_QTS_HOUSEHOLDS.pyobj')

In [3]:
#df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10094 entries, 0 to 10093
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   HHID           10094 non-null  float64
 1   STATUS         10094 non-null  object 
 2   HHSIZE         7724 non-null   float64
 3   BIKES          7573 non-null   float64
 4   HHVEH          6740 non-null   float64
 5   DWELLTYPE      10094 non-null  object 
 6   REFUSEAGE      2354 non-null   object 
 7   REFUSESEX      2357 non-null   object 
 8   REFUSESIZE     2354 non-null   object 
 9   REFUSEVEH      2354 non-null   object 
 10  SURVEYWEEK     10094 non-null  int64  
 11  STRATA_LGA     10094 non-null  int64  
 12  TRAVDATE       10094 non-null  int64  
 13  TRAVMONTH      10094 non-null  int64  
 14  TRAVYEAR       10094 non-null  int64  
 15  TRAVDOW        10094 non-null  int64  
 16  HOME_SA1_2016  10094 non-null  int64  
 17  HHWGT_17       5448 non-null   float64
dtypes: flo

In [4]:
# Import the python objects created in 01-Access_Database_Import.ipynb
# This is where the objects were exported
filepath = './Access_Tables/'
df = pd.read_pickle(filepath + '1_QTS_HOUSEHOLDS.pyobj')

# 1_QTS_HOUSEHOLDS table
# Convert columns to correct types
df['HHID'] = pd.to_numeric(df['HHID'], downcast='integer')

# Convert Reject values to 0 and Accept values to 1
m = {'Reject':0, 'Accept': 1}
#df['STATUS'] = df['STATUS'].map(m).astype('bool')
df['STATUS'] = pd.to_numeric(df['STATUS'].map(m), downcast = "integer", errors = "coerce")
#df['HHSIZE'] = pd.array(df['HHSIZE'], dtype="Int64")
#df['BIKES'] = pd.array(df['BIKES'], dtype="Int64")
#df['HHVEH'] = pd.array(df['HHVEH'], dtype="Int64")

# Check ID uniqueness
print('Number of rows:', df.shape[0])
print('Number of unique HHID:', df['HHID'].nunique())

#df.head()
df.info()

# Export changes
#df.to_pickle(filepath + '1_QTS_HOUSEHOLDS.pyobj')
df.to_csv(filepath + '1_QTS_HOUSEHOLDS.csv')

Number of rows: 10094
Number of unique HHID: 10094
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10094 entries, 0 to 10093
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   HHID           10094 non-null  int16  
 1   STATUS         10094 non-null  int8   
 2   HHSIZE         7724 non-null   float64
 3   BIKES          7573 non-null   float64
 4   HHVEH          6740 non-null   float64
 5   DWELLTYPE      10094 non-null  object 
 6   REFUSEAGE      2354 non-null   object 
 7   REFUSESEX      2357 non-null   object 
 8   REFUSESIZE     2354 non-null   object 
 9   REFUSEVEH      2354 non-null   object 
 10  SURVEYWEEK     10094 non-null  int64  
 11  STRATA_LGA     10094 non-null  int64  
 12  TRAVDATE       10094 non-null  int64  
 13  TRAVMONTH      10094 non-null  int64  
 14  TRAVYEAR       10094 non-null  int64  
 15  TRAVDOW        10094 non-null  int64  
 16  HOME_SA1_2016  10094 non-null  int64  
 17 

### 2_QTS_PERSON table

In [5]:
# 2_QTS_PERSONS table
df = pd.read_pickle(filepath + '2_QTS_PERSONS.pyobj')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20202 entries, 0 to 20201
Data columns (total 43 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   HHID              20202 non-null  int16         
 1   PERSID            20202 non-null  object        
 2   AGEGROUP          17724 non-null  float64       
 3   SEX               17762 non-null  object        
 4   RELATIONSHIP      10917 non-null  object        
 5   CARLICENCE        20202 non-null  bool          
 6   CARLICTYPE        11830 non-null  object        
 7   MBLICENCE         20202 non-null  bool          
 8   MBLICTYPE         984 non-null    object        
 9   OTHERLICENCE      20202 non-null  bool          
 10  WORKSTATUS        14008 non-null  object        
 11  ANZSCO_1-digit    8094 non-null   Int64         
 12  ANZSCO_3-digit    8094 non-null   Int64         
 13  INDUSTRY          8854 non-null   object        
 14  STUDYING          4359

In [6]:
# 2_QTS_PERSONS table
df = pd.read_pickle(filepath + '2_QTS_PERSONS.pyobj')

# Convert columns to correct types
df['HHID'] = pd.to_numeric(df['HHID'], downcast='integer')

# This appears to be age without decimal places rather than an age group
#df['AGEGROUP'].fillna(-1, inplace=True)
#df['AGEGROUP'] = pd.to_numeric(df['AGEGROUP'], downcast='signed')
#df['AGEGROUP'] = pd.array(df['AGEGROUP'], dtype="Int16")

df['ANZSCO_1-digit'] = pd.array(df['ANZSCO_1-digit'], dtype="Int64")
df['ANZSCO_3-digit'] = pd.array(df['ANZSCO_3-digit'], dtype="Int64")

# df['ASSISTAGE'] = df['ASSISTAGE'].fillna(0).astype('bool')
# df['ASSISTLTHC'] = df['ASSISTLTHC'].fillna(0).astype('bool')
# df['ASSISTSTHC'] = df['ASSISTSTHC'].fillna(0).astype('bool')
# df['ASSISTDISABILITY'] = df['ASSISTDISABILITY'].fillna(0).astype('bool')
# df['ASSISTENGLISH'] = df['ASSISTENGLISH'].fillna(0).astype('bool')
# df['ASSISTOTHER'] = df['ASSISTOTHER'].fillna(0).astype('bool')
# df['ASSISTANY'] = df['ASSISTANY'].fillna(0).astype('bool')

# df['RIDESHAREENT'] = df['RIDESHAREENT'].fillna(0).astype('bool')
# df['RIDESHAREHC'] = df['RIDESHAREHC'].fillna(0).astype('bool')
# df['RIDESHAREED'] = df['RIDESHAREED'].fillna(0).astype('bool')
# df['RIDESHARESHOP'] = df['RIDESHARESHOP'].fillna(0).astype('bool')
# df['RIDESHAREWORK'] = df['RIDESHAREWORK'].fillna(0).astype('bool')
# df['RIDESHAREOTHER'] = df['RIDESHAREOTHER'].fillna(0).astype('bool') 

# Several "Select One" values should be made equal to missing
df.loc[df['TAXITRIPS']=='Select One', 'TAXITRIPS'] = np.nan

# df['TAXIENT'] = df['TAXIENT'].fillna(0).astype('bool')
# df['TAXIHC'] = df['TAXIHC'].fillna(0).astype('bool')
# df['TAXIED'] = df['TAXIED'].fillna(0).astype('bool')
# df['TAXISHOP'] = df['TAXISHOP'].fillna(0).astype('bool')
# df['TAXIWORK'] = df['TAXIWORK'].fillna(0).astype('bool')
# df['TAXIOTHER'] = df['TAXIOTHER'].fillna(0).astype('bool') 

# Check ID uniqueness
print('Number of rows:', df.shape[0])
print('Number of unique HHID:', df['PERSID'].nunique())
print()

#df.head()
df.info()

# Export changes
#df.to_pickle(filepath + '2_QTS_PERSONS.pyobj')
df.to_csv(filepath + '2_QTS_PERSONS.csv')

df_person_reduced = df[['HHID', 'PERSID', 'AGEGROUP', 'SEX']]

Number of rows: 20202
Number of unique HHID: 20202

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20202 entries, 0 to 20201
Data columns (total 43 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   HHID              20202 non-null  int16         
 1   PERSID            20202 non-null  object        
 2   AGEGROUP          17724 non-null  float64       
 3   SEX               17762 non-null  object        
 4   RELATIONSHIP      10917 non-null  object        
 5   CARLICENCE        20202 non-null  bool          
 6   CARLICTYPE        11830 non-null  object        
 7   MBLICENCE         20202 non-null  bool          
 8   MBLICTYPE         984 non-null    object        
 9   OTHERLICENCE      20202 non-null  bool          
 10  WORKSTATUS        14008 non-null  object        
 11  ANZSCO_1-digit    8094 non-null   Int64         
 12  ANZSCO_3-digit    8094 non-null   Int64         
 13  INDUSTRY          8854 n

Investigating the AGEGROUP column, we find it's a double. This should be an integer as it will be used to map to the reference table later on. Therefore, we'll also get rid of the missing values here which are only around 3,000 out of 20,000.

In [7]:
df_person_reduced = df_person_reduced.dropna()

In [8]:
#df_person_reduced.info()
df_person_reduced.head()

Unnamed: 0,HHID,PERSID,AGEGROUP,SEX
0,1,1/1000,15.0,female
1,1,1/1001,16.0,male
2,1000,1000/1000,13.0,male
3,1000,1000/1001,15.0,female
4,10007,10007/1000,10.0,female


Now let's look at the SEX column which appears to contain string values for "male" and "female"

In [9]:
df_person_reduced['SEX'].unique()

array(['female', 'male'], dtype=object)

Only these two labels are included here. These will be mapped to 0 for female and 1 for male

In [10]:
df_person_reduced['SEX'] = df_person_reduced['SEX'].map({'female': 0, 'male': 1})

In [11]:
df_person_reduced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17724 entries, 0 to 20201
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   HHID      17724 non-null  int16  
 1   PERSID    17724 non-null  object 
 2   AGEGROUP  17724 non-null  float64
 3   SEX       17724 non-null  int64  
dtypes: float64(1), int16(1), int64(1), object(1)
memory usage: 588.5+ KB


### 3_QTS_VEHICLES table

In [12]:
# 3_QTS_VEHICLES table
df = pd.read_pickle(filepath + '3_QTS_VEHICLES.pyobj')

# Convert columns to correct types
# Fix year 68 to 1968 and convert 0s and 1s to NA
m = {0: np.nan, 1: np.nan, 68: 1968}
df['YEAR'] = pd.array(df['YEAR'].replace(m), dtype="Int64")

# Check ID uniqueness
print('Number of rows:', df.shape[0])
print('Number of unique HHID:', df['VEHID'].nunique())
print()

#df.head()
df.info()

# Export changes
#df.to_pickle(filepath + '3_QTS_VEHICLES.pyobj')
df.to_csv(filepath + '3_QTS_VEHICLES.csv')

Number of rows: 11798
Number of unique HHID: 11798

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11798 entries, 0 to 11797
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   VEHID     11798 non-null  int64 
 1   HHID      11798 non-null  int64 
 2   FUELTYPE  11766 non-null  object
 3   TYPE      11798 non-null  object
 4   YEAR      11664 non-null  Int64 
dtypes: Int64(1), int64(2), object(2)
memory usage: 472.5+ KB


### 4_QTS_STOPS table

In [13]:
# 4_QTS_STOPS table
df = pd.read_pickle(filepath + '4_QTS_STOPS.pyobj')

# Convert columns to correct types
df['STOPID'] = pd.to_numeric(df['STOPID'], downcast='integer')
df['HHID'] = pd.to_numeric(df['HHID'], downcast='integer')
df['TRIPID'] = pd.array(df['TRIPID'], dtype = 'Int64')
df['NOONE'] = df['NOONE'].fillna(0).astype('bool')
df['VEHOCC'] = pd.array(df['VEHOCC'], dtype = 'Int64')
df['VEHID'] = pd.array(df['VEHID'], dtype = 'Int64')
df['DURATION'] = pd.to_numeric(df['DURATION'])
df['TRAVTIME'] = pd.to_numeric(df['TRAVTIME'], downcast='float')

# Check ID uniqueness
print('Number of rows:', df.shape[0])
print('Number of unique HHID:', df['STOPID'].nunique())
print()

#df.head()
df.info()

# Export changes
#df.to_pickle(filepath + '4_QTS_STOPS.pyobj')
df.to_csv(filepath + '4_QTS_STOPS.csv')

Number of rows: 44880
Number of unique HHID: 44880

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44880 entries, 0 to 44879
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   STOPID        44880 non-null  int64  
 1   HHID          44880 non-null  int16  
 2   PERSID        44880 non-null  object 
 3   TRIPID        44879 non-null  Int64  
 4   STOPNO        44880 non-null  int64  
 5   STARTIME      44880 non-null  int64  
 6   ORIGPLACE     44880 non-null  object 
 7   ORIGPURP      44880 non-null  object 
 8   ORIGSA1_2016  44880 non-null  int64  
 9   DESTPLACE     44880 non-null  object 
 10  DESTPURP      44880 non-null  object 
 11  DESTSA1_2016  44880 non-null  int64  
 12  NOONE         44880 non-null  bool   
 13  MODE          44880 non-null  object 
 14  VEHOCC        34982 non-null  Int64  
 15  VEHID         33370 non-null  Int64  
 16  VEHPARKED     34869 non-null  object 
 17  ARRTIME       448

### 5_QTS_TRIPS table

In [14]:
# 5_QTS_TRIPS table
df = pd.read_pickle(filepath + '5_QTS_TRIPS.pyobj')

# Convert columns to correct types
df['TRIPID'] = pd.to_numeric(df['TRIPID'], downcast='integer')
df['HHID'] = pd.to_numeric(df['HHID'], downcast='integer')
df['DURATION'] = pd.to_numeric(df['DURATION'])
df['TRAVTIME'] = pd.to_numeric(df['TRAVTIME'], downcast='float')

# Check ID uniqueness
print('Number of rows:', df.shape[0])
print('Number of unique HHID:', df['TRIPID'].nunique())
print()

#df.head()
df.info()

# Export changes
#df.to_pickle(filepath + '5_QTS_TRIPS.pyobj')
df.to_csv(filepath + '5_QTS_TRIPS.csv')

Number of rows: 40470
Number of unique HHID: 40470

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40470 entries, 0 to 40469
Data columns (total 25 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TRIPID           40470 non-null  int64  
 1   HHID             40470 non-null  int16  
 2   PERSID           40470 non-null  object 
 3   STARTSTOP        40470 non-null  int64  
 4   ENDSTOP          40470 non-null  int64  
 5   STARTIME         40470 non-null  int64  
 6   ORIGPLACE        40470 non-null  object 
 7   ORIGPURP         40470 non-null  object 
 8   ORIGSA1_2016     40470 non-null  int64  
 9   DESTPLACE        40470 non-null  object 
 10  DESTPURP         40470 non-null  object 
 11  DESTSA1_2016     40470 non-null  int64  
 12  MAINMODE         40470 non-null  object 
 13  MODE1            40470 non-null  object 
 14  MODE2            2087 non-null   object 
 15  MODE3            1804 non-null   object 
 16  MODE4 

Basic model with few variables

Variables chosen based on Koppelman and Bhat (2006)

    Income
    Automobile ownership
    Sex
    Age group
    Travel time
    Travel cost



In [15]:
df_trips = df

In [16]:
df_trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40470 entries, 0 to 40469
Data columns (total 25 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TRIPID           40470 non-null  int64  
 1   HHID             40470 non-null  int16  
 2   PERSID           40470 non-null  object 
 3   STARTSTOP        40470 non-null  int64  
 4   ENDSTOP          40470 non-null  int64  
 5   STARTIME         40470 non-null  int64  
 6   ORIGPLACE        40470 non-null  object 
 7   ORIGPURP         40470 non-null  object 
 8   ORIGSA1_2016     40470 non-null  int64  
 9   DESTPLACE        40470 non-null  object 
 10  DESTPURP         40470 non-null  object 
 11  DESTSA1_2016     40470 non-null  int64  
 12  MAINMODE         40470 non-null  object 
 13  MODE1            40470 non-null  object 
 14  MODE2            2087 non-null   object 
 15  MODE3            1804 non-null   object 
 16  MODE4            310 non-null    object 
 17  MODE5       

In [17]:
df_trips_reduced = df_trips[['TRIPID', 'HHID', 'PERSID', 'MAINMODE', 'TRAVTIME', 'OVERALL_PURPOSE']]

In [18]:
df_trips_reduced.head()

Unnamed: 0,TRIPID,HHID,PERSID,MAINMODE,TRAVTIME,OVERALL_PURPOSE
0,371000001,37,37/1000,Car driver,20.0,Pickup/Dropoff Someone
1,371000002,37,37/1000,Car driver,25.0,Pickup/Dropoff Someone
2,371000003,37,37/1000,Car driver,5.0,Shopping
3,371000004,37,37/1000,Car driver,5.0,Shopping
4,371002001,37,37/1002,Car passenger,20.0,Education


In [19]:
df_trips_reduced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40470 entries, 0 to 40469
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TRIPID           40470 non-null  int64  
 1   HHID             40470 non-null  int16  
 2   PERSID           40470 non-null  object 
 3   MAINMODE         40470 non-null  object 
 4   TRAVTIME         40470 non-null  float32
 5   OVERALL_PURPOSE  40470 non-null  object 
dtypes: float32(1), int16(1), int64(1), object(3)
memory usage: 1.5+ MB


In [20]:
df_trips_reduced['MAINMODE'].unique()

array(['Car driver', 'Car passenger', 'Walking', 'Motorcycle driver',
       'Bicycle', 'Charter/Courtesy/Other bus', 'Truck passenger',
       'Truck driver', 'Taxi or ride share e.g. Uber',
       'School bus (private/chartered)', 'Mobility scooter', 'Train',
       'Other method', 'Ferry', 'Light rail', 'Motorcycle passenger',
       'Public bus', 'Taxi', 'Uber / Other Ride Share', 'Taxi or Uber',
       'School bus (with route number)', 'Public Bus'], dtype=object)

Looking at the counts below, it's clear that some modes aren't used often. Since this should be kept betweeb 4 and 7 for simplicity (Koppelman & Bhat, 2006. p. 9), some of these will be grouped together.

Proposed groupings:
* Bicycle + Walking -> Active (~4,000)
* Ferry +  Light rail + Public Bus + Public bus -> PublicTransport (~835)
* Car driver (~23,000)
* Car passenger (~10,000)

In [21]:
df_trips_reduced.groupby(by=['MAINMODE']).count()

Unnamed: 0_level_0,TRIPID,HHID,PERSID,TRAVTIME,OVERALL_PURPOSE
MAINMODE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bicycle,537,537,537,537,537
Car driver,23374,23374,23374,23374,23374
Car passenger,10233,10233,10233,10233,10233
Charter/Courtesy/Other bus,148,148,148,148,148
Ferry,61,61,61,61,61
Light rail,56,56,56,56,56
Mobility scooter,20,20,20,20,20
Motorcycle driver,187,187,187,187,187
Motorcycle passenger,9,9,9,9,9
Other method,15,15,15,15,15


Here, we will code the modes of transport into four groups:
1. Bicycle, Walking -> 3
2. Ferry, Light rail, Public Bus -> 2
3. Car driver -> 1
4. Car driver -> 0

In [22]:
df_trips_reduced.to_pickle(filepath + 'df_trips_reduced.pyobj')
#df_person_trips.to_pickle(filepath + 'df_person_trips.pyobj')

In [23]:
# In a new column 'MODE_CODE', converted the values from 'MAINMODE' to their mappings ()
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Bicycle', 'MODE_CODE'] = 3
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Walking', 'MODE_CODE'] = 3
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Ferry', 'MODE_CODE'] = 2
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Light rail', 'MODE_CODE'] = 2
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Public Bus', 'MODE_CODE'] = 2
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Public bus', 'MODE_CODE'] = 2
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Car driver', 'MODE_CODE'] = 0
df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Car passenger', 'MODE_CODE'] = 1

# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Bicycle', 'MODE_CODE'] = 1
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Walking', 'MODE_CODE'] = 2
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Ferry', 'MODE_CODE'] = 2
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Light rail', 'MODE_CODE'] = 2
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Public Bus', 'MODE_CODE'] = 2
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Public bus', 'MODE_CODE'] = 2
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Car driver', 'MODE_CODE'] = 3
# df_trips_reduced.loc[df_trips_reduced['MAINMODE'] == 'Car passenger', 'MODE_CODE'] = 4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [24]:
df_trips_reduced['MODE_CODE'].unique()

array([ 0.,  1.,  3., nan,  2.])

Now reduce the number of rows by omitting other mode types. Recall that the mode types being analysed include:
* Car driver
* Car passenger
* Public Transport
* Active

In [25]:
df_trips_reduced = df_trips_reduced.dropna()

In [26]:
df_trips_reduced['MODE_CODE'].unique()

array([0., 1., 3., 2.])

In [27]:
df_trips_reduced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38316 entries, 0 to 40465
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TRIPID           38316 non-null  int64  
 1   HHID             38316 non-null  int16  
 2   PERSID           38316 non-null  object 
 3   MAINMODE         38316 non-null  object 
 4   TRAVTIME         38316 non-null  float32
 5   OVERALL_PURPOSE  38316 non-null  object 
 6   MODE_CODE        38316 non-null  float64
dtypes: float32(1), float64(1), int16(1), int64(1), object(3)
memory usage: 2.0+ MB


Join the PERSON and TRIPS tables together

In [28]:
df_person_trips = df_trips_reduced.merge(right = df_person_reduced, how = 'left', on = 'PERSID')

In [29]:
df_person_trips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38316 entries, 0 to 38315
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TRIPID           38316 non-null  int64  
 1   HHID_x           38316 non-null  int16  
 2   PERSID           38316 non-null  object 
 3   MAINMODE         38316 non-null  object 
 4   TRAVTIME         38316 non-null  float32
 5   OVERALL_PURPOSE  38316 non-null  object 
 6   MODE_CODE        38316 non-null  float64
 7   HHID_y           38311 non-null  float64
 8   AGEGROUP         38311 non-null  float64
 9   SEX              38311 non-null  float64
dtypes: float32(1), float64(4), int16(1), int64(1), object(3)
memory usage: 2.9+ MB


In [30]:
df_person_trips = df_person_trips.dropna()

In [31]:
df_person_trips.head()

Unnamed: 0,TRIPID,HHID_x,PERSID,MAINMODE,TRAVTIME,OVERALL_PURPOSE,MODE_CODE,HHID_y,AGEGROUP,SEX
0,371000001,37,37/1000,Car driver,20.0,Pickup/Dropoff Someone,0.0,37.0,11.0,1.0
1,371000002,37,37/1000,Car driver,25.0,Pickup/Dropoff Someone,0.0,37.0,11.0,1.0
2,371000003,37,37/1000,Car driver,5.0,Shopping,0.0,37.0,11.0,1.0
3,371000004,37,37/1000,Car driver,5.0,Shopping,0.0,37.0,11.0,1.0
4,371002001,37,37/1002,Car passenger,20.0,Education,1.0,37.0,4.0,0.0


Fix the incorrect data types for AGEGROUP and SEX. These should be integers. Also remove the second HHID column.

In [32]:
#df_person_trips.head()
df_person_trips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38311 entries, 0 to 38315
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TRIPID           38311 non-null  int64  
 1   HHID_x           38311 non-null  int16  
 2   PERSID           38311 non-null  object 
 3   MAINMODE         38311 non-null  object 
 4   TRAVTIME         38311 non-null  float32
 5   OVERALL_PURPOSE  38311 non-null  object 
 6   MODE_CODE        38311 non-null  float64
 7   HHID_y           38311 non-null  float64
 8   AGEGROUP         38311 non-null  float64
 9   SEX              38311 non-null  float64
dtypes: float32(1), float64(4), int16(1), int64(1), object(3)
memory usage: 2.8+ MB


The OVERALL_PURPOSE values appear to be quite clean with a couple of lower frequency classes `Other Purpose` and `Pickup/Deliver Something`. At this stage, this variable will be used to filter by trip purpose so that, for example, direct work commutes can be targeted for a model. However, OVERALL_PURPOSE itself, will be an input to the model. As such, these categorical variables won't be coded to integers.

In [33]:
df_trips_reduced.groupby(by=['OVERALL_PURPOSE']).count()

Unnamed: 0_level_0,TRIPID,HHID,PERSID,MAINMODE,TRAVTIME,MODE_CODE
OVERALL_PURPOSE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accompany Someone,3260,3260,3260,3260,3260,3260
Direct Work Commute,6757,6757,6757,6757,6757,6757
Education,3953,3953,3953,3953,3953,3953
Other Purpose,220,220,220,220,220,220
Personal Business,2654,2654,2654,2654,2654,2654
Pickup/Deliver Something,542,542,542,542,542,542
Pickup/Dropoff Someone,5900,5900,5900,5900,5900,5900
Recreation,4293,4293,4293,4293,4293,4293
Shopping,7120,7120,7120,7120,7120,7120
Social,1439,1439,1439,1439,1439,1439


Now reduce the number of rows by omitting other mode types. Recall that the mode types being analysed include:

Accompany Someone -> 0
Direct Work Commute, Work Related -> 1
Education -> 2
Other Purpose -> 3
Personal Business -> 4
Pickup/Deliver Something -> 5
Pickup/Dropoff Someone -> 5
Recreation -> 6
Shopping -> 7
Social -> 8

In [34]:
# In a new column 'PURPOSE_MODE', converted the values from 'OVERALL_PURPOSE' to their mappings ()
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Accompany Someone', 'PURPOSE_CODE'] = 0
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Direct Work Commute', 'PURPOSE_CODE'] = 1
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Education', 'PURPOSE_CODE'] = 2
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Other Purpose', 'PURPOSE_CODE'] = 3
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Personal Business', 'PURPOSE_CODE'] = 4
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Pickup/Deliver Something', 'PURPOSE_CODE'] = 5
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Pickup/Dropoff Someone', 'PURPOSE_CODE'] = 5
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Recreation', 'PURPOSE_CODE'] = 6
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Shopping', 'PURPOSE_CODE'] = 7
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Social', 'PURPOSE_CODE'] = 8
df_person_trips.loc[df_person_trips['OVERALL_PURPOSE'] == 'Work Related', 'PURPOSE_CODE'] = 1

In [35]:
df_person_trips['PURPOSE_CODE'].unique()

array([5., 7., 2., 1., 6., 0., 4., 8., 3.])

In [36]:
#df_person_trips = pd.concat([df_person_trips, pd.get_dummies(df_person_trips['PURPOSE_CODE'], prefix='purpose')], axis=1)

In [37]:
df_person_trips['AGEGROUP'] = pd.to_numeric(df_person_trips['AGEGROUP'], downcast='integer', errors='coerce')
df_person_trips['SEX'] = pd.to_numeric(df_person_trips['SEX'], downcast='integer')
df_person_trips['MODE_CODE'] = pd.to_numeric(df_person_trips['MODE_CODE'], downcast='integer')
df_person_trips['PURPOSE_CODE'] = pd.to_numeric(df_person_trips['PURPOSE_CODE'], downcast='integer')

df_person_trips = df_person_trips.drop('HHID_y', axis = 1)
df_person_trips = df_person_trips.rename(columns={"HHID_x": "HHID"})

In [47]:
df_person_trips = df_person_trips.drop(['TRIPID', 'HHID', 'PERSID', 'MAINMODE', 'OVERALL_PURPOSE'], axis = 1)

In [48]:
df_person_trips.head()

Unnamed: 0,TRAVTIME,MODE_CODE,AGEGROUP,SEX,PURPOSE_CODE
0,20.0,0,11,1,5
1,25.0,0,11,1,5
2,5.0,0,11,1,7
3,5.0,0,11,1,7
4,20.0,1,4,0,2


In [49]:
df_person_trips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38311 entries, 0 to 38315
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TRAVTIME      38311 non-null  float32
 1   MODE_CODE     38311 non-null  int8   
 2   AGEGROUP      38311 non-null  int8   
 3   SEX           38311 non-null  int8   
 4   PURPOSE_CODE  38311 non-null  int8   
dtypes: float32(1), int8(4)
memory usage: 598.6 KB


In [50]:
%matplotlib inline

In [51]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

In [52]:
y = (df_person_trips['MODE_CODE'])
#y = (df_person_trips['MODE_CODE']).values
X = df_person_trips.drop(['MODE_CODE'], axis = 1)
X = sm.add_constant(X)
#X = sm.add_constant(X, prepend=False).values

In [53]:
rs = 6
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)

In [54]:
print("X Training shape:", X_train.shape)
print("X Testing shape:", X_test.shape)
print("y Training shape:", y_train.shape)
print("y Testing shape:", y_test.shape)

X Training shape: (30648, 5)
X Testing shape: (7663, 5)
y Training shape: (30648,)
y Testing shape: (7663,)


In [55]:
choice_model = sm.MNLogit(y_train, X_train, missing='drop')

In [56]:
choice_model_res = choice_model.fit()

Optimization terminated successfully.
         Current function value: 0.822317
         Iterations 8


In [57]:
print(choice_model_res.summary())

                          MNLogit Regression Results                          
Dep. Variable:              MODE_CODE   No. Observations:                30648
Model:                        MNLogit   Df Residuals:                    30633
Method:                           MLE   Df Model:                           12
Date:                Sun, 14 Jun 2020   Pseudo R-squ.:                  0.1518
Time:                        16:18:06   Log-Likelihood:                -25202.
converged:                       True   LL-Null:                       -29714.
Covariance Type:            nonrobust   LLR p-value:                     0.000
 MODE_CODE=1       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const            1.8734      0.043     43.512      0.000       1.789       1.958
TRAVTIME        -0.0106      0.001    -11.176      0.000      -0.012      -0.009
AGEGROUP        -0.3111      0.004    -70.08

In [58]:
print(choice_model_res.summary2())

                        Results: MNLogit
Model:              MNLogit          Pseudo R-squared: 0.152     
Dependent Variable: MODE_CODE        AIC:              50434.7384
Date:               2020-06-14 16:18 BIC:              50559.6933
No. Observations:   30648            Log-Likelihood:   -25202.   
Df Model:           12               LL-Null:          -29714.   
Df Residuals:       30633            LLR p-value:      0.0000    
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     8.0000                                       
----------------------------------------------------------------
 MODE_CODE = 0   Coef.  Std.Err.    t     P>|t|   [0.025  0.975]
----------------------------------------------------------------
       const     1.8734   0.0431  43.5120 0.0000  1.7890  1.9578
    TRAVTIME    -0.0106   0.0009 -11.1763 0.0000 -0.0124 -0.0087
    AGEGROUP    -0.3111   0.0044 -70.0862 0.0000 -0.3198 -0.3024
         SEX    -0.3300   0.0314 -10.5078

In [59]:
print(choice_model_res.params)

                     0         1         2
const         1.873417 -2.422629 -0.858047
TRAVTIME     -0.010579  0.025347 -0.008515
AGEGROUP     -0.311104 -0.111827 -0.127922
SEX          -0.329991 -0.515125  0.030208
PURPOSE_CODE  0.016799 -0.133243  0.097728


In [60]:
print(np.exp(choice_model_res.params))

                     0         1         2
const         6.510502  0.088688  0.423989
TRAVTIME      0.989477  1.025671  0.991521
AGEGROUP      0.732638  0.894199  0.879922
SEX           0.718930  0.597426  1.030669
PURPOSE_CODE  1.016941  0.875252  1.102663


In [61]:
print(choice_model_res.get_margeff().summary())

       MNLogit Marginal Effects      
Dep. Variable:              MODE_CODE
Method:                          dydx
At:                           overall
 MODE_CODE=0      dy/dx    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
TRAVTIME         0.0014      0.000      9.744      0.000       0.001       0.002
AGEGROUP         0.0453      0.000     95.837      0.000       0.044       0.046
SEX              0.0432      0.005      8.557      0.000       0.033       0.053
PURPOSE_CODE    -0.0061      0.001     -5.652      0.000      -0.008      -0.004
--------------------------------------------------------------------------------
 MODE_CODE=1      dy/dx    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
TRAVTIME        -0.0015      0.000    -10.892      0.000      -0.002      -0.001
AGEGROUP        -0.0423      0.000    

In [62]:
import numpy as np
np.exp(choice_model_res.params)

Unnamed: 0,0,1,2
const,6.510502,0.088688,0.423989
TRAVTIME,0.989477,1.025671,0.991521
AGEGROUP,0.732638,0.894199,0.879922
SEX,0.71893,0.597426,1.030669
PURPOSE_CODE,1.016941,0.875252,1.102663


In [63]:
preds = choice_model_res.predict(X_test)

In [64]:
respondent1000 = X.iloc[[1000]]
choice_model_res.predict(respondent1000)

Unnamed: 0,0,1,2,3
1000,0.500399,0.373015,0.042478,0.084108


In [65]:
t = choice_model_res.pred_table()
print(t)
print("Accuracy:",np.diag(t).sum()/t.sum())

[[1.7807e+04 8.5400e+02 3.6000e+01 0.0000e+00]
 [2.5710e+03 5.6130e+03 1.2000e+01 0.0000e+00]
 [5.6800e+02 1.1000e+02 5.0000e+00 0.0000e+00]
 [2.1290e+03 9.4200e+02 1.0000e+00 0.0000e+00]]
Accuracy: 0.7643239363090577


In [66]:
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn import metrics 
from sklearn.metrics import confusion_matrix

In [67]:
model1 = LogisticRegression(random_state=rs, multi_class='multinomial', penalty='none', solver='newton-cg').fit(X_train, y_train)
preds = model1.predict(X_train)

#print the tunable parameters (They were not tuned in this example, everything kept as default)
params = model1.get_params()
print(params)


{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'multinomial', 'n_jobs': None, 'penalty': 'none', 'random_state': 6, 'solver': 'newton-cg', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}


In [68]:
print('Intercept: \n', model1.intercept_)
print('Coefficients: \n', model1.coef_)

Intercept: 
 [ 0.17590739  1.11261564 -1.0354071  -0.25311594]
Coefficients: 
 [[ 0.17590739 -0.00156322  0.13771343  0.2037271   0.00467907]
 [ 1.11261564 -0.01214244 -0.17339065 -0.1262644   0.02147805]
 [-1.0354071   0.02378385  0.02588625 -0.31139819 -0.12856395]
 [-0.25311594 -0.01007819  0.00979097  0.23393549  0.10240682]]


In [69]:
np.exp(model1.coef_)

array([[1.19232763, 0.998438  , 1.14764662, 1.22596354, 1.00469004],
       [3.04230558, 0.98793099, 0.84080909, 0.88138178, 1.02171036],
       [0.3550818 , 1.02406894, 1.02622421, 0.73242217, 0.87935733],
       [0.77637787, 0.98997242, 1.00983906, 1.26356298, 1.10783407]])

In [70]:
df_person_trips

Unnamed: 0,TRAVTIME,MODE_CODE,AGEGROUP,SEX,PURPOSE_CODE
0,20.0,0,11,1,5
1,25.0,0,11,1,5
2,5.0,0,11,1,7
3,5.0,0,11,1,7
4,20.0,1,4,0,2
...,...,...,...,...,...
38311,60.0,2,5,1,2
38312,60.0,2,5,1,2
38313,93.0,2,4,1,7
38314,95.0,2,3,1,2


In [71]:
df_person_trips.to_pickle(filepath + 'df_person_trips.pyobj')

In [72]:
df_person_trips

Unnamed: 0,TRAVTIME,MODE_CODE,AGEGROUP,SEX,PURPOSE_CODE
0,20.0,0,11,1,5
1,25.0,0,11,1,5
2,5.0,0,11,1,7
3,5.0,0,11,1,7
4,20.0,1,4,0,2
...,...,...,...,...,...
38311,60.0,2,5,1,2
38312,60.0,2,5,1,2
38313,93.0,2,4,1,7
38314,95.0,2,3,1,2


In [75]:
#mode_sex_counts = df_person_trips[['MAINMODE', 'SEX', 'HHID_x']].groupby(by=['MAINMODE', 'SEX']).count()

In [76]:
pd.set_option("display.max_rows", None)

In [77]:
df_person_trips.groupby(by=['MODE_CODE', 'SEX']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,TRAVTIME,AGEGROUP,PURPOSE_CODE
MODE_CODE,SEX,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,12246,12246,12246
0,1,11126,11126,11126
1,0,6011,6011,6011
1,1,4221,4221,4221
2,0,482,482,482
2,1,360,360,360
3,0,2060,2060,2060
3,1,1805,1805,1805


In [78]:
df_person_trips.groupby(by=['MODE_CODE', 'AGEGROUP']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,TRAVTIME,SEX,PURPOSE_CODE
MODE_CODE,AGEGROUP,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3,1,1,1
0,4,439,439,439
0,5,1097,1097,1097
0,6,1408,1408,1408
0,7,2123,2123,2123
0,8,2510,2510,2510
0,9,2628,2628,2628
0,10,2754,2754,2754
0,11,2183,2183,2183
0,12,2113,2113,2113


In [None]:
set(df_person_trips['AGEGROUP'].unique())

In [81]:
df_person_trips.groupby(by=['MODE_CODE', 'PURPOSE_CODE']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,TRAVTIME,AGEGROUP,SEX
MODE_CODE,PURPOSE_CODE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,164,164,164
0,1,7672,7672,7672
0,2,381,381,381
0,3,90,90,90
0,4,1816,1816,1816
0,5,5525,5525,5525
0,6,1949,1949,1949
0,7,4828,4828,4828
0,8,947,947,947
1,0,2738,2738,2738


In [None]:
df_person_trips