In [1]:
# libraries
import numpy as np
import pandas as pd
import altair as alt
alt.renderers.enable('default')

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import add_dummy_feature
from sklearn.decomposition import PCA

# warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# display settings
pd.options.display.max_columns = None

## Access to Public Transportation in California

Kabir Snell, Ryan Sevilla, Jaime Gomez, and Joseph Momich.

#### Author contributions

Name | Role
---|---|
Kabir Snell | Found the data set, wrote the background, and came up with questions for future work
Ryan Sevilla | Worked on variable summaries and explanatory plots
Jaime Gomez | Ensured guidelines were met and worked on final completion of the report
Joseph Momich | Provided data description, helped transform the data, and assisted in data exploration & analysis

#### Abstract

Well-designed public transit plans help improve community heath, reduce air pollution, and provide economic benefits to the community. California's public transportation system services residents in both urban and rural areas, but not all residents have equal access to it. The aim of this analysis was to identify how access to public transit differs among location and ethnicity. In addition, it was to determine if either population or ethnicity are significant predictors of access to transit. Access to public transit differs between each region and county in California. However, a location's ethnic demographic and population alone, are not significant predictors of access to transit.

---
### Background
California has a vast public transportation system that includes buses, trains, light rail, and subways. The state's largest public transportation agency is the Los Angeles County Metropolitan Transportation Authority (Metro), which serves the Los Angeles metropolitan area with an extensive network of bus and rail lines. Other major public transportation agencies in California include the San Francisco Municipal Transportation Agency and the San Diego Metropolitan Transit System. In addition to these larger agencies, many smaller cities and towns throughout California have their own public transportation systems, which may include buses, shuttles, or other types of services. These systems often provide connections to regional or statewide transportation networks, making it possible for people to travel throughout the state using public transportation. While many areas have well-established public transportation systems, there may be gaps or limitations in service in some parts of the state, particularly in more rural or remote areas.

Every 10 years, the California Department of Public Health collects public transportation data, as part of the "Health Communities Data and Indicators Project (HCI). The goal of the project is to evaluate how city plans and policies affect community health. In particular, it provides information on the number of households that are within a certain distance of public transit stops. This information can be useful for understanding the accessibility of public transportation in different areas of the state, and for identifying opportunities to improve transit service or expand transit infrastructure. Overall, it is a valuable resource for policymakers, transportation planners, and researchers who are interested in understanding the availability and accessibility of public transportation in California.

### Aims
There were two main objectives in the data analysis. First, it was to visuaize how access to public transportation access varies across geographic location and ethnicity group. The next was to determine if either population or ethnicity are significant indicators in predicting access rates. In order to approach these objectives, the data set was grouped by each level of geography (region, county, town). To determine how public transportation varies, visualizations like bar charts and scatter plots were used. Single and multiple linear regression models were fit to the data to determine how location and ethnicity explain access rates.

The findings show that access to public transportation varies significantly across the four regions and counties analyzed. In certain counties with high access rates, the ethnic demographic is segmented, but this is less apparent on the regional level. However, the data is very scattered between a location's population and its public transit access rates. **Overall, the analysis provided negative results.** Ethnicity and population are not significant enough indicators to predict the percent of residents that reside within 1/2 mile of public transportation.

---
### Datasets

<p style='text-align: left;'>

The data set shows the percent of the population that resides within 1/2 mile of a major transit location in four California regions, and whose waiting time is less than 15 minutes during peak commute hours. The data is stratified by 8 race/ethnicity groups and includes both geographic information and statistical reliability measurements.

The data includes 2012 Transit Stops from the San Diego and Southern California Association of Governments, as well as the Metropolitan Transportation Commission. It also includes 2008 Transit Stops from the Sacramento Council of Government and 2010 block-level population data from the U.S. Census Bureau. The four California regions are defined as the following:
* Southern California (SCAG): Imperial, Los Angeles, Orange, Riverside, San Bernardino, and Ventura 
* Sacramento (SACOG): Placer, Sacramento, and Yolo 
* Bay Area (MTC): Contra Costa, Marin, Napa, San Francisco, San Mateo, Santa Clara, Solano, Sonoma 
* San Diego County

Data values were obtained using automated methods to download information from various public websites. In order to compile them into one data set, the census blocks from the 2010 U.S. Census were merged with the blocks from the travel surveys. One important survey was the 2010-2012 California Household Travel Survey. Multiple data collection methods were used in this survey, including computer-assisted telephone interviews and online/mail surveys. To identify census blocks inside 1/2 mile of the transit stops, geospatial software was used. The data was processed into Excel files with standard formats. 

The population is adults aged 18 years and over, who reside in the four California regions. The sampling frame includes adults in these four regions, with access to telephone or mail services. The sampling mechanism for the respective year (2008 or 2012) is a probability sample because the surveys downloaded by the HCI project were sent to randomly selected adults However, the scope of inference has limitations. The data is from the year 2012 for the SCAG, MTC, and San Diego regions; while, 2008 for the SACOG region. Some transit stops and services may have changed during that time period. As well, the population data was collected from the 2010 U.S. Census, which is a different time period than the transit data (2008, 2012). Therefore, some variation may exist if demographics changed. The following table provides a summary of the variables used for analysis:


Name | Variable description | Type | Units of measurement
---|---|---|---
year | year when data was reported | Numeric | Calendar year 
race_eth_name | name of the different races/ethnicties ('AfricanAm', 'AIAN', 'Asian', 'Latino', 'Multiple', 'NHOPI', 'Other', 'Total', 'White')| Object | Name
geotype | describes the level of geography for data in that row ('RE'=region, 'CT'=census tract, 'PL'=place/town/city, 'CO'=county)  | Object | Name
geoname | name of the city/town | Object | Name 
county_name | name of the county | Object | Name
region_name | name of the region ('Sacramento Area', 'Bay Area', 'San Diego', 'Southern California') | Object | Name
pop_trans_acc | number of residents that live within 1/2 mile of public transportation | Numeric | Integer
pop2010 | total number of residents that reside in that county | Numeric | Integer
p_trans_acc | the percent of residents that live within 1/2 mile of public transportation | Numeric | Float

The data set used for analysis is show below:

In [2]:
# load tidied data and print rows
data = pd.read_csv(
    'tidy-data', 
    dtype = {
    'pop_trans_acc':'Int64',
    'county_fips': 'Int64'
    },
    index_col = 0
)
data[['year', 'race_eth_name', 'geotype', 'geoname', 'county_name', 'region_name', 'pop_trans_acc', 'pop2010', 'p_trans_acc']].head()

Unnamed: 0,year,race_eth_name,geotype,geoname,county_name,region_name,pop_trans_acc,pop2010,p_trans_acc
0,2008,AfricanAm,CO,Placer,Placer,Sacramento Area,55,4427,0.012424
1,2008,AIAN,CO,Placer,Placer,Sacramento Area,51,2080,0.024519
2,2008,Asian,CO,Placer,Placer,Sacramento Area,117,19963,0.005861
3,2008,Latino,CO,Placer,Placer,Sacramento Area,1835,44710,0.041042
4,2008,Multiple,CO,Placer,Placer,Sacramento Area,241,10658,0.022612


### Methods
In the exploratory analysis, the data was grouped by each of region and county. First, in each region the total population and the mean access to transit were measured and displayed in a table. The same process was conducted on the county-level data, but displayed with a scatter plot. Then, the relationship between ethnicity in each region and access to transit was explored with a bar chart. 

In the analysis section, a simple linear regression model was fit to determine if the 2010 population influenced access to transit in different counties. The data was then modified to include ethnicity "rates". These rates were calculated from the original data set and represent the percent of each race/ethnicity group in the county or town. A multiple linear regression model was fit to this data, including ethnicity rates and population as predictors. This MLR model was fit to both county-level and town-level data, which had a much larger sample size. Lastly, PCA was used on the town-level data, to determine if a smaller subset of the variables better explain the data.

---
### Results

#### Exploratory Analysis

The data was first sorted by the four regions and grouped by 2010 population and percent access to transit.

In [3]:
# regional data, do not include 'Total' in race_eth_name
data_region = data[
    ((data.geotype == 'RE') | ((data.region_name == 'San Diego') & (data.geotype == 'CO'))) & 
    (data.race_eth_name != 'Total')
]

# merge the 'pop2010' and 'p_trans_acc' statistics into one dataframe 
region_pop_percent = pd.merge(
    data_region.groupby(['region_name']).sum().reset_index().loc[:, ['region_name', 'pop2010']], 
    data_region.groupby(['region_name']).mean().reset_index().loc[:, ['region_name', 'p_trans_acc']]
)

# table of region, 2010 population, and percent access
region_pop_percent_table = region_pop_percent.rename(
    columns = {'region_name': 'Region', 'pop2010': '2010 Population', 'p_trans_acc': 'Percent Access'}
)

region_pop_percent_table

Unnamed: 0,Region,2010 Population,Percent Access
0,Bay Area,7150739,0.571932
1,Sacramento Area,1999270,0.190002
2,San Diego,3095313,0.358582
3,Southern California,18051534,0.351986


The table shows that the Bay Area has the highest percent of residents that reside within 1/2 mile to public transit. The Sacramento region has both the smallest population and lowest access. While San Diego and Southern California have similar rates, Southern California has a much greater population. The table provides a general overview and there could be counties that are outliers within each region. A similar approach was used for county-level data and the results were displayed in a scatter plot:

In [4]:
# county-level data, do not include 'Total' in race_eth_name
data_county = data[(data.geotype == 'CO') & (data.race_eth_name != 'Total')]

# merge the pop2010 and p_trans_acc statistics into one dataframe 
county_pop_percent = pd.merge(
    data_county.groupby(['county_name']).sum().reset_index().loc[:, ['county_name', 'pop2010']], 
    data_county.groupby(['county_name']).mean().reset_index().loc[:, ['county_name', 'p_trans_acc']]
)

# drop counties with '0' for 'p_trans_acc'
county_pop_percent = county_pop_percent[(county_pop_percent['p_trans_acc'] != 0)]

In [5]:
# scatter plot of counties
scatter_county = alt.Chart(county_pop_percent).mark_point().encode(
    x = alt.X('pop2010',
              title = '2010 County Population',
              scale = alt.Scale(zero = False, type = 'pow', exponent = 0.5)),
    y = alt.Y('mean(p_trans_acc)', 
              title = 'Percent Access')
).properties(
    title = 'Access to Public Transportation by County',
    width = 700, 
    height = 250
)

# labels for each county
text_county = scatter_county.mark_text(
    align = 'left',
    baseline ='middle',
    dx = 5,
    dy = 0.2
).encode(
    text='county_name'
)

# plot
county_plot = scatter_county + text_county
county_plot

Nearly 100% of residents within San Francisco live within 1/2 mile of public transportation. In more rural counties, like Napa and Placer, rates are lower. Los Angeles county has a substantially larger population than the other counties, and has similar transit rates to more urban Bay Area counties like Alameda and Santa Clara. Next, the regional data was stratified by ethnicity, in order to determine if certain ethnicities live closer to transit.

In [6]:
# bar chart 
alt.Chart(data_region).mark_bar().encode(
    x = alt.X('race_eth_name', 
              title = ''), 
    y = alt.Y('mean(p_trans_acc)', 
              scale=alt.Scale(domain=[0, 0.8]), 
              title = 'Percent Access')
).properties(
    width = 200, height = 250
).facet(
    column = 'region_name').properties(title = 'Access to Public Transportation by Ethnicity & Region'
)

While there are differences amongst the race/ethnicity groups, the chart shows that **region is a more significant indicator** in whether residents live within 1/2 mile of public transit. Across all regions, African Americans have the highest access to public transit. However, this access varies significantly by individual county and in the data analysis below.

#### Data Analyis

**Populaton and Access to Transit in Individual Counties:** The first part of the analysis was to determine whether access to public transportation could be predicted based on the population of each county. A simple linear regression model was used to model 'p_trans_acc' against 'pop2010.'

In [7]:
## Simple Linear Regression

# store response variable as array
y_slr = county_pop_percent["p_trans_acc"]

# store explanatory variable
x_slr_df = county_pop_percent.loc[: , ['pop2010']]

# add intercept column
x_slr = add_dummy_feature(x_slr_df, value = 1)

# configure model
slr = LinearRegression(fit_intercept = False)

# fit model
slr.fit(x_slr, y_slr)

# fitted values
fitted_slr = slr.predict(x_slr)

# residuals
resid_slr = y_slr - fitted_slr

# append fitted values and residuals to the data
county_pop_percent_slr = county_pop_percent.copy()
county_pop_percent_slr['fitted_slr'] = fitted_slr
county_pop_percent_slr['resid_slr'] = resid_slr

# compute R-squared
r2_slr = r2_score(county_pop_percent_slr.p_trans_acc, county_pop_percent_slr.fitted_slr)

In [8]:
# modifying scale from original 'scatter_county' plot
scatter_county_mod1 = scatter_county.encode(
    x = alt.X('pop2010', title = '2010 County Population')
    ).properties(
    title = 'Public Transportation by County with Simple Regression Line'
    )

# modifying labels from original 'scatter_county' plot
text_county_mod1 = scatter_county_mod1.mark_text(
    align = 'left', 
    baseline ='middle', 
    dx = 5, 
    dy = 0.2
).encode(text='county_name')

# plot
county_plot_mod1 = scatter_county_mod1 + text_county_mod1

# adding the line of best fit
slr_line = alt.Chart(
    county_pop_percent_slr
).mark_line(
).encode(
    x = 'pop2010',
    y = 'fitted_slr'
)
 
# layer with the line of best fit
county_plot_mod1 = county_plot_mod1 + slr_line
county_plot_mod1

Adding the regression line to the scatter plot above, demonstrates that a simple linear regression model does not fit the data well. The $R^2$ value is appoximately 0.02 meaning the 2010 population in each county does not explain the variation in public transportation rates. Next, a smooth curve was fit between the two variables.

In [9]:
# compute smooth
county_smooth = scatter_county_mod1.transform_loess(
    on = 'pop2010', # x variable
    loess = 'p_trans_acc', # y variable
    bandwidth = 0.3 # how smooth?
).mark_line(color = 'black')

county_plot_mod2 = scatter_county_mod1 + county_smooth + text_county_mod1
county_plot_mod2


Even with this smooth curve added, there is no clear relationship between county population and access to transit.

**Analyzing Ethnicities:** The next step was to determine the demographics in each county. The original data table was pivoted to show the percent of each ethnicity in the counties.

In [10]:
# have a dataframe with county_name, ethnicity, pop2010, and p_trans_acc as columns
data_county_mod1 = data_county.groupby(['county_name', 'race_eth_name']).sum().reset_index().loc[:, ['race_eth_name', 'county_name', 'pop2010']]

# merging, in order to get total population values from 'county_pop_percent' data set
data_county_mod1 = pd.merge(
    data_county_mod1,
    county_pop_percent.drop(columns = 'p_trans_acc').rename(columns = {'pop2010':'county_total'}), 
    how='left'
)

# percent of ethnicity in each county
data_county_mod1['race_eth_percent'] = data_county_mod1.pop2010 / data_county_mod1.county_total

# pivot table to make the columns a percent of each race 
data_county_mod1 = data_county_mod1.pivot_table('race_eth_percent', ['county_name'], 'race_eth_name').reset_index().rename_axis(None, axis = 1)

# meging, in order to get 'p_trans_acc' from 'county_pop_percent' data set
data_county_mod1 = pd.merge(
    data_county_mod1,
    county_pop_percent,
    how='left'
)

# sorting the values by descending rates to public transit
data_county_mod1 = data_county_mod1.sort_values(by = 'p_trans_acc', ascending = False)

# print
data_county_mod1.head(3)

Unnamed: 0,county_name,AIAN,AfricanAm,Asian,Latino,Multiple,NHOPI,Other,White,pop2010,p_trans_acc
11,San Francisco,0.00227,0.058096,0.329966,0.151228,0.032387,0.003885,0.003097,0.419071,805235,0.992901
13,Santa Clara,0.002269,0.02376,0.317385,0.268971,0.030059,0.003509,0.002176,0.351871,1781642,0.673294
0,Alameda,0.002774,0.121916,0.258579,0.225052,0.040299,0.0079,0.002775,0.340706,1510271,0.65937


For example, in San Francisco, a county of 805,235 residents, White residents are ~42% of the population. Whereas, in Los Angeles, they are only ~28% of the population. The scatter splot above did not show how ethnicity varies by county. A bar chart of the counties, the percent access, and the ethnicity were then constructed. The following is for the African American demographic.

In [11]:
alt.Chart(data_county_mod1).mark_bar().encode(
    x = alt.X('county_name', title = ''),
    y = alt.Y('p_trans_acc', title = 'Percent Access to Transit'),
    color = 'AfricanAm'
).properties(
    width = 300, height = 175, title = 'African American Demographic'
)

In all of the counties, African Americans are never more than 15% of the population. They represent the greatest percentage in Alameda (high transit access) and Solano (low transit access). In other counties like Marin, San Mateo, and Santa Clara, where transit acess is very high, they represent a very small percent of the population. This chart is significant becasue it shows how certain counties are ethnically segregated and how African Americans may only have high access rates in some of them.

In [12]:
alt.Chart(data_county_mod1).mark_bar().encode(
    x = alt.X('county_name', title = ''),
    y = alt.Y('p_trans_acc', title = 'Percent Access to Transit'),
    color = 'White'
).properties(
    width = 300, height = 175, title = 'White Demographic'
)

White residents make up a significant portion of each county. Unlike African Americans, in Marin, they represent an extremely high percent of the population (> 80%). In rural areas, like Sonoma and Placer, they also represent a large part of the population. Compared to African Anmericans, White residents have high access in affluent areas (Marin, San Mateo, Santa Clara) and low access in rural areas (Sonoma, Placer).

**Multiple Linear Regression:** A multiple linear regression model was then fit to the data, with ethnicity rates as added predictor variables. Only 'White' and 'pop2010' were used as the explanatory variables at first.

In [13]:
# analyze 'p_trans_acc' based on 'White' and 'pop2010' variables
x_mlr_county = data_county_mod1[['White', 'pop2010']]

# add intercept 
x_mx_county = add_dummy_feature(x_mlr_county, value = 1)

# response variable
y_mlr_county = data_county_mod1['p_trans_acc']

# configure model
mlr_county = LinearRegression(fit_intercept = False)

# fit model
mlr_county.fit(x_mx_county, y_mlr_county)

# fitted values
fitted_mlr_county = mlr_county.predict(x_mx_county)

# residuals
resid_mlr_county = y_mlr_county - fitted_mlr_county

# use ".copy()" to solve potential error warning
data_county_mod2 = data_county_mod1.copy()

# append fitted and residual values
data_county_mod2.loc[: , "fitted_mlr"] = fitted_mlr_county
data_county_mod2.loc[: , "resid_mlr"] = resid_mlr_county

# R^2 value
R_2_mlr_county = r2_score(data_county_mod2.p_trans_acc, data_county_mod2.fitted_mlr)
# R_2_mlr_county


The following table shows the county, the ethnicity rates, and the fitted values for 'p_trans_acc' (access rates). It also includes the residuals used for the $R^2$ analysis.

In [14]:
data_county_mod2.head(3)

Unnamed: 0,county_name,AIAN,AfricanAm,Asian,Latino,Multiple,NHOPI,Other,White,pop2010,p_trans_acc,fitted_mlr,resid_mlr
11,San Francisco,0.00227,0.058096,0.329966,0.151228,0.032387,0.003885,0.003097,0.419071,805235,0.992901,0.401179,0.591721
13,Santa Clara,0.002269,0.02376,0.317385,0.268971,0.030059,0.003509,0.002176,0.351871,1781642,0.673294,0.443941,0.229353
0,Alameda,0.002774,0.121916,0.258579,0.225052,0.040299,0.0079,0.002775,0.340706,1510271,0.65937,0.453485,0.205885


The fitted values have trouble with outliers like San Francisco and are more accruate for counties with medium populations and access rates. Still, the $R^2$ increases to **0.116**. It improves substantially from simple linear regression by adding an ethnicity ('White') as another parameter. Adding more parameters, continues to increase the $R^2$ value. For example, adding either 'Asian' or 'Latino' in addition to 'White', increases the $R^2$ value to approximately 0.7. However, this is because **the model is overfitting the data** and the sample size is small. In order to increase the sample size, a regression model was fit to the town-level data. Since there are far more towns than counties, the sample size is much larger. The groupings for each town are displayed below:

In [15]:
# town-level data, do not include 'Total' in race_eth_name
data_town = data[(data.geotype == 'PL') & (data.race_eth_name != 'Total')]

# merge the 'pop2010' and 'p_trans_acc' statistics into one dataframe 
town_pop_percent = pd.merge(
    data_town.groupby(['geoname']).sum().reset_index().loc[:, ['geoname', 'pop2010']], 
    data_town.groupby(['geoname']).mean().reset_index().loc[:, ['geoname', 'p_trans_acc']]
)

# table of town, 2010 population, and percent access
town_pop_percent_table = town_pop_percent

town_pop_percent_table = town_pop_percent_table.rename(columns = {'geoname':'Town', 'pop2010':'2010 Population', 'p_trans_acc':'Percent Access'})

town_pop_percent_table.head(3)

Unnamed: 0,Town,2010 Population,Percent Access
0,Acalanes Ridge CDP,1137,0.295509
1,Acton CDP,7596,0.0
2,Adelanto city,31765,0.0


This table shows that the population is much smaller for each town, compared to the counties analyzed. In some towns, the percent access is 0% since there are so few residents and the town may be in a rural area. Like for the county data, the data table was pivoted to include ethnicity rates for each town. A multiple linear regression model was then fit to the data. The goal is that it should fit the data better with more sample points. First, only 'White' and 'pop2010' were used as the explanatory variables.

In [16]:
### ---------------------------------- PIVOTING THE DATA TABLE --------------------------------- 

# Repeating the same steps as in county data, in order to have percent of each ethnicity as a column
data_town_mod1 = data_town.groupby(['geoname', 'race_eth_name']).sum().reset_index().loc[:, ['race_eth_name', 'geoname', 'pop2010']]

# merging, in order to get total population values from 'town_pop_percent' data set
data_town_mod1 = pd.merge(
    data_town_mod1,
    town_pop_percent.drop(columns = 'p_trans_acc').rename(columns = {'pop2010':'town_total'}), 
    how='left'
)

# percent of ethnicity in each county
data_town_mod1['race_eth_percent'] = data_town_mod1.pop2010 / data_town_mod1.town_total

# pivot table to make the columns a percent of each race 
data_town_mod1 = data_town_mod1.pivot_table('race_eth_percent', ['geoname'], 'race_eth_name').reset_index().rename_axis(None, axis = 1)

# meging, in order to get 'p_trans_acc' from 'town_pop_percent' data set
data_town_mod1 = pd.merge(
    data_town_mod1,
    town_pop_percent,
    how='left'
)

# sorting the values by descending rates to public transit
data_town_mod1 = data_town_mod1.sort_values(by = 'p_trans_acc', ascending = False)

# print
# data_town_mod1.head(3)

In [17]:
### ---------------------------------- MULTIPLE LINEAR REGRESSION MODEL --------------------------------- 

# analyze 'p_trans_acc' based on 'White' and 'pop2010' variables
x_mlr_town = data_town_mod1[['White', 'pop2010']]

# add intercept 
x_mx_town = add_dummy_feature(x_mlr_town, value = 1)

# response variable
y_mlr_town = data_town_mod1['p_trans_acc']

# configure model
mlr_town = LinearRegression(fit_intercept = False)

# fit model
mlr_town.fit(x_mx_town, y_mlr_town)

# fitted values
fitted_mlr_town = mlr_town.predict(x_mx_town)

# residuals
resid_mlr_town = y_mlr_town - fitted_mlr_town

# use ".copy()" to solve potential error warning
data_town_mod2 = data_town_mod1.copy()

# append fitted and residual values
data_town_mod2.loc[: , "fitted_mlr"] = fitted_mlr_town
data_town_mod2.loc[: , "resid_mlr"] = resid_mlr_town

# R^2 value
R_2_mlr_town = r2_score(data_town_mod2.p_trans_acc, data_town_mod2.fitted_mlr)
# R_2_mlr_town

The following table shows each town, the ethnicity rates, and the fitted values for 'p_trans_acc' (access rates). It also includes the residuals used for the $R^2$ analysis.

In [18]:
data_town_mod2.head(3)

Unnamed: 0,geoname,AIAN,AfricanAm,Asian,Latino,Multiple,NHOPI,Other,White,pop2010,p_trans_acc,fitted_mlr,resid_mlr
79,Burbank CDP,0.003654,0.023955,0.076127,0.509338,0.02294,0.001624,0.002233,0.36013,4926,1.0,0.224305,0.775695
17,Alto CDP,0.001406,0.008439,0.040788,0.07173,0.046414,0.001406,0.007032,0.822785,711,1.0,0.060105,0.939895
517,Rollingwood CDP,0.002021,0.06029,0.17548,0.61839,0.017851,0.006736,0.005052,0.11418,2969,1.0,0.310499,0.689501


The $R^2$ value increased slightly to 0.13 in the town-data from 0.116 in the county-data. Adding all the ethnicity rates, increased the $R^2$ value to 0.25. Even with all the predictors added, the $R^2$ value is still insignificant. Beisdes the data being very scattered, one reason is that **many of the towns had a value of '0' or '1' for 'p_trans_acc'**. A regression model struggles to fit data with a lot of 0's and 1's. The following scatterplot shows how scattered the data is:

In [19]:
# scatter plot of towns and public transportation rates
alt.Chart(data_town_mod1).mark_point().encode(
    x = alt.X('pop2010', 
              title = '2010 Population',
              scale = alt.Scale(zero = False, type = 'pow', exponent = 0.2)),
    y = alt.Y('p_trans_acc', 
              title = 'Percent Access')
).properties(height = 200, title = 'Public Transportation Access by Town')

**PCA Analysis for Town Data**

The next question was to determine whether PCA analysis could help filter the predictors.

In [20]:
### ---------------------------------  PCA ANALYSIS --------------------------------- 

# data_mod3: do not include 'p_trans_acc' since it is not a predictor
data_town_mod3 = data_town_mod2[['AIAN', 'AfricanAm', 'Asian', 'Latino',	'Multiple',	'NHOPI', 'Other', 'White', 'pop2010']]

# center and scale ('normalize')
town_ctr = (data_town_mod3 - data_town_mod3.mean())/data_town_mod3.std()

# compute principal components
pca = PCA(n_components = town_ctr.shape[1])
pca.fit(town_ctr)

# store proportion of variance explained as a dataframe
pca_var_explained = pd.DataFrame({'Proportion of variance explained': pca.explained_variance_ratio_})

# add component number as a new column
pca_var_explained['Component'] = np.arange(1, 10)

pca_var_explained['Cumulative variance explained'] = pca_var_explained['Proportion of variance explained'].cumsum()
pca_var_explained.head(3)

Unnamed: 0,Proportion of variance explained,Component,Cumulative variance explained
0,0.250581,1,0.250581
1,0.217423,2,0.468003
2,0.130466,3,0.598469


This table shows that three components explain roughly 60% of the variation in the data. A heat map was then used to show the correlation between variables.

In [21]:
### --------------------------------- Plot of Components vs. Variance Explained [Don't Show] ---------------------------------

base_pca = alt.Chart(pca_var_explained).encode(x = 'Component')

prop_var_base = base_pca.encode(
    y = alt.Y('Proportion of variance explained',
              axis = alt.Axis(titleColor = '#57A44C'))
)

cum_var_base = base_pca.encode(
    y = alt.Y('Cumulative variance explained', axis = alt.Axis(titleColor = '#5276A7'))
)

prop_var = prop_var_base.mark_line(stroke = '#57A44C') + prop_var_base.mark_point(color = '#57A44C')
cum_var = cum_var_base.mark_line() + cum_var_base.mark_point()

var_explained_plot = alt.layer(prop_var, cum_var).resolve_scale(y = 'independent')

# plot
# var_explained_plot

In [22]:
### --------------------------------- Heat Map ---------------------------------


# project data onto first four components; store as data frame
projected_data = pd.DataFrame(pca.fit_transform(town_ctr)).iloc[:, 0:4].rename(columns = {0: 'PC1', 1: 'PC2', 2: 'PC3', 3: 'PC4'})
                                                                            
# add state and county
projected_data[['geoname', 'p_trans_acc']] = data_town_mod2[['geoname', 'p_trans_acc']]

# print
# projected_data.head(4)

### --------------------------------- [Break] ---------------------------------

### HeatMap

# correlation matrix
corr_mx = data_town_mod3.corr()

# melt corr_mx
corr_mx_long = corr_mx.reset_index().rename(
    columns = {'index': 'row'}
).melt(
    id_vars = 'row',
    var_name = 'col',
    value_name = 'Correlation'
)
# construct plot
alt.Chart(corr_mx_long).mark_rect().encode(
    x = alt.X('col', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    y = alt.Y('row', title = '', sort = {'field': 'Correlation', 'order': 'ascending'}),
    color = alt.Color('Correlation',
                      scale = alt.Scale(scheme = 'blueorange', 
                                        domain = (-1, 1), 
                                        type = 'sqrt'), #
                                        legend = alt.Legend(tickCount = 5)) 
).properties(width = 300, height = 200)

This shows that variables like 'White' and 'Latino' are highly correlated (positive). The loadings were then shown for each principal component.

In [23]:
### --------------------------------- Loading Plot ---------------------------------

# store the loadings as a data frame with appropriate names
loading_df = pd.DataFrame(pca.components_).transpose().rename(
    columns = {0: 'PC1', 1: 'PC2', 2: 'PC3', 3: 'PC4'}
    ).loc[:, ['PC1', 'PC2', 'PC3', 'PC4']] 

# add a column with the variable names
loading_df['Variable'] = corr_mx.index

# melt from wide to long
loading_plot_df = loading_df.melt(
    id_vars = 'Variable',
    var_name = 'Principal Component',
    value_name = 'Loading'
)
# add a column of zeros to encode for x = 0 line to plot
loading_plot_df['zero'] = np.repeat(0, len(loading_plot_df))

# create base layer
base = alt.Chart(loading_plot_df)

# create lines + points for loadings
loadings = base.mark_line(point = True).encode(
    y = alt.X('Variable', title = ''),
    x = 'Loading',
    color = 'Principal Component'
)

# create line at zero

rule = base.mark_rule().encode(x = alt.X('zero', title = 'Loading'), size = alt.value(0.05))
# layer

loading_plot = (loadings + rule).properties(height = 200, width = 120)

loading_plot.facet(column='Principal Component')

Ultimately, this still shows that additional data is required at both the town and county level. For example, in PC1: the variables with the largest loadings are 'White' (negative), and 'Latino' (positive). From the correlation matrix, 'White' and 'Latino' have a strong negative correlation. One approach is that PC1 measures neighborhood homogeneity, with towns having a higher value for PC1 if most of the residents are Latino and lower values if most are White. However, **we cannot make a reasonable inference from this PCA data.**

---
### Discussion

The findings show that access to public transportation differs between each region. As well, ethnicity rates and access to transportation differ significantly between counties. However, **ethnicity rates and population are not significant predictors of public transportation rates.** For example, both Marin and Sonoma county have a high 'White' population. However, Marin has a much higher access to transit rate, compared to Sonoma. This shows that it is important to look at how urban, or rural the location is. Additional variables are needed too. These could include measurements like unemployment rates, or a resident's type of work. 

Further analysis should focus on one specific region, particularly the Bay Area. In the Bay Area, there are a mix of urban and rural counties, as well as high-income and low-income counties. Using the additional predictors and focusing on one region, could potentially help the model fit the data better. Each region has different demographics and transportation infrastructure, so it is likely difficult to build a general model to the entire state of California. In addition to simple and multiple linear regression, other models could be used. For example, a regression tree could be used since the data is non-linear and it would be more interpretable than other methods.