In [2]:
import pandas as pd

In [2]:
census_data = pd.read_csv('/home/ec2-user/efs/fcc_census_2.csv')

## Picking Variables

In [7]:
census_data.columns

Index(['tract_geoid', 'All_Provider_Count', 'All_Providers', 'MaxAdDown',
       'MaxAdUp', 'AllMaxAdDown', 'AllMaxAdUp', 'Wired_Provider_Count',
       'Satellite_Provider_Count', 'Fixed_Wireless_Provider_Count',
       ...
       'pct_pop_60_to_64', 'pct_pop_65_to_69', 'pct_pop_70_to_74',
       'pct_pop_75_to_79', 'pct_pop_80_to_84', 'pct_pop_gt_85',
       'pct_pop_disability', 'pct_pop_households_with_kids',
       'Ookla Median Download Speed (Mbps)',
       'Ookla Median Upload Speed (Mbps)'],
      dtype='object', length=111)

### Using .corr() to see which varibables have the correlation with poverty rate

In [12]:
pearson_corr = census_data.corr('pearson')
kendall_corr = census_data.corr('kendall')
spearman_corr = census_data.corr('spearman')

In [15]:
correlation_matrix = pd.DataFrame()

correlation_matrix['pearson'] = pearson_corr['poverty_rate']
correlation_matrix['kendall'] = kendall_corr['poverty_rate']
correlation_matrix['spearman'] = spearman_corr['poverty_rate']

correlation_matrix['pearson_magnitude'] = correlation_matrix['pearson'].abs()
correlation_matrix['kendall_magnitude'] = kendall_corr['poverty_rate'].abs()
correlation_matrix['spearman_magnitude'] = spearman_corr['poverty_rate'].abs()

In [17]:
correlation_matrix.columns

Index(['pearson', 'kendall', 'spearman', 'pearson_magnitude',
       'kendall_magnitude', 'spearman_magnitude'],
      dtype='object')

In [34]:
correlation_matrix.sort_values(by=['pearson_magnitude',
                                   'kendall_magnitude',
                                   'spearman_magnitude'], ascending = False)[['pearson','kendall','spearman']].iloc[91:120]



Unnamed: 0,pearson,kendall,spearman
pct_pop_10_to_14,0.031669,0.014224,0.021752
pct_pop_50k_thru_60k,0.029187,0.108086,0.160779
pct_internet_dial_up,-0.028923,-0.025795,-0.035136
Wired_Provider_Count,-0.021317,-0.008239,-0.011463
pct_pop_households_with_kids,0.01941,0.101789,0.15215
ave_household_size,0.017471,-0.025317,-0.034924
ALAND,0.015176,-0.092464,-0.143074
ALAND_SQMI,0.015176,-0.092476,-0.143077
pct_internet_only_satellite,0.013003,0.049589,0.068336
state,-0.010787,-0.022417,-0.032301


Takeaways:

- Most of the top correlations are related to income metrics. 
- The metrics related to broadband seem more correlated to broadband affordability. For example, pct_internet will increase if more people can afford broadband, not if they have access to broadband. Similar to pct_computer and pct_no_computer. 
- Several of the metrics are probably covariates: For example, [pct_internet, pct_internet_broadband_any_type, and pct_internet_none] or [pct_computer, pct_no_computer] or [pct_pop_hs, or pct_pop_bachelors+]
- Many metrics related to provider counts don't seem to be correlated to poverty rate. I guess infrastructure itself doesn't discriminate between rich and poor areas - The top speed will be the same. But, whether people can afford it is a different story.
- I'd have expected fewer providers to exist in areas where broadband access is less available. But there doesn't seem to be a correlation like that.

Top Correlations Related to Broadband Metrics:

Metric | Correlation | Description
--- | --- | ---
pct_internet | -0.646842 | Weighted average of the percent of the population with an internet subscription in the zip code 
pct_computer_with_broadband | -0.646616 | percent of the population - Has a computer:!!With a broadband Internet subscription
pct_internet_broadband_any_type | -0.641772 | percent of the population with a broadband (>25mbps) internet subscription of any type in the census tract
pct_internet_none | 0.610228 | percent of the population - No Internet access
pct_internet_cellular | -0.543007 | percent of the population with a internet subscription and a cellular data plan in the census tract
pct_internet_no_subscrp | 0.319738 | percent of the population - Internet access without a subscription
Ookla Median Download Speed (Mbps) | -0.158891 | Median download speed from Ookla dataset
Ookla Median Upload Speed (Mbps) | -0.144009 | Median upload speed
All_Provider_Count_100	| -0.122607 | A count of all unique provider IDs that service the area and have a max speed over 100 mbps
All_Provider_Count_25 | -0.104580 | A count of all unique provider IDs that service the area and have a max speed over 25 mbps
pct_internet_broadband_satellite | -0.096784 | percent of the population - With an Internet subscription!!Satellite Internet service
Fixed_Wireless_Provider_Count_25 | -0.085291 | A count of unique provider IDs that service the area and are considered Wired technology type. This refers to a reported TechCode of 70 (i.e. Terrestrial Fixed Wireless) speeds > 25mbps
pct_internet_broadband_fiber | -0.085230 | percent of the population - With an Internet subscription!!Broadband such as cable, fiber optic or DSL
Wired_Provider_Count_100 | -0.083568
Wired_Provider_Count_25 | -0.071912
Fixed_Wireless_Provider_Count_100 | -0.067551
Satellite_Provider_Count_100 | -0.062588	
Fixed_Wireless_Provider_Count |	-0.051429
MaxAdUp |	-0.020005
All_Provider_Count |	-0.047808
Satellite_Provider_Count_25 | 	0.040239
pct_internet_other	| 0.036371
pct_internet_dial_up | -0.028923	
pct_internet_dial_up | -0.028923
Wired_Provider_Count  | -0.021317	
Satellite_Provider_Count | -0.005400	

In [37]:
all_broadband_variables = ["pct_internet", 
                           "pct_computer_with_broadband",
                           "pct_internet_broadband_any_type",
                           "pct_internet_none",
                           "pct_internet_cellular",
                           "pct_internet_no_subscrp",
                           "Ookla Median Download Speed (Mbps)",
                           "Ookla Median Upload Speed (Mbps)",
                           "All_Provider_Count_100",
                           "All_Provider_Count_25",
                           "pct_internet_broadband_satellite",
                           "Fixed_Wireless_Provider_Count_25",
                           "pct_internet_broadband_fiber",
                           "Wired_Provider_Count_100",
                           "Wired_Provider_Count_25",
                           "Fixed_Wireless_Provider_Count_100",
                           "Satellite_Provider_Count_100",
                           "Fixed_Wireless_Provider_Count",
                           "MaxAdUp",
                           "All_Provider_Count",
                           "Satellite_Provider_Count_25",
                           "pct_internet_other",
                           "pct_internet_dial_up",
                           "pct_internet_dial_up",
                           "Wired_Provider_Count",
                           "Satellite_Provider_Count"]

#All variables with correlation > 0.3
high_corr = ["pct_internet", 
             "pct_computer_with_broadband",
             "pct_internet_broadband_any_type",
             "pct_internet_none",
             "pct_internet_cellular",
             "pct_internet_no_subscrp"]

#All variables with correlation > 0.1
mid_corr = ["pct_internet", 
            "pct_computer_with_broadband",
            "pct_internet_broadband_any_type",
            "pct_internet_none",
            "pct_internet_cellular",
            "pct_internet_no_subscrp",
            "Ookla Median Download Speed (Mbps)",
            "Ookla Median Upload Speed (Mbps)",
            "All_Provider_Count_100",
            "All_Provider_Count_25"]

### Checking Correlation Between Top Broadband Variables

We want to find variables which are not highly correlated with each other, but are correlated with poverty rate. Let's start by looking at how correlated PC_Internet is to each of the other variables.

In [41]:
for i in all_broadband_variables:
    print(f"{i}: {pearson_corr['pct_internet'][i]}")


pct_internet: 1.0
pct_computer_with_broadband: 0.9952068563120876
pct_internet_broadband_any_type: 0.9985175386272158
pct_internet_none: -0.9560269551263195
pct_internet_cellular: 0.8690814697697261
pct_internet_no_subscrp: -0.4559727264071304
Ookla Median Download Speed (Mbps): 0.3639283669160873
Ookla Median Upload Speed (Mbps): 0.22030364547616488
All_Provider_Count_100: 0.190527264338434
All_Provider_Count_25: 0.10985180367629867
pct_internet_broadband_satellite: -0.038924987333297766
Fixed_Wireless_Provider_Count_25: 0.10727742941562714
pct_internet_broadband_fiber: 0.1304320912020837
Wired_Provider_Count_100: 0.10656025667524201
Wired_Provider_Count_25: 0.05193617077302265
Fixed_Wireless_Provider_Count_100: 0.1610039885988648
Satellite_Provider_Count_100: 0.09162155078356726
Fixed_Wireless_Provider_Count: -0.004433158904339903
MaxAdUp: 0.12285371396701919
All_Provider_Count: 0.09040669697691196
Satellite_Provider_Count_25: -0.012718677145945644
pct_internet_other: -0.065782036928

pct_internet is highly correlated with:
- pct_computer_with_broadband
- pct_internet_broadband_any_type
- pct_internet_none
- pct_internet_no_subscrp

That's pretty much all of the variables that had a correlation > 0.3 with poverty rate. This doesn't bode well..

Let's assume we take Ookla Median Download Speed and continue with variables which have correlation > 0.1.

In [42]:
for i in all_broadband_variables:
    print(f"{i}: {pearson_corr['Ookla Median Download Speed (Mbps)'][i]}")


pct_internet: 0.3639283669160873
pct_computer_with_broadband: 0.36868096012540547
pct_internet_broadband_any_type: 0.3723814906399157
pct_internet_none: -0.37256590967520853
pct_internet_cellular: 0.3663446775185923
pct_internet_no_subscrp: -0.09147093820816783
Ookla Median Download Speed (Mbps): 1.0
Ookla Median Upload Speed (Mbps): 0.3896286257840724
All_Provider_Count_100: 0.08565906496990369
All_Provider_Count_25: -0.14072958602833066
pct_internet_broadband_satellite: -0.35170303916975315
Fixed_Wireless_Provider_Count_25: -0.12778394307219632
pct_internet_broadband_fiber: 0.060294627130492134
Wired_Provider_Count_100: 0.0177650322283854
Wired_Provider_Count_25: -0.09151514865519354
Fixed_Wireless_Provider_Count_100: 0.1120035543395934
Satellite_Provider_Count_100: 0.06115695166883511
Fixed_Wireless_Provider_Count: -0.3054907991333883
MaxAdUp: 0.20611834809579532
All_Provider_Count: -0.11271896699240068
Satellite_Provider_Count_25: 0.008082893538194833
pct_internet_other: -0.0820062

In [44]:
for i in all_broadband_variables:
    print(f"{i}: {pearson_corr['All_Provider_Count_100'][i]}")


pct_internet: 0.190527264338434
pct_computer_with_broadband: 0.1919359126070964
pct_internet_broadband_any_type: 0.19084079835635412
pct_internet_none: -0.19982474036615605
pct_internet_cellular: 0.20594341893955193
pct_internet_no_subscrp: -0.03323644782001384
Ookla Median Download Speed (Mbps): 0.08565906496990369
Ookla Median Upload Speed (Mbps): 0.1305537943248629
All_Provider_Count_100: 1.0
All_Provider_Count_25: 0.735068709215781
pct_internet_broadband_satellite: -0.023402699360777802
Fixed_Wireless_Provider_Count_25: 0.3715162866759355
pct_internet_broadband_fiber: -0.0476222222974764
Wired_Provider_Count_100: 0.7872989543473666
Wired_Provider_Count_25: 0.7132926615879259
Fixed_Wireless_Provider_Count_100: 0.4923671428374058
Satellite_Provider_Count_100: 0.472024613897881
Fixed_Wireless_Provider_Count: 0.29366623083168797
MaxAdUp: 0.4114493979971153
All_Provider_Count: 0.4875733131888181
Satellite_Provider_Count_25: 0.17519145397858
pct_internet_other: -0.005369209853732419
pct_

So now we have a new list for correlation > 0.1 and uncorrelated with each other.

In [45]:
small_covariance_med_corr = ['pct_internet',
                             'Ookla Median Download Speed (Mbps)',
                             'All_Provider_Count_100',
                             'All_Provider_Count_25']