# FCC Data Merging



## About the Data

The FCC data was taken from: https://www.fcc.gov/form-477-broadband-deployment-data-december-2019-version-1
* Download the US - US - Fixed with Satellite - Dec 19v1(CSV)
* Note this downloads as a zip. Joanie's Mac had trouble unzipping but what worked well wa

The columns in the FCC dataset are: https://www.fcc.gov/general/explanation-broadband-deployment-data 

With more info on the tech codes: https://www.fcc.gov/general/technology-codes-used-fixed-broadband-deployment-data

In [1]:
## imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile
#import os

In [2]:
## Read in census data (Note: this came from the census data cleaning notebook)
census_data = pd.read_csv("../data/relabeled_census.csv")
census_data.head()

Unnamed: 0,NAME,total_pop2,median_age_overall,median_age_male,median_age_female,state,county,tract,employment_rate,median_income,...,pct_internet_broadband_satellite,pct_internet_only_satellite,pct_internet_other,pct_internet_no_subscrp,pct_internet_none,pct_computer,pct_computer_with_dialup,pct_computer_with_broadband,pct_computer_no_internet,pct_no_computer
0,"Census Tract 11, Jefferson County, Alabama",4781,39.0,42.5,38.1,1,73,1100,51.0,37030.0,...,9.02215,0.918422,0.0,1.134522,24.851432,80.821178,0.0,74.014046,6.807131,19.178822
1,"Census Tract 14, Jefferson County, Alabama",1946,44.3,40.5,49.1,1,73,1400,45.4,36066.0,...,4.901961,0.0,0.0,2.083333,25.490196,85.661765,0.0,71.078431,14.583333,14.338235
2,"Census Tract 20, Jefferson County, Alabama",4080,34.0,31.0,36.4,1,73,2000,47.7,27159.0,...,4.651163,0.0,0.0,0.0,45.454545,71.317829,0.0,54.545455,16.772375,28.682171
3,"Census Tract 38.02, Jefferson County, Alabama",5291,35.8,31.7,37.3,1,73,3802,51.7,38721.0,...,3.959873,0.0,0.0,6.335797,33.632524,85.744456,0.0,59.450898,26.293559,14.255544
4,"Census Tract 40, Jefferson County, Alabama",2533,52.1,51.6,53.8,1,73,4000,36.9,18525.0,...,4.548635,1.959412,0.0,5.108467,47.515745,63.051085,0.0,44.786564,18.264521,36.948915


In [3]:
## Double check which census values are null
census_data.isnull().sum()

NAME                                   0
total_pop2                             0
median_age_overall                   672
median_age_male                      702
median_age_female                    780
state                                  0
county                                 0
tract                                  0
employment_rate                      646
median_income                       1024
total_households                       0
ave_household_size                   857
ave_family_size                      880
pct_health_ins_children              935
pct_health_ins_19_64                 762
pct_health_ins_65+                  1038
total_population                       0
median_house_value                  1972
pct_white                            646
pct_hisp_latino                      646
pct_black                            646
pct_native                           646
pct_asian                            646
pct_hi_pi                            646
pct_other_race  

In [4]:
## Read in the FCC dataset
fcc = pd.read_csv("../data/fcc/fbd_us_with_satellite_dec2019_v1.csv", converters={'BlockCode' : lambda x: str(x)})

In [5]:
fcc.head()

Unnamed: 0,LogRecNo,Provider_Id,FRN,ProviderName,DBAName,HoldingCompanyName,HocoNum,HocoFinal,StateAbbr,BlockCode,TechCode,Consumer,MaxAdDown,MaxAdUp,Business
0,1,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001336,70,1,10.0,1.0,1
1,2,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001391,70,1,10.0,1.0,1
2,3,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001398,70,1,10.0,1.0,1
3,4,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001399,70,1,10.0,1.0,1
4,5,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001400,70,1,10.0,1.0,1


In [6]:
fcc.tail()

Unnamed: 0,LogRecNo,Provider_Id,FRN,ProviderName,DBAName,HoldingCompanyName,HocoNum,HocoFinal,StateAbbr,BlockCode,TechCode,Consumer,MaxAdDown,MaxAdUp,Business
73215372,73215373,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",PR,721537506022018,60,1,2.0,1.3,1
73215373,73215374,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",PR,721537506022019,60,1,2.0,1.3,1
73215374,73215375,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",PR,721537506022020,60,1,2.0,1.3,1
73215375,73215376,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",PR,721537506022021,60,1,2.0,1.3,1
73215376,73215377,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",PR,721537506022022,60,1,2.0,1.3,1


In [7]:
## FCC data is huge - 73M+ rows
fcc.shape

(73215377, 15)

In [8]:
# Only 3000 unique provider ids
np.unique(fcc.Provider_Id).shape

(3000,)

In [9]:
## Creating a csv of the provider id mapping to providername and dbanames, this could be useful
names = fcc.drop_duplicates(subset = ["Provider_Id", "ProviderName", "DBAName"])
names = names[["Provider_Id", "ProviderName", "DBAName"]]
names.to_csv("../data/fcc_names.csv", index=False)

In [10]:
# We can see there are codes for Guam, Virgin Islands, Puerto Rico, etc which have very different policies and may 
# not display similar correlations with broadband. We will drop non-US
#np.unique(fcc.StateAbbr)

In [11]:
# https://www.census.gov/library/reference/code-lists/ansi/ansi-codes-for-states.html
## Don't need to keep codes for American Samoa, Guam, Puerto Rico, etc
## Only want US states and Washington DC
state_codes_to_drop = ["AS", "GU", "MP", "PR", "VI"]

In [12]:
## Dropping codes from US terrotories
df = fcc[~fcc['StateAbbr'].isin(state_codes_to_drop)]
df

Unnamed: 0,LogRecNo,Provider_Id,FRN,ProviderName,DBAName,HoldingCompanyName,HocoNum,HocoFinal,StateAbbr,BlockCode,TechCode,Consumer,MaxAdDown,MaxAdUp,Business
0,1,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001336,70,1,10.0,1.0,1
1,2,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001391,70,1,10.0,1.0,1
2,3,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001398,70,1,10.0,1.0,1
3,4,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001399,70,1,10.0,1.0,1
4,5,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,560379705001400,70,1,10.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73067164,73067165,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",WY,560459513003125,60,1,2.0,1.3,1
73067165,73067166,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",WY,560459513003126,60,1,2.0,1.3,1
73067166,73067167,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",WY,560459513003127,60,1,2.0,1.3,1
73067167,73067168,55262,18756155,"VSAT Systems, LLC",Skycasters,"VSAT Systems, LLC",300167,"VSAT Systems, LLC",WY,560459513003128,60,1,2.0,1.3,1


In [13]:
## Adding a new column for tract_geoid
## Tract geoid is the first 2+3+6=11 digits of the 2+3+6+4=15 digit block codes
## Note that some geoids start with 0, which may have been cut off of 012345678915678 so we will pull 
## all but last 4 instaed of first 11 digits

df["tract_geoid"] = df["BlockCode"].apply(lambda row: row[:-4])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["tract_geoid"] = df["BlockCode"].apply(lambda row: row[:-4])


In [14]:
## Now that we have a tract ID, we can drop the BlockCode column

newdf = df.drop("BlockCode", axis="columns")

## Now that we have dropped the BlockCode column, we can drop duplicate rows (i.e. same provider info in a tract)
newdf = newdf.drop_duplicates(subset = ["Provider_Id", "TechCode", "MaxAdDown", "MaxAdUp", "tract_geoid"])

In [15]:
newdf.columns

Index(['LogRecNo', 'Provider_Id', 'FRN', 'ProviderName', 'DBAName',
       'HoldingCompanyName', 'HocoNum', 'HocoFinal', 'StateAbbr', 'TechCode',
       'Consumer', 'MaxAdDown', 'MaxAdUp', 'Business', 'tract_geoid'],
      dtype='object')

In [16]:
## Now only 1.6M+ rows of data
newdf.shape

(1696206, 15)

In [17]:
## Add some booleans for categories we care about
## Columns inspired from broadband now dataset: https://github.com/BroadbandNow/Open-Data

newdf["Faster_than_25_3"] = newdf.apply(lambda row: (row["MaxAdDown"] >= 25)&(row["MaxAdUp"] >= 3), axis=1)
newdf["Faster_than_100_3"] = newdf.apply(lambda row: (row["MaxAdDown"] >= 100)&(row["MaxAdUp"] >= 3), axis=1)

## Wired
newdf["Is_Wired"] = newdf.apply(lambda row: (row["TechCode"]>=10)&(row["TechCode"]<=50), axis=1)
newdf["Is_Satellite"] = newdf.apply(lambda row: (row["TechCode"]==60), axis=1)
newdf["Is_Fixed_Wireless"] = newdf.apply(lambda row: (row["TechCode"]==70), axis=1)

## Deciding how to summarize the Data

There are over 11M block codes in the data if we include the non-US state codes of "AS", "GU", "MP", "PR", "VI" : 11165833

Otherwise, there are XX block codes.

But only XXX tract codes.


The ACS data is available by tract.

We want to summarize the FCC data by tract.

In [18]:
## Let's drop more duplicates (i.e. we don't care if a single tract has a same provider that offers both 110 & 110 in a tract)

newdf = newdf.drop_duplicates(subset =["Provider_Id", "Faster_than_25_3", "Faster_than_100_3", "Is_Wired",
                                       "Is_Satellite", "Is_Fixed_Wireless", "tract_geoid", "MaxAdDown", "MaxAdUp"])

In [19]:
## Let's look at results for a single tract
newdf.loc[newdf.tract_geoid=='56037970500']

Unnamed: 0,LogRecNo,Provider_Id,FRN,ProviderName,DBAName,HoldingCompanyName,HocoNum,HocoFinal,StateAbbr,TechCode,Consumer,MaxAdDown,MaxAdUp,Business,tract_geoid,Faster_than_25_3,Faster_than_100_3,Is_Wired,Is_Satellite,Is_Fixed_Wireless
0,1,53763,1630201,Union Telephone Company,Union Telephone Company,Union Holding Corp.,360114,Union Holding Corp.,WY,70,1,10.0,1.0,1,56037970500,False,False,False,False,True
156673,156674,53788,3723822,"Level 3 Communications, LLC",CenturyLink,"CenturyLink, Inc.",130228,"CenturyLink, Inc.",WY,50,0,0.0,0.0,1,56037970500,False,False,True,False,False
656958,656959,54076,4335584,MCI Communications Corporation,MCI,Verizon Communications Inc.,131425,Verizon Communications Inc.,WY,30,0,0.0,0.0,1,56037970500,False,False,True,False,False
2674220,2674221,54891,8293649,"All West/Wyoming, Inc.",All West/Wyoming Inc.,"All West Communications, Inc.",130037,"All West Communications, Inc.",WY,42,1,300.0,30.0,1,56037970500,True,True,True,False,False
8189899,8189900,55859,15616642,Wyoming.Com,Wyoming.com,Wyoming.Com,250090,Wyoming.Com,WY,20,1,41.0,41.0,1,56037970500,True,False,True,False,False
8205538,8205539,55859,15616642,Wyoming.Com,Wyoming.com,Wyoming.Com,250090,Wyoming.Com,WY,70,1,4.0,2.0,1,56037970500,False,False,False,False,True
19498401,19498402,56004,4963088,"ViaSat, Inc.",Viasat Inc,"ViaSat, Inc.",290111,"ViaSat, Inc.",WY,60,1,35.0,3.0,1,56037970500,True,False,False,True,False
19613204,19613205,56069,1631134,Bridger Valley Electric Association Inc,Bridger Valley Electric Association Inc,Bridger Valley Electric Association Inc.,300182,Bridger Valley Electric Association Inc.,WY,70,1,8.0,0.512,0,56037970500,False,False,False,False,True
30822824,30822825,58876,18626853,"CenturyLink, Inc.",CenturyLink,"CenturyLink, Inc.",130228,"CenturyLink, Inc.",WY,11,1,15.0,0.64,1,56037970500,False,False,True,False,False
30822825,30822826,58876,18626853,"CenturyLink, Inc.",CenturyLink,"CenturyLink, Inc.",130228,"CenturyLink, Inc.",WY,11,1,1.5,0.5,1,56037970500,False,False,True,False,False


In [21]:
## There are still repeat providers because some providers offer speeds above and below the request speed

### Ideas

Groupby Tract_geoid

* LogRecNo : can toss
* Provider_Id list of providers in tract
* StateAbbr should be same -> i.e. choose most common
* TechCode list
* Consumer_All only 1 if all blocks 1
* Consumer_Some 1 if any blocks 1
* MaxAdDown : max in tract
* MaxAdUp : max in tract
* Business_All only 1 if all blocks 1
* Business_Some 1 if any blocks 1


Could later add unique tech code columns based on apply definitions if a code is present
* Wired if tech code is Cable, Copper, DSL, Fiber



Make a dictionary of Provider_Id to List of [ProviderName, DBAName, TechCode]

## New Idea - Use a few different groupbys and then merge the results

In [20]:
grouped_tract = newdf.groupby(["tract_geoid"])

In [23]:
df1 = grouped_tract.agg(
    All_Provider_Count = ("Provider_Id", "nunique"),
    All_Providers = ("Provider_Id", lambda x: set(x)),
    MaxAdDown = ("MaxAdDown", max),
    MaxAdUp = ("MaxAdUp", max),
    AllMaxAdDown = ("MaxAdDown", lambda x: set(x)),
    AllMaxAdUp = ("MaxAdUp", lambda x: set(x))
)
df1 = df1.reset_index()
df1

Unnamed: 0,tract_geoid,All_Provider_Count,All_Providers,MaxAdDown,MaxAdUp,AllMaxAdDown,AllMaxAdUp
0,01001020100,7,"{56004, 58757, 54694, 67056, 59349, 55262, 58623}",1000.0,50.0,"{0.0, 0.768, 1.5, 3.0, 100.0, 2.0, 6.0, 1000.0...","{0.0, 0.512, 0.128, 35.0, 3.0, 0.384, 0.768, 0..."
1,01001020200,10,"{56004, 58757, 54694, 67056, 59349, 56888, 555...",1000.0,50.0,"{0.0, 1.5, 2.0, 3.0, 100.0, 6.0, 1000.0, 10.0,...","{0.0, 0.384, 0.512, 35.0, 3.0, 0.128, 0.768, 0..."
2,01001020300,10,"{56004, 58757, 54694, 67056, 59349, 56888, 555...",1000.0,50.0,"{0.0, 2.0, 100.0, 6.0, 1000.0, 940.0, 12.0, 18...","{0.0, 0.512, 0.768, 35.0, 3.0, 1.3, 50.0}"
3,01001020400,8,"{56004, 58757, 54694, 67056, 59349, 55514, 552...",1000.0,50.0,"{0.0, 0.768, 1.5, 3.0, 100.0, 5.0, 6.0, 2.0, 1...","{0.0, 0.512, 0.128, 35.0, 3.0, 0.384, 0.768, 0..."
4,01001020500,13,"{56004, 58757, 54694, 56170, 54860, 54927, 670...",1000.0,1000.0,"{0.0, 1.5, 2.0, 100.0, 6.0, 1000.0, 75.0, 940....","{0.0, 0.512, 0.128, 35.0, 3.0, 0.768, 5.0, 1.3..."
...,...,...,...,...,...,...,...
73052,56043000200,10,"{56004, 56358, 54694, 57421, 59469, 56273, 593...",1000.0,1000.0,"{0.0, 2.0, 35.0, 5.0, 1000.0, 40.0, 10.0, 940....","{0.0, 1.0, 2.0, 3.0, 35.0, 1.3, 6.0, 1000.0, 1..."
73053,56043000301,11,"{56004, 56358, 54694, 56683, 57421, 59469, 562...",1000.0,1000.0,"{0.0, 2.0, 35.0, 1000.0, 10.0, 940.0, 15.0, 50...","{0.0, 1.0, 1.3, 3.0, 35.0, 1000.0, 10.0, 20.0}"
73054,56043000302,10,"{56004, 56358, 54694, 57421, 59469, 56273, 593...",1000.0,1000.0,"{0.0, 2.0, 35.0, 1000.0, 10.0, 940.0, 15.0, 25...","{0.0, 1.0, 1.3, 3.0, 35.0, 1000.0, 10.0}"
73055,56045951100,11,"{56004, 54694, 55912, 56683, 57421, 59469, 593...",1000.0,1000.0,"{0.0, 2.0, 35.0, 100.0, 1000.0, 10.0, 12.0, 94...","{0.0, 1.0, 1.3, 3.0, 100.0, 5.0, 35.0, 1000.0,..."


In [24]:
# Total provider count for Is_Wired
grouped_tract2 = newdf.groupby(["tract_geoid", "Is_Wired"])
df2 = grouped_tract2.agg(
    Wired_Provider_Count = ("Provider_Id", "nunique")
    
)
df2 = df2.reset_index()
df2 = df2[df2.Is_Wired==True]
df2 = df2.drop(columns="Is_Wired")
df2

Unnamed: 0,tract_geoid,Wired_Provider_Count
1,01001020100,3
3,01001020200,6
5,01001020300,6
7,01001020400,4
9,01001020500,9
...,...,...
145786,56043000200,5
145788,56043000301,4
145790,56043000302,3
145792,56045951100,5


In [25]:
#Total provider count for Satellite
grouped_tract3 = newdf.groupby(["tract_geoid", "Is_Satellite"])
df3 = grouped_tract3.agg(
    Satellite_Provider_Count = ("Provider_Id", "nunique")
    
)
df3 = df3.reset_index()
df3 = df3[df3.Is_Satellite==True]
df3 = df3.drop(columns="Is_Satellite")
df3 #.Satellite_Provider_Count.value_counts()

Unnamed: 0,tract_geoid,Satellite_Provider_Count
1,01001020100,4
3,01001020200,4
5,01001020300,4
7,01001020400,4
9,01001020500,4
...,...,...
145913,56043000200,4
145915,56043000301,4
145917,56043000302,4
145919,56045951100,4


In [26]:
#Total provider count for Fixed Wireless
grouped_tract4 = newdf.groupby(["tract_geoid", "Is_Fixed_Wireless"])
df4 = grouped_tract4.agg(
    Fixed_Wireless_Provider_Count = ("Provider_Id", "nunique")
    
)
df4 = df4.reset_index()
df4 = df4[df4.Is_Fixed_Wireless==True]
df4 = df4.drop(columns="Is_Fixed_Wireless")
df4

Unnamed: 0,tract_geoid,Fixed_Wireless_Provider_Count
6,01001020600,1
9,01001020801,1
11,01001020802,1
13,01001020900,1
15,01001021000,1
...,...,...
118882,56043000200,4
118884,56043000301,3
118886,56043000302,3
118888,56045951100,3


In [27]:
## All providers > 25

grouped_tract5 = newdf.groupby(["tract_geoid", "Faster_than_25_3"])
df5 = grouped_tract5.agg(
    All_Provider_Count_25 = ("Provider_Id", "nunique")
    
)
df5 = df5.reset_index()
df5 = df5[df5.Faster_than_25_3==True]
df5 = df5.drop(columns="Faster_than_25_3")
df5

Unnamed: 0,tract_geoid,All_Provider_Count_25
1,01001020100,5
3,01001020200,4
5,01001020300,4
7,01001020400,5
9,01001020500,5
...,...,...
145895,56043000200,7
145897,56043000301,5
145899,56043000302,5
145901,56045951100,8


In [28]:
## All providers > 100

grouped_tract6 = newdf.groupby(["tract_geoid", "Faster_than_100_3"])
df6 = grouped_tract6.agg(
    All_Provider_Count_100 = ("Provider_Id", "nunique")
    
)
df6 = df6.reset_index()
df6 = df6[df6.Faster_than_100_3==True]
df6 = df6.drop(columns="Faster_than_100_3")
df6

Unnamed: 0,tract_geoid,All_Provider_Count_100
1,01001020100,4
3,01001020200,3
5,01001020300,3
7,01001020400,3
9,01001020500,4
...,...,...
145512,56043000200,4
145514,56043000301,2
145516,56043000302,2
145518,56045951100,4


In [29]:
## Fixed Wireless providers > 25

grouped_tract7 = newdf.groupby(["tract_geoid", "Faster_than_25_3", "Is_Fixed_Wireless"])
df7 = grouped_tract7.agg(
    Fixed_Wireless_Provider_Count_25 = ("Provider_Id", "nunique")
    
)
df7 = df7.reset_index()
df7 = df7[(df7.Faster_than_25_3==True)&(df7.Is_Fixed_Wireless==True)]
df7 = df7.drop(columns=["Faster_than_25_3", "Is_Fixed_Wireless"])
df7

Unnamed: 0,tract_geoid,Fixed_Wireless_Provider_Count_25
33,01003010100,1
169,01009050101,1
173,01009050102,1
177,01009050200,1
187,01009050500,1
...,...,...
211526,56043000200,2
211530,56043000301,1
211534,56043000302,1
211538,56045951100,3


In [30]:
## Fixed Wireless providers > 100
grouped_tract = newdf.groupby(["tract_geoid", "Faster_than_100_3", "Is_Fixed_Wireless"])
df10 = grouped_tract.agg(
    Fixed_Wireless_Provider_Count_100 = ("Provider_Id", "nunique")
    
)
df10 = df10.reset_index()
df10 = df10[(df10.Faster_than_100_3==True)&(df10.Is_Fixed_Wireless==True)]
df10 = df10.drop(columns=["Faster_than_100_3", "Is_Fixed_Wireless"])
df10

Unnamed: 0,tract_geoid,Fixed_Wireless_Provider_Count_100
256,01015002000,1
263,01015002103,1
393,01029959500,1
398,01029959700,1
401,01029959800,1
...,...,...
200922,56037971600,2
200930,56039967702,1
200937,56041975200,2
200941,56041975300,2


In [31]:
## Wired providers > 25

grouped_tract = newdf.groupby(["tract_geoid", "Faster_than_25_3", "Is_Wired"])
df8 = grouped_tract.agg(
    Wired_Provider_Count_25 = ("Provider_Id", "nunique")
    
)
df8 = df8.reset_index()
df8 = df8[(df8.Faster_than_25_3==True)&(df8.Is_Wired==True)]
df8 = df8.drop(columns=["Faster_than_25_3", "Is_Wired"])
df8

Unnamed: 0,tract_geoid,Wired_Provider_Count_25
3,01001020100,3
7,01001020200,2
11,01001020300,2
15,01001020400,3
19,01001020500,3
...,...,...
290828,56043000200,4
290832,56043000301,2
290836,56043000302,2
290840,56045951100,3


In [32]:
## Wired providers > 25

grouped_tract = newdf.groupby(["tract_geoid", "Faster_than_100_3", "Is_Wired"])
df11 = grouped_tract.agg(
    Wired_Provider_Count_100 = ("Provider_Id", "nunique")
    
)
df11 = df11.reset_index()
df11 = df11[(df11.Faster_than_100_3==True)&(df11.Is_Wired==True)]
df11 = df11.drop(columns=["Faster_than_100_3", "Is_Wired"])
df11

Unnamed: 0,tract_geoid,Wired_Provider_Count_100
3,01001020100,3
7,01001020200,2
11,01001020300,2
15,01001020400,2
19,01001020500,3
...,...,...
259501,56043000200,4
259504,56043000301,2
259507,56043000302,2
259511,56045951100,3


In [33]:
## Satellite providers > 25

grouped_tract = newdf.groupby(["tract_geoid", "Faster_than_25_3", "Is_Satellite"])
df9 = grouped_tract.agg(
    Satellite_Provider_Count_25 = ("Provider_Id", "nunique")
    
)
df9 = df9.reset_index()
df9 = df9[(df9.Faster_than_25_3==True)&(df9.Is_Satellite==True)]
df9 = df9.drop(columns=["Faster_than_25_3", "Is_Satellite"])
df9

Unnamed: 0,tract_geoid,Satellite_Provider_Count_25
3,01001020100,2
7,01001020200,2
11,01001020300,2
15,01001020400,2
19,01001020500,2
...,...,...
291104,56043000200,2
291108,56043000301,2
291112,56043000302,2
291116,56045951100,2


In [34]:
## Satellite providers > 100

grouped_tract = newdf.groupby(["tract_geoid", "Faster_than_100_3", "Is_Satellite"])
df12 = grouped_tract.agg(
    Satellite_Provider_Count_100 = ("Provider_Id", "nunique")
    
)
df12 = df12.reset_index()
df12 = df12[(df12.Faster_than_100_3==True)&(df12.Is_Satellite==True)]
df12 = df12.drop(columns=["Faster_than_100_3", "Is_Satellite"])
df12

Unnamed: 0,tract_geoid,Satellite_Provider_Count_100
3,01001020100,1
7,01001020200,1
11,01001020300,1
15,01001020400,1
19,01001020500,1
...,...,...
253985,56041975200,1
253989,56041975300,1
253993,56041975400,1
254006,56045951100,1


In [35]:
## Aggregate it all

data = df1
dfs = [df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12]

for d in dfs:
    data = data.merge(d, how="outer", on="tract_geoid")

#Fill NAs with 0, if it was NA that means there was nothing to count so it is 0
data = data.fillna(0)
data


Unnamed: 0,tract_geoid,All_Provider_Count,All_Providers,MaxAdDown,MaxAdUp,AllMaxAdDown,AllMaxAdUp,Wired_Provider_Count,Satellite_Provider_Count,Fixed_Wireless_Provider_Count,All_Provider_Count_25,All_Provider_Count_100,Fixed_Wireless_Provider_Count_25,Wired_Provider_Count_25,Satellite_Provider_Count_25,Fixed_Wireless_Provider_Count_100,Wired_Provider_Count_100,Satellite_Provider_Count_100
0,01001020100,7,"{56004, 58757, 54694, 67056, 59349, 55262, 58623}",1000.0,50.0,"{0.0, 0.768, 1.5, 3.0, 100.0, 2.0, 6.0, 1000.0...","{0.0, 0.512, 0.128, 35.0, 3.0, 0.384, 0.768, 0...",3.0,4,0.0,5.0,4.0,0.0,3.0,2.0,0.0,3.0,1.0
1,01001020200,10,"{56004, 58757, 54694, 67056, 59349, 56888, 555...",1000.0,50.0,"{0.0, 1.5, 2.0, 3.0, 100.0, 6.0, 1000.0, 10.0,...","{0.0, 0.384, 0.512, 35.0, 3.0, 0.128, 0.768, 0...",6.0,4,0.0,4.0,3.0,0.0,2.0,2.0,0.0,2.0,1.0
2,01001020300,10,"{56004, 58757, 54694, 67056, 59349, 56888, 555...",1000.0,50.0,"{0.0, 2.0, 100.0, 6.0, 1000.0, 940.0, 12.0, 18...","{0.0, 0.512, 0.768, 35.0, 3.0, 1.3, 50.0}",6.0,4,0.0,4.0,3.0,0.0,2.0,2.0,0.0,2.0,1.0
3,01001020400,8,"{56004, 58757, 54694, 67056, 59349, 55514, 552...",1000.0,50.0,"{0.0, 0.768, 1.5, 3.0, 100.0, 5.0, 6.0, 2.0, 1...","{0.0, 0.512, 0.128, 35.0, 3.0, 0.384, 0.768, 0...",4.0,4,0.0,5.0,3.0,0.0,3.0,2.0,0.0,2.0,1.0
4,01001020500,13,"{56004, 58757, 54694, 56170, 54860, 54927, 670...",1000.0,1000.0,"{0.0, 1.5, 2.0, 100.0, 6.0, 1000.0, 75.0, 940....","{0.0, 0.512, 0.128, 35.0, 3.0, 0.768, 5.0, 1.3...",9.0,4,0.0,5.0,4.0,0.0,3.0,2.0,0.0,3.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73052,56043000200,10,"{56004, 56358, 54694, 57421, 59469, 56273, 593...",1000.0,1000.0,"{0.0, 2.0, 35.0, 5.0, 1000.0, 40.0, 10.0, 940....","{0.0, 1.0, 2.0, 3.0, 35.0, 1.3, 6.0, 1000.0, 1...",5.0,4,4.0,7.0,4.0,2.0,4.0,2.0,0.0,4.0,0.0
73053,56043000301,11,"{56004, 56358, 54694, 56683, 57421, 59469, 562...",1000.0,1000.0,"{0.0, 2.0, 35.0, 1000.0, 10.0, 940.0, 15.0, 50...","{0.0, 1.0, 1.3, 3.0, 35.0, 1000.0, 10.0, 20.0}",4.0,4,3.0,5.0,2.0,1.0,2.0,2.0,0.0,2.0,0.0
73054,56043000302,10,"{56004, 56358, 54694, 57421, 59469, 56273, 593...",1000.0,1000.0,"{0.0, 2.0, 35.0, 1000.0, 10.0, 940.0, 15.0, 25...","{0.0, 1.0, 1.3, 3.0, 35.0, 1000.0, 10.0}",3.0,4,3.0,5.0,2.0,1.0,2.0,2.0,0.0,2.0,0.0
73055,56045951100,11,"{56004, 54694, 55912, 56683, 57421, 59469, 593...",1000.0,1000.0,"{0.0, 2.0, 35.0, 100.0, 1000.0, 10.0, 12.0, 94...","{0.0, 1.0, 1.3, 3.0, 100.0, 5.0, 35.0, 1000.0,...",5.0,4,3.0,8.0,4.0,3.0,3.0,2.0,0.0,3.0,1.0


In [36]:
data.to_csv("../data/aggregated_fcc_by_tract.csv")

In [37]:
census_data.columns

Index(['NAME', 'total_pop2', 'median_age_overall', 'median_age_male',
       'median_age_female', 'state', 'county', 'tract', 'employment_rate',
       'median_income', 'total_households', 'ave_household_size',
       'ave_family_size', 'pct_health_ins_children', 'pct_health_ins_19_64',
       'pct_health_ins_65+', 'total_population', 'median_house_value',
       'pct_white', 'pct_hisp_latino', 'pct_black', 'pct_native', 'pct_asian',
       'pct_hi_pi', 'pct_other_race', 'pct_two+_race', 'pct_rent_burdened',
       'poverty_rate', 'pct_pop_bachelors+', 'pct_pop_hs+', 'pct_internet',
       'pct_internet_dial_up', 'pct_internet_broadband_any_type',
       'pct_internet_cellular', 'pct_only_cellular',
       'pct_internet_broadband_fiber', 'pct_internet_broadband_satellite',
       'pct_internet_only_satellite', 'pct_internet_other',
       'pct_internet_no_subscrp', 'pct_internet_none', 'pct_computer',
       'pct_computer_with_dialup', 'pct_computer_with_broadband',
       'pct_compute

## Merging with the Census Data



In [38]:
data = pd.read_csv("../data/aggregated_fcc_by_tract.csv", index_col="Unnamed: 0")

In [39]:
data.shape

(73057, 18)

In [40]:
census_data.shape

(73056, 46)

In [41]:
## Helper functions

def pad_county_code(x):
    county = str(x)
    while len(county) < 3:
        county = "0" + county
    return county

def pad_state_code(x):
    state = str(x)
    while len(state) < 2:
        state = "0" + state
    return state


def pad_tract_code(x):
    tract = str(x)
    while len(tract) < 6:
        tract = "0" + tract
    return tract


def pad_tract_geoid(x):
    tract = str(x)
    if len(tract) < 11:
        tract = "0" + tract
    return tract

In [42]:
data["tract_geoid"] = data["tract_geoid"].apply(pad_tract_geoid)

census_data["tract_geoid"] = census_data["state"].apply(pad_state_code) + census_data["county"].apply(
    pad_county_code) + census_data["tract"].apply(pad_tract_code)

## We noted below in the EDA that we have to address 2011 & 2012 geography changes in the FCC tract ID data

## Updating Census Tracts Dataset with Geographic Changes in 2011 & 2012 

Note the geocorr mapping tool only provides 2010 tracts mapped to zipcodes. There are changes to a few tracts after 2010, which affects our 2019 data.

For more info, refer to:
* 2011 Changes: https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes/2011/geography-changes.html
* 2012 Changes: https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes/2012/geography-changes.html

We have reviewed all other changes after 2012-2019 and did not find any relevant tract changes.

In [43]:
## Madison County, NY state and County code is: 36 053
madison_pairs = {
    "940101" : "030101",
    "940102" : "030102",
    "940103" : "030103",
    "940200" : "030200",
    "940300" : "030300",
    "940401" : "030401",
    "940403" : "030403",
    "940600" : "030600",
    "940700" : "030402"
}

for old, new in madison_pairs.items():
    data.loc[(data.tract_geoid=="36053"+old), "tract_geoid"] = "36053"+new
    


In [44]:
## Oneida NY County code is: 36065

oneida_pairs = {
    "940000" : "024800",
    "940100" : "024700",
    "940200" : "024900"
}

for old, new in oneida_pairs.items():
    data.loc[(data.tract_geoid=="36065"+old), "tract_geoid"] = "36065"+new

In [45]:
## Pima County, AZ 04019

pima_pairs = {
    "002701" : "002704",
    "002903" : "002906",
    "410501" : "004118",
    "410502" : "004121",
    "410503" : "004125",
    "470400" : "005200",
    "470500" : "005300"
}

for old, new in pima_pairs.items():
    data.loc[(data.tract_geoid=="04019"+old), "tract_geoid"] = "04019"+new

In [46]:
## Los Angeles County Code = 06037

data.loc[(data.tract_geoid=="06037930401"), "tract_geoid"] = "06037137000"

## There were also some county level geography changes

* 2015: https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes.2015.html
* 2014: https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes.2014.html

In [47]:
## Wade Hampton Census Area, Alaska, was renamed as Kusilvak Census Area and the county code changed from 270 to 158

data.loc[data.tract_geoid == "02270000100", "tract_geoid"] = "02158000100"

In [48]:
## Shannon County, South Dakota, was renamed as Oglala Lakota County and the county code changed from 113 to 102. 
data.loc[data.tract_geoid == "46113940500", "tract_geoid"] = "46102940500"
data.loc[data.tract_geoid == "46113940800", "tract_geoid"] = "46102940800"
data.loc[data.tract_geoid == "46113940900", "tract_geoid"] = "46102940900"


In [49]:
## Bedford City, Virginia changed its legal status to town, ending its independent city status (county equivalent), and was absorbed as a municipality within Bedford County, Virginia.  
## Its former FIPS State and FIPS County codes 51-515 have ceased to exist.

data.loc[data.tract_geoid == "51515050100", "tract_geoid"] = "51019050100"

In [50]:
big_data = data.merge(census_data, how="outer", on="tract_geoid")
#big_data.to_csv("../data/fcc_census.csv")

In [51]:

## Census Tract 0089.00 merged into Census Tract 0097.00. The area merged into Census Tract 0097.00 is entirely water.
## See cells below
big_data[big_data.NAME.isna()]

Unnamed: 0,tract_geoid,All_Provider_Count,All_Providers,MaxAdDown,MaxAdUp,AllMaxAdDown,AllMaxAdUp,Wired_Provider_Count,Satellite_Provider_Count,Fixed_Wireless_Provider_Count,...,pct_internet_broadband_satellite,pct_internet_only_satellite,pct_internet_other,pct_internet_no_subscrp,pct_internet_none,pct_computer,pct_computer_with_dialup,pct_computer_with_broadband,pct_computer_no_internet,pct_no_computer
47066,36085008900,1,{56004},35.0,3.0,{35.0},{3.0},0.0,1,0.0,...,,,,,,,,,,


In [52]:
## Choosing where NAME is not NA due to note in cell above
big_data = big_data[~big_data.NAME.isna()]

In [53]:
## ONE ADDITIONAL STEP -> ADD POPULATION DENSITY column using gazetter file from: 
## https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.2019.html

gazetteer = pd.read_csv("../census_scripts/2019_Gaz_tracts_national.txt", sep="\t", 
                        converters = {"GEOID" : lambda x: str(x)})
gazetteer

Unnamed: 0,USPS,GEOID,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,INTPTLAT,INTPTLONG
0,AL,01001020100,9817813,28435,3.791,0.011,32.481959,-86.491338
1,AL,01001020200,3325680,5669,1.284,0.002,32.475758,-86.472468
2,AL,01001020300,5349273,9054,2.065,0.003,32.474024,-86.459703
3,AL,01001020400,6384276,8408,2.465,0.003,32.471030,-86.444835
4,AL,01001020500,11408866,43534,4.405,0.017,32.458922,-86.421826
...,...,...,...,...,...,...,...,...
73996,PR,72153750501,1820185,0,0.703,0.000,18.031211,-66.867347
73997,PR,72153750502,689930,0,0.266,0.000,18.024746,-66.860442
73998,PR,72153750503,3298433,1952,1.274,0.001,18.023148,-66.876603
73999,PR,72153750601,10987037,4527,4.242,0.002,18.017809,-66.839070


In [54]:
big_data = big_data.merge(gazetteer[["GEOID", "ALAND", "AWATER", "ALAND_SQMI", "AWATER_SQMI"]], how="left", left_on="tract_geoid", right_on="GEOID")


In [55]:
pd.set_option('display.max_rows', 500)
big_data.isnull().sum()

tract_geoid                             0
All_Provider_Count                      0
All_Providers                           0
MaxAdDown                               0
MaxAdUp                                 0
AllMaxAdDown                            0
AllMaxAdUp                              0
Wired_Provider_Count                    0
Satellite_Provider_Count                0
Fixed_Wireless_Provider_Count           0
All_Provider_Count_25                   0
All_Provider_Count_100                  0
Fixed_Wireless_Provider_Count_25        0
Wired_Provider_Count_25                 0
Satellite_Provider_Count_25             0
Fixed_Wireless_Provider_Count_100       0
Wired_Provider_Count_100                0
Satellite_Provider_Count_100            0
NAME                                    0
total_pop2                              0
median_age_overall                    672
median_age_male                       702
median_age_female                     780
state                             

In [56]:
def safe_division(x, y):
    try:
        return (x / y)*100
    except:
        #y would have been 0 so return NAN
        return np.nan

In [57]:
big_data["population_density"] = big_data.apply(lambda row: safe_division(row["total_pop2"], row["ALAND_SQMI"]), axis=1)

In [58]:
big_data.to_csv("../data/fcc_census.csv", index=False)

## Some EDA of big data

In [59]:
big_data.shape

(73056, 70)

In [60]:
pd.set_option('display.max_rows', None)
big_data.isnull().sum()

tract_geoid                             0
All_Provider_Count                      0
All_Providers                           0
MaxAdDown                               0
MaxAdUp                                 0
AllMaxAdDown                            0
AllMaxAdUp                              0
Wired_Provider_Count                    0
Satellite_Provider_Count                0
Fixed_Wireless_Provider_Count           0
All_Provider_Count_25                   0
All_Provider_Count_100                  0
Fixed_Wireless_Provider_Count_25        0
Wired_Provider_Count_25                 0
Satellite_Provider_Count_25             0
Fixed_Wireless_Provider_Count_100       0
Wired_Provider_Count_100                0
Satellite_Provider_Count_100            0
NAME                                    0
total_pop2                              0
median_age_overall                    672
median_age_male                       702
median_age_female                     780
state                             

In [107]:
## This ID 36085008900 corresponds to Richmond County NY, which was merged into Census tract 0097.00
big_data[big_data.NAME.isna()]

Unnamed: 0,tract_geoid,All_Provider_Count,All_Providers,Wired_Provider_Count,Satellite_Provider_Count,Fixed_Wireless_Provider_Count,All_Provider_Count_25,All_Provider_Count_100,Fixed_Wireless_Provider_Count_25,Wired_Provider_Count_25,...,pct_computer_with_dialup,pct_computer_with_broadband,pct_computer_no_internet,pct_no_computer,GEOID,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,population_density


In [108]:
## We can see that single provider is already covered so we can drop 36085008900
big_data[(big_data.tract_geoid=="36085008900") | (big_data.tract_geoid=="36085009700")]

Unnamed: 0,tract_geoid,All_Provider_Count,All_Providers,Wired_Provider_Count,Satellite_Provider_Count,Fixed_Wireless_Provider_Count,All_Provider_Count_25,All_Provider_Count_100,Fixed_Wireless_Provider_Count_25,Wired_Provider_Count_25,...,pct_computer_with_dialup,pct_computer_with_broadband,pct_computer_no_internet,pct_no_computer,GEOID,ALAND,AWATER,ALAND_SQMI,AWATER_SQMI,population_density
47068,36085009700,8,"{56004, 54694, 56843, 54895, 55027, 59349, 552...",4.0,4,0.0,4.0,2.0,0.0,2.0,...,0.566038,84.08805,7.798742,7.54717,36085009700,1146651,520483,0.443,0.201,1023702.0


In [112]:
# These are rows that the FCC data is missing
# This is due to the 2011 & 2012 tract ID changes. Let's fix above before the merge
big_data[big_data.All_Provider_Count.isna()][["tract_geoid", "NAME", "total_pop2"]]

Unnamed: 0,tract_geoid,NAME,total_pop2


## Old Idea - Use apply to get the counts we want - takes too long!

In [None]:
# ## Get the total provider count
# tracts["All_Provider_Count"] = tracts.apply(lambda row: len(np.unique(newdf.loc[newdf.tract_geoid == row.tract_geoid,
#                                                                   "Provider_Id"])), axis=1)

# ## Get the total provider count with speeds > 25
# tracts["All_Provider_Count_25_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_25_3)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get the total provider count with speeds > 100
# tracts["All_Provider_Count_100_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_100_3)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get wired (Cable, Copper, DSL, Fiber) provider count
# tracts["Wired_Provider_Count"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Is_Wired)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get wired (Cable, Copper, DSL, Fiber) provider count with speeds > 25
# tracts["Wired_Provider_Count_25_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_25_3)&
#                                                                                 (newdf.Is_Wired)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get wired (Cable, Copper, DSL, Fiber) provider count with speeds > 100
# tracts["Wired_Provider_Count_100_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_100_3)&
#                                                                                 (newdf.Is_Wired)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get fixed wireless provider count
# tracts["Fixed_Wireless_Provider_Count"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Is_Fixed_Wireless)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get fixed wireless provider count > 25
# tracts["Fixed_Wireless_Provider_Count_25_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_25_3)&
#                                                                                 (newdf.Is_Fixed_Wireless)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get fixed wireless provider count > 100
# tracts["Fixed_Wireless_Provider_Count_100_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_100_3)&
#                                                                                 (newdf.Is_Fixed_Wireless)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get satellite provider count
# tracts["Satellite_Provider_Count"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Is_Satellite)),
#                                                                   "Provider_Id"])), axis=1)


# ## Get satellite provider count > 25
# tracts["Satellite_Provider_Count_25_3"] = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_25_3)&
#                                                                                 (newdf.Is_Satellite)),
#                                                                   "Provider_Id"])), axis=1)

# ## Get satellite provider count > 100
# tracts["Satellite_Provider_Count_100_3"]  = tracts.apply(lambda row: len(np.unique(newdf.loc[((newdf.tract_geoid==row.tract_geoid)
#                                                                                 &(newdf.Faster_than_100_3)&
#                                                                                 (newdf.Is_Satellite)),
#                                                                   "Provider_Id"])), axis=1)

# # ##Get all the provider dba names
# # tracts["All_Provider_Names"] = tracts.apply(lambda row: np.unique(agged_reset.loc[agged_reset.tract_geoid==row.tract_geoid,
# #                                                                   "dba_name"]), axis=1)

# # ## Get business boolean
# # ##Get all the provider dba names
# # tracts["Business"] = tracts.apply(lambda row: np.sum(agged_reset.loc[agged_reset.tract_geoid==row.tract_geoid,
# #                                                                   "business_any"])>=1, axis=1)

# # tracts["Consumer"] = tracts.apply(lambda row: np.sum(agged_reset.loc[agged_reset.tract_geoid==row.tract_geoid,
# #                                                                   "consumer_any"])>=1, axis=1)



# tracts

In [None]:
## TODO Merge the tracts dataset with the Census values


tracts.to_csv("tracts.csv")

In [67]:
test = pd.DataFrame({"first" : [1,2,3,4,5], "second" : ["red","blue", "yellow", "red", "yellow"], "third" : ["a","a","a","a","b"]})
print(test)
test.groupby(["second","third"]).agg(
    num = ("first", lambda x: ((x==3) | (x==4)).all())
)

   first  second third
0      1     red     a
1      2    blue     a
2      3  yellow     a
3      4     red     a
4      5  yellow     b


Unnamed: 0_level_0,Unnamed: 1_level_0,num
second,third,Unnamed: 2_level_1
blue,a,False
red,a,False
yellow,a,True
yellow,b,False


In [45]:
agged2 = agged.reset_index()
grouped2 = agged2.groupby("tract_geoid")


In [None]:
grouped2.agg(
    
)

In [50]:
agged["fcc_satellite"] = agged["tech_codes"].apply(lambda x: 60 in x)
agged["fcc_dsl"]= agged["tech_codes"].apply(lambda x: (10 in x) | (11 in x) | (12 in x) | (20 in x))
agged["fcc_cable_modem"] = agged["tech_codes"].apply(lambda x: (40 in x) | (41 in x) | (42 in x) | (43 in x))
agged["fcc_other_copper"] = agged["tech_codes"].apply(lambda x: 30 in x)
agged["fcc_fiber"] = agged["tech_codes"].apply(lambda x: 50 in x)
agged["fcc_terrestrial_fixed_wireless"] = agged["tech_codes"].apply(lambda x: 70 in x)
agged["fcc_other"] = agged["tech_codes"].apply(lambda x: (90 in x) | (0 in x))
agged["fcc_wired"] = agged["fcc_dsl"] + 
agged

Unnamed: 0_level_0,provider_ids,num_blocks,provider_names,state,tract,tech_codes,consumer_any,max_ad_down,max_ad_up,business_any,fcc_satellite,fcc_dsl,fcc_cable_modem,fcc_other_copper,fcc_fiber,fcc_terrestrial_fixed_wireless,fcc_electric_power_line,fcc_all_other
BlockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
010010202002029,{53788},1,"{Level 3 Communications, LLC}",AL,01001020200,{30},False,0.0,0.0,True,False,False,False,True,False,False,False,False
010010203002020,{53788},1,"{Level 3 Communications, LLC}",AL,01001020300,{30},False,0.0,0.0,True,False,False,False,True,False,False,False,False
010010205002010,{53788},1,"{Level 3 Communications, LLC}",AL,01001020500,{30},False,0.0,0.0,True,False,False,False,True,False,False,False,False
010010205002014,{53788},1,"{Level 3 Communications, LLC}",AL,01001020500,{30},False,0.0,0.0,True,False,False,False,True,False,False,False,False
010010205002021,{53788},1,"{Level 3 Communications, LLC}",AL,01001020500,{30},False,0.0,0.0,True,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
560419752004147,{53763},1,{Union Telephone Company},WY,56041975200,{12},True,100.0,20.0,True,False,True,False,False,False,False,False,False
560419753005006,{53763},1,{Union Telephone Company},WY,56041975300,{12},True,100.0,20.0,True,False,True,False,False,False,False,False,False
560419753006361,{53763},1,{Union Telephone Company},WY,56041975300,{12},True,100.0,20.0,True,False,True,False,False,False,False,False,False
560419754003043,{53763},1,{Union Telephone Company},WY,56041975400,{12},True,100.0,20.0,True,False,True,False,False,False,False,False,False


In [31]:
##Get all the provider dba names
tracts["All_Provider_Names"] = tracts.apply(lambda row: np.unique(agged_reset.loc[agged_reset.tract_geoid==row.tract_geoid,
                                                                  "dba_name"]), axis=1)

## Get business boolean
##Get all the provider dba names
tracts["Business"] = tracts.apply(lambda row: np.sum(agged_reset.loc[agged_reset.tract_geoid==row.tract_geoid,
                                                                  "business_any"])>=1, axis=1)

tracts["Consumer"] = tracts.apply(lambda row: np.sum(agged_reset.loc[agged_reset.tract_geoid==row.tract_geoid,
                                                                  "consumer_any"])>=1, axis=1)

## EDA

Why do tech code 30 have speed of 0?

In [39]:
tech_code_30 = df[df.TechCode==30]
tech_code_30.MaxAdDown.value_counts()

0.000       407081
20.000       53373
12.000       14947
100.000       5146
25.000        4110
1000.000      1308
10.000         705
1.500          678
5.000          330
45.000         289
50.000         224
3.000          146
2.000          119
1.000           89
500.000         55
7.000           49
15.000          41
40.000          40
150.000         17
6.000           15
30.000          13
250.000         13
200.000         10
4.000            9
400.000          5
32.000           4
8.000            4
300.000          3
18.000           2
80.000           1
70.000           1
0.256            1
16.000           1
0.768            1
60.000           1
9.000            1
75.000           1
Name: MaxAdDown, dtype: int64

In [None]:
tech_code_30.MaxAdUp.value_counts()