# The Goal

I want to look at individual contributions to political campaigns just prior to the 2020 presidential election made by people on Long Island. Money is interesting, because there's a reason that we say "put your money where your mouth is". 

There's also, as we'll see, a lot you can learn from the data on its own. It contains things like Occupation and Employer that are just interesting in their own right. We could use it, for instance, to find employers and contacts that we might be interested in. Here's an example record: 

| Field           | Value                          |
|:----------------|:-------------------------------|
| CMTE_ID         | C00401224                      |
| AMNDT_IND       | N                              |
| RPT_TP          | MY                             |
| TRANSACTION_PGI |                                |
| IMAGE_NUM       | 201907299155126104             |
| TRANSACTION_TP  | 24T                            |
| ENTITY_TP       | IND                            |
| NAME            | LEVINE, ELIOT                  |
| CITY            | HUNTINGTON                     |
| STATE           | NY                             |
| ZIP_CODE        | 11743                          |
| EMPLOYER        | SELF                           |
| OCCUPATION      | LAWYER                         |
| TRANSACTION_DT  | 01092019                       |
| TRANSACTION_AMT | 25                             |
| OTHER_ID        | C00000935                      |
| TRAN_ID         | SA11AI_144839311               |
| FILE_NUM        | 1344765                        |
| MEMO_CD         |                                |
| MEMO_TEXT       | EARMARKED FOR DCCC (C00000935) |
| SUB_ID          | 4082820191120772033            |
| postalcode      | 11743                          |


## The Problem

The files for 2020 are large - 9GB uncompressed, 18GB compressed. Some large percentage of this data is not really useful for my purpose, and I have to figure out a way to keep only what I need. Filtering on NY is possible, but as I know already and found out even more so as part of this little project, NY is really, really big and very politically active. So, we want to restrict even within NY, to Long Island. But, how should we do that? 

We don't want to use the CITY data because one person's Bellmore is another person's North Bellmore (that is, there's really only one if you don't live there!). Also, NY is big enough to have two towns with the same name in different counties (Tuckahoe, NY for example!). 

Zip codes are more reliable indicators, but then we face a different problem - how do we know which zip codes are on Long Island? 

## The Gameplan

- Get a list of zip codes that are contained within Long Island (defined as Nassau and Suffolk counties).
- Filter the FEC data using our list of zip codes.
- Analyze the Long Island FEC data. 

We start with a list of Zip Codes for the entire US, downloaded from the USPS. We'll geocode this list of zip codes using the open source Nominatim geocoder, and use polygons for Nassau and Suffolk counties to retrieve only the zip codes we need. 



## Geocoding

Geocoding is the process of transforming an address or other geographic information entered by a user into geographical coordinates. We do it so often nowadays that we don't even think about what a wonderful thing it is! Every time we type an address into Google Maps to get directions, we are using a geocoder. 

Geocoding is also a computationally expensive process, and we have a large number of zip codes (just over 2000 in NY state) that we want to geocode. We'll do this cheaply by using Nominatim, which is the geocoder used by OpenStreetMaps. Building and installing Nominatim locally is an involved process, but we can sidestep it and just use Nominatim in a Docker container. We can follow the instructions here to build and run a Docker instance of Nominatim: 

https://www.linkedin.com/pulse/geocoding-geopy-your-own-nominatim-server-chonghua-yin/?trk=related_artice_Geocoding%20with%20GeoPy%20and%20Your%20Own%20Nominatim%20Server_article-card_title

We could interact with this instance directly using http requests, but we'll use geopy instead. This library allows us to switch geocoders based on the purpose we have in mind, and also allows us to implement rate limiters if the geocoding service requires it.  


In [4]:
from geopy.geocoders import Nominatim
geocoder = Nominatim(domain="localhost:8080", scheme="http")

We test it by geocoding the North Bellmore Library!

In [5]:
geocoder.geocode("North Bellmore Library")

Location(Public Library, 1551, Newbridge Road, North Bellmore, Town of Hempstead, Nassau County, 11710, United States, (40.6831165, -73.53970382629029, 0.0))

## Processing Long Island Zip Codes

We have a list of zip codes provided by the United States Postal Service, for the entire country. 

In [6]:
import pandas as pd
zip_codes_df = pd.read_excel("../references/ZIP_Locale_Detail.xls", sheet_name=0)

We can filter on PHYSICAL STATE to get only the zip codes in NY. 

In [7]:
ny_zip_codes_df = zip_codes_df[zip_codes_df["PHYSICAL STATE"]=="NY"].copy()

Amusingly, there is one zip code that is physically in NY but serviced as part of Connecticut. This is Fisher's Island, which is in the Sound but whose ferry service is from Connecticut. The more you know!

In [8]:
ny_zip_codes_df["DISTRICT NAME"].value_counts()

DISTRICT NAME
NEW YORK 3     1756
NEW YORK 2      309
NEW YORK 1      208
CONNECTICUT       1
Name: count, dtype: int64

## Shapefiles and GeoPandas 

We are interested primarily in Nassau and Suffolk counties. Happily, NYS has the boundaries of those counties available to us in Shapefile format. Shapefiles are a data format originally created by ESRI, makers of ArcGIS software. It caught on as a popular format for exchanging spatial data. Many other formats are available now - GeoJSON is particularly popular now - but Shapefiles are everywhere, particularly for government data. New York State maintains a GIS clearinghouse with all kinds of useful assets, including polygon representations of all its counties. We use this file to select the spatial definitions for Nassau and Suffolk counties. 

We also use the GeoPandas library, which lets us turn the shapefiles into Pandas dataframes with a special Geometry column that we can use in spatial operations. 

In [9]:
import geopandas as gpd
counties_gdf = gpd.read_file("../references/shapefiles/Counties.shp")

Also happily, the names are spelled the way we expect. We use GeoPandas to project the county polygons into WGS84, a system that uses latitude and longitude. We need to use the same coordinate system for the counties and the zipcodes we will work with below. 

In [10]:
long_island_gdf = counties_gdf[counties_gdf.NAME.isin(["Nassau","Suffolk"])].copy()
long_island_gdf = long_island_gdf.to_crs(4326)
long_island_gdf

Unnamed: 0,NAME,ABBREV,GNIS_ID,FIPS_CODE,SWIS,NYSP_ZONE,POP1990,POP2000,POP2010,POP2020,DOS_LL,DOSLL_DATE,NYC,CALC_SQ_MI,DATEMOD,Shape_Leng,Shape_Area,geometry
29,Nassau,NASS,974128,36059,280000,Long Island,1287348,1334544,1339532,1395774,,,N,446.637468,2018-04-12,168031.844843,1156786000.0,"POLYGON ((-73.42898 40.97791, -73.42934 40.940..."
51,Suffolk,SUFF,974149,36103,470000,Long Island,1321864,1419369,1493350,1525920,,,N,2372.634185,,385044.83796,6145094000.0,"POLYGON ((-72.13717 40.90804, -72.15988 40.899..."


As mentioned above, geocoding is expensive and we don't want to do it unnecessarily. There is duplication in the file we received from the USPS. For instance, there are two post offices in Bellmore, one north and one south, but Bellmore only has one zip code (11710). There's no point to calling the same operation twice.  

In [11]:
ny_zip_codes_df[ny_zip_codes_df["PHYSICAL ZIP"]==11710]

Unnamed: 0,AREA NAME,AREA CODE,DISTRICT NAME,DISTRICT NO,DELIVERY ZIPCODE,LOCALE NAME,PHYSICAL DELV ADDR,PHYSICAL CITY,PHYSICAL STATE,PHYSICAL ZIP,PHYSICAL ZIP 4
4290,ATLANTIC,4B,NEW YORK 2,117,11710,BELLMORE,2611 MERRICK RD,BELLMORE,NY,11710,5752.0
4291,ATLANTIC,4B,NEW YORK 2,117,11710,NORTH BELLMORE,2465 JERUSALEM AVE,NORTH BELLMORE,NY,11710,9991.0


Just for fun, I tested it with my own town. The location given appears to be the geometric center of Bellmore, rather than the address of one of its two post offices. That's certainly going to fall within Nassau county. 

In [12]:
result = geocoder.geocode("11710", featuretype="postalcode", addressdetails=True)
result

Location(Bellmore, Town of Hempstead, Nassau County, 11710, United States, (40.67664607465925, -73.53396529349835, 0.0))

We come up with a smaller list by restricting the columns and dropping duplicates. We also make all zipcodes the same length and replace spaces with 0, which we have to do because of Fisher Island. We then run the geocoder on the resulting postcodes. 

In [13]:
ny_zip_only_df = ny_zip_codes_df[["PHYSICAL CITY", "LOCALE NAME", "DELIVERY ZIPCODE", "PHYSICAL ZIP"]].drop_duplicates()
ny_zip_only_df["postalcode"] = ny_zip_only_df["PHYSICAL ZIP"].apply(lambda x: str(int(x)).rjust(5, "0"))
ny_zip_only_df["geocode"] = ny_zip_only_df["postalcode"].apply(lambda x: geocoder.geocode(query={"postalcode": x}))

### An Interlude - The Sorrows and Joys of Open Data

The surprising thing (perhaps) is that we are missing a fair amount of data. Take one example, for Niobe NY. There wasn't a zipcode associated with the data in Open Street Map (I've added one!). Open Street Map relies on public "donations" of data, and it seems that the good people of Niobe may not be aware of the need. We are fortunate that Long Island is very well supported in Open Street Map. If we were looking at a less fortunate area, we might have to invest in a commercial geocoder. 

In [14]:
ny_zip_only_df[ny_zip_only_df.geocode.isnull()]

Unnamed: 0,PHYSICAL CITY,LOCALE NAME,DELIVERY ZIPCODE,PHYSICAL ZIP,postalcode,geocode
3936,MAHOPAC FALLS,MAHOPAC FALLS,10542,10542,10542,
3938,MARYKNOLL,MARYKNOLL,10545,10545,10545,
3965,SHENOROCK,SHENOROCK,10587,10587,10587,
4039,NEW MILFORD,NEW MILFORD,10959,10959,10959,
4202,BROOKLYN,OZONE PARK CARRIER ANNEX,11416,11256,11256,
...,...,...,...,...,...,...
5888,NIOBE,NIOBE,14758,14758,14758,
5899,SAINT BONAVENTURE,SAINT BONAVENTURE,14778,14778,14778,
5906,WEST CLARKSVILLE,WEST CLARKSVILLE,14786,14786,14786,
5939,COOPERS PLAINS,COOPERS PLAINS,14827,14827,14827,


For now, we drop any records that were not geocoded successfully and apply a function to the rest to get a 2-dimensional point. 

In [15]:
from shapely.geometry import Point

ny_zip_geocoded_df = ny_zip_only_df.dropna().copy()
ny_zip_geocoded_df["Point"] = ny_zip_geocoded_df["geocode"].apply(lambda x: Point(x.longitude, x.latitude))

We build a GeoPandas dataframe from the data. 

In [16]:
ny_zip_geocoded_gdf = gpd.GeoDataFrame(ny_zip_geocoded_df, geometry=ny_zip_geocoded_df["Point"], crs=4326)

With our data in a dataframe, we can plot these zip code positions using Kepler. The picture looks an awful lot like NY!

In [17]:
from keplergl import KeplerGl
ny_map = KeplerGl(height=800)
ny_map.add_data(ny_zip_geocoded_gdf[["LOCALE NAME", "geometry"]])
ny_map

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'unnamed': {'index': [2518, 3738, 3739, 3740, 3741, 3742, 3743, 3744, 3745, 3746, 3747, 3748, 3…

We can now do a spatial join between the polygons representing Nassau and Suffolk counties, and the points (longitude and latitude) returned by our geocoder. This operation is written in GeoPandas and is fairly efficient, at least for this relatively small amount of data. 

In [18]:
long_island_zipcodes_df = gpd.sjoin(ny_zip_geocoded_gdf, long_island_gdf, predicate="within")

Bellmore only has one entry in the list, as we might expect. It is within Nassau County, which is a relief. 

In [20]:
long_island_zipcodes_df[long_island_zipcodes_df["PHYSICAL CITY"]=='BELLMORE']

Unnamed: 0,PHYSICAL CITY,LOCALE NAME,DELIVERY ZIPCODE,PHYSICAL ZIP,postalcode,geocode,Point,geometry,index_right,NAME,...,POP2000,POP2010,POP2020,DOS_LL,DOSLL_DATE,NYC,CALC_SQ_MI,DATEMOD,Shape_Leng,Shape_Area
4290,BELLMORE,BELLMORE,11710,11710,11710,"(Bellmore, Town of Hempstead, Nassau County, 1...",POINT (-73.53396529349835 40.67664607465925),POINT (-73.53397 40.67665),29,Nassau,...,1334544,1339532,1395774,,,N,446.637468,2018-04-12,168031.844843,1156786000.0


We save our list of zipcodes to a file for use in filtering the Federal Election data in our next notebook!

In [21]:
long_island_zipcodes_df["address"] = long_island_zipcodes_df["geocode"].apply(lambda x: x.address)
long_island_zipcodes_df["locality"] = long_island_zipcodes_df["address"].apply(lambda x: x.split(",")[0])
long_island_zipcodes_df = long_island_zipcodes_df.rename(columns={"NAME": "County"})
long_island_zipcodes_df = long_island_zipcodes_df[["County", "address", "locality", "postalcode", "PHYSICAL ZIP", "Point"]].drop_duplicates()
long_island_zipcodes_gdf = gpd.GeoDataFrame(long_island_zipcodes_df.drop("Point", axis="columns"), geometry=long_island_zipcodes_df["Point"])
long_island_zipcodes_gdf.to_file("../references/long_island_zipcodes.geojson", driver="GeoJSON")

We can confirm that the zip codes are on Long Island using Kepler. 

In [23]:
from keplergl import KeplerGl
li_zip_map = KeplerGl(height=800)
li_zip_map.add_data(long_island_zipcodes_gdf)
li_zip_map

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter


KeplerGl(data={'unnamed': {'index': [2518, 4280, 4281, 4282, 4283, 4285, 4286, 4292, 4294, 4295, 4296, 4297, 4…