# NYPD Fare Evasion Enforcement's Disparate Impact
Leena Kahlon, Dylan Fosgett, Woo Hyuk Chang and Rebecca Dorn
University of California, Santa Cruz, Fall 2019

<img src="https://pbs.twimg.com/media/EJAwu4gWsAoPKwL?format=jpg&name=medium">

*Image from Twitter's @DecolonizeThisPlace*

## Introduction
***
Imagine your loan application has just been rejected by your bank. You ask them why, and they respond that the neighborhood you live in scored poorly for likelihood to pay the loan back. For some reason this bank has the algorithm they use online. You look into it a bit more, and you realize that though the bank does not use race as a feature, those who live in predominantly African American neighborhoods are almost three times as likely to be denied a loan. Disparate impact refers to "when policies, practices, rules or other systems that appear to be neutral result in a disproportionate impact on a protected group". No one in the bank sat down and say "hey, I'm going to create a fairly racist predictor!". But, the rules that the bank goes by dispraportionately impact African Americans. Situations like these are all too common in the United States, and they must be addressed.

On October 28, 2019, a video of around ten police officers tackling and pulling guns on a black teen for evading the \$2.75 subway fare in New York City went viral. Communities were enraged to see this fierceley violent attack on a teen of color over such a small fare. A few months before, the MTA calculated that they were losing millions annually due to those who evade paying the transit fare. The MTA decided to deploy 500 new officers to watch for turnstile hoppers, in hopes of losing less money. While in theory this plan makes sense, we can't help but wonder. What kind of disparate impact against communities of color is happening? What about against lower income communities?


## Setup: Installing Necessary Libraries
```pip install matplotlib
pip install numpy
pip install pandas
pip install geopandas
pip install beautifulsoup4
pip install seaborn
pip install twitter
pip install nltk```

Cooresponding files can be found on git: https://github.com/rebdorn/fare_evasion

## Project
***
We organize our project into distinct portions
1. Web Scraping
2. Data Wrangling
3. Data Visualization

Our goal is to create an easy-to-use visualization to answer the following research questions:
1. How has fare evasion enforcement in New York City changed in recent years?
2. What subway stations have an outstanding number of recent police sightings? What are the demographics around these stations?
3. What are the fare evasione enforcement trends around stations with outstanding numbers of recent police sightings?
4. What is the overlap between stations with more police sightings and stations with more fare evasion enforcement?

# Web Scraping
***
To get recent police sightings, we scrape twitter account @unfarenyc. This is an account where people send information about police sightings at subway stations, and they post the number of cops and which subway station. We scrape the twitter account these sightings using nltk and bear's python twitter wrapper. Note that to run this code on your account, you must posess a secret key given by the Twitter API.

In [1]:
# Some import statements to make our code run
from itertools import permutations
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import twitter
import nltk

In [5]:
# Some functions that will be useful in making sense of the data
def timeline_to_df(statuses_list,column_names):
    tweets = []
    for s in statuses_list:
        tweets.append([s.id,s.user.screen_name,s.created_at,s.text,s.retweet_count])
    return pd.DataFrame(tweets,columns=column_names)

Order: scrape (Tweeter), NLTK, showing why we chose some stations!

In [8]:
CONSUMER_KEY = '0ift2jAgEYQrWs8OP1f9TjRQ1'
CONSUMER_SECRET = 'qq9jNTaS3cG9wVc2Hxij3M0GuhAy72wNzFnBc9y1ejp2xh5dg4'
ACCESS_TOKEN_KEY = '144353594-pUtB5UxLeZHwZxJH4vPtFQMXAvgyQKTyvDanafog'
ACCESS_TOKEN_SECRET = 'TvGH6dRf2J0PTKgBiSwWJ022j25PNBBaYUZQYWML1bDok'

# connect to twitter API with our given user credentials
api = twitter.Api(consumer_key = CONSUMER_KEY,
                      consumer_secret = CONSUMER_SECRET,
                      access_token_key = ACCESS_TOKEN_KEY,
                      access_token_secret = ACCESS_TOKEN_SECRET)

# Create a dataframe to get all @unfarenyc tweets.
# Note that the maximum number of tweets we can get at once is 200
statuses_list = api.GetUserTimeline(count=200,screen_name='unfarenyc') # Load @unfarenyc's timeline

# To get earlier tweets we continuously find the minimum and get the 200 previous
earliest_tweet = min(statuses_list, key=lambda x: x.id).id # Find the last tweet we were allowed to get
while True: # While there are more tweets to get
    nexttweets = api.GetUserTimeline(screen_name='unfarenyc', max_id=earliest_tweet, count=200)
    new_earliest = min(nexttweets, key=lambda x: x.id).id
    if not nexttweets or new_earliest == earliest_tweet:
        break
    else:
        earliest_tweet = new_earliest
        statuses_list += nexttweets

tweetsdf = timeline_to_df(statuses_list,['ID','user','created_at','text','retweet_count'])

In [2]:
# Our protocol for getting unigrams into a usable form
def process_unigram(unigram):
    if unigram[0].isdigit() & unigram.endswith('st'): # Are we of form "1"st?
        unigram = unigram[:-2] # Shorten unigram to only that street number
    elif unigram[0].isdigit() & unigram.endswith('nd'): # Are we of form "2"nd?
        unigram = unigram[:-2] # Shorten unigram to just that number
    elif unigram[0].isdigit() and unigram.endswith('th'): # Are we of form "10"th?
        unigram = unigram[:-2] # Shorten unigram to only that number
    elif unigram == 'park': # If this unigram reads 'park'
        unigram = 'pk' # Shorten it to match our station names
    elif unigram == 'parkway': # If this unigram reads "parkway"
        unigram = 'pkwy' # Shorten it to match our station names
    elif unigram == 'heights': # If this unigram reads "heights"
        unigram = 'hts' # Shorted it to match our station names
    return unigram

## Data Wrangling
***
We further split our data wrangling into three parts:
1. Wrangling at the Community District level
2. Wrangling at the Transit District level

### Wrangling at the Community District level

In [None]:
# Code and descriptions, etc.

### Wrangling at the Transit District level

In [None]:
# More code and descriptions, etc.

## Data Visualization
***
We visualize data and stuff

In [None]:
# More of code and descriptions, etc.

# Sources:
https://nyc.streetsblog.org/2019/11/14/mta-will-spend-249m-on-new-cops-to-save-200m-on-fare-evasion/ , defining disparate impact
*These are really really not finished*