# <center> A6 : Social Security Benefits Analysis </center>

<center> Lohith B R
<br>
DATA 512 Final Project
<br>
University of Washington, Fall 2018 </center>

## Introduction 

Social Security has become very important these days epecially in a developed country like the United States. Income derived from Social Security is currently estimated to have reduced the poverty rate for Americans age 65 or older from about 40% to below 10%.<sup>[1](https://www.cbpp.org/research/social-security/social-security-keeps-22-million-americans-out-of-poverty-a-state-by-state)</sup>. In 2018, the trustees of the Social Security Trust Fund reported that the program will become financially insolvent in the year 2034.<sup>[2](https://www.ssa.gov/oact/tr/2018/tr2018.pdf)</sup>. Thus it is very important that we analyze where the money is going and see if there are better ways to solve some of the problems instead of directly paying the beneficiaries. For e.g. it may be a good idea to invest more upgrading or reforming the health care system to reduce the cost of health care per person. It is also important to identify and prevent Social Security Fraud. This analysis can, in theory, help decision makers make informed decisions.

The annual publication of OASDI(Old-Age, Survivors & Disability Insurance) benefits from the Social Security Administration (SSA) presents the basic program data on the number and type of beneficiaries and the amount of benefits paid each state at the ZIP code level. This dataset also shows the number of men and women aged 65 or older receiving benefits. This dataset contains only those persons to whom the benefits are payable. Those whose benefits were withheld are excluded. 

This dataset will be used in this analysis to answer some of the research questions and also, if possible, uncover some interesting patterns/insights.

## Reproducibility

This Analysis is completelty reproducible if you have the dataset, the code and the tools and libraries used in this analysis. The piece of code you see in this notebook will take care of all the data processing and producing the necessary output CSV files. However, for better visualization I have used Tableau instead of matplotlib or seaborn library. Please find the links in the references to see how to use Tableau to visualize the data the way I did it. 

## Tools and libraries needed to reproduce the results

* Tableau Public Desktop Version : 2018.2.3 64-bit <sup>[1]</sup>
* Python Version : Python 3.6.5 |Anaconda, Inc.| <sup>[7]</sup>
* Python Numpy version : 1.12.1 <sup>[3]</sup>
* Pandas Version : 0.23.1 <sup>[4]</sup>

# Code

### Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np

# Suppress the warning messages generated as they are not relvant for this analysis
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Read the spread sheet into a pandas datastructure
data = pd.ExcelFile("oasdi_zip16.xlsx")

In [3]:
# List all the headers that are present in the dataset
headers = ['contact_info', 'zip_code', 'nan', 'total', 'retired', 'disabled',
          'widowers_parents', 'spouses', 'children', 'amount_all_beneficiaries', 'amount_retired',
          'amount_widowers', 'old_beneficiaries']

In [4]:
# Function to parse the sheets in the Excel file and create a single Pandas Dataframe object

def state_df_parse(input_df):
    """ input : Pandas ExcelFile
        output : Pandas Dataframe with all the headers mentioned above and 
                as extra column containing the name of the state
    """
    all_df = pd.DataFrame()
    
    for state in input_df.sheet_names:
        state_df = input_df.parse(state, names = headers)
        
        state_df.drop(['contact_info', 'nan'], inplace=True, axis=1) # contact_info field is not required here
        state_df.dropna(subset=['zip_code'], inplace=True) 
        state_df['state'] = state
        all_df = pd.concat([all_df, state_df], axis=0)  # concatenate the dataframe vertically
        
    return all_df

In [5]:
# Parse the ExcelFile format and generate a single pandas Dataframe
all_states = state_df_parse(data)

In [6]:
# Inspect the dataframe
all_states.head()

Unnamed: 0,zip_code,total,retired,disabled,widowers_parents,spouses,children,amount_all_beneficiaries,amount_retired,amount_widowers,old_beneficiaries,state
6,35013,35,15,5,5,5,5,33,20,5,20,Alabama
7,35016,4400,2745,840,325,145,345,5087,3468,370,2945,Alabama
8,35031,2040,1180,425,150,85,200,2273,1473,168,1290,Alabama
9,35049,910,570,165,60,40,75,1089,752,71,610,Alabama
10,35079,2045,1160,455,170,70,190,2557,1614,206,1265,Alabama


## Metadata for the dataset

|  Attribute | Value  |
|---|---|
|  Last Updated | December 2, 2017  |
|  Created |  December 2, 2017 |
|  File Format | MS Excel  |
|  License | [Creative Commons CCZero](https://creativecommons.org/publicdomain/zero/1.0/legalcode)  |
| Number of sheets  | 56  |
|  ID | 2d66781f-0590-4cee-95ea-d8ab01e08e03  |
| Revision ID | c745f739-adee-4936-b575-9cc44548e45a |
| Raw File Size | 3.4 MB|

* Every sheet either corresponds to a State or a territory in the US
* Every sheet has the following data format




|  Headers  |Data Type / Description   |
|---|---|
| Field Office and Zip Code  | Text(either field office name or ZIP code)  |
| Total  | Text(denoting a comma seperated number)  |
| Retired workers  |  Text(denoting a comma seperated number) |
| Disabled Workers  | Text(denoting a comma seperated number)  |
| Widow(er)s and parents  | Text(denoting a comma seperated number)  |
| Spouses | Text(denoting a comma seperated number)  |
| Children  | Text(denoting a comma seperated number)  |
| All Beneficiaries  | Text(denoting a comma separated number in terms of thousands of dollars)  |
| Retired Workers  |  Text(denoting a comma separated number in terms of thousands of dollars) |
| Widow(er)s and parents  | Text(denoting a comma separated number in terms of thousands of dollars)  |
| Number of OASDI beneficiaries aged 65 or older | Text(denoting a comma seperated number)  |

In [7]:
# Calculate amount per beneficiary at the zip code level
all_states['amount_per_beneficiary'] = all_states['amount_all_beneficiaries'] / all_states['total']

### Research Question 1
* What Zip codes have some of the lowest/highest amount received per month and how does that compare with the annual average house hold income in that zip code?

In [8]:
# Sort the dataframe in the descending order of the amount per beneficiary
all_states.sort_values('amount_per_beneficiary', ascending=False)[['zip_code', 'amount_per_beneficiary']].head(10)

Unnamed: 0,zip_code,amount_per_beneficiary
833,10020,2.26667
1116,77010,2.15556
1410,12604,2.13333
1141,94305,2.08929
340,57186,2.05
140,55323,2.0
845,10107,2.0
725,37383,1.97143
150,7846,1.95
519,24595,1.92


### comments
* Let's take a look at the zip code 77010 in Texas where on an average the beneficiaries get nearly 2,155 dollars per month
* The average household income in this zip code is nearly 250,001 dollars <sup>[6]</sup>
* Though there are exceptions, it's not hard to see that zip codes that have very high household incomes also tend to collect more benefits from the SSA

In [9]:
# Sort the dataframe in the ascending order of the amount per beneficiary
all_states.sort_values('amount_per_beneficiary', ascending=True)[['zip_code', 'amount_per_beneficiary']].head(10)

Unnamed: 0,zip_code,amount_per_beneficiary
565,19806,0.24
6,80231,0.3
16,86044,0.333333
1468,54481,0.333333
115,1844,0.371429
241,2864,0.4
1690,47876,0.4
1379,47876,0.4
691,75504,0.4
132,5671,0.416


### comments
* Let's take a look at the zip code 80231 in Colorado where on an average the beneficiaries get nearly 300 dollars per month
* The average household income in this zip code is nearly 51,099 dollars <sup>[6]</sup>
* Again, though there are exceptions, it's not hard to see that zip codes that have very low household incomes also tend to collect less benefits from the SSA

### Research Question 2
* What states have some of the lowest/highest amount received per month and is there an apparent trend in the data?

In [10]:
# Identify all the numerical features in the dataframe
num_features = ['total', 'retired', 'disabled', 'widowers_parents', 'spouses', 
    'children', 'amount_all_beneficiaries', 'amount_retired', 'amount_widowers', 
    'old_beneficiaries', 'amount_per_beneficiary' ]

In [11]:
# Convert all object types to floating point numbers
for feature in num_features:
    all_states[feature] = all_states[feature].astype(float)

In [12]:
# Aggregate the data at the state level by computing the mean amount per beneficiary in each state
state_details = all_states.groupby('state').agg({'amount_per_beneficiary' : np.mean})

In [13]:
state_details.head()

Unnamed: 0_level_0,amount_per_beneficiary
state,Unnamed: 1_level_1
Alabama,1.164613
Alaska,1.0453
American Samoa,0.702348
Arizona,1.206986
Arkansas,1.113012


In [14]:
# Persist the data in CSV format so that it's easy visualize using Tableau
state_details.to_csv('state_details.csv')

### Tableau instructions
* Load the csv file state_details.csv as a text file into Tableau
* Drag the state Dimension onto the worksheet
* Drag the amount per beneficiary Measure onto the color mark on the central pane
* These steps above will generate the visualization as seen below
* Similar visualizations can be creted using other visualization tools as well

![](states.png)

### Comments
* Even though there is a wide range of possible values of average benefits per state and even some states standout, it's hard to find correlations with any other idicators like income level or poverty.
* It can be concluded that SS benefits have to be looked at a more granular level than at a State level

## Research Question 3
* Are there parts of the cities/towns where top and botton beneficiaries live close to each other?
* Since it's unweildy to look at every town/city in the country, we will focus on New York city for this analysis.
* In particular we will look at Bronx and Manhattan boroughs in NYC

In [15]:
# Drop the na values in the dataframe
all_states.dropna(inplace=True)

In [16]:
# compute the 95th and 5th percentile for amount per beneficiary in the dataset

per_top = np.percentile(all_states['amount_per_beneficiary'], 95,)
per_bot = np.percentile(all_states['amount_per_beneficiary'], 5)

per_bot, per_top

(0.96842105263157896, 1.4904212040271207)

In [17]:
# create a dataframe with top and bottom 5% in terms of the amount per beneficiary

top_zip_codes = all_states[all_states['amount_per_beneficiary'] > per_top]
bot_zip_codes = all_states[all_states['amount_per_beneficiary'] < per_bot]

In [18]:
# Create appropriate labels for each of the dataframes

top_zip_codes['status'] = 'TOP 5%'
bot_zip_codes['status'] = 'BOT 5%'

# concatenate the dataframes vertically
extremes = pd.concat([top_zip_codes, bot_zip_codes], axis=0)

In [19]:
# Persist the dataframe in CSV format to enable visualization using Tableau
extremes.to_csv('extremes_5.csv')

### Tableau instructions
* Load the csv file extremes_5.csv as a text file into Tableau
* Drag the state zip code onto the worksheet
* Drag the status Dimension onto the color mark on the central pane
* These steps above will generate the visualization as seen below
* Similar visualizations can be creted using other visualization tools as well

![](nyc.png)

# Conclusions
* A lot of rich people, who probably don't need SS benefits are availing them
* Eligibility criteria for SS benefits should change to take into account not just taxable income but the total value of assets at the time the claims are made


# Future work
* Combine this dataset with Income/wealth data and do further analysis
* Combine this dataset with zip code wise racial and gender distribution of data and find patterns if any
* Is it possible to find cases of widespread fraud using these datasets?


# References

[1] https://public.tableau.com/en-us/s/
<br>
[2] https://catalog.data.gov/dataset/oasdi-beneficiaries-by-state-and-zip-code-2016
<br>
[3] https://docs.scipy.org/doc/
<br>
[4] https://pandas.pydata.org/pandas-docs/stable/
<br>
[5] https://creativecommons.org/publicdomain/zero/1.0/legalcode
<br>
[6] https://www.incomebyzipcode.com/
<br>
[7] https://www.anaconda.com/download/#macos
<br>
[8] https://www.tableau.com/learn/training