### How to Get IRS Data

IRS Form Data is stored at AWS and it can be found [here](https://registry.opendata.aws/irs990/). The documentation is self-explanatory. Two things you need to know is:

- IRS provides an index file for each year that contains all tax returns for that year. These index files includes basic information about each filing, including the name of the filer, the Employer Identification Number (EIN) of the filer, the date of the filing, and unique identifier for the filing. The index file for 2017, for example is can be found at [here]( https://s3.amazonaws.com/irs-form-990/index_2017.csv). For other year you just need to change year in the link.

- Each URL link, which contains XML, has a unique identifier for that organization. For example http://s3.amazonaws.com/irs-form-990/201703199349311180_public.xml shows FOUNDATION HEALTH SYSTEMS CORP.'s return for 2017. 2017 is for year and 03199349311180 is a unique identifier for that organization. This whole number (OBJECT_ID) can be found at the last column of index file which I mention above.  

Last but not least IRS XML structure has changed after 2013. Even though there are minor change in the element names it needs further process to get exact identification for that element if someone wants to create a panel data with pre-2013 and pro-2013 years.

The code below shows how I get XML links for the hospitals with Schedule H for 2017 tax returns. I store all XML links in csv file to use it for further analysis. 

**PS**: All the codes below are not final so they are not clean code yet. There will be final (clean) version with many comments. 

In [None]:
import csv
import os
from xml.dom import minidom
from urllib.request import urlopen
import pandas as pd
import requests
data2016= pd.read_csv("https://s3.amazonaws.com/irs-form-990/index_2017.csv")
URLlist = []
path="/home/msari/Project1/RawData/IRS/2016"
for i in data2016.iloc[:,-1]:
    try:
        URL="https://s3.amazonaws.com/irs-form-990/"+str(i)+"_public.xml"
        mydoc = minidom.parse(urlopen(URL))
        scheduleH=mydoc.getElementsByTagName('IRS990ScheduleH')
        if len(scheduleH) == 1:
            URLlist.append(URL)
            response=requests.get(URL)
            with open(os.path.join(path,str(i)+"_public.xml"), 'wb') as file:
                file.write(response.content)
            continue
    except:
        pass
with open("2016output.csv",'w') as r:
    wr = csv.writer(r)
    for url in URLlist:
        wr.writerow([url])

After I store all the XML links that I need, I need to convert XMLs to a dataframe whose unit level is hospital for that specific year. So the dataframe will be panel data. There are many ways to turn XMLs to dataframe, however, what I do below is simply get all elements and child elements as column name(variable name) without specifying in advance. For example CharityCareAtCost.NetCommunityBenefitExpense shows the each key under the ScheduleH unit. There are other categories that have same NetCommunityBenefitExpense key, so this method simply avoids duplicating and renaming process. 

In [None]:
import pandas as pd
import numpy as np
from collections import OrderedDict
import csv
import os
from urllib.request import urlopen
import requests
import xmltodict
from pandas.io.json import json_normalize
for i in range(2010, 2018):
    df = pd.read_csv(str(i)+"output.csv")
    IRS_temp = pd.DataFrame([])
    for url in df.iloc[:,0]:
        URL=url
        webpage = requests.get(URL)
        units = xmltodict.parse(webpage.content)
        ScheduleH = units['Return']['ReturnData']['IRS990ScheduleH']
        ReturnHeader = units['Return']['ReturnHeader']
        Profile = {**ScheduleH, **ReturnHeader}
        Profile = json_normalize(Profile)
        IRS_temp = IRS_temp.append(Profile,  sort= False)
    IRS_temp.to_csv("IRS_temp_"+str(i)+".csv")
d = {}
for i in range(2010, 2013):
    temp = pd.read_csv("IRS_temp_"+str(i)+".csv", low_memory = False)
    temp = temp.assign(Year=i)
    d[i] = temp.rename(columns={'TaxYear': 'TaxYr'}, inplace= True)
    d[i] = temp.set_index(['Filer.EIN','TaxYr'], append=True)
for i in range(2013, 2018):
    temp = pd.read_csv("IRS_temp_"+str(i)+".csv", low_memory= False)
    temp = temp.assign(Year=i)
    d[i] = temp.set_index(['Filer.EIN','TaxYr'], append=True)
IRS990 = pd.concat([d[2010], d[2011], d[2012], d[2013],d[2014],d[2015],d[2016],d[2017]], sort=False)
IRS990.to_csv("IRS990.csv", index= True)