## **DEVELOPING DATASET FOR EACH YEAR**

Goal: <br>
   +  To link [EIN, URL] from IRS to [EIN, NTEE1] from NCCS files, and get Mission or Purpose statements for each EIN.
    
To- do List: <br>
   + [+] 1. Get [EIN, URL] from IRS. -> df_irs <br>
   + [+] 2. Get [EIN, NTEE1] from NCCS. -> df_nccs <br>
   + [+] 3. Intersect [df_irs, df_nccs] -> df_inter -> "link'year'.csv" <br>
   + [+] 4. Visit each URL in df_inter and get data from relevant tabs. <br>
    


In [4]:
import pandas as pd
import re, requests, string
from bs4 import BeautifulSoup as bs
from tqdm import tqdm_notebook as tqdm
from multiprocessing import Pool
#use regex instead of beautifulsoup, if possible.

year_list = [2015, 2014, 2013, 2012, 2011]
year = year_list[0]


Step 1. Get [EIN, URL] from IRS

In [5]:
irsfile = pd.read_json('https://s3.amazonaws.com/irs-form-990/index_'+str(year)+'.json')
ein_url=list(map(list, zip(*[[s['EIN'] for s in irsfile['Filings'+str(year)]], [s['URL'] for s in irsfile['Filings'+str(year)]]])))
df_irs = pd.DataFrame(ein_url, columns=['EIN', 'URL'])

Step 2. Get [EIN, NTEE1] from NCCS

In [6]:
df_nccs1 = pd.read_csv('https://nccs-data.urban.org/data/core/'+str(year)+'/nccs.core'+str(year)+'pc.csv')
df_nccs = df_nccs1[['EIN', 'NTEE1']]

#print(df_nccs)
#break it into two, first read and then select categories into other variable

  interactivity=interactivity, compiler=compiler, result=result)


Step 3. Get common URL from df_irs and df_nccs and make corresponding list of [EIN, URL, NTEE1]

In [7]:
df_nccs['EIN'] = df_nccs['EIN'].apply(str)
df_inter = pd.DataFrame(pd.merge(df_nccs, df_irs, how='outer', on=['EIN']), columns=['EIN','NTEE1','URL'])
df_inter.to_csv('link'+str(year)+'.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Step 4: Provide list of tags to search from:
+ The original tag names are converted into small letters while parsing, for e.g. 'ActivityOrMissionDesc' is parsed as 'activityormissiondesc'. 
+ So, we will provide original tags converted into small letters for comparision.
+ https://github.com/lecy/Open-Data-for-Nonprofit-Research/blob/master/Build_IRS990_E-Filer_Datasets/Data_Dictionary.md


year| tag| line#

|Year | MissionTags | PurposeTags|
| --- | --- | --- |
|2015, 2014, 2013| ActivityOrMissionDesc|OtherExemptPurposeExpendGrp, TotalExemptPurposeExpendGrp|
|2012, 2011| ActivityOrMissionDescription| OtherExemptPurposeExpenditures, TotalExemptPurposeExpenditures|

| FormType | Mission | Purpose |
| --- | --- | --- |
| 990 | Yes | No |
| 990EZ | No | Yes |
| 990PF | No | No |


In [8]:
#alltags = ['ActivityOrMissionDesc', 'ActivityOrMissionDescription',
#           'OtherExemptPurposeExpenGrp', 'OtherExemptPurposeExpenditures',
#           'TotalExemptPurposeExpendGrp', 'TotalExemptPurposeExpenditures']
    
alltags = ['activityormissiondesc', 'activityormissiondescription',
           'otherexemptpurposeexpengrp', 'otherexemptpurposeexpenditures',
           'totalexemptpurposeexpendgrp', 'totalexemptpurposeexpenditures']


Step 5: Visit each URL and get data from relevant tags.





In [13]:

df_inter1= pd.read_csv("link"+str(year)+".csv")

df_inter = df_inter1[~ pd.isnull(df_inter1['URL'])]

print("length: ",len(df_inter))
def build_coredata(r):
    print("Inside the function:", r)
    
    masterdata=pd.DataFrame(columns=['EIN', 'NTEE', 'IRS_URL', 'TEXT', 'TEXTTYPE', 'YEAR'])
    
    for turn in tqdm(range(r,r+101)):
        row = df_inter.values[int(turn)]
            flag = 0
            page = requests.get(row[3])

            bss = bs(page.text, 'html.parser')

            for tag in bss.find_all():
                if tag.name in alltags:
                    masterdata.loc[len(masterdata)] = [str(row[1]), row[2], row[3], tag.string, tag.name, str(year)]
                    flag = 1

            if(flag == 0):
                masterdata.loc[len(masterdata)] = [str(row[1]), row[2], row[3],'','', str(year)]            

        
    masterdata.to_csv(open('sampleMasterData22.csv', 'a'), header=False, index=False)



records = list(range(0, len(df_inter)))

agents = 4
chunksize = 100
with Pool(processes=agents) as pool:
    pool.map(build_coredata, records, chunksize)

length:  261034
Inside the function: 100
Inside the function: 0
Inside the function: 300
Inside the function: 200


HBox(children=(IntProgress(value=0, max=101), HTML(value='')))

HBox(children=(IntProgress(value=0, max=101), HTML(value='')))

HBox(children=(IntProgress(value=0, max=101), HTML(value='')))

HBox(children=(IntProgress(value=0, max=101), HTML(value='')))




Process ForkPoolWorker-7:






Process ForkPoolWorker-5:
Process ForkPoolWorker-8:





Traceback (most recent call last):
Process ForkPoolWorker-6:


KeyboardInterrupt: 

In [6]:
from multiprocessing import Pool

def printstuff(r):
    print("hey: ", r)

records = list(range(0, 97))
agents = 4
chunksize = 10
with Pool(processes=agents) as pool:
    pool.map(printstuff, records, chunksize)

hey:  0
hey:  10
hey:  20
hey:  11
hey:  30
hey:  1
hey:  21
hey:  2
hey:  12
hey:  31
hey:  3
hey:  22
hey:  13
hey:  32
hey:  4
hey:  23
hey:  14
hey:  33
hey:  5
hey:  24
hey:  15
hey:  34
hey:  25
hey:  6
hey:  16
hey:  35
hey:  26
hey:  7
hey:  17
hey:  8
hey:  27
hey:  36
hey:  18
hey:  37
hey:  19
hey:  9
hey:  28
hey:  38
hey:  40
hey:  50
hey:  29
hey:  39
hey:  60
hey:  51
hey:  41
hey:  61
hey:  70
hey:  52
hey:  42
hey:  43
hey:  71
hey:  62
hey:  53
hey:  63
hey:  72
hey:  54
hey:  64
hey:  73
hey:  65
hey:  55
hey:  66
hey:  56
hey:  67
hey:  57
hey:  68
hey:  58
hey:  69
hey:  80
hey:  74
hey:  81
hey:  82
hey:  83
hey:  84
hey:  85
hey:  86
hey:  87
hey:  88
hey:  89
hey:  44
hey:  90
hey:  91
hey:  92
hey:  93
hey:  45
hey:  94
hey:  46
hey:  95
hey:  47
hey:  96
hey:  59
hey:  75
hey:  76
hey:  48
hey:  77
hey:  78
hey:  79
hey:  49
