## **DEVELOPING DATASET FOR EACH YEAR**

Goal: <br>
   +  To link [EIN, URL] from IRS to [EIN, NTEE1] from NCCS files, and get Mission or Purpose statements for each EIN.
    
To- do List: <br>
   + [+] 1. Get [EIN, URL] from IRS. -> df_irs <br>
   + [+] 2. Get [EIN, NTEE1] from NCCS. -> df_nccs <br>
   + [+] 3. Intersect [df_irs, df_nccs] -> df_inter -> "link'year'.csv" <br>
   + [+] 4. Visit each URL in df_inter and get data from relevant tabs. <br>
    


In [30]:
import pandas as pd
import re, requests, string
from bs4 import BeautifulSoup as bs
from tqdm import tqdm_notebook as tqdm
from multiprocessing import Pool
#use regex instead of beautifulsoup, if possible.

year_list = [2015, 2014, 2013, 2012, 2011]
year = year_list[0]

Step 1. Get [EIN, URL] from IRS

In [5]:
irsfile = pd.read_json('https://s3.amazonaws.com/irs-form-990/index_'+str(year)+'.json')
ein_url=list(map(list, zip(*[[s['EIN'] for s in irsfile['Filings'+str(year)]], [s['URL'] for s in irsfile['Filings'+str(year)]]])))
df_irs = pd.DataFrame(ein_url, columns=['EIN', 'URL'])

Step 2. Get [EIN, NTEE1] from NCCS

In [6]:
df_nccs1 = pd.read_csv('https://nccs-data.urban.org/data/core/'+str(year)+'/nccs.core'+str(year)+'pc.csv')
df_nccs = df_nccs1[['EIN', 'NTEE1']]

#print(df_nccs)
#break it into two, first read and then select categories into other variable

  interactivity=interactivity, compiler=compiler, result=result)


Step 3. Get common URL from df_irs and df_nccs and make corresponding list of [EIN, URL, NTEE1]

In [7]:
df_nccs['EIN'] = df_nccs['EIN'].apply(str)
df_inter = pd.DataFrame(pd.merge(df_nccs, df_irs, how='outer', on=['EIN']), columns=['EIN','NTEE1','URL'])
df_inter.to_csv('link'+str(year)+'.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Step 4: Provide list of tags to search from:
+ The original tag names are converted into small letters while parsing, for e.g. 'ActivityOrMissionDesc' is parsed as 'activityormissiondesc'. 
+ So, we will provide original tags converted into small letters for comparision.
+ https://github.com/lecy/Open-Data-for-Nonprofit-Research/blob/master/Build_IRS990_E-Filer_Datasets/Data_Dictionary.md


year| tag| line#

|Year | MissionTags | PurposeTags|
| --- | --- | --- |
|2015, 2014, 2013| ActivityOrMissionDesc|OtherExemptPurposeExpendGrp, TotalExemptPurposeExpendGrp|
|2012, 2011| ActivityOrMissionDescription| OtherExemptPurposeExpenditures, TotalExemptPurposeExpenditures|

| FormType | Mission | Purpose |
| --- | --- | --- |
| 990 | Yes | No |
| 990EZ | No | Yes |
| 990PF | No | No |


In [8]:
#alltags = ['ActivityOrMissionDesc', 'ActivityOrMissionDescription',
#           'OtherExemptPurposeExpenGrp', 'OtherExemptPurposeExpenditures',
#           'TotalExemptPurposeExpendGrp', 'TotalExemptPurposeExpenditures']
    
alltags = ['activityormissiondesc', 'activityormissiondescription',
           'otherexemptpurposeexpengrp', 'otherexemptpurposeexpenditures',
           'totalexemptpurposeexpendgrp', 'totalexemptpurposeexpenditures']


Step 5: Visit each URL and get data from relevant tags.





In [100]:

#df_inter1= pd.read_csv("link"+str(year)+".csv")

alltags = ['activityormissiondesc', 'activityormissiondescription',
           'otherexemptpurposeexpengrp', 'otherexemptpurposeexpenditures',
           'totalexemptpurposeexpendgrp', 'totalexemptpurposeexpenditures']


df = pd.read_pickle("/Users/Rushi/Documents/GRAFall2018/Data/intermediary/link"+str(year)+".pkl.gz", 'gzip')
df_inter = df[~ pd.isnull(df['IRS_URL'])]

masterdata=pd.DataFrame(columns=['EIN', 'NTEE', 'IRS_URL', 'TEXT', 'TEXTTYPE', 'YEAR'])
masterdata.to_csv(open('2015/MasterData2015.csv.gzip', 'a'), index=False)
#masterdata.to_pickle(open('MasterData2015.pkl.gzip', 'wb'))

print("length: ",len(df_inter))

def build_coredata(r):
    print("Inside the function:", r)
    
    masterdata=pd.DataFrame(columns=['EIN', 'NTEE', 'IRS_URL', 'TEXT', 'TEXTTYPE', 'YEAR'])
    
    #for turn in tqdm(range(r,r+101)):
    for turn in range(r[0],r[1]):
        row = df_inter4.values[int(turn)]
        flag = 0
        page = requests.get(row[2])

        bss = bs(page.text, 'html.parser')

        for tag in bss.find_all():
            if tag.name in alltags:
                masterdata.loc[len(masterdata)] = [str(row[0]), row[1], row[2], tag.string, tag.name, str(year)]
                flag = 1

        if(flag == 0):
            masterdata.loc[len(masterdata)] = [str(row[0]), row[1], row[2],'','', str(year)]            

    print(masterdata)    
    masterdata.to_csv(open('2015/MasterData2015Take2.csv.gzip', 'a'), header=False, index=False)
    #masterdata.to_pickle(open('MasterData2015.pkl.gzip', 'wb'))
    print("written: ", r)


no_urls = 100
#records = [[i, i+no_urls] for i in range(18000, 19000, no_urls)]
records = [[i, i+no_urls] for i in range(0, len(df_inter), no_urls)]

agents = 4
#with Pool(processes=agents) as pool:
#    pool.map(build_coredata, records)
   
no_url = df[pd.isnull(df['IRS_URL'])]
masterdata=pd.DataFrame(no_url, columns=['EIN', 'NTEE', 'IRS_URL', 'TEXT', 'TEXTTYPE', 'YEAR'])
#masterdata.to_csv(open('2015/MasterData2015.csv.gzip', 'a'), index=False)
print("Added data without URLs!")

length:  261034
Added data without URLs!


In [76]:
df_scraped = pd.read_csv('2015/MasterData2015.csv.gzip').drop_duplicates('IRS_URL')
df_inter1 = df_inter[~df_inter['IRS_URL'].isin(df_scraped['IRS_URL'])]

print("Tota: ", len(df_inter1))
no_urls = 100
#records = [[i, i+no_urls] for i in range(18000, 19000, no_urls)]
records = [[i, i+no_urls] for i in range(0, len(df_inter1), no_urls)]

agents = 4
with Pool(processes=agents) as pool:
    pool.map(build_coredata, records)
   
no_url = df[pd.isnull(df['IRS_URL'])]
masterdata=pd.DataFrame(no_url, columns=['EIN', 'NTEE', 'IRS_URL', 'TEXT', 'TEXTTYPE', 'YEAR'])
masterdata.to_csv(open('2015/MasterData2015.csv.gzip', 'a'), index=False)
print("Added data without URLs!")

Tota:  172334
Inside the function: [0, 100]
Inside the function: [10800, 10900]
Inside the function: [21600, 21700]
Inside the function: [32400, 32500]
written:  [10800, 10900]
Inside the function: [10900, 11000]
written:  [32400, 32500]
Inside the function: [32500, 32600]
written:  [21600, 21700]
Inside the function: [21700, 21800]
written:  [0, 100]
Inside the function: [100, 200]
written:  [10900, 11000]
Inside the function: [11000, 11100]
written:  [32500, 32600]
Inside the function: [32600, 32700]
written:  [21700, 21800]
Inside the function: [21800, 21900]
written:  [100, 200]
Inside the function: [200, 300]
written:  [11000, 11100]
Inside the function: [11100, 11200]
written:  [32600, 32700]
Inside the function: [32700, 32800]
written:  [200, 300]
Inside the function: [300, 400]
written:  [21800, 21900]
Inside the function: [21900, 22000]
written:  [32700, 32800]
Inside the function: [32800, 32900]
written:  [11100, 11200]
Inside the function: [11200, 11300]
written:  [300, 400]

written:  [14300, 14400]
Inside the function: [14400, 14500]
written:  [3200, 3300]
Inside the function: [3300, 3400]
written:  [36000, 36100]
Inside the function: [36100, 36200]
written:  [24800, 24900]
Inside the function: [24900, 25000]
written:  [14400, 14500]
Inside the function: [14500, 14600]
written:  [36100, 36200]
Inside the function: [36200, 36300]
written:  [3300, 3400]
Inside the function: [3400, 3500]
written:  [24900, 25000]
Inside the function: [25000, 25100]
written:  [14500, 14600]
Inside the function: [14600, 14700]
written:  [36200, 36300]
Inside the function: [36300, 36400]
written:  [3400, 3500]
Inside the function: [3500, 3600]
written:  [25000, 25100]
Inside the function: [25100, 25200]
written:  [14600, 14700]
Inside the function: [14700, 14800]
written:  [36300, 36400]
Inside the function: [36400, 36500]
written:  [3500, 3600]
Inside the function: [3600, 3700]
written:  [25100, 25200]
Inside the function: [25200, 25300]
written:  [14700, 14800]
Inside the func

Inside the function: [17900, 18000]
written:  [6400, 6500]
Inside the function: [6500, 6600]
written:  [39600, 39700]
Inside the function: [39700, 39800]
written:  [28100, 28200]
Inside the function: [28200, 28300]
written:  [17900, 18000]
Inside the function: [18000, 18100]
written:  [6500, 6600]
Inside the function: [6600, 6700]
written:  [39700, 39800]
Inside the function: [39800, 39900]
written:  [28200, 28300]
Inside the function: [28300, 28400]
written:  [18000, 18100]
Inside the function: [18100, 18200]
written:  [6600, 6700]
Inside the function: [6700, 6800]
written:  [39800, 39900]
Inside the function: [39900, 40000]
written:  [28300, 28400]
Inside the function: [28400, 28500]
written:  [18100, 18200]
Inside the function: [18200, 18300]
written:  [6700, 6800]
Inside the function: [6800, 6900]
written:  [39900, 40000]
Inside the function: [40000, 40100]
written:  [28400, 28500]
Inside the function: [28500, 28600]
written:  [18200, 18300]
Inside the function: [18300, 18400]
writ

written:  [43100, 43200]
Inside the function: [43200, 43300]
written:  [9900, 10000]
Inside the function: [10000, 10100]
written:  [21200, 21300]
Inside the function: [21300, 21400]
written:  [31400, 31500]
Inside the function: [31500, 31600]
written:  [43200, 43300]
Inside the function: [43300, 43400]
written:  [10000, 10100]
Inside the function: [10100, 10200]
written:  [21300, 21400]
Inside the function: [21400, 21500]
written:  [31500, 31600]
Inside the function: [31600, 31700]
written:  [43300, 43400]
Inside the function: [43400, 43500]
written:  [10100, 10200]
Inside the function: [10200, 10300]
written:  [21400, 21500]
Inside the function: [21500, 21600]
written:  [31600, 31700]
Inside the function: [31700, 31800]
written:  [43400, 43500]
Inside the function: [43500, 43600]
written:  [10200, 10300]
Inside the function: [10300, 10400]
written:  [31700, 31800]
Inside the function: [31800, 31900]
written:  [21500, 21600]
Inside the function: [54000, 54100]
written:  [43500, 43600]


Inside the function: [46600, 46700]
written:  [78100, 78200]
Inside the function: [78200, 78300]
written:  [56800, 56900]
Inside the function: [56900, 57000]
written:  [67200, 67300]
Inside the function: [67300, 67400]
written:  [46600, 46700]
Inside the function: [46700, 46800]
written:  [78200, 78300]
Inside the function: [78300, 78400]
written:  [56900, 57000]
Inside the function: [57000, 57100]
written:  [67300, 67400]
Inside the function: [67400, 67500]
written:  [46700, 46800]
Inside the function: [46800, 46900]
written:  [78300, 78400]
Inside the function: [78400, 78500]
written:  [67400, 67500]
Inside the function: [67500, 67600]
written:  [57000, 57100]
Inside the function: [57100, 57200]
written:  [46800, 46900]
Inside the function: [46900, 47000]
written:  [78400, 78500]
Inside the function: [78500, 78600]
written:  [67500, 67600]
Inside the function: [67600, 67700]
written:  [57100, 57200]
Inside the function: [57200, 57300]
written:  [78500, 78600]
Inside the function: [78

written:  [60000, 60100]
Inside the function: [60100, 60200]
written:  [81800, 81900]
Inside the function: [81900, 82000]
written:  [70400, 70500]
Inside the function: [70500, 70600]
written:  [49900, 50000]
Inside the function: [50000, 50100]
written:  [60100, 60200]
Inside the function: [60200, 60300]
written:  [81900, 82000]
Inside the function: [82000, 82100]
written:  [70500, 70600]
Inside the function: [70600, 70700]
written:  [50000, 50100]
Inside the function: [50100, 50200]
written:  [82000, 82100]
Inside the function: [82100, 82200]
written:  [60200, 60300]
Inside the function: [60300, 60400]
written:  [70600, 70700]
Inside the function: [70700, 70800]
written:  [82100, 82200]
Inside the function: [82200, 82300]
written:  [50100, 50200]
Inside the function: [50200, 50300]
written:  [60300, 60400]
Inside the function: [60400, 60500]
written:  [70700, 70800]
Inside the function: [70800, 70900]
written:  [82200, 82300]
Inside the function: [82300, 82400]
written:  [50200, 50300]

Inside the function: [73900, 74000]
written:  [85200, 85300]
Inside the function: [85300, 85400]
written:  [53200, 53300]
Inside the function: [53300, 53400]
written:  [63300, 63400]
Inside the function: [63400, 63500]
written:  [73900, 74000]
Inside the function: [74000, 74100]
written:  [85300, 85400]
Inside the function: [85400, 85500]
written:  [53300, 53400]
Inside the function: [53400, 53500]
written:  [63400, 63500]
Inside the function: [63500, 63600]
written:  [74000, 74100]
Inside the function: [74100, 74200]
written:  [85400, 85500]
Inside the function: [85500, 85600]
written:  [63500, 63600]
Inside the function: [63600, 63700]
written:  [74100, 74200]
Inside the function: [74200, 74300]
written:  [53400, 53500]
Inside the function: [53500, 53600]
written:  [85500, 85600]
Inside the function: [85600, 85700]
written:  [74200, 74300]
Inside the function: [74300, 74400]
written:  [63600, 63700]
Inside the function: [63700, 63800]
written:  [53500, 53600]
Inside the function: [53

Inside the function: [99300, 99400]
written:  [109800, 109900]
Inside the function: [109900, 110000]
written:  [120600, 120700]
Inside the function: [120700, 120800]
written:  [88700, 88800]
Inside the function: [88800, 88900]
written:  [99300, 99400]
Inside the function: [99400, 99500]
written:  [109900, 110000]
Inside the function: [110000, 110100]
written:  [120700, 120800]
Inside the function: [120800, 120900]
written:  [88800, 88900]
Inside the function: [88900, 89000]
written:  [99400, 99500]
Inside the function: [99500, 99600]
written:  [110000, 110100]
Inside the function: [110100, 110200]
written:  [120800, 120900]
Inside the function: [120900, 121000]
written:  [88900, 89000]
Inside the function: [89000, 89100]
written:  [110100, 110200]
Inside the function: [110200, 110300]
written:  [99500, 99600]
Inside the function: [99600, 99700]
written:  [120900, 121000]
Inside the function: [121000, 121100]
written:  [89000, 89100]
Inside the function: [89100, 89200]
written:  [110200

written:  [113000, 113100]
Inside the function: [113100, 113200]
written:  [102400, 102500]
Inside the function: [102500, 102600]
written:  [91900, 92000]
Inside the function: [92000, 92100]
written:  [123900, 124000]
Inside the function: [124000, 124100]
written:  [113100, 113200]
Inside the function: [113200, 113300]
written:  [102500, 102600]
Inside the function: [102600, 102700]
written:  [92000, 92100]
Inside the function: [92100, 92200]
written:  [124000, 124100]
Inside the function: [124100, 124200]
written:  [113200, 113300]
Inside the function: [113300, 113400]
written:  [102600, 102700]
Inside the function: [102700, 102800]
written:  [92100, 92200]
Inside the function: [92200, 92300]
written:  [124100, 124200]
Inside the function: [124200, 124300]
written:  [113300, 113400]
Inside the function: [113400, 113500]
written:  [102700, 102800]
Inside the function: [102800, 102900]
written:  [92200, 92300]
Inside the function: [92300, 92400]
written:  [124200, 124300]
Inside the fun

Inside the function: [105700, 105800]
written:  [116200, 116300]
Inside the function: [116300, 116400]
written:  [127100, 127200]
Inside the function: [127200, 127300]
written:  [95100, 95200]
Inside the function: [95200, 95300]
written:  [105700, 105800]
Inside the function: [105800, 105900]
written:  [116300, 116400]
Inside the function: [116400, 116500]
written:  [127200, 127300]
Inside the function: [127300, 127400]
written:  [95200, 95300]
Inside the function: [95300, 95400]
written:  [116400, 116500]
Inside the function: [116500, 116600]
written:  [105800, 105900]
Inside the function: [105900, 106000]
written:  [127300, 127400]
Inside the function: [127400, 127500]
written:  [116500, 116600]
Inside the function: [116600, 116700]
written:  [95300, 95400]
Inside the function: [95400, 95500]
written:  [105900, 106000]
Inside the function: [106000, 106100]
written:  [127400, 127500]
Inside the function: [127500, 127600]
written:  [116600, 116700]
Inside the function: [116700, 116800]

written:  [141300, 141400]
Inside the function: [141400, 141500]
written:  [130500, 130600]
Inside the function: [130600, 130700]
written:  [151900, 152000]
Inside the function: [152000, 152100]
written:  [141400, 141500]
Inside the function: [141500, 141600]
written:  [162700, 162800]
Inside the function: [162800, 162900]
written:  [130600, 130700]
Inside the function: [130700, 130800]
written:  [152000, 152100]
Inside the function: [152100, 152200]
written:  [141500, 141600]
Inside the function: [141600, 141700]
written:  [130700, 130800]
Inside the function: [130800, 130900]
written:  [162800, 162900]
Inside the function: [162900, 163000]
written:  [152100, 152200]
Inside the function: [152200, 152300]
written:  [141600, 141700]
Inside the function: [141700, 141800]
written:  [130800, 130900]
Inside the function: [130900, 131000]
written:  [162900, 163000]
Inside the function: [163000, 163100]
written:  [152200, 152300]
Inside the function: [152300, 152400]
written:  [141700, 141800

Inside the function: [155100, 155200]
written:  [133500, 133600]
Inside the function: [133600, 133700]
written:  [144700, 144800]
Inside the function: [144800, 144900]
written:  [165800, 165900]
Inside the function: [165900, 166000]
written:  [155100, 155200]
Inside the function: [155200, 155300]
written:  [133600, 133700]
Inside the function: [133700, 133800]
written:  [144800, 144900]
Inside the function: [144900, 145000]
written:  [165900, 166000]
Inside the function: [166000, 166100]
written:  [155200, 155300]
Inside the function: [155300, 155400]
written:  [144900, 145000]
Inside the function: [145000, 145100]
written:  [133700, 133800]
Inside the function: [133800, 133900]
written:  [166000, 166100]
Inside the function: [166100, 166200]
written:  [155300, 155400]
Inside the function: [155400, 155500]
written:  [145000, 145100]
Inside the function: [145100, 145200]
written:  [133800, 133900]
Inside the function: [133900, 134000]
written:  [166100, 166200]
Inside the function: [166

written:  [158200, 158300]
Inside the function: [158300, 158400]
written:  [168900, 169000]
Inside the function: [169000, 169100]
written:  [136500, 136600]
Inside the function: [136600, 136700]
written:  [148100, 148200]
Inside the function: [148200, 148300]
written:  [158300, 158400]
Inside the function: [158400, 158500]
written:  [169000, 169100]
Inside the function: [169100, 169200]
written:  [148200, 148300]
Inside the function: [148300, 148400]
written:  [136600, 136700]
Inside the function: [136700, 136800]
written:  [158400, 158500]
Inside the function: [158500, 158600]
written:  [169100, 169200]
Inside the function: [169200, 169300]
written:  [148300, 148400]
Inside the function: [148400, 148500]
written:  [136700, 136800]
Inside the function: [136800, 136900]
written:  [158500, 158600]
Inside the function: [158600, 158700]
written:  [148400, 148500]
Inside the function: [148500, 148600]
written:  [169200, 169300]
Inside the function: [169300, 169400]
written:  [136800, 136900

Process ForkPoolWorker-58:
Process ForkPoolWorker-57:
Process ForkPoolWorker-56:
Process ForkPoolWorker-55:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Users/Rushi/anaconda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/Rushi/anaconda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/Rushi/anaconda/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/Rushi/anaconda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Rushi/anaconda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Rushi/anaconda/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Rushi/anaconda/lib/python3.6/mu

KeyboardInterrupt: 

In [94]:
df_scraped2 = pd.read_csv('2015/MasterData2015.csv.gzip').drop_duplicates('IRS_URL')
df_scrapedtake2 = pd.read_csv('2015/MasterData2015Take2.csv.gzip').drop_duplicates('IRS_URL')

df_merge = pd.concat([df_scraped2, df_scrapedtake2])
#print(df_merge)

df_inter2 = df_inter[~df_inter['IRS_URL'].isin(df_merge['IRS_URL'])]


print(len(df_merge), len(df_inter))
print("Total: ", len(df_inter2))
no_urls = 100
#records = [[i, i+no_urls] for i in range(18000, 19000, no_urls)]
records = [[i, i+no_urls] for i in range(0, len(df_inter2), no_urls)]

agents = 4
with Pool(processes=agents) as pool:
    pool.map(build_coredata, records)

258402 261034
Total:  91334
Inside the function: [0, 100]
Inside the function: [5800, 5900]
Inside the function: [17400, 17500]
Inside the function: [11600, 11700]
written:  [0, 100]
Inside the function: [100, 200]
written:  [11600, 11700]
Inside the function: [11700, 11800]
written:  [17400, 17500]
Inside the function: [17500, 17600]
written:  [5800, 5900]
Inside the function: [5900, 6000]
written:  [100, 200]
Inside the function: [200, 300]
written:  [17500, 17600]
Inside the function: [17600, 17700]
written:  [5900, 6000]
Inside the function: [6000, 6100]
written:  [11700, 11800]
Inside the function: [11800, 11900]
written:  [200, 300]
Inside the function: [300, 400]
written:  [17600, 17700]
Inside the function: [17700, 17800]
written:  [6000, 6100]
Inside the function: [6100, 6200]
written:  [11800, 11900]
Inside the function: [11900, 12000]
written:  [17700, 17800]
Inside the function: [17800, 17900]
written:  [300, 400]
Inside the function: [400, 500]
written:  [6100, 6200]
Insid

written:  [20800, 20900]
Inside the function: [20900, 21000]
written:  [9200, 9300]
Inside the function: [9300, 9400]
written:  [15000, 15100]
Inside the function: [15100, 15200]
written:  [20900, 21000]
Inside the function: [21000, 21100]
written:  [3500, 3600]
Inside the function: [3600, 3700]
written:  [9300, 9400]
Inside the function: [9400, 9500]
written:  [15100, 15200]
Inside the function: [15200, 15300]
written:  [21000, 21100]
Inside the function: [21100, 21200]
written:  [9400, 9500]
Inside the function: [9500, 9600]
written:  [3600, 3700]
Inside the function: [3700, 3800]
written:  [15200, 15300]
Inside the function: [15300, 15400]
written:  [3700, 3800]
Inside the function: [3800, 3900]
written:  [21100, 21200]
Inside the function: [21200, 21300]
written:  [9500, 9600]
Inside the function: [9600, 9700]
written:  [15300, 15400]
Inside the function: [15400, 15500]
written:  [3800, 3900]
Inside the function: [3900, 4000]
written:  [9600, 9700]
Inside the function: [9700, 9800]

Inside the function: [30100, 30200]
written:  [41600, 41700]
Inside the function: [41700, 41800]
written:  [35800, 35900]
Inside the function: [35900, 36000]
written:  [24300, 24400]
Inside the function: [24400, 24500]
written:  [30100, 30200]
Inside the function: [30200, 30300]
written:  [35900, 36000]
Inside the function: [36000, 36100]
written:  [41700, 41800]
Inside the function: [41800, 41900]
written:  [24400, 24500]
Inside the function: [24500, 24600]
written:  [36000, 36100]
Inside the function: [36100, 36200]
written:  [41800, 41900]
Inside the function: [41900, 42000]
written:  [30200, 30300]
Inside the function: [30300, 30400]
written:  [36100, 36200]
Inside the function: [36200, 36300]
written:  [24500, 24600]
Inside the function: [24600, 24700]
written:  [41900, 42000]
Inside the function: [42000, 42100]
written:  [30300, 30400]
Inside the function: [30400, 30500]
written:  [36200, 36300]
Inside the function: [36300, 36400]
written:  [24600, 24700]
Inside the function: [24

written:  [39200, 39300]
Inside the function: [39300, 39400]
written:  [44900, 45000]
Inside the function: [45000, 45100]
written:  [33400, 33500]
Inside the function: [33500, 33600]
written:  [27700, 27800]
Inside the function: [27800, 27900]
written:  [39300, 39400]
Inside the function: [39400, 39500]
written:  [45000, 45100]
Inside the function: [45100, 45200]
written:  [33500, 33600]
Inside the function: [33600, 33700]
written:  [39400, 39500]
Inside the function: [39500, 39600]
written:  [27800, 27900]
Inside the function: [27900, 28000]
written:  [45100, 45200]
Inside the function: [45200, 45300]
written:  [33600, 33700]
Inside the function: [33700, 33800]
written:  [39500, 39600]
Inside the function: [39600, 39700]
written:  [27900, 28000]
Inside the function: [28000, 28100]
written:  [45200, 45300]
Inside the function: [45300, 45400]
written:  [33700, 33800]
Inside the function: [33800, 33900]
written:  [39600, 39700]
Inside the function: [39700, 39800]
written:  [28000, 28100]

Inside the function: [65700, 65800]
written:  [48400, 48500]
Inside the function: [48500, 48600]
written:  [60000, 60100]
Inside the function: [60100, 60200]
written:  [54200, 54300]
Inside the function: [54300, 54400]
written:  [65700, 65800]
Inside the function: [65800, 65900]
written:  [48500, 48600]
Inside the function: [48600, 48700]
written:  [60100, 60200]
Inside the function: [60200, 60300]
written:  [54300, 54400]
Inside the function: [54400, 54500]
written:  [65800, 65900]
Inside the function: [65900, 66000]
written:  [48600, 48700]
Inside the function: [48700, 48800]
written:  [60200, 60300]
Inside the function: [60300, 60400]
written:  [54400, 54500]
Inside the function: [54500, 54600]
written:  [65900, 66000]
Inside the function: [66000, 66100]
written:  [48700, 48800]
Inside the function: [48800, 48900]
written:  [60300, 60400]
Inside the function: [60400, 60500]
written:  [54500, 54600]
Inside the function: [54600, 54700]
written:  [66000, 66100]
Inside the function: [66

written:  [51700, 51800]
Inside the function: [51800, 51900]
written:  [69000, 69100]
Inside the function: [69100, 69200]
written:  [57600, 57700]
Inside the function: [57700, 57800]
written:  [51800, 51900]
Inside the function: [51900, 52000]
written:  [63400, 63500]
Inside the function: [63500, 63600]
written:  [69100, 69200]
Inside the function: [69200, 69300]
written:  [51900, 52000]
Inside the function: [52000, 52100]
written:  [57700, 57800]
Inside the function: [57800, 57900]
written:  [63500, 63600]
Inside the function: [63600, 63700]
written:  [69200, 69300]
Inside the function: [69300, 69400]
Inside the function: [69600, 69700]
Inside the function: [75400, 75500]
Inside the function: [81200, 81300]
Inside the function: [87000, 87100]


ConnectionError: None: Max retries exceeded with url: /irs-form-990/201531299349100008_public.xml (Caused by None)

In [101]:
#df_scraped3 = pd.read_csv('2015/MasterData2015.csv.gzip').drop_duplicates('IRS_URL')
df_scraped3 = pd.read_csv('2015/MasterData2015Take2.csv.gzip').drop_duplicates('IRS_URL')

#df_merge = pd.concat([df_scraped2, df_scrapedtake2])
#print(df_merge)

df_inter3 = df_inter2[~df_inter2['IRS_URL'].isin(df_scraped3['IRS_URL'])].drop_duplicates()
print("Total: ", df_inter3)

no_urls = 100
#records = [[i, i+no_urls] for i in range(18000, 19000, no_urls)]
records = [[i, i+no_urls] for i in range(0, len(df_inter3), no_urls)]

agents = 4
with Pool(processes=agents) as pool:
    pool.map(build_coredata, records)

Total:                EIN NTEE                                            IRS_URL
497072  753170579  NaN  https://s3.amazonaws.com/irs-form-990/20151229...
497073  426149467  NaN  https://s3.amazonaws.com/irs-form-990/20151229...
497074  391707934  NaN  https://s3.amazonaws.com/irs-form-990/20151229...
497075  275427246  NaN  https://s3.amazonaws.com/irs-form-990/20151229...
497076  911117863  NaN  https://s3.amazonaws.com/irs-form-990/20151229...
497077  330939708  NaN  https://s3.amazonaws.com/irs-form-990/20151229...
497078  061607026  NaN  https://s3.amazonaws.com/irs-form-990/20153225...
497079  956138372  NaN  https://s3.amazonaws.com/irs-form-990/20150228...
497080  726028392  NaN  https://s3.amazonaws.com/irs-form-990/20150228...
497081  363335089  NaN  https://s3.amazonaws.com/irs-form-990/20150228...
497082  431908752  NaN  https://s3.amazonaws.com/irs-form-990/20150228...
497083  760027581  NaN  https://s3.amazonaws.com/irs-form-990/20150229...
497084  202383241  NaN  https:

written:  [7500, 7600]
Inside the function: [7600, 7700]
written:  [9000, 9100]
Inside the function: [9100, 9200]
written:  [10500, 10600]
Inside the function: [10600, 10700]
written:  [6100, 6200]
Inside the function: [6200, 6300]
written:  [7600, 7700]
Inside the function: [7700, 7800]
written:  [9100, 9200]
Inside the function: [9200, 9300]
written:  [10600, 10700]
Inside the function: [10700, 10800]
written:  [6200, 6300]
Inside the function: [6300, 6400]
written:  [7700, 7800]
Inside the function: [7800, 7900]
written:  [9200, 9300]
Inside the function: [9300, 9400]
written:  [10700, 10800]
Inside the function: [10800, 10900]
written:  [6300, 6400]
Inside the function: [6400, 6500]
written:  [7800, 7900]
Inside the function: [7900, 8000]
written:  [9300, 9400]
Inside the function: [9400, 9500]
written:  [10800, 10900]
Inside the function: [10900, 11000]
written:  [6400, 6500]
Inside the function: [6500, 6600]
written:  [7900, 8000]
Inside the function: [8000, 8100]
written:  [9400

written:  [18600, 18700]
Inside the function: [18700, 18800]
written:  [21500, 21600]
Inside the function: [21600, 21700]
written:  [20100, 20200]
Inside the function: [20200, 20300]
written:  [18700, 18800]
Inside the function: [18800, 18900]
written:  [21600, 21700]
Inside the function: [21700, 21800]
written:  [20200, 20300]
Inside the function: [20300, 20400]
written:  [18800, 18900]
Inside the function: [18900, 19000]
written:  [21700, 21800]
Inside the function: [21800, 21900]
written:  [20300, 20400]
Inside the function: [20400, 20500]
written:  [18900, 19000]
Inside the function: [19000, 19100]
written:  [21800, 21900]
Inside the function: [21900, 22000]
written:  [20400, 20500]
Inside the function: [20500, 20600]
written:  [19000, 19100]
Inside the function: [19100, 19200]
written:  [20500, 20600]
Inside the function: [20600, 20700]
written:  [21900, 22000]
Inside the function: [22000, 22100]
written:  [19100, 19200]
Inside the function: [19200, 19300]
written:  [20600, 20700]

IndexError: index 22634 is out of bounds for axis 0 with size 22634

In [106]:
#df_scraped3 = pd.read_csv('2015/MasterData2015.csv.gzip').drop_duplicates('IRS_URL')
df_scraped4 = pd.read_csv('2015/MasterData2015Take2.csv.gzip').drop_duplicates('IRS_URL')

#df_merge = pd.concat([df_scraped2, df_scrapedtake2])
#print(df_merge)

df_inter4 = df_inter3[~df_inter3['IRS_URL'].isin(df_scraped4['IRS_URL'])].drop_duplicates()
print("Total: ", df_inter4)
print(len(df_inter4))


Total:                EIN NTEE                                            IRS_URL
536372  222698057  NaN  https://s3.amazonaws.com/irs-form-990/20154274...
536373  223551487  NaN  https://s3.amazonaws.com/irs-form-990/20154224...
536374  463118314  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536375  137355037  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536376  133920688  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536377  731573960  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536378  134065892  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536379  208191024  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536380  930961091  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536381  990342196  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536382  330020792  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536383  133782849  NaN  https://s3.amazonaws.com/irs-form-990/20154226...
536384  561872786  NaN  https:

In [109]:
def build_coredatas(df_inter4):
    
    masterdata=pd.DataFrame(columns=['EIN', 'NTEE', 'IRS_URL', 'TEXT', 'TEXTTYPE', 'YEAR'])
    
    #for turn in tqdm(range(r,r+101)):
    for turn in range(0, 34):
        row = df_inter4.values[int(turn)]
        flag = 0
        page = requests.get(row[2])

        bss = bs(page.text, 'html.parser')

        for tag in bss.find_all():
            if tag.name in alltags:
                masterdata.loc[len(masterdata)] = [str(row[0]), row[1], row[2], tag.string, tag.name, str(year)]
                flag = 1

        if(flag == 0):
            masterdata.loc[len(masterdata)] = [str(row[0]), row[1], row[2],'','', str(year)]            

    print(masterdata)    
    masterdata.to_csv(open('2015/MasterData2015Take2.csv.gzip', 'a'), header=False, index=False)
    #masterdata.to_pickle(open('MasterData2015.pkl.gzip', 'wb'))
    print("written: ", r)
    
build_coredatas(df_inter4)

          EIN NTEE                                            IRS_URL  \
0   222698057  NaN  https://s3.amazonaws.com/irs-form-990/20154274...   
1   223551487  NaN  https://s3.amazonaws.com/irs-form-990/20154224...   
2   463118314  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
3   137355037  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
4   133920688  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
5   731573960  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
6   134065892  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
7   208191024  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
8   930961091  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
9   990342196  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
10  330020792  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
11  133782849  NaN  https://s3.amazonaws.com/irs-form-990/20154226...   
12  561872786  NaN  https://s3.amazonaws.com/irs-fo

NameError: name 'r' is not defined

In [None]:
df_file1 = pd.read_csv('MasterData2015.csv.gzip').drop_duplicates()
df_file2 = pd.read_csv('MasterData2015Take2.csv.gzip').drop_duplicates()

df_dist = pd.concat([df_file1,df_file2])
df_dist = df_dist[df_dist.NTEE != 'NTEE']

df_dist.to_pickle('MasterData2015.pkl.gz')