<a href="https://colab.research.google.com/github/johnzelson/local-nonprofit-colab/blob/main/S7_Get_People.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

Step Seven (S7 Get People)

This notebook gets the list of People included in IRS Business Master File as ICO (In Care Of) and the Officers and trusted Staff (eg. Director) listed in latest Tax Return.

The list of people and affiliations is creates a fun way to show affiliation and interconnection - Potential collaborators?

For example, is there a similar group of people active in Arts-related programs?

# Tech Notes


In:  np_local_df

Out:  all_people_df


IRS Tax data is provided in XML files.  During processing, XML sections on people were dumped into dataframe.  The people xml snippet gets massaged into json, flattened, and written to a people dataframe - one row for each person and affiliation.  TODO: should use xmltodict on these snippets, instead of the simple-minded tweaks to format XML for json (eg. valid json has double quotes; names like O'Connell...)

# Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [None]:
import requests
import pprint
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import folium
from google.colab import userdata
import pprint
import json
import numpy as np
import re

pd.set_option('display.max_columns', 100);
pd.set_option('display.max_rows', 100);

proc_dir = '/content/drive/My Drive/IRS_processed/'
data_dir = '/content/drive/My Drive/irs_data/'

In [None]:
# get np local df


dtype = {"CLASSIFICATION": str,
         "EIN" : str,
         "ACTIVITY" : str,
         "AFFILIATION" : str,
         "ORGANIZATION" : str,
         "FOUNDATION" : str,
         "NTEE_CD" : str,
         "RULING" : str,
         "ZIP" : str,
         "TAX_PERIOD" : str,
         "GROUP" : str
         }

np_local_df = pd.read_csv('/content/drive/My Drive/IRS_processed/np_local_df.csv',
                          dtype=dtype)

display(np_local_df)



In [None]:
np_local_df.info(verbose=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 121 columns):
 #    Column                        Dtype  
---   ------                        -----  
 0    EIN                           int64  
 1    NAME                          object 
 2    ICO                           object 
 3    STREET                        object 
 4    CITY                          object 
 5    STATE                         object 
 6    ZIP                           object 
 7    GROUP                         int64  
 8    SUBSECTION                    int64  
 9    AFFILIATION                   int64  
 10   CLASSIFICATION                int64  
 11   RULING                        int64  
 12   DEDUCTIBILITY                 int64  
 13   FOUNDATION                    int64  
 14   ACTIVITY                      int64  
 15   ORGANIZATION                  int64  
 16   STATUS                        int64  
 17   TAX_PERIOD                    float64
 18   ASSET_CD

In [None]:
# note: people table combines tax info and BMF ICO
# need to decide on structure

'p_org_id', 'ICO', 'NAME', 'EIN', 'ntee_cat'

PersonNm', 'TitleTxt', 'IndividualTrusteeOrDirectorInd', 'OfficerInd'



# Extract ICO names

In [None]:
# create a dataframe that starts with ICOs

filt = np_local_df['ICO'].notna()
ico_df = np_local_df[filt][['p_org_id', 'ICO', 'NAME', 'EIN', 'ntee_cat' ]]

display(ico_df)
# np_local_df



Unnamed: 0,p_org_id,ICO,NAME,EIN,ntee_cat
2,49916,% TINA CAVALIER,AFRICA READS INC,204703107,"(Q) International, Foreign Affairs & National ..."
3,85879,% SUNY CORTLAND,ALPHA PHI OMEGA,593837821,no_NTEE
4,119027,% STEVEN NANN,AMERICAN FEDERATION OF TEACHERS,990725348,no_NTEE
5,36701,% RICHARD NAUSEEF,AMERICAN LEGION,150610966,no_NTEE
6,61771,% PATTI W FITZPATRICK,AMERICAN LEGION AUXILIARY,263340217,no_NTEE
7,42029,% SUE CONNELLY,AMERICAN VOLKSSPORT ASSOCIATION INC,161445103,no_NTEE
8,48184,% JENNIFER SCHULTZ,ASC FAMILY FUND INC,201878196,(P) Human Services
10,36518,% ERIKA DEAN,AUXILIARY SERVICES CORPORATION OF SUNY CORTLAND,150548524,(B) Education
11,115772,% BARRY PRIMARY PTO TREASURER,BARRY PRIMARY PTO INC,922673272,(B) Education
13,73252,% NONE,BILLY BIMBA GLOBAL MINISTRIES INC,454495541,(X) Religion-Related


In [None]:
# prep ico for merge with other people

#ico_df.rename(columns={'ICO': 'PersonNm'}, inplace=True)

#ico_df['PersonNm'] = ico_df['PersonNm'].str.replace('% ', '')
ico_df['TitleTxt'] = 'ICO'

display(ico_df)

Unnamed: 0,p_org_id,PersonNm,NAME,EIN,ntee_cat,TitleTxt
2,49916,TINA CAVALIER,AFRICA READS INC,204703107,"(Q) International, Foreign Affairs & National ...",ICO
3,85879,SUNY CORTLAND,ALPHA PHI OMEGA,593837821,no_NTEE,ICO
4,119027,STEVEN NANN,AMERICAN FEDERATION OF TEACHERS,990725348,no_NTEE,ICO
5,36701,RICHARD NAUSEEF,AMERICAN LEGION,150610966,no_NTEE,ICO
6,61771,PATTI W FITZPATRICK,AMERICAN LEGION AUXILIARY,263340217,no_NTEE,ICO
7,42029,SUE CONNELLY,AMERICAN VOLKSSPORT ASSOCIATION INC,161445103,no_NTEE,ICO
8,48184,JENNIFER SCHULTZ,ASC FAMILY FUND INC,201878196,(P) Human Services,ICO
10,36518,ERIKA DEAN,AUXILIARY SERVICES CORPORATION OF SUNY CORTLAND,150548524,(B) Education,ICO
11,115772,BARRY PRIMARY PTO TREASURER,BARRY PRIMARY PTO INC,922673272,(B) Education,ICO
13,73252,NONE,BILLY BIMBA GLOBAL MINISTRIES INC,454495541,(X) Religion-Related,ICO


# Extract people XML snippets from IRS Taxes.

In [None]:
# nonprofits that have people listed in json-ish column
filt = np_local_df['people'].notna()
people_df = np_local_df[filt][['p_org_id', 'people', 'NAME', 'EIN', 'ntee_cat' ]]


In [None]:
# Reading snippet as json
# but should have tried xmltodict, first

# simple-minded massage seemed to work, so leaving for now.

# Simple-minded:  to replace single quotes with double quotes
# but some names have single quotes (O'Connell) so

# but when last name ends with O it'll be followed by comma
#filt= people_df['people'].str.contains('O\',')
#display(people_df[filt])

# here's an over simple fix
# first deal with people with names ending in O, marker <end>
people_df['people'] = people_df['people'].str.replace('O\',', 'O<end>, ')

# then get the actual O' names
people_df['people'] = people_df['people'].str.replace('O\'', 'O ')

# then take out marker and add sq back
people_df['people'] = people_df['people'].str.replace('<end>', '\'')

# note: there is a routine in streamlit app that uses findall
# but cut'n paste didn't work and  trying this fast



In [None]:
display(people2_df)

Unnamed: 0,PersonNm,TitleTxt,IndividualTrusteeOrDirectorInd,OfficerInd
0,RICHARD NAUSEEF,1ST VICE TREASURER,na,X
1,LYLE BUDDENHAGEN,3RD VICE,na,X
2,KENNETH BUSH,ADJUTANT,na,X
3,THOMAS THOMPSON,COMMANDER,na,X
4,DOUGLAS BROWN,2ND VICE,na,X


In [None]:
# iterates dataframe to extract people from json-ish

# previous cell fixes the name problem
# now simple replacement of single quotes for double to make valid json
# Note: dict to dataframe tools would prob be better than iteration

people2_df = pd.DataFrame(columns=['p_org_id', 'NAME', 'EIN', 'ntee_cat',
                                   'PersonNm', 'TitleTxt', 'IndividualTrusteeOrDirectorInd',
                                   'OfficerInd'])

for index, row in people_df.iterrows():
  # print("\n", index, row['people'])
  person_list = []
  org_common_list = []

  # get basic columns that will get added to every person
  org_common_list = [row['p_org_id'], row['NAME'], row['EIN'], row['ntee_cat']]
  #row['p_org_id'], row['people'], row['NAME'], row['EIN'], row['ntee_cat']


  # used simple fix above for names with single quote O'GORMAN,

  # now, simple general attempt to just replace single quotes
  ppl = row['people'].replace("\'", "\"")

  try:
    ppl_dict = json.loads(ppl)   # IRS Tax I guess is a list of dicts, but single quoted
  except:
    print (ppl)

  for each_row  in ppl_dict:
    person_list = []
    for each_col in ['PersonNm', 'TitleTxt', 'IndividualTrusteeOrDirectorInd', 'OfficerInd']:
      # different IRS form have different tags
      if each_col in each_row:
        try:
          person_list.append(each_row[each_col])
        except:
          print ("hmm", each_row)
      else:
        person_list.append("na")

    list_to_add = org_common_list + person_list
    person_list = []

    # add list to end of dataframe
    try:
      people2_df.loc[len(people2_df)] = list_to_add
    except:
      print ("erg", list_to_add)



hmm PersonNm
erg [73252, 'BILLY BIMBA GLOBAL MINISTRIES INC', 454495541, '(X) Religion-Related', 'na', 'na', 'na']
hmm TitleTxt
erg [73252, 'BILLY BIMBA GLOBAL MINISTRIES INC', 454495541, '(X) Religion-Related', 'na', 'na', 'na']
hmm PersonNm
erg [37344, 'GRANGE PATRONS OF HUSBANDRY NEW YORK STATE GRANGE INC', 160725792, 'no_NTEE', 'na', 'na', 'na']
hmm TitleTxt
erg [37344, 'GRANGE PATRONS OF HUSBANDRY NEW YORK STATE GRANGE INC', 160725792, 'no_NTEE', 'na', 'na', 'na']


# Combine all people

In [None]:
all_people_df = pd.concat([ico_df, people2_df])
display(all_people_df)

Unnamed: 0,p_org_id,PersonNm,NAME,EIN,ntee_cat,TitleTxt,IndividualTrusteeOrDirectorInd,OfficerInd
2,49916,TINA CAVALIER,AFRICA READS INC,204703107,"(Q) International, Foreign Affairs & National ...",ICO,,
3,85879,SUNY CORTLAND,ALPHA PHI OMEGA,593837821,no_NTEE,ICO,,
4,119027,STEVEN NANN,AMERICAN FEDERATION OF TEACHERS,990725348,no_NTEE,ICO,,
5,36701,RICHARD NAUSEEF,AMERICAN LEGION,150610966,no_NTEE,ICO,,
6,61771,PATTI W FITZPATRICK,AMERICAN LEGION AUXILIARY,263340217,no_NTEE,ICO,,
...,...,...,...,...,...,...,...,...
814,36391,MIKE STAPLETON,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,Trustee-Treasur,X,X
815,36391,ANGELA LOH,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,Trustee-PRES,X,X
816,36391,JOHN WHITTLETON,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,Trustee,X,na
817,36391,SUE SHERMAN-BROYLES,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,TRUSTEE SEC,X,X


# Data Checks

In [None]:
# TODO: Research these people who had errors

hmm PersonNm
erg [73252, 'BILLY BIMBA GLOBAL MINISTRIES INC', 454495541, '(X) Religion-Related', 'na', 'na', 'na']
hmm TitleTxt
erg [73252, 'BILLY BIMBA GLOBAL MINISTRIES INC', 454495541, '(X) Religion-Related', 'na', 'na', 'na']
hmm PersonNm
erg [37344, 'GRANGE PATRONS OF HUSBANDRY NEW YORK STATE GRANGE INC', 160725792, 'no_NTEE', 'na', 'na', 'na']
hmm TitleTxt
erg [37344, 'GRANGE PATRONS OF HUSBANDRY NEW YORK STATE GRANGE INC', 160725792, 'no_NTEE', 'na', 'na', 'na']

In [None]:
# Name Variations
# TODO: Add fuzzy name check

all_people_df = pd.read_csv(data_dir + 'all_people.csv')
display(all_people_df)


Unnamed: 0.1,Unnamed: 0,p_org_id,PersonNm,NAME,EIN,ntee_cat,TitleTxt,IndividualTrusteeOrDirectorInd,OfficerInd
0,779,36375,ADAM CLIFFORD,YOUNG MENS CHRISTIAN ASSOCIATION CORTLAND,150533570,(P) Human Services,BOARD MEMBER/PRESIDENT,X,X
1,771,76383,ADAM CLIFFORD,YMCA OF CORTLAND PROPERTIES INC,463376307,(P) Human Services,PRESIDENT,na,na
2,28,44725,ADAM MCCRACKEN,CIVIL SERVICE EMPLOYEES ASSOCIATION,161613155,no_NTEE,ICO,,
3,243,38548,ADELE FETTERLY,CORTLAND COUNTY BOARD REALTORS INC,160987063,no_NTEE,TREASURER,na,na
4,797,36391,ADRIANNE TRAUB,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,BOARD MEMBER,X,na
...,...,...,...,...,...,...,...,...,...
884,657,56342,WILLIAM MCKEE,NATIONAL ACADEMY OF ARBITRATORS,237126791,no_NTEE,PRESIDENT-ELECT,X,X
885,604,38243,WILLIAM MURPHY,J M MURRAY CENTER INC,160919050,(J) Employment,DIRECTOR,X,na
886,715,39411,WILLIAM WEISMORE,STATEWIDE COUNTRY MUSIC ASSOCIATION INC,161132390,"(A) Arts, Culture & Humanities",MEMBERSHIP DIRECTOR,X,na
887,472,35994,Warren Eddy,CORTLAND RURAL CEMETERY,150279170,no_NTEE,Trustee,X,na


In [None]:
display(people2_df)
#display(ico_df)
#people2_df.info()

Unnamed: 0,p_org_id,NAME,EIN,ntee_cat,PersonNm,TitleTxt,IndividualTrusteeOrDirectorInd,OfficerInd
0,18383,1890 HOUSE MUSEUM AND CENTER FOR THE ARTS,132951986,"(A) Arts, Culture & Humanities",TERRY MINGLE,BD MEMBER,X,na
1,18383,1890 HOUSE MUSEUM AND CENTER FOR THE ARTS,132951986,"(A) Arts, Culture & Humanities",MARK HARRINGTON,BD MEMBER,X,na
2,18383,1890 HOUSE MUSEUM AND CENTER FOR THE ARTS,132951986,"(A) Arts, Culture & Humanities",JERRY WILCOX,BD MEMBER,X,na
3,18383,1890 HOUSE MUSEUM AND CENTER FOR THE ARTS,132951986,"(A) Arts, Culture & Humanities",JANE HUNTER,BD MEMBER,X,na
4,18383,1890 HOUSE MUSEUM AND CENTER FOR THE ARTS,132951986,"(A) Arts, Culture & Humanities",NICOLE HOLLENBACK,BD MEMBER,X,na
...,...,...,...,...,...,...,...,...
814,36391,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,MIKE STAPLETON,Trustee-Treasur,X,X
815,36391,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,ANGELA LOH,Trustee-PRES,X,X
816,36391,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,JOHN WHITTLETON,Trustee,X,na
817,36391,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,SUE SHERMAN-BROYLES,TRUSTEE SEC,X,X


# Save the people!

In [None]:
#all_people_df['PersonNm'].value_counts()

# all_people_dfPersonNm.groupby(all_people_df['PersonNm']).count().sort_values(ascending=False)

filt = all_people_df['PersonNm'] != 'na'
# all_people_df = all_people_df[filt]

all_people_df[filt].sort_values('PersonNm').to_csv(data_dir + 'all_people.csv')


# Test collections for visualization



In [None]:
all_people_df = pd.read_csv(data_dir + 'all_people.csv')

In [None]:
display(all_people_df)

Unnamed: 0.1,Unnamed: 0,p_org_id,PersonNm,NAME,EIN,ntee_cat,TitleTxt,IndividualTrusteeOrDirectorInd,OfficerInd
0,779,36375,ADAM CLIFFORD,YOUNG MENS CHRISTIAN ASSOCIATION CORTLAND,150533570,(P) Human Services,BOARD MEMBER/PRESIDENT,X,X
1,771,76383,ADAM CLIFFORD,YMCA OF CORTLAND PROPERTIES INC,463376307,(P) Human Services,PRESIDENT,na,na
2,28,44725,ADAM MCCRACKEN,CIVIL SERVICE EMPLOYEES ASSOCIATION,161613155,no_NTEE,ICO,,
3,243,38548,ADELE FETTERLY,CORTLAND COUNTY BOARD REALTORS INC,160987063,no_NTEE,TREASURER,na,na
4,797,36391,ADRIANNE TRAUB,YOUNG WOMENS CHRISTIAN ASSOCIATION,150536617,(P) Human Services,BOARD MEMBER,X,na
...,...,...,...,...,...,...,...,...,...
884,657,56342,WILLIAM MCKEE,NATIONAL ACADEMY OF ARBITRATORS,237126791,no_NTEE,PRESIDENT-ELECT,X,X
885,604,38243,WILLIAM MURPHY,J M MURRAY CENTER INC,160919050,(J) Employment,DIRECTOR,X,na
886,715,39411,WILLIAM WEISMORE,STATEWIDE COUNTRY MUSIC ASSOCIATION INC,161132390,"(A) Arts, Culture & Humanities",MEMBERSHIP DIRECTOR,X,na
887,472,35994,Warren Eddy,CORTLAND RURAL CEMETERY,150279170,no_NTEE,Trustee,X,na


In [None]:
# people with more than one connection
all_people_df[all_people_df['PersonNm'].groupby(all_people_df['PersonNm']).transform('size')>1]

Unnamed: 0.1,Unnamed: 0,p_org_id,PersonNm,NAME,EIN,ntee_cat,TitleTxt,IndividualTrusteeOrDirectorInd,OfficerInd
0,779,36375,ADAM CLIFFORD,YOUNG MENS CHRISTIAN ASSOCIATION CORTLAND,150533570,(P) Human Services,BOARD MEMBER/PRESIDENT,X,X
1,771,76383,ADAM CLIFFORD,YMCA OF CORTLAND PROPERTIES INC,463376307,(P) Human Services,PRESIDENT,na,na
8,737,42886,ALICE STARMER,THE GREAT CORTLAND PUMPKINFEST INC,161506254,(N) Recreation & Sports,VICE PRESIDENT,X,X
9,321,42170,ALICE STARMER,CORTLAND COUNTY CONVENTION & VISITORS BUREAU INC,161454737,(S) Community Improvement & Capacity Building,BOARD MEMBER,X,na
25,359,54486,ANDREA PIEDIGROSSI,CORTLAND COUNTY YOUTH SOCCER ASSOCIATION INC,223320358,no_NTEE,DIRECTOR OF REC,na,na
...,...,...,...,...,...,...,...,...,...
846,766,76383,TIM HERMAN,YMCA OF CORTLAND PROPERTIES INC,463376307,(P) Human Services,TREASURER,na,na
859,17,114554,TRAVIS MACDOWELL,CENTRAL NEW YORK ACTION SPORTS INC,920791511,(N) Recreation & Sports,ICO,,
860,87,114554,TRAVIS MACDOWELL,CENTRAL NEW YORK ACTION SPORTS INC,920791511,(N) Recreation & Sports,PRESIDENT,X,X
875,694,70345,WALT DE TREUX,NATIONAL ACADEMY OF ARBITRATORS RESEARCH & EDU...,382613043,no_NTEE,SECRETARY-TREASURER,X,X


# Fodder

In [None]:
# this used in streamlit to fix O' names
import re

for index, row in people_df.iterrows():
  print ("----------- print people -------------")
  print(row['people'])
  print ("----------- end people -------------")

  # ppl = json.loads(row['people'])
  # ppl = json.loads(json.dumps(row['people']))
  ppl = json.loads(row['people'].replace("'", "\""))

  #print (json.loads(ppl))

  # for r in row['people']:
  #  print (r)

  # ppl = row['people'].replace("\'", "\"")
  #  print(ppl)



  if index == 2:
    break

   ###

  if isinstance(ppl, str):

      quoted_stuff = re.findall('"([^"]*)"', ppl)
      # st.write (quoted_stuff)
      for t in quoted_stuff:
          fix_t = t.replace("'", " ") # KATHLEEN O CONNELL
          # ok, try this way
          ppl = ppl.replace(t, fix_t) # replace name with no sq

      # after taking sq from any quoted string
      # then replace dq with single quote
      ppl = ppl.replace('"', "'")   # with quoted handled, make all sq
      ppl =  ppl.replace("'", '"')  # replace sq with dq for json

      # st.text (ppl)
      ppl_dict = json.loads(ppl)
      # return {'status' : 'no people'}
      # return ppl_dict
      pprint.pprint(ppl_dict)

  else:
      print ( 'no people')

  ###




----------- print people -------------
[{'PersonNm': 'TERRY MINGLE', 'TitleTxt': 'BD MEMBER', 'AverageHoursPerWeekRt': '1.00', 'AverageHoursPerWeekRltdOrgRt': '0.00', 'IndividualTrusteeOrDirectorInd': 'X', 'ReportableCompFromOrgAmt': '0', 'ReportableCompFromRltdOrgAmt': '0', 'OtherCompensationAmt': '0'}, {'PersonNm': 'MARK HARRINGTON', 'TitleTxt': 'BD MEMBER', 'AverageHoursPerWeekRt': '1.00', 'AverageHoursPerWeekRltdOrgRt': '0.00', 'IndividualTrusteeOrDirectorInd': 'X', 'ReportableCompFromOrgAmt': '0', 'ReportableCompFromRltdOrgAmt': '0', 'OtherCompensationAmt': '0'}, {'PersonNm': 'JERRY WILCOX', 'TitleTxt': 'BD MEMBER', 'AverageHoursPerWeekRt': '1.00', 'AverageHoursPerWeekRltdOrgRt': '0.00', 'IndividualTrusteeOrDirectorInd': 'X', 'ReportableCompFromOrgAmt': '0', 'ReportableCompFromRltdOrgAmt': '0', 'OtherCompensationAmt': '0'}, {'PersonNm': 'JANE HUNTER', 'TitleTxt': 'BD MEMBER', 'AverageHoursPerWeekRt': '1.00', 'AverageHoursPerWeekRltdOrgRt': '0.00', 'IndividualTrusteeOrDirectorInd':

JSONDecodeError: Expecting ',' delimiter: line 1 column 25 (char 24)