<a href="https://colab.research.google.com/github/johnzelson/np-colab-notebooks/blob/main/S5_Web_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This notebook launches web searches (using Bing Web Search api) to learn more about an organization when website is not included in IRS Form.  In future, with a website for each organization, scraping can be done to get additional info about each nonprofit.

While the IRS Tax docs have lots of info, the web searches can include info is interesting  to someone who wants to learn about their community.  

I tried a few ideas, but ended up using Bing free tier (needs a credit card, tho).  It has an api and allows 1000 requests/mo. Fiddled with Duck Duck Go, Wikidata, Wikipeda, but not satisfactory results.  Simply scraping google or bing without registered API seems to be violation of terms.  

Roadmap: Once all the websites are populated, can use scrapy or bs4 to dig into nonprofit websites.



# Tech Notes

|   In           |   Description |
| -------------- | ----------------- |
| np_cortland_df    | Local Area df |
| irs_latest_df   | Website, if exists, is in Tax Forms |


|   Out           |   Description |
| -------------- | ----------------- |
| web_find_df.csv | Web Search Results  |




# setup

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [None]:
import requests
import pprint
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import folium

pd.set_option('display.max_columns', 100);
pd.set_option('display.max_rows', 100);

from google.colab import userdata


In [None]:
#TODO: create a configuration section -- not implemented

# base folder for retrieving raw data
data_dir = '/content/drive/My Drive/irs_data/'

# folder for writing processed data
proc_dir = '/content/drive/My Drive/IRS_processed/'


# Prep DFs

First, have to get website from listed on IRS Tax Return.

In [None]:
# from previous processing, load dataframe to

#TODO: Again, when to introduce np_local

# using np_cortland
np_cortland_df = pd.read_csv('/content/drive/My Drive/IRS_processed/np_cortland_p_df.csv')
np_cortland_df

# need website from IRS Latest
IRS_latest_df = pd.read_csv ('/content/drive/My Drive/IRS_processed/irs_latest_df.csv')
IRS_latest_df

# create np_local_df (used that name previously in code) but it's just a temp name
# for this notebook
#TODO: reviewing naming strategies for each step
np_local_df = pd.merge(
    np_cortland_df,
    IRS_latest_df[['EIN', 'WebsiteAddressTxt']],  # Select desired columns
    on='EIN',
    how='left'
)

filt =  (np_local_df['WebsiteAddressTxt'] == "tag_not_found") |  np_local_df['WebsiteAddressTxt'].isnull()
np_local_df[filt].shape # 117



(117, 68)

# lookup with Bing

In [None]:
# --------------------
# kickoff search
# --------------------
# iterate np_local_df
# filter where website has "tag_not_found_ or is null
# save results with p_org_id and ein in web_srch_df

# null website:  no tax return was found for nonprofit
# tag_not_found:  a tax return was processed, but didn't have a website

import requests
import pprint

from google.colab import userdata
subscription_key = userdata.get('bing_key')

search_url = "https://api.bing.microsoft.com/v7.0/search"

# list of dicts for each search
web_find_list = []
each_find = {}


# filter dataframe cortland_df for website is null
# filt =  cortland_geotax_df['WebsiteAddressTxt'].isnull()

# should have added tag_not_found to filter

filt =  (np_local_df['WebsiteAddressTxt'] == "tag_not_found") |  np_local_df['WebsiteAddressTxt'].isnull()


#for index, row in cortland_geotax_df[filt].head(3).iterrows():
for index, row in np_local_df[filt].iterrows():

  each_find = {}
  p_org_id = row['p_org_id']
  ein = row['EIN']
  np_name= row['NAME']
  city = row['CITY']
  state = row['STATE']

  # go do search
  print (f"searching {np_name} ({ein})...")
  each_find = bing_lookup(ein, np_name, city, state)
  each_find['p_org_id'] = p_org_id
  each_find['ein'] = ein

  web_find_list.append(each_find)

#web_find_tnf_df = pd.DataFrame.from_dict(web_find_list)
web_srch_df = pd.DataFrame.from_dict(web_find_list)


searching CORTLANDVILLE FIRE DEPARTMENT INCORPORATED (115227037)...
searching 1890 HOUSE MUSEUM AND CENTER FOR THE ARTS (132951986)...
searching CORTLAND RURAL CEMETERY (150279170)...
searching BENEVOLENT & PROTECTIVE ORDER OF ELKS OF THE USA (150298495)...
searching AMERICAN LEGION (150610966)...
searching RALPH WILKINS FOUNDATION (161188525)...
searching THE GREAT CORTLAND PUMPKINFEST INC (161506254)...
searching CIVIL SERVICE EMPLOYEES ASSOCIATION (161513551)...
searching YAMAN FOUNDATION (161571985)...
searching THE KINGS DAUGHTERS HOME FOR CHILDREN (166054829)...
searching DISABLED VETS OF CORTLAND COUNTY (222315953)...
searching FRIENDS OF THE CORTLAND COUNTY CHILD ADVOCACY CENTER INC (462277341)...
searching DAN AND ROSE MCNEIL FOUNDATION INC (833255297)...
searching CORTLAND REUSE INC (853825876)...
searching LITTLE TAGS FOUNDATION INC (870850886)...


In [None]:
# def to lookup np org
# return website, snippet, and store entire search results in dataframe
  # https://learn.microsoft.com/en-us/bing/search-apis/bing-web-search/reference/query-parameters

def bing_lookup(ein, np_name, city, state):

  each_bing_find = {}


  # test in browser with EIN
  search_term = np_name + " in " + city + ", " + state

  headers = {"Ocp-Apim-Subscription-Key": subscription_key}
  params = {"q": search_term, "textDecorations": True, "textFormat": "HTML"}
  #params = {"q": search_term}
  response = requests.get(search_url, headers=headers, params=params)
  response.raise_for_status()
  search_results = response.json()
  #pprint.pprint(search_results)

  # just getting first web page.
  # TODO: add some intelligence, tho' can do it later since saving full
  url = search_results['webPages']['value'][0]['url']
  found_name = search_results['webPages']['value'][0]['name']
  snippet = search_results['webPages']['value'][0]['snippet']
  full_result = response.text

  each_bing_find['url'] = url
  each_bing_find['found_name'] = found_name
  each_bing_find['snippet'] = snippet
  each_bing_find['full_result'] = full_result

  return each_bing_find



In [None]:
# save results of web search into web_find_df
fn = proc_dir + 'web_find_df.csv'
#web_srch_df.to_csv('/content/drive/My Drive/IRS_geocode/web_find_df.csv')
web_srch_df.to_csv(fn, index=False)
print (f"saved {fn}"

