# Web Scraper
This is a tool to crape Google for the first search result and saving it as a .csv.

## Important
Please do not modify any code below except for one line that is specified.

## Requirements
This is Colab, so you don't need anything other than this notebook itself and a Google Drive (or a shared team drive).

## To Run this
1. Click the run button on the top left of each code snippet
2. For the first snippet to mount Google Drive, click the link after the code runs, copy the authorization code from Google Drive into the box then hit enter.
3. After clicking through all of the run buttons before the "Stop Here" sign, evaluate what you will be doing and follow the instructions below.


In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
!pip install bs4 google google-api-python-client nltk numpy pandas scikit-learn selenium webdriver_manager
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease
Hit:4 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:5 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:10 http://ppa.launchpad.net/marutter/c2d4u3.5/ubuntu bionic InRelease
Hit:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
chromiu

In [13]:
from bs4 import BeautifulSoup
import csv
from selenium import webdriver
from urllib.parse import quote_plus
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options


def read_csv(filename):
  file = open(filename, 'r')
  reader = csv.reader(file)
  # skip the header
  next(reader, None)
  data = list(reader)
  file.close()
  return list(data)

def fetch_webpage(url, browser):
    browser.get(url)
    html = browser.page_source
    return html

def get_google_url(query, num=1, start=0, lang='en'):
  query = quote_plus(str(query))
  url = 'https://www.google.com/search?q={}&num={}&start={}&nl={}'.format(query, num, start, lang)
  return url

def save_as_dict(name, link, about):
  data_dict = {
    "name": name,
    "link": link,
    "about": about
  }
  return data_dict

def find_link(webpage):
  soup = BeautifulSoup(webpage, 'html.parser')
  # find a list of all span elements
  span = soup.find('span', {'class' : 'ellip'})
  if not span:
    link = soup.find('cite')
    if link:
      url = soup.find('cite').get_text()
    else:
      return '', ''
  else:
    url = span.get_text()
  print(url)

  # Find a brief description
  data = soup.find('div', {"class": 'g'})
  if not data:
    return '', ''
  about = data.find('span',{'class':'st'})
  if not about:
    return '', ''
  about_text = about.text.strip()
  print(about_text)
  return url, about_text

def search_google(data, start, length):
  names = data[start: start + length]
  # Chrome fails to start on Google Cloud Shell
  # browser = webdriver.Chrome(ChromeDriverManager().install())
  chrome_options = Options()
  chrome_options.add_argument('--headless')
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument('--disable-dev-shm-usage')
  browser = webdriver.Chrome('chromedriver', options=chrome_options)
  info_dict = []
  for name in names:
    link = get_google_url(name[0])
    print(link)
    webpage = fetch_webpage(link, browser)
    url, about = find_link(webpage)
    info_dict.append(save_as_dict(name[0], url, about))
  return info_dict

def save_results(results, start, length):
  csv_columns = ['name','link','about']
  with open('/content/drive/My Drive/Capstone/scraped_data/links_irs990_' + str(start) + '_' + str(start + length) + '.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=csv_columns)
    writer.writeheader()
    for data in results:
        writer.writerow(data)


data = read_csv('/content/drive/My Drive/irs990_names.csv')

# Stop Here

## Note
Colab is able to scrape 50 links in one function without any issues. However, with over 50 queries, the requests sometimes fail. Therefore, the scraping is done 50 entries at a time. Each run saves the result in a csv. You can run multiple scrapes at once (see options below). Ultimately, Google only allows about 1000 queries per day for an IP (but realistically the limit is lower). Therefore, the scraping can be done in batches of 500 entries without failing.

## Options
1. The first code snippet below runs multiple batches at once. 
**Instructions:**
*  Set the start in increments of 500
* If you see the links being printed do not include about/name, then the VM needs to restart
* After each run is complete, please check on Google Drive that the output is valid, and restart the VM (see instruction below)
2. The second code snippet runs a single scrape with only 50 organizations (only used when debuging)

## Restart the VM
* From the navigate bar, click Runtime -> Factory Reset Runtime
* Go back to the top of the notebook and rerun everything before "Stop Here"

In [None]:
# Click me for larger scrapes
# TODO: Set Start in increments of 500
start = 8000

'''
Do not modify below this line
-----------------------------------------------
'''
stop = start + 500
step = 50
for i in range(start, stop, step):
  results = search_google(data, i, step)
  save_results(results, i, step)

https://www.google.com/search?q=HALL+INSTITUTE+FOR+NEW+JERSEY+PUBLIC+POLICY+INC&num=1&start=0&nl=en
en.wikipedia.org › wiki › Hall_Institute_of_Public_Poli...
The Hall Institute for Public Policy - New Jersey is a nonpartisan, non-profit think tank that focuses on public policy issues in New Jersey. The institute was ...
https://www.google.com/search?q=HEARTS+THAT+CARE+VOLUNTEER+HEALTH+CLINIC+INC&num=1&start=0&nl=en
heartsthatcarelawton.org
Hearts that care provides a free health clinic and medications for people in Lawton who don't have insurance. Our work is made possible by donations from ...
https://www.google.com/search?q=SAINT+BARNABAS+MEDICAL+CENTER&num=1&start=0&nl=en
www.rwjbh.org › saint-barnabas-medical-center
Saint Barnabas Medical Center (SBMC) is a fully accredited acute care hospital and offers a comprehensive array of services including advanced cancer care, ...
https://www.google.com/search?q=MIDLAND+SCHOOL&num=1&start=0&nl=en
midland-school.org
SANTA BARBARA COUNTY CO

# Stop and Factory Reset Runtime after Each Iteration of 500 Scrapes
Your don't need to run or change anything below this line

----

In [None]:
# For single runs
# TODO: set start
start = 0
num_entries = 50
results = search_google(data, start, num_entries)
save_results(results, start, num_entries)