# Web Scraper
This is a tool to crape Google for the first search result and saving it as a .csv.

## Important
Please do not modify any code below except for one line that is specified.

## Requirements
This is Colab, so you don't need anything other than this notebook itself and a Google Drive (or a shared team drive).

## Please Make a Copy of this First
To ensure everyone can collaborate without stumbling upon each other's progress, I suggest that you first make a copy of this notebook without changing anything. To do so, go to the navigation bar -> File -> Save a Copy in Drive. Then you will be able to find the copy in your Google Drive.

## To Run this
1. Click the run button on the top left of each code snippet
2. For the first snippet to mount Google Drive, click the link after the code runs, copy the authorization code from Google Drive into the box then hit enter.
3. After clicking through all of the run buttons before the "Stop Here" sign, evaluate what you will be doing and follow the instructions below.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install bs4 selenium webdriver_manager
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

In [None]:
from bs4 import BeautifulSoup
import csv
from selenium import webdriver
from urllib.parse import quote_plus
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options


def read_csv(filename):
  file = open(filename, 'r')
  reader = csv.reader(file)
  # skip the header
  next(reader, None)
  data = list(reader)
  file.close()
  return list(data)

def fetch_webpage(url, browser):
    browser.get(url)
    html = browser.page_source
    return html

def get_google_url(query, num=1, start=0, lang='en'):
  query = quote_plus(str(query))
  url = 'https://www.google.com/search?q={}&num={}&start={}&nl={}'.format(query, num, start, lang)
  return url

def save_as_dict(name, link, about):
  data_dict = {
    "name": name,
    "link": link,
    "about": about
  }
  return data_dict

def find_link(webpage):
  soup = BeautifulSoup(webpage, 'html.parser')
  # find a list of all span elements
  span = soup.find('span', {'class' : 'ellip'})
  if not span:
    link = soup.find('cite')
    if link:
      url = soup.find('cite').get_text()
    else:
      return '', ''
  else:
    url = span.get_text()
  print(url)

  # Find a brief description
  data = soup.find('div', {"class": 'g'})
  if not data:
    return '', ''
  about = data.find('span',{'class':'st'})
  if not about:
    return '', ''
  about_text = about.text.strip()
  print(about_text)
  return url, about_text

def search_google(data, start, length):
  names = data[start: start + length]
  # Chrome fails to start on Google Cloud Shell
  # browser = webdriver.Chrome(ChromeDriverManager().install())
  chrome_options = Options()
  chrome_options.add_argument('--headless')
  chrome_options.add_argument('--no-sandbox')
  chrome_options.add_argument('--disable-dev-shm-usage')
  browser = webdriver.Chrome('chromedriver', options=chrome_options)
  info_dict = []
  for name in names:
    link = get_google_url(name[0])
    print(link)
    webpage = fetch_webpage(link, browser)
    url, about = find_link(webpage)
    info_dict.append(save_as_dict(name[0], url, about))
  return info_dict

def save_results(results, start, length):
  csv_columns = ['name','link','about']
  with open('/content/drive/My Drive/Capstone/scraped_data/links_irs990_' + 
            str(start) + '_' + str(start + length) + '.csv', 'w') as f:
    writer = csv.DictWriter(f, fieldnames=csv_columns)
    writer.writeheader()
    for data in results:
        writer.writerow(data)


data = read_csv('/content/drive/My Drive/irs990_names.csv')

# Stop Here

## Note
Colab is able to scrape 50 links in one function without any issues. However, with over 50 queries, the requests sometimes fail. Therefore, the scraping is done 50 entries at a time. Each run saves the result in a csv. You can run multiple scrapes at once (see options below). Ultimately, Google only allows about 1000 queries per day for an IP (but realistically the limit is lower). Therefore, the scraping can be done in batches of 500 entries without failing.

## Options
1. The first code snippet below runs multiple batches at once. 
**Instructions:**
*  Set the start in increments of 500
* If you see the links being printed do not include about/name, then the VM needs to restart
* If you see any errors during the run, try restarting the runtime and rerun the same batch. If the problem persists, please describe the error to Tony
* After each run is complete, please check on Google Drive that the output is valid, and restart the VM (see instruction below)
2. The second code snippet runs a single scrape with only 50 organizations (only used when debuging)

## Restart the VM
* From the navigate bar, click Runtime -> Factory Reset Runtime
* Go back to the top of the notebook and rerun everything before "Stop Here"

In [None]:
# Click me for larger scrapes
# TODO: Set Start in increments of 500
start = 10000

'''
Do not modify below this line
-----------------------------------------------
'''
stop = start + 500
step = 50
for i in range(start, stop, step):
  results = search_google(data, i, step)
  save_results(results, i, step)

# Stop and Factory Reset Runtime after Each Iteration of 500 Scrapes
Your don't need to run or change anything below this line

----

In [None]:
# For single runs
# TODO: set start
start = 9500
num_entries = 50
results = search_google(data, start, num_entries)
save_results(results, start, num_entries)