# Google Email Scraper Tool 
Adopted from: https://towardsdatascience.com/web-scraping-to-extract-contact-information-part-1-mailing-lists-854e8a8844d2

# INSTRUCTIONS: 
    - CHANGE USER AGENT: Within the 'process' variable in the 'get_info' function, you'll see settings=('USER AGENT': 'Mozilla..."). Change this value to your personal user agent by copying and pasting the immediate results to a google search query of "What is my user agent?" 
    - Run all cells until the cell starting with the "final_result" variable manually performing "Shift+Return" all the way down, or after single clicking on said cell to select it, clicking "Run All Above" within the "Cell" button in PANDAS.
    - In 'final_results' variable using the 'get_info' method, change the string to any google search query, and the number to some value of sites to scrape.* 
    - Leave the language ('en'), & path holder ('csv') as is.
    - Run the cell for final results. ** 
    - In the following cell containing "to_csv", change the path to wherever desired within the string within the "to_csv" call. Its base route is to your local downloads folder with a generic name. 
    - Run the cell, and access your results where directed to! 
    - Terminate the program ("Close and Halt" button within the "File" button on PANDAS) and restart it to run the script with a different query. **

*: There stands an increasing run time and chance of failure due to too rapid of an API call & scraping detection software as this number increases. This will manifest as a "ValueError" when running the script, and when occurs is best to terminate and restart the program with a lower number of URLs to scrape as it will get stuck on this action. ITs detection as a scraper is likely as result of not paying for a USER AGENT databse & filtering through it while running. I reccomend starting with 10 to familiarize yourself with the results, and go higher over time with knowledge of the potential failure + greater run time.

** The "ReactorNotRestartable" error will pop up if you run the script a second time due to an error with the Scrapy API call. The script runs once, then is required to be quit and restarted for each individual use. 

Google Email Scraper
ACTION: Scrape Emails + URLs from Google Search


OUTSTANDING CHALLENGES: Without paying for open source user lists, we are not able to get past all sites' scripting prevention software. The script runs successfully once, then must be terminated & restarted due to an issue with the API call.

In [5]:
import logging
import re
import os
import pandas as pd
# Scrapy enables us to Crawl Websites. Install with pip install scrapy
import scrapy 
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
# Allows easier Google Functionality. Install with pip install google-search
#from googlesearch import GoogleSearch
from googlesearch import search
# A dependancy for using Scrapy within Pandas - to avoid excess warnings & logs.
logging.getLogger('scrapy').propagate = False

In [6]:
pd.set_option('display.max_rows', 100)

A function to return URLs from a search query. 
Feed in a list query, any number of results per page up to 100, and a stop value <= the number of results.

WARNING: Google will throw a HTTPL Error 429 if you request too many queries at once. 
Limit is 50k per day, but 10 per second. 
Source:https://developers.google.com/analytics/devguides/config/mgmt/v3/limits-quotas

In [7]:
def return_urls(query_string, stop_val,language='en'):
	url_list = [url for url in search(query_string, stop=stop_val)][:stop_val]
	return url_list

regex function to find emails from a site's html: 1+ letters, then an @ sign, then more letters, then a '.', then more letters

In [8]:
def email_finder(html_text):
	return re.findall(r'\w+@\w+\.{1}\w+', html_text)

Function to create the final file for download.

In [9]:
def create_file(path):    
    with open(path, 'wb') as file: 
        file.close()

Scrapy Spider to extract emails from a set of URLs 

In [10]:
class MailSpider(scrapy.Spider):
    
    name = 'email'
    # To Avoid Denial of Crawling by Websites. Slows program down, but the reccomended course of action should Spider fail.
    #Set to any number 2+ for best results
    #download_delay = 2 
    
    def parse(self, response):
        
        links = LxmlLinkExtractor(allow=()).extract_links(response)
        links = [str(link.url) for link in links]
        links.append(str(response.url))
        
        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_link) 
            
    def parse_link(self, response):
        
        for word in self.reject:
            if word in str(response.url):
                return
            
        html_text = str(response.text)
        mail_list = email_finder(html_text)

        dic = {'email': mail_list, 'link': str(response.url)}
        df = pd.DataFrame(dic)
        
        df.to_csv(self.path, mode='a', header=False)
        df.to_csv(self.path, mode='a', header=False)


Creates new CSV with an empty data frame, gets Google URLs from 'get_urls' funntion, scrapes for emails, returns new CSV!

In [18]:
def get_info(tag, n, language, path, reject=[]):
    
    create_file(path)
    df = pd.DataFrame(columns=['email', 'link'], index=[0])
    df.to_csv(path, mode='w', header=True)
    
    print('Collecting Google urls...')
    google_urls = return_urls(tag, n, language)
    
    print('Searching for emails...')
    #Change to your own USER AGENT by replacing the "Mozilla..." string with the google result to "what is my user-agent"
    process = CrawlerProcess(
    	settings={'USER_AGENT': 'Mozilla/5.0 (Macintosh;...',
    			
    			 })
    process.crawl(MailSpider, start_urls=google_urls, path=path, reject=reject)
    process.start()
    
    #time.sleep(0.5)

    #os.execl(sys.executable, sys.executable, *sys.argv)
    
    
    
    print('Cleaning emails...')
    df = pd.read_csv(path, index_col=0)
    df.columns = ['email', 'link']
    df = df[~df.email.str.contains("@2x.png", na=False)]
    df = df.drop_duplicates(subset='email')
    df = df.reset_index(drop=True)
    df.to_csv(path, mode='w', header=True)
    
    return df


To run the script, use the get_info function passing in a google search for 'tag', any number fot the stop value, 'en' for language, and any unique name for the path (CSV file). Currently, the string shows an example google query for scraping 20 URLs to find therapists in Detroit. Since it is a larger query it takes a few minutes to run, and in this case yields around 1200 results (90%+ usable emails) in its CSV output. It errored scraping numerous sites which detected the scraper, but given enough time still yields said output.

In [17]:
final_result =  get_info('therapists @gmail.com detroit email', 20, 'en','csv')

In [16]:
final_result.to_csv('./downloads/google_scraper_results.csv')

Final Remarks: A known error is that emails with a '.' before the '@' sign may return a concatenated email, if an email in the resulting spreadsheet looks off - click on its subsequent link and Control-F with it to find the full email. Your quality of results can depend on a 'hit or miss' basis with the serach query. "therapists @gmail.com new york email" yielded 200+ emails, and depending on what sites pop up the number of scrapable emails differs greatly. For a poor set of results, switching up the query by city is a good way to try again.