In [1]:
#!/usr/bin/env python
# -*- coding: UTF-8

# Google search using Python
> See [here](http://pythonhosted.org/google/) for documentation and [here](https://pypi.python.org/pypi/google) for source code.

## Initializing the Python environment

In [2]:
# IMPORTING KEY PACKAGES
from google import search
import csv, re, os
import urllib, requests

## Testing out the search function

In [3]:
help(search)

Help on function search in module google:

search(query, tld='com', lang='en', tbs='0', safe='off', num=10, start=0, stop=None, pause=2.0, only_standard=False, extra_params={}, tpe='', user_agent=None)
    Search the given query string using Google.
    
    @type  query: str
    @param query: Query string. Must NOT be url-encoded.
    
    @type  tld: str
    @param tld: Top level domain.
    
    @type  lang: str
    @param lang: Languaje.
    
    @type  tbs: str
    @param tbs: Time limits (i.e "qdr:h" => last hour, "qdr:d" => last 24 hours, "qdr:m" => last month).
    
    @type  safe: str
    @param safe: Safe search.
    
    @type  num: int
    @param num: Number of results per page.
    
    @type  start: int
    @param start: First result to retrieve.
    
    @type  stop: int
    @param stop: Last result to retrieve.
        Use C{None} to keep searching forever.
    
    @type  pause: float
    @param pause: Lapse to wait between HTTP requests.
        A lapse too long will

In [4]:
# Example of using the function:
for url in search('BENJAMIN FRANKLIN CHARTER SCHOOL MESA 2345 NORTH HORNE, MESA, AZ', \
                  stop=5, pause=1):
    print(url)

http://www.ade.az.gov/edd/NewDetails.asp?EntityID=5536&RefTypeID=1035
https://www.schooldigger.com/go/AZ/schools/0006500821/school.aspx
https://www.mapquest.com/us/arizona/schools-mesa/franklin-benjamin-charter-school-345471093
https://www.yellowpages.com/mesa-az/mip/benjamin-franklin-charter-school-16162575
http://public-schools.startclass.com/l/2200/Benjamin-Franklin-Charter-School-Mesa
https://www.publicschoolreview.com/benjamin-franklin-charter-school-mesa-profile
https://www.noodle.com/schools/ktzL7/benjamin-franklin-charter-school-mesa
http://www.markmyagent.com/ShowSchoolDetail.aspx?pageid=2022821&schoolid=040006500821
https://www.spellingcity.com/benjamin-franklin-charter-school-mesa-mesa-az.html
http://www.ratemyteachers.com/benjamin-franklin-charter-school-mesa/500425-s


## Configuring search environment

In [5]:
# Here's a list of sites we DON'T want to spider, 
# but that an automated Google search might return...
# and we might thus accidentally spider unless we filter them out (as below)!

bad_sites = []
with open("../bad_sites.csv", "r", encoding = "utf-8") as csvfile:
    for row in csvfile:
        bad_sites.append(re.sub("\n", "", row))

print(bad_sites)

['high-schools.com', 'yelp.com', 'har.com', 'trulia.com', 'redfin.com', 'practutor.com', 'startclass.com', 'greatschools.org', 'greatschools.com', 'greatschools.net', 'paschoolperformance.org', 'worldcontactinfo.com', 'kula.com', 'mapquest.com', 'maps.net', 'google.com', 'facebook.com', 'zillow.com', 'manta.com', 'yellowpages.com', 'usnews.com', 'publicschoolreview.com', 'publicschoolreview.org', 'schooldigger.com', 'niche.com', 'privateschoolreview.com', 'cappex.com', 'collegeconfidential.com', 'tripsadvisor.com', 'groupon.com', 'school-ratings.com', 'superpages.com', 'onsaleph.com', 'psk12.com', 'schoolmatters.com', 'neighborhoodscout.com', 'localschooldirectory.com', 'publicschoolsk12.com', 'schooldatadirect.org', 'nces.ed.gov', 'cityrating.com', 'blogspot.com', 'public-schools.findthebest.com', 'twitter.com', 'zoominfo.com', 'jigsaw.com', 'hoovers.com', 'corporateinformation.com', 'doe.k12.ga.us', 'gradeschools.net', 'charterschoolratings.net', 'schools.net', 'insiderpages.com', 'p

## Helpful bash-fu

In [3]:
!cat > bad_sites

^C


In [None]:
  114  cat > testlist.txt
  115  cat testlist.txt 
  116  for i in $(cat testlist.txt | head -n 4); do echo $i; done
  117  for i in $(cat testlist.txt | head -n 4); do echo wget --exclude-domains=$(echo $(cat ../Charter-school-identities/bad_sites.txt  ) | tr ' ' ,) $i; done
  118  for i in $(cat testlist.txt | head -n 4); do echo wget --exclude-domains=$(echo $(cat ../Charter-school-identities/bad_sites.txt  ) | tr ' ' ,) $i; echo; echo; done
  119  for i in $(cat testlist.txt | head -n 4); do wget --exclude-domains=$(echo $(cat ../Charter-school-identities/bad_sites.txt  ) | tr ' ' ,) $i; done
  120  ls -la
  121  rm -f 500425-s 55362003.pdf franklin-benjamin-charter-school-mesa index.html 
  122* for i in $(cat testlist.txt | head -n 4); do wget --mirror --exclude-domains=$(echo $(cat ../Charter-school-identities/bad_sites.txt  ) | tr ' ' ,) $i; done

## Reading in data

In [6]:
sample = [] # make empty list to store the dictionaries in
with open('../charter_URLs_Apr17.csv', 'r', encoding = 'Latin-1')\
as csvfile: # open file                      
    reader = csv.DictReader(csvfile) # create a reader
    for row in reader: # loop through rows
        sample.append(row) # append each row to the list

In [7]:
# Take a look at the first entry's contents and the variables list in our sample (a list of dictionaries)
print(sample[1]['SEARCH'], "\n", sample[1]["OLD_URL"], "\n")
print(sample[1].keys())

POLK STATE COLLEGE COLLEGIATE HIGH SCHOOL 3425 WINTER LK RD LAC1200, WINTER HAVEN, FL 33881 
 https://www.polk.edu/charter-high-schools/ 

dict_keys(['MANUAL_URL', 'SCH_NAME', 'STABR', 'SEARCH', 'NCESSCH', 'OLD_URL', 'ADDRESS'])


## Getting URLs

In [8]:
api_key = open("api_key.txt").read()
api_key

'AIzaSyCsFyWjkVsTMA2VmKrHLgk69Dv_cWcphGg'

In [9]:
os.system("python kg-api.py" + " " + "'BENJAMIN FRANKLIN CHARTER SCHOOL MESA'")[]
# try googling: BENJAMIN FRANKLIN CHARTER SCHOOL MESA 2345 NORTH HORNE, MESA, AZ

0

In [10]:
os.system("python kg-api.py" + " " + "Taylor Swift")

512

Example of a Google Knowledge Graph API search URL (see [here](http://searchengineland.com/cool-tricks-hack-googles-knowledge-graph-results-featuring-donald-trump-268231)):

https://kgsearch.googleapis.com/v1/entities:search?query=Taylor+Swift&key={YOUR_API_KEY} 

https://kgsearch.googleapis.com/v1/entities:search?query=Taylor+Swift&indent=True&limit=5&key=AIzaSyCsFyWjkVsTMA2VmKrHLgk69Dv_cWcphGg

In [13]:
def getURL(search_terms, bad_sites_list, manual_url, known_urls):
    
    '''This function finds the one best URL for a school
    using a Google search of the school's name and address (stored in the SEARCH variable).
    It excludes URLs with any of the 61 bad_sites defined above, e.g. trulia.com, 
    greatschools.org, mapquest. It returns the number of excluded URLs and the first non-bad 
    URL discovered--or the already-collected manual_URL, which has already been documented.'''
    
    #print(os.system('python kg-api.py' + ' ' + search_terms))
    
    new_urls = []    # start with empty list
    good_url = ""    # output goes here
    k = 0    # initialize counter
    print("\nGetting URL for", search_terms)    # show school name & address
    
    # TO DO: Use KG-API here, search for school NAME only. 
    # Strict test: For each KG entity, check if address=school's address. If so, take that entity's URL as good_url.
    # Otherwise, use method below.
    
    new_urls = list(search(search_terms, num=20, pause=1, stop=10)) # grab first 10 Google results (URLs)
    
    # TO DO: Check output below for accuracy. If necessary, modify bad_sites_list and/or this method.
    
    # Loop through google search output to find first good result:
    for url in new_urls:
        if any(domain in url for domain in bad_sites_list):
            k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
        else:
            good_url = url
            break    # Exit for loop after first good url is found
    
    #if k>2: # Print this warning if any bad sites have been detected (and deleted)
    #    print("WARNING!! CHECK THIS URL!: " + new_urls[0] + \
    #          "\n" + str(k) + " bad Google results have been omitted.")

    if k>1:
        print(str(k) + " bad Google results have been omitted. Check this URL!")
        
    elif k>0:
        print(str(k) + " bad Google result has been omitted. Check this URL!")
    
    else: 
        print("No bad sites detected. Reliable URL!")

        
    if manual_url != "":
        print("VALIDITY CHECK: Is the discovered URL of " + good_url + \
              " consistent with the known URL of " + manual_url + " ?")
        known_urls.append(manual_url)
        return(k, manual_url)
    
    elif good_url == "":
        print("WARNING No good URL found via google search, school is probably CLOSED!")
        return(k, good_url)
    
    else:
        known_urls.append(good_url)
        return(k, good_url)

In [17]:
numschools = 0 # initialize school counter
known_URLs = [] # initialize list of known URLs

for school in sample: # loop through list of schools
    numschools += 1
    school["NUM_BAD_URLS"], school["URL"] = "", "" # start with empty strings
    school["NUM_BAD_URLS"], school["URL"] = getURL(school["SEARCH"], bad_sites, school["MANUAL_URL"], known_URLs)

print("\n\nURLs discovered for " + str(numschools) + " schools.")

#print("\nThe list of known URLs is now: \n" + str(known_URLs))


Getting URL for Richland Two Charter High 750 Old Clemson Road, Columbia, SC 29229
2 bad Google results have been omitted.
VALIDITY CHECK: Is the discovered URL of https://www.richland2.org/aec consistent with the known URL of https://www.richland2.org/charterhigh/ ?

Getting URL for POLK STATE COLLEGE COLLEGIATE HIGH SCHOOL 3425 WINTER LK RD LAC1200, WINTER HAVEN, FL 33881
2 bad Google results have been omitted.
VALIDITY CHECK: Is the discovered URL of http://www.ncsasports.org/football-recruiting/florida/winter-haven/polk-state-college-collegiate-high-school consistent with the known URL of https://www.polk.edu/lakeland-gateway-to-college-high-school/ ?

Getting URL for River City Scholars Charter Academy 944 Evergreen Street, Grand Rapids, MI 49507
No bad sites detected.
VALIDITY CHECK: Is the discovered URL of https://www.nhaschools.com/schools/rivercity consistent with the known URL of https://www.nhaschools.com/schools/rivercity/Pages/default.aspx ?

Getting URL for Detroit Ente

No bad sites detected.
VALIDITY CHECK: Is the discovered URL of https://trcsboone.org/contact-us/ consistent with the known URL of http://www.tworiverscommunityschool.net/ ?

Getting URL for TWO RIVERS COMMUNITY SCHOOL 195 CENTER DRIVE, GLENWOOD SPRINGS, CO 81601
No bad sites detected.
VALIDITY CHECK: Is the discovered URL of http://www.tworiverscs.org/ consistent with the known URL of http://www.tworiverscommunityschool.net/ ?

Getting URL for TWO DIMENSIONS/VICKERY 12330 VICKERY ST, HOUSTON, TX 77039
2 bad Google results have been omitted.
VALIDITY CHECK: Is the discovered URL of http://www.twodimensions.org/ consistent with the known URL of http://www.twodimensions.org/ ?

Getting URL for TEXAS LEADERSHIP OF MIDLAND 3300 THOMAS AVE, MIDLAND, TX 79703
No bad sites detected.
VALIDITY CHECK: Is the discovered URL of http://www.texasleadershipmidland.com/ consistent with the known URL of http://www.tlca-cl.com/ ?

Getting URL for TAOS INTEGRATED SCHOOL OF ARTS 123 MANZANARES ST, TAOS, N

No bad sites detected.
VALIDITY CHECK: Is the discovered URL of http://www.sequoiavillageschool.org/ consistent with the known URL of http://www.sequoiavillageschool.org/ ?

Getting URL for SEED PCS of Washington DC 4300 C St SE, Washington, DC 20019
No bad sites detected.
VALIDITY CHECK: Is the discovered URL of https://www.seedschooldc.org/ consistent with the known URL of http://www.seedschooldc.org/ ?

Getting URL for San Diego Cooperative Charter 7260 Linda Vista Rd., San Diego, CA 92111
No bad sites detected.
VALIDITY CHECK: Is the discovered URL of http://www.sdccs.org/ consistent with the known URL of http://www.sdccs.org/ ?

Getting URL for Sauvie Island Academy 14445 NW Charlton Rd, Portland, OR 97231
5 bad Google results have been omitted.
VALIDITY CHECK: Is the discovered URL of http://schools.oregonlive.com/school/Scappoose/Sauvie-Island-Elementary-School/ consistent with the known URL of http://www.sauvieislandacademy.org/ ?

Getting URL for Sanger Academy Charter 2207 Ni

UnboundLocalError: local variable 'good_url' referenced before assignment

In [31]:
# SAVE IT