# Introduction to web-scraping: Extra challenges

## Challenge: Scraping URLs with a blacklist

1. Use automated google searching to get 10 URLs for Dr. David C. Walker Intermediate School, located at 6500 Ih 35 N Ste C, San Antonio, TX 78218. 
2. Get the first result that doesn't match any domain on the blacklist in `../assets/blacklist_school_domains.csv`. 
3. Get the first 10-20 quality results--those that don't match any blacklisted domain.

### Part 1

In [15]:
# Import automated Google search package
from googlesearch import search

# Your solution here

# Define metadata
school_name = 'Dr. David C. Walker Intermediate School'
school_address = '6500 Ih 35 N Ste C, San Antonio, TX 78218'

# Automated search
for url in search(school_name + ' ' + school_address, \
                  stop=10, pause=5.0):
    print(url)

https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
https://closelocation.com/find-school/dr-david-c-walker-elementary-school-school-in-basse-basse-16-11508-1216-80


### Part 2

In [22]:
# Import packages
import re

# Define blacklisted domains to filter out: third-party domains/false positives that we DON'T want to scrape 
blacklist = []
with open('../extra/blacklist_school_domains.csv', 'r', encoding = 'utf-8') as csvfile:
    for row in csvfile:
        blacklist.append(re.sub('\n', '', row))

print(blacklist)

['high-schools.com', 'yelp.com', 'har.com', 'trulia.com', 'redfin.com', 'practutor.com', 'startclass.com', 'greatschools.org', 'greatschools.com', 'greatschools.net', 'paschoolperformance.org', 'worldcontactinfo.com', 'kula.com', 'mapquest.com', 'maps.net', 'google.com', 'facebook.com', 'zillow.com', 'manta.com', 'yellowpages.com', 'usnews.com', 'publicschoolreview.com', 'publicschoolreview.org', 'schooldigger.com', 'niche.com', 'privateschoolreview.com', 'cappex.com', 'collegeconfidential.com', 'tripsadvisor.com', 'groupon.com', 'school-ratings.com', 'superpages.com', 'onsaleph.com', 'psk12.com', 'schoolmatters.com', 'neighborhoodscout.com', 'localschooldirectory.com', 'publicschoolsk12.com', 'schooldatadirect.org', 'nces.ed.gov', 'cityrating.com', 'blogspot.com', 'public-schools.findthebest.com', 'twitter.com', 'zoominfo.com', 'jigsaw.com', 'hoovers.com', 'corporateinformation.com', 'doe.k12.ga.us', 'gradeschools.net', 'charterschoolratings.net', 'schools.net', 'insiderpages.com', 'p

In [23]:
# Your solution here

# Collect search results
urls = search(school_name + ' ' + school_address, \
              stop=20, pause=5.0, num=20) 
print("Successfully collected Google search results.")

# Initialize blacklist match counter: How many blacklisted domains has this search encountered?
blacklisted_num = 0 

# Loop through google search output to find first good result:
for url in urls:
    if any(bad_domain in url for bad_domain in blacklist):
        print(f'Bad site detected: {url}') 
        blacklisted_num += 1 # Add one to blacklist match counter
    else:
        good_url = url
        print("Success! URL obtained by Google search with " + str(blacklisted_num) + " bad URLs avoided.")
        break # Exit for loop after first good url is found
        
print(f'Quality URL: {url}')

Successfully collected Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Bad site detected: https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
Bad site detected: https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
Bad site detected: https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected:

### Part 3

In [28]:
# Your solution here

# Define minimum number of quality results we want to see
min_number_results = 10 

# Initialize counter: How many blacklisted domains has this search encountered?
blacklisted_num = 0 

# Initialize URL lists: quality URLs and those that match blacklist
good_urls = []
bad_urls = []

# Initialize where to start the search and total number seen
start_num = 0
batched_num = 0

while len(good_urls) < min_number_results: # Get more results until we have 10 that are good quality
    
    start_num += batched_num # In case we need additional batches, start after 
    
    # Get batch of search results
    urls = search(school_name + ' ' + school_address, \
                  start = start_num, stop = start_num + min_number_results, pause=5.0)
    print("Collected batch of Google search results.")
    
    batched_num += min_number_results # Add to number batched
    
    # Loop through urls and add to quality URL list anything not matching a blacklisted domain
    for url in urls:
        if any(bad_domain in url for bad_domain in blacklist): # Check if any blacklisted domain is in this url
            print(f'Bad site detected: {url}') 
            blacklisted_num += 1 # Add to counter
        elif url not in good_urls: # Don't add duplicates
            good_urls.append(url)
        
print(f'Success! Collected {str(len(good_urls))} quality Google search results and avoided {str(blacklisted_num)} third-party URLs.')
print()

# Print each quality URL
print('Quality URLs:')
for url in good_urls:
    print(url)

Collected batch of Google search results.
Bad site detected: https://www.niche.com/k12/dr-david-c-walker-intermediate-school-san-antonio-tx/
Bad site detected: https://www.har.com/school/015806106/dr-david-c-walker-elementary-school
Bad site detected: https://www.greatschools.org/texas/san-antonio/12035-Dr-David-C-Walker-Intermediate-School/
Bad site detected: https://www.usnews.com/education/k12/texas/dr-david-c-walker-el-206298
Bad site detected: https://www.publicschoolreview.com/dr-david-c-walker-elementary-school-profile
Bad site detected: https://nces.ed.gov/ccd/schoolsearch/school_detail.asp?ID=480006211404
Bad site detected: https://www.dnb.com/business-directory/company-profiles.school_of_excellence_in_education.8fde8b90005cb3de714dd31c0d8e98f4.html
Bad site detected: https://www.schooldigger.com/go/TX/schools/0006211404/school.aspx
Bad site detected: https://elementaryschools.org/directory/tx/cities/san-antonio/dr-david-c-walker-elementary/480006211404/
Bad site detected: htt