# Introduction to web-scraping

It's 2019. The web is everywhere.

* If you want to buy a house, real estate agents have [websites](https://www.wendytlouie.com/) where they list the houses they're currently selling. 
* If you want to know whether to where a rain jacket or shorts, you check the weather on a [website](https://weather.com/weather/tenday/l/Berkeley+CA+USCA0087:1:US). 
* If you want to know what's happening in the world, you read the news [online](https://www.sfchronicle.com/). 
* If you've forgotten which city is the capital of Australia, you check [Wikipedia](https://en.wikipedia.org/wiki/Australia).

**The point is this: there is an enormous amount of information (also known as data) on the web.**

If we (in our capacities as, for example, data scientists, social scientists, digital humanists, businesses, public servants or members of the public) can get our hands on this information, **we can answer all sorts of interesting questions or solve important problems**.

* Maybe you're studying gender bias in student evaluations of professors. One option would be to scrape ratings from [Rate My Professors](https://www.ratemyprofessors.com/) (provided you follow their [terms of service](https://www.ratemyprofessors.com/TermsOfUse_us.jsp#use))
* Perhaps you want to build an app that shows users articles relating to their specified interests. You could scrape stories from various news websites and then use NLP methods to decide which articles to show which users.
* [Geoff Boeing](https://geoffboeing.com/) and [Paul Waddell](https://ced.berkeley.edu/ced/faculty-staff/paul-waddell) recently published [a great study](https://arxiv.org/pdf/1605.05397.pdf) of the US housing market by scraping millions of Craiglist rental listings. Among other insights, their study shows which metropolitan areas in the US are more or less affordable to renters.

This first day's workshop is a one-hour beginner's introduction to web scraping. 


## Learning Goals
*   

## Outline

* [How the web works](#mechanics)
* [Structured queries with APIs](#apis)
* [Domain collection with automated google search](#domain)
* [Downloading and mirroring with `wget`](#wget)
* [6](#6)
* [7](#7)
* [8](#8)
* [9](#9)
* [Terms of Service](#terms)

## Background

We will do some review, but this notebook assumes you have basic familiarity with Python. If you need a beginner's introduction to coding in Python, please walk through the intro to Python notebook at `solutions/intro-to-python.ipynb` and/or [this one](https://github.com/lknelson/text-analysis-course/blob/master/scripts/01.25.02_PythonBasics.ipynb) *before* the workshop. 

We will also use some regular expressions, which are character sequences defining a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations. Don't worry if you haven't seen these before--we will keep it simple. If you want to get more out of this session, first go through [this notebook on regular expressions](https://github.com/lknelson/text-analysis-course/blob/master/scripts/03.20.01_RegularExpressions.ipynb).

## Vocabulary

* *domain*: 
    *  
* *web-scraping* (i.e., *screen-scraping*):
    * Extracting structured information from web pages, usually relying on their HTML or CSS formatting.
* *web-crawling*:
    * Finding web pages through links, automated search, etc. to download or mirror them.
* *downloading*:
    *  
* *mirroring*:
    *  
* *Application Programming Interface (API)*:
    *  

**__________________________________**


## How the web works<a id='mechanics'></a>

Here's our high-level description of the web.

**The internet is a bunch of computers connected together.** Some computers are laptops, some are desktops, some are smart phones, some are servers owned by companies. Each computer has its own address on the internet. Using these addresses, **one computer can ask another computer for some information (data). We say that the first computer sends a _request_ to the second computer, asking for some particular information. The second computer sends back a _response_**. The response could include the information requested, or it could be an error message. Perhaps the second computer doesn't have that information any more, or the first computer isn't allowed to access that information.

<img src='../assets/computer-network.png' />

We said that there is an enormous amount of information available on the web. When people put information on the web, they generally have two different audiences in mind, two different types of consumers of their information: humans and computers. If they want their information to be used primarily by humans, they'll make a website. This will let them lay out the information in a visually appealing way, choose colours, add pictures, and make the information interactive. If they want their information to be used by computers, they'll make a web API. A web API provides other computers structured access to their data. We won't cover APIs in this workshop, but you should know that i) APIs are very common and ii) if there is an API for a website/data source, you should use that over web scraping. Many data sources that you might be interested in (e.g. social media sites) have APIs.

**Websites are just a bunch of files on one of those computers. They are just plain text files, so you can view them if you want. When you type in the address of a website in your browser, your computer sends a request to the computer located at that address. The request says "hey buddy, please send me the file(s) for this website". If everything goes well, the other computer will send back the file(s) in the response**. Everytime you navigate to a new website or page in your browser, this process repeats.

<img src='../assets/request-response.png' />

**There are three main languages that that website files are written with: HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript (JS)**. They normally have `.html`, `.css` and `.js` file extensions. Each language (and thus each type of file) serves a different purpose. **HTML files are the ones we care about the most, because they are the ones that contain the text you see on a web page**. CSS files contain the instructions on how to make the content in a HTML visually appealing (all the colours, font sizes, border widths, etc.). JavaScript files have the instructions on how to make the information on a website interactive (things like changing colour when you click something, entering data in a form). In this workshop, we're going to focus on HTML.


**It's not too much of a simplification to say:**

\begin{equation}
\textrm{Web scraping} = \textrm{Making a request for a HTML file} + \textrm{Parsing the HTML response}
\end{equation}

## Existing data

**Before web scraping, see if you can get the same data elsewhere.** This will often be easier for you and preferred by the people who own the data.

For example, Wikipedia offers an [API](https://www.mediawiki.org/wiki/REST_API) to access their pages. In fact, Wikipedia would prefer that we access their data that way rather than web-scraping. There's even a [Python package](https://pypi.org/project/wikipedia/) that wraps around this API to make it even easier to use. Wikipedia also makes all of its content available for [direct download](https://dumps.wikimedia.org/). 

Moreover, if you're affiliated with an institution, you may be breaching existing contracts by engaging in scraping. UC Berkeley's Library [recommends](http://guides.lib.berkeley.edu/text-mining) following this workflow:

<img src='../assets/workflow.png' />

## Terms of Service<a id='terms'></a>

As you've seen, web scraping involves making requests from other computers for their data. It costs people money to maintain the computers that we request data from: it needs electricity, it requires staff, sometimes you need to upgrade the computer, etc. But we didn't pay anyone for using their resources.

Because we're making these requests programmatically, we could make many, many requests per second. For example, we could put a request in a never-ending loop which would constantly request data from a server. But computers can't handle too much traffic, so eventually this might crash someone else's computer. Moreover, if we make too many requests when we're web scraping, that might restrict the number of people who can view the web page in their browser. This isn't very nice.

Websites often have Terms of Service, documents that you agree to whenever you visit a site. Some of these terms prohibit web scraping, because it puts too much strain on their servers, or they just don't want their data accessed programmatically. Whatever the reason, we need to respect a websites Terms of Service. **Before you scrape a site, you should always check its terms of service to make sure it's allowed.**

# Structured queries with APIs<a id='apis'></a>

As an example, let's try out the [Google Fact Check API](https://developers.google.com/fact-check/tools/api/), which can be easily explored [in a browser](https://toolbox.google.com/factcheck/explorer). By searching this Google service, the Fact Identifier collects facts relevant to the query input by user (or built in by default, as in the current version).

In [None]:
# Import libraries

import csv
from tqdm import tqdm
import requests # for downloading
from bs4 import BeautifulSoup, NavigableString, Tag # for html scraping
import regex as re # Regex module with Unicode support
import html5lib # slower but more accurate bs4 parser for messy HTML # lxml faster
import urllib
import json

# Import functions to scrape fact check web pages
from scrape_helpers import load_api_key, clean_text, scrape_politifact, scrape_factcheck, scrape_snopes

In [None]:
######################################################
# Call API
######################################################

# Elements in query response: text, claimDate, claimReview[publisher[name], url, textualRating]
# Columns in output CSV: (date (DD-MM-YYYY), claim, truth rating, url, source (publisher), fact, explanation

if __name__ == '__main__':
    page_token = 0
    domains = ['covid', 'blm', 'election']
    query_sets = [
        ["masks", "Chinese bioweapon", "China virus"],
        ["George Floyd", "Antifa", "Black Lives Matter"],
        ["Hunter Biden", "rigged election", "mail-in", "election ballots"],
    ]

    api_key_fp = "api_key.txt"
    key = load_api_key(api_key_fp)
    endpoint = 'https://factchecktools.googleapis.com'
    search = '/v1alpha1/claims:search'

    sites = ['politifact.com', 'factcheck.org', 'snopes.com']
    site_scrapers = [scrape_politifact, scrape_factcheck, scrape_snopes]
    site_switches = ['politifact', 'factcheck.org', 'snopes']


    for i in range(0, len(domains)):
        domain = domains[i]
        queries = query_sets[i]
        claims = [] # initialize list of claims

        for query in queries:
            urls = set() # initialize set of fact check URLs already seen for this query

            for site in tqdm(sites, desc='Collecting data for {} via API'.format(query)):
                params = {
                    'pageToken': page_token,
                    'query': query,
                    'reviewPublisherSiteFilter': site,
                    'key': key
                }

                nextToken = True
                while nextToken:
                    url = endpoint + search + '?' + urllib.parse.urlencode(params)
                    response = requests.get(url)
                    data = response.json()

                    if 'claims' in data:
                        for claim in data['claims']:
                            if not site == 'snopes.com':
                                claims.append([claim['claimDate'],
                                               claim['text'],
                                               claim['claimReview'][0]['textualRating'],
                                               claim['claimReview'][0]['url'],
                                               claim['claimReview'][0]['publisher']['name']])
                            else:
                                claims.append([claim['claimReview'][0]['reviewDate'],
                                               claim['text'],
                                               claim['claimReview'][0]['textualRating'],
                                               claim['claimReview'][0]['url'],
                                               claim['claimReview'][0]['publisher']['name']
                                              .replace('.com', '')])

                    if 'nextPageToken' in data:
                        params['pageToken'] = data['nextPageToken']
                    else:
                        nextToken = False

            for j in tqdm(range(0, len(claims)), desc='Scraping websites'.format(query)):
                claim = claims[j]
                switch = site_switches.index(claim[4].lower()) # use fact to get publisher site then index (site name is 5th element)
                scraper = site_scrapers[switch] # get scraper using index
                claim.extend(scraper(claim[3])) # scrape URL using scraper (URL is 4th element), add to existing claim info
                claims[j] = claim # record fact in list

            claims # remove duplicates

            # Save output for this query
            query_string = query.replace(' ', '-')
            with open('data/fact_checker_data_{}.csv'.format(query_string), 'w') as f:
                csv_writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_MINIMAL)
                csv_writer.writerow(['date', 'claim', 'truth_rating', 'url', 'source', 'fact', 'explanation'])
                for claim in claims: # Each claim gets its own column
                    if claim[3] not in urls: # don't add if fact check URL already seen
                        csv_writer.writerow(claim) # save row
                        urls.add(claim[3]) # add to set of urls already saved

            print('Saved {} claims for {} query.'.format(str(len(urls)), query))
            print()

# Domain collection with automated google search<a id='domain'></a>

This script uses two related functions to scrape the best URL from online sources: 
> The Google Places API. See the [GitHub page](https://github.com/slimkrazy/python-google-places) for the Python wrapper and sample code, [Google Web Services](https://developers.google.com/places/web-service/) for general documentation, and [here](https://developers.google.com/places/web-service/details) for details on Place Details requests.

> The Google Search function (manually filtered). See [here](https://pypi.python.org/pypi/google) for source code and [here](http://pythonhosted.org/google/) for documentation.

To get an API key for the Google Places API (or Knowledge Graph API), go to the [Google API Console](http://code.google.com/apis/console).
To upgrade your quota limits, sign up for billing--it's free and raises your daily request quota from 1K to 150K (!!).

The code below doesn't use Google's Knowledge Graph (KG) Search API because this turns out NOT to reveal websites related to search results (despite these being displayed in the KG cards visible at right in a standard Google search). The KG API is only useful for scraping KG id, description, name, and other basic/ irrelevant info. TO see examples of how the KG API constructs a search URL, etc., (see [here](http://searchengineland.com/cool-tricks-hack-googles-knowledge-graph-results-featuring-donald-trump-268231)).

Possibly useful note on debugging: An issue causing the GooglePlaces package to unnecessarily give a "ValueError" and stop was resolved in [July 2017](https://github.com/slimkrazy/python-google-places/issues/59). <br>
Other instances of this error may occur if Google Places API cannot identify a location as given. Dealing with this is a matter of proper Exception handling (which seems to be working fine below).

In [None]:
!pip install google # For automated Google searching 
!pip install https://github.com/slimkrazy/python-google-places/zipball/master # Google Places API

## Define helper functions

In [None]:
def dicts_to_csv(list_of_dicts, file_name, header):
    '''This helper function writes a list of dictionaries to a csv called file_name, with column names decided by 'header'.'''
    
    with open(file_name, 'w') as output_file:
        print("Saving to " + str(file_name) + " ...")
        dict_writer = csv.DictWriter(output_file, header)
        dict_writer.writeheader()
        dict_writer.writerows(list_of_dicts)

In [None]:
def count_left(list_of_dicts, varname):
    '''This helper function determines how many dicts in list_of_dicts don't have a valid key/value pair with key varname.'''
    
    count = 0
    for school in list_of_dicts:
        if school[varname] == "" or school[varname] == None:
            count += 1

    print(str(count) + " schools in this data are missing " + str(varname) + "s.")

count_left(sample, 'URL')

## Initialize Python search environment

In [None]:
# IMPORTING KEY PACKAGES
from googlesearch import search # automated Google Search package
from googleplaces import GooglePlaces, types, lang  # Google Places API

import csv, re, os  # Standard packages
import pandas as pd  # for working with csv files
import urllib, requests  # for scraping
from tqdm import tqdm # for progress tracking in for loops

In [None]:
# Initializing Google Places API search functionality
places_api_key = re.sub("\n", "", open("../data/places_api_key.txt").read())
print(places_api_key)

google_places = GooglePlaces(places_api_key)

In [None]:
# Here's a list of sites we DON'T want to spider, 
# but that an automated Google search might return...
# and we might thus accidentally spider unless we filter them out (as below)!

bad_sites = []
with open('../data/bad_sites.csv', 'r', encoding = 'utf-8') as csvfile:
    for row in csvfile:
        bad_sites.append(re.sub('\n', '', row))

print(bad_sites)

In [None]:
# See the Google Places API wrapper at work!
school_name = "River City Scholars Charter Academy"
address = "944 Evergreen Street, Grand Rapids, MI 49507"

query_result = google_places.nearby_search(
        location=address, name=school_name,
        radius=15000, types=[types.TYPE_SCHOOL], rankby='distance')

for place in query_result.places:
    print(place.name)
    place.get_details()  # makes further API call
    #print(place.details) # A dict matching the JSON response from Google.
    print(place.website)
    print(place.formatted_address)

# Are there any additional pages of results?
if query_result.has_next_page_token:
    query_result_next_page = google_places.nearby_search(
            pagetoken=query_result.next_page_token)

In [None]:
# Example of using the google search function:
for url in search('DR DAVID C WALKER INT 6500 IH 35 N STE C, SAN ANTONIO, TX 78218', \
                  stop=5, pause=5.0):
    print(url)

## Read in data

In [None]:
sample = []  # make empty list in which to store the dictionaries

if os.path.exists('../data/filtered_schools.csv'):  # first, check if file containing search results is available on disk
    file_path = '../data/filtered_schools.csv'
else:  # use original data if no existing results are available on disk
    file_path = '../../data_management/data/charters_unscraped_noURL_2015.csv'

with open(file_path, 'r', encoding = 'utf-8') as csvfile: # open file                      
    print('Reading in ' + str(file_path) + ' ...')
    reader = csv.DictReader(csvfile)  # create a reader
    for row in reader:  # loop through rows
        sample.append(row)  # append each row to the list

print("\nColumns in data: ")
print(list(sample[0]))
sample = sample[0:5]
sample

In [None]:
# Create new "URL" and "NUM_BAD_URLS" variables for each school, without overwriting any with data there already:
for school in sample:
    try:
        if len(school["URL"]) > 0:
            pass
        
    except (KeyError, NameError):
        school["URL"] = ""

for school in sample:
    try:
        if school["QUERY_RANKING"]:
            pass
        
    except (KeyError, NameError):
        school["QUERY_RANKING"] = ""

In [None]:
#### Take a look at the first entry's contents and the variables list in our sample (a list of dictionaries)
print(sample[1]["SCH_NAME"], "\n", sample[1]["ADDRESSES"], "\n", sample[1]["NCESSCH"], "\n")
print(sample[1].keys())

## Getting URLs

In [None]:
def getURL(school_name, address, bad_sites_list): # manual_url
    
    '''This function finds the one best URL for a school using two methods:
    
    1. If a school with this name can be found within 20 km (to account for proximal relocations) in
    the Google Maps database (using the Google Places API), AND
    if this school has a website on record, then this website is returned.
    If no school is found, the school discovered has missing data in Google's database (latitude/longitude, 
    address, etc.), or the address on record is unreadable, this passes to method #2. 
    
    2. An automated Google search using the school's name + address. This is an essential backup plan to 
    Google Places API, because sometimes the address on record (courtesy of Dept. of Ed. and our tax dollars) is not 
    in Google's database. For example, look at: "3520 Central Pkwy Ste 143 Mezz, Cincinnati, OH 45223". 
    No wonder Google Maps can't find this. How could it intelligibly interpret "Mezz"?
    
    Whether using the first or second method, this function excludes URLs with any of the 62 bad_sites defined above, 
    e.g. trulia.com, greatschools.org, mapquest. It returns the number of excluded URLs (from either method) 
    and the first non-bad URL discovered.'''
    
    
    ## INITIALIZE
    
    new_urls = []    # start with empty list
    good_url = ""    # output goes here
    k = 0    # initialize counter for number of URLs skipped
    
    radsearch = 15000  # define radius of Google Places API search, in km
    numgoo = 20  # define number of google results to collect for method #2
    wait_time = 20.0  # define length of pause between Google searches (longer is better for big catches like this)
    
    search_terms = school_name + " " + address
    print("Getting URL for " + school_name + ", " + address + "...")    # show school name & address
    
    
    
    ## FIRST URL-SCRAPE ATTEMPT: GOOGLE PLACES API
    # Search for nearest school with this name within radsearch km of this address
    
    try:
        query_result = google_places.nearby_search(
            location=address, name=school_name,
            radius=radsearch, types=[types.TYPE_SCHOOL], rankby='distance')
        
        for place in query_result.places:
            place.get_details()  # Make further API call to get detailed info on this place

            found_name = place.name  # Compare this name in Places API to school's name on file
            found_address = place.formatted_address  # Compare this address in Places API to address on file

            try: 
                url = place.website  # Grab school URL from Google Places API, if it's there

                if any(domain in url for domain in bad_sites_list):
                    k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
                    #print("  URL in Google Places API is a bad site. Moving on.")

                else:
                    good_url = url
                    print("    Success! URL obtained from Google Places API with " + str(k) + " bad URLs avoided.")
                    
                    '''
                    # For testing/ debugging purposes:
                    
                    print("  VALIDITY CHECK: Is the discovered URL of " + good_url + \
                          " consistent with the known URL of " + manual_url + " ?")
                    print("  Also, is the discovered name + address of " + found_name + " " + found_address + \
                          " consistent with the known name/address of: " + search_terms + " ?")
                    
                    if manual_url != "":
                        if manual_url == good_url:
                            print("    Awesome! The known and discovered URLs are the SAME!")
                    '''
                            
                    return(k, good_url)  # Returns valid URL of the Place discovered in Google Places API
        
            except:  # No URL in the Google database? Then try next API result or move on to Google searching.
                print("  Error collecting URL from Google Places API. Moving on.")
                pass
    
    except:
        print("  Google Places API search failed. Moving on to Google search.")
        pass
    
    

    ## SECOND URL-SCRAPE ATTEMPT: FILTERED GOOGLE SEARCH
    # Automate Google search and take first result that doesn't have a bad_sites_list element in it.
    
    
    # Loop through google search output to find first good result:
    try:
        new_urls = list(search(search_terms, stop=numgoo, pause=wait_time))  # Grab first numgoo Google results (URLs)
        print("  Successfully collected Google search results.")
        
        for url in new_urls:
            if any(domain in url for domain in bad_sites_list):
                k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
                #print("  Bad site detected. Moving on.")
            else:
                good_url = url
                print("    Success! URL obtained by Google search with " + str(k) + " bad URLs avoided.")
                break    # Exit for loop after first good url is found
                
    
    except:
        print("  Problem with collecting Google search results. Try this by hand instead.")
            
        
    '''
    # For testing/ debugging purposes:
    
    if k>2:  # Print warning messages depending on number of bad sites preceding good_url
        print("  WARNING!! CHECK THIS URL!: " + good_url + \
              "\n" + str(k) + " bad Google results have been omitted.")
    if k>1:
        print(str(k) + " bad Google results have been omitted. Check this URL!")
    elif k>0:
        print(str(k) + " bad Google result has been omitted. Check this URL!")
    else: 
        print("  No bad sites detected. Reliable URL!")
    
    if manual_url != "":
        if manual_url == good_url:
            print("    Awesome! The known and discovered URLs are the SAME!")
    '''
    
    if good_url == "":
        print("  WARNING! No good URL found via API or google search.\n")
    
    return(k + 1, good_url)

In [None]:
numschools = 0  # initialize scraping counter

keys = sample[0].keys()  # define keys for writing function
fname = "../data/final_schools.csv"  # define file name for writing function

In [None]:
for school in sample[:2]:
    print(school["URL"])

In [None]:
# Now to call the above function and actually scrape these things!
# 
for school in tqdm(sample): # loop through list of schools (sample)
    if school["URL"] == "":  # if URL is missing, fill that in by scraping
        numschools += 1
        school["QUERY_RANKING"], school["URL"] = getURL(school["SCH_NAME"], school["ADDRESSES"], bad_sites) # school["MANUAL_URL"]
    
    else:
        if school["URL"]:
            pass  # If URL exists, don't bother scraping it again

        else:  # If URL hasn't been defined, then scrape it!
            numschools += 1
            school["QUERY_RANKING"], school["URL"] = "", "" # start with empty strings
            school["QUERY_RANKING"], school["URL"] = getURL(school["SCH_NAME"], school["ADDRESSES"], bad_sites) # school["MANUAL_URL"]

print("\n\nURLs discovered for " + str(numschools) + " schools.")

In summer 2017, the above approach works to get a good URL for 6,677 out of the 6,752 schools in this data set. Not bad! <br>
For some reason, the Google search algorithm (method #2) is less likely to work after passing from the Google Places API. <br>
To fill in for the remaining 75, let's skip the function's layers of code and just call the google search function by hand.

In [None]:
for school in sample:
    school["SEARCH"] = school["SCH_NAME"] + " " + school["ADDRESSES"]
    if school["URL"] == "":
        k = 0  # initialize counter for number of URLs skipped
        school["QUERY_RANKING"] = ""

        
        print("Scraping URL for " + school["SEARCH"] + "...")
        urls_list = list(search(school["SEARCH"], stop=20, pause=10.0))
        print("  URLs list collected successfully!")

        for url in urls_list:
            if any(domain in url for domain in bad_sites):
                k+=1    # If this url is in bad_sites_list, add 1 to counter and move on
                # print("  Bad site detected. Moving on.")
            else:
                good_url = url
                print("    Success! URL obtained by Google search with " + str(k) + " bad URLs avoided.")

                school["URL"] = good_url
                school["QUERY_RANKING"] = k + 1
                
                count_left(sample, 'URL')
                dicts_to_csv(sample, fname, keys)
                print()
                break    # Exit for loop after first good url is found                               
                                           
    else:
        pass

count_left(sample, 'URL')
dicts_to_csv(sample, fname, keys)

In [None]:
# Save sample to file (can continue to load and add to it):
count_left(sample, 'URL')
dicts_to_csv(sample, fname, keys)

In [None]:
# CHECK OUT RESULTS
# TO DO: Make a histogram of 'NUM_BAD_URLS'
# systematic way to look at problem URLs (with k > 0)?

f = 0
for school in sample:
    if int(school['NUM_BAD_URLS']) > 14:
        print(school["SEARCH"], "\n", school["URL"], "\n")
        f += 1

print(str(f))

# Downloading and mirroring with `wget`<a id='wget'></a>

### Limitations

-only works for static HTML and it doesn’t support JavaScript. Thus any element generated by JS will not be captured. 

More info:

https://www.petekeen.net/archiving-websites-with-wget

http://askubuntu.com/questions/411540/how-to-get-wget-to-download-exact-same-web-page-html-as-browser

https://www.reddit.com/r/linuxquestions/comments/3tb7vu/wget_specify_dns_server/
failed: nodename nor servname provided, or not known.

In [None]:
#---------------------------------------------------------------
#Define most general wget parameters (more specific params below)
#This list would not be so long if Parallel would allow wget to read from /usr/local/etc/wgetrc
wget_general_options = '--no-parent --level 7 --no-check-certificate \
--recursive --adjust-extension --convert-links --page-requisites --wait=2 --random-wait \
-e --robots=off --follow-ftp --secure-protocol=auto --retry-connrefused --tries=12 --no-remove-listing \
--local-encoding=UTF-8 --no-cookies --default-page=default --server-response --trust-server-names \
--header="Accept:text/html" --exclude-directories=' + exclude_dirs + reject_files
#---------------------------------------------------------------

#Other options:
#--verbose --convert-file-only --force-directories --show-progress 
#--user_agent = Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0

In [None]:
# Some of these options explained: 
'''
--warc-file turns on WARC output to the specified file
--warc-cdx tells wget to dump out an index file for our new WARC file
--page-requisites will grab all of the linked resources necessary to render the page (images, css, javascript, etc)
--adjust-extension appends .html to the files when appropriate
--convert-links will turn links into local links as appropriate
--execute robots=off turns off wget's automatic robots.txt checking
--exclude-directories includes a comma-separated list of directories that wget should exclude in the archive
--user-agent overrides wget's default user agent
--random-wait will randomize that wait to between 5 and 15 seconds
'''

In [1]:
# import necessary libraries
import os, csv
import shutil
import urllib
from urllib.request import urlopen
from socket import error as SocketError
import errno

In [2]:
#setting directories
micro_sample_cvs = "/Users/anhnguyen/Desktop/research/scraping_Python/micro-sample_Feb17.csv"
wget_folder = "/Users/anhnguyen/Desktop/research/scraping_Python/wget_accept"
no_dir_folder = "/Users/anhnguyen/Desktop/research/scraping_Python/no_dir"
learning_wget = "/Users/anhnguyen/Desktop/research/scraping_Python/learning_wget"

In [3]:
sample = [] # make empty list
with open(micro_sample_cvs, 'r', encoding = 'Windows-1252')\
as csvfile: # open file; the windows-1252 encoding looks weird but works for this
    reader = csv.DictReader(csvfile) # create a reader
    for row in reader: # loop through rows
        sample.append(row) # append each row to the list
        
#note: each row, sample[i] is a dictionary with keys as column name and value as info

In [4]:
# turning this into tuples we can use with wget!
# first, make some empty lists
url_list = []
name_list = []
terms_list = []

# now let's fill these lists with content from the sample
for school in sample:
    url_list.append(school["URL"])
    name_list.append(school["SCHNAM"])
    terms_list.append(school["ADDRESS"])

In [5]:
tuple_list = list(zip(url_list, name_list))
# Let's check what these tuples look like:
print(tuple_list[:3])
print("\n", tuple_list[1][1].title())

[('https://www.richland2.org/charterhigh/', 'RICHLAND TWO CHARTER HIGH'), ('https://www.polk.edu/lakeland-gateway-to-college-high-school/', 'POLK STATE COLLEGE COLLEGIATE HIGH SCHOOL'), ('https://www.nhaschools.com/schools/rivercity/Pages/default.aspx', 'RIVER CITY SCHOLARS CHARTER ACADEMY')]

 Polk State College Collegiate High School


### Helper Functions

In [27]:
def get_parent_link(str):
    """Function to get parents' links. Return a list of valid links."""
    ls= get_parent_link_helper(5, str, []);
    if len(ls) > 1:
        return ls[0]
    return str

def get_parent_link_helper(level, str, result):
    """This is a tail recursive function
    to get parent link of a given link. Return a list of urls """
    if level == 0 or not check(str):
        return ''
    else:
        result += [str]
        return get_parent_link_helper(num -1, str[: str.rindex('/')], result)

In [25]:
def format_folder_name (k, name):
    """Format a folder nicely for easy access"""
    if k < 10: # Add two zeros to the folder name if k is less than 10 (for ease of organizing the output folders)
        dirname = "00" + str(k) + " " + name
    elif k < 100: # Add one zero if k is less than 100
        dirname = "0" + str(k) + " " + name
    else: # Add nothing if k>100
        dirname = str(k) + " " + name
    return dirname

def run_wget_command(link, parent_folder, my_folder):
    """wget on link and print output to appropriate folders"""
    #navigate to parent folder
    os.chdir(parent_folder)
    # create dir my_folder if it doesn't exist yet
    if not os.path.exists(my_folder):
        os.makedirs(my_folder)
    #navigate to the correct folder, ready to wget
    os.chdir(my_folder)
    os.system('wget --header="Accept: text/html" -r --level=3 --accept .html --referer= '+get_parent_link(link) + ' ' + link)
#     os.system('wget -np --no-parent --show-progress --progress=dot --recursive --level=3 --convert-links --retry-connrefused \
#          --random-wait --no-cookies --secure-protocol=auto --no-check-certificate --execute robots=off \
#          --header "Accept: text/html" \
#          --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" \
#           --accept .html' + ' ' + link)
    

def contains_html(my_folder):
    """check if a wget is success by checking if a directory has a html file"""

    for r,d,f in os.walk(my_folder):
        for file in f:
            if file.endswith('.html'):
                return True
    return False

def count_with_file_ext(folder, ext):
    count = 0
    for r,d,f in os.walk(my_folder):
        for file in f:
            if file.endswith(ext):
                count +=1
    return count 

# write a file and add num line at the beginning of line
def write_to_file(num, link, file_name):
    with open(file_name, "a") as text_file:
        text_file.write(str(num) + "\t" + link +"\n")

# just write str to file
def write_file(str, file_name):
    with open(file_name, "a") as text_file:
        text_file.write(str)
        
def reset(folder, text_file_1, text_file_2):
    """Deletes all files in a folder and set 2 text files to blank"""
    parent_folder = folder[: folder.rindex('/')]
    shutil.rmtree(folder)
    os.makedirs(folder)
    filelist = [ f for f in os.listdir(folder) if f.endswith(".bak") ]
    for f in filelist:
        os.unlink(f)
    for file_name in [text_file_1, text_file_2]:
        reset_text_file(file_name)
        
def reset_text_file(file_name):
    if os.path.exists(file_name):
            with open(file_name, "w") as text_file:
                text_file.write("")

In [7]:
#testing methods
print(format_folder_name(30, "name me"))



030 name me


In [11]:
def check(url):
    """ Helper function, check if url is a valid list"""
    try:
        urlopen(url)
        
    except urllib.error.URLError:
        print("urllib.error.URLError")
        return False
    except urllib.error.HTTPError:
        print('urllib.error.HTTPError')
        return False
    except SocketError:
        print('SocketError')
        return False
    return True


def read_txt(txt_file):
    links = []
    count = 0
    with open(txt_file) as f:
        for line in f:   
            
            elem =  line.split('\t')[1].rstrip()
            count +=1
    
#             print(elem)
            links += [elem.rstrip()]
    return links, count

def read_txt_2(txt_file):
    links = []
    count = 0
    with open(txt_file) as f:
        for line in f:   
            
#             elem =  line.split('\t')[1].rstrip()
#             if elem.endswith('\'):
#                 elem = elem[:-1]
            count +=1
    
#             print(elem)
            links += [line.rstrip()]
    return links, count

### Running wget

In [8]:
# set up file directories
success_file = "/Users/anhnguyen/Desktop/research/scraping_Python/success.txt"
fail_file = "/Users/anhnguyen/Desktop/research/scraping_Python/fail.txt"

In [26]:
valid_now = '/Users/anhnguyen/Desktop/research/scraping_Python/validlinks_from_Sammy.txt'
list_valid_now,count = read_txt_2(valid_now)
for link in list_valid_now:
    run_wget_command(str(link), wget_folder, "new "+ str(link)[6:])
    

In [11]:
#reset(wget_folder, success_file, fail_file)

In [12]:

k=200 # initialize this numerical variable k, which keeps track of which entry in the sample we are on.

#testing the first 10 tuples
# tuple_test = tuple_list[200:300]


for tup in tuple_test:
    school_title = tup[1].title()


    k += 1 # Add one to k, so we start with 1 and increase by 1 all the way up to entry # 300
    print("Capturing website data for", school_title + ", which is school #" + str(k), "of 300...")
    
    # use the tuple to create a name for the folder
    dirname = format_folder_name(k, school_title)
    
    run_wget_command(tup[0], wget_folder, dirname)
    
    school_folder = wget_folder + '/'+ dirname
    if contains_html(school_folder):
        write_file( tup[0], success_file )
    else :
        write_file( tup[0], fail_file)
print("done!")
    

In [17]:
success_links, count = read_txt(success_file)
print("There are {} links in success file.".format( count))
# print(success_links)

There are 243 links in success file.


In [18]:
fail_links, count = read_txt(fail_file)
print("There are {} links in fail file.".format( count))

There are 57 links in fail file.


In [124]:
# counting # of html files
# def count_html(file):
    
def count_valid_links(list_of_links, valid_file, invalid_file):
    count_success, count_fail = 0, 0
    valid, invalid = '', ''
    for l in list_of_links:
#         print(l)
        if check(l):
            valid += l + '\n'
            count_success +=1
        else:
            invalid += l + '\n'
            count_fail += 1
#             print(l)
    write_file(valid, valid_file)
    write_file(invalid, invalid_file)
    return count_success, count_fail



In [125]:
valid_list = '/Users/anhnguyen/Desktop/research/scraping_Python/valid_links.txt'
invalid_list = '/Users/anhnguyen/Desktop/research/scraping_Python/invalid_links.txt'
reset_text_file(valid_list)
reset_text_file(invalid_list)

In [126]:

count_success, count_fail = count_valid_links(fail_links, valid_list, invalid_list)


urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError


In [127]:
print("There are {} valid links and {} invalid links".format(count_success, count_fail))

There are 31 valid links and 26 invalid links


In [114]:
# recheck links without "/"
recheck, count = read_txt_2(invalid_list)
print(count)

26


In [115]:
for index in range (0, len(recheck)):
    if recheck[index].endswith('/'):
        recheck[index] = recheck[index][: recheck[index].rindex('/')]
print(recheck[20])

http://responsiveed.com/dallasclassical


In [116]:
invalid2 = '/Users/anhnguyen/Desktop/research/scraping_Python/invalid2.txt'
count_success, count_fail = count_valid_links(recheck, valid_list, invalid2 )

http://www.trinityschoolforchildren.org
http://www.pasadenarosebud.com
http://www.mlacademy.org/#!contact-us/c2q4
http://www.materacademy.com/schools
http://www.jeffersoncommunityschool.org
http://www.evergladesprep.com/pages/Everglades_Preparatory_Academy
http://www.clevelandta.org/school/oak-leadership-institute
http://www.chandlerparkacademy.net/index.php/schools/elementary-school.html
http://www.ccaschool.net
http://www.blracademy.org
http://www.academycharterhs.org/pages/mainpg
http://www.academiadeestrellas.org
http://rpes-susd-ca.schoolloop.com
http://responsiveed.com/premierpharrmcallen
http://responsiveed.com/premiernewbraunfels
http://responsiveed.com/huntsvilleclassical
http://responsiveed.com/dallasclassical
http://ideacharterschool.com
http://gowan.craneschools.org
http://arthuracademy.org/woodburn/woodburn-arthur-academy.html


In [118]:
print("There are {} valid links and {} invalid links".format(count_success, count_fail))

There are 6 valid links and 20 invalid links


### Runing wget with log output

In [26]:
# setting up files
invalid2 = '/Users/anhnguyen/Desktop/research/scraping_Python/invalid2.txt'
log = '/Users/anhnguyen/Desktop/research/scraping_Python/wget_accept_logs.txt'

In [27]:
failed_links, counts = read_txt(invalid2)
print(counts)

20


In [121]:
## something wrong with check function???
print(check('http://responsiveed.com/dallasclassical'))

urllib.error.URLError
False


In [28]:
os.chdir('/Users/anhnguyen/Desktop/research/scraping_Python/no_dir')
reset_text_file(log)
for link in failed_links:
    
    
    os.system('wget -np --no-parent --show-progress --progress=dot --recursive --level=3 --convert-links --retry-connrefused --tries=5\
         --random-wait --no-cookies --secure-protocol=auto --no-check-certificate --execute robots=off \
         --header "Host: jrs-s.net" \
         --output-file=log \
         --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" \
          --accept .html' + ' ' + link)