# Introduction to web-scraping

This first day's workshop is a one-hour beginner's introduction to web scraping. 


## Learning Goals
*   

## Outline

* [Motivation](#motivation)
* [How the Web Works](#mechanics)
* [3](#3)
* [4](#4)
* [5](#5)
* [6](#6)
* [7](#7)
* [8](#8)
* [9](#9)
* [Terms of Service](#terms)

## Background

We will do some review, but this notebook assumes you have basic familiarity with Python. If you need a beginner's introduction to coding in Python, please walk through the intro to Python notebook at `solutions/intro-to-python.ipynb` and/or [this one](https://github.com/lknelson/text-analysis-course/blob/master/scripts/01.25.02_PythonBasics.ipynb) *before* the workshop. 

We will also use some regular expressions, which are character sequences defining a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations. Don't worry if you haven't seen these before--we will keep it simple. If you want to get more out of this session, first go through [this notebook on regular expressions](https://github.com/lknelson/text-analysis-course/blob/master/scripts/03.20.01_RegularExpressions.ipynb).

## Vocabulary

* *domain*: 
    *  
* *web-scraping* (i.e., *screen-scraping*):
    *   
* *web-crawling*:
    *  
* *downloading*:
    *  
* *mirroring*:
    *  
* *Application Programming Interface (API)*:
    *  

**__________________________________**


## Motivation<a id='motivation'></a>

It's 2019. The web is everywhere.

* If you want to buy a house, real estate agents have [websites](https://www.wendytlouie.com/) where they list the houses they're currently selling. 
* If you want to know whether to where a rain jacket or shorts, you check the weather on a [website](https://weather.com/weather/tenday/l/Berkeley+CA+USCA0087:1:US). 
* If you want to know what's happening in the world, you read the news [online](https://www.sfchronicle.com/). 
* If you've forgotten which city is the capital of Australia, you check [Wikipedia](https://en.wikipedia.org/wiki/Australia).

**The point is this: there is an enormous amount of information (also known as data) on the web.**

If we (in our capacities as, for example, data scientists, social scientists, digital humanists, businesses, public servants or members of the public) can get our hands on this information, **we can answer all sorts of interesting questions or solve important problems**.

* Maybe you're studying gender bias in student evaluations of professors. One option would be to scrape ratings from [Rate My Professors](https://www.ratemyprofessors.com/) (provided you follow their [terms of service](https://www.ratemyprofessors.com/TermsOfUse_us.jsp#use))
* Perhaps you want to build an app that shows users articles relating to their specified interests. You could scrape stories from various news websites and then use NLP methods to decide which articles to show which users.
* [Geoff Boeing](https://geoffboeing.com/) and [Paul Waddell](https://ced.berkeley.edu/ced/faculty-staff/paul-waddell) recently published [a great study](https://arxiv.org/pdf/1605.05397.pdf) of the US housing market by scraping millions of Craiglist rental listings. Among other insights, their study shows which metropolitan areas in the US are more or less affordable to renters.

## How the Web works<a id='mechanics'></a>

Here's our high-level description of the web.

**The internet is a bunch of computers connected together.** Some computers are laptops, some are desktops, some are smart phones, some are servers owned by companies. Each computer has its own address on the internet. Using these addresses, **one computer can ask another computer for some information (data). We say that the first computer sends a _request_ to the second computer, asking for some particular information. The second computer sends back a _response_**. The response could include the information requested, or it could be an error message. Perhaps the second computer doesn't have that information any more, or the first computer isn't allowed to access that information.

<img src='../assets/computer-network.png' />

We said that there is an enormous amount of information available on the web. When people put information on the web, they generally have two different audiences in mind, two different types of consumers of their information: humans and computers. If they want their information to be used primarily by humans, they'll make a website. This will let them lay out the information in a visually appealing way, choose colours, add pictures, and make the information interactive. If they want their information to be used by computers, they'll make a web API. A web API provides other computers structured access to their data. We won't cover APIs in this workshop, but you should know that i) APIs are very common and ii) if there is an API for a website/data source, you should use that over web scraping. Many data sources that you might be interested in (e.g. social media sites) have APIs.

**Websites are just a bunch of files on one of those computers. They are just plain text files, so you can view them if you want. When you type in the address of a website in your browser, your computer sends a request to the computer located at that address. The request says "hey buddy, please send me the file(s) for this website". If everything goes well, the other computer will send back the file(s) in the response**. Everytime you navigate to a new website or page in your browser, this process repeats.

<img src='../assets/request-response.png' />

**There are three main languages that that website files are written with: HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript (JS)**. They normally have `.html`, `.css` and `.js` file extensions. Each language (and thus each type of file) serves a different purpose. **HTML files are the ones we care about the most, because they are the ones that contain the text you see on a web page**. CSS files contain the instructions on how to make the content in a HTML visually appealing (all the colours, font sizes, border widths, etc.). JavaScript files have the instructions on how to make the information on a website interactive (things like changing colour when you click something, entering data in a form). In this workshop, we're going to focus on HTML.


**It's not too much of a simplification to say:**

\begin{equation}
\textrm{Web scraping} = \textrm{Making a request for a HTML file} + \textrm{Parsing the HTML response}
\end{equation}

# Wget using accept

In [1]:
# import necessary libraries
import os, csv
import shutil
import urllib
from urllib.request import urlopen
from socket import error as SocketError
import errno


In [2]:
#setting directories
micro_sample_cvs = "/Users/anhnguyen/Desktop/research/scraping_Python/micro-sample_Feb17.csv"
wget_folder = "/Users/anhnguyen/Desktop/research/scraping_Python/wget_accept"
no_dir_folder = "/Users/anhnguyen/Desktop/research/scraping_Python/no_dir"
learning_wget = "/Users/anhnguyen/Desktop/research/scraping_Python/learning_wget"

In [3]:
sample = [] # make empty list
with open(micro_sample_cvs, 'r', encoding = 'Windows-1252')\
as csvfile: # open file; the windows-1252 encoding looks weird but works for this
    reader = csv.DictReader(csvfile) # create a reader
    for row in reader: # loop through rows
        sample.append(row) # append each row to the list
        
#note: each row, sample[i] is a dictionary with keys as column name and value as info

In [4]:
# turning this into tuples we can use with wget!
# first, make some empty lists
url_list = []
name_list = []
terms_list = []

# now let's fill these lists with content from the sample
for school in sample:
    url_list.append(school["URL"])
    name_list.append(school["SCHNAM"])
    terms_list.append(school["ADDRESS"])

In [5]:
tuple_list = list(zip(url_list, name_list))
# Let's check what these tuples look like:
print(tuple_list[:3])
print("\n", tuple_list[1][1].title())

[('https://www.richland2.org/charterhigh/', 'RICHLAND TWO CHARTER HIGH'), ('https://www.polk.edu/lakeland-gateway-to-college-high-school/', 'POLK STATE COLLEGE COLLEGIATE HIGH SCHOOL'), ('https://www.nhaschools.com/schools/rivercity/Pages/default.aspx', 'RIVER CITY SCHOLARS CHARTER ACADEMY')]

 Polk State College Collegiate High School


### Helper Functions

In [27]:
def get_parent_link(str):
    """Function to get parents' links. Return a list of valid links."""
    ls= get_parent_link_helper(5, str, []);
    if len(ls) > 1:
        return ls[0]
    return str

def get_parent_link_helper(level, str, result):
    """This is a tail recursive function
    to get parent link of a given link. Return a list of urls """
    if level == 0 or not check(str):
        return ''
    else:
        result += [str]
        return get_parent_link_helper(num -1, str[: str.rindex('/')], result)

In [25]:
def format_folder_name (k, name):
    """Format a folder nicely for easy access"""
    if k < 10: # Add two zeros to the folder name if k is less than 10 (for ease of organizing the output folders)
        dirname = "00" + str(k) + " " + name
    elif k < 100: # Add one zero if k is less than 100
        dirname = "0" + str(k) + " " + name
    else: # Add nothing if k>100
        dirname = str(k) + " " + name
    return dirname

def run_wget_command(link, parent_folder, my_folder):
    """wget on link and print output to appropriate folders"""
    #navigate to parent folder
    os.chdir(parent_folder)
    # create dir my_folder if it doesn't exist yet
    if not os.path.exists(my_folder):
        os.makedirs(my_folder)
    #navigate to the correct folder, ready to wget
    os.chdir(my_folder)
    os.system('wget --header="Accept: text/html" -r --level=3 --accept .html --referer= '+get_parent_link(link) + ' ' + link)
#     os.system('wget -np --no-parent --show-progress --progress=dot --recursive --level=3 --convert-links --retry-connrefused \
#          --random-wait --no-cookies --secure-protocol=auto --no-check-certificate --execute robots=off \
#          --header "Accept: text/html" \
#          --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" \
#           --accept .html' + ' ' + link)
    

def contains_html(my_folder):
    """check if a wget is success by checking if a directory has a html file"""

    for r,d,f in os.walk(my_folder):
        for file in f:
            if file.endswith('.html'):
                return True
    return False

def count_with_file_ext(folder, ext):
    count = 0
    for r,d,f in os.walk(my_folder):
        for file in f:
            if file.endswith(ext):
                count +=1
    return count 

# write a file and add num line at the beginning of line
def write_to_file(num, link, file_name):
    with open(file_name, "a") as text_file:
        text_file.write(str(num) + "\t" + link +"\n")

# just write str to file
def write_file(str, file_name):
    with open(file_name, "a") as text_file:
        text_file.write(str)
        
def reset(folder, text_file_1, text_file_2):
    """Deletes all files in a folder and set 2 text files to blank"""
    parent_folder = folder[: folder.rindex('/')]
    shutil.rmtree(folder)
    os.makedirs(folder)
    filelist = [ f for f in os.listdir(folder) if f.endswith(".bak") ]
    for f in filelist:
        os.unlink(f)
    for file_name in [text_file_1, text_file_2]:
        reset_text_file(file_name)
        
def reset_text_file(file_name):
    if os.path.exists(file_name):
            with open(file_name, "w") as text_file:
                text_file.write("")

In [7]:
#testing methods
print(format_folder_name(30, "name me"))



030 name me


In [11]:
def check(url):
    """ Helper function, check if url is a valid list"""
    try:
        urlopen(url)
        
    except urllib.error.URLError:
        print("urllib.error.URLError")
        return False
    except urllib.error.HTTPError:
        print('urllib.error.HTTPError')
        return False
    except SocketError:
        print('SocketError')
        return False
    return True


def read_txt(txt_file):
    links = []
    count = 0
    with open(txt_file) as f:
        for line in f:   
            
            elem =  line.split('\t')[1].rstrip()
            count +=1
    
#             print(elem)
            links += [elem.rstrip()]
    return links, count

def read_txt_2(txt_file):
    links = []
    count = 0
    with open(txt_file) as f:
        for line in f:   
            
#             elem =  line.split('\t')[1].rstrip()
#             if elem.endswith('\'):
#                 elem = elem[:-1]
            count +=1
    
#             print(elem)
            links += [line.rstrip()]
    return links, count

### Running wget

In [8]:
# set up file directories
success_file = "/Users/anhnguyen/Desktop/research/scraping_Python/success.txt"
fail_file = "/Users/anhnguyen/Desktop/research/scraping_Python/fail.txt"

In [26]:
valid_now = '/Users/anhnguyen/Desktop/research/scraping_Python/validlinks_from_Sammy.txt'
list_valid_now,count = read_txt_2(valid_now)
for link in list_valid_now:
    run_wget_command(str(link), wget_folder, "new "+ str(link)[6:])
    

In [11]:
#reset(wget_folder, success_file, fail_file)

In [12]:

k=200 # initialize this numerical variable k, which keeps track of which entry in the sample we are on.

#testing the first 10 tuples
# tuple_test = tuple_list[200:300]


for tup in tuple_test:
    school_title = tup[1].title()


    k += 1 # Add one to k, so we start with 1 and increase by 1 all the way up to entry # 300
    print("Capturing website data for", school_title + ", which is school #" + str(k), "of 300...")
    
    # use the tuple to create a name for the folder
    dirname = format_folder_name(k, school_title)
    
    run_wget_command(tup[0], wget_folder, dirname)
    
    school_folder = wget_folder + '/'+ dirname
    if contains_html(school_folder):
        write_file( tup[0], success_file )
    else :
        write_file( tup[0], fail_file)
print("done!")
    

### Limitation of wget

-only works for static HTML and it doesn’t support JavaScript. Thus any element generated by JS will not be captured. 

More info:

https://www.petekeen.net/archiving-websites-with-wget

http://askubuntu.com/questions/411540/how-to-get-wget-to-download-exact-same-web-page-html-as-browser

https://www.reddit.com/r/linuxquestions/comments/3tb7vu/wget_specify_dns_server/
failed: nodename nor servname provided, or not known.


In [17]:
success_links, count = read_txt(success_file)
print("There are {} links in success file.".format( count))
# print(success_links)

There are 243 links in success file.


In [18]:
fail_links, count = read_txt(fail_file)
print("There are {} links in fail file.".format( count))

There are 57 links in fail file.


In [124]:
# counting # of html files
# def count_html(file):
    
def count_valid_links(list_of_links, valid_file, invalid_file):
    count_success, count_fail = 0, 0
    valid, invalid = '', ''
    for l in list_of_links:
#         print(l)
        if check(l):
            valid += l + '\n'
            count_success +=1
        else:
            invalid += l + '\n'
            count_fail += 1
#             print(l)
    write_file(valid, valid_file)
    write_file(invalid, invalid_file)
    return count_success, count_fail



In [125]:
valid_list = '/Users/anhnguyen/Desktop/research/scraping_Python/valid_links.txt'
invalid_list = '/Users/anhnguyen/Desktop/research/scraping_Python/invalid_links.txt'
reset_text_file(valid_list)
reset_text_file(invalid_list)

In [126]:

count_success, count_fail = count_valid_links(fail_links, valid_list, invalid_list)


urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError
urllib.error.URLError


In [127]:
print("There are {} valid links and {} invalid links".format(count_success, count_fail))

There are 31 valid links and 26 invalid links


In [114]:
# recheck links without "/"
recheck, count = read_txt_2(invalid_list)
print(count)

26


In [115]:
for index in range (0, len(recheck)):
    if recheck[index].endswith('/'):
        recheck[index] = recheck[index][: recheck[index].rindex('/')]
print(recheck[20])

http://responsiveed.com/dallasclassical


In [116]:
invalid2 = '/Users/anhnguyen/Desktop/research/scraping_Python/invalid2.txt'
count_success, count_fail = count_valid_links(recheck, valid_list, invalid2 )

http://www.trinityschoolforchildren.org
http://www.pasadenarosebud.com
http://www.mlacademy.org/#!contact-us/c2q4
http://www.materacademy.com/schools
http://www.jeffersoncommunityschool.org
http://www.evergladesprep.com/pages/Everglades_Preparatory_Academy
http://www.clevelandta.org/school/oak-leadership-institute
http://www.chandlerparkacademy.net/index.php/schools/elementary-school.html
http://www.ccaschool.net
http://www.blracademy.org
http://www.academycharterhs.org/pages/mainpg
http://www.academiadeestrellas.org
http://rpes-susd-ca.schoolloop.com
http://responsiveed.com/premierpharrmcallen
http://responsiveed.com/premiernewbraunfels
http://responsiveed.com/huntsvilleclassical
http://responsiveed.com/dallasclassical
http://ideacharterschool.com
http://gowan.craneschools.org
http://arthuracademy.org/woodburn/woodburn-arthur-academy.html


In [118]:
print("There are {} valid links and {} invalid links".format(count_success, count_fail))

There are 6 valid links and 20 invalid links


### Runing wget with log output

In [26]:
# setting up files
invalid2 = '/Users/anhnguyen/Desktop/research/scraping_Python/invalid2.txt'
log = '/Users/anhnguyen/Desktop/research/scraping_Python/wget_accept_logs.txt'

In [27]:
failed_links, counts = read_txt(invalid2)
print(counts)

20


In [121]:
## something wrong with check function???
print(check('http://responsiveed.com/dallasclassical'))

urllib.error.URLError
False


In [28]:
os.chdir('/Users/anhnguyen/Desktop/research/scraping_Python/no_dir')
reset_text_file(log)
for link in failed_links:
    
    
    os.system('wget -np --no-parent --show-progress --progress=dot --recursive --level=3 --convert-links --retry-connrefused --tries=5\
         --random-wait --no-cookies --secure-protocol=auto --no-check-certificate --execute robots=off \
         --header "Host: jrs-s.net" \
         --output-file=log \
         --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" \
          --accept .html' + ' ' + link)

## Terms of Service<a id='terms'></a>

As you've seen, web scraping involves making requests from other computers for their data. It costs people money to maintain the computers that we request data from: it needs electricity, it requires staff, sometimes you need to upgrade the computer, etc. But we didn't pay anyone for using their resources.

Because we're making these requests programmatically, we could make many, many requests per second. For example, we could put a request in a never-ending loop which would constantly request data from a server. But computers can't handle too much traffic, so eventually this might crash someone else's computer. Moreover, if we make too many requests when we're web scraping, that might restrict the number of people who can view the web page in their browser. This isn't very nice.

Websites often have Terms of Service, documents that you agree to whenever you visit a site. Some of these terms prohibit web scraping, because it puts too much strain on their servers, or they just don't want their data accessed programmatically. Whatever the reason, we need to respect a websites Terms of Service. **Before you scrape a site, you should always check its terms of service to make sure it's allowed.**

Often, there are better ways of accessing the same data. For the Wikipedia sites we scraped, there's actually an [API](https://www.mediawiki.org/wiki/REST_API) that we could have used. In fact, Wikipedia would prefer that we access their data that way. There's even a [Python package](https://pypi.org/project/wikipedia/) that wraps around this API to make it even easier to use. Furthermore, Wikipedia actually makes all of its content available for [direct download](https://dumps.wikimedia.org/). **The point of the story is: before web scraping, see if you can get the same data elsewhere.** This will often be easier for you and preferred by the people who own the data.

Moreover, if you're affiliated with an institution, you may be breaching existing contracts by engaging in scraping. UC Berkeley's Library [recommends](http://guides.lib.berkeley.edu/text-mining) following this workflow:

<img src='img/workflow.png' />