# A First Web Scraping Exercise

We're going to scrape a bunch of separate lists from ONE Wikipedia page. Wikipedia is a good place to practice scraping because the HTML there is (mostly) well formatted, and the site's traffic is so high, they don't mind us hitting one page again and again. 

First we import two libraries: 
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is for extracting what we want from a page or pages.
* [Requests](http://docs.python-requests.org/en/master/user/quickstart/) is for making the HTTP request to the server where we want to scrape.

**If you have not yet installed these two libraries** into the virtual environment where you are running this Notebook, you'll need to do that *first.*

This is the page we will scrape: [List of colleges and universities in Florida](https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Florida)


In [None]:
# load the Python libraries
from bs4 import BeautifulSoup
import requests

Most scraping scripts begin with one page, which means one URL, to be scraped. 

In the three lines below, the only thing that changes if you scrape a different page, even at another website, is the URL inside the single quotaton marks.

In [None]:
# open the page and copy all its HTML into a variable named `soup`
url = 'https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Florida'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

As a result of those three lines above, we now have a BeautifulSoup "object" named `soup`, and it contains **all the HTML** from the file at the URL we provided. THe HTML is stored in a manner that allows the BeautifulSoup functions to search and use the data.

Our goal is to scrape **all the lists** of colleges from that one Wikipedia page and put them in a CSV file, with separate columns for college name, location, type of college, and the URL for that college's Wikipedia page.

We start by using Chrome Dev Tools: right-click on the heading "State University System" and select **Inspect** from the pop-up menu.

If we inspect the heading above *each* of the lists of colleges and schools, we find that *all* of those headings are H3 headings. 


## Find all of the H3 elements

Here comes your first Beautiful Soup command &mdash;

In [None]:
# start by collecting all the H3 elements on the page into a Python list
h3_list = soup.find_all('h3')

In [None]:
# we don't really NEED to print them, but we do it so we can see what we got
# I would delete this cell after I had run it - it's only for testing 
print(h3_list)

Look closely, and you'll see that all the H3 heading elements are items in a Python *list.* A Python list is enclosed in square brackets &mdash; `[ ]`

We're going to use a heading to grab the list of schools that follows it.

## Grab a list of schools that follows a certain H3 heading

The heading above the first list of schools we want is "State University System" &mdash; so let's get the first UL that comes after that heading. (A UL element in HTML is an "unordered list," which is a list with bullets.)

But first, notice how there are multiple SPAN elements inside the H3 element.

Then look at the code we use to grab the UL (the list of schools):

In [None]:
for head in h3_list:
    if head.span.text == 'State University System':
        the_ul = head.find_next('ul')
        print(the_ul)
        break


Again, printing all of that *wasn't strictly necessary,* but it's good to make sure you got what you were trying to get. Otherwise, if you start getting error messages, you won't understand why. 

**Note that the code that enables us to print the list *will be used later* to scrape the list.** Printing is not the point of the scrape. It's just a step along the way.

You should be able to recognize that you got the complete UL element and everything it contains. It is NOT in a Python list. It is one single BeautifulSoup "tag object," and it's been assigned to the **variable** named `the_ul`.

* Above, you began with a *list* of all H3 elements on the page. Each H3 element from a Wikipedia page contains `span` tags and text and more. (You saw this with **Inspect.**)

* The next task was to *get* the complete UL element that comes after the heading we chose: "State University System." So we *looped* over the list of all H3 elements, and we checked *each one* to see if the text inside its *first* `span` tag matched the string we provided.

* As soon as we found that exact string, we grabbed *the next UL element* that comes after that H3, and we assigned that entire UL element to the variable `the_ul`.

* We printed that whole lump of code.

* Then, we stopped. The command `break` makes the for-loop stop looping.


##  Get the info we want for *each* school in the list

Next, we want to get **three things** out of each LI element in that list: the college name, the location, and the URL for that college's Wikipedia page. Use **Inspect** again to see the structure of the HTML and figure out how to use it.

We will scrape the **text** for college name and location. We will scrape the **href** (inside the A element) for the URL.


In [None]:
# make a Python list of all items in this UL element (in `the_ul`)
schools = the_ul.find_all('li')

# loop over that list, scraping three things from each line in the list
for li in schools:
    a_list = li.find_all('a')
    college_name = a_list[0].text
    location = a_list[1].text
    url = a_list[0]['href']
    print(college_name, location, url)


### What the code above does

One by one, for one LI at a time, we:

* Find all the A elements in it and put them into a new list.
* From the *first* A element, `a_list[0]`, we get the text inside the A tags and assign it to the variable `college_name`.
* From the *second* A element, `a_list[1]`, we get the text inside the A tags and assign it to the variable `location`.
* From the *first* A element, `a_list[0]`, we get the *value* of the `href` attribute and assign it to the variable `url`.
* We print the three variables for each time the loop loops.

### How to make the partial URL a complete URL

This is another common scraping task. The HTML holds a partial URL because the link goes to another page on Wikipedia. We want a full URL so we can use it *outside* Wikipedia.

The solution is simply to *add* the missing front part of the URL to the partial.


In [None]:
# the missing front part of the URL is - 'https://en.wikipedia.org'
# let's re-do the same for-loop from above

for li in schools:
    a_list = li.find_all('a')
    college_name = a_list[0].text
    location = a_list[1].text
    # here, we add the missing front part of the URL 
    url = 'https://en.wikipedia.org' + a_list[0]['href']
    print(college_name, location, url)


## Put what was scraped into a CSV file

Python has standard code for doing this. You will use these few lines of code the same way every time you need to store data in a CSV.

To demonstrate, we'll create a "test" CSV file with two rows of nonsense data.


In [None]:
# import Python's built-in CSV module
import csv

# create and open a new file for writing - its name will be `test.csv` 
csvfile = open("test.csv", 'w', newline='', encoding='utf-8')

# make a new variable, c, for Python's CSV writer object 
c = csv.writer(csvfile)

# write your header row to test.csv
c.writerow( ['first', 'second', 'third', 'fourth'] )

# write two more rows to test.csv
c.writerow( ['a', 'b', 'c', 'd'] )
c.writerow( [10, 20, 30, 40] )

# close and save the CSV file
csvfile.close()


You have now created a new CSV file and written three rows and four columns into it. But WHERE was it *saved*? To find out, use the `pwd` to find out which directory this Jupyter Notebook is running in. (`pwd` is a command that stands for "print working directory.")

In [None]:
pwd

Go to that folder now, in your Finder or File Explorer, and open the CSV file named *test.csv.*

Now you know **how to write a new CSV file.** Can you figure out how to take the earlier code from above &mdash; the for-loop that PRINTS a list of schools &mdash; and make it WRITE to a row in the CSV *instead* of printing?

Because that's all you need to do.

Note: The `import csv` line does not need to be repeated.


In [None]:
# create and open a new file for writing - its name will be `test.csv` 
csvfile = open("state_university_system.csv", 'w', newline='', encoding='utf-8')

# make a new variable, c, for Python's CSV writer object
c = csv.writer(csvfile)

# write your header row to the file 
c.writerow( ['school', 'location', 'url'] )

# the same code from above that loops over all the schools in the list of LI elements
for li in schools:
    a_list = li.find_all('a')
    college_name = a_list[0].text
    location = a_list[1].text
    url = 'https://en.wikipedia.org' + a_list[0]['href']
    # instead of print() --- we write to the CSV file 
    c.writerow( [college_name, location, url] )

# close and save the CSV file
csvfile.close()


Open your new CSV, named *state_university_system.csv,* and take a look at the rows and columns.

An **important point** to understand about the new line in the code above &mdash; line 17, `c.writerow( [college_name, location, url] )` &mdash; is that the square brackets *inside* the parentheses are **required** by the csv module's code. It expects to receive a Python list (which is like an array in JavaScript). If it doesn't, the CSV file will not be written correctly.

So any time you WRITE a ROW to a CSV file here, you must write a *list* and NOT just variables separated by commas (which will not work).


## We said we would get all the schools from all the lists, and we will

What we've done is nice, but it will be even nicer when we get *all* the lists of schools that are still in operation (we will skip the closed schools). And we already have almost all the Python and Beautiful Soup code we need to get the job done.

(If we are short on time for the live demonstration, we will stop here.)

Recall that previously, we got a list of all the H3 headings. It's in the variable `h3_list`.


In [None]:
print(h3_list)

That is all the complete H3 elements, with tags and everything. 

Let's look at just the *text* of the headings instead, using the same list.


In [None]:
for head in h3_list:
    if head.span:
        print(head.span.text)


### Check the page for missing data

Always keep in mind that your code might not be getting all the items you want, and check the web page.

When I check the Wikipedia page, I discover the list above is missing "Trade/technical institutions." 

I use **Inspect** again and see that the H3 element there is a bit different from the others, so I try another way.


In [None]:
for head in h3_list:
    if head.find( 'span', {'class':'mw-headline'} ):
        print( head.find( 'span', {'class':'mw-headline'} ).text)


Notice how that gave me a much better list!

The BeautifulSoup pattern `object.find( 'element', {'class':'classname'} )` is very, very common and very, very useful. 

### Review basic Beautiful Soup commands

`.find()` brings you only the **first** instance of the tag or class, even if there are others on the page. 

`.find_all()` brings you a Python **list** of ALL the elements of that kind on the page that have that class.

`.text` strips the **text only** out of the element and gives it to you.

When I examined the Wikipedia HTML carefully with **Inspect**, I realized that all the headings are inside a `span` tag that has `class="mw-headline` &mdash; *so that is what I used* to get the printed H3 headlines I want.

Use Command-f (or Control-f on Windows) on the [BeautifulSoup documentation page](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to discover these tricks.

Now I will manually create a Python list of the headings for the UL lists I want to scrape from this page.


In [None]:
# copy/paste that nice text from above to create a new Python list 
# I removed 'Defunct public (county) colleges in Florida' and 
# 'Segregated junior colleges' because those schools are all closed 

headings = [
    'State University System',
    'Florida College System',
    'Other public institutions',
    'Religiously affiliated institutions',
    'Trade/technical institutions',
    'Other private institutions'
]


I've checked the list carefully against the Wikipedia page. Every **list of schools** that I want to add to my CSV comes under one of those headings. I will use the headings the same way I used 'State University System' alone, [above](http://localhost:8888/notebooks/beginner-notebooks/completed/scraping_first_time.ipynb#Grab-a-list-of-schools-that-follows-a-certain-heading), to get ONE list of schools.

I will use the `h3_list` as I did before &mdash; but now I will ALSO use the new `headings` list.


In [None]:
for item in headings:
    for head in h3_list:
        text = head.find( 'span', {'class':'mw-headline'} ).text
        if text == item:
            the_ul = head.find_next('ul')
            break
    # show only the first LI in the UL - just to test this code 
    print( the_ul.li.text )


I printed only the text in the first LI in each list to test my code, and **I have a problem.** 

"By state and in insular areas" is not a school. Back on the Wikipedia page, I inspect again, and I find that the heading "Other private institutions" is followed (in the HTML) by a table, and *inside the table* there is a UL I do not want.

This is a poor page layout. We can't help that. 

This is a typical scraping problem, and we can solve it.


In [None]:
for item in headings:
    for head in h3_list:
        text = head.find( 'span', {'class':'mw-headline'} ).text
        if text == item:
            the_ul = head.find_next_sibling('ul')
            break
    print( the_ul.li.text )


### Using siblings in Beautiful Soup

Siblings are tricky to work with, and they don't always act like you would expect. Learn more about them in the [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

`head.find_next_sibling('ul')` means "get the first sibling that is a UL element following this head."

Since we now know we can successfully get all the LI elements, let's try to add in the code that gets the location and the URL too. (We continue testing without writing a CSV until we *know* we can write the CSV file cleanly.)

We continue *printing* because we are still *testing.*


In [None]:
for item in headings:
    for head in h3_list:
        text = head.find( 'span', {'class':'mw-headline'} ).text
        if text == item:
            the_ul = head.find_next_sibling('ul')
            break
    
    # get all schools under ONE heading
    schools = the_ul.find_all('li')

    # remember this? we used it above, earlier 
    for li in schools:
        a_list = li.find_all('a')
        college_name = a_list[0].text
        location = a_list[1].text
        url = 'https://en.wikipedia.org' + a_list[0]['href']
        print(college_name, location, url, item)


That looks pretty good, so let's insert that whole chunk of code *into* the CSV-writing code and give it a try.

## Make the final, complete CSV of all schools

We'll use the same CSV code we used above. Remember, we don't need to `import csv` again.


In [None]:
# create and open a new file for writing
csvfile = open("florida_colleges.csv", 'w', newline='', encoding='utf-8')

# make a new variable, c, for Python's CSV writer object 
c = csv.writer(csvfile)

# write your header row 
c.writerow( ['school', 'location', 'url', 'type'] )

# write all the schools into rows
for item in headings:
    for head in h3_list:
        text = head.find( 'span', {'class':'mw-headline'} ).text
        if text == item:
            the_ul = head.find_next_sibling('ul')
            break
    
    # get all schools under ONE heading
    schools = the_ul.find_all('li')
 
    for li in schools:
        a_list = li.find_all('a')
        college_name = a_list[0].text
        location = a_list[1].text
        url = 'https://en.wikipedia.org' + a_list[0]['href']
        
        # write one row into the CSV
        c.writerow( [college_name, location, url, item] )

# close and save the file
csvfile.close()


And it works! You have a clean CSV with 120+ schools, and you could use it to make a map, a database, a mailing list (that would require more scraping on other pages), etc. You can sort the schools alphabetically or by location and still know which type of schools they are because of the "type" column. Some schools might not be real (it is Wikipedia, after all), so some more human intelligence would need to be applied to the CSV before it could be used with confidence.

**Note** that *only the code in the final cell* is needed to make this CSV. Many of the cells above that were testing and working to get to this point. 

Also needed were:

* import statements for Beautiful Soup, Requests, and csv 
* the cell in which we opened the Wikipedia page and copied all its HTML into a variable named `soup`
* create `h3_list`
* create `headings`

**Note also** that the list of URLs (which might need some cleanup) could be used to further scrape data from each college's Wikipedia page.