<a href="https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/Content/Data_Collecting/08_html.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reading Data From the Web

Some of the data is easy to gather directly from the web!  The [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/index.php) has lost of data cleaned up and ready for us to use.  We can also upload csv and other files to GitHub and access the 'raw' version to get to the data.  We did that with the [iris](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/iris.csv) dataset earlier!  How than can we get data from a table in a web page?  We can of course [copy and paste ](https://en.wikipedia.org/wiki/Copypasta), but if there are multiple tables or the table is of an odd shape, this sometimes just won't work!  Instead we want to read that data directly from the web.

Reading data from the web is an important task for some data analysis projects.  Web Scrapping is the gathering of that data.  There are lots of fantastic packages to read and parse html.  `requests` is going to gather the raw html for me.  `BeautifulSoup` will help me parse the code.

In [None]:
import requests
import pandas as pa
from bs4 import BeautifulSoup

Next I am going to look at a simple web page from Wikipedia.  I have been a big fan of **The Simpsons** for many years.  Let's look at the Wikipedia page for them.  [https://en.wikipedia.org/wiki/The_Simpsons](https://en.wikipedia.org/wiki/The_Simpsons)

Let's gather that html!

In [None]:
r = requests.get('https://en.wikipedia.org/wiki/The_Simpsons')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
#html_soup

## Basic Building Blocks

I do not print the html because it is very long!  Let's examine some aspects of our html that we have gathered

In [None]:
html_soup.title

<title>The Simpsons - Wikipedia</title>

I think the `title` is rather obvious.  It is what shows in my tab!

In [None]:
html_soup.a

<a id="top"></a>

The `a` is an anchor.  Normally that is a hyperlink but this one does not appear to be one!

In [None]:
html_soup.p

<p class="mw-empty-elt">
</p>

`p` stands for paragraph.  This one happens to be empty.  

In [None]:
html_soup.img

<img alt="Featured article" data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>

`img` is an image.  Next are several classes of headers, six in total.

In [None]:
html_soup.h2

<h2 id="mw-toc-heading">Contents</h2>

The thing we will use the most for this class is `table`

In [None]:
html_soup.table

<table class="infobox vevent"><tbody><tr><th class="infobox-above summary" colspan="2" style="background: #CCCCFF; padding: 0.25em 1em; font-size: 125%;"><i>The Simpsons</i></th></tr><tr><td class="infobox-image" colspan="2"><a class="image" href="/wiki/File:The_Simpsons_yellow_logo.svg"><img alt="The Simpsons yellow logo.svg" data-file-height="206" data-file-width="464" decoding="async" height="111" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/250px-The_Simpsons_yellow_logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/375px-The_Simpsons_yellow_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/500px-The_Simpsons_yellow_logo.svg.png 2x" width="250"/></a></td></tr><tr><th class="infobox-label" scope="row">Genre</th><td class="infobox-data category"><div class="plainlist">
<ul><li><a href="/wiki/Animated_sitcom" title="Animated sitcom">Animated si

You can combine these commands!

In [None]:
html_soup.table.a['href']

'/wiki/File:The_Simpsons_yellow_logo.svg'

This looks like an image on top of the table.  `href` is the link to the file that give that image.  You should go check out the webpage and see if you can find it!

Notice how I only keep getting the first of something?  There are many more links and table on the webpage!  Use the `find_all`

In [None]:
html_soup.table.find_all('a')

[<a class="image" href="/wiki/File:The_Simpsons_yellow_logo.svg"><img alt="The Simpsons yellow logo.svg" data-file-height="206" data-file-width="464" decoding="async" height="111" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/250px-The_Simpsons_yellow_logo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/375px-The_Simpsons_yellow_logo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/98/The_Simpsons_yellow_logo.svg/500px-The_Simpsons_yellow_logo.svg.png 2x" width="250"/></a>,
 <a href="/wiki/Animated_sitcom" title="Animated sitcom">Animated sitcom</a>,
 <a href="/wiki/Satire" title="Satire">Satire</a>,
 <a href="/wiki/Matt_Groening" title="Matt Groening">Matt Groening</a>,
 <a href="/wiki/The_Simpsons_shorts" title="The Simpsons shorts"><i>The Simpsons</i> shorts</a>,
 <a href="/wiki/James_L._Brooks" title="James L. Brooks">James L. Brooks</a>,
 <a href="/wiki/Sam_Simon" title="Sam Sim

This gives all the links from this table that includes the talent for the show.  You can access each link by using 

In [None]:
html_soup.table.find_all('a')[1]['href']

'/wiki/Animated_sitcom'

If you wanted to do some crawling along the web you might do something like:

In [None]:
links = html_soup.table.find_all('a')
listOfURLS = []

for link in links:
  listOfURLS.append('https://en.wikipedia.org' + link['href'])

listOfURLS

['https://en.wikipedia.org/wiki/File:The_Simpsons_yellow_logo.svg',
 'https://en.wikipedia.org/wiki/Animated_sitcom',
 'https://en.wikipedia.org/wiki/Satire',
 'https://en.wikipedia.org/wiki/Matt_Groening',
 'https://en.wikipedia.org/wiki/The_Simpsons_shorts',
 'https://en.wikipedia.org/wiki/James_L._Brooks',
 'https://en.wikipedia.org/wiki/Sam_Simon',
 'https://en.wikipedia.org/wiki/Dan_Castellaneta',
 'https://en.wikipedia.org/wiki/Julie_Kavner',
 'https://en.wikipedia.org/wiki/Nancy_Cartwright',
 'https://en.wikipedia.org/wiki/Yeardley_Smith',
 'https://en.wikipedia.org/wiki/Hank_Azaria',
 'https://en.wikipedia.org/wiki/Harry_Shearer',
 'https://en.wikipedia.org/wiki/List_of_The_Simpsons_cast_members',
 'https://en.wikipedia.org/wiki/Danny_Elfman',
 'https://en.wikipedia.org/wiki/The_Simpsons_Theme',
 'https://en.wikipedia.org/wiki/Richard_Gibbs',
 'https://en.wikipedia.org/wiki/Alf_Clausen',
 'https://en.wikipedia.org/wiki/Bleeding_Fingers_Music',
 'https://en.wikipedia.org/wiki/Li

Doesn't look like all of these worked but you should get the general idea!  We could visit each of these sites just like we did above!

## Developer Tools

Your favorite web browser will have developer tools! These will allow you to examine the raw html code while also hightlighting the rendered output with your browser. This is very useful for webscrapping and figuring out how a website has been constructed! I acessed the developer tools with F12 key but it may vary for you!

Here is a screen shot of me highlighting the first table. 

![simpsonsDevTools](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Content/devtools.png)

The developer tools are at the bottom and I have grabbed the first table that we have also scrapped.  The html is well organized in the developer tools but it also might have called a server and gotten external data from somewhere.  So be aware what you see here and on your `requests.get` may not be the same.

## Your Turn

Navigate you the wikipedia page for your favorite television show or sports club. 

1. Display the title for the page
2. Within an interesting table, retrieve all links and store them in a list