<a href="https://colab.research.google.com/github/johnsl01/WebBits/blob/master/Test01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Colab notebook

Ilustrating using beautiful soup to screap data from a web-page.

This is stored at : 

(https://colab.research.google.com/github/johnsl01/WebBits/blob/master/Test01.ipynb)

## Imports

First let's verify that our imports are available - if not we'll have to install them with pip : 
 

In [0]:
import requests
from bs4 import BeautifulSoup
print ("Done")

Yes, that works. 

No errors is sufficient to verify this - but the fist code snippet in a notebook can take a while to start so I tend to have it print something simple when completed.

So let's write the actual python code using text instead of normal python coments.

This is much more broken up than necessary - just as an example.

##The web page 

We are going to use a Wikepedia page : 

(https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States)

So let's assign it to a variable called url. 

And then pass this to the get method of the requests library that we imported above, putting the result into a variable called page. 


In [0]:
url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
print(page.status_code)
print (type(page))

The page object returned by requests.get() is more than just the text of the page - it is an object of class requests.models.Response (so we would have been better calling it response) and we can use the requests methods to get at date inside this instance of the object.  Fore example the status_code value is the status of the http(s) GET = so 200 for ***success***, 400 for ***not found*** etc.

To illustrate : 

In [0]:
print (page.status_code)
print (type(page))

Clearly the part of the response we need to get at next is the page content itself, and this is directly accessable via the content value of the object.

In [0]:
print(page.content)

In [0]:
print(len(page.content))

And we clearly have the entire wikipedia page as a byte array (note the b' at the beginning) - we don't need to worry here about the differences between byte arrays and python strings, because all the text we need to look at uses charatcter in the first 127 characters so thier representation in both formats is identical.

So we need to get at subsets of the content of the page to find the data we want.  To do this we are going to use the BeautifulSoup library - and the starting point is to get the page content into a BeautifulSoup object so we can use its methods on the page content.

By convenion we call the BeautifulSoup object instance soup - but that's just to make it clear that it is different from the requests object we have so far: 

In [0]:
soup = BeautifulSoup(page.content, 'html.parser')

And to verify the content we can use one of the BeautifulSoup methods:  You can clear the output after examining it as it is long and gets in the way of the next bit.

In [0]:
print(soup.prettify())


So we can clearly see that the soup object has the page contents and its metods can process hte content and return it processed.

Of course just laying out the page in a pretty manner is fairly trivial - what we want to do is get at bits within it.  And to do this we need to understand how the page is constructed.  Scrolling up and down through the prettified output gives us some idea - but using the developer tools in a browser is a more direct way of exploring a page to understand its structure.

After exploring the page we see that the names of the presidents are in a large 'table' and this table has a class identifier of 'wikitable'.  And this is the first table of this class on the page.

So we can use these two bits of information to 'chop' a bit of the page content into a new object.

To do this we use the find() method of the soup object and we pass it the type of html object we want to find a html 'table' and we select the specific table by passing it a value to match - specifically the class should equal 'wikitable'.  Because class is a python reserved word we have a special exception needed here where we tell the method to look for _class to be equal to 'wikitable' :

In [0]:
tb = soup.find('table', class_='wikitable')

Importantly the object returned by a find method is itself a BeautifulSoup object so we can further refine it using the same range of methods.



So let's look at it 

In [0]:
print(tb.prettify())

Note that the returned soup object starts with the html table we searched for and ends with the end of the table    < table  .....  > ..... < /table >

The table we have found contains the names we are looking for - and by examing the html, and he page, we can see that names we want are in large text inside a < big> ... < /big> section.

However we want more than the first president so we use a different method to get all the matching sections - each into its own small BeautifulSoup object.  And we directly call a for loop to process each one further and pull out the name.

To start let's just look at the first one : 


In [0]:
print( ( tb.find('big') ).prettify() )

This is much more manageable - the name we want is there and conveniently it is the only actual text in the object data we have.  So there is a text value we can access to get at just this text.  By text here we mean text not inside an html tag.

However just to be sure let's refine it further and find the < a> object within this object.

And to get at all the names we will use a find_all() method whioh will return all the < big> sections not just the first one: 

In [0]:
for name_link in tb.find_all('big'):
  name = name_link.find('a').text
  print(name)

And there we have it.  In a few lines of code we have requested a web page - created a BeautifulSoub object and used the BeautifulSoup methods to access parts of the page and work into it to pull out the bits we want.



# Credits

I didn't create this example myself - I was looking for a simple example and I found this one which is very simple but works very well as a basic introduction to the process - and mor importantly allowed me to focus on the notebook rather then the requests or BeautifulSoup parts of the exercise.

From https://www.codementor.io/dankhan/web-scrapping-using-python-and-beautifulsoup-o3hxadit4



