In [None]:
import numpy as np
import pandas as pd

## The requests library

In [None]:
import requests
page = requests.get("https://en.wikipedia.org/wiki/Nova_School_of_Business_and_Economics")
page

After running our request, we get a Response object. This object has a ```status_code``` property, which indicates if the page was downloaded successfully:

In [None]:
page.status_code

A code "200" means that the request was successfully received, understood, and accepted.

In [None]:
requests.get("https://en.wikipedia.org/wiki/Nova_School_of_Business_and_Economics2")

As you can see, you may be request the wrong page and return different status code. Check the full list of HTTP Status Code at [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

We can print out the HTML content of the page using the ```content``` property:

In [None]:
page.content

## Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the ```p``` tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

In [None]:
print(soup.prettify())

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it:

In [None]:
list(soup.children)

The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (n) in the list as well. Let’s see what the type of each element in the list is:

In [None]:
[type(item) for item in list(soup.children)]

As you can see, all of the items are BeautifulSoup objects. The first is a ```Doctype``` object, which contains information about the type of the document. The second is a ```NavigableString```, which represents text found in the HTML document. The final item is a ```Tag``` object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects).

We can now select the html tag and its children by taking the third item in the list:

In [None]:
html = list(soup.children)[2]
html

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

In [None]:
list(html.children)

As you can see above, there are many tags here, head, title, script and body. We want to extract the text inside the title tag, so we’ll dive into the head first:

In [None]:
head = list(html.children)[1]
head

In [None]:
title = list(head.children)[3]
title

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

In [None]:
title.get_text()

### Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the ```find``` or ```find_all``` method, which will find all the instances of a tag on a page.

If you only want to find the first instance of a tag, you can use the ```find``` method, which will return a single BeautifulSoup object:

In [None]:
soup.find('title').get_text()

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [None]:
soup.find_all('p')

In [None]:
[paragraph for paragraph in list(soup.find_all('p'))
 if "Financial Times" in paragraph.get_text()]

### Searching for tags by class and id
We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.

In [None]:
soup.find_all(class_="street-address")

In [None]:
soup.find_all(class_="nickname")

In [None]:
soup.find_all("h2")

Similarly, you can also find the specific tag with its id:

In [None]:
soup.find_all("h2", id="mw-toc-heading")

In [None]:
from bs4 import BeautifulSoup
from lxml import html
soup = BeautifulSoup(page.content, 'lxml')
tree = html.fromstring(page.content)

In [None]:
tree.xpath('//*[@id="mw-content-text"]/div[1]/table')

In [None]:
for row in tree.xpath('//*[@id="mw-content-text"]/div[1]/table//tr'):
    print(row.xpath('//td//text()'))

## Data Access using API

In [None]:
response = requests.get('https://jsonplaceholder.typicode.com/todos')

Using ```read_json``` method to read the JSON data contained in the response:

In [None]:
todos = pd.read_json(response.text)
todos

Write the DataFrame to the Excel file:

In [None]:
todos.to_excel('todos.xls')

Now try yourself with https://jsonplaceholder.typicode.com/photos