## Accessing web data:
1. Making requests
2. Parsing through json data
3. Fetching data through html tags using beautiful soup

### Part 1: Making requests
1. Using requests module

There are other libraries that give you the ability to interact with http and make different kind of requests, but in my experience requests() module serves the purpose and is sufficient for majority of the needs. One can also use urllib, urllib2 and urllib3 to do similar tasks as well. If you know how to make web requests in python and don't want to use requests() module, then you can skip this section and continue with xml and html parsing.

In [1]:
import requests
import os

We will be using the openweather api to demonstrate how http requests are made. Go to <a href='http://openweathermap.org/appid'>this link </a>to create an account and generate an api key. Once you have that you can try to make an api call for the current weather api <a href='http://openweathermap.org/api'>here</a> One api call that can be made is by visiting this url http://api.openweathermap.org/data/2.5/forecast?id=524901&APPID&APPID=your_key you will need to supply your api key for this to work

Now, since we've seen how to use the browser to make the api call (it was just a simple url that we were hitting using our browser). Now, let's see how we can do the same task programmatcally.

In [2]:
os.chdir('E:\Work\Python\Python Trainings')
f=open('open_weather_api.txt','r')
key=f.read()
f.close()

In [3]:
base_url='http://api.openweathermap.org/data/2.5/weather?q=London&APPID='
url=base_url+key
request=requests.get(url.strip())##The raw text file has whitespaces after the key value


To figure out if the request() was successful, one can check the http status codes, A status code of 200, signifies that there is no error. <a href='https://en.wikipedia.org/wiki/List_of_HTTP_status_codes'>Here is a comprehensive list of http error codes and what they mean</a>

In [4]:
print request.status_code

200


Sometimes the requests() function will not be able to fetch data, this can happen because of the inability of requests() method to mimic the behaviour of a browser, this can be remidied by supplying correct headers. <a href='http://docs.python-requests.org/en/master/user/quickstart/#custom-headers'>See the official docs here to see how supply custom headers</a>

In [5]:
headers={'user-agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
request=requests.get(url.strip(),headers)

In [6]:
print request.status_code

200


One can find out the contents of the request by choosing specific request methods. Here is the list of methods:
1. <a href='http://docs.python-requests.org/en/master/user/quickstart/#response-content'>Response Content</a>
2. <a href='http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content'>Binary Response</a>
3. <a href='http://docs.python-requests.org/en/master/user/quickstart/#json-response-content'>JSON Response</a>
4. <a href='http://docs.python-requests.org/en/master/user/quickstart/#raw-response-content'> Raw Response</a>

In [7]:
print request.text

{"coord":{"lon":-0.13,"lat":51.51},"weather":[{"id":310,"main":"Drizzle","description":"light intensity drizzle rain","icon":"09d"}],"base":"stations","main":{"temp":287.78,"pressure":1010,"humidity":87,"temp_min":287.15,"temp_max":289.15},"visibility":10000,"wind":{"speed":3.6,"deg":300},"clouds":{"all":90},"dt":1500875400,"sys":{"type":1,"id":5091,"message":0.0047,"country":"GB","sunrise":1500869588,"sunset":1500926402},"id":2643743,"name":"London","cod":200}


In [8]:
data=request.text

In [9]:
type(data)

unicode

In [10]:
data=request.json()

In [11]:
type(data)

dict

In [12]:
print data.keys()

[u'clouds', u'name', u'visibility', u'sys', u'weather', u'coord', u'base', u'dt', u'main', u'id', u'wind', u'cod']


In [13]:
print data['main']

{u'pressure': 1010, u'temp_min': 287.15, u'temp_max': 289.15, u'temp': 287.78, u'humidity': 87}


In [14]:
## Can you extract the country name
## Sunset and sunrise time (its been given in unix format)
## Use the file Api hands on.docx and answer the questions that follow,

Open weather gives you an ability to choose the format of the response you want. By default the response is a json object, though one can get <a href='https://openweathermap.org/current#other'> an xml and an html response as well </a>

In [15]:
base_url='http://api.openweathermap.org/data/2.5/weather?q=London&mode=xml&APPID='
url=base_url+key
request_xml=requests.get(url.strip())##The raw text file has whitespaces after the key value


In [16]:
print request_xml.text

<?xml version="1.0" encoding="UTF-8"?>
<current><city id="2643743" name="London"><coord lon="-0.13" lat="51.51"></coord><country>GB</country><sun rise="2017-07-24T04:13:08" set="2017-07-24T20:00:02"></sun></city><temperature value="287.78" min="287.15" max="289.15" unit="kelvin"></temperature><humidity value="87" unit="%"></humidity><pressure value="1010" unit="hPa"></pressure><wind><speed value="3.6" name="Gentle Breeze"></speed><gusts></gusts><direction value="300" code="WNW" name="West-northwest"></direction></wind><clouds value="90" name="overcast clouds"></clouds><visibility value="10000"></visibility><precipitation mode="no"></precipitation><weather number="310" value="light intensity drizzle rain" icon="09d"></weather><lastupdate value="2017-07-24T05:50:00"></lastupdate></current>


In [17]:
type(request_xml.text)

unicode

The requests module can't handle xml response as there is no method to handle xml data. Eventually every form of response should get converted into a datastructure native to python, when we used the json() method we were able to obtain a dictionary, as you can see there is no method that requests() provides to handle xml data. 

In [18]:
print dir(request_xml)

['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


In [19]:
"xml" in dir(request_xml)

False

You can also see that the api exposes an html response as well, but again we don't have a method in the requests class to handle this. Let's use the api to get an html response and then see what response() class's text method returns

In [20]:
base_url='http://api.openweathermap.org/data/2.5/weather?q=London&mode=html&APPID='
url=base_url+key
request_html=requests.get(url.strip())##The raw text file has whitespaces after the key value


In [21]:
print request_html.text

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <meta name="keywords" content="weather, world, openweathermap, weather, layer" />
  <meta name="description" content="A layer with current weather conditions in cities for world wide" />
  <meta name="domain" content="openweathermap.org" />
  <meta http-equiv="pragma" content="no-cache" />
  <meta http-equiv="Expires" content="-1" />
</head>
<body>
  <div style="font-size: medium; font-weight: bold; margin-bottom: 0px;">London</div>
  <div style="float: left; width: 130px;">
    <div style="display: block; clear: left;">
      <div style="float: left;" title="Titel">
        <img height="45" width="45" style="border: medium none; width: 45px; height: 45px; background: url(&quot;http://openweathermap.org/img/w/09d.png&quot;) repeat scroll 0% 0% transparent;" alt="title" src="http://openweathermap.org/images/transparent.png"/>
      </div>
      <div style="float: left;">
        <div style="display: block; clear: left; f

In [22]:
print type(request_html.text)
print type(request_html.content)

<type 'unicode'>
<type 'str'>


In [23]:
os.chdir('E:\Work\Python\Python Trainings\Python Advanced\Code\Day_2')

In [24]:
f=open('sample.html','w')

In [25]:
for s in request_html.content:
    f.write(s)

In [26]:
f.close()

## Part 2: Parsing html using beautifulsoup4
We have seen html response earlier from the api call that we had made. One can onbtain the html response if a request is made to a web page. The discussion below focuses on how one can parse an html response obtained via requests module. Again DOM framework will be followed to parse through the tags

In [27]:
from bs4 import BeautifulSoup

We will first look at the basic objects and classes that are provided by bs4.

<img src="BeautifulSoup.png">

We will read in a file with html markup and then introduce you to bs4 objects

In [28]:
os.chdir('E:\Work\Python\Python Trainings\Python Advanced\Code\Day_2')
f=open('html.html','r')
html=f.read()
f.close()

In [29]:
##The first thing that one needs to do after acquiring the markup is to convert it into a soup object
html_soup=BeautifulSoup(html,'html.parser')
print html_soup.prettify()

<!DOCTYPE html>
<html>
 <head>
  <title>
   GETTING STARTED WITH bs4
  </title>
 </head>
 <div class="para 1">
  <p>
   This is paragraph one
  </p>
  <p>
   This is paragraph two
  </p>
 </div>
 <div class="para 2">
  <p>
   This is para 1 in div 2
  </p>
 </div>
 <div class="para 1">
  <p>
   This is paragraph three of div with class para 1
  </p>
  <p>
   This is paragraph four of div with class para 1
  </p>
 </div>
</html>


In [30]:
print type(html_soup)

<class 'bs4.BeautifulSoup'>


We will look into how following tasks can be done:
1. Selecting specific elements (based on html tags)
2. Extracting the text from tags
3. Traversing the html tree

In [31]:
##Selecting specific elements
head=html_soup.head
print type(head)

<class 'bs4.element.Tag'>


<img src='Tag Object.png'>

In [32]:
print head.name
print head.attr
print head

head
None
<head>
<title>GETTING STARTED WITH bs4</title>
</head>


In [33]:
## Extracting text from tags
print html_soup.head.title
print type(html_soup.head.title)
print html_soup.head.title.contents
print html_soup.head.contents

<title>GETTING STARTED WITH bs4</title>
<class 'bs4.element.Tag'>
[u'GETTING STARTED WITH bs4']
[u'\n', <title>GETTING STARTED WITH bs4</title>, u'\n']


In [34]:
## Extracting text from tags
print html_soup.head.title.string
print type(html_soup.head.title.string)

GETTING STARTED WITH bs4
<class 'bs4.element.NavigableString'>


In [35]:
## Extarcting text from tags
print html_soup.head.title.text
print type(html_soup.head.title.text)

GETTING STARTED WITH bs4
<type 'unicode'>


In [36]:
## Traversing the html tree
# One can go deep into the tree by using appropriate tag methods
print html_soup.prettify()

<!DOCTYPE html>
<html>
 <head>
  <title>
   GETTING STARTED WITH bs4
  </title>
 </head>
 <div class="para 1">
  <p>
   This is paragraph one
  </p>
  <p>
   This is paragraph two
  </p>
 </div>
 <div class="para 2">
  <p>
   This is para 1 in div 2
  </p>
 </div>
 <div class="para 1">
  <p>
   This is paragraph three of div with class para 1
  </p>
  <p>
   This is paragraph four of div with class para 1
  </p>
 </div>
</html>


In [37]:
##Suppose we want to traverse to the title tag
print html_soup.head.title

<title>GETTING STARTED WITH bs4</title>


In [38]:
##Getting the text
print html_soup.head.title.text


GETTING STARTED WITH bs4


In [39]:
##Suppose we want to traverse to first para in div with class para 1
print html_soup.div 
##only the first occurence of div is returned, while in document there are 3 occurences of div 

<div class="para 1">
<p>
            This is paragraph one
        </p>
<p>
            This is paragraph two
        </p>
</div>


In [40]:
print html_soup.div.p

<p>
            This is paragraph one
        </p>


In [41]:
## What if we wanted to look at the second para in the tree?
print html_soup.div.p.next_sibling





In [42]:
## What if we wanted to look at the second para in the tree?
print html_soup.div.p.next_sibling.next_sibling

<p>
            This is paragraph two
        </p>


In [43]:
## What if we wanted to look at the second para in the tree?
print html_soup.div.p.next_sibling.next_sibling.contents

[u'\n            This is paragraph two\n        ']


In [44]:
## What if we wanted to look at the second para in the tree?
print html_soup.div.p.next_sibling.next_sibling.string


            This is paragraph two
        


In [45]:
## What if we wanted to look at the second para in the tree?
print html_soup.div.p.next_sibling.next_sibling.text


            This is paragraph two
        


In [46]:
## Traversing the html tree
# One can go deep into the tree by using appropriate tag methods
print html_soup.prettify()
#Suppose we wanted to extract the first para in div with class para 2?

<!DOCTYPE html>
<html>
 <head>
  <title>
   GETTING STARTED WITH bs4
  </title>
 </head>
 <div class="para 1">
  <p>
   This is paragraph one
  </p>
  <p>
   This is paragraph two
  </p>
 </div>
 <div class="para 2">
  <p>
   This is para 1 in div 2
  </p>
 </div>
 <div class="para 1">
  <p>
   This is paragraph three of div with class para 1
  </p>
  <p>
   This is paragraph four of div with class para 1
  </p>
 </div>
</html>


In [47]:
## There are find methods that help us do that
print html_soup.find('div',class_='para 2')

<div class="para 2">
<p>
            This is para 1 in div 2
        </p>
</div>


In [48]:
print type(html_soup.find('div',class_='para 2'))

<class 'bs4.element.Tag'>


In [49]:
## There are find methods that help us do that
print html_soup.find_all('div',class_='para 2')

[<div class="para 2">\n<p>\n            This is para 1 in div 2\n        </p>\n</div>]


In [50]:
print type(html_soup.find_all('div',class_='para 2'))

<class 'bs4.element.ResultSet'>


In [51]:
## Suppose we wanted to extract all the text within div with class para 1?
html_soup.find_all('div',class_='para 1')

[<div class="para 1">\n<p>\n            This is paragraph one\n        </p>\n<p>\n            This is paragraph two\n        </p>\n</div>,
 <div class="para 1">\n<p>\n            This is paragraph three of div with class para 1\n        </p>\n<p>\n            This is paragraph four of div with class para 1\n        </p>\n</div>]

In [52]:
len(html_soup.find_all('div',class_='para 1'))

2

In [53]:
html_soup.find_all('div',class_='para 1')[0].text

u'\n\n            This is paragraph one\n        \n\n            This is paragraph two\n        \n'

In [54]:
## We can also loop
for t in html_soup.find_all('div',class_='para 1'):
    print t.text



            This is paragraph one
        

            This is paragraph two
        



            This is paragraph three of div with class para 1
        

            This is paragraph four of div with class para 1
        



In [55]:
##Suppose we wanted only the second paragraph in each div?
for t in html_soup.find_all('div',class_='para 1'):
    print t.p.next_sibling.next_sibling.text


            This is paragraph two
        

            This is paragraph four of div with class para 1
        


In [66]:
##Suppose we wanted only the second paragraph in each div?
for t in html_soup.find_all('div',class_='para 1'):
    print t.p.next_sibling.next_sibling.text.strip()

This is paragraph two
This is paragraph four of div with class para 1


## Demo: Combining requests + beautifulsoup to extract top 100 favourite movie quotes

We will be using this http://www.imdb.com/list/ls000029269/ to scrape the quotes and write them out in a text file. A rough sequence of steps would be to:
1. Use requests() to get the html markup
2. Create a soup object
3. Use appropriate Tag methods to grab the data

In [None]:
url='http://www.imdb.com/list/ls000029269/'
imdb_html=requests.get(url)
imdb_html=imdb_html.content

In [None]:
imdb_soup=BeautifulSoup(imdb_html,'html.parser')

In [None]:
print imdb_soup.prettify()[0:1000]

In [None]:
f=open('quotes.txt','w')
for t in imdb_soup.find_all('div',class_='description'):
     f.write(t.text.strip().encode('utf-8')+'\n')
f.close()

In [None]:
for t in imdb_soup.find_all('div',class_='description'):
     print t.text.strip()
