[View in Colaboratory](https://colab.research.google.com/github/pascalesdedy/python-simple-web-scrape/blob/master/python_simple_web_scrape.ipynb)

# **Learn Web Scrapping with Python**


### Simple python script for web scrapping


Your first task will be to download web pages.

The requests package comes to the rescue. It aims to be an easy-to-use tool for doing all things HTTP in Python, and it doesn’t dissappoint. In this tutorial, you will need only the requests.get() function, but you should definitely checkout the full documentation when you want to go further.

First, here’s your function:

In [0]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    


The simple_get() function accepts a single url argument. 
It then makes a GET request to that URL. If nothing goes wrong, you end up with the raw HTML content for the page you requested. 
If there were any problems with your request (like the URL is bad, or the remote server is down), then your function returns None.

You may have noticed the use of the closing() function in your definition of simple_get(). The closing() function ensures that any network resources are freed when they go out of scope in that with block.
Using closing() like that is good practice and helps to prevent fatal errors and network timeouts.

In [10]:
raw_html = simple_get('https://realpython.com/blog/')
len(raw_html)    

254791

### **Using Beautiful soup**

Fetch Mathematician name

In [0]:
raw_html = simple_get('http://www.fabpedigree.com/james/mathmen.htm')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
      print(i, li.text)

Fetch popular News


In [0]:
raw_html = simple_get('http://jogja.tribunnews.com/populer')
html = BeautifulSoup(raw_html, 'html.parser')
for i, a in enumerate(html.select('a')):
     print(i, a.text)

    
