## Practical Session 6 - scraping the web with urllib3 and beautifulsoup

Students (pair):
- [MIFDAL Oussama]([link](https://github.com/username1))
- [LAMSAOUB Mohamed]([link](https://github.com/username2))

**Useful references for this lab**:

[1] `urllib3`: [documentation](https://urllib3.readthedocs.io/en/latest/)

[2] `beautifulsoup4`: [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 


## <a name="content">Contents</a>
- [Exercise 1: Parsing the content of a web page](#ex1)
- [Exercise 2: Extracting information from Wikipedia](#ex2)
---

 This notebook is aimed at introducing Python functions and library to automatically collect data from static web pages. In particular, this session will be devoted to the `urllib3` and `Beautiful Soup` packages.

 Other useful packages in this context:
 - `os` & `sys` to issue system instructions;
 - `re` for [**r**egular **e**xpressions when manipulating text strings](https://docs.python.org/3/library/re.html). The test the validity of a regular expression;
 - `datetime` to interact with dates & times.

In [1]:
%matplotlib inline

In [11]:
import os
import re
import sys
from datetime import date, datetime

import urllib3
from bs4 import BeautifulSoup

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


 To take Centrale Lille's proxy into account:

In [12]:
# If proxy : to get out through Centrale Lille's proxy

centrale_proxy = False
if centrale_proxy:
    proxy = urllib3.ProxyManager("http://cache.ec-lille.fr:3128")
else:
    proxy = urllib3.PoolManager()

# See https://stackoverflow.com/questions/40490187/get-proxy-address-on-auto-proxy-discovery-mode-on-mac-os-x
# scutil --proxy

## <a name="ex1">Exercise/example 1: parsing the content of a web page</a> [(&#8593;)](#content)

This example consits in retrieving the version number of the Beautiful Soup package, appearing in the top left corner of the associated [documentation webpage](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). To do this, you can for isntance use the following instructions.

In [3]:
# from urllib3.request import urlopen
response = proxy.request(
    "GET", "https://www.crummy.com/software/BeautifulSoup/bs4/doc/"
)

 Transform content into formatted text:

In [4]:
utf8_text = response.data.decode("utf-8")
# print(utf8_text)

 Look for the version number and print it:

In [5]:
# Search data using Regex (regular expressions)
# Test regex http://regexr.com
regex = "Beautiful Soup (\d\.){2}(\d)"  # looking for version number under the form Beautiful Soup 4.9.0
web_text = re.search(regex, utf8_text)
print(web_text.group(0))

Beautiful Soup 4.9.0


1\. Extract only the version number from the same page.

> Hint: two useful pages about regular expressions (regexp): [tutorial](https://www.lucaswillems.com/fr/articles/25/tutoriel-pour-maitriser-les-expressions-regulieres), [verifying validity of an expression](http://regexr.com).

Your answers(s)

In [6]:
# your code

# Let's extract the version mentioned at the top left.

pattern_of_version = r"\d+\.\d+\.\d+"  

web_text = re.search(pattern_of_version, utf8_text)

print(web_text.group(0))

4.12.0


2\. Take a look at the quickstart page of [`Beautiful Soup` (bs4 package)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and use this library to retrieve the same information.

> Hint:
> - [this page on Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) can be useful
> - useful elements of code:
>
>```python
> from bs4 import BeautifulSoup
> html_doc = proxy.request('GET','https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
> soup = BeautifulSoup(html_doc, 'html.parser')
> ...
>```

Your answers(s)

In [7]:
# your code
from bs4 import BeautifulSoup


html_doc = proxy.request('GET','https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
soup = BeautifulSoup(html_doc.data, 'html.parser')

In [8]:
## Chercher la version dans le top left de la page

version = re.search( pattern_of_version ,soup.title.string)

print(version.group(0))

4.12.0


In [9]:
## Chercher toute les titres du page HTML : 

pattern = r'<h1>(.*?)<a'


Titles = [re.findall(pattern, str(link))[0] for link in soup.find_all('h1')]


print(Titles[1])

Quick Start


-------
----

## <a name="ex2">Exercise 2: Extracting information from Wikipedia</a> [(&#8593;)](#content)

This exercise consists in extracting the birthdate of a list of actors from their Wikipedia page to infer their age. Consider for instance a list composed of Brad Pitt, Laurent Cantet, Jean-Paul Belmondo, Matthew McConaughey, Marion Cotillard, ...

To this aim, take a look at one such Wikipedia page, verify whether a birthdate is reported, and take a look at the `.html` source code of the page (from your browser) to see where this information is located. 

First write a function to automatically retrieve the birthdate of each actor in the list. In a second step, convert this information into a "numerical date" (see codes below) and compute the difference with the current date to estimate the actors' age.

> Hints: 
> - note that the birth date is associated whith the class `class="nowrap date-lien bday"` (check source code of the web page);
> - useful object: `bs4.BeautifulSoup`, with its `find` method, see the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/);
> - you can create an `Actor` class to collect useful attributes (see [here](https://scipy-lectures.org/intro/language/oop.html?highlight=classes) and [there](https://docs.python.org/3/tutorial/classes.html) for more details on defining classes in Python).
> 
>```python
>class Actor:
>    def __init__(self, firstname, name):
>        self.name = name
>        self.firstname = firstname
>    ...
>
>```

> Codes: one possible way to translate words into a numerical date to compute an age is
>```python
># Parse data (replace month by number)
>month = ['janvier', 'février', 'mars', 'avril', 'mai', 'juin', 'juillet', 'août', 'septembre', 'octobre', 'novembre', 'décembre']
>month_number = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
>
>for i in range(0, 12):
>    web_date = web_date.replace(month[i], month_number[i])
>    
># Parse data and find the date to translate it into a numerical value
>born = datetime.strptime(web_date, '%m %Y')
>now = date.today()
>
># Compute the age
>age = now - born.date()
>age.days / 356
>
>result = now.year - born.date().year - ((now.month, now.day) < (born.date().month, born.date().day))
>```

Your answers(s)

In [10]:
import pandas as pd

# define a class for actors
class Actor:

    def __init__(self, full_name):

        self.full_name = full_name
    
    def set_age(self,age): 
        self.age = age

    def set_birthday(self,birthday):
        self.birthday = birthday



def date_to_age(web_date): 

    """ input  : birthday
        output : age"""


    month = ['janvier', 'février', 'mars', 'avril', 'mai', 'juin', 'juillet', 'août', 'septembre', 'octobre', 'novembre', 'décembre']
    month_number = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

    for i in range(0, 12):
        web_date = web_date.replace(month[i], month_number[i])
        
    # Parse data and find the date to translate it into a numerical value
    born = datetime.strptime(web_date, '%m %Y')
    now = date.today()

    # Compute the age
    age = now - born.date()
    age.days / 356

    result = now.year - born.date().year - ((now.month, now.day) < (born.date().month, born.date().day))

    return result
        

Actors = ["Brad Pitt", "Laurent Cantet", "Jean-Paul Belmondo", "Matthew McConaughey", "Marion Cotillard"]
instance_actors = [Actor(full_name) for full_name in Actors]

### Getting birthdays :

In [72]:
Links = ['https://fr.wikipedia.org/wiki/'+Actors[i].replace(" ","_") for i in range(len(Actors))]


# define two dict for birthdays and ages :  {'actor' : 'age'} or {'actor' : 'birthday'}

Actors_birthady = {}
Actors_ages = {}


for i in range(len(Actors)) : 

    html_doc = proxy.request('GET','https://fr.wikipedia.org/wiki/'+Actors[i].replace(" ","_") )
    soup = BeautifulSoup(html_doc.data , 'html.parser')


    # add the actor and his informations :
    Actors_birthady[Actors[i]] = soup.find_all('time')[0].get('data-sort-value')
    Actors_ages[Actors[i]] = date_to_age(Actors_birthady[Actors[i]][5:7]+' '+Actors_birthady[Actors[i]][:4]) 

    # set the age and the birthday 
    instance_actors[i].set_birthday(Actors_birthady[Actors[i]] ) 
    instance_actors[i].set_age(Actors_ages[Actors[i]] ) 


In [73]:
data_scaped = pd.DataFrame({'Actors' : Actors , 'Birthday' : list(Actors_birthady.values()) , 'Ages':list(Actors_ages.values())})

In [74]:
data_scaped

Unnamed: 0,Actors,Birthday,Ages
0,Brad Pitt,1963-12-18,59
1,Laurent Cantet,1961-04-11,62
2,Jean-Paul Belmondo,1933-04-09,90
3,Matthew McConaughey,1969-11-04,53
4,Marion Cotillard,1975-09-30,48
