## 1 - APIS Scraping

### 1.1 - Introduction aux APIs et requêtes APIs

##### Qu'est-ce qu'une API ?


API pour une Application Program Interface


Ensemble d'outils et méthodes qui autorisent différentes applications à interragir entre elle --> récupérer de la data dynamiquement.

### 1.2 - Requête API

In [123]:
# Import requests
import requests

### 1.3 - Requête GET

In [124]:
# Requête pour obtenir la dernière position de la station ISS depuis l'API OpenNotify
response = requests.get("http://api.open-notify.org/iss-now.json") # On ajoute après l'adresse de l'API un point d'accès
#ou endpoint qui donne accès à des informations (ici iss-now.json --> latitude et longitude de la station)

### 1.4 - Code Status 

In [125]:
response

<Response [200]>

#### Code 200 - tout est ok le serveur retourne le resultat

In [126]:
status_code = response.status_code
print(status_code)

200


#### Code 301 - Le serveur redirige vers un autre paramètre

#### Code 400 - Mauvaise requête 

In [127]:
response = requests.get("http://api.open-notify.org/iss-pass.json")
status_code = response.status_code
print(status_code)

400


#### Code 401 - Le serveur pense que vous n'êtes pas authentifié

#### Code 403 - vous n'êtes pas autorisé à accéder à l'API

#### Code 404 - Le serveur n'a pas trouvé la ressource

In [128]:
response = requests.get("http://api.open-notify.org/iss-pass")
status_code = response.status_code
print(status_code)

404


### 1.5 - Paramètre de requête 

In [129]:
# latitude et longitude de la ville de Paris 
parameters = {"lat": 48.87, "lon": 2.33}

In [130]:
#http://api.open-notify.org/iss-pass.json?lat=48.87&lon=2.33

In [131]:
response = requests.get("http://api.open-notify.org/iss-pass.json", 
                        params=parameters)

In [132]:
content = response.content # on recupère le contenu
print(content)

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1588176541, \n    "latitude": 48.87, \n    "longitude": 2.33, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 558, \n      "risetime": 1588213805\n    }, \n    {\n      "duration": 653, \n      "risetime": 1588219534\n    }, \n    {\n      "duration": 653, \n      "risetime": 1588225347\n    }, \n    {\n      "duration": 655, \n      "risetime": 1588231169\n    }, \n    {\n      "duration": 637, \n      "risetime": 1588236979\n    }\n  ]\n}\n'


#### Training

Appliquer la requête GET à la ville de San Francisco:
* Récupérer le contenu avec response.content
* Assigner le résultat à la variable content
* Afficher le resultat

In [135]:
# latitude et longitude de la ville de San Francisco 
parameters = {"lat": 37.78, "lon": -122.41}

In [136]:
response = requests.get("http://api.open-notify.org/iss-pass.json", 
                        params=parameters)

In [137]:
content = response.content # on recupère le contenu
print(content)

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1588179146, \n    "latitude": 37.78, \n    "longitude": -122.41, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 636, \n      "risetime": 1588181339\n    }, \n    {\n      "duration": 611, \n      "risetime": 1588187153\n    }, \n    {\n      "duration": 619, \n      "risetime": 1588241456\n    }, \n    {\n      "duration": 633, \n      "risetime": 1588247253\n    }, \n    {\n      "duration": 522, \n      "risetime": 1588253167\n    }\n  ]\n}\n'


### 1.6 - Format JSON

Librairie json :

- dumps -- prend en entrée un objet Python et retourne une chaine de caractères

- loads -- prend en entrée une chaine de caractères JSON et retourne un objet Python (listes, dictionnaires...)

#### Exemple

In [138]:
# soit une liste de sports
sports = ["Tennis", "Foot", "Triathlon"]
print(sports)

['Tennis', 'Foot', 'Triathlon']


In [139]:
print(type(sports))

<class 'list'>


In [140]:
# Import la librairie json
import json

In [141]:
# Méthode json.dumps pour convertir en chaine de caractères
sports_string = json.dumps(sports)
print(sports_string)

["Tennis", "Foot", "Triathlon"]


In [142]:
print(type(sports_string))

<class 'str'>


In [143]:
# Méthode json.loads pour convertir sports_string en liste 
sports2 = json.loads(sports_string)
print(sports2)

['Tennis', 'Foot', 'Triathlon']


In [144]:
print(type(sports2))

<class 'list'>


#### Training

Soit le dictionnaire ci dessous :
- Convertir en chaine de caractères 
- Re-convertir en dictionnaire
- Vérifier les types 

In [145]:
# Soit le dictionnaire contenant le nombre de licenciés pour 
# quelques sports en France en 2016
sports_number = {
    "Football": 1962241,
    "Tennis": 1039337,
    "Equitation": 663194,
    "Basketball": 641367
}

In [146]:
# Import json
import json

In [147]:
# Convertir en chaine de caractères
sports_number_string = json.dumps(sports_number)
print(sports_number_string)

{"Football": 1962241, "Tennis": 1039337, "Equitation": 663194, "Basketball": 641367}


In [148]:
# Vérifier le type sports_number_string
print(type(sports_number_string))

<class 'str'>


In [149]:
# Re-convertir en dictionnaire 
sports_number_dic = json.loads(sports_number_string)
print(sports_number_dic)

{'Football': 1962241, 'Tennis': 1039337, 'Equitation': 663194, 'Basketball': 641367}


In [150]:
# Vérifier le type de sports_numbers_dict
print(type(sports_number_dic))

<class 'dict'>


### 1.7 - Obtenir un JSON depuis une requête

#### Méthode json()

In [151]:
# Reprenons notre requête avec les coordonnées de la ville de Paris
parameters = {"lat": 48.87, "lon": 2.33}
response = requests.get("http://api.open-notify.org/iss-pass.json", 
                        params=parameters)

In [152]:
# Obtenir un objet Python
json_data = response.json()
print(type(json_data))

<class 'dict'>


In [153]:
print(json_data)

{'message': 'success', 'request': {'altitude': 100, 'datetime': 1588176541, 'latitude': 48.87, 'longitude': 2.33, 'passes': 5}, 'response': [{'duration': 558, 'risetime': 1588213805}, {'duration': 653, 'risetime': 1588219534}, {'duration': 653, 'risetime': 1588225347}, {'duration': 655, 'risetime': 1588231169}, {'duration': 637, 'risetime': 1588236979}]}


In [154]:
# Durée necessaire pour que la station spaciale passe au dessus de Paris
first_pass_duration = json_data["response"][0]['duration']
print(first_pass_duration)

558


### 1.8 - Type de Contenu

In [155]:
# .headers 
print(response.headers)

{'Server': 'nginx/1.10.3', 'Date': 'Wed, 29 Apr 2020 17:00:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '518', 'Connection': 'keep-alive', 'Via': '1.1 vegur'}


In [156]:
content_type = response.headers["content-type"]
print(content_type)

application/json


#### Trouver le nombre de personnes dans l'espace 

#### Training
Trouver cmbien de personnes sont actuellement dans l'espace :
- Assigner le résultat à la variable in_space_count
- Afficher le résultat

In [157]:
# Appeler l' API
response = requests.get("http://api.open-notify.org/astros.json")
json_data = response.json()
print(json_data)

{'number': 3, 'message': 'success', 'people': [{'craft': 'ISS', 'name': 'Chris Cassidy'}, {'craft': 'ISS', 'name': 'Anatoly Ivanishin'}, {'craft': 'ISS', 'name': 'Ivan Vagner'}]}


In [158]:
in_space_count = json_data["number"]
print(in_space_count)

3


## 2 - Authentification à une API

### 2.1 - Authentification à l'API de Github

In [159]:
import requests

In [160]:
# Création dictionnaire contenant le token
headers = {"Authorization": "token 4c229fe484728c0ee1be9e936d9dca23472f6ecc"}

In [161]:
# Requête Get 
response = requests.get("http://api.github.com/users/mkinty", 
                        headers=headers)

In [162]:
json_data = response.json()
print(json_data)

{'login': 'mkinty', 'id': 44556576, 'node_id': 'MDQ6VXNlcjQ0NTU2NTc2', 'avatar_url': 'https://avatars2.githubusercontent.com/u/44556576?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/mkinty', 'html_url': 'https://github.com/mkinty', 'followers_url': 'https://api.github.com/users/mkinty/followers', 'following_url': 'https://api.github.com/users/mkinty/following{/other_user}', 'gists_url': 'https://api.github.com/users/mkinty/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/mkinty/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/mkinty/subscriptions', 'organizations_url': 'https://api.github.com/users/mkinty/orgs', 'repos_url': 'https://api.github.com/users/mkinty/repos', 'events_url': 'https://api.github.com/users/mkinty/events{/privacy}', 'received_events_url': 'https://api.github.com/users/mkinty/received_events', 'type': 'User', 'site_admin': False, 'name': 'Moustapha KINTY', 'company': None, 'blog': '', 'location': 'Yerres, Fra

### 2.2 - Autres points d'accès

In [163]:
# compte github de huandu
response = requests.get("https://api.github.com/users/huandu", headers=headers)
huandu = response.json()
print(huandu)

{'login': 'huandu', 'id': 239739, 'node_id': 'MDQ6VXNlcjIzOTczOQ==', 'avatar_url': 'https://avatars1.githubusercontent.com/u/239739?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/huandu', 'html_url': 'https://github.com/huandu', 'followers_url': 'https://api.github.com/users/huandu/followers', 'following_url': 'https://api.github.com/users/huandu/following{/other_user}', 'gists_url': 'https://api.github.com/users/huandu/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/huandu/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/huandu/subscriptions', 'organizations_url': 'https://api.github.com/users/huandu/orgs', 'repos_url': 'https://api.github.com/users/huandu/repos', 'events_url': 'https://api.github.com/users/huandu/events{/privacy}', 'received_events_url': 'https://api.github.com/users/huandu/received_events', 'type': 'User', 'site_admin': False, 'name': 'Huan Du', 'company': '@altstory', 'blog': '', 'location': 'Beijing, China',

In [164]:
# se connecter à l'organisation facebook
response = requests.get("https://api.github.com/orgs/facebook", headers=headers)
orgs = response.json()
print(orgs)

{'login': 'facebook', 'id': 69631, 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjY5NjMx', 'url': 'https://api.github.com/orgs/facebook', 'repos_url': 'https://api.github.com/orgs/facebook/repos', 'events_url': 'https://api.github.com/orgs/facebook/events', 'hooks_url': 'https://api.github.com/orgs/facebook/hooks', 'issues_url': 'https://api.github.com/orgs/facebook/issues', 'members_url': 'https://api.github.com/orgs/facebook/members{/member}', 'public_members_url': 'https://api.github.com/orgs/facebook/public_members{/member}', 'avatar_url': 'https://avatars3.githubusercontent.com/u/69631?v=4', 'description': 'We are working to build community through open source technology. NB: members must have two-factor auth.', 'name': 'Facebook', 'company': None, 'blog': 'https://opensource.fb.com', 'location': 'Menlo Park, California', 'email': None, 'is_verified': True, 'has_organization_projects': True, 'has_repository_projects': True, 'public_repos': 128, 'public_gists': 12, 'followers': 0, 'following': 

#### Training 
Faites une requête GET sur le point d'accès 

http://api.github.com/repos/octocat/Hello-World:
- Assigner le résultat JSON à la variable hello_world
- Afficher le resultat

In [165]:
response = requests.get("http://api.github.com/repos/octocat/Hello-World", headers=headers)
hello_world = response.json()
print(hello_world)

{'id': 1296269, 'node_id': 'MDEwOlJlcG9zaXRvcnkxMjk2MjY5', 'name': 'Hello-World', 'full_name': 'octocat/Hello-World', 'private': False, 'owner': {'login': 'octocat', 'id': 583231, 'node_id': 'MDQ6VXNlcjU4MzIzMQ==', 'avatar_url': 'https://avatars3.githubusercontent.com/u/583231?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/octocat', 'html_url': 'https://github.com/octocat', 'followers_url': 'https://api.github.com/users/octocat/followers', 'following_url': 'https://api.github.com/users/octocat/following{/other_user}', 'gists_url': 'https://api.github.com/users/octocat/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/octocat/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/octocat/subscriptions', 'organizations_url': 'https://api.github.com/users/octocat/orgs', 'repos_url': 'https://api.github.com/users/octocat/repos', 'events_url': 'https://api.github.com/users/octocat/events{/privacy}', 'received_events_url': 'https://api.github.

### 2.3 - Pagination

In [166]:
params = {"per_page": 50, "page": 1}
response = requests.get("https://api.github.com/users/rakeshsukla53/starred", 
                        headers=headers, params=params)
page1_repos = response.json()
print(page1_repos)

[{'id': 176058541, 'node_id': 'MDEwOlJlcG9zaXRvcnkxNzYwNTg1NDE=', 'name': 'termpair', 'full_name': 'cs01/termpair', 'private': False, 'owner': {'login': 'cs01', 'id': 5715368, 'node_id': 'MDQ6VXNlcjU3MTUzNjg=', 'avatar_url': 'https://avatars2.githubusercontent.com/u/5715368?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/cs01', 'html_url': 'https://github.com/cs01', 'followers_url': 'https://api.github.com/users/cs01/followers', 'following_url': 'https://api.github.com/users/cs01/following{/other_user}', 'gists_url': 'https://api.github.com/users/cs01/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/cs01/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/cs01/subscriptions', 'organizations_url': 'https://api.github.com/users/cs01/orgs', 'repos_url': 'https://api.github.com/users/cs01/repos', 'events_url': 'https://api.github.com/users/cs01/events{/privacy}', 'received_events_url': 'https://api.github.com/users/cs01/received_events', 

#### Training

Obtenir la seconde page des repositories que rakeshsukla53 a marqué comme intéressant
- Assigner le résultat JSON à la variable page2_repos
- Afficher le résultat

In [167]:
params = {"per_page": 50, "page": 2}
response = requests.get("https://api.github.com/users/rakeshsukla53/starred", 
                        headers=headers, params=params)
page2_repos = response.json()
print(page2_repos)

[{'id': 105175251, 'node_id': 'MDEwOlJlcG9zaXRvcnkxMDUxNzUyNTE=', 'name': 'python-dialog-example', 'full_name': 'slackapi/python-dialog-example', 'private': False, 'owner': {'login': 'slackapi', 'id': 6962987, 'node_id': 'MDEyOk9yZ2FuaXphdGlvbjY5NjI5ODc=', 'avatar_url': 'https://avatars3.githubusercontent.com/u/6962987?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/slackapi', 'html_url': 'https://github.com/slackapi', 'followers_url': 'https://api.github.com/users/slackapi/followers', 'following_url': 'https://api.github.com/users/slackapi/following{/other_user}', 'gists_url': 'https://api.github.com/users/slackapi/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/slackapi/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/slackapi/subscriptions', 'organizations_url': 'https://api.github.com/users/slackapi/orgs', 'repos_url': 'https://api.github.com/users/slackapi/repos', 'events_url': 'https://api.github.com/users/slackapi/events{/p

### 2.4 - Point d'accès User-Level

In [168]:
response = requests.get("https://api.github.com/user", headers=headers)
user = response.json()
print(user)

{'login': 'mkinty', 'id': 44556576, 'node_id': 'MDQ6VXNlcjQ0NTU2NTc2', 'avatar_url': 'https://avatars2.githubusercontent.com/u/44556576?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/mkinty', 'html_url': 'https://github.com/mkinty', 'followers_url': 'https://api.github.com/users/mkinty/followers', 'following_url': 'https://api.github.com/users/mkinty/following{/other_user}', 'gists_url': 'https://api.github.com/users/mkinty/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/mkinty/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/mkinty/subscriptions', 'organizations_url': 'https://api.github.com/users/mkinty/orgs', 'repos_url': 'https://api.github.com/users/mkinty/repos', 'events_url': 'https://api.github.com/users/mkinty/events{/privacy}', 'received_events_url': 'https://api.github.com/users/mkinty/received_events', 'type': 'User', 'site_admin': False, 'name': 'Moustapha KINTY', 'company': None, 'blog': '', 'location': 'Yerres, Fra

### 2.5 - Requête POST

In [169]:
# payload = {"name": "test", "description": "Ceci est la description du repository"}
# requests.post("https://api.github.com/user/repos", json=payload)

In [170]:
payload = {"name": "api-scraping"}
response = requests.post("https://api.github.com/user/repos", 
                         json=payload, headers=headers)
status = response.status_code
print(status)

201


### 2.6 - Requête PATCH/PUT

In [171]:
payload = {"name":"api", "description": "Super formation!"}
response = requests.patch("https://api.github.com/repos/mkinty/api-scraping", 
                         json=payload, headers=headers)
status = response.status_code
print(status)

200


### 2.7 - Requête DELETE

In [172]:
response = requests.delete("https://api.github.com/repos/mkinty/api",
                           headers=headers)
status = response.status_code
print(status)

204


#### Training
- Sur votregithub, créer un repository que vous nommerez training
- Mettez-le à jour, et ajouter la description "Training API" et renommez-le "Mon repository"
- Enfin supprimez-le !

In [173]:
# Création dictionnaire contenant le token
headers = {"Authorization": "token 4c229fe484728c0ee1be9e936d9dca23472f6ecc"}

In [174]:
# Requête POST 
payload = {"name": "training"}
response = requests.post("https://api.github.com/user/repos", 
                         json=payload, headers=headers)
status = response.status_code
print(status)

201


In [175]:
# Requête PATCH
payload = {"name":"mon-repository", "description": "Training API!"}
response = requests.patch("https://api.github.com/repos/mkinty/training", 
                         json=payload, headers=headers)
status = response.status_code
print(status)

200


In [176]:
# Requête DELETE
response = requests.delete("https://api.github.com/repos/mkinty/mon-repository",
                           headers=headers)
status = response.status_code
print(status)

204


## 3 - Cas pratique: API Reddit

### 3.1 - Authentification à l'API Reddit

In [177]:
import requests
import requests.auth

In [178]:
# On utilise la méthode HTTPBasicAuth de la librairie requests.auth pour ajouter les identifiants
# de notre script
client_auth = requests.auth.HTTPBasicAuth('tSzvPEWz-pZ9KA', 'U9Y1T8593JHfTLmwzZ5ibyS0cck')
# On ajoute l'identifiant et le mot de passe de notre compte Reddit
post_data = {"grant_type":"password","username": "moustaphak", "password": "Kinty_1989"}
headers = {'User-agent': 'Formation API'} # Ajout d'un nom pour l'identification
response = requests.post("https://www.reddit.com/api/v1/access_token", auth = client_auth,
                        data = post_data, headers = headers) # On génère un token avec toutes les infos précédentes
response.json()

{'access_token': '495868378210-afDL4YvyjztyQDd-mfu56IvmEWY',
 'token_type': 'bearer',
 'expires_in': 3600,
 'scope': '*'}

In [180]:
headers = {"authorization": "bearer 495868378210-afDL4YvyjztyQDd-mfu56IvmEWY", "User-agent": "Formation API"}
params = {"t": "day"} # On ajoute un paramètre pour cibler le dernier jour
# On applique une requêt GET avec les paramètres hearders et params afin de récupérer les posts sur 
# python les plus populaires 
response = requests.get("https://oauth.reddit.com/r/python/top", headers = headers, params = params)
python_top = response .json()
print(python_top)

{'kind': 'Listing', 'data': {'modhash': None, 'dist': 25, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'Python', 'selftext': '', 'author_fullname': 't2_z74zx', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'Shared this one on FB and everyone was confused. :D', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/Python', 'hidden': False, 'pwls': 6, 'link_flair_css_class': 'made-this', 'downs': 0, 'hide_score': False, 'name': 't3_g9vesj', 'quarantine': False, 'link_flair_text_color': 'dark', 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 1930, 'total_awards_received': 0, 'media_embed': {}, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_domain': True, 'is_meta': False, 'category': None, 'secure_media_embed': {}, 'link_flair_text': 'I Made This', 'can_mod_post': False, 'score': 1930, 'approved_by': None, 'author_premiu

### 3.2 - Obtenir le post avec le plus de votes 

In [181]:
python_top_articles = python_top['data']['children']
python_top_articles

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'Python',
   'selftext': '',
   'author_fullname': 't2_z74zx',
   'saved': False,
   'mod_reason_title': None,
   'gilded': 0,
   'clicked': False,
   'title': 'Shared this one on FB and everyone was confused. :D',
   'link_flair_richtext': [],
   'subreddit_name_prefixed': 'r/Python',
   'hidden': False,
   'pwls': 6,
   'link_flair_css_class': 'made-this',
   'downs': 0,
   'hide_score': False,
   'name': 't3_g9vesj',
   'quarantine': False,
   'link_flair_text_color': 'dark',
   'author_flair_background_color': None,
   'subreddit_type': 'public',
   'ups': 1930,
   'total_awards_received': 0,
   'media_embed': {},
   'author_flair_template_id': None,
   'is_original_content': False,
   'user_reports': [],
   'secure_media': None,
   'is_reddit_media_domain': True,
   'is_meta': False,
   'category': None,
   'secure_media_embed': {},
   'link_flair_text': 'I Made This',
   'can_mod_post': False,
   'score': 1930,
 

In [182]:
most_upvoted = ""
most_upvotes = 0

for article in python_top_articles:
    ar = article['data']
    if ar['ups'] >= most_upvotes:
        most_upvoted = ar['id']
        most_upvotes = ar['ups']

In [183]:
print(most_upvotes)

1930


In [185]:
print(most_upvoted)

g9vesj


In [186]:
for article in python_top_articles:
    ar = article['data']
    print(ar["id"], ar["ups"])

g9vesj 1930
ga7y7f 1267
ga4ehh 56
ga5u1b 31
ga3n5o 15
g9tyei 14
ga1b3z 10
g9v681 10
ga0vzz 8
g9x3md 9
g9xsyj 7
g9xbp1 6
ga8iwy 3
ga4iqj 3
ga4i7p 3
ga4aa3 2
ga48ij 3
ga0tmd 3
g9xqdj 3
g9t1t9 4
gabydq 2
ga8sti 2
ga7y38 2
ga7ufx 2
ga3jei 2


### 3.3 - Obtenir les commentaire du post

In [187]:
# r/python/comments/

#### Training
Obtenir tous les commentaires du pot le plus populaire du sureddit Python
- Générer l'URL de la requête en utilisant le nom du subreddit et l'ID du post
- Faire une requête GET 
- Obtenir la reponse en utilisant la méthode json()
- Afficher le résultat

In [188]:
response = requests.get("https://oauth.reddit.com/r/python/comments/g9vesj", headers=headers)
comments = response.json()
print(comments)

[{'kind': 'Listing', 'data': {'modhash': None, 'dist': 1, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'Python', 'selftext': '', 'user_reports': [], 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'Shared this one on FB and everyone was confused. :D', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/Python', 'hidden': False, 'pwls': 6, 'link_flair_css_class': 'made-this', 'downs': 0, 'parent_whitelist_status': 'all_ads', 'hide_score': False, 'name': 't3_g9vesj', 'quarantine': False, 'link_flair_text_color': 'dark', 'upvote_ratio': 0.97, 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 1932, 'total_awards_received': 0, 'media_embed': {}, 'author_flair_template_id': None, 'is_original_content': False, 'author_fullname': 't2_z74zx', 'secure_media': None, 'is_reddit_media_domain': True, 'is_meta': False, 'category': None, 'secure_media_embed': {}, 'link_flair_text': 'I Made This', 'can_mod_post

## 4 - Web Scraping

### 4.1 - Introduction

In [189]:
import requests

In [190]:
# Télécharger la page 
response = requests.get("https://raw.githubusercontent.com/codelikerod/web-scraping/master/exemple1.html")
# Extraire le contenu de la page 
content = response.content
# Afficher le contenu de la page 
print(content)

b'<html>\r\n  <head>\r\n      <title> Un exemple de page HTML </title>\r\n  </head>\r\n\r\n  <body>\r\n      <p>Un simple paragraphe</p>\r\n  </body>\r\n</html>'


### 4.2 - Récupérer des éléments d'une page

In [191]:
# Librairie BeautifulSoup du package bs4

In [192]:
from bs4 import BeautifulSoup

In [193]:
# On applique BeautifulSoup pour analyser le contenu précédement téléchargé
parser = BeautifulSoup(content, 'html.parser')

# Obtenir le tag body du document HTML
body = parser.body

# Obtenir le tag p du body
p = body.p

# Afficher le texte -- on utilise l'attribut .text
print(p.text)

Un simple paragraphe


#### Training
Faire de même avec la balise head
- Récupérer le titre 
- Afficher le résultat

In [194]:
# Obtenir le tag head du document HTML
head = parser.head

# Obtenir le tag title du head
title = head.title

# Afficher le texte 
print(title.text)

 Un exemple de page HTML 


### 4.3 - Utiliser Find All

In [195]:
parser = BeautifulSoup(content, 'html.parser')

# Obtenir tous les éléments de la balise body
body = parser.find_all("body")

# On récupère les éléments de la balise p dans la liste body
p = body[0].find_all("p") # body[0] car on récupère le premier élément de la liste

# Obtenir le texte 
print(p[0].text)

Un simple paragraphe


#### Training
Faire de même avec la balise head
- Récupérer le titre 
- Afficher le résultat

In [196]:
# Obtenir tous les éléments de la balise head
head = parser.find_all("head")

# On récupère les éléments de la balise title dans la liste head
title = head[0].find_all("title") # head[0] car on récupère le premier élément de la liste

# Obtenir le texte 
title_text = title[0].text
print(title_text)

 Un exemple de page HTML 


### 4.4 - Eléments correspondant aux IDs

In [197]:
# Télécharger la page 
response = requests.get("https://raw.githubusercontent.com/codelikerod/web-scraping/master/exemple2.html")
# Extraire le contenu de la page 
content = response.content
parser = BeautifulSoup(content, 'html.parser')

In [198]:
# Récupérer l'ID souhaité
first_paragraph = parser.find_all("p", id = "first")[0] # On ajoute id="first"
print(first_paragraph.text)

1er paragraphe


#### Training
- Obtenir le texte du second paragraphe et assigner le résultat à la variable second_paragraph_text.
- Afficher le résultat

In [199]:
# Récupérer l'ID souhaité
second_paragraph = parser.find_all("p", id = "second")[0] # On ajoute id="second"
second_paragraph_text = second_paragraph.text
print(second_paragraph_text)

2nd paragraphe


### 4.5 - Les classes

In [200]:
# Télécharger le Site Web
response = requests.get("https://raw.githubusercontent.com/codelikerod/web-scraping/master/exemple3.html")
# Extraire le contenu de la page 
content = response.content
parser = BeautifulSoup(content, 'html.parser')

In [201]:
# Obtenir le premier paragraphe de la classe 1
# Trouver tous les paragraphes de la class 1 et récupérer le premier élément
first_class1_paragraph = parser.find_all("p", class_ = "class1")[0]
print(first_class1_paragraph.text)

1er paragraphe classe 1


#### Training
- Récupérer le texte du second paragraphe de la classe 1 et assigner le résultat à la variable second_class1_paragraph_text
- Récupérer le texte du premier paragraphe de la classe 2 et assigner le résultat à la variable first_class2_paragraph_text

In [202]:
# Obtenir le second paragraphe de la classe 1
# Trouver tous les paragraphes de la class 1 et récupérer le second élément
second_class1_paragraph = parser.find_all("p", class_ = "class1")[1]
second_class1_paragraph_text = second_class1_paragraph.text
print(second_class1_paragraph_text)

2nd paragraphe class 1


In [203]:
# Obtenir le premier paragraphe de la classe 2
# Trouver tous les paragraphes de la class 2 et récupérer le premier élément
first_class2_paragraph = parser.find_all("p", class_ = "class2")[0]
first_class2_paragraph_text = first_class2_paragraph.text
print(first_class2_paragraph_text)

1er paragraphe class 2


### 4.6 - Select en CSS

In [204]:
# #first{
#    color: red
#    }

# .class1 {
#     color: red
#     }

In [205]:
# Télécharger le Site Web
response = requests.get("https://raw.githubusercontent.com/codelikerod/web-scraping/master/exemple4.html")
# Extraire le contenu de la page 
content = response.content
parser = BeautifulSoup(content, 'html.parser')

In [206]:
# Sélectionner tous les élements de la classe first-item
first_items = parser.select(".first-item")
print(first_items)

[<p class="class1 first-item" id="first">1er paragraphe classe 1
      </p>, <p class="class2 first-item" id="second">1er paragraphe class 2
      </p>]


#### Training 
- Sélectionner tous les éléments de la class 2 et assigner le premier élément à la variable first_class2_text.
- Sélectionner tous les éléments qui possèdent l'ID second et assigner le premier paragraphe à la variable second_text.

In [207]:
# Sélectionner tous les élements de la classe 2
# Assigner le premier élément à la variable first_class2_text
first_class2 = parser.select(".class2")[0]
first_class2_text = first_class2.text
print(first_class2_text)

1er paragraphe class 2
      


In [208]:
# Sélectionner tous les éléments qui possèdent l'ID second
# Assigner le premier élément à la variable second_text
second = parser.select("#second")[0]
second_text = second.text
print(second_text)

1er paragraphe class 2
      


### 4.7 - Associer des sélecteurs en CSS

In [209]:
# div p
# div .first-item
# body div #first

#### Training 
- Extraire le nombre de fautes de Chelsea et assigner le résultat à la variable chelsea_offences_count
- Extraire le nombre de passes réussies par le PSG et assigner le résultat à la variable psg_pass_count

In [210]:
# Télécharger le Site Web
response = requests.get("https://raw.githubusercontent.com/codelikerod/web-scraping/master/psg-vs-chelsea.html")
# Extraire le contenu de la page 
content = response.content
parser = BeautifulSoup(content, 'html.parser')

In [211]:
# Trouver le nombre de fautes de Chelsea
offences = parser.select("#fautes")[0]
chelsea_offences = offences.select("td")[1]
# Assigner le résultat à la variable chelsea_offences_count
chelsea_offences_count = chelsea_offences.text
print(chelsea_offences_count)

24


In [212]:
# Trouver le nombre de fautes de Chelsea
passes = parser.select("#passes")[0]
psg_pass = passes.select("td")[2]
# Assigner le résultat à la variable chelsea_offences_count
psg_pass_count = psg_pass.text
print(psg_pass_count)

545


## 5 - Challenge 1: Site météo

### 5.1 - Exploration de la structure de la page web

##### Pratique :
- Télécharger la page contenant la prévision météo
- Utiliser BeautifulSoup pour analyser le code HTML
- Trouver l'ID seven-day-forecast et assigner le résultat à la variable seven_day
- À l'intérieur de seven_day sélectionner chaque prévision individuellement 
- Extraire et afficher le premier élément

In [213]:
# import librairies
import requests
from bs4 import BeautifulSoup

In [214]:
# Télécharger la page contenant la prévision météo
response = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.Xql0x5MzauU")
# Extraire le contenu de la page 
content = response.content
# Utiliser BeautifulSoup pour analyser le code HTML
soup = BeautifulSoup(content, 'html.parser')

In [215]:
# Trouver l'ID seven-day-forecast et assigner le résultat à la variable seven_day
seven_day = soup.find(id = "seven-day-forecast")
# À l'intérieur de seven_day sélectionner chaque prévision individuellement
forecast_items = seven_day.find_all(class_ = "tombstone-container")
today = forecast_items[0]
print(today)

<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. "/></p><p class="short-desc">Mostly Cloudy</p><p class="temp temp-high">High: 66 °F</p></div>


In [216]:
# .prettify() pour être plus claire dans la lisibilité
print(today.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Cloudy
 </p>
 <p class="temp temp-high">
  High: 66 °F
 </p>
</div>


### 5.2 - Extraire toutes les informations d'un élément

##### Pratique :
- Extraire le nom de l'objet forecast, la courte description et la température 
- Extraire le titre de l'objet img

In [217]:
# Extraire le nom de l'objet forecast
period = today.find(class_ = "period-name").get_text() # .get_text() pour obtenir la version texte 
# La courte description
short_desc = today.find(class_ = "short-desc").get_text()
# La température 
temp = today.find(class_ = "temp").get_text()
print(period)
print(short_desc)
print(temp)

Today
Mostly Cloudy
High: 66 °F


In [218]:
# Regardons l'objet img
img = today.find("img")
print(img)

<img alt="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. "/>


In [219]:
# img est un dictionnaire, on récupère title avec un []
desc = img['title']
print(desc)

Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. 


### 5.3 - Extraire toutes les informations de la page 

##### Exemple :
- Sélectionner tous les objets de la classe period-name à l'intérieur d'un objet de classe tombstone-container dans l'élément seven-day
- Utiliser une compréhension de liste puis appleler la méthode get_text() sur chaque objet.

In [230]:
# Regardons l'élément seven_day
print(seven_day.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   San Francisco CA
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Today
       <br/>
       <br/>
      </p>
      <p>
       <img alt="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. "/>
      </p>
      <p class="short-desc">
       Mostly Cloudy
      </p>
      <p class="temp temp-high">
       High: 66 °F
      </p>
     </div>
    </li>
    <li class="forecast-tombstone">
     <div class="tomb

In [233]:
# Sélectionner tous les objets de la classe period-name à l'intérieur 
# d'un objet de classe tombstone-container dans l'élément seven-day
period_tags = seven_day.select(".tombstone-container .period-name")
# On parcourt tous les éléments de period_tags et pour chaque élément 
# on applique la méthode get_text(). On obtient une liste 
periods = [pt.get_text() for pt in period_tags]

In [234]:
# Afficher la liste 
print(periods)

['Today', 'Tonight', 'Thursday', 'ThursdayNight', 'Friday', 'FridayNight', 'Saturday', 'SaturdayNight', 'Sunday']


##### Pratique :
- Faites de même avec la courte description, les températures et les titres (descriptions)

In [235]:
# La même chose pour la courte description
short_desc_tags = seven_day.select(".tombstone-container .short-desc") 
short_descs = [short.get_text() for short in short_desc_tags]
print(short_descs)

['Mostly Cloudy', 'Partly Cloudy', 'Sunny', 'Partly Cloudy', 'Sunny', 'Mostly Clear', 'Partly Sunny', 'Partly Cloudy', 'Mostly Sunny']


In [236]:
# La même chose pour la température
temp_tags = seven_day.select(".tombstone-container .temp")
temps = [temp.get_text() for temp in temp_tags]
print(temps)

['High: 66 °F', 'Low: 54 °F', 'High: 69 °F', 'Low: 52 °F', 'High: 68 °F', 'Low: 54 °F', 'High: 66 °F', 'Low: 54 °F', 'High: 65 °F']


In [237]:
# La même chose pour les titres (descriptions)
desc_tags = seven_day.select(".tombstone-container img")
descs = [img["title"] for img in desc_tags]
print(descs)

['Today: Mostly cloudy, with a high near 66. West wind 13 to 17 mph, with gusts as high as 23 mph. ', 'Tonight: Partly cloudy, with a low around 54. West wind 14 to 16 mph, with gusts as high as 21 mph. ', 'Thursday: Sunny, with a high near 69. West wind 13 to 16 mph, with gusts as high as 22 mph. ', 'Thursday Night: Partly cloudy, with a low around 52. West wind 10 to 16 mph, with gusts as high as 21 mph. ', 'Friday: Sunny, with a high near 68. West wind 9 to 16 mph, with gusts as high as 21 mph. ', 'Friday Night: Mostly clear, with a low around 54.', 'Saturday: Partly sunny, with a high near 66.', 'Saturday Night: Partly cloudy, with a low around 54.', 'Sunday: Mostly sunny, with a high near 65.']


### 5.4 - Affichage du résultat avec Pandas 

In [71]:
import pandas as pd
weather = pd.DataFrame({
        "period": periods,
        "short_desc": short_descs,
        "temp": temps,
        "desc": descs
    })

In [72]:
weather

Unnamed: 0,period,short_desc,temp,desc
0,Today,Mostly Cloudy,High: 66 °F,"Today: Mostly cloudy, with a high near 66. Wes..."
1,Tonight,Partly Cloudy,Low: 54 °F,"Tonight: Partly cloudy, with a low around 54. ..."
2,Thursday,Sunny,High: 69 °F,"Thursday: Sunny, with a high near 69. West win..."
3,ThursdayNight,Partly Cloudy,Low: 52 °F,"Thursday Night: Partly cloudy, with a low arou..."
4,Friday,Sunny,High: 68 °F,"Friday: Sunny, with a high near 68. West wind ..."
5,FridayNight,Mostly Clear,Low: 54 °F,"Friday Night: Mostly clear, with a low around 54."
6,Saturday,Partly Sunny,High: 66 °F,"Saturday: Partly sunny, with a high near 66."
7,SaturdayNight,Partly Cloudy,Low: 54 °F,"Saturday Night: Partly cloudy, with a low arou..."
8,Sunday,Mostly Sunny,High: 65 °F,"Sunday: Mostly sunny, with a high near 65."


## 6 - Challenge 2: Critique de films

### 6.1 - Structure de l'URL

#### Pratique :
- Importer la fonction get() du module requests
- Assigner l'url de la page à la variable url
- Télécharger la page et assigner le résultat à la variable response 
- afficher un extrait du résultat 

In [286]:
from requests import get
url = "https://www.imdb.com/search/title/?release_date=2017&sort=num_votes,desc&page=1"
response = get(url)
print(response.text[:500]) # Afficher les 500 premiers éléments




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">



        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>

<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle"


### 6.2 - Structure HTML de la page 

#### Pratique :
- Importer  la classe BeautifulSoup du package bs4
- Extraire le code HTML
- Utiliser la méthode find_all() pour extraire les éléments souhaités

In [295]:
from bs4 import BeautifulSoup
text = response.text
html_soup = BeautifulSoup(text, 'html.parser')
movie_containers = html_soup.find_all('div', class_ = "lister-item mode-advanced")
print(type(movie_containers))
print(len(movie_containers))

<class 'bs4.element.ResultSet'>
50


 ### 6.3 - Extraire la data pour un seul film

#### Challenge :
- Extraire l'année de sortie du premier film
- Extraire la note IMDB (à convertir en nombre à virgule)
- Extraire la note Metacritic (à convertir en entier)
- Extraire le nombre de votes (utiliser paramètre attrs)

In [296]:
first_movie = movie_containers[0]
print(first_movie)

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt3315342"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt3315342/"> <img alt="Logan" class="loadlate" data-tconst="tt3315342" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BYzc5MTU4N2EtYTkyMi00NjdhLTg3NWEtMTY4OTEyMzJhZTAzXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/large/film-184890147._CB466725069_.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt3315342/">Logan</a>
<span class="lister-item-year text-muted unbold">(2017)</span>
</h3>
<p class="text-muted">
<span class="certificate">12</span>
<span class="ghost">|</span>
<span class="runtime">137 min</span>
<span class="ghost">|</span>
<span class="genre">
Ac

In [297]:
# Le nom du film
first_name = first_movie.h3.a.text
first_name

'Logan'

In [298]:
# L'année de sortie 
first_year = first_movie.h3.find('span', class_ = "lister-item-year text-muted unbold")
first_year = first_year.text
first_year

'(2017)'

In [299]:
# Note IMDB
first_imdb = float(first_movie.strong.text) # convertir en float (nombre à virgule)
first_imdb

8.1

In [305]:
# Note Metacritic
first_metascore = first_movie.find('span', class_ = "metascore favorable")
first_metascore = int(first_metascore.text) # convertir en nombre entier
first_metascore

77

In [301]:
# Nombre de votes 
first_votes = first_movie.find('span', attrs = {'name':'nv'})
first_votes

<span data-value="611782" name="nv">611,782</span>

In [302]:
first_votes = int(first_votes["data-value"]) # convertir en nombre entier
first_votes

611782

### 6.4 - Script pour scraper une seule page 

In [307]:
# On crée des liste vides pour toutes nos informations
names = []
years = []
imdb_ratings = []
metascores = []
votes = []

# On reprend notre data movie_containers pour y extraire l'information
for container in movie_containers:
    
    # Si le film a une note Metacritic, on extrait
    if container.find('div', class_ = 'ratings-metascore') is not None:
        
        # Le titre du film
        name = container.h3.a.text
        names.append(name) # On ajoute chaque élément name de la boucle à la liste names
        
        # L'année de sortie
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
        
        # Note IMDB
        imdb_rating = float(container.strong.text)
        imdb_ratings.append(imdb_rating)
        
        # Note Metacritic
        metascore = int(container.find('span', class_ = "metascore").text)
        metascores.append(metascore)
        
        # Votes 
        vote = int(container.find('span', attrs = {'name':'nv'})["data-value"])
        votes.append(vote)

In [317]:
print(names)

['Logan', 'Thor: Ragnarok', 'Les gardiens de la galaxie Vol. 2', 'Star Wars: Episode VIII - Les derniers Jedi', 'Wonder Woman', 'Dunkerque', 'Spider-Man: Homecoming', 'Get Out', 'Ça', 'Blade Runner 2049', 'Baby Driver', '3 Billboards: Les panneaux de la vengeance', 'Justice League', "La forme de l'eau", 'John Wick 2', 'Coco', 'Jumanji: Bienvenue dans la jungle', 'La Belle et la Bête', 'Kong: Skull Island', "Kingsman: Le cercle d'or", 'Pirates des Caraïbes: la Vengeance de Salazar', 'Alien: Covenant', 'The Greatest Showman', 'La planète des singes: Suprématie', 'Lady Bird', "Le crime de l'Orient-Express", 'Life: Origine inconnue', 'Fast & Furious 8', 'Ghost in the Shell', 'Wind River', 'Call Me by Your Name', "Le roi Arthur: La légende d'Excalibur", 'Mother', 'Hitman & Bodyguard', 'Moi, Tonya', 'Atomic Blonde', 'La momie', 'Bright', 'Les heures sombres', 'Valérian et la Cité des Mille Planètes', 'Baywatch: Alerte à Malibu', 'Barry Seal: American Traffic']


In [318]:
print(votes)

[611782, 546773, 535645, 534465, 524957, 513301, 480804, 456199, 430806, 426615, 407643, 395175, 360168, 347064, 340744, 334423, 298336, 258051, 256514, 254658, 247650, 242917, 219045, 216827, 214672, 196961, 196459, 194862, 188251, 188173, 187657, 182276, 178559, 176697, 166195, 165902, 165779, 160490, 158282, 153819, 152166, 148627]


In [311]:
print(years)

['(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(I) (2017)', '(I) (2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(I) (2017)', '(2017)', '(I) (2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(I) (2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(2017)', '(I) (2017)', '(2017)', '(2017)', '(2017)', '(2017)']


In [312]:
print(imdb_ratings)

[8.1, 7.9, 7.6, 7.0, 7.4, 7.9, 7.4, 7.7, 7.3, 8.0, 7.6, 8.2, 6.4, 7.3, 7.5, 8.4, 6.9, 7.1, 6.6, 6.7, 6.6, 6.4, 7.6, 7.4, 7.4, 6.5, 6.6, 6.7, 6.3, 7.7, 7.9, 6.7, 6.6, 6.9, 7.5, 6.7, 5.4, 6.3, 7.4, 6.5, 5.5, 7.1]


In [313]:
print(metascores)

[77, 74, 67, 85, 76, 94, 73, 85, 69, 81, 86, 88, 45, 87, 75, 81, 58, 65, 62, 44, 39, 65, 48, 82, 94, 52, 54, 56, 52, 73, 93, 41, 75, 47, 77, 63, 34, 29, 75, 51, 37, 65]


In [319]:
print(len(names))
print(len(votes))
print(len(years))
print(len(imdb_ratings))
print(len(metascores))

42
42
42
42
42


### 6.5 - Affichage du DataFrame avec Pandas 

In [320]:
import pandas as pd
test_df = pd.DataFrame({
        "movie": names,
        "year": years,
        "imdb": imdb_ratings,
        "metascore": metascores,
        "vote": votes
    })

In [321]:
print(test_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   movie      42 non-null     object 
 1   year       42 non-null     object 
 2   imdb       42 non-null     float64
 3   metascore  42 non-null     int64  
 4   vote       42 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 1.8+ KB
None


In [323]:
test_df.tail()

Unnamed: 0,movie,year,imdb,metascore,vote
37,Bright,(I) (2017),6.3,29,160490
38,Les heures sombres,(2017),7.4,75,158282
39,Valérian et la Cité des Mille Planètes,(2017),6.5,51,153819
40,Baywatch: Alerte à Malibu,(2017),5.5,37,152166
41,Barry Seal: American Traffic,(2017),7.1,65,148627


### 6.6 - Script pour toutes les pages

In [324]:
# Choix des pages à scraper (de la page 1 à 4)
pages = [str(i) for i in range(1,5)]
pages

['1', '2', '3', '4']

In [327]:
# Varier les années ( de 2000 à 2017)
years_url = [str(i) for i in range(2000, 2018)]
years_url

['2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017']

### 6.7 - Contrôler le taux de requêtes envoyées 

In [328]:
from time import sleep
from random import randint

# exemple
for a in range(0,5):
    print('Booh')
    sleep(randint(1,4)) 


Booh
Booh
Booh
Booh
Booh


In [330]:
from time import time 
from IPython.display import clear_output
start_time = time() # on fixe le temps de début 
requests = 0 # la variable requests va compter le nombre de requêtes 
for _ in range(5):
    requests += 1 # on incrémente la variable requests d'une unité à chaque passage de boucle 
    sleep(randint(1,3)) # on fait une pause de 1 à 3 secondes
    elapsed_time = time() - start_time # on calcule le temps écoulé depuis la première requête
    print('Requests: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True) # afficher uniquement la requête en cours 

Requests: 5; Frequency: 0.5539399630712326 requests/s


In [331]:
# Pour afficher les status 
from warnings import warn
warn("Attention")

  This is separate from the ipykernel package so we can avoid doing imports until


### 6.8 - Script final

- Redéclarer les listes précédemment crées pour notre script sur une seule page pour qu'elles deviennent de nouveau vides
- Préparer l'écran d'affichage de la boucle 
- Ecrire une boucle qui fait varier le paramètre release_date et l'URL avec les valeurs de la liste years_url
- Faire une requête GET sur la boucle des pages 
- Pauser la boucle sur un intervalle de 8 à 15 secondes 
- Afficher à l'écran le temps de requêtes vu à la vidéo précedente
- Ajouter un avertissement pur tout code status diffrérent de 200
- Arrêter la boucle si le nombre de requêtes est supérieur au nombre attendu (72 pages)
- Convertir le contenu HTML response en un objet BeautifulSoup
- Extraire tous les containers des différents films 
- Ecrire une boucle qui parcourt tous les containers 
- Extraire les informations de chaque container si celui-ci a un Metascore 