# **Module de Machine Learning**

# **Partie 2  : Web Scraping**
L'objectif de cette partie est de récupérer des phrases à caractère financier sur le site du Financial Times pour les fournir au modèle de la Partie 1 après récupération de celui-ci.

### Librairies

In [18]:
from bs4 import BeautifulSoup
import requests
import csv

### Script de récupération de données financières (Financial Times)

#### Récupération d'une page web

In [19]:
response = requests.get("https://www.ft.com/global-economy")

response
# Si response renvoie <Response 200> la requête à bien fonctionner et la variable response contient la page web correspondant à l'url.
# En revanche <Response 404> indique une erreur.

<Response [200]>

#### Parsing du contenu de la page web

In [20]:
soup = BeautifulSoup(response.content, 'html.parser')

soup.title
# Normalement, on devrait avoir récupérer les données du site du Financial Times

<title>Global Economy | Financial Times</title>

### Isolation des données qui nous intéressent

In [21]:
tag_a = soup.find_all("a", class_="js-teaser-heading-link")
tag_a

[<a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/a064c6b4-37b3-40fa-bc88-e58bd1b0adce">FirstFT: Trump blasts ‘election interference’ after historic charges</a>,
 <a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/94d2410b-c3c1-4e0b-ad50-6144b310c75f">EU trade deal with South America delayed by row over environmental rules</a>,
 <a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/6890be22-9280-460a-bd2a-3c45f2cf531f">BoE’s chief economist hints at May interest rate rise</a>,
 <a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/92d95586-f1eb-4148-ae32-1864f7deeb43">Waging war on trade will be costly</a>,
 <a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/95745636-2d21-46aa-b0f1-6bda1c0fdd0b">Personal inflation calculator: what is your inflation rate?</a>,
 <a class="js-teaser-heading-link" data-trackable="heading-link" href="/content/b7340b0f-7919-

In [22]:
for i in range(len(tag_a) - 2):
  print(tag_a[i].contents)
# On retire 2 car on ne veut pas récupérer les 2 dernières lignes qui ne 
# contiennent pas de phrases à caractère financier

['FirstFT: Trump blasts ‘election interference’ after historic charges']
['EU trade deal with South America\xa0delayed\xa0by row over environmental rules']
['BoE’s chief economist hints at May interest rate rise']
['Waging war on trade will be costly']
['Personal inflation calculator: what is your inflation rate?']
['US job openings fall to lowest level in almost two years']
['Opec isn’t scaring anyone']
['FirstFT: Trump prepares to face charges']
['How Spain has taken on the problem of precarious work  ']
['China Inc keen on setting up shop in the US despite tensions']
['Surprise cut by Opec+ fuels optimism for oil companies']
['Israel political crisis could cut 2.8% a year from GDP, central bank warns']
['High inflation boosts public finances, IMF says']
['Europe’s aversion to anti-coercion']
['FirstFT: Oil prices surge']
['China’s ports dominance undermines western aims to loosen trade ties']
['The financial turmoil is not over']
['How China is winning the race for Africa’s lithium'

In [23]:
phrases_web = [['Phrases']]
for j in range(len(tag_a) - 2):
  phrases_web.append(tag_a[j].contents)
print("ok")
# Ici on rassemble les phrases qu'on vient d'identifier dans un tableau

ok


In [24]:
for k in range(len(phrases_web)):
  print(phrases_web[k])
# On vérifie que nos phrases sont bien dans le tableau

['Phrases']
['FirstFT: Trump blasts ‘election interference’ after historic charges']
['EU trade deal with South America\xa0delayed\xa0by row over environmental rules']
['BoE’s chief economist hints at May interest rate rise']
['Waging war on trade will be costly']
['Personal inflation calculator: what is your inflation rate?']
['US job openings fall to lowest level in almost two years']
['Opec isn’t scaring anyone']
['FirstFT: Trump prepares to face charges']
['How Spain has taken on the problem of precarious work  ']
['China Inc keen on setting up shop in the US despite tensions']
['Surprise cut by Opec+ fuels optimism for oil companies']
['Israel political crisis could cut 2.8% a year from GDP, central bank warns']
['High inflation boosts public finances, IMF says']
['Europe’s aversion to anti-coercion']
['FirstFT: Oil prices surge']
['China’s ports dominance undermines western aims to loosen trade ties']
['The financial turmoil is not over']
['How China is winning the race for Afric

### Ecriture des données dans un fichier csv

In [25]:
type(phrases_web[0][0])
# On vérifie le type de nos phrases

str

In [26]:
for l in range(len(phrases_web)):
  phrases_web[l][0] = str(phrases_web[l][0])
type(phrases_web[0][0])
# Le type de nos phrases est un type propre à la librairie BeautifulSoup.
# On convertit les phrases en objet de type string pour éviter d'éventuels problème lors de l'écriture en csv.

str

In [27]:
with open("../data/web.csv", "w", newline='', encoding='utf-8') as csvfile:
  writer = csv.writer(csvfile, delimiter=',')
  writer.writerows(phrases_web)
  print("ok")

ok


### Récupération des données et du modèle

In [28]:
import pandas as pan
from joblib import load

In [29]:
# Récupération des phrases scrapées
dataframe_web = pan.read_csv("../data/web.csv")
dataframe_web

Unnamed: 0,Phrases
0,FirstFT: Trump blasts ‘election interference’ ...
1,EU trade deal with South America delayed by ro...
2,BoE’s chief economist hints at May interest ra...
3,Waging war on trade will be costly
4,Personal inflation calculator: what is your in...
5,US job openings fall to lowest level in almost...
6,Opec isn’t scaring anyone
7,FirstFT: Trump prepares to face charges
8,How Spain has taken on the problem of precario...
9,China Inc keen on setting up shop in the US de...


In [30]:
# Récupération du modèle de la Partie 1
model = load("model.logiR")

### Prédictions sur les données récupérées sur le web

In [31]:
model

In [33]:
predictions = model.predict(dataframe_web.Phrases)
predictions

array(['neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'neutral', 'positive', 'negative',
       'negative', 'neutral', 'neutral', 'neutral', 'positive', 'neutral',
       'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral',
       'positive', 'neutral'], dtype=object)