# **Un peu de scraping**

## Du pixel aux images - 32M7138

*Printemps 2024 - Université de Genève*

*Adrien Jeanrenaud (adrien.jeanrenaud@unige.ch)*


<div class="alert alert-block alert-info">
<b>Un peu de scraping</b> : 
    <br>Le scraping signifie, en français, la récupération des données.
    <br>Cette étape, cruciale dans la chaîne de traitement des données et des images numérique vise à la récupération d'un jeu de données ainsi que sa structuration en vue de futures analyses. 
    <br>Pour ce faire, nous allons voir ensemble les sujets suivants:
</div>

## **Plan du cours**

> **Scraping online**
> * A partir d'un site
> * Structurer les données
> * Télécharger 

> **Récupération des données sur Explore**
> * Les données sur Explore
> * Structurer les données
> * Télécharger

In [16]:
# importer librairies

import requests
from bs4 import BeautifulSoup
import pandas as pd

### **Scraping online**

A partir d'un site : https://archive.org/details/crash-magazine-01

In [4]:
# Lien que vous souhaitez télécharger
url = "https://archive.org/details/crash-magazine-01"

# Télécharger le contenu HTML
response = requests.get(url)

# Vérifier si la requête a réussi (code 200)
if response.status_code == 200:
    # Récupérer le contenu HTML
    html_content = response.text
    print(html_content)
else:
    print(f"Erreur {response.status_code} lors du téléchargement du lien.")

<!DOCTYPE html>
<html lang="en">
<!-- __ _ _ _ __| |_ (_)__ _____
    / _` | '_/ _| ' \| |\ V / -_)
    \__,_|_| \__|_||_|_| \_/\___| -->
  <head data-release=5c813afd data-node="www26.us.archive.org">
    <title>Crash Magazine Issue 01 : Free Download, Borrow, and Streaming : Internet Archive</title>

          <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    
        <meta name="google-site-verification" content="Q2YSouphkkgHkFNP7FgAkc4TmBs1Gmag3uGNndb53B8" />
    <meta name="google-site-verification" content="bpjKvUvsX0lxfmjg19TLblckWkDpnptZEYsBntApxUk" />

    <script  nonce="**CSP_NONCE**" >
/* @licstart  The following is the entire license notice for the
 * JavaScript code in this page.
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU Affero General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * Thi

In [None]:
# date de publication

<dl class="metadata-definition">
        <dt>Publication date</dt>
        <dd class="">
          <a href="/search.php?query=date:1984-02">
            <span itemprop="datePublished">1984-02</span>
        </a>
                </dd>
    </dl>

In [6]:
# Utiliser BeautifulSoup pour analyser le HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extraire le titre de la page
title = soup.title.text if soup.title else "Titre non trouvé"

# Extraire la date de publication
publication_date_tag = soup.find("dt", text="Publication date")
publication_date = publication_date_tag.find_next("span", {"itemprop": "datePublished"}).text if publication_date_tag else "Date non trouvée"
print(publication_date)

In [7]:
# balise pour le pdf

<a class="format-summary download-pill" href="/download/crash-magazine-01/Crash_01_Feb_1984.pdf" title="" data-toggle="tooltip" data-placement="auto left" data-container="body" data-original-title="31.8M">
                PDF                <span class="iconochive-download" aria-hidden="true"></span><span class="icon-label sr-only">download</span>              </a>

'1984-02'

In [14]:
# lien pour le pdf
soup = BeautifulSoup(html_content, 'html.parser')

# Extraire le lien vers le fichier PDF
pdf_link_tag = soup.find("a", {"class": "stealth","title": "31.8M", "href": lambda href: href and "download" in href})
pdf_link = pdf_link_tag["href"] if pdf_link_tag else "Lien PDF non trouvé"

print(pdf_link)

/download/crash-magazine-01/Crash_01_Feb_1984.pdf


In [15]:
# récupérer le titre

# Extraire l'identificateur
identifier_tag = soup.find("dt", text="Identifier")
identifier = identifier_tag.find_next("span", {"itemprop": "identifier"}).text if identifier_tag else "Identificateur non trouvé"
print(identifier)

crash-magazine-01


In [19]:
# stocker les informations dans un tableur structuré

df = pd.DataFrame({"Titre": [identifier],
                       "Date_publication": [publication_date],
                       "Lien_PDF": [pdf_link]})
df

Unnamed: 0,Titre,Date_publication,Lien_PDF
0,crash-magazine-01,1984-02,/download/crash-magazine-01/Crash_01_Feb_1984.pdf


In [33]:
# télécharger le pdf

pdf_url = "https://ia904703.us.archive.org/3/items"+df.Lien_PDF[0][9:]
pdf_response = requests.get(pdf_url)
if pdf_response.status_code == 200:
    with open(f"{df.Titre[0]}.pdf", "wb") as pdf_file:
        pdf_file.write(pdf_response.content)

### **Récupération des données sur Explore**

In [47]:
# import un csv

# Spécifiez le chemin vers le fichier CSV
chemin_vers_csv = "38-vangogh.csv"

# Importez le fichier CSV dans une DataFrame
df = pd.read_csv(chemin_vers_csv)

# Affichez la DataFrame
df

Unnamed: 0,numero_cluster,manifest_url,canvas_number,image_url,City,Country,Title,wkt,Date,Journal Type,notice,group_name,group_tags
0,71,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,122.0,https://digi.ub.uni-heidelberg.de/iiif/2/samle...,Copenhagen,Denmark,Samleren,POINT(12.568888888889 55.676111111111),1929-01-01,Art History,,VanGogh_Autoportrait_1889,autoportrait
1,71,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,145.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,Paris,France,L'Amour de l'art (1920),POINT(2.3513888888889 48.856944444444),1937-01-01,Modern Art Journal,,VanGogh_Autoportrait_1889,autoportrait
2,71,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,202.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,Paris,France,Cahiers d'art (Paris),POINT(2.3513888888889 48.856944444444),1947-01-01,Avant-garde Journal,,VanGogh_Autoportrait_1889,autoportrait
3,71,https://iiif.archivelab.org/iiif/sim_computerw...,114.0,https://iiif.archivelab.org/iiif/sim_computerw...,Framingham,United States of America,Computerworld,POINT(-71.416666666667 42.279166666667),2000-10-16,Computing,,VanGogh_Autoportrait_1889,autoportrait
4,86,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,672.0,https://digi.ub.uni-heidelberg.de/iiif/2/cicer...,Leipzig,Germany,Der Cicerone,POINT(12.375 51.34),1922-01-01,Modern Art Journal,,VanGogh_CafeDeNuit_1888,scene_genre
...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,281,https://iiif.archivelab.org/iiif/sim_carnegie_...,17.0,https://iiif.archivelab.org/iiif/sim_carnegie_...,Pittsburgh,United States of America,Carnegie,POINT(-80 40.44166666666667),1943-02-01,Philanthropy,,VanGogh_ArlesiennePortraitMadameGinoux_1890,Portrait
143,284,https://iiif.unige.ch/dhportal/ug8084702/manifest,57.0,https://iiif.unige.ch/iiif/2/fedora_ug8042454;...,Buenos Aires,Argentina,El Amante. Cine,POINT(-58.381944444444 -34.599722222222),2002-05-01,Cinema,,VanGogh_EgliseAuversSurOise_1890,Rural scene
144,287,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,532.0,https://digi.ub.uni-heidelberg.de/iiif/2/cicer...,Leipzig,Germany,Der Cicerone,POINT(12.375 51.34),1926-01-01,Modern Art Journal,,VanGogh_YoungPEasantGirlInAStraxHatSittingInFr...,Portrait
145,287,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k30...,271.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k30...,Paris,France,L'Art vivant (Paris. 1925),POINT(2.3513888888889 48.856944444444),1938-01-01,Modern Art Journal,,VanGogh_YoungPEasantGirlInAStraxHatSittingInFr...,Portrait


In [48]:
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 146
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   numero_cluster  147 non-null    int64  
 1   manifest_url    147 non-null    object 
 2   canvas_number   142 non-null    float64
 3   image_url       147 non-null    object 
 4   City            147 non-null    object 
 5   Country         147 non-null    object 
 6   Title           147 non-null    object 
 7   wkt             147 non-null    object 
 8   Date            147 non-null    object 
 9   Journal Type    147 non-null    object 
 10  notice          0 non-null      float64
 11  group_name      147 non-null    object 
 12  group_tags      147 non-null    object 
dtypes: float64(2), int64(1), object(10)
memory usage: 16.1+ KB


In [49]:
# créer un identifiant unique
import hashlib

# Ajoutez une nouvelle colonne avec un hash des valeurs des colonnes 'Titre', 'Données', 'Identificateur'
df['Hash'] = df.apply(lambda row: hashlib.sha256(str(row["Title"]+row['manifest_url']+row['image_url']).encode('utf-8')).hexdigest(), axis=1)

# Affichez la DataFrame mise à jour
df

Unnamed: 0,numero_cluster,manifest_url,canvas_number,image_url,City,Country,Title,wkt,Date,Journal Type,notice,group_name,group_tags,Hash
0,71,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,122.0,https://digi.ub.uni-heidelberg.de/iiif/2/samle...,Copenhagen,Denmark,Samleren,POINT(12.568888888889 55.676111111111),1929-01-01,Art History,,VanGogh_Autoportrait_1889,autoportrait,02d3d63bbfd6db3ea7fd7376a9d1273b20368602db099e...
1,71,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,145.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,Paris,France,L'Amour de l'art (1920),POINT(2.3513888888889 48.856944444444),1937-01-01,Modern Art Journal,,VanGogh_Autoportrait_1889,autoportrait,d44e2d3ee4c4b4aa398a9d437ff6d8a565b929b065f599...
2,71,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,202.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,Paris,France,Cahiers d'art (Paris),POINT(2.3513888888889 48.856944444444),1947-01-01,Avant-garde Journal,,VanGogh_Autoportrait_1889,autoportrait,de9eb43c79200891c4485dd53ddf84164278290fba6006...
3,71,https://iiif.archivelab.org/iiif/sim_computerw...,114.0,https://iiif.archivelab.org/iiif/sim_computerw...,Framingham,United States of America,Computerworld,POINT(-71.416666666667 42.279166666667),2000-10-16,Computing,,VanGogh_Autoportrait_1889,autoportrait,88fb8a30397acb3223e7cad3667149dcba58c0e50c515b...
4,86,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,672.0,https://digi.ub.uni-heidelberg.de/iiif/2/cicer...,Leipzig,Germany,Der Cicerone,POINT(12.375 51.34),1922-01-01,Modern Art Journal,,VanGogh_CafeDeNuit_1888,scene_genre,7a702f6e82b64b1be026ba421e4b2af986e9b5697fb69f...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,281,https://iiif.archivelab.org/iiif/sim_carnegie_...,17.0,https://iiif.archivelab.org/iiif/sim_carnegie_...,Pittsburgh,United States of America,Carnegie,POINT(-80 40.44166666666667),1943-02-01,Philanthropy,,VanGogh_ArlesiennePortraitMadameGinoux_1890,Portrait,95bccb25ae7d3a81fbd439562d223b36d256fae5625e88...
143,284,https://iiif.unige.ch/dhportal/ug8084702/manifest,57.0,https://iiif.unige.ch/iiif/2/fedora_ug8042454;...,Buenos Aires,Argentina,El Amante. Cine,POINT(-58.381944444444 -34.599722222222),2002-05-01,Cinema,,VanGogh_EgliseAuversSurOise_1890,Rural scene,402f67b9e472878c65956c306f6ef3b8cc31b3085fd602...
144,287,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,532.0,https://digi.ub.uni-heidelberg.de/iiif/2/cicer...,Leipzig,Germany,Der Cicerone,POINT(12.375 51.34),1926-01-01,Modern Art Journal,,VanGogh_YoungPEasantGirlInAStraxHatSittingInFr...,Portrait,e82ce126f2e94a8314960823540b0f33171b2ef79d318b...
145,287,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k30...,271.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k30...,Paris,France,L'Art vivant (Paris. 1925),POINT(2.3513888888889 48.856944444444),1938-01-01,Modern Art Journal,,VanGogh_YoungPEasantGirlInAStraxHatSittingInFr...,Portrait,1fc340431a9954c4f085d8dca2baf697a3508b1a9a0a8f...


In [50]:
# vérifier

len(df.Hash.unique())

147

In [56]:
# télécharger
from PIL import Image
from io import BytesIO
import os

# Supposons que vous avez déjà une DataFrame df avec vos données

# Créer un dossier pour enregistrer les images si le dossier n'existe pas
dossier_images = "imagesVanGogh"
if not os.path.exists(dossier_images):
    os.makedirs(dossier_images)

# Téléchargez et enregistrez les images avec gestion des erreurs
for index, row in df.iterrows():
    url = row['image_url']

    try:
        # Téléchargez l'image
        response = requests.get(url)
        response.raise_for_status()  # Lève une exception pour les codes d'erreur HTTP

        # Générez le hash de l'image pour en faire le titre
        hash_image = row['Hash']

        # Enregistrez l'image dans le dossier avec le titre en tant que hash
        chemin_image = f"{dossier_images}/{hash_image}.jpg"
        with open(chemin_image, "wb") as image_file:
            image_file.write(response.content)
        
        print(f"Image {url} enregistrée sous {chemin_image}")

    except requests.exceptions.RequestException as e:
        print(f"Erreur lors du traitement de l'image {url}: {e}")

# Continuez le traitement même s'il y a des erreurs
print("Traitement terminé.")

Image https://digi.ub.uni-heidelberg.de/iiif/2/samleren1929%3A064_bg.jpg/502,689,1339,1652/full/0/default.jpg enregistrée sous imagesVanGogh/02d3d63bbfd6db3ea7fd7376a9d1273b20368602db099ef89331981ba3214204.jpg
Erreur lors du traitement de l'image https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226205f/f146/510,480,1042,1229/full/0/default.jpg: HTTPSConnectionPool(host='gallica.bnf.fr', port=443): Max retries exceeded with url: /iiif/ark:/12148/bpt6k4226205f/f146/510,480,1042,1229/full/0/default.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff4328b1370>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
Erreur lors du traitement de l'image https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42260903/f203/1613,333,1663,1955/full/0/default.jpg: HTTPSConnectionPool(host='gallica.bnf.fr', port=443): Max retries exceeded with url: /iiif/ark:/12148/bpt6k42260903/f203/1613,333,1663,1955/full/0/default.jpg (Caused by NewCo