# Become a movie director

Let's use Scrapy to get some information about the 250 top rated movies on <a href="http://www.imdb.com/" target="_blank">IMDB</a>.

1. Install `Scrapy`:

In [22]:
# !pip install scrapy

## Étape 1 — Extraire les informations d’un seul film

https://www.imdb.com/fr/chart/boxoffice/

2. Lance une requête vers la page IMDb :

In [23]:
import requests
from scrapy import Selector

In [35]:
url = "https://www.imdb.com/chart/boxoffice/"

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36',
    "Accept-Language": "fr-FR,fr;q=0.9"
}

response = requests.get(url,headers=headers)

response.content



In [36]:
response.content



In [37]:
sel = Selector(text=response.text)
sel

<Selector query=None data='<html lang="fr-FR" xmlns:og="http://o...'>

3. Identifie la balise principale contenant les films :

In [38]:
films = sel.css("li.ipc-metadata-list-summary-item")
films

[<Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' ipc-metadata-list-summary-item ')]" data='<li class="ipc-metadata-list-summary-...'>,
 <Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' ipc-metadata-list-summary-item ')]" data='<li class="ipc-metadata-list-summary-...'>,
 <Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' ipc-metadata-list-summary-item ')]" data='<li class="ipc-metadata-list-summary-...'>,
 <Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' ipc-metadata-list-summary-item ')]" data='<li class="ipc-metadata-list-summary-...'>,
 <Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' ipc-metadata-list-summary-item ')]" data='<li class="ipc-metadata-list-summary-...'>,
 <Selector query="descendant-or-self::li

4. Sélectionne le premier film, récupère pas à pas :
- le classement
- le titre du film
- la note IMDb
- le nombre de votes (en nettoyant la valeur : '84\xa0k' → 84000)
- les revenus s’ils sont affichés ('34\xa0M\xa0$US' → 34000000)

💡 Utilise `.css("...::text").get()` pour tester les sélecteurs un par un.

💡 Quand un texte est découpé en plusieurs morceaux (['\xa0(', '84\xa0k', ')']), pense à utiliser `.getall()` puis `''.join(...)`.

In [39]:
premier_film = films[0]
print(premier_film)

<Selector query="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' ipc-metadata-list-summary-item ')]" data='<li class="ipc-metadata-list-summary-...'>


In [40]:
# Titre du film
premier_titre = premier_film.css("h3[class='ipc-title__text']::text").get()
premier_titre

'Send Help'

In [41]:
# Url ??? 
premier_url_end = premier_film.css("a[class='ipc-title-link-wrapper']::attr(href)").get()
premier_url_end

'/fr/title/tt8036976/?ref_=chtbo_t_1'

In [42]:
# Url finale, propre
premier_url = "imdb.com" + premier_url_end
premier_url

'imdb.com/fr/title/tt8036976/?ref_=chtbo_t_1'

In [None]:

# total_gross = premier_film.css("span.sc-382281d-2.bXJhOC:nth-of-type(1)::text")


'37\xa0M\xa0$US'

In [68]:
total_gross = premier_film.css("span[class='sc-382281d-2 bXJhOC']::text")[1].get() #selection du 2e élément possédant la même classe
print(total_gross)
budget = premier_film.css("li.sc-382281d-1.gPDhWQ:nth-child(2) span.sc-382281d-2.bXJhOC::text").get() #On utilise le css selector de la classe mère pour distinguer les deux classes
print(budget)
budget == total_gross

37 M $US
37 M $US


True

In [49]:
# Earnings
premier_earnings_raw = premier_film.css("span[class='sc-382281d-2 bXJhOC']::text").get()
premier_earnings_raw = premier_film.css("span[class='sc-382281d-2 bXJhOC']::text").get()

premier_earnings_raw


'9\xa0M\xa0$US'

In [44]:
# Fonction pour nettoyer le string
import re
def parse_earnings(raw_text):
    """Nettoie et convertit un chiffre d'affaires IMDb (ex: '5,3\xa0M\xa0$US')"""
    if not raw_text:
        return None
    text = raw_text.replace('\xa0', ' ').replace('$US', '').strip()
    match = re.search(r'([\d\s.,]+)\s*([KkMm]?)', text)
    if not match:
        return None

    number = match.group(1).replace(' ', '').replace(',', '.')
    suffix = match.group(2).upper()
    try:
        value = float(number)
    except ValueError:
        return None

    if suffix == 'K':
        value *= 1_000
    elif suffix == 'M':
        value *= 1_000_000
    return int(value)

In [45]:
premier_earnings = parse_earnings(premier_earnings_raw)
premier_earnings

In [69]:
# Le rating
premier_rating = premier_film.css("span[class='ipc-rating-star--rating']::text").get()
premier_rating
# <span class="ipc-rating-star--rating">7.2</span>

'7,2'

In [76]:
# Nombre de voteurs
premier_nb_voteurs = premier_film.css("span[class='ipc-rating-star--voteCount']::text")[1].get()
premier_nb_voteurs
# <span class="ipc-rating-star--voteCount">&nbsp;(<!-- -->22K<!-- -->)</span>

'22\xa0k'

In [77]:
# Nombre de voteurs
premier_nb_voteurs_raw = premier_film.css("span[class='ipc-rating-star--voteCount']::text").getall()
premier_nb_voteurs_raw

['\xa0(', '22\xa0k', ')']

In [78]:
import re

def parse_vote_count(text_list):
    """Nettoie et convertit un nombre de votes IMDb (ex: ['\xa0(', '84\xa0k', ')'])"""
    text = ''.join(text_list).replace('\xa0', ' ').strip()
    match = re.search(r'([\d\s.,]+)\s*([KkMm]?)', text)
    if not match:
        return None

    number = match.group(1).replace(' ', '').replace(',', '.')
    suffix = match.group(2).upper()
    try:
        value = float(number)
    except ValueError:
        return None

    if suffix == 'K':
        value *= 1_000
    elif suffix == 'M':
        value *= 1_000_000
    return int(value)



In [79]:
premier_nb_voteurs = parse_vote_count(premier_nb_voteurs_raw)
premier_nb_voteurs

22000

## Étape 2 — Extraire les informations de tous les films
Une fois que ton code fonctionne pour un film, généralise-le à toute la liste.

### 1. Boucle sur chaque élément de films.
### 2. Applique les mêmes sélecteurs et nettoyages.
### 3. Stocke les résultats dans une liste de dictionnaires :

In [None]:
import requests
from scrapy import Selector

url = "https://www.imdb.com/chart/boxoffice"

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36',
    "Accept-Language": "fr-FR,fr;q=0.9"
}

r = requests.get(url, headers=headers)

response = Selector(text=r.text)

films = response.css("li[class='ipc-metadata-list-summary-item']")
classement = 0
dict_films = {}

for film in films:
    
    ### Création du dico d'un film
    dict_film = {}
    
    ### On crée le classement nous meme
    classement += 1
    dict_film['classement'] = classement
    
    ### On récupère le titre
    titre = film.css("h3[class='ipc-title__text']::text").get()
    dict_film['title'] = titre
    
    ### On récupère l'URL
    url_film_end = film.css("").get()
    url_film = "imdb.com" + url_film_end
    dict_film['url'] = url_film
    
    
    ### On récupère les earnings
    earnings_raw = film.css("").get()
    earnings = parse_earnings(earnings_raw)
    dict_film['earnings']  = earnings
    
    ### On récupère le rating
    rating = film.css("").get()
    dict_film['rating'] = rating
    
    ### On récupère le nombre de votants
    nb_voteurs_raw = film.css("").getall()
    nb_voteurs = parse_vote_count(nb_voteurs_raw)
    dict_film['nb_voters'] = nb_voteurs
    
    dict_films[f'film number {classement}'] = dict_film

In [18]:
dict_films

{'film number 1': {'classement': 1,
  'title': None,
  'url': 'imdb.com/fr/title/tt8036976/?ref_=chtbo_i_1',
  'earnings': 9000000,
  'rating': '7,2',
  'nb_voters': 22000},
 'film number 2': {'classement': 2,
  'title': None,
  'url': 'imdb.com/fr/title/tt32306991/?ref_=chtbo_i_2',
  'earnings': 7000000,
  'rating': '7,6',
  'nb_voters': 1900},
 'film number 3': {'classement': 3,
  'title': None,
  'url': 'imdb.com/fr/title/tt27564844/?ref_=chtbo_i_3',
  'earnings': 6000000,
  'rating': '6,6',
  'nb_voters': 15000},
 'film number 4': {'classement': 4,
  'title': None,
  'url': 'imdb.com/fr/title/tt39216314/?ref_=chtbo_i_4',
  'earnings': 5700000,
  'rating': '8,5',
  'nb_voters': 1100},
 'film number 5': {'classement': 5,
  'title': None,
  'url': 'imdb.com/fr/title/tt31434030/?ref_=chtbo_i_5',
  'earnings': 4400000,
  'rating': '6,2',
  'nb_voters': 26000},
 'film number 6': {'classement': 6,
  'title': None,
  'url': 'imdb.com/fr/title/tt26443597/?ref_=chtbo_i_6',
  'earnings': 4000

## 🕷️ Étape 3 — Créer un crawler Scrapy

Maintenant que tes sélecteurs sont au point, transforme ton code en Spider Scrapy.

### 1. Crée une classe ImdbSpider(scrapy.Spider) :

In [None]:
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

class imdb_spider(scrapy.Spider):
    # Name of your spider
    name = "imdb"

    # Url to start your spider from 
    start_urls = ["https://www.imdb.com/chart/boxoffice"]

    # Callback function that will be called when starting your spider
    def parse(self, response):
        
        films = response.css("li[class='ipc-metadata-list-summary-item']")

        for film,i in zip(films,range(1,11)):
            ### Classement 
            classement = i
            
            ### Titre 
            titre = film.css("h3[class='ipc-title__text']::text").get()
            
            ### URL
            url_film_end = film.css("a[href]::attr(href)").get()
            url_film = "imdb.com" + url_film_end
            
            ### Earnings
            earnings_raw = film.css("span[class='sc-382281d-2 bXJhOC']::text").get()
            earnings = parse_earnings(earnings_raw)
            
            ### Ratings
            rating = film.css("span[class='ipc-rating-star--rating']::text").get()

            ### Nb votants
            nb_voteurs_raw = film.css("span[class='ipc-rating-star--voteCount']::text").getall()
            nb_voteurs = parse_vote_count(nb_voteurs_raw)

            yield {
                "ranking": classement,
                "title":   titre,
                "url":     url_film,
                "total_earnings": earnings,
                "rating":  rating,
                "nb_voters": nb_voteurs,
            }


# Name of the file where the results will be saved
filename = "imdb.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('01-Become_a_movie_director/'):
        os.remove('01-Become_a_movie_director/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    "USER_AGENT": ("Chrome/140.0.0.0"),
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        '01-Become_a_movie_director/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(imdb_spider)
process.start()

2026-02-12 09:39:23 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: scrapybot)
2026-02-12 09:39:23 [scrapy.utils.log] INFO: Versions:
{'lxml': '5.3.0',
 'libxml2': '2.13.8',
 'cssselect': '1.2.0',
 'parsel': '1.8.1',
 'w3lib': '2.1.2',
 'Twisted': '24.11.0',
 'Python': '3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, '
           '16:37:03) [MSC v.1929 64 bit (AMD64)]',
 'pyOpenSSL': '25.1.0 (OpenSSL 3.5.3 16 Sep 2025)',
 'cryptography': '45.0.5',
 'Platform': 'Windows-11-10.0.26200-SP0'}
2026-02-12 09:39:23 [scrapy.addons] INFO: Enabled addons:
[]
2026-02-12 09:39:23 [scrapy.extensions.telnet] INFO: Telnet Password: 78f531f90173f9d3
2026-02-12 09:39:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2026-02-12 09:39:24 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/14

RuntimeError: This event loop is already running

2026-02-12 09:39:25 [scrapy.core.engine] INFO: Spider opened
2026-02-12 09:39:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2026-02-12 09:39:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2026-02-12 09:39:25 [scrapy.core.engine] INFO: Closing spider (finished)
2026-02-12 09:39:25 [scrapy.extensions.feedexport] INFO: Stored json feed (0 items) in: 01-Become_a_movie_director/imdb.json
2026-02-12 09:39:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 213,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 3006,
 'downloader/response_count': 1,
 'downloader/response_status_count/202': 1,
 'elapsed_time_seconds': 0.190562,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2026, 2, 12, 8, 39, 25, 437236, tzinfo=datetime.timezone.utc),
 'items_per_minute

: 

### 2. Teste ton spider directement depuis le notebook :

In [None]:
import os
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

class imdb_spider(scrapy.Spider):
    # Name of your spider
    name = "imdb"

    # Url to start your spider from 
    start_urls = ["https://www.imdb.com/chart/boxoffice"]

    # Callback function that will be called when starting your spider
    def parse(self, response):
        
        films = response.css("li[class='ipc-metadata-list-summary-item']")

        for film,i in zip(films,range(1,11)):
            ### Classement 
            classement = i
            
            ### Titre 
            titre = film.css("h3[class='ipc-title__text']::text").get()
            
            ### URL
            url_film_end = film.css("a[href]::attr(href)").get()
            url_film = "imdb.com" + url_film_end
            
            ### Earnings
            earnings_raw = film.css("span[class='sc-382281d-2 bXJhOC']::text").get()
            earnings = parse_earnings(earnings_raw)
            
            ### Ratings
            rating = film.css("span[class='ipc-rating-star--rating']::text").get()

            ### Nb votants
            nb_voteurs_raw = film.css("span[class='ipc-rating-star--voteCount']::text").getall()
            nb_voteurs = parse_vote_count(nb_voteurs_raw)

            yield {
                "ranking": classement,
                "title":   titre,
                "url":     url_film,
                "total_earnings": earnings,
                "rating":  rating,
                "nb_voters": nb_voteurs,
            }


# Name of the file where the results will be saved
filename = "imdb.json"

# If file already exists, delete it before crawling (because Scrapy will 
# concatenate the last and new results otherwise)
if filename in os.listdir('01-Become_a_movie_director/'):
        os.remove('01-Become_a_movie_director/' + filename)

# Declare a new CrawlerProcess with some settings
## USER_AGENT => Simulates a browser on an OS
## LOG_LEVEL => Minimal Level of Log 
## FEEDS => Where the file will be stored 
## More info on built-in settings => https://docs.scrapy.org/en/latest/topics/settings.html?highlight=settings#settings
process = CrawlerProcess(settings = {
    "USER_AGENT": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/124.0 Safari/124.0"),
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        '01-Become_a_movie_director/' + filename : {"format": "json"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(imdb_spider)
process.start()

2026-02-12 09:39:49 [scrapy.utils.log] INFO: Scrapy 2.13.3 started (bot: scrapybot)
2026-02-12 09:39:49 [scrapy.utils.log] INFO: Versions:
{'lxml': '5.3.0',
 'libxml2': '2.13.8',
 'cssselect': '1.2.0',
 'parsel': '1.8.1',
 'w3lib': '2.1.2',
 'Twisted': '24.11.0',
 'Python': '3.13.5 | packaged by Anaconda, Inc. | (main, Jun 12 2025, '
           '16:37:03) [MSC v.1929 64 bit (AMD64)]',
 'pyOpenSSL': '25.1.0 (OpenSSL 3.5.3 16 Sep 2025)',
 'cryptography': '45.0.5',
 'Platform': 'Windows-11-10.0.26200-SP0'}
2026-02-12 09:39:49 [scrapy.addons] INFO: Enabled addons:
[]
2026-02-12 09:39:49 [scrapy.extensions.telnet] INFO: Telnet Password: c2821777f7c18491
2026-02-12 09:39:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2026-02-12 09:39:50 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/

RuntimeError: This event loop is already running

2026-02-12 09:39:51 [scrapy.core.engine] INFO: Spider opened
2026-02-12 09:39:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2026-02-12 09:39:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2026-02-12 09:39:52 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imdb.com/chart/boxoffice/> (referer: None)
Traceback (most recent call last):
  File "c:\Users\dubos\miniforge3\envs\JedhaTraining\Lib\site-packages\scrapy\utils\defer.py", line 343, in iter_errback
    yield next(it)
          ~~~~^^^^
  File "c:\Users\dubos\miniforge3\envs\JedhaTraining\Lib\site-packages\scrapy\utils\python.py", line 369, in __next__
    return next(self.data)
  File "c:\Users\dubos\miniforge3\envs\JedhaTraining\Lib\site-packages\scrapy\utils\python.py", line 369, in __next__
    return next(self.data)
  File "c:\Users\dubos\miniforge3\envs\JedhaTraining\Lib\site-packages\scrapy\core\spidermw.py", line 16

: 

### 3. Maintenant tu peux exporter ton spider en un imdb.py et le déployer depuis ton terminal. 