<a href="https://colab.research.google.com/github/lugsantistebanji/WCS-IA/blob/main/WCS_IA_Exercice_Scraping_Imdb_Top_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# __Mission : Scraper le TOP 10 des Films du Box-Office en Temps Réel !__

## __Objectif :__

Tu vas scraper en temps réel les 10 films les plus populaires actuellement au
box-office mondial et générer une page web interactive affichant les affiches, titres et scores des films !

## __Contexte__

Tu es un analyste de cinéma passionné, et tu veux connaître les films les plus regardés en ce moment.

Plutôt que de chercher manuellement, automatise le processus en scrappant un site de cinéma et affiche le classement des 10 films les plus vus avec leurs affiches et notes.

Site cible : IMDb

Nous allons scraper les 10 films en tête du Box Office sur IMDb via ce lien :
https://www.imdb.com/chart/boxoffice/


## __Outils à utiliser :__

- requests pour récupérer la page
- BeautifulSoup pour extraire les films
- pandas pour structurer les données
- matplotlib pour une visualisation graphique
- HTML & CSS pour une page web stylée


## __Étapes du Challenge :__

In [None]:
!pip install -q streamlit

In [3]:
!npm install localtunnel

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K
added 22 packages in 3s
[1G[0K⠼[1G[0K
[1G[0K⠼[1G[0K3 packages are looking for funding
[1G[0K⠼[1G[0K  run `npm fund` for details
[1G[0K⠼[1G[0K

In [16]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re

___

### __1. Scraper les films du Top 10__

- Utilise requests pour récupérer la page IMDb
- Utilise BeautifulSoup pour extraire les titres, scores, revenus au box-office et affiches des films

In [12]:
url_base = "https://www.imdb.com/chart/boxoffice/"
headers = {
    'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}

In [174]:
#%%writefile scraping.py

from bs4 import BeautifulSoup
import re
import requests
import pandas as pd

headers = headers = {
    'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}

def get_info_movie(movie_soup: BeautifulSoup) -> dict:
    movie = {}
    title = ''
    score = 0
    revenue = 0
    link_image = ''

    try:
        title = re.sub(r'^\d+\.\s',"", movie_soup.find('h3').text)

        score = float(movie_soup.find('span', attrs={'aria-label':re.compile(r'IMDb rating')}).get('aria-label').split(":")[-1].strip())

        revenue = int(movie_soup.find('ul').find('li').next_sibling.text.split(":")[-1].strip("$M "))

        link_image = movie_soup.find('img').get('src')

        movie = {
            'title' : title,
            'score' : score,
            'revenue' : revenue,
            'link_image' : link_image
        }
    except Exception as e:
        print("Error info movie")
        print(e)

    return movie

def get_movies(url: str, headers=headers) -> list:
    movies = []

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        movies_container = soup.find('div', attrs={'data-testid': 'chart-layout-main-column'})
        if movies_container is not None:
            movies_container = movies_container.find('ul')
            movies_list = movies_container.find_all('li', class_="ipc-metadata-list-summary-item")
            for movie in movies_list:
                movies.append(get_info_movie(movie))
        else:
            print("Error in scraping the main container")
    else:
        print(f"Error in the website: {response.status_code}")
    return movies

def create_csv_file(url: str, headers: dict,  file_name: str):
    pd.DataFrame(get_movies(url, headers)).to_csv(file_name)
    print("CSV File Created")

In [175]:
movies = get_movies(url_base)

___

### __2. Afficher les données sous forme de tableau__

- Stocke les données dans un DataFrame Pandas
- Trie les films par revenus générés et affiche-les

In [176]:
movies_df = pd.DataFrame(movies)

In [178]:
movies_df

Unnamed: 0,title,score,revenue,link_image
0,Dog Man,6.5,37,https://m.media-amazon.com/images/M/MV5BNDVlZG...
1,Companion,7.4,10,https://m.media-amazon.com/images/M/MV5BYjkyZT...
2,Mufasa: The Lion King,6.7,230,https://m.media-amazon.com/images/M/MV5BYjBkOW...
3,One of Them Days,7.1,35,https://m.media-amazon.com/images/M/MV5BNjI3OG...
4,Flight Risk,5.5,21,https://m.media-amazon.com/images/M/MV5BY2Q3Yj...
5,Sonic the Hedgehog 3,7.0,231,https://m.media-amazon.com/images/M/MV5BMjZjNj...
6,Moana 2,6.9,454,https://m.media-amazon.com/images/M/MV5BZDUxNT...
7,A Complete Unknown,7.7,67,https://m.media-amazon.com/images/M/MV5BYTA2NT...
8,The Brutalist,8.0,12,https://m.media-amazon.com/images/M/MV5BM2U0MW...
9,Den of Thieves 2: Pantera,6.4,35,https://m.media-amazon.com/images/M/MV5BZGIyYT...


In [121]:
movies_df.to_csv('movies.csv')

___

### __3. Générer une page HTML dynamique__

- Crée une page web affichant les affiches, titres et scores des films

In [180]:
%%writefile app.py

import streamlit as st
import pandas as pd
import time
from  datetime import datetime


csv_filename = "./movies.csv"
movies_df = pd.read_csv(csv_filename)

st.title("IMDb Top Box Office (US) ")

for index, movie in movies_df.iterrows():
    col1, mid, col2 = st.columns([6,1, 20])
    with col1:
        st.image(movie['link_image'])
    with col2:
        st.write(f"**{int(index)+1}. {movie['title'].upper()}**")
        st.write(f":star: **{movie['score']}**")
        st.write(f":dollar: **{movie['revenue']}**")

Overwriting app.py


In [146]:
!streamlit run app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.44.121.124
[1G[0K⠙[1G[0Kyour url is: https://breezy-tigers-train.loca.lt


___

### __4. Automatiser la mise à jour (Bonus)__

- Ajoute un script d’actualisation qui met à jour les films toutes les 24h

In [181]:
!pip install python-crontab

Collecting python-crontab
  Downloading python_crontab-3.2.0-py3-none-any.whl.metadata (17 kB)
Downloading python_crontab-3.2.0-py3-none-any.whl (27 kB)
Installing collected packages: python-crontab
Successfully installed python-crontab-3.2.0


In [182]:
%%writefile script_update.py
from scraping import create_csv_file


url_base = "https://www.imdb.com/chart/boxoffice/"
headers = {
    'user-agent' : 'Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0'
}
csv_filename = "./movies.csv"
create_csv_file(url_base, headers, csv_filename)

Writing script_update.py


In [None]:
from crontab import CronTab

cron = CronTab()
job = cron.new(command='python script_update.py')
job.hour.every(24)
cron.write()

___