### Targeted information retrieval

We have seen how to parse a webpage, which retrieves the information without distinction.

But, in general, the purpose of scrapping is to automate the collection of targeted information on the web


In [18]:
from bs4 import BeautifulSoup
import requests

Let's say I want to scrape the description of the latest movies released in theaters

So I go to the allociné website and try to find the tags that will give me links to the specific pages of these movies to get their summaries.

#### Recovery of the url of the pages of films newly released in the theaters

In [19]:
url='http://www.allocine.fr/'
r = requests.get(url)
print(url, r.status_code)
soup = BeautifulSoup(r.content,'lxml')
soup

http://www.allocine.fr/ 200


<!DOCTYPE html>
<html lang="fr">
<head>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" name="viewport"/>
<meta content="index,follow,max-snippet:-1" name="robots"/>
<title>AlloCiné : Cinéma, Séries TV, BO de films et séries, Vidéos, DVD et VOD</title>
<meta content="" name="keywords"/>
<meta content="noarchive" name="Googlebot"/>
<meta content="global" name="distribution"/>
<meta content="AlloCine" name="author"/>
<meta content="France" name="country"/>
<meta content="48.87078;2.30447" name="geo.position"/>
<meta content="FR" name="geo.country"/>
<meta content="48.87078;2.30447" name="ICBM"/>
<meta content="#FECC00" name="theme-color"/>
<meta content="AlloCiné, le site de référence du cinéma et des séries tv ! Découvrez notre recherche d'horaires de films, le programme tv de vos séries préférées, l'actualité ciné et séries, les émissions

On your web browser (Chrome, Firefox,...), you can use the "inspect" function (right click -> inspect) and drag your mouse to the areas of the page that interest you. At the same time I will see the html script move to the instructions of the html script in question. 

That's how you find the tags that you are interested in.

I notice that the relative link of the web page specific to the new movie on the poster is stored in these tags:

```html
<a class="meta-title meta-title-link" href="/film/fichefilm_gen_cfilm=235582.html" title="Le Grand Bain">Le Grand Bain</a>
```


In [20]:
for p in soup.find_all('a'):
    print (p.text)

Un homme en colère
West Side Story
Lucifer
The Resident

            Cinéma
        

            Séries
        

            Trailers
        

            DVD
        

            VOD
        

                        Kids
                    
séances cinéma
news
dossiers
émissions AlloCiné
dernières bandes-annonces
Meilleurs films
Meilleurs films pour enfants
Meilleurs documentaires
Tous les films
Meilleurs films 2019
Tous les films pour enfants
Avant-premières
Agenda des sorties
Films pour enfants à l'affiche
Box Office
Bandes-annonces à ne pas manquer
Army Of The Dead
Army of the Dead Bande-annonce VF
Fast & Furious 9 Bande-annonce VO
Top Gun: Maverick Bande-annonce (2) VO
Hitman & Bodyguard Bande-annonce VO
Shadow in the Cloud Teaser VO
 Fast & Furious 9 
 Nomadland
 
 Godzilla vs Kong
 
 Space Jam - Nouvelle ère
 
 
Tous les films à venir
Dernières news cinéma
Godzilla vs Kong privé de sortie cinéma
Le Chant du Loup sur M6 : François Civil et Omar Sy ont-ils tourné dans de vra

This time, the site is more difficult to "extract". Let's use much more specific parameters to the search function `find_all`.

In [21]:
# In addition to the tag a, which is easily identifiable, we notice some additional 
# information such as the value of the class variable of these identical tags.
for elem in soup.find_all('a',attrs={"class" :"meta-title meta-title-link"}):
    print(elem)

<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=25362.html" title="Le Serpent">Le Serpent</a>
<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=27414.html" title="Le Remplaçant">Le Remplaçant</a>
<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=28562.html" title="Qui a tué Sara ?">Qui a tué Sara ?</a>
<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=27142.html" title="L'Ecole de la vie">L'Ecole de la vie</a>
<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=23853.html" title="Jupiter's Legacy">Jupiter's Legacy</a>
<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=25616.html" title="HPI">HPI</a>
<a class="meta-title meta-title-link" href="/series/ficheserie_gen_cserie=24647.html" title="Shadow and Bone">Shadow and Bone</a>


#### Recovery of `href`

We have noticed the presence of `href` links to the pages that interest us. Let's go get them back:

In [22]:
for elem in soup.find_all('a',attrs={"class" :"meta-title meta-title-link"}):
    print(elem.get('href'))
    # return a list

/series/ficheserie_gen_cserie=25362.html
/series/ficheserie_gen_cserie=27414.html
/series/ficheserie_gen_cserie=28562.html
/series/ficheserie_gen_cserie=27142.html
/series/ficheserie_gen_cserie=23853.html
/series/ficheserie_gen_cserie=25616.html
/series/ficheserie_gen_cserie=24647.html


Can you retrieve the titles for me via the search for "title" in the items of the previous list?

In [23]:
for elem in soup.find_all('a',attrs={"class" :"meta-title meta-title-link"}):
    print(elem.get('title'))

Le Serpent
Le Remplaçant
Qui a tué Sara ?
L'Ecole de la vie
Jupiter's Legacy
HPI
Shadow and Bone


#### Get summary

Let's start by building the url that we will use to retrieve the summaries

Start by putting the `href` values in a list of links


In [24]:
links=[]
for elem in soup.find_all('a',attrs={"class" :"meta-title meta-title-link"}):
    # I simply put all of thisin a list
    links.append(elem.get('href'))
links

['/series/ficheserie_gen_cserie=25362.html',
 '/series/ficheserie_gen_cserie=27414.html',
 '/series/ficheserie_gen_cserie=28562.html',
 '/series/ficheserie_gen_cserie=27142.html',
 '/series/ficheserie_gen_cserie=23853.html',
 '/series/ficheserie_gen_cserie=25616.html',
 '/series/ficheserie_gen_cserie=24647.html']

The absolute url of the searched movie pages is built in this form: http://www.allocine.fr/film/fichefilm_gen_cfilm=243835.html

It is therefore necessary to repeat the previous list and build the absolute urls for our search

It's up to you to play.

NB: Do not take the links for the shows(series)

In [25]:
links_movie=['http://www.allocine.fr'+ elem for elem in links if 'series' in elem]
links_movie

['http://www.allocine.fr/series/ficheserie_gen_cserie=25362.html',
 'http://www.allocine.fr/series/ficheserie_gen_cserie=27414.html',
 'http://www.allocine.fr/series/ficheserie_gen_cserie=28562.html',
 'http://www.allocine.fr/series/ficheserie_gen_cserie=27142.html',
 'http://www.allocine.fr/series/ficheserie_gen_cserie=23853.html',
 'http://www.allocine.fr/series/ficheserie_gen_cserie=25616.html',
 'http://www.allocine.fr/series/ficheserie_gen_cserie=24647.html']

Finally, on each page, the title and synopsis must be retrieved. Let's try for a movie, the first of the list

In [26]:
url=links_movie[0]
r = requests.get(url)
print(url, r.status_code)
soup = BeautifulSoup(r.content,'lxml')
soup

http://www.allocine.fr/series/ficheserie_gen_cserie=25362.html 200


<!DOCTYPE html>
<html lang="fr">
<head>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" name="viewport"/>
<meta content="index,follow,max-snippet:-1,max-image-preview:large" name="robots"/>
<title>Le Serpent - Série TV 2020 - AlloCiné</title>
<meta content="" name="keywords"/>
<meta content="noarchive" name="Googlebot"/>
<meta content="global" name="distribution"/>
<meta content="AlloCine" name="author"/>
<meta content="France" name="country"/>
<meta content="48.87078;2.30447" name="geo.position"/>
<meta content="FR" name="geo.country"/>
<meta content="48.87078;2.30447" name="ICBM"/>
<meta content="#FECC00" name="theme-color"/>
<meta content="Le Serpent est une série TV de Richard Warlow et Toby Finlay avec Tahar Rahim (Charles Sobhraj), Jenna Coleman (Marie-Andrée Leclerc). Retrouvez toutes les news et les vidéos de la série Le Serpent. 

For title 
```html
<div class="titlebar-title titlebar-title-lg">Le Grand Bain</div>
```
For the synopsis

```html
<div class="content-txt " itemprop="description"

 
              C’est dans les couloirs de leur piscine municipale que Bertrand, Marcus, Simon, Laurent, Thierry et les autres s’entraînent sous l’autorité toute relative de Delphine, ancienne gloire des bassins. Ensemble, ils se sentent libres et utiles. Ils vont mettre toute leur énergie dans une discipline jusque-là propriété de la gent féminine : la natation synchronisée. Alors, oui c’est une idée plutôt bizarre, mais ce défi leur permettra de trouver un sens à leur vie...
    
      </div>
```

In [27]:
for elem in soup.find_all('div',attrs={"class" :"titlebar-title titlebar-title-lg"}):
    # Just like that
    print(elem.text)
    
for elem in soup.find_all('div',attrs={"class" :"content-txt "}):
    # Just like that
    print(elem.text)

Le Serpent


1) Automate this script for the entire list

2) Put the information in three lists (film_links, title, synopsis)

3) Create a dataframe that includes these three informations in three associated columns

4) Save this dataframe in a csv file

And here's your first real scrap, you're real hackers now.

In [32]:
import time
import random
from random import randint

title=[]
synopsis=[]

for link in links_movie:
    
    url=link
    # I slow down the frequency of requests to avoid being identified and therefore ban from the site
    time.sleep(random.uniform(1.0, 2.0))
    r = requests.get(url)
    print(url, r.status_code)
    soup = BeautifulSoup(r.content,'lxml')
    
    
    for elem in soup.find_all('div',attrs={"class" :"titlebar-title titlebar-title-lg"}):
        title.append(elem.text.strip())

    for elem in soup.find_all('div',attrs={"class" :"content-txt"}):
        synopsis.append(elem.text.strip())
        
# I check the length of the lists before creating the df
len(title),len(synopsis),len(links_movie)

http://www.allocine.fr/series/ficheserie_gen_cserie=25362.html 200
http://www.allocine.fr/series/ficheserie_gen_cserie=27414.html 200
http://www.allocine.fr/series/ficheserie_gen_cserie=28562.html 200
http://www.allocine.fr/series/ficheserie_gen_cserie=27142.html 200
http://www.allocine.fr/series/ficheserie_gen_cserie=23853.html 200
http://www.allocine.fr/series/ficheserie_gen_cserie=25616.html 200
http://www.allocine.fr/series/ficheserie_gen_cserie=24647.html 200


(7, 23, 7)

In [45]:
import pandas as pd
df=pd.DataFrame({'Titre':title})
df['synopsis']=synopsis[0:7]
df['liens']=links_movie

In [46]:
df.to_csv('./assets/allo_cine.csv', index=False)

PermissionError: [Errno 13] Permission denied: './assets/allo_cine.csv'