## General instructions

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel/runtime** (Colab: in the menubar, select *Runtime*$\rightarrow$*Factory Reset Runtime*; Jupyter: in the menubar, select *Kernel*$\rightarrow$*Restart*) and then **run all cells** (Colab: in the menubar, select *Runtime*$\rightarrow$*Run all*; Jupyter: in the menubar, select *Cell*$\rightarrow$*Run All*).

Make sure you fill in any place that says `YOUR CODE HERE` or `"YOUR ANSWER HERE"`, as well as the list of the group members in the following cell.

Enter here the *Group Name* and the list of *Group Members*.

`GROUP NAME`

`GROUP MEMBERS`

In order to be able to have an evaluation DO NOT delete/cut the cells with code and answers. Once you have finished you can downolad the notebook (Colab: in the menubar, select *File*$\rightarrow$*Download .ipynb*; Jupyter: in the menubar, select *File*$\rightarrow$*Download as*$\rightarrow$*Notebook (.ipynb)*) and upload as an assignment on the e-learning platform.

The following cell will load the Google Drive extension for the current notebook, when the variable `MOUNT` is `True`. This allow you to mount the Google Drive filesystem for file persistence. The mountpoint will be `/content/gdrive`.
Furthermore, it will set the `PATH` variable, from now on, so that if you have to refer to external files you could do that by writing:

```python
os.path.join(PATH, filename)
```

This will append the filename after the specific PATH.

In [1]:
import os
MOUNT = False
if 'google.colab' in str(get_ipython()) and MOUNT:
    from google.colab import drive
    drive.mount('/content/gdrive')
    PATH = '/content/gdrive/MyDrive'
else:
    PATH = '.'

# Important warning

**⚠️ avoid copying, removing or modifying test cells, if you do that your assignment might be graded wrongly ⚠️**

---

# The top 100 Greatest Movies of all time

In this practice, which will consists of two parts, we will try to find-out what's common in the top 100 greatest movies of all time.

To this aim we will refer to the IMDb website, which is a website containing movies rating. In particular, at the URL [http://www.imdb.com/list/ls055592025/](http://www.imdb.com/list/ls055592025/) is available a set of top 100 greatest movies of all time, which are represented in form of a HTML page.

Our goal will be to automatically fetch the data from that page and to process it in order to perform an analysis of the movies.


In order to interact with a web server you can use the [`requests` library](https://requests.readthedocs.io), which allow to download the content of a web page by issuing a simple `.get()` method (refer to the library documentation for details).

Moreover, the downloaded data will be encoded in `HTML`, therefore you have to transform in something that can be manipulated by means of the python program. The [beautifulsoup `bs4` library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) does this on your behalf (refer to the library documentation).

Finally, you need some way to *navigate* the result. Once the page is loaded in a beautiful soup object, you may specifically address some part of the page by means of [*css selectors*](https://css-tricks.com/how-css-selectors-work/) that allow to extract part of the html document according to a basic *query language*. In addition, beautiful soup has also other `find_` methods that can be useful.


## Getting the data

In this phase, the goal is to *scrape* the data from the web and making a local copy of the relevant movie synopses for the following phases.




We have to retrieve and parse the 100 Greatest Movies of All Time page from the URL `http://www.imdb.com/list/ls055592025/`. We do it by *web scraping*, that is by retrieving the data from a web page.

In order to do that the fragment of the web page containing the relevant information should be identified looking at the HTML source of the web page. We will use the *beautiful soup* library for parsing the HTML content of the page and querying it.

In particular for the previous web page the structure of a single movie is the following:

```html
<div class="lister-item-content">
    <h3 class="lister-item-header">
    <span class="lister-item-index unbold text-primary">1.</span>
    <a href="/title/tt0068646/">The Godfather</a>
    <span class="lister-item-year text-muted unbold">(1972)</span>
    </h3>
    <p class="text-muted text-small">
      <span class="certificate">T</span>
      <span class="ghost">|</span> 
      <span class="runtime">175 min</span>
      <span class="ghost">|</span> 
      <span class="genre">Crime, Drama</span>
    </p> 
...
</div>
```


In particular, each single movie is contained in a `<div>` element whose class is `lister-item-content`, further, the movie title and a link to its page is contained in a `<h3>` element whose class is `lister-item-header`. 

These elements can be accessed through CSS selectors: `div.lister-item-content` will retrieve a list of all the movies, in turn `h3.lister-item-header` retrieves the names and `span.genre` retrieves the movie genre (have a look at [this content](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class) for explanation on selecting).

In the following we collect in a list all the movies' titles, URLs and genres; the result should look like the following:
    
```python
[('The Godfather', '/title/tt0068646/', 'Crime, Drama'),
 ('The Shawshank Redemption', '/title/tt0111161/', 'Drama'),
 ("Schindler's List", '/title/tt0108052/', 'Biography, Drama, History'),
 ('Raging Bull', '/title/tt0081398/', 'Biography, Drama, Sport'),
 ...
]
```

In [2]:
import requests
from bs4 import BeautifulSoup

# Set the Accept-Language header to force English language content
headers = {
    "Accept-Language": "en-US,en;q=0.9"
}

response = requests.get('http://www.imdb.com/list/ls055592025/', headers=headers)
assert response.ok, f"Could not download data from the website {response.status}"
soup = BeautifulSoup(response.text, "html.parser")

# Extract data from each div.lister-item-content 
movies_data = soup.select("div.lister-item-content")
movies = []
for m in movies_data:
    title = m.select_one('h3.lister-item-header a')
    genre = m.select_one('span.genre')
    movies.append((title.text.strip(), title.get('href'), genre.text.strip()))

ModuleNotFoundError: No module named 'requests'

## Exercise

Write a function `extract_movie_id(url)` that given the URL of the movie extracts the movie id, e.g., for `/title/tt0068646/` it should return `tt0068646`. 

**Notice** that all movies id have a format with the `tt` substring at the beginning followed by a sequence of digits.

In [None]:
import re
def extract_movie_id(url):
    match = re.search(r"/(tt\d+)/", url)
    if match:
        return match.group(1)    

In [None]:
assert extract_movie_id('/title/tt0068646/') == 'tt0068646'
assert extract_movie_id('prova') is None

## Exercise

Write a function `extract_synopsis(movie_id)` that given a movie id, it returns the synopsis of the movie. That data is available composing the URL in such a way:`https://www.imdb.com/title/{movie_id}/plotsummary`.

Since the site provides against scraping, it is necessary to fake a user-agent different from the `requests` standard one, for example you can send a header (see the example above for setting the headers in the `get` request) whose name is `"User-Agent"` and a possible value is:
`"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"`.

Once the page has been taken, the synopsis is available in a `<div>` element with a `data-testid` attribute having value `sub-section-synopsis`.

In this case the CSS selector to be used in the beautifulsoup `select()` method is `[data-testid="sub-section-synopsis"]`.



In [None]:
def extract_synopsis(movie_id):
    headers = {
        "Accept-Language": "en-US,en;q=0.9",
        'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
    }
    response = requests.get(f'https://www.imdb.com/title/{movie_id}/plotsummary', headers=headers)
    if not response.ok:
        raise Exception(response.status_code)
    soup = BeautifulSoup(response.text, "html.parser")
    synopsis = soup.select_one('[data-testid="sub-section-synopsis"]')
    return synopsis.text if synopsis is not None else ""

In [None]:
assert extract_synopsis('tt0068646').strip() == '''
In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone\'s daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as "Godfather." He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, "no Sicilian can refuse a request on his daughter\'s wedding day." One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment from the presiding judge. The Don is disappointed in Bonasera, who\'d avoided most contact with the Don due to Corleone\'s nefarious business dealings. The Don\'s wife is godmother to Bonasera\'s shamed daughter, a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees to have his men punish the young men responsible (in a non-lethal manner) in return for future service if necessary.Meanwhile, the Don\'s youngest son Michael (Al Pacino), a decorated US Marine hero returning from World War II service, arrives at the wedding and tells his girlfriend Kay Adams (Diane Keaton) anecdotes about his family, informing her about his father\'s criminal life; he reassures her that he is different from his family and doesn\'t plan to join them in their criminal dealings. The wedding scene serves as critical exposition for the remainder of the film, as Michael introduces the main characters to Kay. Fredo (John Cazale), Michael\'s next older brother, is a bit dim-witted and quite drunk by the time he finds Michael at the party. Santino, who is nicknamed Sonny (James Caan), the Don\'s eldest child and next in line to become Don upon his father\'s retirement, is married but he is a hot-tempered philanderer who sneaks into a bedroom to have sex with one of Connie\'s bridesmaids, Lucy Mancini (Jeannie Linero). Tom Hagen is not related to the family by blood but is considered one of the Don\'s sons because he was homeless when he befriended Sonny in the Little Italy neighborhood of Manhattan and the Don took him in and saw to Tom\'s upbringing and education. Now a talented attorney, Tom is being groomed for the important position of consigliere (counselor) to the Don, despite his non-Sicilian heritage.Also among the guests at the celebration is the famous singer Johnny Fontane (Al Martino), Corleone\'s godson, who has come from Hollywood to petition Vito\'s help in landing a movie role that will revitalize his flagging career. Jack Woltz (John Marley), the head of the studio, denies Fontane the part (a character much like Johnny himself), which will make him an even bigger star, but Don Corleone explains to Johnny: "I\'m gonna make him an offer he can\'t refuse." The Don also receives congratulatory salutations from Luca Brasi, a terrifying enforcer in the criminal underworld, and fills a request from the baker, Nazorine, who made Connie\'s wedding cake who wishes for his nephew Enzo to become an American citizen.After the wedding, Hagen is dispatched to Los Angeles to meet with Woltz, but Woltz angrily tells him that he will never cast Fontane in the role. Woltz holds a grudge because Fontane seduced and "ruined" a starlet who Woltz had been grooming for stardom and with whom he had a sexual relationship. Woltz is persuaded to give Johnny the role, however, when he wakes up early the next morning and feels something wet in his bed. He pulls back the sheets and finds himself in a pool of blood; he screams in horror when he discovers the severed head of his prized $600,000 stud horse, Khartoum, in the bed with him. (A deleted scene from the film implies that Luca Brasi (Lenny Montana), Vito\'s top "button man" or hitman, is responsible.)Upon Hagen\'s return, the family meets with Virgil "The Turk" Sollozzo (Al Lettieri), who is being backed by the rival Tattaglia family. He asks Don Corleone for financing as well as political and legal protection for importing and distributing heroin. Despite the huge profit to be made, Vito Corleone refuses, explaining that his political influence would be jeopardized by a move into the narcotics trade -- the judges and politicians he\'s allied himself with over the course of several decades would renounce their friendships with him if he were to enter the drug trade. The Don\'s eldest son, Sonny, who had earlier urged the family to enter the narcotics trade, breaks rank during the meeting and begins to question Sollozzo\'s assurances as to the Corleone Family\'s investment being guaranteed by the Tattaglia Family. His father, angry at Sonny\'s dissension in a non-family member\'s presence, silences Sonny with a single look and privately rebukes him later. Don Corleone then dispatches Luca Brasi to infiltrate Sollozzo\'s organization and report back with information. During the meeting, while Brasi is bent over to allow Bruno Tattaglia to light his cigarette, he is stabbed in the hand by Sollozzo, and is subsequently garroted by an assassin.Soon after his meeting with Sollozzo, Don Corleone is gunned down in an assassination attempt just outside his office, and it is not immediately known whether he has survived. Fredo Corleone had been assigned driving and protection duty for his father when Paulie Gatto, the Don\'s usual bodyguard, had called in sick. Fredo proves to be ineffectual, fumbling with his gun and unable to shoot back. When Sonny hears about the Don being shot and Paulie\'s absence, he orders Clemenza (Richard S. Castellano), one of his father\'s two "caporegimes," to find Paulie and bring him to the Don\'s house.Sollozzo abducts Tom Hagen and holds him for several hours, persuading him to offer Sonny the deal previously offered to his father. When Tom is released, Sollozzo gets word that the Don has survived the attempt on his life. He angrily tells Tom to convince Sonny to accept his offer.Enraged, Sonny refuses to consider it and issues an ultimatum to the Tattaglias: turn over Sollozzo or face a lengthy, bloody and costly (for both sides) gang war. They refuse, and instead send Sonny "a Sicilian message," in the form of two fresh fish wrapped in Luca Brasi\'s bullet-proof vest, telling the Corleones that Luca Brasi "sleeps with the fishes."Clemenza later takes Paulie and one of the family\'s hitmen, Rocco Lampone, for a drive into Manhattan. Sonny wants to "go to the mattresses" -- set up beds in apartments for Corleone button men to operate out of in the event that the crime war breaks out. On their way back from Manhattan, Clemenza has Paulie stop the car in a remote area so he can urinate. Rocco shoots Paulie dead; he and Clemenza leave Paulie and the car behind.Michael, whom the other Mafia families consider a "civilian" and not involved in mob business, visits his father at a small private hospital after having dinner with Kay at her hotel. He is shocked to find that no one is guarding him -- a nurse tells him that the men were interfering with hospital policy and were told to leave by the police about 10 minutes before Mike\'s arrival. Realizing that his father is again being set up to be killed, he calls Sonny for help, moves his father to another room, and goes outside to watch the entrance. Michael enlists help from Enzo the baker (Gabriele Torrei), who has come to the hospital to pay his respects. Together, they bluff away Sollozzo\'s men as they drive by. Police cars soon appear bringing the corrupt Captain McCluskey (Sterling Hayden), who viciously punches Michael in the cheek and breaks his jaw when Michael insinuates that Sollozzo paid McCluskey to set up his father. Just then, Hagen arrives with "private detectives" licensed to carry guns to protect Don Corleone, and he takes the injured Michael home. Sonny responds by having Bruno Tattaglia (Tony Giorgio), the eldest son and underboss of Don Phillip Tattaglia (Victor Rendina), killed (off-camera).Following the attempt on the Don\'s life at the hospital, Sollozzo requests a meeting with the Corleones, which Captain McCluskey will attend as Sollozzo\'s bodyguard. When Michael volunteers to kill both men during the meeting, Sonny and the other senior Family members are amused; however, Michael convinces them that he is serious and that killing Sollozzo and McCluskey is in the family\'s interest: "It\'s not personal. It\'s strictly business." Because Michael is considered a civilian, he won\'t be regarded as a suspicious ambassador for the Corleones. Although police officers are usually off limits for hits, Michael argues that since McCluskey is corrupt and has illegal dealings with Sollozzo, he is fair game. Michael also implies that newspaper reporters that the Corleones have on their payroll would delight in publishing stories about a corrupt police captain.Michael meets with Clemenza, who prepares a small pistol for him, covering the trigger and grip with tape to prevent any fingerprint evidence. He instructs Michael about the proper way to perform the assassination and tells him to leave the gun behind. He also tells Michael that the family were all very proud of Michael for becoming a war hero during his service in the Marines and that a war like the impending one that Sollozzo\'s and McClusky\'s killings will spark is necessary about every five to tens years to clean out the ambition and resentment that builds between the Five Families. Clemenza shows great confidence that Michael can perform the job and tells him it will all go smoothly. The plan is to have the Corleone\'s informers find out the location of the meeting and plant the revolver before Michael, Sollozzo and McCluskey arrive. Before he leaves for the meeting, Sonny tells Michael he\'ll get word to Kay about not saying goodbye.Before the meeting in a small Italian restaurant in the Bronx, McCluskey frisks Michael for weapons and finds him clean. After a few minutes where Michael and Sollozzo converse in Italian, Michael excuses himself to go to the bathroom, where he retrieves the planted revolver. Returning to the table, he fatally shoots Sollozzo, then McCluskey. Michael is sent to hide in Sicily while the Corleone family prepares for all-out warfare with the Five Families (who are united against the Corleones) as well as a general clampdown on the mob by the police and government authorities. Three months later, when the don returns home from the hospital, he is distraught to learn that it was Michael who killed Sollozzo and McCluskey.Meanwhile, Connie and Carlo\'s marriage is disintegrating. They argue frequently over Carlo\'s suspected infidelity and his possessive behavior toward Connie. By Italian tradition, nobody, not even a high-ranking Mafia don, can intervene in a married couple\'s personal disputes, even if they involve infidelity, money, or domestic abuse. One day, Sonny sees a bruise on Connie\'s face and she tells him that Carlo hit her after she asked him if he was having an affair. Sonny tracks down and severely beats Carlo in the middle of a crowded street for brutalizing the pregnant Connie, and threatens to kill Carlo if he ever harms Connie again. An angry Carlo responds by plotting with Tattaglia and Don Emilio Barzini (Richard Conte), the Corleones\' chief rivals, to have Sonny killed.Later, Carlo has one of his mistresses phone his house, knowing that Connie will answer. The woman asks Connie to tell Carlo not to meet her tonight. The very pregnant and distraught Connie throws a tantrum, throwing the plates with their dinner around the dining room and kitchen. Carlo takes advantage of the altercation to beat Connie in order to lure Sonny out in the open and away from the Corleone compound. When Connie phones the compound to tell Sonny that Carlo has beaten her again, the enraged Sonny drives off (alone and unprotected) to fulfill his threat against Carlo. On the way to Connie and Carlo\'s house, Sonny is ambushed at a toll booth on the Long Island Causeway and violently shot to death by several carloads of hitmen wielding Thompson sub-machine guns.Tom Hagen relays the news of Sonny\'s massacre to the Don, who calls in the favor from Bonasera to personally handle the embalming of Sonny\'s body. Rather than seek revenge for Sonny\'s killing, Don Corleone meets with the heads of the Five Families to negotiate a cease-fire. Not only is the conflict draining all their assets and threatening their survival, but ending it is the only way that Michael can return home safely. Reversing his previous decision, Vito agrees that the Corleone family will provide political protection for Tattaglia\'s traffic in heroin, as long as it is controlled and not sold to children. At the meeting, Don Corleone deduces that Don Barzini, not Tattaglia, was ultimately behind the start of the mob war and Sonny\'s death, despite showing early signs of senility.In Sicily, Michael patiently waits out his exile, protected by Don Tommasino (Corrado Gaipa), an old family friend. Michael aimlessly wanders the countryside, accompanied by his ever-present bodyguards, Calo (Franco Citti) and Fabrizio (Angelo Infanti). In a small village, Michael meets and falls in love with Apollonia Vitelli (Simonetta Stefanelli), the beautiful young daughter of a bar owner. They court and marry in the traditional Sicilian fashion, but soon Michael\'s presence becomes known to Corleone enemies. One day, while Michael is teaching his new bride to drive, Tommasino brings the bad news about Sonny\'s assassination. He wants to movie Michael to a safer location. As the couple is about to leave, Apollonia is killed as a result of a rigged car (originally intended for Michael) exploding on ignition; Michael, who saw the car explode, spots Fabrizio hurriedly leaving the grounds seconds before the explosion, implicating him in the assassination plot. (In a deleted scene, Fabrizio is found years later and killed.)With his safety guaranteed, Michael returns home. More than a year later, in 1950, he reunites with his former girlfriend Kay after a total of four years of separation -- three in Italy and one in America. He tells her he wants them to be married. Although Kay is hurt that he waited so long to contact her, she accepts his proposal. With Don Vito semi-retired, Sonny dead, and middle brother Fredo considered incapable of running the family business, Michael is now in charge; he promises Kay he will make the family business completely legitimate within five years.Two years later, Clemenza and Salvatore Tessio (Abe Vigoda), complain that they are being pushed around by the Barzini Family and ask permission to strike back, but Michael denies the request. He plans to move the family operations to Nevada and after that, Clemenza and Tessio may break away to form their own families in the New York area. Michael further promises Connie\'s husband, Carlo, that he will be his right hand man in Nevada (Carlo had grown up there), unaware of his part in Sonny\'s assassination. Tom Hagen has been removed as consigliere and is now merely the family\'s lawyer, with Vito serving as consigliere. Privately, Hagen inquires about his change in status, and also questions Michael about a new regime of "soldiers" secretly being built under Rocco Lampone (Tom Rosqui). Don Vito explains to Hagen that Michael is acting on his advice.Another year or so later, Michael travels to Las Vegas and meets with Moe Greene (Alex Rocco), a rich and shrewd casino boss looking to expand his business dealings. After the Don\'s attempted assassination, Fredo had been sent to Las Vegas to learn about the casino business from Greene. Michael arrogantly offers to buy out Greene but is rudely rebuffed. Greene believes the Corleones are weak and that he can secure a better deal from Barzini. As Moe and Michael heatedly negotiate, Fredo sides with Moe. After Moe storms out of the meeting, Michael warns Fredo to never again "take sides with anyone against the family."Michael returns home. In a private moment, Vito explains his expectation that the Family\'s enemies will attempt to murder Michael by using a trusted associate to arrange a meeting as a pretext for assassination. Vito also reveals that he had never really intended a life of crime for Michael, hoping that his youngest son would hold legitimate power as a senator or governor. Some months later, Vito collapses and dies while playing with his young grandson Anthony (Anthony Gounaris) in his tomato garden. At the burial, Tessio conveys a proposal for a meeting with Barzini, which identifies Tessio as the traitor that Vito was expecting.Kay asks Michael if he\'ll agree to be godfather to Connie and Carlo\'s newborn son. Michael agrees and seizes the opportunity to eliminate competition from the other five families while also using the baptism as an alibi. The murders occur simultaneously during the ceremony:Don Stracci (Don Costello) is gunned down along with his bodyguard in a hotel elevator by a shotgun-wielding Clemenza.Moe Greene is killed while having a massage, shot through the eye by an unidentified assassin.Don Cuneo (Rudy Bond) is trapped in a revolving door at the St. Regis Hotel and shot dead by soldier Willi Cicci (Joe Spinell).Don Tattaglia is assassinated in bed, along with a prostitute, by Rocco Lampone and an unknown associate.Don Barzini is killed on the steps of his office building along with his bodyguard and driver, shot by Al Neri (Richard Bright), disguised in his old police uniform.After the baptism, Tessio believes he and Hagen are on their way to the meeting between Michael and Barzini that he has arranged. Instead, he is surrounded by Willi Cicci and other button men as Hagen steps away. Realizing that Michael has uncovered his betrayal, Tessio tells Hagen that he always respected Michael, and that his disloyalty "was only business." He asks if Tom can get him off for "old times\' sake," but Tom says he cannot. Tessio is driven away and never seen again (it is implied that Cicci shoots and kills Tessio with his own gun after he disarms him prior to entering the car).Meanwhile, Michael confronts Carlo about Sonny\'s murder and forces him to admit his role in setting up the ambush, having been approached by Barzini himself. (The hitmen who killed Sonny were the core members of Barzini\'s personal bodyguard.) Michael assures Carlo he will not be killed, but his punishment is exclusion from all family business. He hands Carlo a plane ticket to exile in Las Vegas. However, when Carlo gets into a car headed for the airport, he is immediately garroted to death by Clemenza, on Michael\'s orders.Later, a hysterical Connie confronts Michael at the Corleone compound as movers carry away the furniture in preparation for the family move to Nevada. She accuses him of murdering Carlo in retribution for Carlo\'s brutal treatment of her and for Carlo\'s suspected involvement in Sonny\'s murder and that Michael craftily waited until their father died so Vito couldn\'t stop him. After Connie is removed from the house, Kay questions Michael about Connie\'s accusation, but he refuses to answer, reminding her to never ask him about his business or what he does for a living. She insists, and Michael outright lies, reassuring his wife that he played no role in Carlo\'s death. Kay believes him and is relieved. The film ends with Clemenza and new caporegimes Rocco Lampone and Al Neri arriving and paying their respects to Michael. Clemenza kisses Michael\'s hand and greets him as "Don Corleone." As Kay watches, the office door is closed.
'''.strip()

Given the list of Top 100 movies obtained above, we download and extract the synopsis of each movie and save it to a dataframe organized as follows:

|ID           |Title        |Genre       |Synopsis                        |
|-------------|-------------|------------|--------------------------------|
|**tt0068646**|The Godfather|Crime, Drama|Don Vito Corleone, head of a ...|
|...          |...          |...         |...                             |

Recall that the `pd.concat([df1, df2])` function will create a concatenation of the two dataframe (assuming they have the same structure). Moreover, a dataframe can be created out of a list of dictionaries or as a dictionary of lists (refer to the documentation if needed).

At the end of the process set `ID` as the *index* of the dataframe.

Notice that the use of tqdm allows to print a progress bar to inspect that something is going on in the for loop.

In [None]:
import pandas as pd
try:
    from tqdm.notebook import tqdm
except ImportError:
    %pip install tqdm
finally:
    from tqdm.notebook import tqdm
df = pd.DataFrame()
for title, url, genre in tqdm(movies):
    movie_id = extract_movie_id(url)
    synopsis = extract_synopsis(movie_id)
    df = pd.concat([df, pd.DataFrame([{'ID': movie_id, 'Title': title, 'Genre': genre, 'Synopsis': synopsis}])])
df = df.set_index('ID')    
display(df)

## Data preparation

Textual data needs a proper preparation for later analysis. In particular, the main activities involve the removal of the terms that are too frequent and do not convey significant meaning (such as articles, pronouns, or prepositions). This task is called **stopword removal**. 

Afterwards, a sort of *normalization* of the terms is needed. Indeed, terms appear in different inflected/derived forms, such as, for example, "am", "are", "is" all refer to the "to be" verb, "car", "cars", "car's", "cars'" all refer to "car". **Stemming** is the process of transforming the text so that the *root* of each stemmed version is used.

Finally, once the text has been properly normalized it should undergo a **tokenization** process to transform the text flow in a list of terms, to be used either directly to represent the document or to be the source of other processing (e.g., $n$-gram building).

The goal of this phase is to get a proper representation of the documents (i.e., the *movies*) that can be used for the following clustering phase.

To help in this phase we will make use of the `nltk` library (Natural Language Toolkit, [https://www.nltk.org](https://www.nltk.org)), a very comprehensive library for performing textual analysis.

## Exercise

The `nltk` library provides a bunch of *corpora*, which are relevant textual sources for different purposes (e.g., 25,000 free books of the *Gutenberg project*, >10,000 documents of the *Reuters* news, ...).

One particular corpus is the `stopwords` one, which contains the stopwords (in different languages), which is available in the `nltk.corpus` package.

Load the whole `stopwords` corpus of the English language in a `stopwords` variable. For suggestion you can refer to [https://www.nltk.org/book/ch02.html](https://www.nltk.org/book/ch02.html).

**Note**: in order to use the corpora, you are required to download them, therefore you should issue the `nltk.download(corpus_name)` command for this purpose.

In [None]:
### BEGIN_SOLUTION
try:
    import nltk
except ImportError:
    !pip install nltk
finally:
    import nltk
assert nltk.download('stopwords')
### END_SOLUTION

## Exercise

The `nltk.sent_tokenize(text)` and the `nltk.word_tokenize(sentence)` are two notable functions for tokenizing a text to sentences and a sentence to word, respectively.
It would be advisable also to normalize the tokens by transforming them to lowercase.

A stemmer, instead, is a function that will transform each word to its root. A very popular one is the so-called *Snowball* stemmer, which is available from the package `nltk.stem.snowball` as `SnowballStemmer`. The `SnowballStemmer` constructor takes a string as its argument, which specifies the stemming language (`"english"` in our case).

Write a function `tokenize_and_stem(text)` that, given the text of the synopsis: 
1. tokenizes its content,
2. stems it,
3. removes the stopwords and everything that is not a word (i.e., everything that does not start with a letter, `isalpha()` could be useful).
4. returns the transformed text as a list of strings

**Note** maybe you should download the `'punkt'` component for tokenizing sentences as `nltk.download('punkt')`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert tokenize_and_stem('In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone\'s daughter Connie (Talia Shire)') == ['in',
 'late',
 'summer',
 'guest',
 'gather',
 'wed',
 'recept',
 'don',
 'vito',
 'corleon',
 'daughter',
 'conni',
 'talia',
 'shire']

## Exercise

Now we have to define any document (i.e., a movie synopsis) as a vector of features. In particular, we want to use the $tf.idf$ to characterize each term (Term-Frequency Inverse document frequency), so that we will obtain a tf.idf matrix.

In order to do that, we can make use of the `sklearn` library (Sci-Kit learn), a general purpose machine learning library. In particular, the tf.idf is available in the package `sklearn.feature_extraction.text` as  the class `TfidfVectorizer`.

In general, the `sklearn` classes have two peculiar methods `fit()` and `transform()` which let the data be learned and the model applied, respectively. In our case, the two operations can be combined in one, using the `fit_transform()` method. This method should be provided with a list of strings (each string is a document). Therefore the synopses from the dataframe should be mapped to a list of strings. The method will return a `tf_idf_matrix`.

When a `TfidfVectorizer` is created a couple of parameters have to be provided:

* `max_df`: the maximum frequency within the documents a given term can have to be used in the tf-idf matrix; a reasonable value is $80\%$ (or `0.8`). If the term appears everywhere it probably carries little meaning.
* `min_df`: the minimum value for document frequencies; using values that appear too seldomly also carries little meaning, a reasonable value is $20\%$ (or `0.2`).
* `max_features`: the maximum terms (i.e., features) to be considered (you can leave it as default, i.e., unlimited).
* `ngram_range`: whether to look at single words or sequences of more than one token (you can leave it as default, i.e., 1).
* `tokenizer`: a function that will perform tokenization, we will use the one defined above.

Once the vectorizer is in place, the terms employed are possibly available through the `get_feature_names_out()` method.

The matrix returned from the vectorizer trhough the `fit_transform()` method is a $documents \times terms$ matrix whose values are the $tf.idf$ metric computed for each pair.



In [None]:
# YOUR CODE HERE
raise NotImplementedError()

We want to save the results of this preparation phase, which is embedded in the `tf_idfmatrix` object. To do that, we can use the `joblib` library (which allows to save a Python object to a (binary) file.

The function `joblib.dump(object, filename)` writes the content to a file, whereas `joblib.load(filename)` can read from a file and returns the loaded object.

In [None]:
import joblib
joblib.dump(tf_idf_matrix, 'tf_idf_matrix.joblib')