# PROGRES 2024 - Mini-Projet 2
# API Web

Fabien Mathieu - fabien.mathieu@normalesup.org

Sébastien Tixeuil - Sebastien.Tixeuil@lip6.fr

The purpose of this mini-project is to work with the *Internet Movie DataBase* (IMDB) and the Python package bottle. It will involve:

- Retrieve and manipulate datasets
- Build an API to perform various tasks on the data
- Build a website that will use the API above

# Rules

1. Cite your sources
2. One file to rule them all
3. Explain
4. Execute your code


https://github.com/balouf/progres/blob/main/rules.ipynb

# The IMDB dataset

[IMDB](https://www.imdb.com) allows to retrieve a part of its dataset for any non-commercial purpose. The available data and the formatting convention is described here: https://developer.imdb.com/non-commercial-datasets/

We are especially interested in the data from the following files:
- https://datasets.imdbws.com/title.principals.tsv.gz
- https://datasets.imdbws.com/name.basics.tsv.gz
- https://datasets.imdbws.com/title.basics.tsv.gz

**Important notes**:
- If you see *Your answer here*, that means something is expected from you.
- To help you, the start and/or the end of a possible solution is sometimes given.
- The content of IMDB is refreshed regularly. That means that some of the results you will compute, like the number of movies, will vary with time. This should not surprise you.

## Exercise 1: Download

Write a `download_imdb` function inspired by the `download` function seen in course, with the following modifications:
- `download_imdb` will have one single argument, the name of the file to retrieve. Location is assumed to be https://datasets.imdbws.com/
- If the file already exists, print a message telling that it exists and do nothing. You can use the `pathlib` module for that.

Your answer here.

In [5]:
from pathlib import Path
from requests import Session

base_url = "https://datasets.imdbws.com/"

def download_imdb(file):
    file_path = Path(file)
    if file_path.exists():
        print(f"{file} already exists.")
        return
    
    url = base_url + file
    with Session() as session:
        response = session.get(url)
        response.raise_for_status()
        with open(file, 'wb') as f:
            f.write(response.content)
    print(f"{file} has been downloaded.")

In [6]:
files = ['title.principals.tsv.gz', 'name.basics.tsv.gz', 'title.basics.tsv.gz']
for file in files:
    download_imdb(file)

title.principals.tsv.gz already exists.
name.basics.tsv.gz already exists.
title.basics.tsv.gz already exists.


## Exercise 2: Explore

- What is the size of the different files you retrieved? You can use Python or a file explorer, as you prefer.

Your answer here.

As explained in https://developer.imdb.com/non-commercial-datasets/:
- the data is stored as `tsv`, which means each text line represents a row.
- A [gzip compression](https://docs.python.org/3/library/gzip.html) is used to reduce the size of the data on the hard drive.

Large compressed files should not be uncompressed on your hard drive or fully loaded in memory.

The Python [gzip module](https://docs.python.org/3/library/gzip.html) is designed so you can open a compressed file as if it was already uncompressed. For example, the following code reads 666 lines from `title.basics` and print the last line read.

In [8]:
import gzip
with gzip.open('title.basics.tsv.gz', 'rt', encoding='utf8') as f:
    for _ in range(666):
        l = f.readline()
print(l)

tt0000671	short	Desdemona	Desdemona	0	1908	\N	\N	Drama,Short



- Write a function that read the 4 first lines of a compressed tsv file. Each line read should be converted into a list of elements and printed.

Your answer here.

In [9]:
def explore(name):
    with gzip.open(name, 'rt', encoding='utf8') as f:
        for _ in range(4):
            line = f.readline().strip()
            elements = line.split('\t')
            print(elements)

In [10]:
for file in files:
    print(f"First lines of {file}:")
    explore(file)

First lines of title.principals.tsv.gz:
['tconst', 'ordering', 'nconst', 'category', 'job', 'characters']
['tt0000001', '1', 'nm1588970', 'self', '\\N', '["Self"]']
['tt0000001', '2', 'nm0005690', 'director', '\\N', '\\N']
['tt0000001', '3', 'nm0005690', 'producer', 'producer', '\\N']
First lines of name.basics.tsv.gz:
['nconst', 'primaryName', 'birthYear', 'deathYear', 'primaryProfession', 'knownForTitles']
['nm0000001', 'Fred Astaire', '1899', '1987', 'actor,miscellaneous,producer', 'tt0072308,tt0050419,tt0053137,tt0027125']
['nm0000002', 'Lauren Bacall', '1924', '2014', 'actress,soundtrack,archive_footage', 'tt0037382,tt0075213,tt0117057,tt0038355']
['nm0000003', 'Brigitte Bardot', '1934', '\\N', 'actress,music_department,producer', 'tt0057345,tt0049189,tt0056404,tt0054452']
First lines of title.basics.tsv.gz:
['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres']
['tt0000001', 'short', 'Carmencita', 'Carmencita', '0',

- How many movie entries are present in the retrieved database?
- How many people entries?

Your answer here.

## Exercise 3: Extract

We want to study the relations between actors and movies. In particular, we focus on:
- Actual movies (e.g. not TV shows or short movies), where the movie year is known and at least one actor/actress is credited.
- Actors that are credited in at least one actual movie.

To start with, build a [Python set](https://docs.python.org/3/tutorial/datastructures.html#sets) that contains all movie ids (`tconst`) such that:
- The type of movie (`titleType`) is `movie`;
- The year (`startYear`) exists, i.e. is an integer.

How many movies have you referenced in the set?

Your answer here.

In [13]:
true_movies = set()
def build_true_movies(file):
    true_movies = set()
    with gzip.open(file, 'rt', encoding='utf8') as f:
        next(f)  # Skip the header line
        for line in f:
            elements = line.strip().split('\t')
            tconst = elements[0]
            titleType = elements[1]
            startYear = elements[5]
            if titleType == 'movie' and startYear.isdigit():
                true_movies.add(tconst)
    return true_movies
true_movies = build_true_movies('title.basics.tsv.gz')

In [14]:
len(true_movies)

595086

Now we want to build two lists, `movies` and `actors`:

- Each element of `movies` should represent a movie, each element of `actors` an actor or actress;
- A movie is represented by a list of three elements:
  - The original name of the movie (`str`),
  - The principal actors of the movie, stored as a list whose elements are integers that represent the index (position) of the actors in the list `actors`,
  - The movie year, `startYear` (`int`);
- An actor/actress is represented by a list of two elements:
  - The name of the person (`str`),
  - The movies the person acted in, stored as a list whose elements are integers that represent the index (position) of the movies in the list `movies`.
  

Build these two lists.

A possible way to do this:
- Initiate `movies` and `actors` as empty lists;
- Create two auxiliary dictionary that will associate to each movie id (`tconst`) and person id (`nconst`) their position in the list;
- Read the file `title.principals.tsv.gz` line by line:
  - Ignore any line where the movie is not in the set `true_movies` or the `category` of the relation is not `actor` or `actress`,
  - If the movie id `tconst` is not in the movie auxiliary index, append an empty movie to `movies` (`["", [], 0]`) and update the movie auxiliary index with an entry for `tconst`,
  - If the actor id `nconst` is not in the actor auxiliary index, append an empty actor to `actors` (`["", []]`) and update the actor auxiliary index with an entry for `nconst`,
  - Append the movie index (not `tconst`!) to the movies of the corresponding actor in `actors`,
  - Append the actor index (not `nconst`!) to the actors of the corresponding movie in `movies`;
- There can be a few undesired duplicates, e.g. some actors can have multiple entries for the same movies. For each actor, remove possible duplicates in the list of movies, and for each movie, remove possible duplicates in the list of actors;
- Using `title.basics.tsv.gz` and your movie auxiliary index, populate each movie in `movies` with its correct name (`str`) and year (`int`);
- Using `name.basics.tsv.gz` and your actor auxiliary index, populate each actor in `movies` with her correct name.

Your answer here.

In [15]:
movie_id_to_index = dict()
movies = []
actor_id_to_index = dict()
actors = []
# Read the title.principals.tsv.gz file
with gzip.open('title.principals.tsv.gz', 'rt', encoding='utf8') as f:
    next(f)  # Skip the header line
    for line in f:
        elements = line.strip().split('\t')
        tconst = elements[0]
        nconst = elements[2]
        category = elements[3]
        
        if tconst not in true_movies or category not in ['actor', 'actress']:
            continue
        
        if tconst not in movie_id_to_index:
            movie_id_to_index[tconst] = len(movies)
            movies.append(["", [], 0])
        
        if nconst not in actor_id_to_index:
            actor_id_to_index[nconst] = len(actors)
            actors.append(["", []])
        
        movie_index = movie_id_to_index[tconst]
        actor_index = actor_id_to_index[nconst]
        
        if movie_index not in actors[actor_index][1]:
            actors[actor_index][1].append(movie_index)
        
        if actor_index not in movies[movie_index][1]:
            movies[movie_index][1].append(actor_index)

# Remove duplicates in the lists
for actor in actors:
    actor[1] = list(set(actor[1]))

for movie in movies:
    movie[1] = list(set(movie[1]))

# Populate movies with their names and years
with gzip.open('title.basics.tsv.gz', 'rt', encoding='utf8') as f:
    next(f)  # Skip the header line
    for line in f:
        elements = line.strip().split('\t')
        tconst = elements[0]
        primaryTitle = elements[2]
        startYear = elements[5]
        
        if tconst in movie_id_to_index:
            movie_index = movie_id_to_index[tconst]
            movies[movie_index][0] = primaryTitle
            movies[movie_index][2] = int(startYear) if startYear.isdigit() else None

# Populate actors with their names
with gzip.open('name.basics.tsv.gz', 'rt', encoding='utf8') as f:
    next(f)  # Skip the header line
    for line in f:
        elements = line.strip().split('\t')
        nconst = elements[0]
        primaryName = elements[1]
        
        if nconst in actor_id_to_index:
            actor_index = actor_id_to_index[nconst]
            actors[actor_index][0] = primaryName

Manually check that your files are correct. For example, try to get the name and year of the movies Michel Blanc played in, or the actors of the first Harry Potter movie.

Your answer here (if everything went well, you just need to execute the two cells below).

In [16]:
', '.join([f"{movies[i][0]} ({movies[i][2]})" for i in [a for a in actors if a[0]=='Michel Blanc'][0][1]])

"The Best Way to Walk (1976), The Favour, the Watch and the Very Big Fish (1991), Gramps Is in the Resistance (1983), Dream one (1984), The Hundred-Foot Journey (2014), R.A.I.D. Special Unit (2016), Move Along, There is Nothing to See (1983), Uranus (1990), Cause toujours... tu m'intéresses! (1979), Out of Whack (1979), You Are So Beautiful (2005), Prospero's Books (1991), Kiss & Tell (2018), The Horse of Pride (1980), Summer Things (2002), Ma femme s'appelle reviens (1982), Separate Bedrooms (1989), Toxic Affair (1993), The Girl on the Train (2009), The Day I Saw Your Heart (2011), Odd Job (2016), The Witnesses (2007), The New Beaujolais Wine Has Arrived... (1978), A Spot of Bother (2009), Top Dogs (2022), Madame Edouard (2004), Viens chez moi, j'habite chez une copine (1981), You Won't Have Alsace-Lorraine (1977), A Good Doctor (2019), Marche à l'ombre (1984), I Hate Actors (1986), Santa Claus Is a Stinker (1982), Le routard (2025), Ménage (1986), To Catch a Cop (1984), Drôle de same

In [17]:
', '.join([actors[i][0] for i in [m for m in movies if m[0].startswith('Harry Potter')][0][1]])

'Maggie Smith, Fiona Shaw, Richard Griffiths, Rupert Grint, Emma Watson, Saunders Triplets, Harry Melling, Richard Harris, Daniel Radcliffe, Robbie Coltrane'

When you have successfully reached this point of the project, you can save the two lists `movies` and `actors` as compressed json files using the code below:

In [18]:
import gzip
import json

with gzip.open('movies.json.gz', 'wt', encoding='utf8') as f:
    json.dump(movies, f)
with gzip.open('actors.json.gz', 'wt', encoding='utf8') as f:
    json.dump(actors, f)

After your files have been saved, you do not need to re-execute all of the above each time your restart your notebook. Instead, you just need to reload `movies` and `actors` using the code below:

In [19]:
import gzip
import json

with gzip.open('movies.json.gz', 'rt', encoding='utf8') as f:
    movies = json.load(f)
with gzip.open('actors.json.gz', 'rt', encoding='utf8') as f:
    actors = json.load(f)    

**Important remark:** in what follows, you will have to build functions that use the two lists a lot. You should NOT reload the lists each time you call a function. Instead, ensure that the two lists are loaded in memory and use them directly.

## Exercise 4: Explore again (now on the curated dataset)

- How many actors do you have in the new dataset? How many movies?
- In average, in how many movies played an actor?
- In average, how many actors play in a movie?
- What is the name of the actor that played in the most movies? How many movies did he feature in?
- What is the oldest movie in the DB?
- Your answer here.

In [20]:
# Number of actors
num_actors = len(actors)
print(f"Number of actors: {num_actors}")

# Number of movies
num_movies = len(movies)
print(f"Number of movies: {num_movies}")

# Average number of movies per actor
avg_movies_per_actor = sum(len(actor[1]) for actor in actors) / num_actors
print(f"Average number of movies per actor: {avg_movies_per_actor:.2f}")

# Average number of actors per movie
avg_actors_per_movie = sum(len(movie[1]) for movie in movies) / num_movies
print(f"Average number of actors per movie: {avg_actors_per_movie:.2f}")

# Actor with the most movies
most_movies_actor = max(actors, key=lambda actor: len(actor[1]))
most_movies_actor_name = most_movies_actor[0]
most_movies_actor_count = len(most_movies_actor[1])
print(f"Actor with the most movies: {most_movies_actor_name} ({most_movies_actor_count} movies)")

# Oldest movie
oldest_movie = min(movies, key=lambda movie: movie[2] if movie[2] is not None else float('inf'))
oldest_movie_name = oldest_movie[0]
oldest_movie_year = oldest_movie[2]
print(f"Oldest movie: {oldest_movie_name} ({oldest_movie_year})")

Number of actors: 1195940
Number of movies: 469661
Average number of movies per actor: 3.03
Average number of actors per movie: 7.71
Actor with the most movies: Brahmanandam (1122 movies)
Oldest movie: Miss Jerry (1894)


## Exercise 5: Prepare some functions

Write the following functions
- `search_movie(name: str) -> list`: return a list of movies whose name contains `name` (ignoring case). Each movie is described as a dictionary with keys `name`, `year`, and `index` (its position in `movies`)
- `get_movie(i: int) -> dict`: returns the a json of the movie at position `i`, with following keys:
  - `name` (`str`)
  - `year` (`int`)
  - `actors` (list of dictionaries with keys `name` and `index`)
- `search_actor(name: str) -> list`: return a list of actors whose name contains `name` (ignoring case). Each actor is described as a dictionary with keys `name` and `index` (its position in `actor`)
- `get_actor(i: int) -> dict`: returns the a json of the actor at position `i`, with following keys:
  - `name` (`str`)
  - `movies` (list of dictionaries with keys `name`, `year`, and `index`)

Your answer here.

In [21]:
def search_movie(name):
    name_lower = name.lower()
    return [
        {"name": movie[0], "year": movie[2], "index": i}
        for i, movie in enumerate(movies)
        if name_lower in movie[0].lower()
    ]

In [22]:
def get_movie(i):
    movie = movies[i]
    return {
        "name": movie[0],
        "year": movie[2],
        "actors": [{"name": actors[actor_index][0], "index": actor_index} for actor_index in movie[1]]
    }


In [23]:
def search_actor(name):
    name_lower = name.lower()
    return [
        {"name": actor[0], "index": i}
        for i, actor in enumerate(actors)
        if name_lower in actor[0].lower()
    ]


In [None]:
def get_actor(i):
    actor = actors[i]
    return {
        "name": actor[0],
        "movies": [{"name": movies[movie_index][0], "year": movies[movie_index][2], "index": movie_index} for movie_index in actor[1]]
    }

In [25]:
search_movie('bronzés')

[{'name': "Les P'tits Bronzés au Pyrénéen", 'year': 2013, 'index': 444600}]

In [26]:
get_movie(55030)

{'name': 'The Brood',
 'year': 1979,
 'actors': [{'name': 'Samantha Eggar', 'index': 62916},
  {'name': 'Cindy Hinds', 'index': 107141},
  {'name': 'Susan Hogan', 'index': 107142},
  {'name': 'Gary McKeehan', 'index': 107143},
  {'name': 'Michael Magee', 'index': 95335},
  {'name': 'Oliver Reed', 'index': 60781},
  {'name': 'Nuala Fitzgerald', 'index': 81815},
  {'name': 'Robert A. Silverman', 'index': 99479},
  {'name': 'Henry Beckman', 'index': 67292},
  {'name': 'Art Hindle', 'index': 85303}]}

In [27]:
search_actor('Daniel Radcliffe')

[{'name': 'Daniel Radcliffe', 'index': 277301}]

In [28]:
get_actor(277289)

{'name': 'Brigitte Rabald',
 'movies': [{'name': 'Star mit fremden Federn',
   'year': 1955,
   'index': 122970}]}

Write a function `movie_path(origin: int, destination: int) -> distance: int, path: list` that computes the collaboration distance between two actors. That distance is the length of the shortest path `(origin, act1, act2, ..., actX, destination)`, where `origin` and `act` played in the same movie, `act1` and `act2` played in the same movie, ... and
`actX` and `destination` played in the same movie.  In addition to the distance, the response should include one shortest path between the two actors, as a list of the form `["origin_name", "movie1_name", "act1_name", "movie2_name", ..., "destination_name"]`, where `movie1` is a movie that featured `origin` and `act1`, and so on...

In particular:
- One actor is by convention at distance 0 from herself. The return path should be `["origin_name"]` then;
- Two distinct actors that play in the same movie are at distance 1;
- If there is no connection between two actors, the function should return `-1, []` by convention.

**Important remarks**: `movie_path` is tricky. You need to try to implement it but you are allowed to fail. If you are stuck for too long, please explain what you did/try and what blocked you in your opinion. Then move on.

Your answer here.

In [29]:
from collections import deque

def movie_path(origin, destination):
    if origin == destination:
        return 0, [actors[origin][0]]
    
    # BFS initialization
    queue = deque([(origin, [origin])])
    visited = set([origin])
    
    while queue:
        current_actor, path = queue.popleft()
        
        for movie_index in actors[current_actor][1]:
            for co_actor in movies[movie_index][1]:
                if co_actor == destination:
                    # Found the destination actor
                    full_path = path + [co_actor]
                    path_names = [actors[origin][0]]
                    for i in range(len(full_path) - 1):
                        path_names.append(movies[movies[full_path[i]][1][0]][0])
                        path_names.append(actors[full_path[i + 1]][0])
                    return len(full_path) - 1, path_names
                
                if co_actor not in visited:
                    visited.add(co_actor)
                    queue.append((co_actor, path + [co_actor]))
    
    return -1, []


In [30]:
search_actor('jean dujardin')

[{'name': 'Jean Dujardin', 'index': 329551}]

In [31]:
search_actor('kiefer sutherland')

[{'name': 'Kiefer Sutherland', 'index': 123741}]

In [32]:
search_actor('kevin bacon')

[{'name': 'Kevin Bacon', 'index': 105313},
 {'name': 'Kevin Bacon', 'index': 1141015}]

In [33]:
search_actor('louis de funès')

[{'name': 'Louis de Funès', 'index': 38478}]

In [34]:
movie_path(105311, 105311)

(0, ['Mark Metcalf'])

In [35]:
movie_path(105311, 329504)

(3,
 ['Mark Metcalf',
  'Island Bruthas',
  'C. Thomas Howell',
  'A Gnome Named Gnorm',
  'Ann-Margret',
  "Summer of '69",
  'Gisele Bündchen'])

In [36]:
movie_path(38476, 123737)

(5,
 ['Carole Donne',
  'Lovers and Thieves',
  'Thurston Hall',
  'The Luring Lights',
  'Rosalind Russell',
  'Badaranii',
  'Maximilian Schell',
  'Underground',
  'Iva Janzurová',
  'The Day It Rained',
  'Karel Kopecký'])

In [37]:
search_movie('gendarme')

[{'name': 'The Gendarme of Saint-Tropez', 'year': 1964, 'index': 41002},
 {'name': 'The Gendarme in New York', 'year': 1965, 'index': 42673},
 {'name': 'The Gendarme Gets Married', 'year': 1968, 'index': 44455},
 {'name': 'The Gendarme Takes Off', 'year': 1970, 'index': 46402},
 {'name': 'The Gendarme and the Extra-Terrestrials',
  'year': 1979,
  'index': 55242},
 {'name': 'The Gendarme and the Gendarmettes', 'year': 1982, 'index': 58335},
 {'name': 'The Gendarme of Champignol', 'year': 1959, 'index': 92829},
 {'name': 'El gendarme de la esquina', 'year': 1951, 'index': 116184},
 {'name': 'Sacrés gendarmes', 'year': 1980, 'index': 120027},
 {'name': "Hainburg - Je t'aime, gendarme", 'year': 2001, 'index': 145903},
 {'name': 'Le gendarme de Abobo', 'year': 2019, 'index': 319652}]

## Exercise 6. Provide a Web API

Using Python and the Bottle package, build a web server that implements the following API:
- `/movies/{id}` : where `id` is the index of a movie, returns the corresponding movie as a json (cf `get_movie`).
- `/movies` : returns by default the first 100 movies. The value 100 can be modified by sending a URL parameter `limit`.
- `/actors/{id}` : where `id` is the index of an author, returns the json of the actor (cf `get_actor`).
- `/actors` : returns by default the first 100 actors. The value 100 can be modified by sending a URL parameter `limit`.
- `/actors/{id}/costars` : returns the co-stars of one actor (actors that play in a same movie).
- `/search/actors/{searchString}` : where `searchString` is a string to lookup one actor. This route should return the actors whose name contains `searchString` (for example, `/search/actors/w` returns the actors whose name contains `w` or `W`).
- `/search/movies/{searchString}`: where `searchString` is a string, returns the list of movies whose title contains `searchString`. The route should accept a URL parameter `filter` formatted like `key1:value1,key2:value2,...`  to restrain the search to the publications where key `keyi` contains `valuei`. For example, `/search/movies/gendarme?filter=year:1964`
should return the list of movies where the title contains `gendarme` published in 1964.
- `/actors/{id_origin}/distance/{id_destination}` : where `id_origin`
and `id_destination` are two actor indices, returns the collaboration distance between the two actors. In addition to the distance, the response should include one shortest path between the two actors, e.g. the json you return should be a list of two elements, one integer and one list.

The developed API should have the following characteristics:

- All errors should have the same format.
- In absence of error, the API should always return a `json`.
- Each route must be documented with the return format, possible errors, and an explanation of parameters.
- Each route that returns a list should return a maximum of 100 elements and should accept URL parameters `start` and `limit` to display `limit` elements starting from the `start`-th element. For example: `/actors` should return the first 100 authors, `/actors?start=100` displays the next 100, and `/actors?start=200&limit=2` displays the next 2 elements.
- For each route that returns a list, the returned elements should be sortable based on a given field using a URL parameter `order`. For example: `/movies?order=year` displays the first 100 movies sorted by year.

Your answer here.

In [44]:
from bottle import Bottle, run, request, response
import json

app = Bottle()

def json_response(data, status=200):
    response.content_type = 'application/json'
    response.status = status
    return json.dumps(data)

@app.route('/movies/<id:int>')
def get_movie_route(id):
    try:
        movie = get_movie(id)
        return json_response(movie)
    except IndexError:
        return json_response({"error": "Movie not found"}, status=404)

@app.route('/movies')
def list_movies():
    try:
        start = int(request.query.start or 0)
        limit = int(request.query.limit or 100)
        order = request.query.order or None
        sorted_movies = sorted(movies, key=lambda x: x[2] if order == 'year' else x[0])
        return json_response(sorted_movies[start:start+limit])
    except Exception as e:
        return json_response({"error": str(e)}, status=400)

@app.route('/actors/<id:int>')
def get_actor_route(id):
    try:
        actor = get_actor(id)
        return json_response(actor)
    except IndexError:
        return json_response({"error": "Actor not found"}, status=404)

@app.route('/actors')
def list_actors():
    try:
        start = int(request.query.start or 0)
        limit = int(request.query.limit or 100)
        order = request.query.order or None
        sorted_actors = sorted(actors, key=lambda x: x[0])
        return json_response(sorted_actors[start:start+limit])
    except Exception as e:
        return json_response({"error": str(e)}, status=400)

@app.route('/actors/<id:int>/costars')
def get_costars(id):
    try:
        actor = actors[id]
        costars = set()
        for movie_index in actor[1]:
            for co_actor in movies[movie_index][1]:
                if co_actor != id:
                    costars.add(co_actor)
        costar_list = [{"name": actors[co_actor][0], "index": co_actor} for co_actor in costars]
        return json_response(costar_list)
    except IndexError:
        return json_response({"error": "Actor not found"}, status=404)

@app.route('/search/actors/<searchString>')
def search_actors_route(searchString):
    try:
        result = search_actor(searchString)
        return json_response(result)
    except Exception as e:
        return json_response({"error": str(e)}, status=400)

@app.route('/search/movies/<searchString>')
def search_movies_route(searchString):
    try:
        filter_params = request.query.filter or None
        filters = {}
        if filter_params:
            for param in filter_params.split(','):
                key, value = param.split(':')
                filters[key] = value
        result = search_movie(searchString)
        if filters:
            result = [movie for movie in result if all(str(movie.get(k)) == v for k, v in filters.items())]
        return json_response(result)
    except Exception as e:
        return json_response({"error": str(e)}, status=400)

@app.route('/actors/<id_origin:int>/distance/<id_destination:int>')
def get_distance(id_origin, id_destination):
    try:
        distance, path = movie_path(id_origin, id_destination)
        return json_response({"distance": distance, "path": path})
    except Exception as e:
        return json_response({"error": str(e)}, status=400)

if __name__ == '__main__':
    run(app, host='localhost', port=8080)

Bottle v0.12.25 server starting up (using WSGIRefServer())...
Listening on http://localhost:8080/
Hit Ctrl-C to quit.



## Exercise 7. Test a Web API

Using `pytest`, write a program that checks that the API made in the previous exercise works as expected.

In [45]:

import pytest
from bottle import Bottle, request, response
from bottle import LocalRequest, LocalResponse
from bottle import HTTPError
import json
from app import app

# Assuming the app is defined in a file named `app.py`

@pytest.fixture
def client():
    return app

def test_get_movie(client):
    response = client.get('/movies/0')
    assert response.status_code == 200
    data = response.json
    assert data['name'] == 'Miss Jerry'
    assert data['year'] == 1894

def test_get_movie_not_found(client):
    response = client.get('/movies/999999')
    assert response.status_code == 404
    data = response.json
    assert data['error'] == 'Movie not found'

def test_list_movies(client):
    response = client.get('/movies')
    assert response.status_code == 200
    data = response.json
    assert len(data) <= 100

def test_get_actor(client):
    response = client.get('/actors/0')
    assert response.status_code == 200
    data = response.json
    assert data['name'] == 'Blanche Bayliss'

def test_get_actor_not_found(client):
    response = client.get('/actors/999999')
    assert response.status_code == 404
    data = response.json
    assert data['error'] == 'Actor not found'

def test_list_actors(client):
    response = client.get('/actors')
    assert response.status_code == 200
    data = response.json
    assert len(data) <= 100

def test_get_costars(client):
    response = client.get('/actors/0/costars')
    assert response.status_code == 200
    data = response.json
    assert isinstance(data, list)

def test_search_actors(client):
    response = client.get('/search/actors/blanche')
    assert response.status_code == 200
    data = response.json
    assert any('Blanche Bayliss' in actor['name'] for actor in data)

def test_search_movies(client):
    response = client.get('/search/movies/jerry')
    assert response.status_code == 200
    data = response.json
    assert any('Miss Jerry' in movie['name'] for movie in data)

def test_get_distance(client):
    response = client.get('/actors/0/distance/1')
    assert response.status_code == 200
    data = response.json
    assert 'distance' in data
    assert 'path' in data

Your answer here.

## Exercise 8. Make a Website that uses the Web API

Create a Python web server using the Bottle library that utilizes the Web API you developed to offer the user a graphical Web interface. This interface allows the user to obtain, by entering relevant information into a Web form:

- The complete list of movies and the complete list of costars of an actor, possibly sorted alphabetically. This actor can be searched beforehand using a substring of characters appearing in her name.
- The colloration distance between two actors. As above, the actors can be searched beforehand using a substring of characters appearing in their names. Try to format a bit (not too much). For example:
  - The collaboration distance between Kevin Bacon and Jean Dujardin is 2.
  - Kevin bacon played in Wild things with Bill Murray;
  - Bill Murray played in The Monuments Men with Jean Dujardin.

Your answer here.

In [48]:
from bottle import Bottle, run, request, template, static_file
import json
from collections import deque

app = Bottle()

# Load the movies and actors data
with gzip.open('movies.json.gz', 'rt', encoding='utf8') as f:
    movies = json.load(f)
with gzip.open('actors.json.gz', 'rt', encoding='utf8') as f:
    actors = json.load(f) 
@app.route('/')
def index():
    return template('index')

@app.route('/search_actor', method='POST')
def search_actor():
    search_string = request.forms.get('search_string')
    result = [actor for actor in actors if search_string.lower() in actor[0].lower()]
    return template('search_actor', actors=result, search_string=search_string)

@app.route('/actor/<actor_id:int>')
def actor_details(actor_id):
    actor = actors[actor_id]
    actor_movies = [movies[movie_id] for movie_id in actor[1]]
    costars = set()
    for movie_id in actor[1]:
        for co_actor_id in movies[movie_id][1]:
            if co_actor_id != actor_id:
                costars.add(co_actor_id)
    costar_list = [actors[co_actor_id] for co_actor_id in costars]
    return template('actor_details', actor=actor, movies=actor_movies, costars=costar_list)

@app.route('/collaboration_distance', method='POST')
def collaboration_distance():
    actor1_id = int(request.forms.get('actor1_id'))
    actor2_id = int(request.forms.get('actor2_id'))
    distance, path = movie_path(actor1_id, actor2_id)
    return template('collaboration_distance', distance=distance, path=path)

@app.route('/static/<filename>')
def server_static(filename):
    return static_file(filename, root='./static')

def movie_path(origin, destination):

    if origin == destination:
        return 0, [actors[origin][0]]

    queue = deque([(origin, [origin])])
    visited = set([origin])

    while queue:
        current_actor, path = queue.popleft()

        for movie_index in actors[current_actor][1]:
            for co_actor in movies[movie_index][1]:
                if co_actor == destination:
                    full_path = path + [co_actor]
                    path_names = [actors[origin][0]]
                    for i in range(len(full_path) - 1):
                        path_names.append(movies[movies[full_path[i]][1][0]][0])
                        path_names.append(actors[full_path[i + 1]][0])
                    return len(full_path) - 1, path_names

                if co_actor not in visited:
                    visited.add(co_actor)
                    queue.append((co_actor, path + [co_actor]))

    return -1, []

if __name__ == '__main__':
    run(app, host='localhost', port=8080)

Bottle v0.12.25 server starting up (using WSGIRefServer())...
Listening on http://localhost:8080/
Hit Ctrl-C to quit.

