
<h1><font color="#00586D" size=6>Movie billboard on AWS</font></h1>

# <font color="#00586D">Introduction</font>

In this project I deploy an application on the AWS cloud that, on a recurring basis, consults the state of the movie billboards through calls to an API of third parties. The result of these queries will be stored through AWS storage services. In addition, the most relevant data will be stored in a non-relational database, which will allow future queries through an API built on Lambda functions.
The validation of the application is done from the current notebook.

# <font color="#00586D"> Architecture</font>

The app will have the following functionality:
<ol type="A">
     <li> It will consult data through an external API, which allows obtaining information about the films that are currently projected in theaters. Initially, queries will be made through the use of the request library within the current notebook</li>
     <li> It will store the data obtained with each call in JSON files, using the AWS S3 storage service. These files will be organized by name that reflects the date they were obtained</li>
     <li> It will keep a database, created in AWS DynamoDB, up-to-date with key data for every movie I have obtained information about. The update process will start upon the presence of new files in AWS S3.
     <li> It will offer queries to the database that allow obtaining movies that meet certain conditions </li>
         <ol>
             <li> Movies released in specific month </li>
             <li> Movies with a rating above a threshold </li>
         </ol>
     <li> It will automate external API queries by deploying an AWS Lambda function</li>
     <li> It will offer the queries to the database through an API deployed with API Gateway + AWS Lambda </li>
</ol>

The application architecture is shown in the following image:
  
<img src="img/arquitectura.png" alt="Arquitectura del capstone">

# <font color="#00586D" size=5> Actions to take </font>

### <font color="#00586D" size=3><b> </b></font> Use of "The Movie Database API"

To consume the API I must obtain an `API Key` that identifies us during our queries.

The TMDb API requires authentication, so to work with it it is necessary, first of all, to have a user. Once the registration on the web is done, it is necessary to request a key to use the API. Detailed instructions are shown on this page([link](https://developers.themoviedb.org/3/getting-started/introduction)). This process is simple, and basically consists of 3 steps:

- Enter the personal account settings.
- Enter the API menu.
- Create the API and identify the token (I will use it as `API KEY`)

 
Once the API is obtained, I use it to perform any of the queries included in the documentation (https://developers.themoviedb.org/3/movies).

During the project, I'll make use of the now_playing call, which gives us the following data for each movie currently playing in theaters:

- poster_path: string with values to generate a URL to the movie poster
- adult: boolean value that tells us if it is an adult movie or not
- overview: string with the summary of the movie
- release_date: string with the release date
- id: integer that serves as a unique identifier
- popularity: integer with the current popularity
- vote_average: integer with the average value of the votes made to date
- vote_count: integer indicating the number of votes made



Using POSTMAN, I make a call to the URL https://api.themoviedb.org/3/movie/now_playing?api_key=YOUR_KEY replacing YOUR_KEY with my personal API Key.

I paste the data obtained for the first 3 films in the following cell:

### <font color="#00586D" size=3><b></font> Query through "requests"

- Now I use the request library to do the API calls.
- Knowing that the obtained data is ordered by popularity, I add a value to each result called rank with the position of the movie
    - The movie in position 0 of the array will have the ranking value 1
    - The movie in position 1 of the array will have the ranking value 2
    - The movie in position n-1 of the array will have the ranking value n

In [1]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
from pandas import json_normalize
API_KEY = "your_key"
response = requests.get(f"https://api.themoviedb.org/3/movie/now_playing?api_key={API_KEY}")
data = response.json()["results"]
df = pd.json_normalize (data)
df = df.assign(rank=range(1, len(df) + 1))
df.head(3).to_json(orient="records")

'[{"adult":false,"backdrop_path":"\\/yF1eOkaYvwiORauRCPWznV9xVvi.jpg","genre_ids":[28,12,878],"id":298618,"original_language":"en","original_title":"The Flash","overview":"When his attempt to save his family inadvertently alters the future, Barry Allen becomes trapped in a reality in which General Zod has returned and there are no Super Heroes to turn to. In order to save the world that he is in and return to the future that he knows, Barry\'s only hope is to race for his life. But will making the ultimate sacrifice be enough to reset the universe?","popularity":4114.54,"poster_path":"\\/rktDFPbfHfUbArZ6OOOKsXcv0Bm.jpg","release_date":"2023-06-13","title":"The Flash","video":false,"vote_average":6.9,"vote_count":1870,"rank":1},{"adult":false,"backdrop_path":"\\/2vFuG6bWGyQUzYS9d69E5l85nIz.jpg","genre_ids":[28,12,878],"id":667538,"original_language":"en","original_title":"Transformers: Rise of the Beasts","overview":"When a new threat capable of destroying the entire planet emerges, Opt

### <font color="#00586D" size=3></font> Uploading data to S3

- I create an S3 bucket named
   - I do it from the AWS console
- The code cell below, after querying data from the API and including the rank field, an object is generated for each result
- Path of the object to create /movies/ID_MOVIE/YEAR_MONTH_DAY.json
   - ID_MOVIE corresponds to the id of the movie
   - YEAR/MONTH/DAY with the date of the query
  
For example, if the query gets 4 movies, the execution of the cell should generate 4 JSON files that will be uploaded, using the boto3 library, to the S3 bucket

In [2]:
!pip install boto3




[notice] A new release of pip available: 22.2.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import requests
import json
import boto3    
from datetime import datetime

API_KEY = "your_key"
NOMBRE_BUCKET = "your_bucket" # Creado desde la consola de AWS 
now = datetime.now()
year_month_day = now.strftime("%Y_%m_%d") # Obtener dígitos del año, mes y día


# 1.- GET DATA FROM THE API

response = requests.get(f"https://api.themoviedb.org/3/movie/now_playing?api_key={API_KEY}")
data = response.json()["results"]
    
# 2.- CONFIGURING S3
    
s3_client = boto3.client('s3', region_name='us-east-1')
s3_resource = boto3.resource('s3', region_name='us-east-1')
    
try:
    s3_client.put_object(Bucket=NOMBRE_BUCKET, Key='movies/')
    print('Folder created correctly')
except Exception as e:
    print(e) 


# 3.- Uploading files to S3

for index, json_obj in enumerate (data):
    json_obj["rank"] = index+1
    movie_id = str(json_obj["id"])
    path = f"movies/{movie_id}/{year_month_day}.json"
    data_movie = json.dumps(json_obj)
    s3_client.put_object(Bucket= NOMBRE_BUCKET, Key=path, Body=data_movie)

- I paste the folders inside the S3 bucket

<img src="img/eje_2.png" alt="Validacion 2">

### <font color="#00586D" size=3></font> Data storage in DynamoDB

For ease of future reference, I am going to store movie data in a DynamoDB database. To do this, I will create a DynamoDB table through the AWS console with the following settings:

- Name: MoviesDB
- Partition Key: id (string)
-Global Secondary Index
   - Partition Key: y_m (string)
      - Represents the release date of the movie (release_date)
      - Composed of the year (4 digits) and month (2 digits). For example 2023_01
   - Sort Key: val (number)

To validate the created table, I am going to generate an input through boto3.

In [88]:
import requests
import json
import boto3    
from decimal import Decimal

API_KEY = "your_key"
NOMBRE_TABLA = "MoviesDB"

# 1.- GET DATA FROM API MOVIES DB

response = requests.get(f"https://api.themoviedb.org/3/movie/now_playing?api_key={API_KEY}")
data = response.json()["results"]

# 2.- CONFIGURE THE DYNAMODB RESOURCE AND TABLE
    
dynamodb_resource = boto3.resource('dynamodb', region_name='us-east-1')
dynamodb_table = dynamodb_resource.Table(NOMBRE_TABLA)

# 3.- MODIFY THE FORMAT OF THE DATA OBTAINED SO THAT IT CAN BE SAVED IN DYNAMO_DB

for index, json_obj in enumerate (data):
    json_obj["rank"] = index+1
    str_id= str(json_obj["id"])
    json_obj["id"] = str_id
    split_rd= json_obj['release_date'].split('-')
    format_rd = split_rd[0]+'_'+split_rd[1]
    json_obj["y_m"] = format_rd
    json_obj["val"] =  json_obj["vote_average"]

# 4.- SAVE A SINGLE ENTRY TO DYNAMO DB, USING THE METHOD BELOW

single_element = data[0]
dynamodb_table.put_item(Item=json.loads(json.dumps(single_element), parse_float=Decimal))

{'ResponseMetadata': {'RequestId': 'O34CHBBI2FSABFAOTE617CQGRFVV4KQNSO5AEMVJF66Q9ASUAAJG',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'server': 'Server',
   'date': 'Tue, 02 May 2023 11:12:44 GMT',
   'content-type': 'application/x-amz-json-1.0',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'O34CHBBI2FSABFAOTE617CQGRFVV4KQNSO5AEMVJF66Q9ASUAAJG',
   'x-amz-crc32': '2745614147'},
  'RetryAttempts': 0}}

- I paste a screenshot that shows the contents of the movies table in DynamoDB

<img src="img/eje_3.png" alt="Validacion 3">

### <font color="#00586D" size=3></font> Process automation

I want to automate the process of adding/updating the data in our DynamoDB table. To do this, I will use a lambda function with the following characteristics:

- Must be executed after the creation of any object in the S3 bucket created above, under the path /movies
- In case there is no previous entry for the movie, an entry will be created once proper fields have been generated for the partition and sort keys
   - Partition key: id
   -GSI
       - Partition key: y_m
       - Sort key: val
      
In addition, it will have two "special" fields with the following characteristics:
   -rank_hist
     - It will represent a dictionary with the evolution of the position of the film in the rankings over the weeks
     - To do this, it will have an entry for each week in which the value obtained has been consulted, where date will have the format year_nWeek (number of the week/year in which the ranking has been consulted).
       - Year will be represented by 4 digits
       - nWeek will be represented by 2 digits
     - If I make a request on January 15, I will be in week 2, while on the 16th I will be in week 3
     - For the most popular movie, I will add the entry
         - rank_hist = {"2023_02", 1}
   -until_date
     - It will help me to check the number of days that a movie has been showing
     - Will save the date of the query in the same format as the release_date field (provided by the MoviesDB API)
         -yyyy-mm-dd
        
- In case an entry already exists, I will update its values so that I "can" have the following changes
     - New value for current valuation
       - field val
     - Updating the until_date field
     - New entry in the rank_hist field with the value of the current week

In [None]:
import requests
import json
import boto3
import botocore
from decimal import Decimal
from datetime import datetime

API_KEY = "your_key"
NOMBRE_TABLA = "MoviesDB"

# 1.- GET DATA FROM API MOVIES DB

response = requests.get(f"https://api.themoviedb.org/3/movie/now_playing?api_key={API_KEY}")
data = response.json()["results"]

# 2.- CONFIGURE THE DYNAMODB RESOURCE AND TABLE
    
dynamodb_resource = boto3.resource('dynamodb', region_name='us-east-1')
dynamodb_table = dynamodb_resource.Table(NOMBRE_TABLA)

# 3.- MODIFY THE FORMAT OF THE DATA OBTAINED SO THAT IT CAN BE SAVED IN DYNAMO_DB

now = datetime.now()    
year_number = now.isocalendar()[0]
week_number = now.isocalendar()[1]    
year_week = f'{year_number}_{week_number}'

for index, json_obj in enumerate (data):
    json_obj["rank"] = index+1
    str_id= str(json_obj["id"])
    json_obj["id"] = str_id
    split_rd= json_obj['release_date'].split('-')
    format_rd = split_rd[0]+'_'+split_rd[1]
    json_obj["y_m"] = format_rd
    json_obj["val"] =  json_obj["vote_average"]
    new_rank = {year_week: json_obj["val"]}
    json_obj["rank_hist"] = new_rank
    
# 4.- SAVE A SINGLE ENTRY TO DYNAMO DB, USING THE METHOD BELOW
single_element = data[0]
try:
    response = dynamodb_table.update_item(
    Key={ 'id': single_element['id']},
    UpdateExpression= 'SET val = :val1, until_date = :val2, #dict.#key = :val3',
    ExpressionAttributeNames={
        '#dict': 'rank_hist',
        '#key': year_week
    },
    ExpressionAttributeValues={
        ':val1': round(Decimal(single_element['val']),2),
        ':val2': datetime.now().strftime("%Y_%m_%d"),
        ':val3': round(Decimal(single_element['val']),2)
    },
    ConditionExpression='attribute_exists(id) and attribute_exists(y_m) and attribute_exists(val)'
    )
    print("UPDATE")
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "ConditionalCheckFailedException":
        dynamodb_table.put_item(Item=json.loads(json.dumps(single_element), parse_float=Decimal))
        print("PUT")
    else:
        raise

In [118]:
import boto3
import json
import botocore
from decimal import Decimal
from datetime import datetime


def lambda_handler(event, context):
    
    s3_resource = boto3.resource('s3', region_name='us-east-1')
    now = datetime.now()    
    year_number = now.isocalendar()[0]
    week_number = now.isocalendar()[1]    
    year_week = f'{year_number}_{week_number}'
    
    NOMBRE_TABLA = "MoviesDB"
    dynamodb_resource = boto3.resource('dynamodb', region_name='us-east-1')
    dynamodb_table = dynamodb_resource.Table(NOMBRE_TABLA)
    
    for record in event.get('Records', []):
        # 1 READ DATA FROM S3
        bucket = record["s3"]["bucket"]["name"]
        key = record["s3"]["object"]["key"]
        obj = s3_resource.Object(bucket, key)
        element = json.load(obj.get()['Body'])
        str_id= str(element["id"])
        element["id"] = str_id
        split_rd= element['release_date'].split('-')
        format_rd = split_rd[0]+'_'+split_rd[1]
        element["y_m"] = format_rd
        element["val"] =  element["vote_average"]
        new_rank = {year_week: element["val"]}
        element["rank_hist"] = new_rank
        
        # 2 TRANSFORM DATA AND STORE IN DYNAMO DB
        try:
            response = dynamodb_table.update_item(
                Key={ 'id': element["id"]},
                UpdateExpression= 'SET val = :val1, until_date = :val2, #dict.#key = :val3',
                ExpressionAttributeNames={
                    '#dict': 'rank_hist',
                    '#key': year_week
                },
                ExpressionAttributeValues={
                    ':val1': round(Decimal(element['val']),2),
                    ':val2': now.strftime("%Y_%m_%d"),
                    ':val3': round(Decimal(element['val']),2)
                },
                ConditionExpression='attribute_exists(id) and attribute_exists(y_m) and attribute_exists(val)'
            )
            print("UPDATE")
        except botocore.exceptions.ClientError as e:
            if e.response['Error']['Code'] == "ConditionalCheckFailedException":
                dynamodb_table.put_item(Item=json.loads(json.dumps(element), parse_float=Decimal))
                print("PUT")
            else:
                raise

- I paste a screenshot that shows the logs of the lambda function during the last 15 minutes
- I paste an image that shows the entries in the DynamoDB table

<img src="img/eje_4_a.png" alt="Validacion 4a">

<img src="img/eje_4_b.png" alt="Validacion 4b">

### <font color="#00586D" size=3></font> Automation of data capture

Once I have all the pieces of the process, let´s automate the data capture through a lambda function deployed in the cloud.

Once deployed, I will add a _trigger_ that runs it every 24 hours.

For the function I'll need it to handle additional dependencies to use the _requests_ library using a _layer_. Also, I need to modify the default parameters (timeout, memory, ...)

- I paste screenshot that shows the logs of the lambda function during the last 5 minutes
- Then I modify the trigger so that the function is executed every day

<img src="img/eje_5.png" alt="Validacion 5">

### <font color="#00586D" size=3></font> Movie list API (given month and year)

Now, I'll build a lambda function executed upon an API Gateway endpoint invocation.

To do this, I will generate a lambda that from events with the following format: `{'pathParameters': {'month': '1', 'year': '2023'}}`, generates a suitable listing. This query returns a movie list from DynamoDB using the following characteristics:
- Query of type _query_
- Query performed on the global secondary index
   - A value must be provided for the y_m field to get the correct data
- Descending order, so that the films with the highest rating are obtained in the first positions
- Limit the query to 10 movies

So that the result (list of movies) can be correctly interpreted by API Gateway, I will include it as json in the lambda response, specifically in the `body` field.

Once the lambda is developed, I can validate it locally by making the following call:

`lambda_handler({'pathParameters': {'month': '12', 'year': '2022'}}, None)`

After the validation of the lambda, I will deploy an API through API Gateway that allows its invocation. To do this, I will define 1 base resource called _/list_ on top of which I will add 1 subresource _/list/{year}_ and a (sub)subresource _/list/{year}/{month}_. On the last subresource (_{month}_) I will add a GET method, which will connect the calls made on `API_URL/stage/list/{year}/{month}` with the previous lambda, where _year}_ and _month}_ will be replaced by real values.

The event received by the lambda will have the previously defined format

`{'pathParameters': {'month': 'MONTH', 'year': 'YEAR'}}`

In [145]:
import boto3
import decimal
import json
from boto3.dynamodb.conditions import Key

NOMBRE_TABLA = "MoviesDB"
dynamo_resource = boto3.resource('dynamodb', region_name='us-east-1')
dynamo_table = dynamo_resource.Table(NOMBRE_TABLA)


class DecimalEncoder(json.JSONEncoder):       # Necesario para manejar los tipos Decimal
    def default(self, o):
        if isinstance(o, decimal.Decimal):
            return str(o)
        return super(DecimalEncoder, self).default(o)
    
    
def lambda_handler(event, context):
    year = int(event['pathParameters']['year'])
    month = int(event['pathParameters']['month'])
    y_m = f'{year}_{month:02d}'
    response = dynamo_table.query(
            IndexName='y_m-val-index',
            KeyConditionExpression=Key('y_m').eq(y_m), 
            ScanIndexForward=False,
            Limit=10   
    )
    items= response["Items"]
    return {
        'statusCode': 200,
        'body': json.dumps(items, cls=DecimalEncoder)
    } 

In [146]:
event = {'pathParameters': {'month': '04', 'year': '2022'}}
lambda_handler(event, {})

{'statusCode': 200, 'body': '[]'}

### <font color="#00586D" size=3></font> Movie details api (given its id)

Now I'll build another lambda function executed upon invocation of another API Gateway endpoint.

To do this, I will generate a lambda that from events in the following format: `{'pathParameters': {'id': '76600'}}`, get details for the movie with the id provided. The details are obtained from a query to the DynamoDB database using the following features:
- Query of type _get_item()_
- Query performed on the table (using the partition key)
   - A value must be provided for the id field

So that the result (movie details) can be correctly interpreted by API Gateway, I will include it as json in the lambda response, specifically in the `body` field (check the content of the included code).

Once the lambda is developed, I can validate it locally by making the following call:

`lambda_handler({'pathParameters': {'id': '76600'}}, None)`

After the validation of the lambda, I will deploy an API through API Gateway that allows its invocation. To do this, I will define 1 new base resource called _/movies_ to which I will add 1 subresource _/movies/{id}_. On the subresource (_{id}_) I will add a GET method, which will connect the calls made on `API_URL/stage/movies/{id}` with the previous lambda, where {id}_ will be replaced by a real value .

The event received by the lambda will have the previously defined format

`{'pathParameters': {'id': 'MOVIE_ID'}}`

In [134]:
import boto3
import decimal
import json
from boto3.dynamodb.conditions import Key

NOMBRE_TABLA = "MoviesDB"
dynamo_resource = boto3.resource('dynamodb', region_name='us-east-1')
dynamo_table = dynamo_resource.Table(NOMBRE_TABLA)


class DecimalEncoder(json.JSONEncoder):       # Necesario para manejar los tipos Decimal
    def default(self, o):
        if isinstance(o, decimal.Decimal):
            return str(o)
        return super(DecimalEncoder, self).default(o)
    
    
def lambda_handler(event, context):
    movie_id = event['pathParameters']['id']
    response = dynamo_table.get_item(
    Key={
        'id': movie_id,
    }
    )
    item= response["Item"]
    return {
        'statusCode': 200,
        'body': json.dumps(item, cls=DecimalEncoder)
    } 

In [135]:
event = {'pathParameters': {'id': '700391'}}
lambda_handler(event, {})

{'statusCode': 200,
 'body': '{"genre_ids": ["878", "12", "53", "28"], "original_title": "65", "val": "6.3", "adult": false, "y_m": "2023_03", "overview": "65 million years ago, the only 2 survivors of a spaceship from Somaris that crash-landed on Earth must fend off dinosaurs and reach the escape vessel in time before an imminent asteroid strike threatens to destroy the planet.", "vote_average": "6.3", "rank": "9", "popularity": "1665.196", "rank_hist": {"2023_18": "6.3"}, "backdrop_path": "/eEF40Xk2twM3WjRNZftfo771gjv.jpg", "release_date": "2023-03-02", "original_language": "en", "vote_count": "853", "until_date": "2023_05_02", "id": "700391", "poster_path": "/rzRb63TldOKdKydCvWJM8B6EkPM.jpg", "video": false, "title": "65"}'}

### <font color="#00586D" size=3></font> API validation through the use of IPWidgets.

To validate that everything works correctly, I have included a series of visual elements (widgets) that allow queries to be made to the previously deployed API.

In [151]:
BASE_URL = 'https://gevlrxz224.execute-api.us-east-1.amazonaws.com/dev'

In [None]:
!pip install ipywidgets

In [152]:
%matplotlib inline

import requests
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, HTML
from matplotlib.ticker import MaxNLocator    
    
def display_movie_info(year, month):    
    url = BASE_URL + f'/list/{year}/{month}'
    res = requests.get(url)
    data = res.json()
    
    dropdown_movies = widgets.Dropdown(
        options=[(x['title'], x['id']) for x in data],    
        description='Pelicula'
    )
    display(dropdown_movies)
    movie_data = widgets.interactive_output(display_movie, {'_id': dropdown_movies})
    display(movie_data)
    
def display_movie(_id):
    url = BASE_URL + f'/movies/{_id}'
    res = requests.get(url)
    data = res.json()
    if not "message" in data:
        output1 = widgets.Output()
        with output1:
            poster_path = 'http://image.tmdb.org/t/p/w185'+data['poster_path']
            html = HTML(f'''
                <p>
                    <b>Estreno:</b> {data["release_date"]}</br>
                    <b>Hasta:</b> {data["until_date"]}</br>
                    <b>Votos:</b> {data["vote_count"]}</br>                    
                    <b>Media:</b> {data["vote_average"]}
                </p>
                <img src="{poster_path}" style=max-width:185px;"/>
            ''')
            display(html)

        output2 = widgets.Output()
        with output2:
            ranks = [float(value) for value in data['rank_hist'].values()]
            weeks = list(data['rank_hist'].keys())

            fig = plt.figure(figsize=(8, 5), dpi=100)
            ax = fig.gca()
            ax.yaxis.set_major_locator(MaxNLocator(integer=True))
            plt.xticks(range(len(weeks)), weeks)
            plt.ylim(0, max(ranks) + 1)

            plt.title("Ranking semanal")
            plt.xlabel("Semana")
            plt.ylabel("Posición")

            # Dibujar grafica
            plt.plot(ranks, marker='o', linewidth=2) 
            plt.show()

        two_columns = widgets.HBox([output1, output2])
        display(two_columns)


dropdown_year = widgets.Dropdown(
    options=[2022, 2023],    
    description='Año'
)

dropdown_month = widgets.Dropdown(
    options=range(1, 13),    
    description='Mes'
)
movie_info = widgets.interactive_output(display_movie_info, {'year': dropdown_year, 'month': dropdown_month})

display(dropdown_year)
display(dropdown_month)
display(movie_info)

Dropdown(description='Año', options=(2022, 2023), value=2022)

Dropdown(description='Mes', options=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), value=1)

Output()