<a href="https://www.kaggle.com/code/mohamedabidi97/build-your-data-api-now?scriptVersionId=116328675" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<center><img style="height: 500px;" src="https://img.freepik.com/free-photo/no-people-desk-with-multiple-computers-call-center-office-used-by-telemarketing-agents-answer-phone-calls-helpline-empty-space-with-technology-give-assistance-customer-care_482257-40793.jpg?w=1380&t=st=1673164584~exp=1673165184~hmac=e8a64604ea41372a09589e5c40bc1451be61ec8a4be79da97130f7cbb1b7842b"></img><br>
</center>



## <center><b>📚 Build a Data API</b></center>

## Introduction 

If you've worked in **data science** lately, you've probably been inundated with a slew of buzzwords referring to the **collection** and **manipulation** of data. NoSQL! Big Data! data applications! Processing! APIs! Latency! Databases! Cloud services! Real-time!

As a data scientist, you should do much more than simply work with data and build Machine Learning models. 
The next step is to figure out how to make my work truly available for usage by the **general public**, which will be the **challenging** part

If you want to develop **applications** that have some kind of **server/backend** for storing or processing data, and your applications use the internet (e.g., web applications, mobile apps, or internet-connected sensors), then this **notebook is** for you.

**Welcome!**


## Objectives

You can use this notebook as a roadmap to get to the following points: 

- First, clean up your data!
- Prepare your architecture, build your API, and save data into the database.
- Finalize your API
- Launch your API - Continue to test! 
- Make your API live by deploying it.

## I- Beginning with data as usual 😺

As usual, we begin by reviewing and comprehending our data, most of you are familiar with this step. 


### 1- Context

This dataset contains information on all of the movies and TV shows available on Netflix as of May 2022. In addition to basic information such as title, release year, andruntime, the dataset includes data on the cast and crew, IMDB score and number of votes, genres, production companies, and more. With this data, you can build models to find the best movies and TV shows on Netflix according to your own criteria



Raw data file: this file contains all of the raw data for the movies and TV shows in the dataset.

Best movie by year file: this file contains a list of the best movies by year, as determined by their IMDB score and number of votes.

Best show by year file: this file contains a list of the best TV shows by year, as determined by their IMDB score and number of votes.

Best movies file: this file contains all of the movies that pass the following criteria:

- at least an IMDB score of 6.9
- at least 10,000 votes

Best shows file: this file contains all of the TV shows that pass the following criteria:
- at least an IMDB score of 7.5
- at least 10,000 votes



### 1- Import libraries 📑

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np

# HTML
from IPython.display import HTML as html_print

import base64

### 2- First check at the data 

In [2]:
# Read data
best_movies_year = pd.read_csv("/kaggle/input/the-ultimate-netflix-tv-shows-and-movies-dataset/Best Movie by Year Netflix.csv")
best_movies = pd.read_csv("/kaggle/input/the-ultimate-netflix-tv-shows-and-movies-dataset/Best Movies Netflix.csv")
best_shows_year = pd.read_csv("/kaggle/input/the-ultimate-netflix-tv-shows-and-movies-dataset/Best Show by Year Netflix.csv")
best_shows = pd.read_csv("/kaggle/input/the-ultimate-netflix-tv-shows-and-movies-dataset/Best Shows Netflix.csv")
raw_credits = pd.read_csv("/kaggle/input/the-ultimate-netflix-tv-shows-and-movies-dataset/raw_credits.csv")
raw_titles = pd.read_csv("/kaggle/input/the-ultimate-netflix-tv-shows-and-movies-dataset/raw_titles.csv")

#### 📌 There are numerous CSV files here, so let's figure out what we have. 

File: raw_titles.csv



| Column name          | Description                                                 |
|----------------------|-------------------------------------------------------------|
| title                | The title of the movie or TV show. (String)                 |
| type                 | The type of the movie or TV show. (String)                  |
| release_year         | The year the movie or TV show was released. (Integer)       |
| age_certification    | The age certification of the movie or TV show. (String)     |
| runtime              | The runtime of the movie or TV show. (Integer)              |
| genres               | The genres of the movie or TV show. (String)                |
| production_countries | The production countries of the movie or TV show. (String)  |
| seasons              | The number of seasons of the TV show. (Integer)             |
| imdb_score           | The IMDB score of the movie or TV show. (Float)             |
| imdb_votes           | The number of IMDB votes of the movie or TV show. (Integer) |


File: Best Shows Netflix.csv

|   Column name   |                                Description                               |
|:---------------:|:------------------------------------------------------------------------:|
| TITLE           | The title of the movie or TV show. (String)                              |
| RELEASE_YEAR    | The year the movie or TV show was released. (Integer)                    |
| SCORE           | The IMDB score for the movie or TV show. (Float)                         |
| NUMBEROFVOTES   | The number of votes the movie or TV show has received on IMDB. (Integer) |
| DURATION        | The duration of the movie or TV show in minutes. (Integer)               |
| NUMBEROFSEASONS | The number of seasons the TV show has. (Integer)                         |
| MAIN_GENRE      | The main genre of the movie or TV show. (String)                         |
| MAIN_PRODUCTION | The main production company of the movie or TV show. (String)            |
| imdb_score      | The IMDB score of the movie or TV show. (Float)                          |
| imdb_votes      | The number of IMDB votes of the movie or TV show. (Integer)              |

File: raw_credits.csv

| Column name |                                 Description                                 |
|:-----------:|:---------------------------------------------------------------------------:|
| name        | The name of the actor or actress. (String)                                  |
| character   | The character the actor or actress played in the movie or TV show. (String) |
| role        | The role the actor or actress played in the movie or TV show. (String)      |

File: Best Movies Netflix.csv

File: Best Movie by Year Netflix.csv

File: Best Show by Year Netflix.csv


|   Column name   |                                Description                               |
|:---------------:|:------------------------------------------------------------------------:|
| TITLE           | The title of the movie or TV show. (String)                              |
| RELEASE_YEAR    | The year the movie or TV show was released. (Integer)                    |
| SCORE           | The IMDB score for the movie or TV show. (Float)                         |
| NUMBEROFVOTES   | The number of votes the movie or TV show has received on IMDB. (Integer) |
| DURATION        | The duration of the movie or TV show in minutes. (Integer)               |
| MAIN_GENRE      | The main genre of the movie or TV show. (String)                         |
| MAIN_PRODUCTION | The main production company of the movie or TV show. (String)            |

#### 📌 Let's know more about each of them. 

We can create a **function** to extract all the essential information from each **dataset** so that we don't have to manually go through each one. 

In [3]:
def cprint(title, text, color='#e63e50'):
    """
    It takes a title, text, and color as input, and returns a formatted HTML string
    
    :param title: The title of the section
    :param text: the text to be displayed
    :param color: The color of the title, defaults to #e63e50 (optional)
    :return: the html_print function with the text argument.
    """
    
    text = "<br><strong style=color:{}>{}:</strong><br>".format(color, title) + \
            "<text>{}</text><br>".format(text)
    return html_print(text)


def check_datasets(df_list):
    """
    It takes a list of dataframes as an input and prints out the name of the dataframe, the size of the
    dataframe, the total number of null values in the dataframe, the data types of each column, and the
    first two rows of the dataframe
    
    :param df_list: A list of dataframes
    """
    
    for index, df in enumerate(df_list):
        display(cprint("\n{} - Name of the dataset\n".format(index+1), str(df[1])))
        display(cprint("{} - The size of the dataset \n".format(index+1) , df[0].shape))
        display(cprint("{} - Total number of null values \n".format(index+1) , df[0].isnull().sum().sum()))
        display(cprint("{} - Data types of each column \n".format(index+1) , ""))
        display(df[0].dtypes)
        print("\n")
        display(df[0].head(2))
        
        
def create_download_link(title = "Download CSV file", filename = "data.csv"):
    """
    It takes a title and a filename as input, and returns a link to download a CSV file
    
    :param title: The title of the link, defaults to Download CSV file (optional)
    :param filename: The name of the file that will be downloaded, defaults to data.csv (optional)
    :return: A link to download the data.csv file.
    """
    
    html = '<a href={filename}>{title}</a>'
    html = html.format(title=title,filename=filename)
    return html_print(html)

In [4]:
dfs = [
          [best_movies_year, "Best Movies By Year"], 
          [best_movies, "Best Movies"],
          [best_shows_year, "Best Shows by Year"],
          [best_shows, "Best Shows"],
          [raw_credits, "Credits"],
          [raw_titles, "Titles"]
      ]

check_datasets(dfs)

index                int64
TITLE               object
RELEASE_YEAR         int64
SCORE              float64
MAIN_GENRE          object
MAIN_PRODUCTION     object
dtype: object





Unnamed: 0,index,TITLE,RELEASE_YEAR,SCORE,MAIN_GENRE,MAIN_PRODUCTION
0,0,White Christmas,1954,7.5,romance,US
1,1,The Guns of Navarone,1961,7.5,war,US


index                int64
TITLE               object
RELEASE_YEAR         int64
SCORE              float64
NUMBER_OF_VOTES      int64
DURATION             int64
MAIN_GENRE          object
MAIN_PRODUCTION     object
dtype: object





Unnamed: 0,index,TITLE,RELEASE_YEAR,SCORE,NUMBER_OF_VOTES,DURATION,MAIN_GENRE,MAIN_PRODUCTION
0,0,David Attenborough: A Life on Our Planet,2020,9.0,31180,83,documentary,GB
1,1,Inception,2010,8.8,2268288,148,scifi,GB


index                  int64
TITLE                 object
RELEASE_YEAR           int64
SCORE                float64
NUMBER_OF_SEASONS      int64
MAIN_GENRE            object
MAIN_PRODUCTION       object
dtype: object





Unnamed: 0,index,TITLE,RELEASE_YEAR,SCORE,NUMBER_OF_SEASONS,MAIN_GENRE,MAIN_PRODUCTION
0,0,Monty Python's Flying Circus,1969,8.8,4,comedy,GB
1,1,Knight Rider,1982,6.9,4,action,US


index                  int64
TITLE                 object
RELEASE_YEAR           int64
SCORE                float64
NUMBER_OF_VOTES        int64
DURATION               int64
NUMBER_OF_SEASONS      int64
MAIN_GENRE            object
MAIN_PRODUCTION       object
dtype: object





Unnamed: 0,index,TITLE,RELEASE_YEAR,SCORE,NUMBER_OF_VOTES,DURATION,NUMBER_OF_SEASONS,MAIN_GENRE,MAIN_PRODUCTION
0,0,Breaking Bad,2008,9.5,1727694,48,5,drama,US
1,1,Avatar: The Last Airbender,2005,9.3,297336,24,3,scifi,US


index         int64
person_id     int64
id           object
name         object
character    object
role         object
dtype: object





Unnamed: 0,index,person_id,id,name,character,role
0,0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR


index                     int64
id                       object
title                    object
type                     object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
dtype: object





Unnamed: 0,index,id,title,type,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes
0,0,ts300399,Five Came Back: The Reference Films,SHOW,1945,TV-MA,48,['documentation'],['US'],1.0,,,
1,1,tm84618,Taxi Driver,MOVIE,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0


- Only the dataframes for "Titles" and "Credits" have missing values. 

- Each dataframe includes an additional "index" column.

Sincerily, I anticipated more missing values, but hopefully not as many 😺.

Let's study some statistical details before addressing the missing values of the two datasets. 

### 3 - Let's check at some dataset statistics ☟

Starting with the datasets that include missing values. But let's first remove the superfluous "index" 

In [5]:
# Delete "index" column from the datasets
for df in dfs:
    df[0].drop('index', axis=1, inplace=True)

In [6]:
display(raw_credits.describe().T)
display(raw_titles.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
person_id,77213.0,499460.322666,612843.136282,7.0,41584.0,182985.0,841557.0,2371585.0


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
release_year,5806.0,2016.013434,7.324883,1945.0,2015.0,2018.0,2020.0,2022.0
runtime,5806.0,77.643989,39.47416,0.0,44.0,84.0,105.0,251.0
seasons,2047.0,2.165608,2.636207,1.0,1.0,1.0,2.0,42.0
imdb_score,5283.0,6.533447,1.160932,1.5,5.8,6.6,7.4,9.6
imdb_votes,5267.0,23407.194988,87134.315849,5.0,521.0,2279.0,10144.0,2268288.0


- As indicated by the "runtime" column in the dataframe for "raw titles", Max value = 251 , Mean=77, Median =84 Here, the Max value is significantly greater than both Mean and the Median value. Hence, we can sense the presence of Outliers.

Let's check if we have any interpretations that can be made by **categorical** columns

In [7]:
display(raw_titles.describe(include=object).T)
display(raw_credits.describe(include=object).T)

Unnamed: 0,count,unique,top,freq
id,5806,5806,ts300399,1
title,5805,5751,The Gift,3
type,5806,2,MOVIE,3759
age_certification,3196,11,TV-MA,841
genres,5806,1626,['comedy'],510
production_countries,5806,449,['US'],1950
imdb_id,5362,5362,tt0075314,1


Unnamed: 0,count,unique,top,freq
id,77213,5434,tm32982,208
name,77213,53687,Shah Rukh Khan,30
character,67586,47125,Self,1667
role,77213,2,ACTOR,72690


- Highest occured class for genres is "Comedy" 🎭 ! nice 😹

- It is evident that films and television shows produced in the US are the most popular ones. 

### 4 - Let's now resolve the missing values ⚙️

In [8]:
raw_credits.isnull().sum()

person_id       0
id              0
name            0
character    9627
role            0
dtype: int64

 - Since it is difficult to foresee the missing values in the "character" column, the best option is to **eliminate** those rows.

In [9]:
raw_credits = raw_credits.dropna()

In [10]:
raw_titles.isnull().sum()

id                         0
title                      1
type                       0
release_year               0
age_certification       2610
runtime                    0
genres                     0
production_countries       0
seasons                 3759
imdb_id                  444
imdb_score               523
imdb_votes               539
dtype: int64

- In this dataframe as well, it is preferable to delete the rows with missing values. We can make some adjustments, such as replacing the missing values with "no-score" or "no-age-certification," but we still don't need a lot of this data.

In [11]:
raw_titles = raw_titles.dropna()

Now that we have modified the datasets, let's download them because I will be utilising Pycharm locally. 

In [12]:
raw_titles.to_csv('raw_titles.csv')
raw_credits.to_csv('raw_credits.csv')

# create a link to download the dataframe which was saved with .to_csv method
create_download_link(filename='raw_titles.csv')

In [13]:
create_download_link(filename='raw_credits.csv')

#### Step one is ended, let's go to the next 🙋‍♂️, shall we?

## II - Let's use an architecture diagram to clarify things

We will examine each component of the diagram one at a time to know what will do next

<center><img src="https://raw.githubusercontent.com/mohamedabidi97/data-api-kaggle/main/assets/architecture.png"></img></center>

As we can see, the diagram contains **3 major** components. Let's begin with them: 

- 1 - Interface: The client side, which may take the form of a web interface, a mobile interface, or even just a request call from a your terminal or other software like Postman.

- 2 - Server: Can be a cloud server or your local server for testing, we will test on both, I'm pretty sure you will like this part

- 3 - Database: Where you can keep your data, and we'll be using MongoDB, a NoSQL database. 

- Json: It is a common data format with diverse uses in electronic data interchange, including that of web applications with servers.
- Logs: An application log is a file that contains information about events that have occurred within a software application. These events are logged out by the application and written to the file.
- Fetch/Store: Getting data or storing data (from/to the database)


## III - Make your API 

An application programming interface (**API**) is a way for two or more computer programs to **communicate** with each other. It is a type of software interface, offering a service to other pieces of software.

### 1 - FastAPI

FastAPI is a Web framework for developing **RESTful APIs** in Python. FastAPI is based on Pydantic and type hints to validate, serialize, and deserialize data, and automatically auto-generate OpenAPI documents.

<center><img style="height: 500px;" src="https://repository-images.githubusercontent.com/160919119/29516980-f308-11e9-9096-0836920fdae3"></img></center>


The key features are:

- Fast: Very high performance, on par with NodeJS and Go (thanks to Starlette and Pydantic). One of the fastest Python frameworks available.

- Fast to code: Increase the speed to develop features by about 200% to 300%. *
- Fewer bugs: Reduce about 40% of human (developer) induced errors. *
- Intuitive: Great editor support. Completion everywhere. Less time debugging.
- Easy: Designed to be easy to use and learn. Less time reading docs.
- Short: Minimize code duplication. Multiple features from each parameter declaration. Fewer bugs.
- Robust: Get production-ready code. With automatic interactive documentation.
- Standards-based: Based on (and fully compatible with) the open standards for APIs: OpenAPI (previously known as Swagger) and JSON Schema.


🔗 [FastAPI Documentation](https://fastapi.tiangolo.com/)

### 2 - Project Structure


Before we move into the coding, this is the structure of the API project.

├── requirements

├── src

    └── routes
    └── datasets
    └── utils
    └── .env
    └── app.py

├── venv

├── server.py


🫡 Don't worry, I will explain you each folder and file individually in the subsequent steps. 

As you can see in the structure section, we must first build a virtual environment for our API project, which is a folder called venv. 
The virtual environment tool creates a folder inside the project directory. By default, the folder is called venv , but you can give it a custom name too. It keeps Python and pip executable files inside the virtual environment folder.

If you are using Pycharm, your environment will be immediately created when you start a project, or by the following the next steps, you can manually create and activate your environment depends on your operating system.

Create an environement first using virtualenv :

- Install the virtualenv package. Enter the following in your terminal :

>pip install virtualenv

- Create the virtual environment :

>virtualenv nameofenv

- Activate the virtual environment :

For MacOS/Linux :

> source mypython/bin/activate

Windows :

> mypthon\Scripts\activate

- After finishing your work generate a requirements.txt file using

> pip3 freeze > requirements.txt  # Python3
> pip freeze > requirements.txt  # Python2

### Your virtual environment will then be successfully generated and activated, and you can continue 🏃‍♀️

Let's install the FastAPI package using:

> pip install fastapi

To run a server, you will need Uvicorn as well:

> pip install uvicorn

#### ⚠️ So only your project environment contains all of the packages installed, it must always be active. 

### 3 - Hello world Endpoint


app.py:

In [14]:
from fastapi import FastAPI

app = FastAPI()


@app.get("/")
def read_root():
    return {"Hello": "World"}

server.py:

```python
import uvicorn


if __name__ == '__main__':
    uvicorn.run('src.app:app', host="0.0.0.0", port=3000, reload=True)
```

## work in progress