# Information retrieval for movies recommendation

Database which the project it's based on:   
[HBO Max](https://www.kaggle.com/datasets/dgoenrique/hbo-max-movies-and-tv-shows)  

<div></div> 

In [1]:
import os
from dotenv import load_dotenv
import pandas as pd
from tmdb_tool import search


### API key import for TheMoviesDatabase

You'll need to config your API key; if you need help with this, check this link:  
[Getting Start with the API](https://developer.themoviedb.org/reference/intro/getting-started)

In [2]:

load_dotenv('.env')
API_KEY = os.getenv('AUTH_KEY')


<div></div> 

## Files import

No secret here, mate. The databases came with a CSV extension. So, there're no major problems here, only use pandas for the import and it's all fine.

<div></div> 

In [3]:
dtm = pd.read_csv('../data/titles.csv')
dtm.head(2)

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,tm77588,Casablanca,MOVIE,"In Casablanca, Morocco in December 1941, a cyn...",1943,PG,102,"['drama', 'romance', 'war']",['US'],,tt0034583,8.5,577842.0,22.005,8.167
1,tm155702,The Wizard of Oz,MOVIE,Young Dorothy finds herself in a magical world...,1939,G,102,"['fantasy', 'family']",['US'],,tt0032138,8.1,406105.0,56.631,7.583


## Data enrichment

There's more informantion and udpate data we can get from The Movies Database, so we'll use as attempt for more accuracy in our model

In [4]:
# Call for the API and enrichment 
enrichment = search.search_info(dtm['title'], dtm['release_year'], dtm['type'], api_key=API_KEY)

# Set title as index for performance gain
enrichment.set_index(keys=['title'], inplace=True)

In [5]:
# Left join
dtm = dtm.merge(enrichment, how='left', left_on=['title'], right_index=True)
dtm.head()

## Data Export

In [7]:
dtm.to_parquet('../data/enriched_data.parquet.gzip', compression='gzip')