# Processed data storage

In this notebook we are going to merge all the part of the dataset together. In particular, we will use the dataset that we made in the previous notebook that are:
- _movies_
- _tags_
- _ratings_
- _tmdb_
- _genome_

## Imports

In [5]:
import os

import pandas as pd

from src.utils.const import DATA_DIR

### Useful path to data

In [6]:
ROOT_DIR = os.path.join(os.getcwd(), '..')
INTERIM_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'interim')
PROCESSED_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'processed')

## Import all interim data

After the pre-process that we made in the previous notebook, we decided to save the intermediate results in the interim folder in a parquet format. We are now going to read these preprocessed dataset in order to build the final one.

In [7]:
movies = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'movies.parquet')
)

tags = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'tags.parquet')
)

ratings = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'ratings.parquet')
)

tmdb = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'tmdb.parquet')
)

genome = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'genome.parquet')
)

## Merging all the datasets

Thanks to our previous elaboration on these datasets, we are now ready to merge them together, just using the pandas' merge function. We use on all the merge function the _inner_ method, because we want to keep only the data that are common to all the dataset except for the tags. We decided to keep also the films that doesn't have any tag and fill these films with 0 as _tag_count_

In [8]:
final = (movies
         .merge(ratings, on='movieId', how='inner')
         .merge(tags, on='movieId', how='left')
         .merge(tmdb, on='movieId', how='inner')
         .merge(genome, on='movieId', how='inner'))

### Fill tag_count with 0

In [9]:
final = (final
         .fillna({'tag_count': 0})
         .drop(columns='movieId'))

In [10]:
print(f'Movies dimensionality: {final.shape}')

Movies dimensionality: (13147, 1153)
