# Processed data storage

In this notebook we are going to merge all the datasets used until now. We saved them inside the _interim_ folder.

## Imports

In [1]:
import os

import pandas as pd

from src.utils.const import DATA_DIR

### Useful path to data

In [2]:
ROOT_DIR = os.path.join(os.getcwd(), '..')
INTERIM_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'interim')
PROCESSED_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'processed')

## Read all interim data

After the pre-processing carried out in the previous notebooks, we decided to save the intermediate results in a very efficient and compressed format that retains the types used, called _parquet_.

In [3]:
movies = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'movies.parquet')
)

tags = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'tags.parquet')
)

ratings = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'ratings.parquet')
)

tmdb = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'tmdb.parquet')
)

genome = pd.read_parquet(
    os.path.join(INTERIM_DIR, 'genome.parquet')
)

## Merging all the datasets

Thanks to our previous elaboration on these datasets, we are now ready to merge them together, just using the pandas' `merge()` function. We use on all the merge function the _inner_ method, because we want to keep only the data that are common to all the dataset except for the tags. We decided to keep also the films that doesn't have any tag and fill these films with 0 as __tag_count__.

In [4]:
final = (movies
         .merge(ratings, on='movieId', how='inner')
         .merge(tags, on='movieId', how='left')
         .merge(tmdb, on='movieId', how='inner')
         .merge(genome, on='movieId', how='inner'))

final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13147 entries, 0 to 13146
Columns: 1154 entries, movieId to zombies
dtypes: float32(1131), float64(1), int32(22)
memory usage: 58.0 MB


### Fill tag_count feature with 0

In [5]:
final['tag_count'].isna().sum()

160

In [6]:
final = (final
         .fillna({'tag_count': 0})
         .drop(columns='movieId'))

final['tag_count'].isna().sum()

0

In [7]:
print(f'Final dataset dimensionality: {final.shape}')

Final dataset dimensionality: (13147, 1153)


In [8]:
final.to_parquet(os.path.join(PROCESSED_DIR, 'final.parquet'))