# Genome dataset

The tag genome is a data structure that contains tag relevance scores for movies. The structure is a dense matrix where each movie in the genome has a value for every tag in the genome. We decided to add these two datasets at a later stage because the results of the model were not very satisfactory.

### Imports

In [22]:
import os

import pandas as pd

from src.utils.const import DATA_DIR

### Useful path to data

In [23]:
ROOT_DIR = os.path.join(os.getcwd(), '..')
RAW_DIR = os.path.join(ROOT_DIR, DATA_DIR, 'raw')

## Data Acquisition

We are assumed that the notebooks are explored in order, so these two datasets should already be stored inside the raw folder.

## Data Pre-processing

If we want to use these datasets, we have to create a new `DataFrame()` that for each sample contains the values of all the __tags__ with respect to the __movieId__. For this reason we have first to read the _genome-scores_ and _genome-tags_.

### genome-scores.csv

In [24]:
# Read
genome_scores = pd.read_csv(
    os.path.join(RAW_DIR, 'genome-scores.csv'),
    encoding='utf-8',
    dtype={'movieId':'int32', 'tagId':'int32', 'relevance':'float32'}
)

genome_scores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14862528 entries, 0 to 14862527
Data columns (total 3 columns):
 #   Column     Dtype  
---  ------     -----  
 0   movieId    int32  
 1   tagId      int32  
 2   relevance  float32
dtypes: float32(1), int32(2)
memory usage: 170.1 MB


### genome-tags.csv

In [25]:
# Read
genome_tags = pd.read_csv(
    os.path.join(RAW_DIR, 'genome-tags.csv'),
    encoding='utf-8',
    dtype={'tagId':'int32', 'tag':'string'}
)

genome_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tagId   1128 non-null   int32 
 1   tag     1128 non-null   string
dtypes: int32(1), string(1)
memory usage: 13.3 KB


### Merge

The next step calculates the left union between the two datasets.

In [26]:
tags_relevance = genome_scores.merge(genome_tags, on='tagId', how='left')

In [27]:
tags_relevance.head()

Unnamed: 0,movieId,tagId,relevance,tag
0,1,1,0.029,007
1,1,2,0.02375,007 (series)
2,1,3,0.05425,18th century
3,1,4,0.06875,1920s
4,1,5,0.16,1930s


Thanks to the `pivot()` function we can obtain exactly what we wanted. Now, we have for each __movieId__, all the __tags__ with their relevance.

In [28]:
tags_relevance=(tags_relevance
                .pivot(index='movieId', columns='tag', values='relevance')
                .reset_index()
                .astype({'movieId': 'int32'}))

In [29]:
tags_relevance.head()

tag,movieId,007,007 (series),18th century,1920s,1930s,1950s,1960s,1970s,1980s,...,world politics,world war i,world war ii,writer's life,writers,writing,wuxia,wwii,zombie,zombies
0,1,0.029,0.02375,0.05425,0.06875,0.16,0.19525,0.076,0.252,0.2275,...,0.03775,0.0225,0.04075,0.03175,0.1295,0.0455,0.02,0.0385,0.09125,0.02225
1,2,0.03625,0.03625,0.08275,0.08175,0.102,0.069,0.05775,0.101,0.08225,...,0.04775,0.0205,0.0165,0.0245,0.1305,0.027,0.01825,0.01225,0.09925,0.0185
2,3,0.0415,0.0495,0.03,0.09525,0.04525,0.05925,0.04,0.1415,0.04075,...,0.058,0.02375,0.0355,0.02125,0.12775,0.0325,0.01625,0.02125,0.09525,0.0175
3,4,0.0335,0.03675,0.04275,0.02625,0.0525,0.03025,0.02425,0.07475,0.0375,...,0.049,0.03275,0.02125,0.03675,0.15925,0.05225,0.015,0.016,0.09175,0.015
4,5,0.0405,0.05175,0.036,0.04625,0.055,0.08,0.0215,0.07375,0.02825,...,0.05375,0.02625,0.0205,0.02125,0.17725,0.0205,0.015,0.0155,0.08875,0.01575


In [30]:
tags_relevance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13176 entries, 0 to 13175
Columns: 1129 entries, movieId to zombies
dtypes: float32(1128), int32(1)
memory usage: 56.7 MB


In [31]:
print(f'Merged genomes dimensionality: {tags_relevance.shape}')

Merged genomes dimensionality: (13176, 1129)


At the end, we can see that the cardinality of this dataset is smaller than the movies' dataset. We will drop all the samples that are not linked with `tags_relevance` in the next notebook.