# 03 - Dataset Preparation
In order to streamline the process of training and evaluating the machine learning models that we will build in the following Notebook, ``04-Modelling``, we use the pipelines developed in the previous notebooks in order to create the core `metadata_df` and `image_embeddings_df` datasets which will be saved in .pkl format.

## Setting up Colab Environment if in Colab
A key component of the processing of our data will involve the dimensionality reduction of the high-level features of the image data we have obtained using the pre-trained MobileNetV2 Model introduced in the previous notebook. Due to the high dimensionality and cardinality of our data, and our choice of the UMAP algorithm as the dimensionality reduction tool of choice in this project, we will make use of the RAPIDS AI CUML library (https://docs.rapids.ai/api), which provides a sci-kit learn-like API for implementations of machine learning algorithms that are specifically configured to run on GPU hardware. Using this library will significantly speed up the computation time associated with the dimensionality reduction we conduct towards the end of this notebook.

As I do not personally have access to a GPU, the GPU-enabled part of this notebook is run on Google's Colab Notebook environment, which offers GPU access for free. In the cell belows below, we define the functions required to setup a 25GB RAM Colab Notebook Environment with the packages necessary for the code to function.

In [None]:
def upgrade_runtime_ram():
    meminfo = subprocess.getoutput('cat /proc/meminfo').split('\n')

    memory_info = {entry.split(':')[0]: int(entry.split(':')[1].replace(' kB','').strip()) for entry in meminfo}

    if memory_info['MemTotal'] > 17000000:
        return

    a = []
    while(1):
        a.append('1')

In [None]:
def restart_runtime():
    os.kill(os.getpid(), 9)

In [None]:
def setup_rapids():
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    device_name = pynvml.nvmlDeviceGetName(handle)
    if (device_name != b'Tesla T4') and (device_name != b'Tesla P4') and (device_name != b'Tesla P100-PCIE-16GB'):
        print("Wrong GPU - Restarting Runtime")
        restart_runtime()


    # clone RAPIDS AI rapidsai-csp-utils scripts repo
    !git clone https://github.com/rapidsai/rapidsai-csp-utils.git

    # install RAPIDS
    !bash rapidsai-csp-utils/colab/rapids-colab.sh 0.13


    # set necessary environment variables 
    dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
    sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
    sys.path

    # update pyarrow & modules 
    exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

In [None]:
def setup_conda():
    if not 'Miniconda3-4.5.4-Linux-x86_64.sh' in os.listdir():
        !wget https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh && bash Miniconda3-4.5.4-Linux-x86_64.sh -bfp /usr/local

    if not ('EPFL-Capstone-Project' in os.listdir()) and (os.getcwd().split('/')[-1] != 'EPFL-Capstone-Project'):
        !git clone https://github.com/helmigsimon/EPFL-Capstone-Project  
    if 'EPFL-Capstone-Project' in os.listdir():
        os.chdir('EPFL-Capstone-Project')

    !conda env create -f environment.yml
    !conda activate exts-ml

In [None]:
def setup_drive():
    #Mounting Google Drive
    global drive
    from google.colab import drive
    drive.mount('/content/drive')

In [None]:
try:
    import sys,os,subprocess
    
    upgrade_runtime_ram()
    setup_drive()

    #Setting up PyPi Packages
    !pip install geopandas sparse-dot-topn pdpipe category-encoders
    import geopandas as gpd
    import sparse_dot_topn.sparse_dot_topn as ct
    import pdpipe as pdp
    import category_encoders

    #Setting up Conda Packages
    setup_conda()
    
    #Initializing NLTK
    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    
    #Setting up RAPIDS AI
    import pynvml
    setup_rapids()
    
    from cuml import UMAP
    
except ModuleNotFoundError as e:
    print(e)
    print('Not in colab environment, continuing to run locally')
    from umap import UMAP

## Imports

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split,  StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
tqdm.pandas()

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer

In [3]:
from lib.transformers import *
from lib.pipelines import *
from lib.processing import save_to_pkl, load_from_pkl
from data.util.paths import DATA_PATH

In [4]:
from category_encoders.leave_one_out import LeaveOneOutEncoder

## DataFrame Setup
### Creating the ``metadata_df`` DataFrame
To avoid having to continuously apply the processing steps necessary to generate the core DataFrame we will use for building the models in this project, we generate our core DataFrames below. In the first step, we create the ``metadata_df`` DataFrame, which combines the data from the ``api_df`` and ``extracted_df`` DataFrames we have worked with in previous notebooks. On this data, we apply the ``api_pipe``, ``extracted_pipe`` and ``consolidation_pipe`` pipelines we have constructed using the data transformations outlined in the previous two notebooks. The result of these transformations is the ``metadata_df`` DataFrame.

In [5]:
api_df = load_from_pkl('api',DATA_PATH)
extracted_df = load_from_pkl('extracted',DATA_PATH)

In [6]:
api_df = api_pipe.fit_transform(api_df)
extracted_df = extracted_pipe.fit_transform(extracted_df)
metadata_df = api_df.merge(extracted_df,how='inner',on='release_id')

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [8]:
col_set = {
    'format': {
        'description': 'format_description_', 
        'name': 'format_name_', 
        'text': ('format_text_clean'),
        'quantity': ('format_quantity')
    },
    'geography': {
        'superregion': 'superregion_',
        'region': 'region_',
        'country': 'country_'
    },
    'timeperiod': {
        'period': 'period_',
        'era': 'era_'
    },
    'genre': 'genre_',
    'style': 'style_',
    'null': None,
    'indicator': lambda x: x.max() == 1 and x.min() == 0,
}
column_store = ColumnStore()
column_store.fit(metadata_df,col_set)

In [15]:
consolidation_pipe = make_data_consolidation_pipe(metadata_df,column_store)
metadata_df = consolidation_pipe.fit_transform(metadata_df)

100%|██████████| 296157/296157 [00:24<00:00, 12155.53it/s]
100%|██████████| 165777/165777 [00:00<00:00, 773462.86it/s]


HBox(children=(FloatProgress(value=0.0, max=908.0), HTML(value='')))




In [16]:
save_to_pkl(metadata_df,'metadata',DATA_PATH)

### Creating the ``image_embeddings_df`` DataFrame
As outlined in the previous notebook, we will be making use of the high-level features which have been extracted from the cover images available for the Jazz albums we are considering in this project, as a means to recreate the full experience of a Jazz album consumer when purchasing albums in the Record Store and Full Information scenarios. In what follows, we load the high-level features, and apply the UMAP dimensionality reduction algorithm to them, in order to obtain a lower dimensional representation of the visual information the albums encode, as input for our prediction algorithms. We reduce the dimensionality of the data from 1280, to 10 dimensions. The reason for this is the importance of being able to control dimensionality for input into our prediction algorithms, as well as the evidence outlined in the previous notebook, which indicates that the visual information the covers encode is not a strong predictor of Jazz album ``market_value``. Nonetheless, following from the framing of our problem, we elect to continue to include the features in the inputs we use to train our prediction algorithms.

We save the results of this dimensionality reduction procedure to ``image_embeddings.pkl``, which we will then merge into our core DataFrame in the following notebook.

In [None]:
with np.load(os.path.join(DATA_PATH,'high_level_features_labelled.npz')) as data:
    high_level_feature_df = pd.concat([pd.DataFrame(data[section]) for section in ('release_id','bitmap','features')],axis=1)
    high_level_feature_df.columns = ['release_id', 'bitmap'] + ['feature_%s' % i for i in range(1,1281)]

In [None]:
from cuml import UMAP

In [None]:
scaler = StandardScaler()
umap = UMAP(n_components=10)
high_level_features_df_scaled = scaler.fit_transform(high_level_feature_df.loc[:,['feature_%s' % i for i in range(1,1281)]])


In [None]:
image_embeddings_df = umap.fit_transform(high_level_features_df_scaled)

In [None]:
image_embeddings_df = pd.concat([
      high_level_feature_df.loc[:,'release_id'],
      pd.DataFrame(
          image_embeddings_df,
          columns = ['images_umap_%s' % i for i in range(image_embeddings_df.shape[1])]
      )],
      axis=1
)


In [None]:
save_to_pkl(image_embeddings_df, 'image_embeddings',DATA_PATH)