# 01 - Data Processing
In this notebook, we will be loading the data necessary for this project, combining the data sources and performing initial data processing steps, in order to prepare the dataset for visualization and additional data cleaning.

In [None]:
%load_ext autoreload
%autoreload 2
import os
from tqdm import tqdm
import pickle
import numpy as np
import pandas as pd
tqdm.pandas()

In [None]:
from data.util.paths import DATA_PATH
from lib.processing import load_from_pkl, save_to_pkl
from lib.pipelines import extracted_pipe, api_pipe

## Introduction

The data for this project has been sourced from Discogs.com, the largest online marketplace for second-hand physical music. There are three components to the data that will make up our final dataset, namely:

1. ``api_df`` -> Data taken from the official Discogs.com API (https://www.discogs.com/developers)
2. ``extracted_df`` -> Data scraped from Discogs release pages
3. ``high_level_features_df`` -> High-Level Features extracted from the cover images of each Album

At this juncture, we will load, investigate and transform the first two data sources outlined above. The data was collected continuously over the months of February and March 2020 using the scripts found in the data directory of this project. To scrape each of these three data sources, the main() function of the data/main.py file was used. It is not recommended to run this script oneself, as it will require substantial time to complete and potentially the investment into a proxy service provider. A sample of all data used in this project has been made available for the EPFL Extension School Reviewers in the ``data/sample_data`` directory, including the DataFrames used as the outset for our processing here, as well as the final composite data that we generate in notebook ``03-DatasetPrep``. For more details on how each data source was filtered and scraped, please refer to the classes and functions contained in the ``data/util/scrape.py`` module.
    
First, we will load the data extracted from the Discogs.com API, henceforth referred to as ``api_df``.

## API Data
### Loading and Overview

In [None]:
api_df = load_from_pkl('api',path=DATA_PATH)

In [None]:
api_df.head()

Above we see the data saved from the querying of the Discogs.com API. The data includes all available jazz albums from the API and retains the columns that were deemed to be the most relevant and informative for the purposes of estimating the market value of each record. As we can see from the above calling of the ``.head()`` method on the api_data DataFrame, several columns have been saved in a binary format. This is due to the fact that the API often returned multiple categories for these columns, and as such they were initially saved in a list in the process of scraping. As this data was originally stored in a SQL database, it has been pickled to enable the storage of the data. Let us now unpickle this data in order to get a better understanding of the structure of the ``api_data``.

In [None]:
api_df.applymap(lambda x: pickle.loads(x) if isinstance(x,bytes) else x).head()

As all columns are now human-readable, we can discuss each column and its relevance for the purposes of our goal of estimating the market value of jazz albums on the basis of their meta-characteristics
- ``release_id`` and ``master_id``
    - These columns represent the unique identifier of a specific release of a record on Discogs.com, and the unique identifier of the album irrespective of its release, respectively.
    - While ``release_id`` will be of no use to us in the estimation of album market value, it is valuable as a primary key for the joining of the data extracted from the Discogs.com API with the scraped release page and image data we will be introducing later
    - ``master_id`` will be useful in constructing linkages between entries that are the same album, but released in a different country, format, or by another label
- ``country``
    - This feature gives us an insight into which country the record was released from
    - While not immediately obvious from the snapshot above, this feature can have multiple countries for its value, such as "England & USA", which requires a more thoughtful approach than just direct one-hot encoding in order to optimally preserve the information it encodes
- ``title``
    - This feature includes both the title of the album, as well as the title of the artist
    - It will be necessary to split this feature into two, such that connections between albums can be drawn on the basis of being authored by the same artist or group
- ``community_want`` and ``community_have``
    - These features outline the registered demand and supply, respectively, for albums on the Discogs.com platform
    - Users can register their ownership or desire of a certain album, which is then aggregated over users and recorded in these features
    - These are the first features we see which will not be taken into account in the 'Record Store Scenario', as in a physical visit to the record store, it is not possible to know exactly how many people have and want a specific album
- ``genre``, ``style`` and ``label``
    - These features outline the genres, styles and labels associated with each album release
    - As has been shown above, there are potentially multiple values for this feature, which will make standard encoding methods such as One-Hot Encoding difficult to rationalize
- ``formats``
    - Taken directly from the Discogs.com API, the formats feature is the most unruly of all, as it contains up to 4 sub-features, namely:
        - Format Description
            - Release-specific information related to its format
            - Examples
                - '10"', '78RPM', 'Album', 'Reissue'
        - Format Text
            - Additional free form notes associated with the record release
            - Examples
                - 'Paper Sleeve', 'Red Vinyl'
        - Format Name
            - The name of the format
            - Examples
                - 'CD', 'Vinyl', 'Cassette'
        - Format Quantity
            - An integer representing how many units of the format are associated with a purchase of the release
- ``thumb_url`` and ``release_url``
    - These features are unnecessary for the purposes of our analysis here, but were useful in order to conduct the data extraction of the ``extracted_df`` and ``image_embedding_df`` obtained from the Discogs.com platform itself

### Investigation and Transformation

Having taken an initial look at ``api_df``, we can now move on to the cleaning and transformation of this DataFrame through a pre-written pipeline, which is to be found in the ``lib/pipelines.py`` module. For the purposes of exposition, the pipeline will be replicated here step-by-step, with motivation and explanation for each step. In the following notebooks, a direct import of the pipeline will be used in lieu of a replication as below.

Below the required transformers will be imported and composed into the pre-defined pipeline structure, after which the pipeline will be explained in its entirety.

In [None]:
from sklearn.pipeline import Pipeline
from lib.transformers import (ColumnRemover, 
                              TitleSplitter, 
                              Unpickler, 
                              LabelCleanReduce, 
                              ArtistCleanReduce, 
                              DuplicateRemover, 
                              CountryEncoder, 
                              GenreEncoder, 
                              MultiValueCategoricalEncoder, 
                              FormatEncoder, 
                              FormatTextCleanReduce, 
                              TimePeriodEncoder,
                              OutlierRemover,
                              DummyGenerator)

In [None]:
clean_text_pipe = Pipeline([
    #Step 4a - Homogenize and clean the label feature
    ('label',LabelCleanReduce()),
    #Step 4b - Homogenize and clean the artist feature
    ('artist',ArtistCleanReduce())
])

column_encoding_pipe = Pipeline([
    #Step 6a - Identify regions and superregions associated with countries and approrpiately encode
    #multi-country album releases
    ('country',CountryEncoder()),
    #Step 6b - Expand and encode the genre feature
    ('genre',GenreEncoder()),
    #Step 6c - Expand and encode the style feature
    ('style',MultiValueCategoricalEncoder(feature='style')),
    #Step 6d - OneHotEncode 'format_name', while maintaining the DataFrame structure
    ('encode_format_name', DummyGenerator('format_name'))
])

format_pipe = Pipeline([
    #Step 7a - Create the format_description, format_text, format_name and format_quantity features
    ('make_columns',FormatEncoder()),
    #Step 7b - Remove format_quantity outliers
    ('remove_quantity_outliers', OutlierRemover('format_quantity')),
    #Step 7c - Expand and encode the format_description feature
    ('encode_descriptions',MultiValueCategoricalEncoder('format_description')),
    #Step 7d - Homogenize and clean the format_text feature
    ('clean_format_text',FormatTextCleanReduce())
])

api_pipe = Pipeline([
    #Step 1 - Remove Unnecessary Columns
    ('remove_columns',ColumnRemover(['id','thumb_url','release_url'])),
    #Step 2 - Splitting title Feature into title and artist
    ('split_title',TitleSplitter()),
    #Step 3 - Unpickle Pickled Features
    ('unpickle',Unpickler(['genre','style','label','formats'])),
    #Step 4 - Clean Features with Multiple Categorical Values per Sample and High Categorical Variance
    ('clean_text',clean_text_pipe),
    #Step 5 - Remove Entries with Duplicate release_ids
    ('remove_duplicates',DuplicateRemover('release_id')),
    #Step 6 - Encode Categorical Features which cannot be encoded conventionally
    ('encode_columns',column_encoding_pipe),
    #Step 7 - Expand the 'format' Feature from Dictionary to Column Format
    ('format_columns',format_pipe),
    #Step 8 - Encode Jazz Periods and Eras associated with the Release Year of each Album
    ('encode_time_periods',TimePeriodEncoder())
])

### Pipeline Step Exposition
#### Step 1 - Removing Unnecessary Columns
This step is mostly self-explanatory. For the purposes of this analysis, neither ``thumb_url`` and ``release_url`` will be relevant, and the ``id`` column is superfluous.
#### Step 2 - Splitting ``title`` Feature into ``title`` and ``artist``
As was alluded to in the initial ``api_df`` overview, the original ``title`` feature clearly contains two distinct pieces of information, namely the name of the performing artist or group for a given release, as well as the name of said release. As such, with this transformer, we split the title feature according to the common delimiter in this column, namely a hyphen ('-') in order to yield an individual ``title`` and ``artist`` column. An equivalent operation is shown below on a sample of the dataset.

In [None]:
api_df['title'].head().str.split('-',n=1,expand=True).rename(columns={0:'artist',1:'title'})

#### Step 3 - Unpickle Pickled Features
This step replicates the operation performed in the ``api_df`` overview, by unpickling ``genre``,``style``,``label`` and ``formats``
#### Step 4 - Clean Features with Multiple Categorical Values per Sample and High Categorical Variance
For the features ``label`` and ``artist``, it is clear that a more sophisticated approach is necessary than just simple One-Hot Encoding, as each has a high unique cardinality, and is composed of fairly dirty textual data.

In [None]:
label_artist_columns = api_pipe.steps[2][-1].fit_transform(api_pipe.steps[1][-1].fit_transform(api_df)).loc[:,['label','artist']]
label_artist_columns

First, let us identify the number of unique values for each of these columns.

In [None]:
label_values = label_artist_columns.loc[:,'label'].apply(lambda x: 'üü'.join(x)).str.split('üü',expand=True).stack()
artist_values = label_artist_columns.loc[:,'artist']
for values, column in zip((label_values,artist_values),('label','artist')):
    print('The api_data dataset has %s unique %s values' % (len(np.unique(values)),column))

As we can see from the output above, the order of magnitude of unique values for these columns prevents us from taking the standard approach in machine learning of One-Hot Encoding these columns, in the interest of preserving the dimensionality of our dataset. Let us now see if there is any possibility of reducing the number of unique values, by homogenizing similar values.

In [None]:
from collections import Counter
Counter(label_values).most_common(n=20)

From the above we can clearly see that labels have been entered under slightly different names in some cases, in particular for the  following pairs:
- Capitol Records & Capitol Records, Inc.
- CBS & CBS Inc.
Unifying these would by themselves already greatly increase the number of connected entries by homogenizing the names of the labels

In [None]:
Counter(artist_values).most_common(n=20)

In the case of artists, it is not as clear that such cleaning would be required in order to get widespread connection between records, however, by taking a look at all entries containing 'Miles Davis', for example, we quickly observe in how many varied ways Miles Davis appears in the ``artist`` column. 

In [None]:
miles_davis_values = tuple(filter(lambda x: 'miles davis' in x.lower(), Counter(artist_values)))
miles_davis_values

Clearly, in order to reduce the dimensionality and increase the linkages between different jazz albums, it will also be crucial to clean the ``artist`` column. 

To achieve this, we apply the ``ArtistCleanReduce`` and ``LabelCleanReduce`` transformers, which clean and homogenize the text inputs, and match the resulting unique strings with one another using TF-IDF Vectorization with n-grams. In the case of ``LabelCleanReduce``, we pick the first unique element attributed to an entry, as it is deemed to be the most relevant label for the release. For both the ``artist`` and ``label`` columns, this results in a great reduction in the cardinality of unique entries associated with both columns, as we can see below.

In [None]:
cleaned_text = clean_text_pipe.fit_transform(label_artist_columns)

In [None]:
for column in ('label','artist'):
    print('The cleaned api_data has %s unique %s values' % (len(cleaned_text.loc[:,column].unique()),column))

Although we have successfully reduced the unique cardinality of ``label`` and ``artist``, these values still far exceed the threshold under which it would be acceptable to apply One-Hot Encoding. As such, we will use the Leave One Out approach to encode these features, which will allow us to eliminate this explosion of dimensionality. The approach will be explained in further detail in ``04-Modelling``.

In [None]:
#Namespace Cleanup
del label_artist_columns,label_values,artist_values,miles_davis_values,cleaned_text

#### Step 5 - Remove Entries with Duplicate ``release_ids``
The results obtained from querying the Discogs.com API yielded a substantial number of duplicates for the ``release_id`` feature, as we can see from the cell below. As this is a unique identifier for each album, we will remove those entries for which a duplicate ``release_id`` is identified.

In [None]:
print('The number of entries with duplicated release_id values is:',len(api_df[api_df['release_id'].duplicated()]))

#### Step 6 - Encode Categorical Features

For the ``column_encoding_pipe``, we turn our attention to the encoding of the ``country``, ``genre`` and ``style`` columns shown below

In [None]:
categorical_features = Unpickler(['genre','style']).fit_transform(api_df.loc[:,['country','genre','style']])
categorical_features.head()

As we can see from the output above, we have two different encodings, as in the ``clean_text_pipe``. The ``country`` column is of a string type, whereas the ``genre`` and ``style`` features contain list-type values for each sample. It is important to know that the ``country`` attribute is provided by users themselves, whereas the ``genre`` and ``style`` values are set by Discogs and are thus more standardized.

As such, for the ``genre`` and ``style`` columns, we simply need to utilize dummy encoding in order to encode the information from these features. We do this by implementing and extending the ``MultiValueCategoricalEncoder`` class which encodes the features accordingly:

In [None]:
from lib.transformers import MultiValueCategoricalEncoder

In [None]:
encoded_genre_style = MultiValueCategoricalEncoder('style')\
    .fit_transform(
        GenreEncoder()\
            .fit_transform(categorical_features)
)
encoded_genre_style.head()

To deal with the ``country`` feature, we take a different approach. As each country associated with a given sample contains information related to the broader region of the globe the album comes from, we implement a custom transformer called ``CountryEncoder``, which reads in the string input of the ``country`` feature, and dummy encodes the information according to which countries the album comes from, and additionally add columns outlining the regions and superregions the album release belongs to. In this case, regions refer to the the UN Geoschemes according to their M49 classification code. Superregions are used to refer to the following regions which supersede and contain within them the UN Geoscheme regions:
- Africa
- Americas
- Asia
- Europe
- Oceania
- Unknown (in case of missing information)

In [None]:
CountryEncoder().fit_transform(categorical_features)

In [None]:
del categorical_features, encoded_genre_style

#### Step 7 - Create and Encode Format Columns

As a next step, we must decompose the formats column such that its component features are interpretable for the machine learning algorithms we aim to build as part of this project.

In [None]:
Unpickler('formats').fit_transform(api_df)['formats']

The ``formats`` column is saved in a dictionary, which means that we must extract the value of each key of the dictionaries and create a new independent feature. We do this via the ``FormatEncoder`` transformer, which executes this for the four keys that are found throughout the feature, and creates the columns ``format_description``, ``format_text``,``format_name``, ``format_quantity``.

In [None]:
formats = FormatEncoder().fit_transform(Unpickler('formats').fit_transform(api_df))[['format_name','format_quantity','format_text','format_description']]
formats

For the above features, we have two fairly cleanly formatted ones in ``format_name`` and ``format_quantity``

In [None]:
formats['format_name'].unique()

In [None]:
np.sort(formats['format_quantity'].unique())

While we can see from the range of values in ``format_name`` that we require no additional encoding other than One-Hot encoding prior to estimation at this stage, we see that there is a fairly long tail in the quantity of units that are associated with each album. In order to avoid these outliers from negatively impacting our predictions, we use the ``OutlierRemover`` transformer which removes values that are more than three standard deviations removed from the mean

In [None]:
np.sort(OutlierRemover('format_quantity').fit_transform(formats)['format_quantity'].unique())

For the ``format_text``and ``format_description`` features, we utilize similar approaches as in previous sections, due to similar issues occuring in the data. For ``format_text``, we extend the ``FeatureCleanReduce`` transformer to create the ``FormatTextCleanReduce`` transformer in order to reduce the feature's cardinality

In [None]:
len(formats['format_text'].unique())

In [None]:
len(FormatTextCleanReduce().fit_transform(formats)['format_text'].unique())

As the cardinality of the feature is still fairly high, we will forego dummy encoding at this point and use the Leave One Out encoding approach we have previously discussed emplyoing for ``artist`` and ``label`` features.

For the ``format_description`` feature, we observe that there is a more standardized set of values for the feature, and as such we can simply apply the ``MultiValueCategoricalEncoder`` as in previous sections for features with list datatypes, such as ``genre`` and ``style``.

In [None]:
MultiValueCategoricalEncoder('format_description').fit_transform(formats)

Finally, we will One-Hot encode the ``format_name`` feature, as its structure of singular, standardized values in the dataset makes it ideal for such an encoding procedure, as shown below.

In [None]:
formats['format_name'].unique()

We implement this One-Hot encoding using the custom ``DummyGenerator`` transformer, which we use in order to be able to apply the One-Hot encoding within our transformation pipeline, without having to forgo the DataFrame format in downstream preprocessing steps.

In [None]:
del formats

#### Step 8 - Create and Encode Time Period Columns

As an additional step of feature engineering, we will utilize the ``year`` feature to encode some information about the distinct eras taking place in the history of Jazz music. This information is obtained from the following source: https://www.preceden.com/timelines/3242-jazz-history, and encodes the following time periods and eras:
- Eras
    - Swing (1925 - 1945)
    - Modern (1940 - 1970)
    - Contemporary (1970 - 2020)
- Periods
    - Big Band (1930 - 1950)
    - Bebop (1940 - 1955)
    - Cool (1950 - 1970)
    - Fusion (1970 - 2020)

In [None]:
TimePeriodEncoder().fit_transform(api_df)[[
    'year','era_modern','era_contemporary','era_swing',
    'period_big_band','period_bebop','period_cool'
]]

## Synthesis
All in all, we combine these steps through the ``api_pipe``, which yields the following output

In [None]:
api_pipe.fit_transform(api_df)

## Extracted Data
### Loading and Overview

In [None]:
extracted_df = load_from_pkl('extracted',path=DATA_PATH)

In [None]:
Unpickler('track_titles').fit_transform(extracted_df)

The ``extracted_df`` DataFrame represents the data that was obtained from scraping Discogs.com release pages for the albums which were obtained through the initial querying of their API. This data is crucial in being able to build the model estimating the market value of jazz albums, as no price information is divulged by the API, and can only be programmatically obtained through web scraping. Additionally, the release pages for albums on Discogs are additionally informative in that they provide platform-specific metadata which can ultimately be used in predicting market value in the previously outlined full-information scenario. 

What follows is a description of each variable and its relevance to the project:
- ``release_id``
    - As for the ``api_data``, the release_id is the unique identifier for each album release and will form the primary key with which we can later join our data sources
- ``market_price``
    - Records the lowest price at which a given album release is available for on the Discogs.com platform.
        - Releases which have no open listings on the platform have null values
- ``units_for_sale``
    - Records how many units of the album release are available on the Discogs.com platform
- ``average_rating``
    - Records the mean score out of 5 attributed to the album release by Discogs.com users
- ``rating_count``
    - Records the number of ratings that have been submitted for the album release
- ``last_sold``
    - Records the last time the album release was sold on the Discogs.com platform
- ``number_of_tracks``
    - Generated during scraping
    - Records the total number of tracks that are listed for an album release
- ``running_time``
    - Generated during scraping
    - Records the total ``running_time`` of the album in minutes
- ``lowest``, ``median``, ``highest``
    - Record the lowest, median and highest prices each album was sold for on the platform, respectively. 
- ``track_titles``
    - Records the track titles of the ablums

## Transformation Pipeline
As for ``api_df``, we will replicate the transformation pipeline used to process the data scraped from Discogs.com release pages and explain each step of the pipeline in further detail.

In [None]:
from lib.transformers import (
    ColumnRemover,
    ColumnCombiner,
    DuplicateRemover,
    NullRemover,
    StandardCountEncoder,
    LastSoldEncoder
)

In [None]:
market_value_pipe =  Pipeline([
    #Step 3a - Creates the market_value feature by imputing 'median' nulls with market_price values
    ('make_market_value', ColumnCombiner('median','market_price','market_value')),
    #Step 3b - Removes samples with null values for market_value
    ('remove_nulls', NullRemover('market_value')),
    #Step 3c - Removes samples with outlier values for market_value according to 3 standard deviation rule
    ('remove_outliers', OutlierRemover('market_value'))
])

extracted_pipe = Pipeline([
    #Step 1 - Removes unnecessary columns
    ('remove_id', ColumnRemover(['id','last_sold'])),
    #Step 2 - Unpickles the track_titles features
    ('unpickle', Unpickler('track_titles')),
    #Step 3 - Creates and processes the ``market_value`` feature
    ('make_market_value',market_value_pipe),
    #Step 4 - Removes any duplicates present in the release_id feature
    ('remove_duplicates', DuplicateRemover('release_id')),
    #Step 5 - Counts the number of Jazz standards each album has
    ('count_standards',StandardCountEncoder('track_titles',DATA_PATH))
])

In comparison to the ``api_df`` DataFrame, the ``extracted_df`` DataFrame is far less unwieldy and messy, and instead requires only small adjustments to be made to the core features, as well as some additional transformations for feature engineering purposes.

#### Step 1 - Removing unnecessary features
We decide to remove the ``last_sold`` feature due to it having an untenable number of null values, and due to the high variation in the ranges in which albums have last been sold for the albums in question, the missing values cannot be imputed in good conscience. As such, the feature is dropped from the dataset.

In [None]:
extracted_df['last_sold'].isna().sum()

#### Step 2 - Unpickling ``track_titles``
The ``track_titles`` feature is stored as a list, and as such needed to be pickled before being saved to the SQL database that was used for the intermediate storing of data during scraping. As such, we unpickle it here for downstream processing.

#### Step 3 - Creating & processing the ``market_value`` feature
There are two features in the ``extracted_df`` dataset that could conceivably be used as the target feature in our machine learning models to predict the market value of a given Jazz album, namely ``market_price`` and ``median``. Without missing data, it would be preferable to use the ``median`` price for the ``market_value`` of a given record, as the value of ``market_price`` is contingent on the period in which the data was scraped and may not be an accurate representation of what price the album is generally valued at in the market. Furthermore, as the ``market_price`` is the lowest price that a given record is being sold for, it could be that the reason for the price being so low is due to the poor quality of the record, a feature which is not captured in our data but is available in separate listings view of the record, which has been left out of the scope of this project in the interest of time, as it would have required the scraping the results of another 350k web requests to the Discogs.com servers. 
However, as we can see below, we do not have access to the ``median`` price for around 150,000 albums, and as such a sole reliance on this attribute would mean we would no longer be able to incorporate these albums in our analysis. In order to minimize the amount of data that must be discarded, we take the approach of combining the ``market_price`` and ``median`` features by filling the missing values of ``median`` with the values of ``market_price``. The rationale behind this combination of features is that a missing value for ``median_price`` indicates that the album in question has never been sold on the Discogs.com platform. Why this is the case could be down to a number of reasons, such as the album being fairly niche, new on the market or not particularly popular among the types of users that Discogs.com attracts. As such, we can noisily impute the value of the record by the supply-side valuation of the album in order to reduce the number of samples that must be removed from the dataset.

In [None]:
extracted_df.loc[:,['market_price','median']].isna().sum()

In [None]:
market_value_features = ColumnCombiner('median','market_price','market_value').fit_transform(extracted_df).loc[:,['market_value','median','market_price']]

In [None]:
market_value_features.isna().sum()

As we can see from the above, post-imputation it is possible to keep around 100,000 samples in the dataset for training our machine learning models. Nevertheless, we must still remove the entries for which ``market_value`` is null, as well as those samples that have outlier values, which we achieve using the ``NullRemover`` and ``OutlierRemover`` transformers

In [None]:
OutlierRemover('market_value').fit_transform(
    NullRemover('market_value').fit_transform(
        market_value_features
    )
).describe()

As we can see from the output above, by removing the nulls and outliers from the ``market_value`` feature, we manage to obtain a very similar distribution to the ``median`` feature we would have otherwise preferred from the original dataset.

In [None]:
del market_value_features

#### Step 4 - Removing Duplicates

As in the processing of ``api_data``, it is critical that we discard the duplicate values of ``release_id`` in our dataset.

In [None]:
print('Number of duplicates is: ',len(extracted_df)-len(DuplicateRemover('release_id').fit_transform(extracted_df)))

#### Step 5 - Count number of standards per Jazz album
In an attempt to inject our data with some additional domain knowledge, we will analyze the ``track_titles`` feature of the ``extracted_data`` DataFrame and seek to identify the number of Jazz standards that each album has on its tracklist. Jazz Standards are pieces of music which are commonly covered by artists within the Jazz community, and provide a possible entrypoint of prospective customers to purchasing the record, as they may be more likely to recognize a certain track on a record, and therefore be more incentivized to buy it. Generally, audiences enjoy hearing another artist's take on an old classic, which makes this feature of Jazz music culture a possibly informative feature to be added for our modelling. 
We accomplish this via a similar method as was employed in the ``FeatureCleanReduce`` inheriting estimators, by using TF-IDF with n-grams in order to compare the ``track_titles`` for each album with a third-party list of accepted Jazz standards (https://en.wikipedia.org/wiki/List_of_jazz_standards), and attempting to find high probability matches.

In [None]:
standards = StandardCountEncoder(path=DATA_PATH).fit_transform(
    Unpickler('track_titles').fit_transform(extracted_df))

In [None]:
del standards

## High Level Features
#### Loading and Overview
For the final component dataset that we will use to model the meta-characteristics of Jazz albums, we will utilize a learned representation of the cover images of each album and feed this as input into our final machine learning model. This step serves to act as the visual information the customer considers when evaluating a given album for purchase. 

In order to obtain the learned representation, we followed the following steps:
1. Download the cover images for each release_id in the ``api_data`` dataset
    a. This was done while scraping the data from Discogs.com for the ``extracted_data`` dataset, as the cover images are to be found on these pages
    b. This step is enabled by the ImageScraper Class to be found in the data/util/scrape.py module
2. Utilize a pre-trained Neural Network, MobileNetV2 in this case, from TFHub to extract the high level features from each of the cover images
    a. For the pre-trained model used in this project, we obtain a vector of 1280 high-level features representing the learned representation of each image by the MobileNetV2 model
    b. This step is enabled by the FeatureExtractor Class to be found in the data/util/scrape.py module
    
The scraping and feature extraction has been finalized in advance, and the high-level feature matrix has been saved in the ``data/data/high_level_features.npz`` file. We load the file below and display it for reference

In [None]:
with np.load(os.path.join(DATA_PATH,'high_level_features.npz')) as file:
    high_level_features_df = pd.DataFrame(file['data'].reshape(350330,1280))
    high_level_features_df['release_id'] = file['label']
    high_level_features_df['release_id'] = high_level_features_df['release_id'].str.replace('bitmap_','').str.replace('.png','')

In [None]:
high_level_features_df