# Data Processing
In this notebook, I will be loading the data necessary for this project, combining the data sources and performing initial data processing steps. I will also be performing some exploratory data analysis for the purposes of identifying missing values and outliers, which will then be followed up with the appropriate processing steps.

In [7]:
%load_ext autoreload
%autoreload 2
from tqdm import tqdm
import pickle
tqdm.pandas()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
from data.util.paths import DATA_PATH
from lib.processing import load_from_pkl
from lib.pipelines import extracted_pipe, api_pipe

## Introduction

The data for this project has been sourced from Discogs.com, the largest online marketplace for second-hand physical music. There are three components to the data that will make up our final dataset, namely:
1. api_data -> Data taken from the official Discogs.com API (https://www.discogs.com/developers)
2. extracted_data -> Data scraped from Discogs release pages
3. image_embedding_data -> High-Level Features extracted from the cover images of each Album

At this juncture, we will load, investigate and transform the first two data sources outlined above. The data was collected continuously over the months of February and March 2020 using the scripts found in the data directory of this project. To scrape each of these three data sources, the main() function of the data/main.py file was used. It is not recommended to run this script oneself, as it will require investing into a paid proxy service due to the rate limiting in force by Discogs.com. The data has been made available for the EPFL Extension School Reviewers at INSERT_DATA_SOURCE_HERE. For more details on how each data source was filtered and scraped, please refer to the classes and functions contained in the data/util/scrape.py module.
    
First, we will load the data extracted from the Discogs.com API, henceforth referred to as ``api_data``

## API Data
### Loading and Overview

In [5]:
api_df = load_from_pkl('api',path=DATA_PATH)

In [6]:
api_df.head()

Unnamed: 0,id,release_id,title,year,country,genre,style,label,community_have,community_want,formats,master_id,thumb_url,release_url
0,1,11918321,Bing Crosby - Crosbyana,1934,Canada,b'\x80\x03]q\x00(X\x04\x00\x00\x00Jazzq\x01X\x...,b'\x80\x03]q\x00(X\x08\x00\x00\x00Big Bandq\x0...,b'\x80\x03]q\x00(X\x05\x00\x00\x00Deccaq\x01X\...,1,2,b'\x80\x03]q\x00}q\x01(X\x03\x00\x00\x00qtyq\x...,1354381,https://img.discogs.com/J7vwmOhWMdUJ5vYuaYZvIj...,https://api.discogs.com/releases/11918321
1,2,10550056,Bing Crosby - Crosbyana,1934,US,b'\x80\x03]q\x00(X\x04\x00\x00\x00Jazzq\x01X\x...,b'\x80\x03]q\x00(X\x08\x00\x00\x00Big Bandq\x0...,b'\x80\x03]q\x00X\x05\x00\x00\x00Deccaq\x01a.',4,4,b'\x80\x03]q\x00}q\x01(X\x03\x00\x00\x00qtyq\x...,1354381,https://img.discogs.com/aLpqYUso3yY53XDHwiqqB-...,https://api.discogs.com/releases/10550056
2,3,6910984,Tommy Dorsey And His Clambake Seven - Tommy Do...,1935,US,b'\x80\x03]q\x00X\x04\x00\x00\x00Jazzq\x01a.',b'\x80\x03]q\x00X\x05\x00\x00\x00Swingq\x01a.',b'\x80\x03]q\x00(X\r\x00\x00\x00Swing Classicq...,4,2,b'\x80\x03]q\x00}q\x01(X\x0c\x00\x00\x00descri...,0,https://img.discogs.com/FkuaZ7cqdBt5-TvS2ck0hP...,https://api.discogs.com/releases/6910984
3,4,12959431,"Paul Whiteman And His Orchestra, Bix Beiderbec...",1936,US,b'\x80\x03]q\x00X\x04\x00\x00\x00Jazzq\x01a.',b'\x80\x03]q\x00X\x05\x00\x00\x00Swingq\x01a.',b'\x80\x03]q\x00X\x06\x00\x00\x00Victorq\x01a.',3,3,b'\x80\x03]q\x00}q\x01(X\x0c\x00\x00\x00descri...,0,https://img.discogs.com/Wg31rg_x6TtOMz-jIDC-3l...,https://api.discogs.com/releases/12959431
4,5,4453491,Jimmie Lunceford And His Orchestra - For Dance...,1937,Canada,b'\x80\x03]q\x00X\x04\x00\x00\x00Jazzq\x01a.',b'\x80\x03]q\x00.',b'\x80\x03]q\x00(X\x05\x00\x00\x00Deccaq\x01X\...,5,2,b'\x80\x03]q\x00}q\x01(X\x0c\x00\x00\x00descri...,1519538,,https://api.discogs.com/releases/4453491


Above we see the data saved from the querying of the Discogs.com API. The data includes all available jazz albums from the API and retains the columns that were deemed to be the most relevant and informative for the purposes of estimating the market value of each record. As we can see from the above calling of the ``.head()`` method on the api_df DataFrame, several columns have been saved in a binary format. This is due to the fact that the API often returned multiple categories for these columns, and as such they were initially saved in a list in the process of scraping. As this data comes from a SQL database, it has been pickled to enable the storage of the data. Let us now unpickle this data in order to get a better understanding of the structure of the ``api_data``.

In [13]:
api_df.applymap(lambda x: pickle.loads(x) if isinstance(x,bytes) else x).head()

Unnamed: 0,id,release_id,title,year,country,genre,style,label,community_have,community_want,formats,master_id,thumb_url,release_url
0,1,11918321,Bing Crosby - Crosbyana,1934,Canada,"[Jazz, Pop]","[Big Band, Vocal]","[Decca, The Compo Company Ltd.]",1,2,"[{'qty': '6', 'descriptions': ['10""', '78 RPM'...",1354381,https://img.discogs.com/J7vwmOhWMdUJ5vYuaYZvIj...,https://api.discogs.com/releases/11918321
1,2,10550056,Bing Crosby - Crosbyana,1934,US,"[Jazz, Pop]","[Big Band, Vocal]",[Decca],4,4,"[{'qty': '6', 'descriptions': ['10""', '78 RPM'...",1354381,https://img.discogs.com/aLpqYUso3yY53XDHwiqqB-...,https://api.discogs.com/releases/10550056
2,3,6910984,Tommy Dorsey And His Clambake Seven - Tommy Do...,1935,US,[Jazz],[Swing],"[Swing Classic, RCA Victor]",4,2,"[{'descriptions': ['10""', '78 RPM', 'Album', '...",0,https://img.discogs.com/FkuaZ7cqdBt5-TvS2ck0hP...,https://api.discogs.com/releases/6910984
3,4,12959431,"Paul Whiteman And His Orchestra, Bix Beiderbec...",1936,US,[Jazz],[Swing],[Victor],3,3,"[{'descriptions': ['10""', '78 RPM', 'Album', '...",0,https://img.discogs.com/Wg31rg_x6TtOMz-jIDC-3l...,https://api.discogs.com/releases/12959431
4,5,4453491,Jimmie Lunceford And His Orchestra - For Dance...,1937,Canada,[Jazz],[],"[Decca, Decca Records, Inc.]",5,2,"[{'descriptions': ['10""', '78 RPM', 'Album'], ...",1519538,,https://api.discogs.com/releases/4453491


As all columns are now human-readable, we can discuss each column and its relevance for the purposes of our goal of estimating the market value of jazz albums on the basis of their meta-characteristics
- ``release_id`` and ``master_id``
    - These columns represent the unique identifier of a specific release of a record on Discogs.com, and the unique identifier of the album irrespective of the release, respectively.
    - While ``release_id`` will be of no use to us in the estimation of album market value, it is valuable as a primary key for the joining of the data extracted from the Discogs.com API with the scraped release page and image data we will be introducing later
    - ``master_id`` will be useful in constructing linkages between entries that are the same album, but released in a different country, format, or by another label
- ``country``
    - This feature gives us an insight into which country the record was released from
    - While not immediately obvious from the snapshot above, this feature can have multiple countries for its value, such as "England & USA", which requires a more thoughtful approach than just direct one-hot encoding in order to optimally preserve the information it encodes
- ``title``
    - This feature includes both the title of the album, as well as the title of the artist
    - It will be necessary to split this feature into two independent ones, such that entries can be linked to others by virtue of having the same authoring artist or group
- ``community_want`` and ``community_have``
    - These features outline the registered demand and supply, respectively for given albums on the Discogs.com platform
    - Users can register their ownership or desire of a certain album, which is then aggregated over users and recorded in these features
    - These are the first features we see which will not be taken into account in the record store estimation scenario, as in a physical visit to the record store, it is not possible to know exactly how many people have and want a specific record
- ``genre``, ``style`` and ``label``
    - These features outline the genres, styles and labels associated with each album release
    - As has been alluded to previously, there are potentially multiple values for this feature, which will make standard encoding methods such as One-Hot Encoding difficult to rationalize
- ``formats``
    - Taken directly from the Discogs.com API, the formats feature is the most unruly of all, as it contains up to 4 sub-features,namely:
        - Format Description
            - Album release specifics
            - Examples
                - '10"', '78RPM', 'Album', 'Reissue'
        - Format Text
            - Additional free form notes associated with the record release
            - Examples
                - 'Paper Sleeve', 'Red Vinyl'
        - Format Name
            - The name of the format
            - Examples
                - 'CD', 'Vinyl', 'Cassette'
        - Format Quantity
            - An integer representing how many units of the format are associated with a purchase of the release
- ``thumb_url`` and ``release_url``
    - These features are unnecessary for the purposes of our analysis here, but were useful in order to conduct the data extraction of the ``extracted_data`` and ``image_embedding_data`` obtained from the Discogs.com platform itself

### Investigation and Transformation

Having taken an initial look at ``api_data``, we can now move on to the cleaning and transformation of this DataFrame through a previously written pipeline, to be found in the lib/pipelines.py module. For the purposes of exposition, the pipeline will be replicated here step-by-step, with motivation behind each step. In the following notebooks, a direct import of the pipeline will be used in lieu of a replication as below.

Below the required transformers will be imported and composed into the pre-defined pipeline structure, after which the pipeline will be explained in its entirety.

In [50]:
from sklearn.pipeline import Pipeline
from lib.transformers import (ColumnRemover, 
                              TitleSplitter, 
                              Unpickler, 
                              LabelCleaner, 
                              ArtistCleaner, 
                              DuplicateRemover, 
                              CountryEncoder, 
                              GenreEncoder, 
                              MultiValueCategoricalEncoder, 
                              FormatEncoder, 
                              FormatTextCleaner, 
                              TimePeriodEncoder)

In [60]:
clean_text_pipe = Pipeline([
    #Step 4a - Homogenize and clean the label feature
    ('label',LabelCleaner()),
    #Step 4b - Homogenize and clean the artist feature
    ('artist',ArtistCleaner())
])

column_encoding_pipe = Pipeline([
    #Step 6a - Identify regions and superregions associated with countries and approrpiately encode
    #multi-country album releases
    ('country',CountryEncoder()),
    #Step 6b - Expand and encode the genre feature
    ('genre',GenreEncoder()),
    #Step 6c - Expand and encode the style feature
    ('style',MultiValueCategoricalEncoder(feature='style'))
])

format_pipe = Pipeline([
    #Step 7a - Create the format_description, format_text, format_name and format_quantity features
    ('make_columns',FormatEncoder()),
    #Step 7b - Expand and encode the format_description feature
    ('encode_descriptions',MultiValueCategoricalEncoder('format_description')),
    #Step 7c - Homogenize and clean the format_text feature
    ('clean_format_text',FormatTextCleaner())
])

api_pipe = Pipeline([
    #Step 1 - Remove Unnecessary Columns
    ('remove_columns',ColumnRemover(['id','thumb_url','release_url'])),
    #Step 2 - Splitting title Feature into title and artist
    ('split_title',TitleSplitter()),
    #Step 3 - Unpickle Pickled Features
    ('unpickle',Unpickler(['genre','style','label','formats'])),
    #Step 4 - Clean Features with Multiple Categorical Values per Sample and High Categorical Variance
    ('clean_text',clean_text_pipe),
    #Step 5 - Remove Entries with Duplicate release_ids
    ('remove_duplicates',DuplicateRemover('release_id')),
    #Step 6 - Encode Categorical Features which cannot be encoded conventionally
    ('encode_columns',column_encoding_pipe),
    #Step 7 - Expand the 'format' Feature from Dictionary to Column Format
    ('format_columns',format_pipe),
    #Step 8 - Encode Jazz Periods and Eras associated with the Release Year of each Album
    ('encode_time_periods',TimePeriodEncoder())
])

### Pipeline Step Exposition
#### Step 1 - Removing Unnecessary Columns
This step is mostly self-explanatory. For the purposes of this analysis, neither ``thumb_url`` and ``release_url`` will be relevant, and the ``id`` column is superfluous
#### Step 2 - Splitting ``title`` Feature into ``title`` and ``artist``
As was alluded to in the initial ``api_data`` overview, the original ``title`` feature clearly contains two distinct pieces of information, namely the name of the performing artist or group for a given release, as well as the name of said release. As such, with this transformer, we split the title feature according to the common delimiter in this column, namely a hyphen ('-') in order to yield an individual ``title`` and ``artist`` column. An equivalent operation is shown below on a sample of the dataset.

In [59]:
api_df['title'].head().str.split('-',n=1,expand=True).rename(columns={0:'artist',1:'title'})

Unnamed: 0,artist,title
0,Bing Crosby,Crosbyana
1,Bing Crosby,Crosbyana
2,Tommy Dorsey And His Clambake Seven,Tommy Dorsey And His Clambake Seven
3,"Paul Whiteman And His Orchestra, Bix Beiderbec...",Dedicated To The Memory Of Bix Beiderbecke
4,Jimmie Lunceford And His Orchestra,For Dancers Only


#### Step 3 - Unpickle Pickled Features
This step replicates the operation performed in the ``api_data`` overview, by unpickling ``genre``,``style``,``label`` and ``formats``
#### Step 4 - Clean Features with Multiple Categorical Values per Sample and High Categorical Variance
For the features ``label`` and ``artist``, it quickly becomes clear that a more sophisticated approach is necessary than just simply One-Hot Encoding, for a number of reasons

#### Step 5 - Remove Entries with Duplicate ``release_ids``
The results obtained from querying the Discogs.com API yielded a substantial number of duplicates for the ``release_id`` feature, as we can see from the cell below. As this is a unique identifier for each album, we will remove those entries for which a duplicate ``release_id`` is identified.

In [66]:
print('The number of entries with duplicated release_id values is:',len(api_df[api_df['release_id'].duplicated()]))

The number of entries with duplicated release_id values is: 98334


## Step 5 - Remove 

In [1]:
extracted_df = load_from_pkl('extracted',path=DATA_PATH)

NameError: name 'load_from_pkl' is not defined

# Transforming

In [None]:
extracted_df = extracted_pipe.fit_transform(extracted_df)
extracted_df.describe()

In [None]:
api_df = api_pipe.fit_transform(api_df)
api_df.describe()