# Projeto final

Este projeto tem como objetivo desenvolver uma solução para o desafio [RecSys Challenge 2018](http://www.recsyschallenge.com/2018/)

Disciplina: EEL 410250 - Aprendizado de Máquina<br>
Aluno: Gustavo de Paula Santos<br>
Matrícula: 19100833<br>

In [1]:
import json
import os
from tqdm import tqdm
import autorootcwd
import numpy as np
import pandas as pd
import torch
from sklearn.preprocessing import OrdinalEncoder

In [2]:
dir = '../spotify_million_playlist_dataset/data/subset_10_000'

rows = []
for file in tqdm(os.listdir(dir)):
    if file.endswith('.json'):
        with open(os.path.join(dir, file), 'r') as f:
            data = json.load(f)
            for playlist in data['playlists']:
                pid = playlist['pid']
                name = playlist['name']
                tracks = playlist['tracks']
                for track in tracks:
                    track_uri = track['track_uri']
                    track_uri = track_uri.split(':')[-1]
                    artist_uri = track['artist_uri']
                    artist_uri = artist_uri.split(':')[-1]
                    row = {
                        'pid': pid,
                        'plist_name': name,
                        'track_uri': track_uri,
                        'artist_uri': artist_uri
                    }
                    rows.append(row)

100%|██████████| 10/10 [00:04<00:00,  2.04it/s]


In [3]:
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,pid,plist_name,track_uri,artist_uri
0,7000,NewNew,3uvsVUrAaGQJCTEUR1S3Sx,0ypTT9UqAU5sZpPo5JZmjR
1,7000,NewNew,0heE5tAAaDQmnGhVDImPl2,0h3YCmvRJ2jqt4jFiR6nGL
2,7000,NewNew,3omXshBamrREltcf24gYDC,6VDdCwrBM4qQaGxoAyxyJC
3,7000,NewNew,6TYWE19e35N7Bn5heHwyY6,4uNv6RD2YXwoaKgHfJZkkL
4,7000,NewNew,1xznGGDReH1oQq0xzbwXa3,3TVXtAsR1Inumwj472S9r4


In [4]:
print(f"Total memory usage: {df.memory_usage(deep=True).sum()/1e6} MB")

Total memory usage: 153.83059 MB


In [5]:
df.to_pickle("./data/raw/subset_10_000.pkl")

In [6]:
encoder = OrdinalEncoder()
df['track_uri'] = encoder.fit_transform(df[['track_uri']])
track_uri_mapping = {i: uri for i, uri in enumerate(encoder.categories_[0])}
df['artist_uri'] = encoder.fit_transform(df[['artist_uri']])
artist_uri_mapping = {i: uri for i, uri in enumerate(encoder.categories_[0])}
df.head()

Unnamed: 0,pid,plist_name,track_uri,artist_uri
0,7000,NewNew,85654.0,4495.0
1,7000,NewNew,15536.0,3177.0
2,7000,NewNew,83496.0,29903.0
3,7000,NewNew,141436.0,22603.0
4,7000,NewNew,42999.0,16055.0


In [9]:
print(f"Total memory usage after encoding: {df.memory_usage(deep=True).sum()/1e6} MB")

Total memory usage after encoding: 59.633757 MB


In [19]:
df["pid"] = pd.to_numeric(df["pid"], downcast="integer")
df["track_uri"] = pd.to_numeric(df["track_uri"], downcast="integer")
df["artist_uri"] = pd.to_numeric(df["artist_uri"], downcast="integer")

In [20]:
df.dtypes

pid            int16
plist_name    object
track_uri      int32
artist_uri     int32
dtype: object

In [21]:
print(f"Total memory usage after downcast: {df.memory_usage(deep=True).sum()/1e6} MB")

Total memory usage after downcast: 50.327789 MB


### Preprocessamento de títulos

Seguindo a abordagem de pré-processamento de títulos proposta pela equipe em que estou me inspirando, as etapas de pré-processamento são as seguintes:

1. Mudança para minúsculas
2. Remoção de pontuação
3. Remoção de stopwords

In [27]:
import re

def normalize_name(name):
    name = name.lower()
    name = re.sub(r"[.,#!$%\^\*;:{}=\_`~()@]", ' ', name)
    name = re.sub(r'\s+', ' ', name).strip()
    return name

In [28]:
df['plist_name'] = df['plist_name'].apply(normalize_name)

In [29]:
print(f"Total memory usage after name normalization: {df.memory_usage(deep=True).sum()/1e6} MB")

Total memory usage after name normalization: 50.041778 MB


In [30]:
df.to_pickle("./data/processed/subset_10_000.pkl")