# Log Ratio Analysis 
### Group 03:
- Catarina Oliveira | 20211616
- Daniel Kruk | 20211687
- Joana Rosa | 20211516
- Marcelo Junior | 20211677
- Martim Serra | 20211543

##### This notebook includes:
 - Preparation of the dataset after it was treated in the preprocessing stage for the log_ratio analysis;

 - Creating a powerfull vocabulary that reflects the characteristics of each tag;

 - Exporting the vocabulary chosen;

# Imports

In [2]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

from functions import *

import json

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


DONE


# Data Download

In [3]:
data = pd.read_csv(
    'data/preproc_final/train_preproc.csv', index_col= 'id').loc[:,['lyrics', 'tag']].copy()

# Log Ratio Analysis
#### Getting some `lyrics` insights

In [6]:
n = 3000
most_common_lyrics, tags, total_words_tag, total_words, top_words = prep_data(data,
                                                                              'lyrics',
                                                                              'tag',
                                                                               n = n)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Martin\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


DONE

Most Common Keys:
 [('get', 417062), ('like', 286100), ('know', 264743), ('go', 221786), ('love', 180760), ('say', 165295), ('yeah', 163217), ('make', 153830), ('na', 153365), ('see', 140957), ('come', 140146), ('oh', 137772), ('time', 135974), ('one', 130919), ('feel', 122917), ('take', 121190), ('want', 117476), ('never', 115856), ('let', 111408), ('shall', 104478), ('back', 103086), ('im', 102378), ('nigga', 100956), ('think', 100605), ('fuck', 97529), ('tell', 96352), ('way', 91657), ('good', 90466), ('give', 85913), ('would', 85832), ('need', 85820), ('wan', 85639), ('u', 83184), ('life', 80847), ('baby', 79138), ('look', 78800), ('bitch', 77165), ('day', 77046), ('gon', 75174), ('man', 74880), ('could', 72729), ('shit', 72443), ('keep', 71923), ('right', 70793), ('leave', 70524), ('girl', 60763), ('thing', 59527), ('away', 58438), ('try', 58294), ('still', 58294), ('night', 57926), ('dont', 56745), ('heart', 56372), ('cause', 55465), ('call', 55111), ('find', 54884), ('live

Before getting our vocabulary based on the loga ratio analysis made later on this notebook, we must define the distribution of words present in the vocabulary:

In [7]:
# each weight is based on the percetange of rows per tag in the whole data
distribution = {
    'pop': int(n * (0.413008)),
    'rap': int(n * (0.285879)),
    'rock': int(n * (0.187692)),
    'rb': int(n * (0.045960)),
    'misc': int(n * (0.041781)),
    'country': int(n * (0.025681))
}

In the next cell we get a dictionary called `log_values` in which it has a length of `n` and has the most important words for each tag.

In [8]:
log_values = log_ratio_analysis(top_words,
                                tags,
                                total_words_tag,
                                total_words,
                                distribution,
                                data,
                                label = 'tag')

Lets have a quick check on how many words is there for each tag.
The total number of words will not match the given `n` given the percentages in the `distribution` object.

In [9]:
sum = 0
for genre in data.tag.value_counts().keys():
    print(f"{genre}: {len(log_values[genre])} \n")
    sum += len(log_values[genre])

print(sum)

pop: 1239 

rap: 857 

rock: 563 

rb: 137 

misc: 125 

country: 77 

2998


Before letting this be our final vocabulary to fit into our vectorizer, lets see if there are any words that are considered important in too many tags.

In [11]:
common_keys = find_common_keys(log_values)
common_keys

{'shatter': ['pop', 'rock'],
 'cling': ['pop', 'rock'],
 'cruel': ['pop', 'rock'],
 'farewell': ['pop', 'rock'],
 'misery': ['pop', 'rock'],
 'toll': ['pop', 'rock'],
 'masquerade': ['pop', 'rock'],
 'sorrow': ['pop', 'rock'],
 'nowhere': ['pop', 'rock'],
 'reveal': ['pop', 'rock'],
 'within': ['pop', 'rock'],
 'judgement': ['pop', 'rock'],
 'instrumental': ['pop', 'rock'],
 'despair': ['pop', 'rock'],
 'crumble': ['pop', 'rock'],
 'spiral': ['pop', 'rock'],
 'burden': ['pop', 'rock'],
 'midnight': ['pop', 'rock'],
 'darkness': ['pop', 'rock'],
 'wind': ['pop', 'rock'],
 'someday': ['pop', 'rock'],
 'unfold': ['pop', 'rock'],
 'guitar': ['pop', 'rock'],
 'awaken': ['pop', 'rock'],
 'sleepless': ['pop', 'rock'],
 'silence': ['pop', 'rock'],
 'overwhelm': ['pop', 'rock'],
 'everyone': ['pop', 'rock'],
 'poison': ['pop', 'rock'],
 'mold': ['pop', 'rock'],
 'emptiness': ['pop', 'rock'],
 'spell': ['pop', 'rock'],
 'shallow': ['pop', 'rock'],
 'everythings': ['pop', 'rock'],
 'halfway': ['p

In [12]:
key_frequencies = {key: len(locations) for key, locations in common_keys.items()}
sorted_key_frequencies = dict(sorted(key_frequencies.items(), key=lambda item: item[1], reverse=True))

As we can see, there are only words considered important in either 1 or 2 tags, we will allow this words to be in our final vocabulary.

In [13]:
sorted_key_frequencies

{'shatter': 2,
 'cling': 2,
 'cruel': 2,
 'farewell': 2,
 'misery': 2,
 'toll': 2,
 'masquerade': 2,
 'sorrow': 2,
 'nowhere': 2,
 'reveal': 2,
 'within': 2,
 'judgement': 2,
 'instrumental': 2,
 'despair': 2,
 'crumble': 2,
 'spiral': 2,
 'burden': 2,
 'midnight': 2,
 'darkness': 2,
 'wind': 2,
 'someday': 2,
 'unfold': 2,
 'guitar': 2,
 'awaken': 2,
 'sleepless': 2,
 'silence': 2,
 'overwhelm': 2,
 'everyone': 2,
 'poison': 2,
 'mold': 2,
 'emptiness': 2,
 'spell': 2,
 'shallow': 2,
 'everythings': 2,
 'halfway': 2,
 'fog': 2,
 'nightmare': 2,
 'eternally': 2,
 'forsake': 2,
 'sky': 2,
 'dust': 2,
 'obey': 2,
 'desperate': 2,
 'sin': 2,
 'slowly': 2,
 'cure': 2,
 'undo': 2,
 'robot': 2,
 'reel': 2,
 'drown': 2,
 'beneath': 2,
 'tide': 2,
 'sew': 2,
 'underneath': 2,
 'earth': 2,
 'collide': 2,
 'vein': 2,
 'die': 2,
 'aa': 2,
 'sun': 2,
 'forgiveness': 2,
 'thirst': 2,
 'drag': 2,
 'celebration': 2,
 'away': 2,
 'survive': 2,
 'dolla': 2,
 'fragile': 2,
 'cradle': 2,
 'parade': 2,
 '

Now, lets save our vocabulary in a different file named `vocabulary` which will consist in a file filled with a list (`best_words`) that has the words that better represent our tags.

In [14]:
best_words = []
for tag in log_values:
    for word in log_values[tag]:
        best_words.append(word[0])

print(best_words)
len(set(best_words)) 

['chee', 'oya', 'sleigh', 'woh', 'bruk', 'ev', 'rosie', 'waan', 'hallelujah', 'disco', 'ariana', 'woof', 'starlight', 'grande', 'aaah', 'amin', 'aah', 'ooooo', 'unkind', 'ooooh', 'bwoy', 'mistletoe', 'ry', 'oo', 'merry', 'christmas', 'darlin', 'heartbeat', 'fairytale', 'starry', 'gal', 'romance', 'claus', 'ly', 'heartbreaker', 'newborn', 'jingle', 'noel', 'la', 'whatcha', 'seh', 'eh', 'lullaby', 'fabulous', 'moonlight', 'lover', 'goodbye', 'pum', 'jo', 'wo', 'bawl', 'darling', 'raindrop', 'silhouette', 'dum', 'sunshine', 'jah', 'adore', 'overflow', 'sushi', 'oh', 'eternally', 'wonderland', 'fi', 'gyal', 'paradise', 'ba', 'someday', 'jive', 'goodnight', 'rhythm', 'ee', 'fingertip', 'tonight', 'loneliness', 'bittersweet', 'alibi', 'heartache', 'somehow', 'spooky', 'gypsy', 'sunrise', 'pre', 'overload', 'ooo', 'symphony', 'stormy', 'forevermore', 'joyful', 'oooh', 'twilight', 'surrender', 'dance', 'harmony', 'madonna', 'tu', 'lonesome', 'rainbow', 'hoo', 'caroline', 'sing', 'wah', 'dreame

2762

In [72]:
with open('vocabulary.py', 'w') as file:
    file.write('vocabulary = ' + json.dumps(best_words))