## Can we make decent genre feature vectors?
metal-archives already places bands in to black, death, doom/stoner/sludge, etc. genres.
From that one can get all bands in the black metal genre, for instance.
But that will lose the "atmospheric" modifier for "atmostpheric black metal", for example.
So can we make a better word embedding or something from the genre text on the band's page?
Then we can pick a band and find bands nearby in genre, perhaps sorted by popularity or something?

In [None]:
import re, collections, string, itertools
from pprint import pprint

import sqlite3 as lite

Get list of tuples of (band_id, band, genre)

In [None]:
db_filename = 'database.db'
with lite.connect(db_filename) as connection:
    cursor = connection.cursor()
    
    genres_and_bands = cursor.execute('select band_id,band,genre from Bands').fetchall()

In [None]:
print(len(genres_and_bands))
pprint(genres_and_bands[0:10])

In [None]:
unique_genres = list({t[2] for t in genres_and_bands})
unique_genres.sort()
pprint(unique_genres[0:25])

In [None]:
len(unique_genres)

## Tokenization
The genre texts are pretty structured, but maybe not totally repeatable.  I'm seeing some patterns that I'd like to make sense of to get the modifiers correct.  I'm thinking I want to group **Technical** with **Death** for example.

I'm seeing some patterns:
* Clearly therese are comma separated lists
* There are (early), (later), (mid), (Later) modifiers.
* **Brutal/Technical Death Metal** should map to **Brutal Death Metal** and **Technical Death Metal**?
* **Raw Black/Viking Metal** could be be **Raw Black Metal** and **Viking Metal** or **Raw Black Metal** and **Raw Viking Metal**.  The band in this example is Aasfresser; I'm guessing the former interpretation, but I'm no expert.
* I want to remove **Metal** and **Rock**, but **Alternative Metal** shouldn't be the same as **Alternative Rock**.  So maybe it's better to keep these "0th level" genres and attach all the modifiers to them a la **Alternative Metal** and **Alternative Rock**.  But then that screws up if the genre is just **Funeral Doom**, not **Funeral Doom Metal**.  Hmm...

This is gonna piss people off, but I'm going to define supergenres = metal, rock, punk, etc.

### What's in parens?

In [None]:
in_parens_regex = re.compile(r'\(.*?\)')
in_parens = set()
for genre in unique_genres:
    for match in re.findall(in_parens_regex, genre):
        in_parens.add(match)

#pprint(in_parens)

Probably just going to drop these.  But when when tokenizing or TF/IDF-ing (if we go that route), I don't want to double-count something like Prog Death (early), Prog Death/Black Metal(later)

### Raw Black/Viking Metal

In [None]:
raw_black_viking = [t for t in genres_and_bands if t[2].find('Raw Black/Viking Metal') != -1]
raw_black_viking

## Tokenizer class

In [None]:
def allgrams(text):
    tokens = text.split(' ')
    return tokens

def flatten(iterable):
    return [item for subiterable in iterable for item in subiterable]

In [None]:
class GenreTokenizer(object):
    """
    Try to tokenize those pesky genre texts.
    """
    def __init__(self):
        self.whitespace_regex = re.compile(r'\s+')
        self.paren_regex = re.compile(r'\(.*?\)')
        self.split_regex = re.compile(r"""[/]+""")
        
        self.pre_split_replace = {" 'n' ": "'n'",
                                  " n' ": "'n'",
                                 }
        
        # this is stupid!  let the data speak for itself!
        self.simple_super_genres = frozenset([
                                        'acoustic',
                                        'alternative',
                                        'ambient',
                                        'aor',
                                        'classical',
                                        'crossover',
                                        'crust',
                                        'djent',
                                        'drone',
                                        'doom',
                                        'downtempo',
                                        'electronic',
                                        'electronica',
                                        'electronics',
                                        'experimental',
                                        'folk',
                                        'funk',
                                        'gothic',
                                        'grunge',
                                        'industrial',
                                        'medieval',
                                        'metal',
                                        'noise',
                                        'psybient',
                                        'psychedelic',
                                        'punk',
                                        'rock',
                                        'synth',
                                        'trance',
                                       ])
        self.compound_super_genre_prefix = frozenset(['neo', 'post', 'synth',])
        self.compound_super_genre_suffix = frozenset(['core', 'gaze', 'wave',
                                                     ])

        self.pre_split_words = frozenset(["rock 'n' roll",
                                          "death 'n' roll",
                                          "black 'n' roll",
                                          "thrash 'n' roll",
                                          'drum and bass',
                                          'a cappella',
                                          'middle eastern',
                                          'middle-eastern',
                                          'avant-garde',
                                          'j-rock',
                                          'd-beat',
                                          'new age',
                                          'nu-metal',
                                         ])

        self.drop_tokens = frozenset([None,'',
                            'influences',
                           'elements',
                           'of',
                           'and'])
        
        self.token_map = {'blackened': 'black',
                          'middle eastern': 'middle-eastern',
                          'neoclassic': 'neoclassical',
                          'operatic': 'opera',
                          'drum and bass': 'drum-and-bass',
                         }
        
        self.bad_tokens = frozenset([None,'','age',
                                    ])
        
        self.split_these_tokens = ['core', 'noise', 'grind', 'synth', 'wave',]
        
    def tokenize(self, genre_text):
        genre_text = self.normalizeWhitespace(genre_text).lower()
        genre_text = self.removeParens(genre_text)
        texts = genre_text.split(',')
        
        # Check for any remaining "bad" characters
        for text in texts:
            if '(' in text or ')' in text or ',' in text:
                print(genre_text, texts)
                raise RuntimeError('bug')
        
        # do the messy split split
        tokens = [token for text in texts for token in self.split2(text)]
        
        # Check for weird tokens
        for bad_token in self.bad_tokens:
            if bad_token in tokens or any(map(lambda token: len(token)==1, tokens)):
                print(genre_text, tokens)
                raise RuntimeError('bug')
        
        return tokens
    
    def normalizeWhitespace(self, text):
        """
        Remove leading and trailing whitespaces.  Use only one space between non-whitespace characters.
        """
        return re.sub(self.whitespace_regex, ' ', text).strip().rstrip()
    
    def removeParens(self, text):
        """
        Drop the stuff in parentheses.
        """
        return re.sub(self.paren_regex, '', text).strip().rstrip()
    

    
    
    def is_super_genre(self, token):
        if token in self.simple_super_genres:
            return True
        
        for prefix in self.compound_super_genre_prefix:
            if token[0:len(prefix)] == prefix:
                return True
            
        for ending in self.compound_super_genre_suffix:
            if token[-len(ending):] == ending:
                return True

        return False
    
    def split(self, text):
        """
        Do the split.  This is gonna be gross...
        """
        tokens = []
        
        text = self.normalizeWhitespace(text)
        
        # Do some replacements before any further tokenization
        for pat,sub in self.pre_split_replace.items():
            text = text.replace(pat,sub)
        
        # Grab these special cases first and remove them from the text
        pre_split_tokens = []
        for word in self.pre_split_words:
            if word in text:
                text = text.replace(word, ' ')
                pre_split_tokens.append(word)
                
        # Deal with modifiers like 'atmospheric black/folk metal'
        # We want 'atmospheric', 'atmospheric black', 'atmospheric black metal', 'black metal',
        # and 'folk metal' as tokens.
        splits = flatten(text.split(' ') for text in re.split('(/)', text)) # keep the / in the list
        
        # Drop certain tokens
        splits = [token for token in splits if token not in self.drop_tokens]
        
        if splits and splits[0] == '/': # weird edge case in the data
            splits.pop(0)
        
        #print(text)
        #print(splits)
        
        # Starting from the end, find a super genre and look ahead until the next super genre.
        # Then that slice is a "structured token".  Gotta handle what the slash means too.
        # Some genre texts have "modified super_genre with super_genre influences"; those super_genre
        # influences get counted as a separate super_genre token.
        structured_tokens = []
        i = len(splits)-1
        while i >= 0:
            token = splits[i]
            
            if token == '/': # weird edge case in the data, not this algorithm
                i -= 1
                continue
                
            #print(token)
            if not self.is_super_genre(token):
                print(text)
                print(splits)
                raise RuntimeError('token={} is not a super genre'.format(token))
                #print('skipping apparent super genre {}'.format(token))
            
            iend = i
            istart = i-1
            have_slash = False
            slash_is_modifier = False
            while istart >= 0:
                if splits[istart] == '/':
                    have_slash = True
                    if istart > 0:
                        if self.is_super_genre(splits[istart-1]):
                            # slash separates super genres
                            #print('slash separates super genres')
                            structured_tokens.append(splits[istart+1:iend+1])
                            i = istart - 1
                            break
                        else:
                            # slash separates modifiers
                            #print('slash separates modifiers')
                            slash_is_modifier = True
                            istart -= 1
                            continue
                    else:
                        print(text)
                        print(splits)
                        raise RuntimeError('bug')
                else:
                    if self.is_super_genre(splits[istart]) and slash_is_modifier:
                        structured_tokens.append(splits[istart+1:iend+1])
                        i = istart
                        break
                    else:
                        istart -= 1
                
                #istart -= 1
            else:
                structured_tokens.append(splits[0:iend+1])
                i = istart
        
        print('text =', repr(text), ' -> structured_tokens =', structured_tokens)
        
        return tokens
    #############################
        
        split_tokens = re.split(self.split_regex, text)
        
        for token in itertools.chain(pre_split_tokens, split_tokens):
            if token in self.drop_tokens:
                continue
            
            if token in self.token_map:
                token = self.token_map[token]
            
            tokens.append(token)
            
            #for base in self.split_these_tokens:
            #    if base in token:
            #        print('stuff')
        
        return tokens
    
        
    def split2(self, text):
        """
        Do the split.  This is gonna be gross...
        
        This assumes the following structure
        
        modifier modifier/modifier super_genre with modifier influences
        """
        # Do some replacements before any further tokenization
        for pat,sub in self.pre_split_replace.items():
            text = text.replace(pat,sub)
            
        # Deal with modifiers like 'atmospheric black/folk metal'
        # We want 'atmospheric', 'atmospheric black', 'atmospheric black metal', 'black metal',
        # and 'folk metal' as tokens.
        splits = flatten(re.split(r'( )', text) for text in re.split(r'(/)', text)) # keep the / in the list
        
        # Drop certain tokens
        splits = [token for token in splits if token not in self.drop_tokens]
        
        if splits and (splits[0] == '/' or splits[0] == ' '): # weird edge case in the data
            splits.pop(0)
        if splits and (splits[-1] == '/' or splits[-1] == ' '): # weird edge case in the data
            splits.pop(-1)
        
        print('text =', text)
        print('splits =', splits)
        
        extra_modifiers = []
        if 'with' in splits:
            ind = splits.index('with')
            extra_modifiers.extend(splits[ind+1:])
            splits = splits[0:ind]
            
            extra_modifiers = [token for token in extra_modifiers if token != ' ']
            print('extra_modifiers =', extra_modifiers)
            print('splits =', splits)
        
        super_genres = frozenset(['rock', 'metal'])
        
        def splitter(tokens, splits, i):
            # token is assumed to be a super genre here
            token = splits[i]
            
            # but sometimes the data entry is wrong...
            if token == ' ' or token == '/':
                print('Bug in data entry')
                splitter(tokens, splits, i-1)
                return
                
            #print("token={} isn't in super_genres set".format(token))
                
            
            if i > 0:
                if splits[i-1] == ' ':
                    # there are modifiers; collect them all
                    modifiers = [[]]
                    i1 = i-2
                    while i1 >= 0:
                        mod = splits[i1]
                        
                        if mod == ' ' or mod == '/': # bug in data entry
                            i1 -= 1
                            continue
                        
                        modifiers[-1].append(mod)
                        if i1 > 1:
                            if splits[i1-1] == ' ':
                                # compound modifier
                                pass
                                
                            elif splits[i1-1] == '/':
                                # end of current modifier
                                next_token = splits[i1-2]
                                if next_token in super_genres:
                                    print('next token is a super genre')
                                    break
                                else:
                                    modifiers.append([])
                                
                            else:
                                raise RuntimeError('bug')
                        else:
                            # Just one modifier and we're done
                            break
                        i1 -= 2
                    
                    # reverse order of modifiers to match up with text
                    modifiers = [[m for m in reversed(modifier)] for modifier in reversed(modifiers)]
                    
                    tokens.append([*modifiers, token])
                    splitter(tokens, splits, i1-2)
                    
                elif splits[i-1] == '/':
                    # token is a bare super genre
                    tokens.append(token)
                    splitter(tokens, splits, i-2)
                
                else:
                    raise RuntimeError()
            elif i == 0:
                # token is a bare super genre and we're done
                tokens.append(token)
        
        tokens = []
        splitter(tokens, splits, len(splits)-1)
        print('tokens =', tokens)
            

            
        return []
    
    
tokenizer = GenreTokenizer()
genres_tokens = [(genre_text, tokenizer.tokenize(genre_text)) for genre_text in unique_genres]
#pprint(genres_tokens[0:25])
unique_tokens = {token for tup in genres_tokens for token in tup[1]}
print(len(unique_tokens))
pprint(unique_tokens)

### Common genres and modifiers?
These are the most common words... that should allow me to pick them out manually.

In [None]:
tokenizer = Tokenizer()
word_counts = collections.defaultdict(int)
split_pattern = re.compile(r'[/\s]+')
for genre_text in unique_genres:
    genre_text = tokenizer.normalizeWhitespace(genre_text).lower()
    genre_text = tokenizer.removeParens(genre_text)
    texts = genre_text.split(',')

    for text in texts:
        for word in re.split(split_pattern, text):
            if not word:
                continue
            
            word_counts[word] += 1

pprint(sorted(word_counts.items(), key=lambda t: t[1]))
super_genres = frozenset([])