# Music name generator
Well here we are. I'm going to find THE best music name, thanks to machine learning

In [17]:
import numpy as np 
import pandas as pd 
import string
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/home/goznalo/Programming/Python/musicnames'):
    for filename in filenames:
        if ".git" not in dirname:
            print(os.path.join(dirname, filename))


%matplotlib inline

/home/goznalo/Programming/Python/musicnames/artists.csv
/home/goznalo/Programming/Python/musicnames/clearedlist3.csv
/home/goznalo/Programming/Python/musicnames/miusic.ipynb
/home/goznalo/Programming/Python/musicnames/clearedlist1.csv
/home/goznalo/Programming/Python/musicnames/clearedlist2.csv
/home/goznalo/Programming/Python/musicnames/.ipynb_checkpoints/miusic-checkpoint.ipynb


### Importing the data
Let's import the data from the kaggle dataset music-artists-popularity, using the Pandas library.

In [18]:
filename = '/home/goznalo/Programming/Python/musicnames/artists.csv'
dataset = pd.read_csv(filename, usecols = [2], dtype=str) # Obtain the artist name column.

In [19]:
dataset.head # preview of the data

<bound method NDFrame.head of                  artist_lastfm
0                     Coldplay
1                    Radiohead
2        Red Hot Chili Peppers
3                      Rihanna
4                       Eminem
...                        ...
1466078                    NaN
1466079                    NaN
1466080                    NaN
1466081                    NaN
1466082                    NaN

[1466083 rows x 1 columns]>

### Clearing, part 1
We don't want any of the NaN entries, nor duplicate names. We will also convert it to a numpy array for posterior transformations.

In [20]:
df = dataset.dropna(axis = 0,how = 'any',thresh = None).drop_duplicates(subset=None) #removing NaN's and converting to numpy.
names = np.squeeze(np.asarray(df))

In [34]:
len(names)

957812

In [21]:
names[0]

'Coldplay'

### Clearing, part 2
We will now remove all those names using non-standard characters. What I mean by non-standard is all those characters not being latin ones, nor punctuation ones, nor digits, nor spaces. For instance, we will get rid of those using chinese characters or greek letters.

If we try applying the .isalnum() method, we quickly run into trouble as these non-latin characters return True. We need to come up with a solution. One way around this is by using the string module. With it, the string.printable characters are those we want to allow in our names, therefore we make a set out of them, which will be used against the names of the database in a loop.

In [22]:
validchars = {}
validchars[0] = set(string.printable)
print(validchars[0])

{'>', 't', 'U', 'a', 'w', '4', 'H', '(', 'X', '&', 'o', '`', '?', '2', 'p', '%', 'L', '!', '-', ';', '8', 'v', '9', 'k', '@', '{', 'S', 'h', '~', '*', '^', 'J', '+', 'W', ')', '\r', '\t', '=', 'V', 'K', 's', 'b', 'D', 'I', "'", 'm', '\\', ':', 'x', '|', 'j', ',', 'r', '"', 'A', ']', '}', '[', 'z', 'T', 'f', '.', 'E', '6', 'Y', 'n', 'M', 'c', 'F', '$', ' ', '0', '7', '<', '#', '1', '\n', 'u', 'Q', 'C', '_', 'R', 'l', 'O', 'N', '5', '\x0c', 'i', 'y', 'B', 'Z', 'e', '\x0b', 'G', 'g', 'd', '3', '/', 'q', 'P'}


Now comes the loop. An initial approach was creating a new object, a numpy array, to which we would append each valid name, therefore discarding the rest of the names with invalid characters. However, the appending operation makes the algorithm take exponential time of completion. Instead, we can save the index of each invalid word in a list, which we then feed to the np.delete() function to remove those entries of the "names" array

In [37]:
clearedlist = {}
def deletechars(validcharacters, listofnames, verbose=False):  # We define it within a function, as we will need it in the next steps
    i = 0 # counts each iteration
    j = 0 # counts each invalid word
    deletelist= []
    for name in listofnames:
        if not all(char in validcharacters for char in name):
            deletelist.append(i)
            if j%10000 == 0 and verbose:
                print("invalid: " + name)
            j += 1
        i += 1
        if i%50000 == 0 and verbose:
            print(str(i) + " cases inspected.")
    return np.delete(listofnames, deletelist)

clearedlist[0] = deletechars(validchars[0], names) #by default, verbose = False (no output)

In [38]:
clearedlist[0][0:22] # Check that Beyoncé has been correctly removed.

861756


array(['Coldplay', 'Radiohead', 'Red Hot Chili Peppers', 'Rihanna',
       'Eminem', 'The Killers', 'Kanye West', 'Nirvana', 'Muse', 'Queen',
       'Foo Fighters', 'Linkin Park', 'Lady Gaga', 'The Rolling Stones',
       'Daft Punk', 'Green Day', 'Katy Perry', 'The Beatles', 'Oasis',
       'Gorillaz', 'Michael Jackson', 'Maroon 5'], dtype=object)

In [39]:
np.savetxt("clearedlist1.csv", clearedlist[0], delimiter=",", fmt='%s') #save the list to a csv file.

### Clearing, part 3: choose your own adventure
We will now distinguish four different cases which can be studied, with decreasing complexity. 
1. The full list, as is.
2. The list, having removed punctuation characters and digits.
3. The list, having removed both punctuation characters, digits and names with more than two words.
4. The list, having removed both punctuation characters, digits and names with more than one word.

Case 2 is easy to implement, just applying the previously defined deletechars() function, specifying punctuation characters as invalid (allowing whitespaces). Case 4 is also straightforward to implement, as string.ascii_letters considers any whitespace invalid. Case 3 entails splitting each name using a "space" delimiter, and then discarding them lengthwise.

In [40]:
case = 3 #change to whichever of the above
verbose = False

In [59]:
validchars[1] = set(string.ascii_letters).union(" ")
validchars[2] = set(string.ascii_letters).union(" ") #we won't use it here, but we will count the characters with this.
validchars[3] = set(string.ascii_letters)

clearedlist[1] = []
clearedlist[2] = []
clearedlist[3] = []

if (case == 2 or case == 3 or case == 4):
    clearedlist[1] = deletechars(validchars[1], clearedlist[0])
    np.savetxt("clearedlist2.csv", clearedlist[1], delimiter=",", fmt='%s')
    
    if (case == 3 or case == 4): #as the function above, but indexing based on the length of string.split()
        i = 0 
        j = 0 
        deletelist = []
        for name in clearedlist[1]:
            if len(name.split()) > 2:
                deletelist.append(i)
                if j%10000 == 0 and verbose:
                    print("invalid: " + name)
                j += 1
            i += 1
            if i%50000 == 0 and verbose:
                print(str(i) + " cases inspected (lengthwise).")
        clearedlist[2] = np.delete(clearedlist[1], deletelist)
        np.savetxt("clearedlist3.csv", clearedlist[2], delimiter=",", fmt='%s') 
        
        if case == 4: # string.ascii_letters considers whitespaces as invalid.
            clearedlist[3] = deletechars(validchars[3], clearedlist[2])
            np.savetxt("clearedlist4.csv", clearedlist[3], delimiter=",", fmt='%s')

print((len(names), len(clearedlist[0]), len(clearedlist[1]), len(clearedlist[2]), len(clearedlist[3])))
print(clearedlist[2][0:22])

(957812, 861756, 743526, 626581, 0)
['Coldplay' 'Radiohead' 'Rihanna' 'Eminem' 'The Killers' 'Kanye West'
 'Nirvana' 'Muse' 'Queen' 'Foo Fighters' 'Linkin Park' 'Lady Gaga'
 'Daft Punk' 'Green Day' 'Katy Perry' 'The Beatles' 'Oasis' 'Gorillaz'
 'Michael Jackson' 'Arctic Monkeys' 'Drake' 'David Bowie']


### Some more pre-processing, I guess

Now that we have our desired array of names, we need to prepare them for being inputted to a neural network. There are several steps to be taken: first, 
we will append to each name a "\n" character: this will be our end-of-name character. We will later make all names lowercase, so as to use a dimensionally smaller encoding. Then, making use of the char_to_index dictionary and the keras.utils.to_categorical() function, we will create our one-hot vectors


In [64]:
data = clearedlist[case-1] #definitive list of names
m = len(data) #training examples

for i in range(len(data)):
    data[i] = data[i] + '\n'

In [65]:
clearedlist[2]

array(['Coldplay\n\n', 'Radiohead\n\n', 'Rihanna\n\n', ...,
       'wilkwceniu\n\n', 'xHoods Upx\n\n', 'yellow Labradore\n\n'],
      dtype=object)

In [55]:
for i in range(m):
    data[i] = data[i].lower()

In [35]:
chars = sorted(list(dict.fromkeys([char.lower() for char in validchars[case-1]])))
numchars = len(chars)
char_to_index = { ch:ix for ix, ch in enumerate(chars)} #assigns numbers to characters.