## Positive Naive Bayes Classification 

A song can fit in more than one genre, and our conditioning of the data has been attentive to preserving the one-to-potentially-many relationship. In this part, a classifier develops a probability that a given body of text will fit within each of the top ten selected genres. It uses the nltk positive naive Bayes classifier to train a set of lyrics for each of the top 10 genres, using 80% of the data. It assigns priors based on the relative frequency of songs. 

In [16]:
from nltk.classify import PositiveNaiveBayesClassifier
import json
import pandas as pd

In [25]:
# open the data set
df = pd.read_csv("../../data/conditioned/all_years_and_genres_with_lyrics_and_wordcount_and_vocabulary_clean.csv")
df = df.drop_duplicates('song_key')
# total number of songs
allsongs= len(df)*1.0

# get the top ten genres
with open("../../notebooks/ss/songsbygenre.json") as json_file:
    genresj = json.load(json_file)
    
genrelist= genresj.keys()
genresj
topgenres={}
for k in genresj.keys():
    topgenres[k]=len(genresj[k])
d=topgenres
glist=[]

#for i,v in enumerate(topgenres):
 #   print i,v, topgenres[v]
rank=0
for w in sorted(d, key=d.get, reverse=True):
    if rank < 15:
        glist.append((rank,w, d[w]/allsongs))
        rank +=1
#glist now holds the genre's id number, the genre, and its freq in the overall population        

glist

[(0, u'/wiki/Pop_music', 0.7867435158501441),
 (1, u'/wiki/Hip_hop_music', 0.6296829971181557),
 (2, u'/wiki/Contemporary_R%26B', 0.6162343900096061),
 (3, u'/wiki/Soul_music', 0.41306436119116235),
 (4, u'/wiki/Rock_music', 0.3621517771373679),
 (5, u'/wiki/Pop_rock', 0.3520653218059558),
 (6, u'/wiki/Soft_rock', 0.24639769452449567),
 (7, u'/wiki/Country_music', 0.1988472622478386),
 (8, u'/wiki/Rhythm_and_blues', 0.19548511047070125),
 (9, u'/wiki/Alternative_rock', 0.16234390009606148),
 (10, u'/wiki/Funk', 0.15994236311239193),
 (11, u'/wiki/Hard_rock', 0.15417867435158503),
 (12, u'/wiki/Dance-pop', 0.14169068203650337),
 (13, u'/wiki/Dance_music', 0.14169068203650337),
 (14, u'/wiki/Disco', 0.13160422670509125)]

In [18]:
dftr = df.sample(frac=0.8)
dftst =  df.loc[~df.index.isin(dftr.index)]
dftr.shape, dftst.shape

((3331, 454), (833, 454))

## Readying the Training Sets
For each genre, there is an in-genre set and all other genres together are not-in. Train the model and build fifteen classifiers. 

In [19]:
def features(sentence):
    words = sentence.lower().split()
    return dict(('contains(%s)' % w, True) for w in words)


In [26]:
indict={}
outdict={}
classdict={}
for genretuple in glist:
    gindex=genretuple[0]
    genre = genretuple[1]
    gprior= genretuple[2]
    #genre="/wiki/Pop_music"

    in_genre_df = dftr[dftr[genre]==True]
    out_genre_df = dftr[dftr[genre]==False]

    # concatentate the oyrics from each song into a "sentence"
    in_sentences=[]
    out_sentences=[]
    for row in in_genre_df.iterrows():
        songsents= row[1][2].split('.')
        for s in songsents:
            in_sentences.append(s)

    for row in out_genre_df.iterrows():
        songsents= row[1][2].split('.')
        for s in songsents:
            out_sentences.append(s)

    positive_featuresets = list(map(features, in_sentences))
    unlabeled_featuresets = list(map(features, out_sentences))
    classdict[int(gindex)] = PositiveNaiveBayesClassifier.train(positive_featuresets,unlabeled_featuresets)
print "training complete"

In [33]:
def genre_classify(classtext):
    for genretuple in glist:
        gindex=genretuple[0]
        genre = genretuple[1]
        gprior= genretuple[2]
        print genre[6:],": ",classdict[gindex].classify(features(classtext))

In [34]:
genre_classify('beat the bitch with a bat')



Pop_music :  False
Hip_hop_music :  True
Contemporary_R%26B :  False
Soul_music :  False
Rock_music :  False
Pop_rock :  False
Soft_rock :  False
Country_music :  False
Rhythm_and_blues :  False
Alternative_rock :  False
Funk :  False
Hard_rock :  True
Dance-pop :  True
Dance_music :  False
Disco :  True


In [36]:
genre_classify('I Love you')


Pop_music :  True
Hip_hop_music :  False
Contemporary_R%26B :  True
Soul_music :  True
Rock_music :  False
Pop_rock :  True
Soft_rock :  True
Country_music :  False
Rhythm_and_blues :  True
Alternative_rock :  False
Funk :  False
Hard_rock :  True
Dance-pop :  True
Dance_music :  True
Disco :  True


In [39]:
genre_classify('My team lost the game')


Pop_music :  False
Hip_hop_music :  True
Contemporary_R%26B :  False
Soul_music :  False
Rock_music :  False
Pop_rock :  True
Soft_rock :  True
Country_music :  False
Rhythm_and_blues :  False
Alternative_rock :  True
Funk :  False
Hard_rock :  False
Dance-pop :  True
Dance_music :  False
Disco :  False
