# A million song: Milestone 2

## Goals

To do sentiment analysis on music genre.

    *Stemming is very commonly used in this kind of text analysis task. For statistical purposes, it is more interesting to treat "cry", "cried", and "crying" as instances of the same thing, rather than treating them as distinct, unrelated tokens. We use a simple, well-known stemming algorithm (Porter2) (which for this example maps all these words to "cri")"

In [7]:
#Imports
%matplotlib inline
import numpy as np
import pandas as pd
import re
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from mpl_toolkits.mplot3d import Axes3D
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
import sklearn.metrics as skm
from IPython.display import display, HTML

Importing the most used words and their count inside the dataset. A bit of data wrangling is necessary

In [8]:
#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])

display(full_word_list.head(20))

Unnamed: 0,Word,Count
6,i,2078808.0
7,the,1863782.0
8,you,1744257.0
9,to,1067578.0
10,and,1055748.0
11,a,974499.0
12,it,821152.0
13,me,771755.0
14,not,735396.0
15,in,626410.0


In [125]:
#Importing the text file in a DataFrame, removing exceptions (bad data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
matches.columns = ['Raw']
#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]
#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]
#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]
#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]

#Second artist name
matches['Artist_again'] = matches['Raw'].str.split('<SEP>', expand=True)[4]

#Second title
matches['Title_again'] = matches['Raw'].str.split('<SEP>', expand=True)[5]

#Dropping rows we do not need
matches = matches.drop(matches.index[:17])

b'Skipping line 60821: expected 1 fields, saw 2\nSkipping line 126702: expected 1 fields, saw 2\n'
b'Skipping line 580629: expected 1 fields, saw 2\nSkipping line 632526: expected 1 fields, saw 2\n'


In [129]:
display(matches.head(20))

Unnamed: 0,Raw,Tid,Artist_Name,Title,MXM_Tid,Artist_again,Title_again
17,TRMMMKD128F425225D<SEP>Karkkiautomaatti<SEP>Ta...,TRMMMKD128F425225D,Karkkiautomaatti,Tanssi vaan,4418550,Karkkiautomaatti,Tanssi vaan
18,TRMMMRX128F93187D9<SEP>Hudson Mohawke<SEP>No O...,TRMMMRX128F93187D9,Hudson Mohawke,No One Could Ever,8898149,Hudson Mohawke,No One Could Ever
19,TRMMMCH128F425532C<SEP>Yerba Brava<SEP>Si Vos ...,TRMMMCH128F425532C,Yerba Brava,Si Vos Querés,9239868,Yerba Brava,Si vos queres
20,TRMMMXN128F42936A5<SEP>David Montgomery<SEP>Sy...,TRMMMXN128F42936A5,David Montgomery,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",5346741,Franz Berwald,"Symphony No. 1 in G minor ""Sinfonie Sérieuse"":..."
21,TRMMMBB12903CB7D21<SEP>Kris Kross<SEP>2 Da Bea...,TRMMMBB12903CB7D21,Kris Kross,2 Da Beat Ch'yall,2511405,Kris Kross,2 Da Beat Ch'yall
22,TRMMMHY12903CB53F1<SEP>Joseph Locke<SEP>Goodby...,TRMMMHY12903CB53F1,Joseph Locke,Goodbye,793273,Joseph LoDuca,Goodbye
23,TRMMMNS128F93548E1<SEP>3 Gars Su'l Sofa<SEP>L'...,TRMMMNS128F93548E1,3 Gars Su'l Sofa,L'antarctique,7503609,3 gars su'l sofa,L'Antarctique
24,TRMMMXJ12903CBF111<SEP>Jorge Negrete<SEP>El hi...,TRMMMXJ12903CBF111,Jorge Negrete,El hijo del pueblo,7362052,Jorge Negrete,El hijo del pueblo
25,TRMMMBW128F4260CAE<SEP>Tiger Lou<SEP>Pilots<SE...,TRMMMBW128F4260CAE,Tiger Lou,Pilots,7833814,Tiger Lou,Pilots
26,TRMMMXI128F4285A3F<SEP>Waldemar Bastos<SEP>N G...,TRMMMXI128F4285A3F,Waldemar Bastos,N Gana,2320200,Waldemar Bastos,N Gana


From the dataset, there was some given metadata:
- The total word count is 55 163 335 
- There are 498 134 unique words 

Those parameters could be usefull for our sentiment analysis, thus why we implement them.


In [128]:
Word_count_total = 55163335
Unique_word_number = 498134

In [149]:
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
display(full_word_list.head(20))

#Showing the percentage of the 100 most used word over the total
display(full_word_list['Occurence_percentage'][:100].sum())

Unnamed: 0,Word,Count,Occurence_percentage
6,i,2078808.0,3.76846
7,the,1863782.0,3.378661
8,you,1744257.0,3.161986
9,to,1067578.0,1.935304
10,and,1055748.0,1.913858
11,a,974499.0,1.76657
12,it,821152.0,1.488583
13,me,771755.0,1.399036
14,not,735396.0,1.333125
15,in,626410.0,1.135555


52.8380581050801

One can notice that the most used first 100 words are already representing half of total count. 