# DATASCI Final Project Title

Group: Baby Alive!

Members:
    Benedictos,
    Loquinte,
    Marasigan,
    Masilang,
    Tejada

In [None]:
import os
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import zscore
from sklearn.feature_extraction.text import CountVectorizer

%pylab inline

### Musixmatch API Library

- describe API Library

### Azapi

Azapi is an API by Khaled ElMorshedy (https://github.com/elmoiv) for AZLyrics.com to get the lyrics of a song. This will be used by the researchers to collect the complete lyrics of the songs the researchers got from the Musixmatch API. Azapi can be accessed through the link https://github.com/elmoiv/azapi

In [None]:
#code here

### Research Objectives

In [None]:
#code here

### Scopes and Limitations

In [None]:
#code here

---

## Data Preparation

In [None]:
# code here

### Get sets of genres (Musix) 

In [None]:
# code here

### Get list of Filipino songs filtered by genres 
Year: 2011-2020

In [None]:
# code here

### Generate `.csv` files
(1 malaking DataFrame -> 1 DataFrame = 1 genre

In [None]:
# code here

### Get lyrics of each song via lyrics_id

In [None]:
# code here

### Load files

The files to be used for this study are the following:
- `all_music.csv`
- `alternative_music.csv`
- `christian_music.csv`
- `hiphop_music.csv`
- `pop_music.csv`
- `rbSoul_music.csv`
- `rock_music.csv`

In [None]:
df = pd.read_csv('CSV Files/all_music.csv')
df_alt = pd.read_csv('CSV Files/alternative_music.csv')
df_chr = pd.read_csv('CSV Files/christian_music.csv')
df_hphp = pd.read_csv('CSV Files/hiphop_music.csv')
df_pop = pd.read_csv('CSV Files/pop_music.csv')
df_rb = pd.read_csv('CSV Files/rbSoul_music.csv')
df_rck = pd.read_csv('CSV Files/rock_music.csv')

---

## Initial Exploratory Data Analysis

In [None]:
df

### Number of songs per genre

First we look into the composition of the dataset in terms of genres. For this part, we are not counting multiple genres of a song as individual labels. Therefore, if one song's genre is both _Alternative_ and _Pop_, it will fall under the label _Alternative, Pop_.

In [None]:
primary_genres = df["genre_names"].value_counts().rename_axis('Genres').reset_index(name='Number of Songs')
primary_genres = primary_genres.nlargest(10, 'Number of Songs')
primary_genres["Genres"] = primary_genres["Genres"].str.replace('[\[\]\']', '', regex=True) #remove brackets and quotes 

primary_genres_plot = sns.catplot(y="Genres", x="Number of Songs", kind="bar", data=primary_genres)

#### OBSERVATION HERE

Next, we consider each genre in a multiple-genred song as their own individual labels. Therefore -- using the example a while ago -- if a song's genre is both _Alternative_ and _Pop_, it will now fall under both the _Alternative_ label and the _Pop_ label.

To be able to count the frequency of each individual genre, the `genre_names` column must be converted into a matrix table where each row contains all genres of a single song, and each column contains a single genre of that song (e.g. if a song has two genres, then each will have their corresponding columns).

To split the genres, we use the fact that the multiple genres are separated by commas. We also remove unnecessary characters like brackets and apostrophes. 

In [None]:
genres_series = df['genre_names'].replace("[\[\]\']", '', regex=True)
genres_matrix = []

for string in genres_series:
    split_str = string.split(', ')
    genres_matrix.append(split_str)

genres_df = pd.DataFrame(genres_matrix)
genres_df

We see that the resulting table has 4 columns which means that there are, at most, 4 genres in a single song. To check which songs these are, we can just check whether it has a value at the 4th column. 

In [None]:
genres_df[~genres_df.iloc[:, 3].isnull()]

In [None]:
df.iloc[[38,573]]

### Observations here

Next, we just concatenate all of these values into a single list and use this to get the count of each individual genres in the dataset.

In [None]:
genres_list = pd.concat([genres_df[0], genres_df[1], genres_df[2]])
genres_list.dropna(inplace = True)
genres_list

In [None]:
unique_genres = genres_list.value_counts().rename_axis('Genres').reset_index(name='Number of Songs')

unique_genres = unique_genres.nlargest(10, 'Number of Songs')
unique_genres_plot = sns.catplot(y="Genres", x="Number of Songs", kind="bar", data=unique_genres)

### Observations here

### Word Counts (Top N words)

Next, we count the frequency of each word in the corpus of lyrics that we have. For convenience, we utilize scikit-learn's CountVectorizer_.

In [None]:
# vectorizer = CountVectorizer()
# vocabulary = vectorizer.fit_transform([df_temp.iloc[0,6]])
# print(vectorizer.get_feature_names())

In [None]:
# df_temp = df.dropna(subset=['lyrics'])
# print(df_temp[df_temp['lyrics'].str.contains("\\b002737\\b", regex=True)].iloc[0,6])
# # print(df_temp.iloc[0, 6])
# # df_temp

In [None]:
corpus = df["lyrics"].dropna()
# df.loc[df["lyrics"].notnull(), ["lyrics"]]

words = "Hello Philippines hello world"

print(corpus)

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform(corpus)
# tokenizer = vectorizer.build_tokenizer()
# tokenized = tokenizer(words)
# len(tokenized)

In [None]:
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())
counts

In [None]:
word_count = counts.max().nlargest(20).reset_index(name = "Count").rename(columns={'index': 'Word'})

word_count_plot = sns.catplot(y="Word", x="Count", orient="h", kind="bar", data=word_count)

maybe also show results without english stop words...

### Average Length of Song Lyrics

In [None]:
def count_words(text):

    vectorizer = CountVectorizer()
    matrix = vectorizer.fit_transform(text)
    
    return matrix.sum()

In [None]:
df_lyrics = df[["artist_name", "genre_names", "lyrics"]].dropna()

df_lyrics["count"] = df_lyrics["lyrics"].apply(lambda x: count_words([x]))
df_lyrics

In [None]:
# remove outliers and other shits

We can now get the average length of songs in our data.

In [None]:
df_lyrics['count'].mean()

### Average Length of Song Lyrics x Genres

For this part, we will be using values under the genre_names column. In the genre_names column, if a song falls under 2 or more genres, the format of it's value under the genre_names column looks like this: **['genre1', 'genre2',..., 'genreN]**.

In [None]:
## remove character occurrences ##
df_lyrics['genre_names'] =  df_lyrics['genre_names'].replace("[\[\]\']", '', regex=True)
df_lyrics.head(30)

In [None]:
## create multiple rows based on genre_names ##
temp = df_lyrics['genre_names'].str.split(", ").apply(pd.Series, 1).stack()
temp.index = temp.index.droplevel(-1)
temp.name = 'genre_names'
temp

del df_lyrics['genre_names']
df_lyrics = df_lyrics.join(temp)


In [None]:
df_lyrics.head(30)

In [None]:
#get the mean for every genre
genre_mean = df_lyrics.groupby('genre_names')['count'].mean()
genre_mean

In [None]:
#create a dataframe for the result
avg_genre=pd.DataFrame(genre_mean, columns=['genre_names','mean'])
avg_genre['mean']=genre_mean

#create a column for genre_names for plotting
avg_genre['genre_names']=genre_mean.index
avg_genre

In [None]:
avg_genre_plot = sns.catplot(y="genre_names", x="mean", orient="h", kind="bar", data=avg_genre)

### Average Length of Song Lyrics x Artists

In [None]:
# code here