# MALLET Analysis on Song Lyric Corpus

In [2]:
# Import libraries and set directory
import pandas as pd
import numpy as np
import os
os.chdir('/Users/nickbruno/Documents/spring_2019/DS5559/project/code')

In [3]:
# Import data
df = pd.read_csv('mallet_start_df.csv')

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist,song,text,artist_id,song_id
0,0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd...",0,0
1,1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...",0,1
2,2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,0,2
3,3,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,0,3
4,4,ABBA,Burning My Bridges,"Well, you hoot and you holler and you make me ...",0,4


In [5]:
# Create an artist-song label #
df['doc_label'] = df.apply(lambda x: "{}-{}".format(x.artist, x.song), 1)

In [6]:
# Get rid of extraneous columns #
df = df[['doc_label','text']]

In [7]:
# Format the text to get rid of verse and line splits #
df.text = df.text.str.replace('  \n', ' ')
df.text = df.text.str.replace('\n\n', '')
# These splits are important in later analysis, but not for the MALLET analysis

In [16]:
df.text.iloc[0] # looks good

"Look at her face, it's a wonderful face And it means something special to me Look at the way that she smiles when she sees me How lucky can one fellow be?  She's just my kind of girl, she makes me feel fine Who could ever believe that she could be mine? She's just my kind of girl, without her I'm blue And if she ever leaves me what could I do, what could I do?  And when we go for a walk in the park And she holds me and squeezes my hand We'll go on walking for hours and talking About all the things that we plan  She's just my kind of girl, she makes me feel fine Who could ever believe that she could be mine? She's just my kind of girl, without her I'm blue And if she ever leaves me what could I do, what could I do?"

In [17]:
# Write out dataframe to run the mallet in the terminal #
df.to_csv('project-mallet.csv')
    # save project-mallet.csv in the mallet bin in order to run commands in terminal

### First mallet with each song in the corpus with num_topics = 5 and 1000 iterations

#### The following code was run in the terminal of my MAC machine

In [None]:
# Importing data to run the MALLET operation
import-file --input project-mallet.csv --output project-mallet.mallet --keep-sequence TRUE --remove-stopwords
    # the '--remove-stopwords' is extremely helpful, as it removes stopwords without having to create a bag of words
    # of the entire corpus and this improves our results.

In [None]:
# Train the topics and produce other important MALLET files
train-topics --input project-mallet.mallet --num-topics 5 --num-iterations 1000 \
--output-doc-topics project-mallet-doc-topics.txt \
--output-topic-keys project-mallet-topic-keys.txt \
--word-topic-counts-file project-mallet-word-topic-counts-file.txt \
--topic-word-weights-file project-mallet-topic-word-weights-file.txt \
--xml-topic-report project-mallet-topic-report.xml \
--xml-topic-phrase-report project-mallet-topic-phrase-report.xml \
--show-topics-interval 10 \
--use-symmetric-alpha false  \
--optimize-interval 100 \
--diagnostics-file project-mallet-diagnostics.xml

### Results

0	1.085	i'm love don't it's you're baby can't time make yeah i'll gonna i've feel heart back give won't life that's 

1	0.45024	man she's he's home back big town good it's boy ain't i'm christmas day night gonna girl round city boys 

2	0.17176	i'm yeah ain't don't rock shit fuck nigga money hey wanna back it's make bitch dance gotta chorus niggas man 

3	0.66453	life world eyes day love light night time sun heart god sky hear there's lord dream it's rain soul sing 

4	0.24917	dead die blood hell death war man kill people fight we're hate black head it's burn power they're men god 

### Analysis

Topic 1: Love, baby, heartbreak, and life

Love is a major theme in most songs, so this topic is not surprising.

Topic 2: Home, town, boy, girl, Christmas (?)

This seems like it could possibly be Country music, talking about boys and girls in towns. The inclusion of Christmas seems odd here, but maybe our corpus contains multiple holiday related songs.

Topic 3: Explitives, Rock, Dance, Chorus, Money

This seems like the hip-hop cluster. Although, it is interesting to find the word "rock" and "chorus". Interesting to see "I'm" as one of the main words (also shown in the first topic). In these two topics, do the artists choose to focus on themselves?

Topic 4: Weather, God, Dreams

This topic seems whimsical and daydreamy, interweaving life and the world in a calming manner. This could be a hippie topic, maybe influenced by older musicians. Weather and life are major themes in music, and this topic seems to encompass this.

Topic 5: Hell, Burn, God, Die, Blood

This topic is very gory, possibly associated with goth, metal-rock music. Very stark contrast against topic 4.

All in all, each topic is pretty distinguishable and revolve around solid themes that are often seen in song lyrics.

### Second try: 10 topics and 1000 iterations

In [None]:
# Training a MALLET analysis with an increased number of topics
train-topics --input project-mallet.mallet --num-topics 10 --num-iterations 1000 \
--output-doc-topics project-mallet-ten-top-doc-topics.txt \
--output-topic-keys project-mallet-ten-top-topic-keys.txt \
--word-topic-counts-file project-mallet-ten-top-word-topic-counts-file.txt \
--topic-word-weights-file project-mallet-ten-top-topic-word-weights-file.txt \
--xml-topic-report project-mallet-ten-top-topic-report.xml \
--xml-topic-phrase-report project-malletten-top--topic-phrase-report.xml \
--show-topics-interval 10 \
--use-symmetric-alpha false  \
--optimize-interval 100 \
--diagnostics-file project-mallet-ten-top-diagnostics.xml

### Results

0	0.15045	god lord jesus sing heaven world soul life man born free king people chorus peace glory holy he's children earth 

1	0.42227	night light sun day sky eyes rain time hear wind blue moon dream fly there's home world it's sea high 

2	0.27399	baby yeah i'm gonna hey wanna girl don't make ain't ooh you're gotta good tonight it's night dance let's rock 

3	0.03075	christmas santa dem year tree ring merry rhythm yuh bow happy bum cha les vocals music ding bells che claus 

4	0.50516	love heart i'll life you're give hold time it's make eyes feel i'm i've true i'd can't chorus mine day 

5	0.10735	i'm ain't shit fuck nigga don't money back bitch niggas that's it's chorus hit ass make man yeah put y'all 

6	0.3373	man she's he's big back home town boy good boys girl girls ain't money that's city day made night house 

7	0.32078	dead die fire we're blood head hell it's world death burn life black war kill fight pain face eyes hate 

8	0.06995	ang bang mary john doctor jane lee billy chorus lang row kung yang kong dumb ako man jean dah ikaw 

9	0.62393	i'm don't it's can't you're time i've back won't make i'll gonna there's feel mind things life that's find you've 

Topic 1: GOSPEL CLUSTER: God, Lord, Heaven, Chorus, Jesus.

Topic 2: WEATHER CLUSTER; Night, sun, rain, wind, moon, blue sky

Topic 3: R&B CLUSTER: Baby, girl, tonight, dance, rock, ooh, yeah

Topic 4: CHRISTMAS CLUSTER: Santa, Christmas, merry, tree, bells

Topic 5: LOVE CLUSTER: Love, heart, eyes, feel, chorus, mine, life.

Topic 6: HIP HOP CLUSTER: Explitives. Chorus. I'm

Topic 7: LESS EXPLITIVE HIP HOP CLUSTER: Big, Town, Money, Girls, Night, Man (possibly country)

Topic 8: GORY CLUSTER: Die, Blood, hell, death, burn, black, pain, hate

Topic 9: NO IDEA CLUSTER: gibberish
This could be a result of the fact that some of the artists in our corpus create non-English music, meaning that their lyrics contain words from languages (Spanish) that are different from English.

Topic 10: EMOTION/REGRET CLUSTER: can't, time, feel, mind, life, find

Both MALLET models highlighted similar topics. We felt that increasing the number of topics to 10 allowed us to get a wider breadth of topics, as there are likely many different genres of artists within our corpus. Although it seems that most of the topic clusters in our first analysis were present in the second analysis (Love, Weather, Gory), the second analysis did a better job illustrating the nuances of many of the major themes in music. For example, I found it interesting that MALLET split topics 6 and 7. Both of these topic clusters seem to be focused around Hip-Hop lyrics, but topic 6 is much more explitive while topic 7 focused on other important themes in Hip-Hop, specifically money, houses, and girls. Overall, the MALLET analysis provides us a good starting point for our similarity analysis, and allows us to more fully understand the major themes of lyrical content within our corpus.

It was also cool to compare the song lyrics of certain songs and see which topic they mostly embody. For example, ABBA's song "Givin' a little bit more" embodies topic 1 of the first MALLET analysis with a topic 1 score of 0.967, meaning that this song focused on themes such as Love and Life. Comparing the same song to the second MALLET analysis, the song was labeled to have a topic 3 score of 0.509 and a topic 10 score of 0.403, both topics having themes around girls and emotion, similar to topic 1 in the first analysis. It was interesting to compare which topics different songs were attributed to over both analyses.