Tutorials

* https://github.com/derekgreene/topic-model-tutorial
* https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html 
* https://nlpforhackers.io/topic-modeling/

In [1]:
import pandas as pd

In [4]:
song_lyrics = pd.read_csv("lyrics.csv", index_col="index")
song_lyrics.head()

Unnamed: 0_level_0,song,year,artist,genre,lyrics
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,ego-remix,2009,beyonce-knowles,Pop,"Oh baby, how you doing?\nYou know I'm gonna cu..."
1,then-tell-me,2009,beyonce-knowles,Pop,"playin' everything so easy,\nit's like you see..."
2,honesty,2009,beyonce-knowles,Pop,If you search\nFor tenderness\nIt isn't hard t...
3,you-are-my-rock,2009,beyonce-knowles,Pop,"Oh oh oh I, oh oh oh I\n[Verse 1:]\nIf I wrote..."
4,black-culture,2009,beyonce-knowles,Pop,"Party the people, the people the party it's po..."


In [8]:
print(song_lyrics.iloc[0]['lyrics'])

Oh baby, how you doing?
You know I'm gonna cut right to the chase
Some women were made but me, myself
I like to think that I was created for a special purpose
You know, what's more special than you? You feel me
It's on baby, let's get lost
You don't need to call into work 'cause you're the boss
For real, want you to show me how you feel
I consider myself lucky, that's a big deal
Why? Well, you got the key to my heart
But you ain't gonna need it, I'd rather you open up my body
And show me secrets, you didn't know was inside
No need for me to lie
It's too big, it's too wide
It's too strong, it won't fit
It's too much, it's too tough
He talk like this 'cause he can back it up
He got a big ego, such a huge ego
I love his big ego, it's too much
He walk like this 'cause he can back it up
Usually I'm humble, right now I don't choose
You can leave with me or you could have the blues
Some call it arrogant, I call it confident
You decide when you find on what I'm working with
Damn I know I'm kil

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [19]:
vectorizer = CountVectorizer(stop_words = "english", min_df = 50)
X = vectorizer.fit_transform(song_lyrics['lyrics'].dropna().sample(10000))
X

<10000x1822 sparse matrix of type '<class 'numpy.int64'>'
	with 396897 stored elements in Compressed Sparse Row format>

In [20]:
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())


Unnamed: 0,10,20,2x,40,50,able,accept,act,acting,action,...,years,yellow,yes,yesterday,yo,york,young,youre,youth,zu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [22]:
words = vectorizer.get_feature_names()
print("Vocabulary has %d distinct terms" % len(words))
words[0:100] # first 100 words

Vocabulary has 1822 distinct terms


['10',
 '20',
 '2x',
 '40',
 '50',
 'able',
 'accept',
 'act',
 'acting',
 'action',
 'add',
 'admit',
 'advice',
 'afraid',
 'age',
 'ago',
 'ah',
 'ahead',
 'ahora',
 'ai',
 'ain',
 'aint',
 'air',
 'al',
 'album',
 'alive',
 'alles',
 'alright',
 'als',
 'american',
 'amor',
 'angel',
 'angels',
 'anger',
 'angry',
 'answer',
 'answers',
 'anybody',
 'anymore',
 'apart',
 'aren',
 'arm',
 'arms',
 'art',
 'ashes',
 'aside',
 'ask',
 'asked',
 'asking',
 'asleep',
 'ass',
 'attack',
 'attention',
 'au',
 'auch',
 'auf',
 'aus',
 'awake',
 'away',
 'ay',
 'babe',
 'baby',
 'bad',
 'bag',
 'bags',
 'ball',
 'band',
 'bang',
 'bank',
 'bar',
 'barely',
 'bass',
 'battle',
 'beach',
 'bear',
 'beast',
 'beat',
 'beating',
 'beats',
 'beautiful',
 'beauty',
 'bed',
 'bedroom',
 'beef',
 'beer',
 'beg',
 'began',
 'begging',
 'begin',
 'beginning',
 'begins',
 'begun',
 'believe',
 'believed',
 'bell',
 'bells',
 'belong',
 'bend',
 'beneath',
 'benz']

In [23]:
from sklearn.decomposition import LatentDirichletAllocation

In [25]:
NUM_TOPICS = 20

lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10)
lda_Z = lda_model.fit_transform(X)



In [52]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
            

In [53]:
print_topics(lda_model, vectorizer)

Topic 0:
['life', 'time', 'day', 'll', 'new', 'way', 'people', 'just', 'live', 'mind']
Topic 1:
['ve', 'days', 'thought', 'times', 'time', 'things', 'seen', 'just', 'gone', 'heart']
Topic 2:
['na', 'hard', 'fall', 'walk', 'today', 'tomorrow', 'carry', 'di', 'mo', 'walking']
Topic 3:
['tonight', 'long', 'gone', 'coming', 'sound', 'stop', 'bring', 'inside', 'hear', 'running']
Topic 4:
['baby', 'don', 'let', 'come', 'wanna', 'want', 'yeah', 'just', 'need', 'make']
Topic 5:
['did', 'said', 'old', 'white', 'didn', 'like', 'young', 'blue', 'bad', 'black']
Topic 6:
['god', 'world', 'soul', 'free', 'sing', 'blood', 'come', 'song', 'heart', 'life']
Topic 7:
['ya', 'im', 'shake', 'dont', 'round', 'somebody', 'til', 'goes', 'city', 'town']
Topic 8:
['gonna', 'time', 'alright', 'rock', 'lord', 'day', 'roll', 'lay', 'fool', 'make']
Topic 9:
['oh', 'yeah', 'whoa', 'everybody', 'beat', 'night', 'lover', 'big', 'stand', 'deny']
Topic 10:
['got', 'like', 'ain', 'know', 'cause', 'em', 'just', 'man', 'mo