# 4. Let's try text analysis!

There are so many things you can do with text analysis! Create visualizations, convert text to tables, or compare texts. In this notebook we are going to do all three.

## 1. We need to pick a dataset

How about music lyrics? There is an API called MusiXmatch that allows for non-commerical use of their API, which includes lyrics. This only allows for 30% of the lyrics, but we will use it anyway for practice. I was able to get access to this API with my .edu email address. There are a lot of APIs that will allow for education access.

First we need to get the lyrics. I did this making an API call to MusiXmatch. Let's pick a band or musician and a song. Any suggestions?

In [None]:
import Constants
import json
import re

In [None]:
from musixmatch import Musixmatch

music_key=Constants.MUSIXMATCH_KEY

musixmatch = Musixmatch(music_key)

lyrics = musixmatch.matcher_lyrics_get('[SONG]', '[ARTIST]')
lyrics = re.sub(r"\\n", " ", json.dumps(lyrics["message"]["body"]["lyrics"]["lyrics_body"]))
lyrics = re.sub(r"(.+)(\.\.\..+)", r"\1", lyrics)

print (lyrics)


## 2. Convert the text to tokens

Great! Now we have the song lyrics as text named `lyrics`. What can we do with it now? Let's start by turning the lyrics into tokens using spaCy.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(lyrics)
print(doc)
print (len(lyrics))
print (len(doc))

The lyrics look the same, but we can see from the length difference between `lyrics` and `doc`, `doc` contains tokens. 

## 3. Create visualizations

One way to do text analysis is doing visualization of the parts of speech. SpaCy has a built in dependency visualizer:

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style="dep", jupyter = True, options={'distance':140})

## 4. Convert entity tagging to tables

Creating a table out of the parts of speech will make it even easier to do data analysis. We could even compare to other songs! So let's start with using Pandas to convert our parts of speech analysis to a table with the counts of each parts of speech.

In [None]:
#!pip install pandas

In [None]:
import pandas as pd

In [None]:
list_of_strings  = [i.text for i in doc]
list_of_tokens = [j.pos_ for j in doc]

In [None]:
df = pd.DataFrame({'token': list_of_strings, 'POS': list_of_tokens })
#pd.DataFrame({'a':[1,2], 'b':[3,4]})

In [None]:
df['POS'].value_counts()

I want to see how this song compares to another. First, we need to bring in another song.

In [None]:
from musixmatch import Musixmatch

music_key=Constants.MUSIXMATCH_KEY

musixmatch = Musixmatch(music_key)

lyrics2 = musixmatch.matcher_lyrics_get('[SONG]', '[ARTIST]')
lyrics2 = re.sub(r"\\n", " ", json.dumps(lyrics2["message"]["body"]["lyrics"]["lyrics_body"]))
lyrics2 = re.sub(r"(.+)(\.\.\..+)", r"\1", lyrics2)

print (lyrics2)

In [None]:
doc2 = nlp(lyrics2)
print(doc2)
print (len(lyrics2))
print (len(doc2))

In [None]:
list_of_strings2  = [k.text for k in doc2]
list_of_tokens2 = [l.pos_ for l in doc2]

In [None]:
df2 = pd.DataFrame({'token': list_of_strings2, 'POS': list_of_tokens2 })
#pd.DataFrame({'a':[1,2], 'b':[3,4]})

In [None]:
df2['POS'].value_counts()

This information is more accurate for comparison if we look at the percentages rather than the counts, so let's convert count to percent:

In [None]:
table2= df2['POS'].value_counts(normalize=True).rename_axis('unique_values').reset_index(name='Song 2')
table2['Song 2'] = table2['Song 2'] * 100
print(table2)

In [None]:
table= df['POS'].value_counts(normalize=True).rename_axis('unique_values').reset_index(name='Song 1')
table['Song 1'] = table['Song 1'] * 100

print(table)

Time to put our two tables together so the songs can be compared:

## Compare songs!

Now that the parts of speech are converted to tables, we can compare songs!

In [None]:
right = table
left = table2
result = pd.merge(left, right, on=['unique_values'])
print(result)

This kind of comparison would be interesting as a bar chart. This makes it easier to visualize the differences. There is another tool, `matplotlib`, we can import which will make the table above into a graph so the song comparison can be visualized. 

In [None]:
#! pip install matplotlib

In [None]:
from matplotlib import pyplot as plt

In [None]:
result['unique_values'] *100
result.plot.bar(x='unique_values', rot=0, figsize=(20, 10));

The nice part of Jupyter notebooks is that you can run through an entire text analysis, change your mind, and run things again differently. We might want to try removing spaces or punctuation from this comparision, for example. 