## Purpose

The purpose of this notebook is to display a working example of how to use Google's universal sentence encoder to compare two different strings. The general idea is to apply some basic NLP techniques with Spacy in order to increase the weights of the 'important' aspects of a sentence, then apply the sentence encoder to get a vector representation of the sentances. These sentances are later compared.

In general, there will be one news article that is inputted. This will be compared to a dataframe of songs that have already been processed by the encoder. The one with the best cosine similarity will be selected.

In [9]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |▎                               | 10kB 17.6MB/s eta 0:00:01[K     |▋                               | 20kB 6.9MB/s eta 0:00:01[K     |█                               | 30kB 9.8MB/s eta 0:00:01[K     |█▎                              | 40kB 6.2MB/s eta 0:00:01[K     |█▋                              | 51kB 7.5MB/s eta 0:00:01[K     |██                              | 61kB 8.9MB/s eta 0:00:01[K     |██▏                             | 71kB 10.1MB/s eta 0:00:01[K     |██▌                             | 81kB 11.3MB/s eta 0:00:01[K     |██▉                             | 92kB 12.6MB/s eta 0:00:01[K     |███▏                            | 102kB 10.0MB/s eta 0:00:01[K     |███▌                            | 112kB 10.0MB/s eta 0:00:01[K     |███▉                

In [0]:
import numpy as np
import pandas as pd
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from newspaper import Article
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import EnglishStemmer
import sentencepiece
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.model_selection import cross_val_score
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

In [0]:
def embed(text):
    print('Start')
    print('Starting embeddings...')
    #embed_US = hub.Module("universal_sentence")
    embed_US = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
    embeddings = embed_US(text)
    print('Extracting embeddings...')
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        embd = sess.run(embeddings)
    dim_vector = ['Dim_{}'.format(i) for i in range(embd.shape[1])]
    df_return = pd.DataFrame(embd, columns = dim_vector)
    return df_return

In [0]:
# #First we want to import the particular article that we want
article = Article('https://www.voanews.com/usa/us-politics/sanders-still-wants-revolution-now-hes-got-company')
article.download()
article.parse()
article_text=article.text

In [0]:
#now we import all of the song Lyrics (this should probably be done in another python script)
df_songs = pd.DataFrame(pd.read_csv('songdata.csv'))

In [14]:
political_artists = ['Zac Brown Band',
'Ziggy Marley',
'The Beatles',
'Arrogant Worms',
'Billy Joel',
'Bob Marley',
'Coldplay',
'Creedence Clearwater Revival',
'Elton John',
'Eminem',
'Fleetwood Mac',
'Garth Brooks',
'John Denver',
'Kanye West',
'Linkin Park',
'Lynyrd Skynyrd',
'Rage Against The Machine',
'Rascal Flatts',
'Red Hot Chili Peppers',
'System Of A Down',
'Tragically Hip',
'The White Stripes']
#df_songs_political = df_songs[df_songs['artist'].isin(political_artists)]
df_songs_political = df_songs
df_songs_political.head()


Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [0]:
df_songs_political=df_songs_political.dropna(subset=['text'])
df_songs_political['text'] = df_songs_political['text'].apply(lambda x: x.replace('\n',' '))


In [0]:
df_songs_politcal_lyrics= list(df_songs_political.iloc[:,3])

## Here we begin to implement some of the NLP

Implement stemming after

In [0]:
#This cell removes stop words and weights the description to focus on Nouns, Adjectives, and Verbs.
def nlp_weighting(input_list):
    print('Start')
    nlp = spacy.load('en')
    newtext = []

    for doc in input_list:
        nlpdoc=nlp(doc)
        tempDoc=''
        for token in nlpdoc:
            if token.is_stop == False:
                tempDoc = tempDoc + ' ' + str(token.lemma_)
                if token.pos_ == 'NOUN':
                    #We triple the strength of Nouns
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                elif token.pos_ == 'PROPN':
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                elif  token.pos_ == 'ADJ':
                    #We double the strength fo Adjectives
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                elif token.pos_ == 'VERB':
                    #We double the strength of Verbs
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                    
        #Here we HAD but I took it out a hard cutoff at 2100 characters. THis is because there were memory issues with the encoding otherwise
        if len(tempDoc) < 110:
            tempDoc =''

        newtext.append(tempDoc)
        
    print('Returned')
        
    return(newtext)
    
    

In [18]:
df_songs_politcal_lyrics = nlp_weighting(df_songs_politcal_lyrics)

Start
Returned


Note below we have the version where we are taking TFIDF Weights. In reality, this would be harder to implement. In practice, we will use a pretrained model that will be able to return a vector to compare similarities. Also, we will want to restrict the size of our data in order to make comparisons feasible.

Implement embedding below

In [19]:
len(df_songs_politcal_lyrics)

57650

In [20]:
df_songs_political_lyrics_embed = embed(df_songs_politcal_lyrics)

Start
Starting embeddings...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Extracting embeddings...


In [0]:
df_songs_political_lyrics_embed.to_csv('Lyric_embeddings.csv')

### Below we apply the NLP to the Article

We will use the functions from above to do this

In [22]:
article_text = nlp_weighting([article_text])

Start
Returned


In [23]:
article_text_embed = embed([article_text[0],'temp']).iloc[0,:]

Start
Starting embeddings...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Extracting embeddings...


In [24]:
max_cos = 0
max_col = ''
for i in range(len(df_songs_political_lyrics_embed)):
    temp_cos = cosine_similarity([article_text_embed],[df_songs_political_lyrics_embed.iloc[i]])
    if temp_cos > max_cos:
        max_cos = temp_cos
        max_col = df_songs_political.iloc[i]
print(max_col)

artist                                              The Jam
song                                           In The Crowd
link                      /j/jam/in+the+crowd_20068837.html
text      When I'm in the crowd, I don't see anything   ...
Name: 8973, dtype: object


In [25]:
max_col[3]

"When I'm in the crowd, I don't see anything   My mind goes a blank, in the humid sunshine   When I'm in the crowd I don't see anything   I fall into a trance, at the supermarket   The noise flows me along, as I catch falling cans   Of baked beans on toast, technology is the most      And everyone seems just like me,   They struggle hard to set themselves free   And their waiting for the change      When I'm in the crowd, I can't remember my name   And my only link is a pint of walls ice cream   When I'm in the crowd - I don't see anything      Sometimes I think that its a plot   An equilibrium melting pot   The government sponsors underhand   When I'm in the crowd   When I'm in the crowd   When I'm in the crowd      And everyone seems that they're acting a dream   Cause they're just not thinking about each other   And they're taking orders, which are media spawned   And they should know better, now you have been warned      And don't forget you saw it here first   When I'm in the crow