## Purpose

The purpose of this notebook is to display a working example of how to use Google's universal sentence encoder to compare two different strings. The general idea is to apply some basic NLP techniques with Spacy in order to increase the weights of the 'important' aspects of a sentence, then apply the sentence encoder to get a vector representation of the sentances. These sentances are later compared.

In general, there will be one news article that is inputted. This will be compared to a dataframe of songs that have already been processed by the encoder. The one with the best cosine similarity will be selected.

In [2]:
import numpy as np
import pandas as pd
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from newspaper import Article
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import EnglishStemmer
import sentencepiece
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.model_selection import cross_val_score
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
def embed(text):
    print('Start')
    print('Starting embeddings...')
    embed_US = hub.Module("universal_sentence")
    #embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
    embeddings = embed_US(text)
    print('Extracting embeddings...')
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        embd = sess.run(embeddings)
    dim_vector = ['Dim_{}'.format(i) for i in range(embd.shape[1])]
    df_return = pd.DataFrame(embd, columns = dim_vector)
    return df_return

In [4]:
#First we want to import the particular article that we want
article = Article('https://www.cbc.ca/news/world/cocaine-bust-philadelphia-ship-1.5180447')
article.download()
article.parse()
article_text=article.text

In [5]:
#now we import all of the song Lyrics (this should probably be done in another python script)
df_songs = pd.DataFrame(pd.read_csv('lyrics.csv'))

In [None]:
political_artists = ['eminem-d12','eminem']#,'bob-marley' 'bob-marley-the-wailers','beyonce-knowles']
df_songs_political = df_songs[df_songs['artist'].isin(political_artists)]
df_songs_political.head()


Unnamed: 0,index,song,year,artist,genre,lyrics
283031,283031,fight-music,2013,eminem-d12,Hip-Hop,[Chorus: Eminem]\nThis kinda music\nUse it and...
283032,283032,keep-talkin,2008,eminem-d12,Hip-Hop,"Yeah, Detroit, motherfucka\nDJ Green Lantern, ..."
283033,283033,i-ll-shit-on-you,2008,eminem-d12,Hip-Hop,"I'll shit on you, da da, da da, da da\nI'll sh..."
313003,313003,people-make-me,2009,eminem,Hip-Hop,
313004,313004,my-darling,2009,eminem,Hip-Hop,Ya look\nIf I were to rap about the crap that'...


In [None]:
df_songs_political=df_songs_political.dropna(subset=['lyrics'])
df_songs_political['lyrics'] = df_songs_political['lyrics'].apply(lambda x: x.replace('\n',' '))


In [None]:
df_songs_politcal_lyrics= list(df_songs_political.iloc[:,5])

## Here we begin to implement some of the NLP

Implement stemming after

In [None]:
#This cell removes stop words and weights the description to focus on Nouns, Adjectives, and Verbs.
def nlp_weighting(input_list):
    print('Start')
    nlp = spacy.load('en')
    newtext = []

    for doc in input_list:
        nlpdoc=nlp(doc)
        tempDoc=''
        for token in nlpdoc:
            if token.is_stop == False:
                tempDoc = tempDoc + ' ' + str(token.lemma_)
                if token.pos_ == 'NOUN':
                    #We triple the strength of Nouns
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                elif  token.pos_ == 'ADJ':
                    #We double the strength fo Adjectives
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                elif token.pos_ == 'VERB':
                    #We double the strength of Verbs
                    tempDoc = tempDoc + ' ' + str(token.lemma_)
                    
        #Here we have a hard cutoff at 2100 characters. THis is because there were memory issues with the encoding otherwise
        if len(tempDoc) > 2100:
            tempDoc = tempDoc[0:2100]
        if len(tempDoc) < 110:
            tempDoc =''

        newtext.append(tempDoc)
        
    print('Returned')
        
    return(newtext)
    
    

In [None]:
df_songs_politcal_lyrics = nlp_weighting(df_songs_politcal_lyrics)

Start
Returned


Note below we have the version where we are taking TFIDF Weights. In reality, this would be harder to implement. In practice, we will use a pretrained model that will be able to return a vector to compare similarities. Also, we will want to restrict the size of our data in order to make comparisons feasible.

Implement embedding below

In [None]:
len(df_songs_politcal_lyrics)

581

In [None]:
df_songs_political_lyrics_embed = embed(df_songs_politcal_lyrics)

Start
Starting embeddings...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0918 16:15:31.952233 4428563904 saver.py:1483] Saver not created because there are no variables in the graph to restore


Extracting embeddings...


In [None]:
df_songs_political_lyrics_embed.head()

### Below we apply the NLP to the Article

We will use the functions from above to do this

In [None]:
article_text = nlp_weighting([article_text])

Start
Returned


In [None]:
article_text_embed_temp = embed(['my fave color','temp sent']).iloc[0,:]
print(article_text_embed_temp)

Start
Starting embeddings...
Instructions for updating:
Colocations handled automatically by placer.


W0918 16:19:55.427418 4384302528 deprecation.py:323] From /anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0918 16:19:58.148826 4384302528 saver.py:1483] Saver not created because there are no variables in the graph to restore


Extracting embeddings...


In [None]:
article_text_embed = embed([article_text[0],'temp']).iloc[0,:]

In [None]:
article_text_embed.head()

In [None]:
max_cos = 0
max_col = ''
for i in range(len(df_songs_political_lyrics_embed)):
    temp_cos = cosine_similarity(article_text_embed,df_songs_political_lyrics_embed[i])
    if temp_cos > max_cos:
        max_cos = temp_cos
        max_col = df_songs_political.iloc[i:]
        print(max_col)