Run the cell below to import the required packages:

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline

from nltk.corpus import stopwords
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#nltk.download('vader_lexicon')

Recall our example yesterday. We first used the Tfidf Vectorizer to preprocess the data and turn the words into a matrix where uncommon words were granted more weight:

In [45]:
articles = ['Football baseball basketball',
            'baseball giants cubs redsox',
            'football broncos cowboys',
            'baseball redsox tigers',
            'pop stars hendrix prince',
            'hendrix prince jagger rock',
            'joplin pearl jam tupac rock',
          ]

vectorizer = TfidfVectorizer(lowercase=True, 
                     token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", 
                     stop_words=stopwords.words('english'),
                     min_df=1)

X = vectorizer.fit_transform(articles).toarray()

articles_df = pd.DataFrame(X,
             columns=vectorizer.get_feature_names())
articles_df

Unnamed: 0,baseball,basketball,broncos,cowboys,cubs,football,giants,hendrix,jagger,jam,joplin,pearl,pop,prince,redsox,rock,stars,tigers,tupac
0,0.479185,0.675356,0.0,0.0,0.0,0.560603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.397106,0.0,0.0,0.0,0.559675,0.0,0.559675,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.464579,0.0,0.0,0.0,0.0
2,0.0,0.0,0.609819,0.609819,0.0,0.506202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.479185,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.560603,0.0,0.0,0.675356,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.451635,0.0,0.0,0.0,0.0,0.544082,0.451635,0.0,0.0,0.544082,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.473977,0.570997,0.0,0.0,0.0,0.0,0.473977,0.0,0.473977,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.461804,0.461804,0.461804,0.0,0.0,0.0,0.383337,0.0,0.0,0.461804


Then, we used an SVD to view the most important words making up each component:

In [3]:
svd = TruncatedSVD(2)
X_svd = svd.fit_transform(X)
pd.DataFrame(svd.components_.round(5),
             index = ["component_1","component_2"],
             columns = vectorizer.get_feature_names())

Unnamed: 0,baseball,basketball,broncos,cowboys,cubs,football,giants,hendrix,jagger,jam,joplin,pearl,pop,prince,redsox,rock,stars,tigers,tupac
component_1,0.59434,0.26389,0.10775,0.10775,0.25565,0.30849,0.25565,0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,0.47627,0.0,0.0,0.31811,-0.0
component_2,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.51977,0.33357,0.10539,0.10539,0.10539,0.29259,0.51977,-0.0,0.36438,0.29259,-0.0,0.10539


Then we also wanted to scale our data using Normalizer. This ensures that each vector has a norm of 1. Vectors with a norm of 1 are easy to work with for calculating similarity. We see that each document is a linear combination of the components:

In [4]:
dtm_svd = Normalizer(copy=False).fit_transform(X_svd)

pd.DataFrame(dtm_svd.round(5),
             index=articles, 
             columns=["component_1","component_2" ])

Unnamed: 0,component_1,component_2
Football baseball basketball,1.0,0.0
baseball giants cubs redsox,1.0,0.0
football broncos cowboys,1.0,0.0
baseball redsox tigers,1.0,-0.0
pop stars hendrix prince,0.0,1.0
hendrix prince jagger rock,0.0,1.0
joplin pearl jam tupac rock,-0.0,1.0


### Pipelines

We can create a pipeline that performs all of the above processes. Let's make the pipe:

In [50]:
pipe = [('tfidf', TfidfVectorizer(stop_words='english', 
                        token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", 
                        min_df=2)),
       ('lsa', TruncatedSVD(2)),
       ('normalizer', Normalizer())]
pipeline = Pipeline(pipe)

Now we can put our article data through the pipe to generate the exact table that we made above. The code is a lot cleaner this way:

In [51]:
dtm_svd = pipeline.fit_transform(articles)
pd.DataFrame(dtm_svd.round(5),
             index=articles, 
             columns=["component_1","component_2" ])

Unnamed: 0,component_1,component_2
Football baseball basketball,1.0,-0.0
baseball giants cubs redsox,1.0,0.0
football broncos cowboys,1.0,-0.0
baseball redsox tigers,1.0,0.0
pop stars hendrix prince,0.0,1.0
hendrix prince jagger rock,0.0,1.0
joplin pearl jam tupac rock,0.0,1.0


### 1.Topic Modeling

When we chose two components to use in our SVD, we were essentially choosing to use two topics. We can view the five most important words associated with each topic below:

In [24]:
n_topics = 2
n_words = 5

feature_names = vectorizer.get_feature_names()                 # get all of the words

for topic_num in range(n_topics):        
    topic_mat = svd.components_[topic_num]                     # get each row of the SVD truncated matrix

    print(f'Topic {topic_num + 1}:'.center(80))

    topic_values = sorted(zip(topic_mat, feature_names),       # Sort all of the items in that row 
                          reverse=True)[:n_words]              # in decending order and keep track of what word
                                                               # that value is associated with. Then return the top
                                                               # n_words.
    print(' '.join([y for x,y in topic_values]))               # print the output
    print('-'*80)

                                    Topic 1:                                    
baseball redsox tigers football basketball
--------------------------------------------------------------------------------
                                    Topic 2:                                    
prince hendrix rock jagger stars
--------------------------------------------------------------------------------


### 2.Predictive Modeling

Here we will train a model on our outcome (Sports 1 or Music 0) and then use the model to predict labels of new sentences. First, let's add a logistic regression classifier to the end of our pipe:

In [8]:
model = LogisticRegression(solver="lbfgs")

pipeline = Pipeline(pipe + [('model', model)])

Now, we can use our previous articles as our training data. (Remember that the first four articles were sports-related and the second three were music-related:

In [9]:
pipeline.fit(articles, [1,1,1,1,0,0,0])

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

Now, let's use our model to predict a new sentence:

In [10]:
new_sentence = 'babe ruth played baseball'
y_pred_1 = pipeline.predict([new_sentence])
print(y_pred_1)

[1]


Clearly, it makes sense that this new article was labeled as sports (1) instead of music (0). What about a new sentence about music?

In [25]:
new_sentence = 'rock and roll music'
y_pred_1 = pipeline.predict([new_sentence])
print(y_pred_1)

[0]


This would get labeled as music. If we wanted to see the probabilities of this music-related document getting classified as music or sports, we could type:

In [30]:
new_sentence = 'rock and roll music'
y_pred_1 = pipeline.predict_proba([new_sentence])
print(y_pred_1)

[[0.68091588 0.31908412]]


### 3.Article Suggestions

We can use a similar function to the one we used in our recommendation system unit to find articles most similar to a given article. Let's find the sentence most similar to Sentence 0 by calculating the dot product of the rows of the SVD matrix with the row corresponding to Sentence 0 and then sorting in descending order of the dot product value:

In [58]:
pipe = [('tfidf', TfidfVectorizer(stop_words='english', 
                        token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", 
                        min_df=1)),
       ('lsa', TruncatedSVD(2))]

pipeline = Pipeline(pipe)

dtm_svd = pipeline.fit_transform(articles)

df = pd.DataFrame(dtm_svd.round(10),
             index=articles, 
             columns=["component_1","component_2" ])
df

def get_similar_sentences(compare_sentence, df, num_recom):
    recs = []
    for sentence in range(df.shape[0]):
        if sentence != compare_sentence:
            recs.append((np.dot(df.iloc[compare_sentence],df.iloc[sentence]), sentence))
    recs.sort(reverse = True)
    final_rec = [recs[i][1] for i in range(num_recom)]
    return final_rec

sentence = 0
print(f"Sentences similar to user {sentence}: {get_similar_sentences(sentence,df,2)}")
sentence = 5
print(f"Sentences similar to user {sentence}: {get_similar_sentences(sentence,df,2)}")


Sentences similar to user 0: [3, 1]
Sentences similar to user 5: [4, 6]


It makes sense that Sentences 3 and 1 are most similar to Sentence 0, as they are all sports-related.
It also makes sense that Sentences 4 and 6 are most similar to Sentence 5, as they are all music-related.

### 4.Sentiment Analysis

Natural language processing can handle sentiment analysis. Polarity is near +1 for highly positive sentiment and near -1 for highly negative sentiment. You can look to the compound polarity as a summary of the sentiment:

Here's a highly positive sentence:

In [14]:
sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores("Oh my god I love football, it's so awesome."))

{'neg': 0.0, 'neu': 0.302, 'pos': 0.698, 'compound': 0.9107}


Here's a highly negative sentence:

In [15]:
print(sid.polarity_scores("I hate swimming it makes me so tired."))

{'neg': 0.598, 'neu': 0.402, 'pos': 0.0, 'compound': -0.8147}


### Homework

Read more about VADER sentiment analysis here:

https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f

Then read the actual VADER docs here:

http://www.nltk.org/howto/sentiment.html

In particular, read all of the "tricky_sentences" and then view their sentiment scores further down the page. Sentiment analysis is difficult!

The VADER package is not the only package that deals with sentiment analysis. Read about other tools and their pros and cons here:

https://medium.com/@b.terryjack/nlp-pre-trained-sentiment-analysis-1eb52a9d742c

Comment on what you learned/found interesting on Google Classroom.