# Sentiment Analysis and Topic Models


Consider the sentiment of the following statements:

- Coronet has the best lines of all day cruisers.
- Bertram has a deep V hull and runs easily through seas.
- Pastel-colored 1980s day cruisers from Florida are ugly.
- I dislike old cabin cruisers.

Are they positive, negative, or neutral?  Why?

In [1]:
%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from textblob import TextBlob

### Sentiment Analysis

In [2]:
text = 'Hi, I thought the speech you gave was awful, your hair looked terrible, and your mom would be ashamed.'

In [3]:
analysis = TextBlob(text)

In [4]:
pos_or_neg = analysis.sentiment.polarity

In [5]:
pos_or_neg

-1.0

In [6]:
ny = pd.read_csv('data/ny_donors.csv')

In [7]:
sent = ny.project_essay_2[30]

In [8]:
sent

"We currently have 3 outdated desktop computers in our classroom that still run on Windows XP! We like to use reading websites, like Raz Kids, Starfall and News-O-Matic to practice our reading skills. It's difficult to make sure everyone gets a fair turn when there are 24 students sharing 3 computers. Most of them don't have access to technology at home either. \\r\\nThese new Kindle Fires will allow more students to have access to technology at the same time. Students will be able to read ebooks, as well as use other reading apps to practice reading on their own level. They can also use them for research for their writing, publishing their work, and learning how to use different types of  technology. It will encourage reluctant readers to practice reading more if they can have access to different kinds of books on technology."

In [9]:
analysis = TextBlob(sent)

In [10]:
analysis.sentiment.polarity

0.18742424242424244

In [11]:
analysis.sentiment.subjectivity

0.6067845117845116

In [12]:
import spacy

In [13]:
nlp = spacy.load('en')

In [14]:
doc = nlp(sent)

In [15]:
for sent in doc.sents:
    print(sent)

We currently have 3 outdated desktop computers in our classroom that still run on Windows XP!
We like to use reading websites, like Raz Kids, Starfall and News-O-Matic to practice our reading skills.
It's difficult to make sure everyone gets a fair turn when there are 24 students sharing 3 computers.
Most of them don't have access to technology at home either.
\r\nThese new Kindle Fires will allow more students to have access to technology at the same time.
Students will be able to read ebooks, as well as use other reading apps to practice reading on their own level.
They can also use them for research for their writing, publishing their work, and learning how to use different types of  technology.
It will encourage reluctant readers to practice reading more if they can have access to different kinds of books on technology.


In [16]:
from spacy import displacy

In [17]:
displacy.render(doc, style = 'ent', jupyter = True)

In [18]:
from textblob import TextBlob

In [19]:
text = ny.project_essay_2[10]

In [20]:
blob = TextBlob(text)

In [21]:
blob.tags[:5]

[('We', 'PRP'),
 ('are', 'VBP'),
 ('looking', 'VBG'),
 ('to', 'TO'),
 ('add', 'VB')]

In [22]:
blob.tags[0][0]

'We'

In [23]:
blob.tags[0][1]

'PRP'

In [24]:
for sent in blob.sentences:
    print(sent.sentiment.polarity, sent[:10])

0.0 We are loo
0.0 But, we ar
0.7 We believe
0.0 \r\n\r\nA 
0.0 For exampl
0.5 More stude
0.0 We could t
0.3181818181818182 Our robot 
0.0 The possib
0.0 Robotics c
0.0 It allows 
0.08333333333333333 This will 
0.0 Thank you!


In [25]:
blob.words

WordList(['We', 'are', 'looking', 'to', 'add', 'robotics', 'coding', 'and', 'programming', 'to', 'our', 'STEM', 'lab', 'in', 'a', 'rural', 'school', 'But', 'we', 'are', 'currently', 'lacking', 'the', 'technology', 'to', 'accomplish', 'this', 'with', 'our', 'students', 'We', 'believe', 'a', 'set', 'of', 'three', 'GoPiGo', 'Robot', 'Starter', 'Kits', 'and', 'Raspberry', 'Pi', 'computers', 'would', 'be', 'the', 'perfect', 'fit', 'to', 'introduce', 'hands-on', 'innovation', 'r\\n\\r\\nA', 'set', 'of', 'three', 'Raspberry', 'Pi', 'computers', 'to', 'program', 'three', 'GoPiGo', 'Robots', 'can', 'engage', 'a', 'variety', 'of', 'grade', 'levels', 'covering', 'a', 'variety', 'of', 'STEAM', 'topics', 'and', 'disciplines', 'For', 'example', 'younger', 'programmers', 'can', 'be', 'introduced', 'to', 'coding', 'by', 'moving', 'the', 'robots', 'through', 'a', 'maze', 'or', 'competing', 'in', 'a', 'robot', 'soccer', 'match', 'More', 'students', 'can', 'learn', 'about', 'planets', 'and', 'program', '

In [26]:
blob.words.count('robotics')

2

In [27]:
blob.words.count('STEM')

2

In [28]:
blob.words.count('technology')

1

In [29]:
blob.words.count('robot')

5

### Problem

Add columns to our dataframe `ny_donors` that contain scores for sentiment and polarity of the `project_essay_1` and `project_essay_2`.

In [31]:
ny.head()

Unnamed: 0.1,Unnamed: 0,id,teacher_id,teacher_prefix,school_state,project_submitted_datetime,project_grade_category,project_subject_categories,project_subject_subcategories,project_title,project_essay_1,project_essay_2,project_essay_3,project_essay_4,project_resource_summary,teacher_number_of_previously_posted_projects,project_is_approved
0,13,p173555,9b7f355e34bc9ca5740779b69ee14d8e,Mrs.,NY,2016-11-15 22:13:39,Grades 3-5,Literacy & Language,Literature & Writing,Extra! Extra! Read all about it!! We love to ...,"Each day my fifth graders walk into our \""home...",My students have had a taste of good reading! ...,,,"My students need good books, with life lessons...",5,0
1,21,p116615,b3593a375f2cf7fd4469b928ffac1c95,Mrs.,NY,2016-09-30 08:12:37,Grades PreK-2,"Applied Learning, Music & The Arts","Early Development, Performing Arts",Oral Language Development through the use of p...,Teaching kindergarten in a diverse district po...,Students don't often get the chance to 'play' ...,,,My students need the opportunity to develop or...,0,1
2,30,p081434,17563b7d138a9ca1e7308f0f480e7d09,Ms.,NY,2016-12-06 21:19:44,Grades PreK-2,"Health & Sports, Special Needs","Health & Wellness, Special Needs",Seating Like a Boss- Our 21st Century Room,"\""Great job buddy!\"" is something I hear every...",In order to promote essential learning skills ...,,,My students need an opportunity to sit and wor...,9,0
3,32,p156550,a902ce7ebdce6f236873d6b443c3ca08,Ms.,NY,2017-03-30 20:05:08,Grades 9-12,"Applied Learning, Special Needs","Other, Special Needs",Keeping Students Focused with Fun and Technology!,"Attending a District 75 high school in Bronx, ...","With a classroom lacking technology, student i...",,,My students need a wider variety of Interactiv...,0,1
4,59,p186381,da67f09a612a32fa30c9c80bed7e6365,Mrs.,NY,2016-09-24 11:36:26,Grades PreK-2,Literacy & Language,"Literacy, Literature & Writing",Listening & Learning in First Grade,Who doesn't enjoy listening to a great story? ...,The Listening Center in the classroom is alway...,,,My students need wireless headphones to use in...,9,1


In [33]:
ny['polarity_1'] = ny['project_essay_1'].apply(lambda essay: TextBlob(essay).sentiment.polarity)
ny['subjectivity_1'] = ny['project_essay_1'].apply(lambda essay: TextBlob(essay).sentiment.subjectivity)
ny['polarity_2'] = ny['project_essay_2'].apply(lambda essay: TextBlob(essay).sentiment.polarity)
ny['subjectivity_2'] = ny['project_essay_2'].apply(lambda essay: TextBlob(essay).sentiment.subjectivity)





In [35]:
ny[['polarity_1','subjectivity_1','polarity_2','subjectivity_2']].head()

Unnamed: 0,polarity_1,subjectivity_1,polarity_2,subjectivity_2
0,0.463889,0.686111,0.477778,0.597222
1,0.215783,0.411869,-0.142551,0.666035
2,0.332481,0.450631,0.227692,0.506923
3,0.287773,0.572455,0.206151,0.535119
4,0.430592,0.580014,0.113462,0.427564


### Topic Models

Below, we explore two approaches to modeling topics with `scikitlearn`: NMF and LDA.  Before we can use these models, we have to preprocess our data.  Below, we do so for both a basic `CountVectorizer` and a `TfidfVectorizer`.

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

In [37]:
vect = CountVectorizer()

In [38]:
X = vect.fit_transform(ny.project_essay_2)

In [39]:
X

<12157x20418 sparse matrix of type '<class 'numpy.int64'>'
	with 1056010 stored elements in Compressed Sparse Row format>

In [40]:
X2 = X.toarray()

In [41]:
names = vect.get_feature_names()
words = pd.DataFrame(X2, columns=names)

In [42]:
words.head()

Unnamed: 0,00,000,00am,00pm,021,04,05a,06,08,10,...,zoo,zoob,zoology,zoom,zooms,zooplankton,zoos,zucchini,zuma,zumba
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
words['zumba'].value_counts()

0    12154
1        2
2        1
Name: zumba, dtype: int64

In [44]:
vect = CountVectorizer(min_df = 15, stop_words = 'english')
X = vect.fit_transform(ny.project_essay_2)
X2 = X.toarray()
names = vect.get_feature_names()
words = pd.DataFrame(X2, columns=names)

In [45]:
words.head()

Unnamed: 0,00,000,10,100,11,11th,12,13,14,15,...,yes,yoga,york,young,younger,youngest,youngsters,youth,yummy,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
from sklearn.feature_extraction.text import TfidfTransformer

In [47]:
transformer = TfidfTransformer()
transformer

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [48]:
tfidf = transformer.fit_transform(words)

In [49]:
tfidf

<12157x4175 sparse matrix of type '<class 'numpy.float64'>'
	with 608648 stored elements in Compressed Sparse Row format>

In [50]:
transformer.fit_transform(words[:10]).toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.10751966, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [51]:
words.columns[:10]

Index(['00', '000', '10', '100', '11', '11th', '12', '13', '14', '15'], dtype='object')

In [65]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

no_topic = 10

nmf = NMF(n_components=no_topic).fit(tfidf) 

lda = LatentDirichletAllocation(n_components=5).fit(tfidf) 



In [54]:
def display_topics(model, feature_names, no_top_words): 
    for topic_idx, topic in enumerate(model.components_): 
        print ("Topic %d:" % (topic_idx) )
        print( " ".join([feature_names[i] 
                        for i in topic.argsort()[:-no_top_words - 1:-1]]) )

In [55]:
no_top_words = 10

display_topics(nmf, words.columns, no_top_words) 

Topic 0:
supplies need students paper year materials pencils classroom school help
Topic 1:
books reading read library book students readers level love classroom
Topic 2:
technology students access chromebooks use computer research computers classroom able
Topic 3:
seating students chairs sit classroom work sitting flexible stools comfortable
Topic 4:
math skills students learning games materials help fun centers practice
Topic 5:
school students healthy snacks day children play music equipment snack
Topic 6:
art create projects paint express creative artists students creativity arts
Topic 7:
printer print ink color work students classroom projects able pictures
Topic 8:
science stem students world hands explore learn learning kits life
Topic 9:
ipad ipads apps use technology students able classroom learning reading


In [56]:
display_topics(lda, words.columns, no_top_words) 

Topic 0:
supplies school students paper need pencils healthy classroom snacks year
Topic 1:
students reading books learning help classroom use able school skills
Topic 2:
science plants garden calculators plant backpacks cycle labs scientists lab
Topic 3:
students chairs seating sit music sitting day comfortable movement physical
Topic 4:
water film bottles lighting french bottle determination sugar humans stream


In [57]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(ny.project_essay_1)
print(dtm_tf.shape)

(12157, 3827)


In [58]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(ny.project_essay_1)
print(dtm_tfidf.shape)

(12157, 3827)


In [60]:
# for TF DTM# for T 
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=20, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

### Visualizing Topic Models

In [61]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [62]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### Project Essay Visualizations

What are some topics in the `project_essay_2` column?  Can you determine a way to incorporate these into a `LogisticRegression` model?