## Text Analysis (with Vader) 12 Dancing Pricessess
### Kiran Pandey April 27, 2019

In this module, we will use the Natural Language Toolkit Library (NLTK) to look at individual words and sentences in a text and clean unneccessary features from the text data to prepare for sentiment analysis. Then using the textblob library, we will analyze the sentiment of opinioned data to give a numerical value for use in a predictive model.

In [1]:
# libraries used
import nltk

from nltk.tokenize import word_tokenize   
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer

from nltk.probability import FreqDist  
from nltk.corpus import stopwords      
from nltk.stem import WordNetLemmatizer 
from nltk.sentiment.vader import SentimentIntensityAnalyzer 

from nltk.corpus import names      #this is sample data
from string import punctuation

##  change to code if the following are not already downloaded 
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('names')
#nltk.download('stopwords')
#nltk.download('vader_lexicon')

## Load file and explore raw data

In [2]:
file = open('../datasets/datasets_12dancingprincesses.txt', 'r')
dp_raw = file.read().replace('\n','  ')      # read as one string with new line repalced by '  ' 

In [3]:
print (type(dp_raw))
print (len(dp_raw))           # establish baseline value number of characters
dp_raw                     

<class 'str'>
8498


'THE TWELVE DANCING PRINCESSES    There was a king who had twelve beautiful daughters. They slept in  twelve beds all in one room; and when they went to bed, the doors were  shut and locked up; but every morning their shoes were found to be quite  worn through as if they had been danced in all night; and yet nobody  could find out how it happened, or where they had been.    Then the king made it known to all the land, that if any person could  discover the secret, and find out where it was that the princesses  danced in the night, he should have the one he liked best for his  wife, and should be king after his death; but whoever tried and did not  succeed, after three days and nights, should be put to death.    A king’s son soon came. He was well entertained, and in the evening was  taken to the chamber next to the one where the princesses lay in their  twelve beds. There he was to sit and watch where they went to dance;  and, in order that nothing might pass without his hearing it, th

## Clean data
#### convert to lower case / remove punctuation

In [4]:
dp_in = dp_raw.lower()    # change all the words to lowercase
dp_in[0:400]

'the twelve dancing princesses    there was a king who had twelve beautiful daughters. they slept in  twelve beds all in one room; and when they went to bed, the doors were  shut and locked up; but every morning their shoes were found to be quite  worn through as if they had been danced in all night; and yet nobody  could find out how it happened, or where they had been.    then the king made it kn'

In [5]:
translator = dp_in.maketrans('‘’', '  ', punctuation)    # create dictionary to remap punctuation to none, add ‘’ also to list
dp_in = dp_in.translate(translator)                      # strip all puctuation
print (type(dp_in))
print (len(dp_in))                                       # establish baseline value number of characters
dp_in[0:400]      

<class 'str'>
8292


'the twelve dancing princesses    there was a king who had twelve beautiful daughters they slept in  twelve beds all in one room and when they went to bed the doors were  shut and locked up but every morning their shoes were found to be quite  worn through as if they had been danced in all night and yet nobody  could find out how it happened or where they had been    then the king made it known to '

## Structure data -- tokenize 

In [6]:
dp_words = word_tokenize(dp_in)    #then tokenize each part of the text
len(dp_words)

1611

In [7]:
#the NLTK FreqDist gives a count for how often each part of the text occurs
fd_wct = FreqDist(dp_words)
fd_wct

FreqDist({'the': 139, 'and': 78, 'to': 42, 'he': 33, 'they': 32, 'of': 28, 'in': 25, 'was': 24, 'all': 24, 'a': 22, ...})

In [8]:
#shows the top 10 words in the text
fd_wct.most_common(10)

[('the', 139),
 ('and', 78),
 ('to', 42),
 ('he', 33),
 ('they', 32),
 ('of', 28),
 ('in', 25),
 ('was', 24),
 ('all', 24),
 ('a', 22)]

In [9]:
dp_new = []  # placeholder for new words
dp_rm = []   # placeholder for removed words

for word in dp_words:                               # loop through each word; bin stopwords to remove and rest to new word 
    if word not in stopwords.words('english'):
        dp_new.append(word)
    else: 
        dp_rm.append(word)

In [10]:
print ('removed word count = ', len(dp_rm))       # confirm/validate removed words are appropriate
fd_rm = FreqDist(dp_rm)
print ('no of unique removed words = ', len(fd_rm))
fd_rm.most_common(15)

removed word count =  945
no of unique removed words =  92


[('the', 139),
 ('and', 78),
 ('to', 42),
 ('he', 33),
 ('they', 32),
 ('of', 28),
 ('in', 25),
 ('was', 24),
 ('all', 24),
 ('a', 22),
 ('as', 19),
 ('it', 18),
 ('had', 17),
 ('i', 17),
 ('were', 16)]

In [11]:
print ('new word count = ', len(dp_new))          # review baseline data for enrichment 
fd_nw = FreqDist(dp_new)
print ('no of unique new words = ', len(fd_nw))
fd_nw.most_common(15)

new word count =  666
no of unique new words =  349


[('soldier', 19),
 ('princesses', 17),
 ('king', 16),
 ('said', 16),
 ('twelve', 11),
 ('went', 11),
 ('eldest', 11),
 ('came', 10),
 ('one', 7),
 ('night', 7),
 ('happened', 7),
 ('time', 7),
 ('youngest', 7),
 ('bed', 6),
 ('danced', 6)]

In [12]:
for i, word in enumerate(dp_new):      # list of all new_words and its index 
    print (i, word)

0 twelve
1 dancing
2 princesses
3 king
4 twelve
5 beautiful
6 daughters
7 slept
8 twelve
9 beds
10 one
11 room
12 went
13 bed
14 doors
15 shut
16 locked
17 every
18 morning
19 shoes
20 found
21 quite
22 worn
23 danced
24 night
25 yet
26 nobody
27 could
28 find
29 happened
30 king
31 made
32 known
33 land
34 person
35 could
36 discover
37 secret
38 find
39 princesses
40 danced
41 night
42 one
43 liked
44 best
45 wife
46 king
47 death
48 whoever
49 tried
50 succeed
51 three
52 days
53 nights
54 put
55 death
56 king
57 son
58 soon
59 came
60 well
61 entertained
62 evening
63 taken
64 chamber
65 next
66 one
67 princesses
68 lay
69 twelve
70 beds
71 sit
72 watch
73 went
74 dance
75 order
76 nothing
77 might
78 pass
79 without
80 hearing
81 door
82 chamber
83 left
84 open
85 king
86 son
87 soon
88 fell
89 asleep
90 awoke
91 morning
92 found
93 princesses
94 dancing
95 soles
96 shoes
97 full
98 holes
99 thing
100 happened
101 second
102 third
103 night
104 king
105 ordered
106 head
107 cut
10

In [13]:
#  To be completed -- create an index for occurences of the new_word in the original text.  

###  Enrich retained word list for analysis
#### Lemmatization verb, noun, adjective

In [14]:
wnl = WordNetLemmatizer()      # create instance of wordnetlemmatizer

In [15]:
##### to be completed set up loop strutcure or function to replace lemmatization steps (verb, noun, adjective)

In [16]:
#empty list to hold the new lemmatized words
lemm_v = []

for word in dp_new:
    lemm_v.append(wnl.lemmatize(word, pos="v"))  #lemmatize using 'verb' part-of-speech

In [17]:
#this is the list of tokens after being lemmatized
print ('new word count = ', len(lemm_v))
fd_nw = FreqDist(lemm_v)
print ('no of unique words = ', len(fd_nw))
fd_nw.most_common(15)

new word count =  666
no of unique words =  300


[('soldier', 19),
 ('princesses', 17),
 ('say', 17),
 ('king', 16),
 ('go', 16),
 ('dance', 12),
 ('twelve', 11),
 ('come', 11),
 ('eldest', 11),
 ('bed', 8),
 ('take', 8),
 ('one', 7),
 ('night', 7),
 ('happen', 7),
 ('leave', 7)]

In [18]:
lemm_n = []

for word in lemm_v:
    lemm_n.append(wnl.lemmatize(word, pos="n"))  #lemmatize using 'noun' part-of-speech

In [19]:
#this is the list of tokens after being lemmatized
print ('new word count = ', len(lemm_n))
fd_nw = FreqDist(lemm_n)
print ('no of unique words = ', len(fd_nw))
fd_nw.most_common(15)

new word count =  666
no of unique words =  291


[('princess', 22),
 ('soldier', 19),
 ('king', 17),
 ('say', 17),
 ('go', 16),
 ('dance', 12),
 ('twelve', 11),
 ('come', 11),
 ('eldest', 11),
 ('bed', 8),
 ('night', 8),
 ('take', 8),
 ('one', 7),
 ('happen', 7),
 ('leave', 7)]

In [20]:
lemm_a = []

for word in lemm_n:
    lemm_a.append(wnl.lemmatize(word, pos="a"))  #lemmatize using 'adjective' part-of-speech

In [21]:
##this is the list of tokens after being lemmatized
print ('new word count = ', len(lemm_a))
fd_nw = FreqDist(lemm_a)
print ('no of unique words = ', len(fd_nw))
fd_nw.most_common(15)

new word count =  666
no of unique words =  289


[('princess', 22),
 ('soldier', 19),
 ('king', 17),
 ('say', 17),
 ('go', 16),
 ('dance', 12),
 ('twelve', 11),
 ('come', 11),
 ('eldest', 11),
 ('bed', 8),
 ('night', 8),
 ('take', 8),
 ('young', 8),
 ('one', 7),
 ('happen', 7)]

### Publish data for analysis

In [22]:
dp_final = " ".join(lemm_a)                  # convert final word list to a single string   

### Data Analysis and Modeling 
#### Sentiment Analysis

In [23]:
sid = SentimentIntensityAnalyzer()         #  create instance of sentiment intensity analyzer

In [24]:
print ('sid polarity score = ', sid.polarity_scores(dp_final))
sid.polarity_scores(dp_final)['compound']    #extract the sentiment value from the dictionary of scores

sid polarity score =  {'neg': 0.09, 'neu': 0.744, 'pos': 0.166, 'compound': 0.9961}


0.9961

###  Insights
Overall the dancing princess passage is positive