# This notebook will use NLTK library for natural language processing tasks.
Created by Eugene Luo

# Q1)
# A) First install the NLTK package using the computer terminal:
pip install nltk

# B) Download the Gutenberg corpus tool in NLTK package

In [1]:
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\luoeu\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

# C) Use text in the corpus

In [2]:
from nltk.corpus import gutenberg #import corpus

In [3]:
nltk.corpus.gutenberg.fileids() #see the file ids in gutenberg corpus

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
whitman = nltk.corpus.gutenberg.words('whitman-leaves.txt') # Assign words in the text to variable
len(whitman) #number of words in the text

154883

In [5]:
print(whitman) # see the text

['[', 'Leaves', 'of', 'Grass', 'by', 'Walt', 'Whitman', ...]


In [6]:
len(gutenberg.fileids()) # Number of sources in Gutenberg

18

# D) Create a table displaying relative frequencies for (can, could, may, might, will, would, and should) in the corpus

In [7]:
cfd = nltk.ConditionalFreqDist((file, word)
                              for file in gutenberg.fileids()
                              for word in gutenberg.words(fileids=file))
files = ['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']
modals = ['can', 'could', 'may', 'might', 'will', 'would', 'should']
cfd.tabulate(conditions=files, samples=modals)

                           can  could    may  might   will  would should 
        austen-emma.txt    270    825    213    322    559    815    366 
  austen-persuasion.txt    100    444     87    166    162    351    185 
       austen-sense.txt    206    568    169    215    354    507    228 
          bible-kjv.txt    213    165   1024    475   3807    443    768 
        blake-poems.txt     20      3      5      2      3      3      6 
     bryant-stories.txt     75    154     18     23    144    110     38 
burgess-busterbrown.txt     23     56      3     17     19     46     13 
      carroll-alice.txt     57     73     11     28     24     70     27 
    chesterton-ball.txt    131    117     90     69    198    139     75 
   chesterton-brown.txt    126    170     47     71    111    132     56 
chesterton-thursday.txt    117    148     56     71    109    116     54 
  edgeworth-parents.txt    340    420    160    127    517    503    271 
 melville-moby_dick.txt    220    215 

# E) Select the largest span of relative freq of most used modals minus least used modals. Compare usage in the texts. Try to suggest an explanation for why those words are used differently in the two texts.

Most used modal in a text: 'will' : 3,807    (bible-kjv.txt)

Least used modal in a text: 'might' : 2      (blake-poems.txt)

The most used modal for 'will' is found in the Bible. I think the explanation for this is because the Bible often tells us what God will do in certain events. Additionally, the word, 'will', can be used to mean 'God's will' as well. Therefore, it makes sense that this modal is used more often in the Bible.

The least used modal, 'might', is found in blake poems. After inspecting the text, it seems to be a short poem describing what what two people are doing. In this case, since the actions described are declarative, you won't often find the word, 'might', in these sentences since these actions have been done already. 

In [8]:
print(gutenberg.raw('bible-kjv.txt').strip()[:501]) # First 501 characters. 

[The King James Bible]

The Old Testament of the King James Bible

The First Book of Moses:  Called Genesis


1:1 In the beginning God created the heaven and the earth.

1:2 And the earth was without form, and void; and darkness was upon
the face of the deep. And the Spirit of God moved upon the face of the
waters.

1:3 And God said, Let there be light: and there was light.

1:4 And God saw the light, that it was good: and God divided the light
from the darkness.

1:5 And God called the light Day


In [9]:
print(gutenberg.raw('blake-poems.txt').strip()[:502]) # First 502 characters.

[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
   So I piped with merry cheer.
 "Piper, pipe that song again;"
   So I piped: he wept to hear.
 
 "Drop thy pipe, thy happy pipe;
   Sing thy songs of happy cheer:!"
 So I sang the same again,
   While he wept with


# # Q2)

# A) Import inaugural corpus

In [10]:
nltk.download('inaugural')
from nltk.corpus import inaugural

[nltk_data] Downloading package inaugural to
[nltk_data]     C:\Users\luoeu\AppData\Roaming\nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


# B) Choose Kennedy's speech

In [11]:
Kennedy = nltk.corpus.inaugural.words('1961-Kennedy.txt')
Kennedy

['Vice', 'President', 'Johnson', ',', 'Mr', '.', ...]

# C) Find 10 most frequently used long words (> 7)

In [12]:
from nltk import FreqDist
import pandas as pd

freqchar = ([w.lower() for w in Kennedy if len(w) > 7]) #Filter only letters over 7 characters

# Make a frequency distribution
fdist = FreqDist(freqchar)
freq = fdist.most_common(10)

#create arrays
np1 = []
np2 = []

# Seperate words
for i in range(10):
    np1.append(freq[i][0])
    
for i in range(10):
    np2.append(freq[i][1])

# Create data frame
df_freq = pd.DataFrame({'Words':np1, 'Count':np2})
df_freq #print data frame

Unnamed: 0,Words,Count
0,citizens,5
1,president,4
2,americans,4
3,generation,3
4,forebears,2
5,revolution,2
6,committed,2
7,powerful,2
8,supporting,2
9,themselves,2


# D) Which word has the largest number of synonyms using WordNet?

In [13]:
import nltk
nltk.download('wordnet')

from nltk.corpus import wordnet as wn
import numpy as np

np_syn = [] #create array


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\luoeu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
# loop through each word

for i in fdist.most_common(10):
    print('Word: ', i[0])
    count = 0
    
    #Print the synonym
    print('Synonyms: ')
    for j in wn.synsets(i[0]):
        print(' ', j.lemma_names())
        count+=len(j.lemma_names())
    np_syn.append([i[0], count])

Word:  citizens
Synonyms: 
  ['citizen']
Word:  president
Synonyms: 
  ['president']
  ['President_of_the_United_States', 'United_States_President', 'President', 'Chief_Executive']
  ['president']
  ['president', 'chairman', 'chairwoman', 'chair', 'chairperson']
  ['president', 'prexy']
  ['President_of_the_United_States', 'President', 'Chief_Executive']
Word:  americans
Synonyms: 
  ['American']
  ['American_English', 'American_language', 'American']
  ['American']
Word:  generation
Synonyms: 
  ['coevals', 'contemporaries', 'generation']
  ['generation']
  ['generation']
  ['generation']
  ['genesis', 'generation']
  ['generation']
  ['generation', 'multiplication', 'propagation']
Word:  forebears
Synonyms: 
  ['forebear', 'forbear']
Word:  revolution
Synonyms: 
  ['revolution']
  ['revolution']
  ['rotation', 'revolution', 'gyration']
Word:  committed
Synonyms: 
  ['perpetrate', 'commit', 'pull']
  ['give', 'dedicate', 'consecrate', 'commit', 'devote']
  ['commit', 'institutionalize

In [15]:
#create empty arrays
np_1 = []
np_2 = []

# Get the synonyms
for i in range(10):
    np_1.append(np_syn[i][0])
    
for j in range(10):
    np_2.append(np_syn[j][1])

#Create data frame
df_syn = pd.DataFrame({'Words':np_1, 'Count':np_2})

# E) List all synonyms for the 10 most frequently used words

In [16]:
#sort from highest to lowest
df_syn.sort_values(by="Count", ascending=False)

Unnamed: 0,Words,Count
8,supporting,52
6,committed,27
1,president,16
7,powerful,16
3,generation,12
2,americans,5
5,revolution,5
4,forebears,2
0,citizens,1
9,themselves,0


'supporting' has the largest number of synonyms

# F) Which one of the 10 words has the largest number of hyponyms?

In [17]:
#Create array for hyponyms
np_hyp = []
#Loop through each word
for i in fdist.most_common(10):
    # Print the word
    print('Word: ', i[0])
    count=0
    
    #Print Hyponyms
    for j in wn.synsets(i[0]):
        print(j)
        print(j.hyponyms())
        count+=len(j.hyponyms())
    np_hyp.append([i[0], count])

Word:  citizens
Synset('citizen.n.01')
[Synset('active_citizen.n.01'), Synset('civilian.n.01'), Synset('freeman.n.01'), Synset('private_citizen.n.01'), Synset('repatriate.n.01'), Synset('thane.n.02'), Synset('voter.n.01')]
Word:  president
Synset('president.n.01')
[]
Synset('president_of_the_united_states.n.01')
[]
Synset('president.n.03')
[Synset('ex-president.n.01')]
Synset('president.n.04')
[Synset('kalon_tripa.n.01'), Synset('vice_chairman.n.01')]
Synset('president.n.05')
[]
Synset('president_of_the_united_states.n.02')
[]
Word:  americans
Synset('american.n.01')
[Synset('african-american.n.01'), Synset('alabaman.n.01'), Synset('alaskan.n.01'), Synset('anglo-american.n.01'), Synset('appalachian.n.01'), Synset('arizonan.n.01'), Synset('arkansan.n.01'), Synset('asian_american.n.01'), Synset('bay_stater.n.01'), Synset('bostonian.n.01'), Synset('californian.n.01'), Synset('carolinian.n.01'), Synset('coloradan.n.01'), Synset('connecticuter.n.01'), Synset('creole.n.02'), Synset('delaware

In [18]:
#Create empty array
np_1 = []
np_2 = []

#Get the hyponyms
for i in range(10):
    np_1.append(np_hyp[i][0])

for j in range(10):
    np_2.append(np_hyp[j][1])
    
#Create dataframe
df_hyp = pd.DataFrame({'Word':np_1, 'Count':np_2})

In [19]:
df_hyp.sort_values(by="Count", ascending=False) # Sort from highest to lowest hyponyms

Unnamed: 0,Word,Count
2,americans,75
8,supporting,42
6,committed,16
5,revolution,8
0,citizens,7
3,generation,6
1,president,3
4,forebears,2
7,powerful,0
9,themselves,0


The Word 'Americans' has the highest number of hyponyms

# G) Reflect on the results

Using NLTK, we can extract useful information from a corpus of text. Here, using a few lines of code we were able to analyze 18 different texts in the gutenberg corpus and find the most common words that can tell us a little bit about each text.

In the inaugural Kennedy text, we found some of the most common words used in his speech and learned which of those common words has more synonyms and hyponyms. 