# Natural Language Processing using NLTK¶


### About the Data Set:

There are 8 different text files of ebooks which are available freely on http://www.gutenberg.org/ . The books are

    The Adventures of Tom Sawyer
    The Time Machine
    The War of the Worlds
    Astounding Stories
    Common Science
    Northanger Abby
    General Science
    Sailing Alone Around the World

### Steps Performed:
    -Importing text files
    -Text Parsing and transforming operations performed such as conversion to lower case, removal of special characters, contraction words, tokenizing etc.
    -Tagging parts of speech to each term
    -Stemming terms to get their root word
    -Removal of stop words
    -Term Document matrix produced with the hightest occuring terms in each document. 
    -Compared the results of Term document matrix and how POS, Stopwords and Stemming Operation affect the formulated matrix. 

In [19]:
#Importing Libraries
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.probability import FreqDist
from collections import Counter
import operator

- ## Term Document Matrix with POS, Stemming and Stopword Removal
    POS = TRUE
    
    Stemming = True
    
    Remove Stopwords = True

In [20]:
file_path = 'Textfiles/'
files = ['T1.txt', 'T2.txt', 'T3.txt', 'T4.txt', 'T5.txt', 'T6.txt','T7.txt', 'T8.txt']
term_doc = []
pos_tags = True
stemming = True
remove_stop = True

In [21]:
#Initialize and Reading File
for file in files:
    with open (file_path+file, "r") as text_file:
        adoc = text_file.read()
    # Convert to all lower case - required
    adoc = ("%s" %adoc).lower()

    # Replace special characters with spaces
    adoc = adoc.replace('-', ' ')
    adoc = adoc.replace('_', ' ')
    adoc = adoc.replace(',', ' ')

    # Replace not contraction with not
    adoc = adoc.replace("'nt", " not")
    adoc = adoc.replace("n't", " not")
    adoc = adoc.replace("'d", " ")

    # Tokenize
    tokens = word_tokenize(adoc)
    tokens = [word.replace(',', '') for word in tokens]
    tokens = [word for word in tokens if ('*' not in word) and word != "''" and word !="``"]

    for word in tokens:
        word = re.sub(r'[^\w\d\s]+','',word)
    print("\nDocument "+file+" contains a total of", len(tokens), " terms.")
    
    #POS Tagging
    if pos_tags:
        tokens = nltk.pos_tag(tokens)

    # Remove stop words
    if remove_stop:
        stop = stopwords.words('english') + list(string.punctuation)
        stop.append("said")
        # Remove single character words and simple punctuation
        tokens = [word for word in tokens if len(word) > 1]
        # Remove stop words
        if pos_tags:
            tokens = [word for word in tokens if word[0] not in stop]
            tokens = [word for word in tokens if (not word[0].replace('.','',1).isnumeric()) and word[0]!="'s" ]
        else:
            tokens = [word for word in tokens if word not in stop]
            tokens = [word for word in tokens if word != "'s" ]
            
    # Lemmatization - Stemming with POS
    if stemming:
        stemmer = SnowballStemmer("english")
        wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}
        wnl = WordNetLemmatizer()
        stemmed_tokens = []
        if pos_tags:
            for token in tokens:
                term = token[0]
                pos = token[1]
                pos = pos[0]
                try:
                    pos = wn_tags[pos]
                    stemmed_tokens.append(wnl.lemmatize(term, pos=pos))
                except:
                    stemmed_tokens.append(stemmer.stem(term))
        else:
            for token in tokens:
                stemmed_tokens.append(stemmer.stem(token))
    if stemming:
        print("Document "+file+" contains", len(stemmed_tokens), "terms after stemming.\n")
        tokens = stemmed_tokens
        
    #Prepare Counts & Add to term_doc

    #fdist = FreqDist(word for word in stemmed_tokens)
    fdist = FreqDist(tokens)
    # Use with Wordnet
    td= {}
    #term_doc = []
    for word, freq in fdist.most_common(2000):
        td[word] = freq
    term_doc.append(td)


Document T1.txt contains a total of 86484  terms.
Document T1.txt contains 40039 terms after stemming.


Document T2.txt contains a total of 108474  terms.
Document T2.txt contains 48289 terms after stemming.


Document T3.txt contains a total of 104778  terms.
Document T3.txt contains 50177 terms after stemming.


Document T4.txt contains a total of 83140  terms.
Document T4.txt contains 35269 terms after stemming.


Document T5.txt contains a total of 76238  terms.
Document T5.txt contains 35030 terms after stemming.


Document T6.txt contains a total of 35136  terms.
Document T6.txt contains 15670 terms after stemming.


Document T7.txt contains a total of 80206  terms.
Document T7.txt contains 35461 terms after stemming.


Document T8.txt contains a total of 64511  terms.
Document T8.txt contains 29797 terms after stemming.



In [22]:
#Prepare Term-Document Matrix

td_mat = {}
for td in term_doc:
    td_mat = Counter(td_mat)+Counter(td)
td_matrix = {}
for k, v in td_mat.items():
    td_matrix[k] = [v]

for td in term_doc:
    for k, v in td_matrix.items():
        if k in td:
            td_matrix[k].append(td[k])
        else:
            td_matrix[k].append(0)
                
#Print Term Document Matrix

td_matrix_sorted = sorted(td_matrix.items(), key=operator.itemgetter(1),reverse=True)
print("Scenario: POS=", pos_tags, "Remove Stop Words=", remove_stop, " Stemming=", stemming)
print("------------------------------------------------------------")
print("TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8")
for i in range(20):
    s = '{:<15s}'.format(td_matrix_sorted[i][0])
    v = td_matrix_sorted[i][1]
    #print(v)
    for j in range(9):
        s = s + '{:>5d}'.format(v[j])
    print('{:<60s}'.format(s))
print("____________________________________________________________")

Scenario: POS= True Remove Stop Words= True  Stemming= True
------------------------------------------------------------
TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8
one             2127  291  437  348  211  312  121  202  205
water           2040   47  922  825    7   94    7   55   83
make            1928  204  694  262  185  237   63  169  114
would           1855  270  407  195  309  222   60  289  103
go              1620  212  292   18  239  154  103  374  228
come            1511  211  153   62  126  276  155  282  246
could           1363  221  121   49  364  195   93  203  117
time            1333  137  128  175  167  164  213  216  133
see             1188  179  232  129  156  110   72  172  138
light           1175   87  461  322   21   92   61   60   71
get             1146  171  291   24   76  121   53  315   95
air             1126   69  518  412   20   19   23   30   35
know            1042  165  102  112  223  119   46  202   73
day              939   87

In [23]:
!pip install wordcloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt



- ## Term Document Matrix without POS, Stemming and Stopword Removal
    POS = False
    
    Stemming = False
    
    Remove Stopwords = False

In [25]:
file_path = 'TextFiles/'
files = ['T1.txt', 'T2.txt', 'T3.txt', 'T4.txt', 'T5.txt', 'T6.txt','T7.txt', 'T8.txt']
term_doc = []
pos_tags = False
stemming = False
remove_stop = False

#Initialize and Reading File
for file in files:
    with open (file_path+file, "r") as text_file:
        adoc = text_file.read()
    # Convert to all lower case - required
    adoc = ("%s" %adoc).lower()

    # Replace special characters with spaces
    adoc = adoc.replace('-', ' ')
    adoc = adoc.replace('_', ' ')
    adoc = adoc.replace(',', ' ')

    # Replace not contraction with not
    adoc = adoc.replace("'nt", " not")
    adoc = adoc.replace("n't", " not")
    adoc = adoc.replace("'d", " ")

    # Tokenize
    tokens = word_tokenize(adoc)
    tokens = [word.replace(',', '') for word in tokens]
    tokens = [word for word in tokens if ('*' not in word) and word != "''" and word !="``"]

    for word in tokens:
        word = re.sub(r'[^\w\d\s]+','',word)
    print("\nDocument "+file+" contains a total of", len(tokens), " terms.")
    
    #POS Tagging
    if pos_tags:
        tokens = nltk.pos_tag(tokens)

    # Remove stop words
    if remove_stop:
        stop = stopwords.words('english') + list(string.punctuation)
        stop.append("said")
        # Remove single character words and simple punctuation
        tokens = [word for word in tokens if len(word) > 1]
        # Remove stop words
        if pos_tags:
            tokens = [word for word in tokens if word[0] not in stop]
            tokens = [word for word in tokens if (not word[0].replace('.','',1).isnumeric()) and word[0]!="'s" ]
        else:
            tokens = [word for word in tokens if word not in stop]
            tokens = [word for word in tokens if word != "'s" ]
            
    # Lemmatization - Stemming with POS
    if stemming:
        stemmer = SnowballStemmer("english")
        wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}
        wnl = WordNetLemmatizer()
        stemmed_tokens = []
        if pos_tags:
            for token in tokens:
                term = token[0]
                pos = token[1]
                pos = pos[0]
                try:
                    pos = wn_tags[pos]
                    stemmed_tokens.append(wnl.lemmatize(term, pos=pos))
                except:
                    stemmed_tokens.append(stemmer.stem(term))
        else:
            for token in tokens:
                stemmed_tokens.append(stemmer.stem(token))
    if stemming:
        print("Document "+file+" contains", len(stemmed_tokens), "terms after stemming.\n")
        tokens = stemmed_tokens
        
    #Prepare Counts & Add to term_doc

    #fdist = FreqDist(word for word in stemmed_tokens)
    fdist = FreqDist(tokens)
    # Use with Wordnet
    td= {}
    #term_doc = []
    for word, freq in fdist.most_common(2000):
        td[word] = freq
    term_doc.append(td)
    
#Prepare Term-Document Matrix

td_mat = {}
for td in term_doc:
    td_mat = Counter(td_mat)+Counter(td)
td_matrix = {}
for k, v in td_mat.items():
    td_matrix[k] = [v]

for td in term_doc:
    for k, v in td_matrix.items():
        if k in td:
            td_matrix[k].append(td[k])
        else:
            td_matrix[k].append(0)
                
#Print Term Document Matrix

td_matrix_sorted = sorted(td_matrix.items(), key=operator.itemgetter(1),reverse=True)
print("Scenario: POS=", pos_tags, "Remove Stop Words=", remove_stop, " Stemming=", stemming)
print("------------------------------------------------------------")
print("TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8")
for i in range(20):
    s = '{:<15s}'.format(td_matrix_sorted[i][0])
    v = td_matrix_sorted[i][1]
    #print(v)
    for j in range(9):
        s = s + '{:>5d}'.format(v[j])
    print('{:<60s}'.format(s))
print("____________________________________________________________")


Document T1.txt contains a total of 86484  terms.

Document T2.txt contains a total of 108474  terms.

Document T3.txt contains a total of 104778  terms.

Document T4.txt contains a total of 83140  terms.

Document T5.txt contains a total of 76238  terms.

Document T6.txt contains a total of 35136  terms.

Document T7.txt contains a total of 80206  terms.

Document T8.txt contains a total of 64511  terms.
Scenario: POS= False Remove Stop Words= False  Stemming= False
------------------------------------------------------------
TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8
the            42083 5178 8302 8767 3174 5833 2241 3794 4794
.              28176 4818 4935 4228 2793 2803 1763 3832 3004
of             19694 2421 3304 4322 2358 2370 1152 1466 2301
and            19163 2358 2278 3240 2304 2121 1235 3124 2503
a              15432 1968 2772 2719 1536 2092  815 1895 1635
to             13276 1864 2056 1941 2239 1583  691 1727 1175
in             10300 1118 1848 2206 126

In [None]:
#It can be seen that, this matrix doesnt result in a good analysis of words.

- ## Term Document Matrix with Stemming
    POS = Flase
    
    Stemming = True
    
    Remove Stopwords = False

In [29]:
file_path = 'TextFiles/'
files = ['T1.txt', 'T2.txt', 'T3.txt', 'T4.txt', 'T5.txt', 'T6.txt','T7.txt', 'T8.txt']
term_doc = []
pos_tags = False
stemming = False
remove_stop = True

#Initialize and Reading File
for file in files:
    with open (file_path+file, "r") as text_file:
        adoc = text_file.read()
    # Convert to all lower case - required
    adoc = ("%s" %adoc).lower()

    # Replace special characters with spaces
    adoc = adoc.replace('-', ' ')
    adoc = adoc.replace('_', ' ')
    adoc = adoc.replace(',', ' ')

    # Replace not contraction with not
    adoc = adoc.replace("'nt", " not")
    adoc = adoc.replace("n't", " not")
    adoc = adoc.replace("'d", " ")

    # Tokenize
    tokens = word_tokenize(adoc)
    tokens = [word.replace(',', '') for word in tokens]
    tokens = [word for word in tokens if ('*' not in word) and word != "''" and word !="``"]

    for word in tokens:
        word = re.sub(r'[^\w\d\s]+','',word)
    print("\nDocument "+file+" contains a total of", len(tokens), " terms.")
    
    #POS Tagging
    if pos_tags:
        tokens = nltk.pos_tag(tokens)

    # Remove stop words
    if remove_stop:
        stop = stopwords.words('english') + list(string.punctuation)
        stop.append("said")
        # Remove single character words and simple punctuation
        tokens = [word for word in tokens if len(word) > 1]
        # Remove stop words
        if pos_tags:
            tokens = [word for word in tokens if word[0] not in stop]
            tokens = [word for word in tokens if (not word[0].replace('.','',1).isnumeric()) and word[0]!="'s" ]
        else:
            tokens = [word for word in tokens if word not in stop]
            tokens = [word for word in tokens if word != "'s" ]
            
    # Lemmatization - Stemming with POS
    if stemming:
        stemmer = SnowballStemmer("english")
        wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}
        wnl = WordNetLemmatizer()
        stemmed_tokens = []
        if pos_tags:
            for token in tokens:
                term = token[0]
                pos = token[1]
                pos = pos[0]
                try:
                    pos = wn_tags[pos]
                    stemmed_tokens.append(wnl.lemmatize(term, pos=pos))
                except:
                    stemmed_tokens.append(stemmer.stem(term))
        else:
            for token in tokens:
                stemmed_tokens.append(stemmer.stem(token))
    if stemming:
        print("Document "+file+" contains", len(stemmed_tokens), "terms after stemming.\n")
        tokens = stemmed_tokens
        
    #Prepare Counts & Add to term_doc

    #fdist = FreqDist(word for word in stemmed_tokens)
    fdist = FreqDist(tokens)
    # Use with Wordnet
    td= {}
    #term_doc = []
    for word, freq in fdist.most_common(2000):
        td[word] = freq
    term_doc.append(td)
    
#Prepare Term-Document Matrix

td_mat = {}
for td in term_doc:
    td_mat = Counter(td_mat)+Counter(td)
td_matrix = {}
for k, v in td_mat.items():
    td_matrix[k] = [v]

for td in term_doc:
    for k, v in td_matrix.items():
        if k in td:
            td_matrix[k].append(td[k])
        else:
            td_matrix[k].append(0)
                
#Print Term Document Matrix

td_matrix_sorted = sorted(td_matrix.items(), key=operator.itemgetter(1),reverse=True)
print("Scenario: POS=", pos_tags, "Remove Stop Words=", remove_stop, " Stemming=", stemming)
print("------------------------------------------------------------")
print("TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8")
for i in range(20):
    s = '{:<15s}'.format(td_matrix_sorted[i][0])
    v = td_matrix_sorted[i][1]
    #print(v)
    for j in range(9):
        s = s + '{:>5d}'.format(v[j])
    print('{:<60s}'.format(s))
print("____________________________________________________________")


Document T1.txt contains a total of 86484  terms.

Document T2.txt contains a total of 108474  terms.

Document T3.txt contains a total of 104778  terms.

Document T4.txt contains a total of 83140  terms.

Document T5.txt contains a total of 76238  terms.

Document T6.txt contains a total of 35136  terms.

Document T7.txt contains a total of 80206  terms.

Document T8.txt contains a total of 64511  terms.
Scenario: POS= False Remove Stop Words= True  Stemming= False
------------------------------------------------------------
TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8
one             2061  277  422  340  207  305  117  190  203
water           2001   40  920  816    5   79    7   53   81
would           1854  270  407  195  309  222   59  289  103
could           1363  221  121   49  364  195   93  203  117
time            1137  114  100  109  149  152  200  191  122
air             1123   69  518  410   19   19   23   30   35
light            980   68  407  291   17

- ## Term Document Matrix with Stemming and Stopword Removal
    POS = False
    
    Stemming = True
    
    Remove Stopwords = True

In [30]:
file_path = 'TextFiles/'
files = ['T1.txt', 'T2.txt', 'T3.txt', 'T4.txt', 'T5.txt', 'T6.txt','T7.txt', 'T8.txt']
term_doc = []
pos_tags = False
stemming = True
remove_stop = True

#Initialize and Reading File
for file in files:
    with open (file_path+file, "r") as text_file:
        adoc = text_file.read()
    # Convert to all lower case - required
    adoc = ("%s" %adoc).lower()

    # Replace special characters with spaces
    adoc = adoc.replace('-', ' ')
    adoc = adoc.replace('_', ' ')
    adoc = adoc.replace(',', ' ')

    # Replace not contraction with not
    adoc = adoc.replace("'nt", " not")
    adoc = adoc.replace("n't", " not")
    adoc = adoc.replace("'d", " ")

    # Tokenize
    tokens = word_tokenize(adoc)
    tokens = [word.replace(',', '') for word in tokens]
    tokens = [word for word in tokens if ('*' not in word) and word != "''" and word !="``"]

    for word in tokens:
        word = re.sub(r'[^\w\d\s]+','',word)
    print("\nDocument "+file+" contains a total of", len(tokens), " terms.")
    
    #POS Tagging
    if pos_tags:
        tokens = nltk.pos_tag(tokens)

    # Remove stop words
    if remove_stop:
        stop = stopwords.words('english') + list(string.punctuation)
        stop.append("said")
        # Remove single character words and simple punctuation
        tokens = [word for word in tokens if len(word) > 1]
        # Remove stop words
        if pos_tags:
            tokens = [word for word in tokens if word[0] not in stop]
            tokens = [word for word in tokens if (not word[0].replace('.','',1).isnumeric()) and word[0]!="'s" ]
        else:
            tokens = [word for word in tokens if word not in stop]
            tokens = [word for word in tokens if word != "'s" ]
            
    # Lemmatization - Stemming with POS
    if stemming:
        stemmer = SnowballStemmer("english")
        wn_tags = {'N':wn.NOUN, 'J':wn.ADJ, 'V':wn.VERB, 'R':wn.ADV}
        wnl = WordNetLemmatizer()
        stemmed_tokens = []
        if pos_tags:
            for token in tokens:
                term = token[0]
                pos = token[1]
                pos = pos[0]
                try:
                    pos = wn_tags[pos]
                    stemmed_tokens.append(wnl.lemmatize(term, pos=pos))
                except:
                    stemmed_tokens.append(stemmer.stem(term))
        else:
            for token in tokens:
                stemmed_tokens.append(stemmer.stem(token))
    if stemming:
        print("Document "+file+" contains", len(stemmed_tokens), "terms after stemming.\n")
        tokens = stemmed_tokens
        
    #Prepare Counts & Add to term_doc

    #fdist = FreqDist(word for word in stemmed_tokens)
    fdist = FreqDist(tokens)
    # Use with Wordnet
    td= {}
    #term_doc = []
    for word, freq in fdist.most_common(2000):
        td[word] = freq
    term_doc.append(td)
    
#Prepare Term-Document Matrix

td_mat = {}
for td in term_doc:
    td_mat = Counter(td_mat)+Counter(td)
td_matrix = {}
for k, v in td_mat.items():
    td_matrix[k] = [v]

for td in term_doc:
    for k, v in td_matrix.items():
        if k in td:
            td_matrix[k].append(td[k])
        else:
            td_matrix[k].append(0)
                
#Print Term Document Matrix

td_matrix_sorted = sorted(td_matrix.items(), key=operator.itemgetter(1),reverse=True)
print("Scenario: POS=", pos_tags, "Remove Stop Words=", remove_stop, " Stemming=", stemming)
print("------------------------------------------------------------")
print("TERM            TOTAL  D1   D2   D3   D4   D5   D6   D7   D8")
for i in range(20):
    s = '{:<15s}'.format(td_matrix_sorted[i][0])
    v = td_matrix_sorted[i][1]
    #print(v)
    for j in range(9):
        s = s + '{:>5d}'.format(v[j])
    print('{:<60s}'.format(s))
print("____________________________________________________________")


Document T1.txt contains a total of 86484  terms.
Document T1.txt contains 40081 terms after stemming.


Document T2.txt contains a total of 108474  terms.
Document T2.txt contains 50486 terms after stemming.


Document T3.txt contains a total of 104778  terms.
Document T3.txt contains 52753 terms after stemming.


Document T4.txt contains a total of 83140  terms.
Document T4.txt contains 35293 terms after stemming.


Document T5.txt contains a total of 76238  terms.
Document T5.txt contains 35184 terms after stemming.


Document T6.txt contains a total of 35136  terms.
Document T6.txt contains 15669 terms after stemming.


Document T7.txt contains a total of 80206  terms.
Document T7.txt contains 35425 terms after stemming.


Document T8.txt contains a total of 64511  terms.
Document T8.txt contains 29810 terms after stemming.

Scenario: POS= False Remove Stop Words= True  Stemming= True
------------------------------------------------------------
TERM            TOTAL  D1   D2   D3 

In [32]:
#Now we know how POS, Stopwords and Stemming affect the Term Document Matrix. 