# Course Introduction
Welcome to the course on NLP Using Python. In this course, you will get to learn the following topics:<br>

* Tokenizing text using functions word_tokenize and sent_tokenize.

* Computing Frequencies with FreqDist and ConditionalFreqDist.

* Generating Bigrams and collocations with bigrams and collocations.

* Stemming word affixes using PorterStemmer and LancasterStemmer.

* Tagging words to their parts of speech using pos_tag.

# NLP - Natural Language Processing
* Humans communicate in natural languages such as English, German, Japanese and so on.

* On the other hand, a Computer communicates in Machine Language, which has a defined set of rules.

* As a reason, a computer cannot communicate with humans in an effective way

* Natural Language Processing helps in increasing computer intelligence to understand human languages as spoken and to respond.

# Why NLP?
* NLP techniques are capable of processing and extracting meaningful insights, from huge unstructured data available online.

* It can automate translating text from one language to other.

* These techniques can be used for performing sentiment analysis.

* It helps in building applications that interact with humans as humans do.

* Also, NLP can help in automating Text Classification, Spam Filtering, and more.

# nltk
* nltk is a popular Python framework used for developing Python programs to work with human language data.<br>
Key features of nltk:<br>
* It provides access to over 50 text corpora and other lexical resources.
* It is a suite of text processing tools.
* It is free to use and Open source.
* It is available for Windows, Mac OS X, and Linux.

# Basic Understanding of nltk
Now let's understand by performing simple tasks in the next couple of slides.<br>

__Note__: nltk.download('punkt') after the mentioned line below
* Splitting a sample text into a list of sentences.

In [1]:
import nltk
text = "Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991."
sentences = nltk.sent_tokenize(text)
len(sentences)

2

__As seen above, sent_tokenize function generates sentences from the given text.__

* Splitting a sample text into words using __word_tokenize__ function.

In [2]:
words = nltk.word_tokenize(text)
len(words)
words[:5]

['Python', 'is', 'an', 'interpreted', 'high-level']

__The expression words[:5] displays first five words of list words.__

* Determining the frequency of words present in sample text using FreqDist function.

In [3]:
wordfreq = nltk.FreqDist(words)
wordfreq.most_common(2)

[('programming', 2), ('.', 2)]

__The expression wordfreq.most_common(2) displays two highly frequent words with their respective frequency count.__

# Downloading NLTK Book collection
* In this course, you will be coordinating with several texts curated by NLTK authors.
* These texts are available in collection book of nltk.
* They can be downloaded by running the following command in Python interpreter, after importing nltk successfully.

In [4]:
import nltk
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\chat80.zip.
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cmudict.zip.
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\conll2000.zip.
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping cor

True

# Importing Items of 'book'

* The above figure illustrates the output of the command from nltk.book import *.
* The command loads nine texts and nine sentences, from the collection book.

# Searching Text
* There are multiple ways of searching for a pattern in a text.
* The example shown below searches for words starting with tri, and ending with r.

In [6]:
from nltk.book import *
text1.findall("<tri.*r>")

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
triangular; triangular; triangular; triangular


# Basic Tasks with Text
__In this topic, you will understand how to perform the following activities, using text1 as input text.__<br>
* Total Word Count
* Unique Word Count
* Transforming Words
* Word Coverage
* Filtering Words
* Frequency Distribution

# Determining Total Word Count
* NLTK Book module consits of 9 text class. Type the name of the text or sentence to view it.
* * * text1: Moby Dick by Herman Melville 1851
* * * text2: Sense and Sensibility by Jane Austen 1811
* * * text3: The Book of Genesis
* * * text4: Inaugural Address Corpus
* * * text5: Chat Corpus
* * * text6: Monty Python and the Holy Grail
* * * text7: Wall Street Journal
* * * text8: Personals Corpus
* * * text9: The Man Who Was Thursday by G . K . Chesterton 1908
* The text1, imported from nltk.book is an object of nltk.text.Text class.

In [7]:
from nltk.book import *
type(text1)

nltk.text.Text

* Total number of words in text1 is determined using len.

In [8]:
n_words = len(text1)
n_words

260819

# Determining Unique Word Count
* A unique number of words in text1 is determined using set and len methods.

In [20]:
n_unique_words = len(set(text6))
n_unique_words

2166

__set(text1) generates list of unique words from text1.__

# Transforming Words
* It is possible to apply a function to any number of words and transform them.* *
* Now let's transform every word of text1 to lowercase and determine unique words once again.

In [19]:
text1_lcw = [ word.lower() for word in set(text1) ]
n_unique_words_lc = len(set(text1_lcw))
n_unique_words_lc

17231

__A difference of 2086 can be found from n_unique_words.__

# Determining Word Coverage
Word Coverage: Word Coverage refers to an average number of times a word is occurring in the text.<br>
* The following examples determine Word Coverage of raw and transformed text1.

In [11]:
word_coverage1 = n_words / n_unique_words
word_coverage1

13.502044830977896

* On average, a single word in text1 is repeated 13.5 times.

In [12]:
word_coverage2 = n_words / n_unique_words_lc
word_coverage2

15.136614241773549

# Filtering Words
* Now let's see how to filter words based on specific criteria.
* The following example filters words having characters more than 17.

In [13]:
big_words = [word for word in set(text1) if len(word) > 17 ]
big_words

['uninterpenetratingly', 'characteristically']

* A list of comprehension with a condition is used above.

# Filtering Words
* Now let's see one more example which filters words having the prefix Sun.

In [14]:
sun_words = [word for word in set(text1) if word.startswith('Sun') ]
sun_words

['Sunset', 'Sunday', 'Sunda']

* The above example is case-sensitive. It doesn't filter the words starting with lowercase s and followed by un.

# Frequency Distribution
* FreqDist functionality of nltk can be used to determine the frequency of all words, present in an input text.
* The following example, determines frequency distribution of text1 and further displays the frequency of word Sunday.

In [22]:
text1_freq = nltk.FreqDist(text6)
text1_freq['ARTHUR']

225

# Common Methods of Frequency Distribution
* Illustration of Commonly used methods on a frequency distribution fdist.

* Now let's identify three frequent words from text1_freq distribution using most_common method.

In [16]:
top3_text1 = text1_freq.most_common(3)
top3_text1

[(',', 18713), ('the', 13721), ('.', 6862)]

* The output says the three most frequent words are , , the, and .
* It may be weird for few of you.

# Frequency Distribution
In general, you would be interested in finding frequent words which are not common in usage and specific to input text.<br>
In the next example, you will perform the following.
* Filter words having all characters and of larger length.
* Determine frequency distribution of the filtered words.
* Identify the three most common words.

In [17]:
large_uncommon_words = [word for word in text1 if word.isalpha() and len(word) > 7 ]
text1_uncommon_freq = nltk.FreqDist(large_uncommon_words)
text1_uncommon_freq.most_common(10)

[('Queequeg', 252),
 ('Starbuck', 196),
 ('something', 119),
 ('Nantucket', 85),
 ('sometimes', 81),
 ('harpooneer', 77),
 ('standing', 73),
 ('whalemen', 71),
 ('business', 67),
 ('Leviathan', 64)]

* Queequeg, Starbuck and something are the top three words, based on the chosen criteria.

In [None]:
Handson

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys

import zipfile

os.environ['NLTK_DATA'] = os.getcwd()+"/nltk_data"
from nltk.corpus import gutenberg
from nltk.text import Text

#
# Complete the 'calculateWordCounts' function below.
#
#
#

def calculateWordCounts(text):
    # Write your code here
    n_words = len(text)
    print(n_words)
    n_unique_words = len(set(text))
    print(n_unique_words)
    word_coverage1 = n_words / n_unique_words
    print(int(word_coverage1//1))

if __name__ == '__main__':
    text = input()
    if not os.path.exists(os.getcwd()+"/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())
    
    text = Text(gutenberg.words(text))

    calculateWordCounts(text)


In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile

os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
from nltk.corpus import gutenberg
from nltk.text import Text


#
# Complete the 'filterWords' function below.
#
# 
#

def filterWords(text):
    # Write your code here
    ing_words = [word for word in set(text) if word.endswith('ing') ]
    large_words = [word for word in text if len(word) > 15 ]
    upper_words = [word for word in set(text) if word.isupper() ]
    return ing_words,large_words,upper_words
    
if __name__ == '__main__':
    text = input()
    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())
            
    text = Text(gutenberg.words(text))

    ing_words, big_words, upper_words = filterWords(text)

    print(sorted(ing_words))
    print(sorted(big_words))
    print(sorted(upper_words))


In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
from nltk.corpus import gutenberg
from nltk.text import Text
import nltk


#
# Complete the 'findWordFreq' function below.
#
# 

def findWordFreq(text, word):
    # Write your code here
    text1_freq = nltk.FreqDist(text)
    wordfreq=text1_freq[word]
    large_uncommon_words = [word for word in text if word.isalpha()]
    text_uncommon_freq = nltk.FreqDist(large_uncommon_words)
    maxfreq=text_uncommon_freq.most_common(1)
    maxfreq=maxfreq[0][0]
    return wordfreq,maxfreq

if __name__ == '__main__':
    text = input()
    word = input()
    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    text = Text(gutenberg.words(text))

    word_freq, max_freq = findWordFreq(text, word)

    print(word_freq)
    print(max_freq)

# Text Corpora
* In the previous topic, you have worked with a simple text collection text1.
* In this topic, you will work with larger text collections known as Text Corpora or Text Corpus.
* The following code snippet downloads more text corpus, which is varied in content.

In [23]:
import nltk
nltk.download('book')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\abc\AppData\Roaming\nltk_

True

# Popular Text Corpora
## Two popular __Text Corpora__ available from __nltk__, which you will be using in this course are:
* __Genesis__: It is a collection of few words across multiple languages.
* __Brown__: It is the first electronic corpus of one million English words.
## Other Corpus in nltk
* __Gutenberg__ : Collections from Project Gutenberg
* __Inaugural___ : Collection of U.S Presidents inaugural speeches

# Popular Text Corpora
* __stopwords__ : Collection of stop words.
* __reuters__ : Collection of news articles.
* __cmudict__ : Collection of CMU Dictionary words.
* __movie_reviews__ : Collection of Movie Reviews.
* __np_chat__ : Collection of chat text.
* __names__ : Collection of names associated with males and females.
* __state_union__ : Collection of state union address.
* __wordnet__ : Collection of all lexical entries.
* __words__ : Collection of words in Wordlist corpus.

# Accessing Text Corpora
* Any text corpus has to be imported before you start working with it.
* The below code imports __genesis__ text corpus.

In [24]:
from nltk.corpus import genesis

* Various text collections available under genesis text corpus are viewed by fileids method.

In [25]:
genesis.fileids()

['english-kjv.txt',
 'english-web.txt',
 'finnish.txt',
 'french.txt',
 'german.txt',
 'lolcat.txt',
 'portuguese.txt',
 'swedish.txt']

* The output displays eight text collections, present in genesis corpus.

# Working with a Text Corpus
* Now let's understand how to work with a text corpus.
* The following example determines the average word length and average sentence length of each text collection present in genesis corpus.

In [27]:
for fileid in genesis.fileids():
    n_chars = len(genesis.raw(fileid))
    n_words = len(genesis.words(fileid))
    n_sents = len(genesis.sents(fileid))
    print(int(n_chars/n_words), int(n_words/n_sents), fileid)

4 30 english-kjv.txt
4 19 english-web.txt
5 15 finnish.txt
4 23 french.txt
4 23 german.txt
4 20 lolcat.txt
4 27 portuguese.txt
4 30 swedish.txt


* The methods raw, words and sents used in code determine the total number of characters, words, and sentences present in a specific text collection.
* The output of code is shown below. Text collection finnish.txt has different average word length.

# Text Corpus Structure
A text corpus is organized into any of the following four structures.
* __Isolated__ - Holds Individual text collections.
* __Categorized__ - Each text collection tagged to a category.
* __Overlapping__ - Each text collection tagged to one or more categories, and
* __Temporal__ - Each text collection tagged to a period, date, time, etc.

# Isolated Text Corpus
* genesis text corpus has eight text collections, which are isolated in structure.

# Categorized Text Corpus
* Each text collection is tagged to a specific category or genre.
* E.g.: Brown text corpus contains 500 collections, which are categorized into 15 genres.

# Overlapping Text Corpus
* Each collection is categorized into one or more genre.
* E.g.: Reuters corpus contains 10788 collections, which are tagged to 90 genre.

# Temporal Text Corpus
* Each text collection is tagged to a period of time.
* E.g.: inaugural corpus contain text collections corresponding to U.S inaugural presidential speeches, gathered over a period of time.

# Loading User Specific Corpus
* Now let's see how to convert your collection of text files into a text corpus.
* Suppose, you have three files c1.txt, c2.txt and c3.txt in /usr/home/dict path.
* Creation of corpus wordlists corpus is shown in the following example.

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
import nltk


#
# Complete the 'accessTextCorpora' function below.
#
# The function accepts following parameters:
#  1. STRING fileid
#  2. STRING word
#

def accessTextCorpora(fileid, word):
    # Write your code here
    from nltk.corpus import inaugural
    file_words = inaugural.words(fileid)
    wordcoverage = int(len(file_words)/len(set(file_words)))
    ed_words = [words for words in set(file_words) if words.endswith('ed')]
    textfreq2 = [word.lower() for word in file_words if word.isalpha()]
    textfreq = nltk.FreqDist(textfreq2)
    wordfreq = textfreq[word]
    
    return wordcoverage, ed_words, wordfreq

if __name__ == '__main__':
    fileid = input()
    word = input()

    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    word_coverage, ed_words, word_freq = accessTextCorpora(fileid, word)

    print(word_coverage)
    print(sorted(ed_words))
    print(word_freq)

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import nltk

from nltk.corpus import PlaintextCorpusReader
#
# Complete the 'createUserTextCorpora' function below.
#
# The function accepts following parameters:
#  1. STRING filecontent1
#  2. STRING filecontent2
#

def createUserTextCorpora(filecontent1, filecontent2):
    # Write your code here
    with open(os.path.join('nltk_data/','content1.txt'),"w") as file1:
            file1.write(filecontent1)
            file1.close()
    with open(os.path.join('nltk_data/','content2.txt'),"w") as file2:
        file2.write(filecontent2)
        file2.close()
    
    text_corpus = PlaintextCorpusReader('nltk_data/','.*')
    no_of_words_corpus1 = len(text_corpus.words('content1.txt'))
    no_of_words_corpus2 = len(text_corpus.words('content2.txt'))
    no_of_unique_words_corpus1 = len(set(text_corpus.words('content1.txt')))
    no_of_unique_words_corpus2 = len(set(text_corpus.words('content2.txt')))
    return text_corpus,no_of_words_corpus1,no_of_unique_words_corpus1,no_of_words_corpus2,no_of_unique_words_corpus2
if __name__ == '__main__':
    filecontent1 = input()

    filecontent2 = input()

    path = os.path.join(os.getcwd(), "nltk_data")
    os.makedirs(path, exist_ok=True)
    for file in os.listdir(path):
        os.remove(path+"\\"+file)


    text_corpus, no_of_words_corpus1, no_of_unique_words_corpus1, no_of_words_corpus2, no_of_unique_words_corpus2 = createUserTextCorpora(filecontent1, filecontent2)
    expected_corpus_files = ['content1.txt', 'content2.txt']
    if type(text_corpus) == nltk.corpus.reader.plaintext.PlaintextCorpusReader and sorted(list(text_corpus.fileids())) == expected_corpus_files:
        print(no_of_words_corpus1)
        print(no_of_unique_words_corpus1)
        print(no_of_words_corpus2)
        print(no_of_unique_words_corpus2)
