# Conditional Frequency
* In the previous topic, you have studied about Frequency Distributions.
* FreqDist function computes the frequency of each item in a list.
* While computing a frequency distribution, you observe occurrence count of an event.

In [3]:
import nltk
items = ['apple', 'apple', 'kiwi', 'cabbage', 'cabbage', 'potato']
nltk.FreqDist(items)

FreqDist({'apple': 2, 'cabbage': 2, 'kiwi': 1, 'potato': 1})

# A Conditional Frequency is a collection of frequency distributions, computed based on a condition.
* For computing a conditional frequency, you have to attach a condition to every occurrence of an event.
* Let's consider the following list for computing Conditional Frequency.

In [1]:
c_items = [('F','apple'), ('F','apple'), ('F','kiwi'), ('V','cabbage'), ('V','cabbage'), ('V','potato') ]

* Each item is grouped either as a fruit F or a vegetable V.

# Computing Conditional Frequency
* ConditionalFreqDist function of nltk is used to compute Conditional Frequency Distribution (CDF).
* The same can be viewed in the following example.

In [4]:
cfd = nltk.ConditionalFreqDist(c_items)
cfd.conditions()
print(cfd['V'])
print(cfd['F'])

<FreqDist with 2 samples and 3 outcomes>
<FreqDist with 2 samples and 3 outcomes>


# Common methods of a CFD
* Illustration of Commonly used methods on a conditional frequency distribution, cfdist.

# Counting Words by Genre
* Now let's determine the frequency of words, of a particular genre, in __brown corpus.__

In [None]:
cfd = nltk.ConditionalFreqDist([ (genre, word) for genre in brown.categories() for word in brown.words(categories=genre) ])

* The conditions applied can be viewed as shown below.

In [None]:
cfd.conditions()

# Viewing Word Count
* Once after computing conditional frequency distribution, __tabulate__ method is used for viewing the count along with arguments __conditions__ and __samples__.

In [None]:
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'])

# Viewing Cumulative Word Count
* The cumulative count for different conditions is found by setting __cumulative__ argument value to __True__.

In [None]:
cfd.tabulate(conditions=['government', 'humor', 'reviews'], samples=['leadership', 'worship', 'hardship'], cumulative = True)

# Accessing Individual Frequency Distributions
* From the obtained conditional frequency distribution, you can access individual frequency distributions.
* The below example extracts frequency distribution of words present in __news__ genre of __brown__ corpus.

In [None]:
news_fd = cfd['news']
news_fd.most_common(3)

* You can further access count of any sample as shown below.

In [None]:
news_fd['the']

# Comparing Frequency Distributions
* Now let's see another example, which computes the frequency of last character appearing in all names associated with males and females respectively and compares them.
* The text corpus __names__ contain two files __male.txt__ and __female.txt__.

# Comparing Frequency Distributions

In [7]:
from nltk.corpus import names
nt = [(fid.split('.')[0], name[-1])    for fid in names.fileids() for name in names.words(fid) ]
cfd2 = nltk.ConditionalFreqDist(nt)
cfd2['female'] > cfd2['male']

True

* The expression __cfd2['female'] > cfd2['male']__ checks if the last characters in females occur more frequently than the last characters in males.

# Comparing Frequency Distributions
* The following code snippet displays frequency count of characters __a__ and __e__ in __females__ and __males__, respectively.

In [8]:
cfd2.tabulate(samples=['a', 'e'])

          a    e 
female 1773 1432 
  male   29  468 


* You can observe a significant difference in frequencies of __a__ and __e__.

1. Handson - NLP - Python - Conditional Frequency DistributionNLP - Python - Conditional Frequency Distribution Define a function called `calculateCFD`, which takes two parameters as a list. The first parameter, `cfdconditions`, is a list of categories, and the second parameter, `cfdevents`, is a list of samples. The function definition code stub is given in the editor. Perform the given operation for the given `cfdconditions` and `cfdevents`: Determine the conditional frequency of all the words (convert into lower case and remove all the stop words) for the given category `cfdconditions` of the brown corpora. Store the result in cdev_cfd. (Hint: Use stopwords corpora to identify the stop words.)Determine the words ending with `ing` or `ed`. Compute conditional frequency distribution, where the condition is `cfdconditions`, and the event is either `ing` or `ed`. Store the conditional frequency distribution in the variable inged_cfd.Compute the condition frequency using the condition `cfdconditions` and event `cfdevents`. Display the frequency of modal in the form of a table. (Hint: Use the tabulate method)Compute the condition frequency using the condition `cfdconditions` and event ['ing', 'ed']. Display the frequency of modal in the form of a table. (Hint: Use the tabulate method) Input Format for Custom TestingInput from stdin will be processed as follows and passed to the function.The first line contains an integer m, the size of the cfdconditions.Each of the next m lines contains cfdconditions[i] where 0 ≤ i < m.The next line contains an integer n, the size of the cfdevents.Each of the next n lines contains an cfdevents[i] where 0 ≤ i < n. Sample Case Sample InputSTDIN Function Parameters ----- ------------------- 2 → cfdconditions[] Size m = 2government → cfdconditions[] = ['government', 'hobbies'] hobbies 2 → cfdevents[] Size n = 2 first → cfdevents[] = ['first', 'last'] lastSample Output first last government 62 21 hobbies 126 26 ed ing government 2507 1474 hobbies 2561 2169 ExplanationThe first table shows that frequency distribution based on the cfdconditions (['government', 'hobbies']) and cfdevents (['first', 'last']) .The second table shows that frequency distribution based on the cfdconditions (['government', 'hobbies']) and ['ed', 'ing'] .

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd()+"/nltk_data"
import nltk

#
# Complete the 'calculateCFD' function below.
#
# The function accepts following parameters:
#  1. STRING_ARRAY cfdconditions
#  2. STRING_ARRAY cfdevents
#

def calculateCFD(cfdconditions, cfdevents):
    # Write your code here
    from nltk.corpus import brown
    from nltk import ConditionalFreqDist
    from nltk.corpus import stopwords
    stopword = set(stopwords.words('english'))
    cdev_cfd = nltk.ConditionalFreqDist([(genre, word.lower()) for genre in brown.categories() for word in brown.words(categories=genre) if not word.lower()  in stopword])
    cdev_cfd.tabulate(conditions = cfdconditions, samples = cfdevents)
    inged_cfd = [ (genre, word.lower()) for genre in brown.categories() for word in brown.words(categories=genre) if (word.lower().endswith('ing') or word.lower().endswith('ed')) ]
    inged_cfd = [list(x) for x in inged_cfd]
    for wd in inged_cfd:
        if wd[1].endswith('ing') and wd[1] not in stopword:
            wd[1] = 'ing'
        elif wd[1].endswith('ed') and wd[1] not in stopword:
            wd[1] = 'ed'
    #print(inged_cfd)
    inged_cfd = nltk.ConditionalFreqDist(inged_cfd)
    #print(inged_cfd.conditions())    
    inged_cfd.tabulate(conditions=cfdconditions, samples = ['ed','ing'])
if __name__ == '__main__':
    cfdconditions_count = int(input().strip())

    cfdconditions = []

    for _ in range(cfdconditions_count):
        cfdconditions_item = input()
        cfdconditions.append(cfdconditions_item)

    cfdevents_count = int(input().strip())

    cfdevents = []

    for _ in range(cfdevents_count):
        cfdevents_item = input()
        cfdevents.append(cfdevents_item)

    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    calculateCFD(cfdconditions, cfdevents)


# Raw Text Processing
* For most of the NLTK studies that you carry out, data is not readily available in the form of a text corpus.
* Also, raw text data from a different source can be obtained, processed and used for doing NLTK studies.
* Some of the processing steps that you perform are
   * Tokenization
   * Stemming

# Reading a Text File
* In this topic, you will understand how data is read from different external sources.
* The following example reads content from a text file, available at Project Gutenberg site.

In [9]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
content1 = request.urlopen(url).read()

# Reading a HTML file
* The following example reads content from a news article available over the web.
* __Beautifulsoup__ module is used for scrapping the required text from the webpage.

In [11]:
from urllib import request
url = "http://www.bbc.com/news/health-42802191"
html_content = request.urlopen(url).read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Reading a HTML file

In [None]:
inner_body = soup.find_all('div', attrs={'class':'story-body__inner'})
inner_text = [elm.text for elm in inner_body[0].find_all(['h1', 'h2', 'p', 'li']) ]
text_content2 = '\n'.join(inner_text)

* find_all method returns all inner elements of div element, having class attribute value as story-body__inner.

# Reading from Other Sources
* You can also read text from some other text resources such as RSS feeds, FTP repositories, local text files, etc.
* It is also possible to read a text in binary format, from sources like Microsoft Word and PDF.
* Third party libraries such as __pywin32__, __pypdf__ are required for accessing Microsoft Word or PDF documents.

# Tokenization
* __Tokenization__ is a step in which a text is broken down into words and punctuation.
* The simplest way of tokenizing is by using __word_tokenize__ method.
* The below example tokenizes text read from Project Gutenberg.

In [14]:
text_content1 = content1.decode('unicode_escape')  # Converts bytes to unicode
tokens1 = nltk.word_tokenize(text_content1)
tokens1[3:8]

['Project', 'Gutenberg', 'eBook', 'of', 'Crime']

* The following example tokenizes text scrapped from the HTML page.

In [None]:
tokens2 = nltk.word_tokenize(text_content2)
tokens2[:5]
len(tokens2)

# Regular Expressions for Tokenization
* Regular expressions can also be utilized to split the text into tokens.
* The below example splits the entire text __text_content2__ with regular expression __\w+__

In [None]:
tokens2_2 = re.findall(r'\w+', text_content2)
len(tokens2_2)

# Regular Expressions for Tokenization
* __nltk__ contains the function __regexp_tokenize__, which can be used similarly to __re.findall__ and produce the tokens.

In [None]:
pattern = r'\w+'
tokens2_3 = nltk.regexp_tokenize(text_content2, pattern)
len(tokens2_3)

# Creation of NLTK text
* Using the obtained list of tokens, an object of NLTK text can be created as shown below.

In [None]:
input_text2 = nltk.Text(tokens2)
type(input_text2)

* Thus obtained text can be used for further linguistic processing.

1. Hands-on - NLP - Python - Processing Raw TextNLP - Python - Processing Raw Text Define a function called `processRawText`, which takes a parameter. The first parameter `textURL` is an URL link. The function definition code stub is given in the editor. Perform the following tasks: Read the text content from the given link `textURL`. Store the content in the variable `textcontent`.Tokenize all the words in the `textcontent`, and convert them into lower case. Store the tokenized list of words in `tokenizedlcwords`. (Hint: Use word tokenize)Find the number of words in `tokenizedlcwords`, and store the result in `noofwords`.Find the number of unique words in `tokenizedlcwords`, and store the result in `noofunqwords`.Calculate the word coverage of `tokenizedlcwords` obtained from the number of words and number of unique words, Store the result in the `wordcov`.Determine the frequency distribution of all words having only alphabets in `tokenizedlcwords`. Store the result in the variable 'wordfreq'.Find the maximum frequent word of `tokenizedlcwords`. Store the result in the variable 'maxfreq'. Return `noofwords`, `noofunqwords`, `wordcov`, and `maxfreq` variables from the function. Note: Word coverage should be of the int data type. Input Format for Custom TestingInput from stdin will be processed as follows and passed to the function. The first line contains a string `textURL`, URL of the text content. Sample Case Sample InputSTDIN Function Parameters ----- ------------------- https://hrcdn.net/s3_pub/istreet-assets/2KDELtu3svGwJgNXUXFE7Q/001.txt → textURL = 'https://hrcdn.net/s3_pub/istreet-assets/2KDELtu3svGwJgNXUXFE7Q/001.txt' Sample Output:2101271the ExplanationThe first line denotes the number of words in `tokenizedlcwords`.The second line denotes the unique number of words in `tokenizedlcwords`.The third line denotes the word coverage in `tokenizedlcwords`.The fourth line denotes the most frequent word in `tokenizedlcwords`.

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"
import nltk

#
# Complete the 'processRawText' function below.
#
# The function accepts STRING textURL as parameter.
#

def processRawText(textURL):
    # Write your code here
    from urllib import request
    content1 = request.urlopen(textURL).read()
    text_content1 = content1.decode('unicode_escape')
    text_content1 =text_content1
    tokenizedlcwords = nltk.word_tokenize(text_content1)
    noofwords=len(tokenizedlcwords)
    text_content1 =text_content1.lower()
    tokenizedlcwords = nltk.word_tokenize(text_content1)
    noofunqwords=len(set(tokenizedlcwords))
    wordcov=noofwords/noofunqwords
    large_uncommon_words = [word for word in tokenizedlcwords if word.isalpha()]
    wordfreq = nltk.FreqDist(large_uncommon_words)
    maxfreq=wordfreq.most_common(1)[0][0]
    return noofwords,noofunqwords,int(wordcov//1),maxfreq
if __name__ == '__main__':
    textURL = input()

    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    noofwords, noofunqwords, wordcov, maxfreq = processRawText(textURL)
    print(noofwords)
    print(noofunqwords)
    print(wordcov)
    print(maxfreq)


In [21]:
import re
s = 'Python is cool!!!'
print(re.findall(r'\s\w+\b', s))
pattern = r'\w+'
tokens2_3 = nltk.regexp_tokenize("Python is cool!!!", pattern)
len(tokens2_3)

[' is', ' cool']


3

# Bigrams
* __Bigrams__ represent a set of two consecutive words appearing in a text.
* __bigrams__ function is called on tokenized words, as shown in the following example, to obtain bigra

In [22]:
import nltk
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.bigrams(tokens))

[('Python', 'is'),
 ('is', 'an'),
 ('an', 'awesome'),
 ('awesome', 'language'),
 ('language', '.')]

# Computing Frequent Bigrams
* Now let's find out three frequently occurring bigrams, present in english-kjv collection of genesis corpus.
* Let's consider only those bigrams, whose words are having a length greater than 5.

In [24]:
from nltk.corpus import genesis
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
filtered_bigrams = [ (w1, w2) for w1, w2 in eng_bigrams if len(w1) >=5 and len(w2) >= 5 ]

# Computing Frequent Bigrams
* After computing bi-grams, the following code computes frequency distribution and displays three most frequent bigrams.

In [25]:
eng_bifreq = nltk.FreqDist(filtered_bigrams)
eng_bifreq.most_common(3)

[(('their', 'father'), 19), (('lived', 'after'), 16), (('seven', 'years'), 15)]

# Determining Frequent After Words
* Now let's see an example which determines the two most frequent words occurring after __living__ are determined.

In [26]:
from nltk.corpus import genesis
eng_tokens = genesis.words('english-kjv.txt')
eng_bigrams = nltk.bigrams(eng_tokens)
eng_cfd = nltk.ConditionalFreqDist(eng_bigrams)
eng_cfd['living'].most_common(2)

[('creature', 7), ('thing', 4)]

# Generating Frequent Next Word
* Now let's define a function named generate, which returns words occurring frequently after a given word.

In [27]:
def generate(cfd, word, n=5):
    n_words = []
    for i in range(n):
      n_words.append(word)
      word = cfd[word].max()
    return n_words

# Generating Most Frequent Next Word
* After defining the function generate, it is called with eng_cfd and living parameters.

In [28]:
generate(eng_cfd, 'living')

['living', 'creature', 'that', 'he', 'said']

* The output shows a word which occurs most frequently next to __living__ is __creature__.
* Similarly __that__ occurs more frequently after __creature__ and so on.

# Trigrams
Similar to __Bigrams__, __Trigrams__ refers to set of all three consecutive words appearing in text.

In [29]:
s = 'Python is an awesome language.'
tokens = nltk.word_tokenize(s)
list(nltk.trigrams(tokens))

[('Python', 'is', 'an'),
 ('is', 'an', 'awesome'),
 ('an', 'awesome', 'language'),
 ('awesome', 'language', '.')]

# ngrams
* __nltk__ also provides the function __ngrams__. It can be used to determine a set of all possible ___n consecutive words__ appearing in a text.
* The following example displays a list of four consecutive words appearing in the text __s__.

In [31]:
list(nltk.ngrams(tokens, 4))

[('Python', 'is', 'an', 'awesome'),
 ('is', 'an', 'awesome', 'language'),
 ('an', 'awesome', 'language', '.')]

# Collocations
* A __collocation__ is a pair of words that occur together, very often.
* For example, __red wine__ is a collocation.
* One characteristic of a __collocation__ is that the words in it cannot be substituted with words having similar senses.
* For example, the combination __maroon wine__ sounds odd.

# Generating Collocations
* Now let's see how to generate collocations from text with the following example.

In [32]:
from nltk.corpus import genesis
tokens = genesis.words('english-kjv.txt')
gen_text = nltk.Text(tokens)
gen_text.collocations()

said unto; pray thee; thou shalt; thou hast; thy seed; years old;
spake unto; thou art; LORD God; every living; God hath; begat sons;
seven years; shalt thou; little ones; living creature; creeping thing;
savoury meat; thirty years; every beast


1. Hands-on - NLP - Python - Bigrams and CollocationsNLP - Python - Bigrams and Collocations Define a function called `performBigramsAndCollocations`, which takes two parameters. The first parameter, `textcontent`, is a string, and the second parameter is `word`. The function definition code stub is given in the editor. Perform the following tasks: Tokenize all the words given in `textcontent`. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in `tokenizedwords`. (Hint: Use regexp_tokenize)Convert all the words into lowercase. Store the result in `tokenizedwords`.Compute bigrams of the list `tokenizedwords`. Store the list of bigrams in `tokenizedwordsbigrams`.Filter only the bigrams from `tokenizedwordsbigrams`, where the words are not part of stopwords. Store the result in tokenizednonstopwordsbigrams. (Hint: Use stopwords corpora)Compute the conditional frequency of `tokenizednonstopwordsbigrams`, where condition and event refer to the words. Store the result in `cfd_bigrams`.Determine the three most frequent words occurring after the given `word`. Store the result in `mostfrequentwordafter`.Generate collocations from `tokenizedwords`. Store list of collocation words in `collocationwords`. Return `mostfrequentwordafter`, `collocationwords` variables from the function. Input Format for Custom TestingInput from stdin will be processed as follows and passed to the function. The first line contains a string `textcontent`. Text content is used to perform mostfrequentbigrams.The second line contains a string `word`, used to find the three most frequent words occurring after this word. Sample Case Sample InputSTDIN Function Parameters ----- ------------------- Thirty-five sports disciplines and four cultural activities.... → textcontent = 'Explain to me again why I shouldn't cheat?" he asked....'sports → word = 'sports'Sample Output[('fans', 3), ('car', 3), ('disciplines', 1)]['sports car', 'sports fans']ExplanationThe first line displays all the three most frequent words occurring after the given `word`.The second line displays all collocation words for the given `textcontent`.

In [None]:
#!/bin/python3

import math
import os
import random
import re
import sys
import zipfile
os.environ['NLTK_DATA'] = os.getcwd() + "/nltk_data"

import nltk


#
# Complete the 'performBigramsAndCollocations' function below.
#
# The function accepts following parameters:
#  1. STRING textcontent
#  2. STRING word
#

def performBigramsAndCollocations(textcontent, word):
    # Write your code here
    from nltk.corpus import stopwords
    from nltk import ConditionalFreqDist
    tokenizedword = nltk.regexp_tokenize(textcontent, pattern = r'\w*', gaps = False)
    tokenizedwords = [x.lower() for x in tokenizedword if x != '']
    tokenizedwordsbigrams=nltk.bigrams(tokenizedwords)
    stop_words= stopwords.words('english')
    tokenizednonstopwordsbigrams=[(w1,w2) for w1 , w2 in tokenizedwordsbigrams if (w1 not in stop_words and w2 not in stop_words)]
    cfd_bigrams=nltk.ConditionalFreqDist(tokenizednonstopwordsbigrams)
    mostfrequentwordafter=cfd_bigrams[word].most_common(3)
    tokenizedwords = nltk.Text(tokenizedwords)
    collocationword = tokenizedwords.collocation_list()
    collocationwords=[]
    for i in collocationword:
        string=i[0]+" "+i[1]
        collocationwords.append(string)
            

    return mostfrequentwordafter ,collocationwords

if __name__ == '__main__':
    textcontent = input()

    word = input()

    if not os.path.exists(os.getcwd() + "/nltk_data"):
        with zipfile.ZipFile("nltk_data.zip", 'r') as zip_ref:
            zip_ref.extractall(os.getcwd())

    mostfrequentwordafter, collocationwords = performBigramsAndCollocations(textcontent, word)
    print(sorted(mostfrequentwordafter, key=lambda element: (element[1], element[0]), reverse=True))
    print(sorted(collocationwords))

In [33]:
s = 'Python is cool!!!'
tokens = nltk.word_tokenize(s)
list(nltk.trigrams(tokens))

[('Python', 'is', 'cool'),
 ('is', 'cool', '!'),
 ('cool', '!', '!'),
 ('!', '!', '!')]