# Introduction to NLTK

Let's start from importing the NLTK library.
We also download its 'punkt' sentence tokenizer (about which we will talk later). 

In [None]:
!pip install nltk

In [None]:

import nltk
nltk.download('punkt')

FILE variable will point to the location of the text file we will be analysing.

In [None]:
# configure; using an absolute path, define the location of a plain text file for analysis
FILE = 'walden.txt'

Now, let's import all modules from NLTK. 

In [None]:
# import / require the use of the Toolkit
from nltk import *

To open a file, we use the built-in open function. 
The open function returns a file object that contains methods and attributes to perform various operations on the file.

To read a file’s contents, call .read() method which reads some quantity of data and returns it as a string.

In [None]:
# slurp up the given file; display the result
handle = open(FILE, 'r')
data = handle.read()
print(data)

**Tokenizing** is splitting a text into chunks which can be of various granularity (e.g. sentences, tokens (words, numbers, pubncuation marks) or even letters)

In [None]:
features = word_tokenize(data)
print(features)



**CODE IT** Can you think of any alternative ways for tokenization? 

Insert your code for an alternative in the cell below marked by #Insert your code here


In [None]:

def word_tokenize_function(text):
    tokens = []
    # Insert your code here
    return (tokens)

features_alternative = word_tokenize_function(data)
print(features_alternative)

Depending on the task, we may need to **lowercase** all words. Think about how this may impact the analysis and what use cases may require lowercasing and which - not?

In [None]:
# normalize the features to lower case and exclude punctuation
features = [feature for feature in features if feature.isalpha()]
features = [feature.lower() for feature in features]
print( features )


Removing **stopwords** - words that are very common (e.g. 'i', 'and', 'by', 'for', 'haven't' etc.) and of relatively low value for text analysis. They are often removed, and there are various stategies (from ready-made stopword lists like NLTK's one to customised lists based on a particular corpus). There are also use cases when stopwords are not removed - can you think of cases when it may be not desirable?

In [None]:
# create a list of (English) stopwords, and then remove them from the features
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
features  = [feature for feature in features if feature not in stopwords]

**CODE IT**  Create a list of words which would NOT EXCLUDE stopwords


In [None]:
all_words=[]

#Insert your code here

## Frequency distribution

Sometimes we may want to find how many times each word occurs in the text (that is, its **frequency**).
What insights can this information give us?
It's called distribution because we learn how the total number of tokens in the text is distributed across all vocabulary items. 
**Vocabulary** is a list of unique words your text contains.

In [None]:
# count & tabulate the features, and then plot the results -- season to taste
frequencies = FreqDist(features)
plot = frequencies.plot(10)



**CODE IT** Create a frequency distribution plot for all words in the text including stopwords using the list you created in the previous exercise


**OBSERVE AND REFLECT** What do you observe comparing the two plots?

_Write your reflection here:_


Words  in  English  texts  have  a  very  peculiar  distribution. 

**Recommended video** (0:00-3:15 of this video is about **Zipf's Law** in language https://www.youtube.com/watch?v=fCn8zs912OE - but the rest of the video is also worth checking out if you have time!)

Check frequency of any word in the British National Corpus (100mln words)  designed to represent an accurate cross-section of current English usage here: http://www.wordcount.org/main.php

Around 50–100 top frequency words typically account for about 50% of the words in any text. 
But at the same time, about half of the words of the vocabulary of a text occur only once in the text! Such a word is called **"hapax legomenon"** (or just hapax). 
Why do we sometimes want to find them? What can we do with them in the context of NLP? 

In [None]:
# create a list of unique words (hapaxes); display them
hapaxes = frequencies.hapaxes()
print(hapaxes)

**CODE IT** Write a function that extracts unique words from the data, without using NLTK built-in functions

In [None]:
def unique_words(data):
#insert your code here
return #add return value here

print(unique_words(data))

In [None]:
from nltk.util import ngrams
# count & tabulate ngrams from the features -- season to taste; display some
ngrams = ngrams(features, 2)
print(ngrams)
frequencies = FreqDist(ngrams)
frequencies.most_common(10)

**CODE IT** Print Trigrams, Fourgrams ... 

In [None]:
#insert your code here

What are the most frequent trigrams, fourgrams ,... 
Reflect on the size of the ngram and the frequency


**CODE IT** Find Bigrams, Trigrams and Fourgrams... using all words i.e. including stop words

**OBSERVE AND REFLECT:** What difference will you observe make if we use all words including stopwords rearding the frequency of ngrams?

_Write your reflection here:_



In [None]:
# create a list each token's length, and plot the result; How many "long" words are there?
lengths = [len(feature) for feature in features]
plot = FreqDist(lengths).plot(10)

## Stemming

**Stemmin** is basically removing the suffix from a word and reduce it to its root word. The purpose of stemming is to bring variant forms of a word together. 

In this tutorial we are using the **Porter stemmer** - it is the most widely used algorith but others also exist.

In [None]:
# initialize a stemmer, stem the features, count & tabulate, and output
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stems = [stemmer.stem( feature ) for feature in features]
frequencies = FreqDist(stems)
frequencies.most_common(10)

**CODE IT** Can you implement any of the stemming rules in English using only python code and not the NLTK library? (for example: plural rules)

In [None]:
def chosen_stemming_rule(word):
    #Insert your code here
    return stemmed_word

new_stems = [chosen_stemming_rule( feature ) for feature in features]
print(new_stems)

In [None]:
# re-create the features and create a NLTK Text object, so other cool things can be done
features = word_tokenize(data)
text = Text(features)

In [None]:
# count & tabulate, again; list a given word -- season to taste
frequencies = FreqDist( text )
print( frequencies[ 'love' ] )

## Concordances

To examine the context of words in a text we can explore **concordances**.
A concordance view in NLTK shows us every occurrence of a given word, together with some context (words surrounding it).



In [None]:
# do keyword-in-context searching against the text (concordancing)
print( text.concordance( 'love' ) )



**If you Fancy**
Take a look at antConc and check if you can use it to find concordances in text: 

**ANTCONC:** https://www.laurenceanthony.net/software/antconc/


**BLOG POST on ANTCONC** https://dhh.uni.lu/2020/08/11/antconc-historians-and-their-diverging-research-methods/


## Dispersion plot

**Lexical dispersion** measures how frequently a word appears across the parts of a corpus. 
This plot notes the occurrences of a word and how many words from the beginning of the corpus it appears (word offsets). 
when can it be useful?
Think of a corpus that covers a long time period  - a dispersion plot can be used to analyse how frequency of specific terms may have changed over time. 

In [None]:
# create a dispersion plot of given words
plot = text.dispersion_plot(['love', 'war', 'man', 'god'])

**Bigrams** are sequences of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. 


In [None]:
# output the "most significant" bigrams, considering surrounding words (size of window) -- season to taste
text.collocations(num=10, window_size=4)


In [None]:
# Find contexts where the specified words appear
text.common_contexts(['love', 'war', 'man', 'god'])

**OBSERVE AND REFLECT**

What's the difference between NGRAMS(bigrams, trigrams) and Collocations?

_Write your reflection here_

In [None]:
# list the words (features) most associated with the given word
text.similar( 'love' )

In [None]:
# create a list of sentences, and display one -- season to taste
sentences = sent_tokenize(data)
sentence  = sentences[ 14 ]
print(sentence)


#Can you think of an alternate method using python only to implement sent_tokenize?

def sent_tokenize_code(text):
    sentences=[]
    # Insert your code here
    return sentences 

sentences = sent_tokenize_code(data)
print(sentence[14])


**CODE IT** Can you think of an alternate method using python only to implement sent_tokenize without using NLTK built-in function?

In [None]:
def sent_tokenize_code(text):
    sentences=[]
    # Insert your code here
    return sentences 

alternative_sentences = sent_tokenize_code(data)
print(alternative_sentences[14])

## Part-of-speech tagging

Do you remember from school what lexical categories or parts of speech are?
You have probably about nouns, verbs, adjectives, pronouns and so on, and what function they have. 

- How are they used in natural language processing?
- What is a good Python data structure for storing words and their categories?
- How can we automatically tag each word of a text with its word class?

The process of classifying words into their parts of speech and labeling them accordingly is known as **part-of-speech tagging, POS-tagging, or simply tagging**. 
**Parts of speech** are also known as **word classes** or **lexical categories**. 
**Tagset** - a collection of tags used for a particular task.

**A part-of-speech tagger, or POS-tagger**, processes a sequence of words and attaches a part of speech tag to each word.



In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:
# tokenize the sentence and parse it into parts-of-speech, all in one go
sentence = pos_tag( word_tokenize( sentence ) )
print( sentence )


#Find Verbs in the data and print out the stems of the found verbs


What do the tags mean?
You can query NLTK by using nltk.help.upenn_tagset('RB'), or a regular expression, e.g. nltk.help.upenn_tagset('NN.*'). 
Let's do a little exercise: ask NLTK for help with some tags from the output above


In [None]:
nltk.download('tagsets')
#your code goes here: ask NLTK for help with some tags from the output above

## Named entity recognition - NER

**Named-entity recognition (NER)** seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In information extraction, a **named entity** is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')


In [None]:
# extract named enities from a sentence, and print the results
rtl_text = open("RTL.txt").read()

sentences=sent_tokenize(rtl_text)

sentence = sentences[1]
sentence = pos_tag( word_tokenize( sentence ) )

from nltk import ne_chunk
entities = ne_chunk(sentence)
print(entities)

**EXERCISE**  

From [this](https://www.gutenberg.org/files/28054/28054-0.txt
) link download the text file contaning the content of "The Brothers Karamazov" by Fyodor Dostoyevsky provided by [Project Gutenberg](https://www.gutenberg.org/about/)



Use [this](https://python-graph-gallery.com/260-basic-wordcloud/) code piece to draw a word cloud for the text of "The Brothers Karamazov":

- Without any preprocessing on the text
-  After performing any preprocessing you find helpful to present a more informative word-cloud
- Compare the two word cloud and write a reflection on your observation


