We did a quick tour of Python and saw some of its features earlier in today's class. Let us now see what off-the-shelf support exists in terms of NLP tools in Python in the rest of this class, and some of next class.

Before we get into NLP methods in depth, let us first look at how to do basic corpus analysis in Python. That is the goal of this notebook. 

Based on: 
- Chapters 1--3 in NLTK. (Chapters 1--7 may be useful for non NLP focused folks, generally, IMO)

Other useful resources:
- Python worksheets in Katrin Erk's NLP courses: https://www.katrinerk.com/teaching/resources 

# Why?
The first step before we build any NLP models is to "understand" the data a little bit.

What is "understanding" the data?
    - Looking at frequently used words/ngrams in the data
    - Looking at collocations (what words go together)
    - How can we visualize text corpora?
    - lexical diversity, linguistic coverage etc in the dataset
    
Why do we need this?
    - To get a general idea of what the texts are about
    - Identify potentially useful features to use in an NLP model that uses this data
    - Identify potential noise in the data
    - Understand limitations of the data (e.g., potential bias) 

## Installing necessary libraries 
Before proceeding further, we have to install the necessary Python libraries (libraries can be thought of as a collection of pre-implemented utilities we can just use right away)

- nltk
- matplotlib (for plots)

installing libraries in python happens through a command called pip

In [None]:
!pip install nltk
!pip install matplotlib

#the ! symbol lets us run programs that are typically run from a command line tool in a notebook.

In [None]:
#If you are able to import these libraries without errors, that means you installed them correctly
import nltk
import matplotlib

## Download the required language resources for nltk and spacy

In [None]:
nltk.download("book")
#Download any data or other non-code resources we need for using the code in the book.

# Elementary Text Analysis 

I will use NLTK library's existing functions to illustrate some text analysis functions in Python. The following examples are based on Chapter 1. 

I will start with reading a text file from the web. This is the file: https://www.gutenberg.org/files/2446/2446-0.txt
(I don't have any agenda - I just like this play, so chose it).

In [None]:
from urllib import request
#urllib is a built-in python library for handling web interactions such as accessing a web-page, reading its contents etc.
myurl = "https://www.gutenberg.org/files/2446/2446-0.txt"
response = request.urlopen(myurl)
mytext = ""
for line in response:
    mytext += line.decode("utf-8").strip() + "\n"
    
print(mytext)

In [None]:
from nltk.text import Text
from nltk.tokenize import sent_tokenize, word_tokenize

text_nltk = Text(word_tokenize(mytext.lower()))

In [None]:
text_nltk

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word "town" in this text. This can be done by calling a pre-implemented utility for NLTK Text objects, called "concordance", as follows:



In [None]:
text_nltk.concordance("town")

A concordance permits us to see words in context. What other words appear in a similar range of contexts? We can use "similar" function for a given text:

In [None]:
text_nltk.similar("people")

In [None]:
text_nltk.similar("stockmann")

It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. 

In [None]:
text_nltk.dispersion_plot(["peter", "citizens", "town", "enemy", "morten"])

# Counting Vocabulary

Counting words, getting their frequencies etc are all useful functions when we start doing some NLP based analyses with text data. Let us see some simple utilities here. 

Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We use the term len to get the length of something, which we'll apply here to the our text_nltk variable:

In [None]:
len(text_nltk) 

So this text has 42789 words and punctuation symbols, or "tokens." A token is the technical name for a sequence of characters — such as hairy, his, or :) — that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to be or not to be, we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of to, two of be, and one each of or and not. 

How many unique tokens does this text have, though? In Python we can obtain that information using the set() function. When you do this, many screens of words will fly past. Now try the following:



In [None]:
sorted(set(text_nltk))
len(set(text_nltk))
#Why am I seeing only one number instead of two?

In [None]:
sorted(set(text_nltk))


Now, let's calculate a measure of the **lexical richness** of the text. One way of getting this is by looking at the percentage of unique tokens to the total number of tokens in the text. The more the number is, the higher the lexical richness is. 


In [None]:
100*len(set(text_nltk)) / len(text_nltk)

We can perhaps say that the vocabulary in this piece of text is "less varied"

Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:



In [None]:
print(text_nltk.count("petra"))

print(100 * text_nltk.count('stockmann') / len(text_nltk))

# texts and lists

In NLP, we commonly see a piece of text being represented as a list of sentences/tokens, and a list of sentences further represented as a list of tokens. It is important to note that there are no perfect sentence splitters or word tokenizers. There are different options around, some are tailored to specific use cases (e.g., NLTK has a tweet tokenizer among others). In this section, we will see a few examples of word and sentence tokenization from nltk. 

In [None]:
#NLTK has pre-implemented functions to split a text into sentences, and sentences into tokens. 
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
all_sens = sent_tokenize(mytext) #i am taking the raw mytext variable directly, not text_nltk.
len(all_sens)

In [None]:
all_sens[233] #looking at a random sentence.


In [None]:
words_in_a_sentence = word_tokenize(all_sens[233])
print(words_in_a_sentence)

## Computing simple statistics from a text

In [None]:
from nltk import FreqDist

In [None]:
fdist1 = FreqDist(text_nltk)
print(fdist1)

In [None]:
fdist1.most_common(10)

In [None]:
fdist1["petra"]

In [None]:
# Selecting words longer than 15 characters.
myvocab = set(text_nltk)

long_words = [w for w in myvocab if len(w) > 15]

long_words

Exercise: What are other ways of writing the same line of code for long_words?

In [None]:
# Words that appear more than 10 times and less than 20 times

mylist = [w for w in myvocab if fdist1[w] > 10 and fdist1[w] < 20]
print(len(mylist))
mylist[:10]

In [None]:
#The following piece of code does the same too:
mylist = [w for w in myvocab if fdist1[w] in range(11,20)]
print(len(mylist))
mylist[:10]

In [None]:
#And, what about this?
mylist = [w for w in myvocab if fdist1[w] in range(10,20)]
print(len(mylist))
mylist[:10]

Question: Why are these both different? 

## Collocations, ngrams and other counts

ngrams are sequences of n words that appear together. Bigrams are pairs of words that appear one after another. Collocations are frequent bigrams

In [None]:
from nltk import bigrams
bigrams_list = list(bigrams(text_nltk))

In [None]:
len(bigrams_list)

In [None]:
bigrams_list[22]

In [None]:
text_nltk.collocations()

In [None]:
text_nltk.collocation_list()

Counting words is useful, but we can count other things too. For example, we can look at the distribution of word lengths in a text, by creating a FreqDist out of a long list of numbers, where each number is the length of the corresponding word in the text:



In [None]:
dist = FreqDist([len(w) for w in text_nltk]) 

In [None]:
dist

In [None]:
len(dist)

In [None]:
dist[21] #seeing how many 21 character tokens exist

In [None]:
dist.max() #What is this doing? 

In [None]:
100*dist.freq(5)

i.e., words of length 5 occupy about 8% of the text

In [None]:
100*dist.freq(19)

In [None]:
dist.plot()

Exercise: Think about how you can extract trigrams and their frequency distribution, based on what you learnt so far, and exploring a little bit of NLTK!

let us take a quick 5 minutes break here, and I will wrap up today's class with a short overview of another useful concept - regular expressions. 