## Caesar Ciper

![title](http://www.maths-resources.net/enrich/codes/caesar/images/caesarwheel3.gif)

The letters on the outer circle represent letters in the original text (message). The letters on the inner circle represent the (encoded) cipher text.

Here, the inner circle is rotated to the left by 3 (`k`=3). 

### The `chr()` and `ord()` Functions

`ord(c)`: Returns an integer representing the Unicode code point of the character `c`.

In [1]:
ord('A')

65

`chr(i)`: Returns a string of one character whose ASCII code is the integer `i`.

In [2]:
chr(65)

'A'

In [3]:
k = 3

chr(65+k)

'D'

In [4]:
ord(chr(65+k))

68

The `ord()` and `chr()` functions are the opposite of each other.

### Caesar cipher in Python

In [6]:
# Define the message text, and key

message = 'Et tu, Brute?'

k = 3

In [7]:
print ('The message is:', message)

print('The key is:', k)

The message is: Et tu, Brute?
The key is: 3


In [None]:
# Encrypt the message, one letter at a time

for letter in message:
    
    print ('Input letter:', letter)
    
    # Retrieve the ASCII code for the letter
    num = ord(letter)
    
    # Add 'k' to that code
    num = num + k
    
    # Retrieve the letter for that ASCII code
    encrypted_letter = chr(num)
    
    print ('Encrypted letter:', encrypted_letter)

Let's encrypt only letters in the alphabet, and ignore special characters, e.g., space, exclamation points.

In [None]:
# Ignore special characters

for letter in message:
    
    print ('Input letter:', letter)
    
    if letter.isalpha():
        num = ord(letter)
        num += k
        encrypted_letter = chr(num)
    else:
        encrypted_letter = letter
    
    print ('Encrypted letter:', encrypted_letter)

Notice how `n = n + k` can also be written as `n += k`.

In [None]:
# Save the encypted message in a string

# Initialize the output (encrypted) message
encrypted_message = ''

for letter in message:
    
    if letter.isalpha():
        num = ord(letter)
        num += k
        encrypted_message += chr(num)
    else:
        encrypted_message += letter

print ('Input message:', message)
print ('Encrypted message:', encrypted_message)

### Let's create a function

In [None]:
def encrypt_message(in_message):

    ## -- INSERT CODE HERE -- ##

In [None]:
# Call the function to encrypt the message
encrypted_message = encrypt_message(message)

# Print the encrypted message
print ('Encrypted message:', encrypted_message)

Let's modify the function to include</i> `k` <i>as one of its parameters

In [None]:
def encrypt_message(in_message, key):

    # initialize the output (encrypted) message
    out_message = ''

    for letter in in_message:
        if letter.isalpha():
            num = ord(letter)
            num += key
            out_message += chr(num)
        else:
            out_message += letter

    return out_message

print ('Encrypted message (k=3):', encrypt_message(message, 3))
print ('Encrypted message (k=7):', encrypt_message(message, 7))
print ('Encrypted message (k=0):', encrypt_message(message, 0))

For `k=7`, letter w would get encrypted into the curly bracket symbol. See the ASCII table below for reference.

![image](http://www.asciitable.com/index/asciifull.gif)

Let's modify the function to avoid situations where the encrypted message contains non-alphabetic character(s).
In other words, force the encryption to "wrap around" to the beginning of the alphabet if it encounters non-alphabetic characters.

In [None]:
def encrypt_message(in_message, key):

    # Initialize the output (encrypted) message
    out_message = ''

    for letter in in_message:
        
        if letter.isalpha():
            num = ord(letter)
            num += key
            
            if letter.isupper() and num > ord('Z'):
                num -= 26
            elif letter.islower() and num > ord('z'):
                num -= 26
            
            out_message += chr(num)
        else:
            out_message += letter

    return out_message

print ('Encrypted message (k=3):', encrypt_message(message, 3))
print ('Encrypted message (k=3):', encrypt_message(message, 7))
print ('Encrypted message (k=3):', encrypt_message(message, 0))

### Function to decode an encrypted message

In [None]:
def decrypt_message(in_message, key):

    # Initialize the output (encrypted) message
    out_message = ''

    for letter in in_message:
        
        if letter.isalpha():
            num = ord(letter)
            
            # For decrypting the message, we need to substract the key
            num -= key
                
            out_message += chr(num)
        else:
            out_message += letter

    return out_message

In [None]:
encrypted_message = 'Hw wx, Euxwh?'

print ('Decrypted message:', decrypt_message(encrypted_message, 3))

But what if we don't know the key that was used to encrypt the message?

In [None]:
# We can use brute force to decode the message
# Try all possible values of k (1 to 26)

for k in range(1, 26):
    print (decrypt_message(encrypted_message, k), f'(key={k})')

Let's modify the `decrypt_message()` function to avoid situations where the decrypted messages contain non-alphabetic character(s).
In other words, force the decryption to "wrap around" to the beginning of the alphabet if it encounters non-alphabetic characters.

In [None]:
def decrypt_message(in_message, key):

    # Initialize the output (encrypted) message
    out_message = ''

    for letter in in_message:
        if letter.isalpha():
            num = ord(letter)
            num -= key
            
            if letter.isupper():
                if num > ord('Z'):
                    num -= 26
                elif num < ord('A'):
                    num += 26
            elif letter.islower():
                if num > ord('z'):
                    num -= 26
                elif num < ord('a'):
                    num += 26
            
            out_message += chr(num)
        else:
            out_message += letter

    return out_message

for k in range(1, 26):
    print (decrypt_message(encrypted_message, k), f'(key={k})')


The code in this exercise is adopted from [*Invent Your Own Computer Games with Python* by Al Sweigart](http://a.co/d/4LAcqtI)

### [Speech: "Friends, Romans, countrymen, lend me your ears" BY WILLIAM SHAKESPEARE](https://www.poetryfoundation.org/poems/56968/speech-friends-romans-countrymen-lend-me-your-ears)


In [None]:
speech = 'Friends, Romans, countrymen, lend me your ears; \
I come to bury Caesar, not to praise him. \
The evil that men do lives after them; \
The good is oft interred with their bones; \
So let it be with Caesar. The noble Brutus \
Hath told you Caesar was ambitious: \
If it were so, it was a grievous fault, \
And grievously hath Caesar answer’d it. \
Here, under leave of Brutus and the rest– \
For Brutus is an honourable man; \
So are they all, all honourable men– \
Come I to speak in Caesar’s funeral. \
He was my friend, faithful and just to me: \
But Brutus says he was ambitious; \
And Brutus is an honourable man. \
He hath brought many captives home to Rome \
Whose ransoms did the general coffers fill: \
Did this in Caesar seem ambitious? \
When that the poor have cried, Caesar hath wept: \
Ambition should be made of sterner stuff: \
Yet Brutus says he was ambitious; \
And Brutus is an honourable man. \
You all did see that on the Lupercal \
I thrice presented him a kingly crown, \
Which he did thrice refuse: was this ambition? \
Yet Brutus says he was ambitious; \
And, sure, he is an honourable man. \
I speak not to disprove what Brutus spoke, \
But here I am to speak what I do know. \
You all did love him once, not without cause: \
What cause withholds you then, to mourn for him? \
O judgment! thou art fled to brutish beasts, \
And men have lost their reason. Bear with me; \
My heart is in the coffin there with Caesar, \
And I must pause till it come back to me.'

print (encrypt_message(speech, 3))

### Count letter frequency

In [None]:
from collections import Counter

# Covert all letters to uppercase and then count the frequency
letters = collections.Counter(speech.upper())

letters

Sort the results in a descending order.

In [None]:
sorted_letters = []
sorted_freq = []

sorted_dict = sorted(letters.items(), key=lambda val: val[1], reverse=True)

for key, val in sorted_dict:
    if key.isalpha():
        sorted_letters.append(key)
        sorted_freq.append(val)

In [None]:
print (sorted_letters, sorted_freq)

Plot the results.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

sns.set(style="darkgrid")
plt.figure().set_size_inches(12, 9)

plt.bar(sorted_letters, sorted_freq)
plt.xlabel('Letter', size=14)
plt.ylabel('Frequency', size=14)
plt.xticks(sorted_letters)

plt.show();

### Word Frequency

In [None]:
word_freq = Counter(speech.split())

word_freq

#### _Also see:_ ####
    
[Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law): Given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. So word number n has a frequency proportional to 1/n. 

## Text Processing

The most common way to deal with text documents is to first convert them into a numeric vector form (sparse matrix), and then perform additional analysis -- like clsutering, classification, and visualization -- using those vectors. This is usually referred to as 'Bag-of-Words' or 'Vector Space Model'.

### 1. Remove Punctuations

In [None]:
import string

string.punctuation

In [None]:
all_punctuations = set(string.punctuation)

all_punctuations

In [None]:
speech_clean = ''.join(l for l in speech if l not in all_punctuations)

speech_clean

### 2. Cover to Upper/Lower-case

In [None]:
speech_clean = speech_clean.lower()

speech_clean

### 3. Remove Stop Words

The `scikit-learn` package provides a list of stop words. 

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

Let's discard all stop words from the text.

In [None]:
speech_words = [word for word in speech_clean.split() if word not in ENGLISH_STOP_WORDS]

set(speech_words)

### 4. Stemming

Stemming is the process of reducing inflected or derived words to their word stem, base or root form. There are several stemming algorithms available; we will use *Porter* and *Lancaster* stemmers in this exercise.

In [None]:
from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

Let's first take a look at an example.

In [None]:
# Define a stemmer
stemmer= PorterStemmer()

# Example 1
for word in ['Play', 'Playing', 'Played']:
    
    stem = stemmer.stem(word)
    
    print ('Word:', word, '\t --> Stem:', stem)

In [None]:
# Example 2

for word in ['grievous', 'grievously']:
    
    stem = stemmer.stem(word)
    
    print ('Word:', word, ' \t --> Stem:', stem)

Let's try another stemmer.

In [None]:
# Import
from nltk.stem import LancasterStemmer

# Define
stemmer = LancasterStemmer()

for word in ['grievous', 'grievously']:
    
    stem = stemmer.stem(word)
    
    print ('Word:', word, ' \t --> Stem:', stem)

Apply stemmer on the speech text.

In [None]:
# Create an empty array to store the results (i.e., stems)
stems = []

for word in speech_words:
    
    # Check if it's a stop word
    if word not in ENGLISH_STOP_WORDS:
        
        # Append the stem for each word to the output array
        stems.append(stemmer.stem(word))
   
set(stems)

Note: [Julie Beth Lovins](https://en.wikipedia.org/wiki/Julie_Beth_Lovins), a computational linguist, published the first-ever stemming algorithm in 1968.

### 5. Lemmatization

For Lemmatization, we need a dictionary, or a corpus reader.

In [None]:
import nltk

nltk.download('wordnet')

In [None]:
# Import
from nltk.stem import WordNetLemmatizer

# Define
lemmatizer = WordNetLemmatizer()

for word in ['know', 'knowing', 'knew', 'knowledge']:
    
    lemma = lemmatizer.lemmatize(word)
    stem = stemmer.stem(word)
    
    print ('Word:', word, '--> Stem:', stem, '--> Lemma:', lemma)

We must provide the context in which we're trying to lemmatize the words. This is refered to as the Parts-Of-Speech (POS).

In [None]:
# Import
from nltk.stem import WordNetLemmatizer

# Define
lemmatizer = WordNetLemmatizer()

for word in ['know', 'knowing', 'knew', 'knowledge']:
    
    # Adding pos argument to lemmatize()
    lemma = lemmatizer.lemmatize(word, pos='v')
    stem = stemmer.stem(word)
    
    print ('Word:', word, '--> Stem:', stem, '--> Lemma:', lemma)

In [None]:
# Create an empty array to store the results (i.e., lemmas)
lemmas = []

for word in speech_words:
    
    # Check if it's a stop word
    if word not in ENGLISH_STOP_WORDS:
        
        # Append the stem for each word to the output array
        lemmas.append(lemmatizer.lemmatize(word, 'v'))
    
set(lemmas)

In [None]:
len(set(lemmas))

In [None]:
len(set(stems))

In [None]:
len(speech_clean.split())

Stemming and Lemmatization are closely related. Unlike Lemmatization, Stemming doesn't incorporate the conext (part of speech) but they typically run faster. In Information Retrieval applications, Stemming improves the True Positive Rate (recall), but reduces the True Negative Rate (specificity).

### Bringing it all together

For the next part of this exercise, let's analyze transcripts from a couple of US presidential inagural addresses.

In [None]:
import string

# NLTK tokenizer (to split a sentence into words)
nltk.download('punkt')

trump_speech_transcript = r"C:\Users\visha\derive Dropbox\Projects\vcu\python\misc\inaugural_speech_trump.txt"
obama_speech_transcript = r"C:\Users\visha\derive Dropbox\Projects\vcu\python\misc\inaugural_speech_obama.txt"

In [None]:
def create_tokens(infile):
    
    with open(infile) as f:
        
        # Read each line from the file and convert it into lowercase
        line = f.read().lower()

        # Remove all punctuations
        line_clean = ''.join(l for l in line if l not in all_punctuations)
        
        # Remove all stop words (this will create a list of words)
        line_words = [word for word in line_clean.split() if word not in ENGLISH_STOP_WORDS]

        # Join all those words to create a line (of text) again
        line_clean = ' '.join(line_words)

        # Tokenize
        tokens = nltk.word_tokenize(line_clean)
        
        return tokens

tokens = create_tokens(trump_speech_transcript)

tokens[:5]

In [None]:
count = Counter(tokens)

print (count.most_common(10))

In [None]:
import re

def create_tokens(infile):
    
    with open(infile) as f:
        
        # Read each line from the file and convert it into lowercase
        line = f.read().lower()

        # Remove all punctuations
        line_clean = ''.join(l for l in line if l not in all_punctuations)
        
        # Remove all stop words (this will create a list of words)
        # In addition, use regex to replace 
        line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS]
        
        # Join all those words to create a line (of text) again
        line_clean = ' '.join(line_words)

        # Tokenize
        tokens = nltk.word_tokenize(line_clean)
        
        return tokens

tokens_trump = create_tokens(trump_speech_transcript)

token_count_trump = Counter(tokens_trump)

print (token_count_trump.most_common(10))

Note: For an explanation of how that `regex` query replaces all non-letter chatacters with '' (nothing), please follow this [link](https://stackoverflow.com/questions/47561298/python-regex-remove-escape-characters-and-punctuation-except-for-apostrophe?rq=1).

In [None]:
tokens_obama = create_tokens(obama_speech_transcript)

token_count_obama = Counter(tokens_obama)

print (token_count_obama.most_common(10))

In [None]:
def create_tokens(infile):
    
    with open(infile) as f:
        
        # Read each line from the file and convert it into lowercase
        line = f.read().lower()

        # Remove all punctuations
        line_clean = ''.join(l for l in line if l not in all_punctuations)
        
        # Remove all stop words (this will create a list of words)
        # In addition, use regex to replace 
        line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS
                     and word != 'applause']

        # Join all those words to create a line (of text) again
        line_clean = ' '.join(line_words)

        # Tokenize
        tokens = nltk.word_tokenize(line_clean)
        
        return tokens

tokens_obama = create_tokens(obama_speech_transcript)

token_count_obama = Counter(tokens_obama)

print (token_count_obama.most_common(10))

### TF-IDF Vectorization

TF-IDF stands for Term Frequency – Inverse Document Frequency. The idea behind this metric is to rescale the frequency of each word by how often they appear across all documents. Words that are common across all documents are penalized, and as a result, the words that are most distinct (and ferquent) within a document are emphasized more.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tokens_all = {}

with open(trump_speech_transcript) as f:

    # Read each line from the file and convert it into lowercase
    line = f.read().lower()

    # Remove all punctuations
    line_clean = ''.join(l for l in line if l not in all_punctuations)

    # Remove all stop words (this will create a list of words)
    # In addition, use regex to remove non-alphabetic characters
    line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS
                 and word != 'applause']

    # Join all those words to create a line (of text) again
    line_clean = ' '.join(line_words)

    tokens_all['trump'] = line_clean
    
with open(obama_speech_transcript) as f:

    # Read each line from the file and convert it into lowercase
    line = f.read().lower()

    # Remove all punctuations
    line_clean = ''.join(l for l in line if l not in all_punctuations)

    # Remove all stop words (this will create a list of words)
    # In addition, use regex to replace 
    line_words = [re.sub("[^a-zA-Z' ]+", '', word) for word in line_clean.split() if word not in ENGLISH_STOP_WORDS
                 and word != 'applause']

    # Join all those words to create a line (of text) again
    line_clean = ' '.join(line_words)

    tokens_all['obama'] = line_clean

In [None]:
tfidf = TfidfVectorizer()

tfs_matrix = tfidf.fit_transform(tokens_all.values())

print(tfs_matrix)

Note: This is how a sparse matrix is represented in Python.

In [None]:
# Feature ("column") names

print(tfidf.get_feature_names()[:10])

In [None]:
# Covert from sparse matrix to dense matrix

tfs_matrix.todense()

Let's create a `pandas` dataframe.

In [None]:
import pandas as pd

feature_names = tfidf.get_feature_names()

scores = tfs_matrix.todense().tolist()

df = pd.DataFrame(scores, columns=feature_names, index=['trump', 'obama'])

df.head()

In [None]:
token_count_trump.most_common(10)

In [None]:
for w, c in token_count_trump.most_common(10):
    print (w, c)

In [None]:
words_trump = []

for w, c in token_count_trump.most_common(10):
    words_trump.append(w)
    
words_trump

In [None]:
df[words_trump].T

In [None]:
words_obama = []

for w, c in token_count_obama.most_common(10):
    words_obama.append(w)
    
words_obama

df[words_obama].T

### Reading Text from Web-pages

The following web-site contains US Presidential inauguration speeches: http://avalon.law.yale.edu/subject_menus/inaug.asp

We will use `requests` and `BeautifulSoup` packages to read data directly from this web-site.

In [None]:
from bs4 import BeautifulSoup
import requests

url = "http://avalon.law.yale.edu/21st_century/obama.asp"

Step 1: Ping the web-page for information. This called making a request.

In [None]:
source_code = requests.get(url)

Step 2: Use Beautiful Soup to parse the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.

In [None]:
soup = BeautifulSoup(source_code.content)

In [None]:
# View the content

soup

Step 3: Extract the text part of the page.

In [None]:
speech_obama = soup.get_text()

speech_obama

Step 4: Extract a specific portion of the text chunk which contains the actual speech.

Note that the speech starts with 'My fellow citizens', which is immediately preceeded by the following: `\r\n\n\n\n`. Let's split the text chunk into two parts using `\r\n\n\n\n` as the separator, and then take the second half of the results.

In [None]:
speech_obama = speech_obama.split('\r\n\n\n\n')[1]

speech_obama

Now the speech actually ends with 'And God bless the United States of America.', which is immediately followed by `\n\n\n\n\n`. Let's split the text chunk into two parts using `\n\n\n\n\n` as the separator, and then take the *first* half of the results.

In [None]:
speech_obama = speech_obama.split('\n\n\n\n')[0]

speech_obama

Step 5: Clean and Tokenize!

In [None]:
# Read each line from the file and convert it into lowercase
line = speech_obama.lower()

# Remove all punctuations
line_clean = ''.join(l for l in line if l not in all_punctuations)

# Remove all stop words (this will create a list of words)
line_words = [word for word in line_clean.split() if word not in ENGLISH_STOP_WORDS]

# Join all those words to create a line (of text) again
line_clean = ' '.join(line_words)

# Tokenize
tokens = nltk.word_tokenize(line_clean)

tokens[:5]

___

**Applications of Text Mining:**
    
    1. Text (or Document) Categorization
    2. Text Clustering
    3. Sentiment Analysis
    4. Document Summarization
    5. Topic Extraction
    6. Document Associations 
    7. Etc.

**Resources:**
    
1. [NLTK 3.4 documentation](http://www.nltk.org/index.html)
2. [spaCy API](https://spacy.io/api)
3. [scikit-learn TF-IDF Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
4. [NLTK's WordNet Interface](http://www.nltk.org/howto/wordnet.html)
5. [Modern NLP in Python by Patrick Harrison | PyData DC 2016](https://www.youtube.com/watch?v=6zm9NC9uRkk) (YouTube)
6. [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/) (Book)
7. [Text Feature Extraction using scikit-learn](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)