## EE 502 P: Analytical Methods for Electrical Engineering
    
# Final project - <span style="color: red">Julia Combs</span>

### Due Thursday, December 16, 2021 at 11:59 PM
Copyright &copy; 2021, University of Washington

<hr>

# Hallucinating the Constitution

Consider the constitution of the United States:

> https://www.usconstitution.net/const.txt .

This document contains upper- and lower-case letters, numbers, and basic punctuation. 

**One letter prediction:**

1. Find the set of all characters used in the document. Call the number of characters $n$. 
2. Create an $n \times n$ matrix whose $i,j$ entry is the probability that the next character is $j$ given that the current character is $i$. Estimate this probability by looking at all occurrences of character $i$ in the document and the number of times character $j$ immediately follows it. 
3. Simulate this system as a Markov chain that starts with an arbitrary capital letter and continues until it gets to a space. Produce $100$ random "words" this way. How many of them are actual words? Use a [Scrabble dictionary](https://scrabble.hasbro.com/en-us/tools#dictionary) if you are not certain whether a given sequence is a word. 

**Two letter prediction:**

1. Create an $n \times n \times n$ tensor whose $i,j,k$ entry is the probability that the next character is $k$ given that the current character is $j$ and the previous character is $i$. Use the document to empirically find these probabilities. 
2. Use this model to construct random words. 

**Sentence prediction:**

Do a one word prediction, but use all the unique *words* in the document. Hallucinate sentences. Consider a punctuation mark as a word. 

**Notes:** Use `open` and `file.read` to read in the file as a string. For the sentence. Use `replace` to add space before punctuation and then `split()` to turn the string into a list. Use a `DiGraph` from the `networkx` library to store the data. Note that you can make weighted edges by adding data to the edges, as in [this document](https://networkx.github.io/documentation/stable/auto_examples/drawing/plot_weighted_graph.html).

In [2]:
# import necessary packages

import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random
from array import *

In [10]:
# load the file
hw = open('const.txt').read()

# filter out unncessary data within the text file (line indicators, etc.)
def tokenize_words(input):
    tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+') # separate all of the words and keep necessary punctuation
    tokens = tokenizer.tokenize(input)
    filtered = filtered = filter(lambda token: token not in stopwords.words('english'), tokens)
    return " ".join(filtered), tokens
                      
hw_pros_inputs, hw_words_tokens = tokenize_words(hw)

hw_inputs = " ".join(hw_words_tokens)
# determine the unique words in the text
hw_words = list(set(hw_words_tokens)) 
print('unique words (hw_words):', hw_words)

# determine the unique characters in the text
hw_chars = sorted(list(set(hw_pros_inputs)))
print('unique chars (hw_chars):', hw_chars)

# determine the number of unique words
n_words = len(hw_words)
print('Number of unique words', n_words)

# determine the number of unique characters
n_chars = len(hw_chars)
print('Number of unique chars', n_chars)

# create a dictionary for the words 
words2num = dict((c,i) for i, c in enumerate(hw_words))

# create a dictionary for the characters
chars2num = dict((c,i) for i, c in enumerate(hw_chars))

# translate the full words using the dictionary
input_len = len(hw_inputs) # determine how many characters are in the txt

# translate full words of text   
hw_trans_words = []
hw_trans_words.append([words2num[hw_words] for hw_words in hw_words_tokens])

# translate characters of text
hw_trans_chars = []
hw_trans_chars.append([chars2num[hw_chars] for hw_chars in hw_inputs])

hw_trans_words = np.asarray(hw_trans_words)

       

unique words (hw_words): ['Inability', 'Session', '-President,', '14', 'Impeachment', 'for', 'race', 'peaceably', 'first', 'prescribed', 'Meeting', 'emit', 'New', 'William', 'proper', 'vacancy', 'Affirmation', 'Importation', 'debts', 'Jackson', 'Measures', 'original', 'Authors', 'intoxicating', 'Indictment', 'Elections', 'Judgment', 'He', 'granting', 'People', 'Tribes', 'three', 'adding', 'prevent', 'ratification', 'Ministers', 'parts', 'impairing', 'Duties', 'Certificates', 'thirty', 'Delaware', 'Seats', 'reserving', 'Violence', 'confirmation', 'tried', 'Privileges', 'Numbers', 'my', 'solemnly', 'government', 'district', 'Amendment', 'liberty', 'aid', 'counterfeiting', 'Justice', 'Provided', 'next', 'organizing', 'return', 'determines', 'likewise', 'War', 'Johnson', 'inability', 'Ability', 'thereof', 'Class', 'exportation', 'one', 'immunities', 'Union', 'then', 'Carolina', '25', 'Business', 'fourteen', 'ballot', 'Ballot', 'Articles', 'grant', 'Miles', 'value', 'Clauses', 'our', 'numbe

In [11]:
# create lambda function to find all of the indexes of desired values
get_indexes = lambda x, xs: [i for (y, i) in zip(xs, range(len(xs))) if x == y]
#times2 = get_indexes(2, hw_trans_words[0, :])
#print('occurance locations', times2)

# make a tranition matrix for the words
TtempData = np.zeros((n_words, n_words))
for i in range(0, n_words):

    indexes = get_indexes(i, hw_trans_words[0,:])
    for j in range(len(indexes)):
        tempIndex = indexes[j] + 1 # look at index value after first word
        if tempIndex == len(hw_trans_words[0,:]):
            print('hit the end')
        else:
            val = hw_trans_words[0, tempIndex]
            val = val.item()
            TtempData[val, i] = TtempData[val, i] + 1
            
Tdata = TtempData


hit the end


In [12]:
div = np.sum(Tdata, axis = 0)[np.newaxis, :]
if (~Tdata.any(axis = 0)).any() == True:
    zeroColLoc = np.where(~Tdata.any(axis = 0))[0]
    shapeX, shapeY = div.shape
    for k in range(0, shapeY-1):
        if div[0,k] == 0:
            div[0,k] = 1
T = Tdata/div
#T[np.isnan(T)] = 0
pd.DataFrame(T)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1352,1353,1354,1355,1356,1357,1358,1359,1360,1361
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1357,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.035714,0.0,0.0,0.0,0.0,0.0,0.0
1360,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# generate text
new_hw = []
n_desired = 100
# separate out first word
first_num = 0
print('start state:', first_num)
new_hw.append(first_num)

for i in range(0, n_desired):
    current_word = new_hw[i]
    probs = T[:, current_word]
    sampleList = np.linspace(0, len(probs)-1, len(probs))
    next_word = random.choices(sampleList, weights = probs, k = 1)
    next_word = int(next_word[0])
    new_hw.append(next_word)

print('new_hw:', new_hw)
        

start state: 0
new_hw: [0, 1291, 1015, 160, 96, 743, 142, 1019, 1331, 1291, 250, 220, 1291, 74, 806, 264, 1291, 1327, 1015, 245, 114, 540, 835, 1281, 1264, 286, 670, 578, 1065, 5, 344, 458, 697, 384, 1353, 605, 997, 1353, 1048, 181, 301, 1291, 812, 145, 260, 262, 1191, 578, 135, 5, 743, 797, 114, 788, 315, 640, 697, 367, 541, 145, 899, 1291, 1034, 1015, 370, 578, 1015, 249, 1291, 1034, 1271, 395, 1015, 210, 114, 1168, 5, 735, 270, 1291, 729, 938, 1120, 1291, 1327, 1154, 697, 138, 684, 797, 578, 1015, 245, 1291, 729, 1190, 1015, 160, 697, 53, 1229]


In [15]:
# convert new list back into words

# reverse the words2num dictionary
num2words = {value : key for (key, value) in words2num.items()}

new_hw_vals = sorted(list(set(new_hw)))
# translate full words of text   
# hw_trans_words = []
# hw_trans_words.append([words2num[hw_words] for hw_words in hw_words_tokens])
# print('hw_trans_words', hw_trans_words)

new_words = []
new_words.append([num2words[new_hw_vals] for new_hw_vals in new_hw])
new_words = np.asarray(new_words)
new_words = new_words[0,:]
final_words = ' '.join(str(e) for e in new_words)
print('final_words:', final_words)



final_words: Inability , the President within that Purpose shall immediately , or older , then act accordingly , without the Senate may be so construed as to vote of votes for his defence . But in another State in which he fled , over such Reconsideration two -thirds of electors for that House may by appropriate legislation . Immediately after such Acts , at the Militia of the Legislature , at noon on the Congress may adjourn for every Year , but a capital , without due . 3 The House of the Senate , but if the President . Amendment 4
