(bayes-improvement)=
# Methods for Improvement
There are a few ways we can improve the quality of our sentences:

- [Better Data Preparation](lemmatize)
- [Increase Training Data](increase-data)
- [Bigrams --> Trigrams](ngrams)
- [Increase the Order of Markov Chain](increase-order)

As a note, I won't actually apply all of these methods due to compute limitations, but will still explain each method in depth.

In [1]:
import nltk
import string
import pandas as pd
from sklearn.preprocessing import normalize
from tqdm import tqdm
from IPython.display import display, Math, Latex
from nltk.corpus import stopwords
import numpy as np

stop_words = stopwords.words('english')

def create_markov_matrix(text, order = 1):
    print(f'\nCreating Markov matrix of order {order}...\n')
    # Get a list of the tokenized words without punctuation
    tokens = [word.lower() for word in nltk.word_tokenize(text) if word not in string.punctuation and word.isalpha()]
    unique_tokens = list(dict.fromkeys(tokens))
    # print(unique_tokens)

    bigrams = list(nltk.bigrams(tokens))
    # print(bigrams)

    # Create a DataFrame where the rows are words and the columns are words
    df = pd.DataFrame(0, columns=unique_tokens, index=unique_tokens)

    # Loop through each of the bigrams (tuples), locate them in the DF, and add 1
    for i in bigrams:
        df.loc[i[0],i[1]] += 1

    # Convert the DataFrame from raw word counts to probabilities
    w_normalized = normalize(df, norm='l1', axis=1)

    if order > 1:
        w_normalized = np.linalg.matrix_power(w_normalized, order)
        
    df_normalized = pd.DataFrame(w_normalized, columns=unique_tokens, index=unique_tokens)

    return df_normalized

    
text = open('honor_code.txt', 'r').read()

print(text[:100], '...')

# Build our Markov transition matrix
mm = create_markov_matrix(text)
print(f'{mm.shape=}')

GEORGIA TECH HONOR CHALLENGE STATEMENT

I commit to uphold the ideals of honor and integrity by refu ...

Creating Markov matrix of order 1...

mm.shape=(1542, 1542)


(lemmatize)=
## Better Data Preparation
There are a few ways to better prepare our data. For example, we have different variants of the same word ('student' *and* 'students'). Because we don't coalesce these similar words, they receive different probabilities in our Markov transition matrix. 

### Lemmatization
Thus, we could do a better job of lemmatizing the words. Lemmatization will:

- Improve the accuracy of this analysis by more heavily weighting words with high probabilities but slightly different endings;
- Increase computation time by reducing the size of our Markov chain matrices

Here are some examples of how lemmatization works:

- students --> student
- corpora --> corpus
- better --> good



In [2]:
def reshape_mm(mat):
    df_reshaped = mat.reset_index() \
                .rename(columns = {'index': 'first_word'}) \
                .melt(id_vars = 'first_word',
                    var_name = 'second_word',
                    value_name = 'p') \
                .sort_values('first_word')

    # Only keep actual words
    df_reshaped = df_reshaped[df_reshaped['first_word'].apply(lambda word: word.isalpha())]
    df_reshaped = df_reshaped[df_reshaped['second_word'].apply(lambda word: word.isalpha())]

    return df_reshaped.loc[(df_reshaped['p'] > 0)]

df_reshaped = reshape_mm(mm)
print(f'{df_reshaped.shape=}')
# df_reshaped.head(10).reset_index(drop = True)

In [3]:
from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()


n_sent = 2
min_words = 7


def find_next_word(mat, word, print_ = True):
    df_reshaped = reshape_mm(mat)
    
    # Filter our dataframe to the word of interest
    df_word = df_reshaped.loc[(df_reshaped['first_word'].str.lower() == word.lower())]

    # Sample using the probability column
    next_word_df = df_word.sample(1, weights = 'p', replace = True)
    next_word = next_word_df['second_word'].iloc[0]
    p = next_word_df['p'].iloc[0]


    if print_:
        print(f'Selected `{next_word}` as the next word to follow `{word}` with probability {p:.3f}.')
        
    return next_word

def build_sents(mm, n_sent = 3, min_words = 8, starter = 'Georgia'):
    for i, sent in enumerate(range(n_sent)):
        if i == 0:
            starting_word = starter
        else:
            starting_word = new_word

        sentence = [starting_word.title()]
        
        # Create boolean conditions for our while loop
        keep_going = True
        idx = 0

        while (keep_going) or (len(sentence) < min_words):
        # for idx, word in enumerate(range(n_words)):
            if idx == 0:
                new_word = starting_word

            new_word = find_next_word(mm, word = new_word, print_ = False)

            sentence.append(new_word)

            # If our final word is a stop word or we're over the minimum number of words
            # stop building our sentence and get rid of the stop word
            if new_word in stop_words:
                keep_going = False
                sentence.pop()
                new_word = find_next_word(mm, word = new_word, print_ = False)

            idx += 1

        print(f"Sentence {i}: {' '.join(sentence)}.")

build_sents(mm, n_sent = n_sent, min_words = min_words, starter = 'Georgia')

Georgia institute required institute official outlined initiation.
Initiation instructor finds ensure conduct investigator means.
Means rather grounds accepted allegations case misconduct.


(increase-data)=
## Increase Training Data
The Georgia Tech honor code text that I imported is ~13k words, 1600 of which are unique tokens. This is a good start, but if we want to more accurately perform the task of next word prediction, we want to increase our training dataset **substantially**. Ideally, we use as much text related to our problem of interest as possible. 

### Considerations
This raises the potential for serious issues given the way that the code has been structured. With just 13k words, I've created a 1625x1625 matrix; with a serious amount of training data, the storage needed to contain the larger matrix would be unsustainable. If I wanted to go this route, I would have to make use of some of the properties of sparse matrices, like those available in `scipy.sparse` to speed up computation.

*Note: Due to compute limitations, I won't apply this here, but the effects of this approach should be evident.*

(ngrams)=
### Unigrams --> n-grams
Currently, we're essentially looking at unigrams: we've essentially word tokenized our text (e.g. 'Georgia'). It *might* improve our performance to look at n-word-pairs like bigrams (e.g. 'Georgia Tech') or trigrams (e.g. 'Georgia Tech rules'). There are even methods to make the Markov chain *simultaneously* look at unigrams, bigrams, and trigrams, as outlined by the process below:

1. If the current state of text has two words prior to it, look at the Markov chain of trigrams.
1. If the Markov chain does not detect any patterns from the trigram *or* the phrase only consists of two words, look at the training data of bigrams.
1. If the Markov chain does not detect any patterns from the bigram *or* the phrase only consists of single words, look at the training data of unigrams.

*Note: To include options like trigrams, we need to **seriously** increase the size of our training data because of how much less likely it is for trigrams to appear in context.*

To quickly implement this, I'll use a package, [Markovify](https://github.com/jsvine/markovify#advanced-usage), already developed for this purpose.


In [4]:
import markovify

max_state = 3
num_sent = 1

for state in range(max_state):
    print(f'\nUsing {state+1}-gram to predict the next state in the Markov chain.')
    # Build the model.
    text_model = markovify.Text(text, state_size = state+1)

    # Print three randomly-generated sentences of no more than 280 characters
    for i in range(num_sent):
        print('\t', text_model.make_short_sentence(200))



Using 1-gram to predict the next state in the Markov chain.
	 The Institute shall normally be adjudicated by the investigation and/or Degree: The Respondent may be a brief, written or the faculty, and Supplementary Requirements imposed if interim suspension.

Using 2-gram to predict the next state in the Markov chain.
	 In cases where the Respondent fails to complete assigned Sanctions.

Using 3-gram to predict the next state in the Markov chain.
	 All graduate students are involved in research and scholarly activities which occur outside of the classroom.


The 3-gram results are extremely promising! It makes sense that these would make more sense because they "force" three-word pairs to stay together.

(increase-order)=
## Increase the Order of Markov Chain
We have set up our approach to make use of Markov Chains' memoryless states. That is, we only look at the current state to predict the next word. But when we discuss topics like [self-attention](intro-attention), this relies on the concept of looking at **all** words that come before a given word in a sentence to help predict the next word. The words that are closer in proximity to our word of interest get higher weights.

If we were to predict the next word in the sequence based on *both* the current and previous states, we would have a Markov chain of order 2. We know that the $t$-step transition probability is:

$$
\mathbb{P}(P_t=j | X_0 = i) = \mathbb{P}(P_{n+t} = j | X_n = i ) = (P^t)_{ij} \text{ for any }n
$$



In [5]:
mm_order2 = create_markov_matrix(text, order = 2)
print(f'{mm_order2.shape=}')

find_next_word(mm_order2, word = 'Georgia')

print('\n')

build_sents(mm_order2, n_sent = 2, min_words = 8, starter = 'Georgia')



Creating Markov matrix of order 2...

mm_order2.shape=(1542, 1542)
Selected `community` as the next word to follow `Georgia` with probability 0.218.
Georgia students scholarly detailed organization revocation misconduct serious.
Serious laboratory even notes unearned honor functions programs.
Programs determining provided victim trying part data limited.


Compared to what we saw in the previous section, these sentences seem to have a bit more creativity. This could be because they take into account more orders of our Markov Chain. Below, we see the probabilities of certain words appearing *after* the current state (`Georgia`):

In [6]:
mm_georgia = mm.loc['georgia', :]
mm2_georgia = mm_order2.loc['georgia', :]

pd.merge(mm_georgia[mm_georgia > 0].to_frame().reset_index(),
        mm2_georgia[mm2_georgia > 0].to_frame().reset_index(),
        how = 'outer',
        on = 'index',
        suffixes = ('_mm_order1', '_mm_order2')
) \
        .rename(columns = {'index': 'second_word'}) \
        .sort_values(['georgia_mm_order1', 'georgia_mm_order2'], ascending = False) \
        .head(10)


Unnamed: 0,second_word,georgia_mm_order1,georgia_mm_order2
0,tech,0.702703,
1,institute,0.216216,0.001744
2,policy,0.027027,0.025079
3,s,0.027027,0.015693
4,preponderance,0.027027,
11,community,,0.218294
9,and,,0.063423
23,student,,0.054388
5,honor,,0.054054
13,academic,,0.054054


It seems that the second-order Markov Chain applies a broader brush in determining which words can come after our current state word (`Georgia`). The first-order Markov Chain seems to make the most sense based on these associated probabilities.

## Conclusion
Based on this methodology, we can create a word-generator. This can help customers who might want unique and creative text generated based on certain findings from the semantic search performed by the language models used in previous sections. The possibilities are limitless.