## Applied Machine Learning

### Project

In this class, we have worked to understand both the general principles of machine learning methods and the lower-level details required to get specific machine learning techniques to work in practice. For this project, you will apply what you have learned in this course to formulate and answer your own question using ML methods.

**Important:** Be sure to include code and answers in the correct cells of the notebook. Otherwise you might not get full credit for your work.

Please use this notebook to turn in your work. 

Points: 70

You will be graded based on:
1. __Technical completeness__. _(50 points)_ 

Did you meet the technical requirements for the project? For instance, did you describe hyperparameter tuning and include plots, where necessary.

2. __Creativity, imagination and ambition__. _(10 points)_ 

Did you form an interesting, creative and ambitious question, and explain why it is important to answer that question?

_Note: downloading a ready-made dataset you find online (e.g. on Kaggle) and answering a question that is already defined for you will make it hard to get full points for creativity, imagination and ambition. (Unless you do something else interesting and ambitious, e.g. analyze attributes of a specific model very closely.) To get full points for creativity/imagination/ambition you will need to think a little a bit more. The bar will be higher for 5604 students._

3. __Presentation__. _(10 points)_ 

Did you do a good job presenting your results? Would your notebook make sense to a person who was not familiar with your project? You should take time to write clearly, simplify your code and explain what you are doing in your notebook. At minimum:

- Make sure your plots are well-labeled and appropriately scaled, delete code that does not work correctly, and be sure to mix code and text so that readers can easily understand your work. 
- Check out [this](https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/) blog post for a nice example of how to present a data analysis clearly.
- Put yourself in the reader's shoes. What would be confusing? Annoying? Helpful? Pretend you don't know anything about your project. What information needs to be presented first? What information needs to be presented last?

## Question 

_Using this cell, please write a short, clear paragraph explaining what question you plan to answer in this notebook. Your question can be narrow (e.g. can we predict a dog's height from its weight) or broad (e.g. what features are important or unimportant in predicting the price of a house). Briefly describe why your question is important and how you plan to answer. Be sure to explain what is imaginative, creative or ambitious about your planned work! For instance, will you spend a lot of time defining new features, will you be working with hard-to-get data, will your work inform a major theoretical debate? Be sure to ask a question you can actually start to answer using machine learning techniques!_


The question that I plan to answer in this notebook is "given one or several words as input, can I predict what I will say next based on my personal writing style?" This question is important and very interesting to me because I am curious what an ML model that is trained on my personal data will be able to come up with for a sentence completion NLP model. I have been avidly writing (typing) journal entries since 2016, so I have hundreds of pages of typed up thoughts about my life, and I am very curious what sort of topics often come up in my life and if those will come through in the sentence generation model I create! I think this is pretty imaginative and creative because I have been dreaming of using my personal journal data for years in some capacity, I just wasn't sure what I could do with it until I took this class and had a fun idea to see if I could build a model to predict what I might say next based on what I've said in the past. The data wasn't hard to gather, but I did have to put some significant time into sorting it to make it easier to read it and then to clean it and get it into the format that I needed to create the model.

## Data

_Using this cell, please write a short, clear paragraph explaining what data you will use to answer your question. You do not need to go gather custom datasets for this class, although you are welcome to do so. Just downloading data from Kaggle is fine, although you are highly encouraged to think a little harder and more creatively when you do the project. There are many, many places to find interesting datasets online related to many topics like music, politics, sports, transportation, etc. Data gathering is one way to make your project more creative, but you are not necessarily expected to take on a major data gathering effort. Be sure to describe how you plan to split between the training and test sets, if this is not defined for you. You might want to check out Google's [dataset search](https://datasetsearch.research.google.com/), [data is plural](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0) or Prof. Keegan's [list of datasets](https://medium.com/information-expositions/list-of-lists-of-datasets-c9bf52370755)_.



My dataset for this project is a corpus of text files from my personal journals. I have been typing online journals since 2016, with at least several entries of anywhere from 2-20 pages every single month since then. Since my question that I am trying to answer is "given one or several words, can I predict what I will say next based on my personal writing style?" -- I think this is the perfect dataset to tackle this question, because I will be training the model on my own writing over the last 5 years. I will be using 2 different kinds of models with different feature engineering strategies: (1) a simple unigram model where the features are the probabilities of the words existing in my corpus of words; and (2) a bigram markov chain sentence model where the features are the probability of the next occuring word in a sentence based on word that came before it, calculated from the probabilities and frequencies of my combinations of 2 words that I commonly use in my journals. I am curious if I will be able to personally "audit" or "test" this model outside of traditional performance metrics by checking if the resulting sentences are believable to me as something that I might actually say!

#### Data preprocessing

#### Step 1: Loop through all journal documents and create an array where each element is the words in the documents in order
- This will be used in different ways for both the unigram and the bigram model

In [1]:
import re
## Code for data preprocessing 

# Include your code to load, clean and split data in this cell. You must complete this step in the project.
import docx2txt
months = [1,2,3,4,5,6,7,8,9,10,11,12]
days = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
years= [16,17,18,19,20,21]
word_arr = []
end_words = []
corpus = "" #this is used for the unigram model
for m in months:
    for d in days:
        for y in years:
            try:
                my_text = docx2txt.process("data_files/{0}_{1}_{2}.docx".format(m,d,y))
                for word in my_text.split():
                    word = word.lower()# make the string lowercase
                    word = re.sub(r'(.)\1+', r'\1\1', word)# remove consecutive characters that are repeated more than twice
                    word = re.sub(r'#\([^()]*\)', ' ', word)# remove special characters, keep punctuation
                    word = word.replace(",", "")#remove commas
                    if word[-1] in ['.','!','?'] and word != '.':
                        end_words.append(word) #add end words to the end word array
                        corpus += word + " "
                        continue
                    word = re.sub('[^a-z]', ' ', word) #get rid of anything non alphanumeric
                    word = word.replace(" ", "")#get rid of spaces
                    if word == '': #get rid of edge cases where the word is an empty string
                        continue
                    word_arr.append(word)
                    corpus += word + " "
            except:
                continue
                
print("Number of words total: ", len(word_arr))

Number of words total:  178877


#### Step 2: CREATE THE DATAFRAME FOR THE BIGRAM MODEL
- Create a pandas dataframe to store all of the words and their subsequent words, with the frequencies

In [2]:
import pandas as pd
dict_df = pd.DataFrame(columns = ['lead', 'follow', 'freq'])
dict_df['lead'] = word_arr
follow = word_arr[1:]
follow.append('EndWord')
dict_df['follow'] = follow

In [3]:
#calculate the frequencies of all of the word pairs and add it to the datafram
dict_df['freq']= dict_df.groupby(by=['lead','follow'])['follow'].transform('count')
#drop any of the duplicate rows (same word combination)
dict_df = dict_df.drop_duplicates()

In [4]:
dict_df.head()

Unnamed: 0,lead,follow,freq
0,dear,diary,98
1,diary,what,4
2,what,an,9
3,an,interesting,26
4,interesting,life,2


In [5]:
#Split to training and testing sets
train_dict_df = dict_df[:55811] #about 80% of data
test_dict_df = dict_df[55811:] #about 20% of data

In [6]:
train_pivot_df = train_dict_df.pivot(index ='lead', columns='follow', values='freq').fillna(0)

In [7]:
train_pivot_df.head()

follow,a,abandon,abandoned,abcs,abdomen,abilities,ability,able,about,above,...,youve,ytt,z,zack,zacks,zen,zero,zip,zone,zoom
lead,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
abandon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abandoned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abcs,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abdomen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
#sanity checking my data here
#this is in almost every document, so it makes sense that there would be a high likelihood of these two
#occuring one after the other!
train_pivot_df.loc["dear"]['diary'] 

98.0

In [9]:
train_pivot_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7284 entries, a to zoom
Columns: 7284 entries, a to zoom
dtypes: float64(7284)
memory usage: 405.1+ MB


## Model selection and tuning

_Using this cell, please write a short, clear paragraph explaining how you selected and tuned your model for the project. You must answer the following questions in this cell (1) Why is your model an appropriate choice for your data? (2) What hyperparameters does your model have and how did you select them? (3) What features did you choose and why?_

You must, at minimum:
1. Engineer one feature (unigram vs. bigram)
2. Tune one hyperparameter (smoothing)
3. Make a plot or table examining performance of your model under different parameter settings (perplexity with different smoothing for bigram vs. ngram)

### What I did for model selection and tuning:
- I trained two different kinds of models with different feature engineering: a unigram model and a bigram model
- For each of the 2 models, I tested the performance (measuring perplexity as performance) with different choices for the "smoothing", which is the assigned probability for words or combinations of words that don't exist in the training dataset
- I plotted the impact that smoothing had on the perplexity (performance) for each separate model, and then I plotted the two different models together to see which one performed best in the next section.

Your answer here to the other questions:
1. My model was an appropriate choice for my data because in order to create a language model (LM) that does sentence generation, I need to know the probabilities of the words that I am going to be generating. This can be done through models with different n-grams as the feature engineering step, where the features in the model are some sort of probability of the words based on n previous words. For my models, I did a unigram feature engineering, where the features were the probability of the words based on their occurence in the entire corpus of training data. And I did a bigram feature engineering, where the features were the probability of the words based on their occurence coming after the previous word in my corpus of training data.
2. My hyperparameter that I chose to play with was the 'smoothing' value of my models. Smoothing is the value (probability) that is assigned to the words (or combinations of 2 words for the bigram) that our model hasn't encountered yet. This helps us not get a value of infinity when we calculate the performance with perplexity.
3. For the features, I ended up choosing all of the words in my corpus of training data. For the unigram model, the values were the probability of that word occuring in the text, and for the bigram model the values were the probability of that word occuring after the previous word based on the training corpus. So the features for my unigram model looked like a dictionary, and the features for my bigram model looked like an 2-dimensional array, where the columns and rows were all the combinations of words that I could encounter in my corpus.

In [10]:
## Code for model selection and tuning.  Please include your code for model selection and tuning in this cell

#All code for model selection and tuning is in the cells below!

## UNIGRAM MODEL:
- FEATURE ENGINEERED --> THIS IS USING UNIGRAMS INSTEAD OF BIGRAMS TO ALLOCATE FREQUENCIES IN THE MODEL
- TUNING THE MODEL WITH SMOOTHING
- PLOTTING THE PERPLEXITY AT DIFFERENT LEVELS OF SMOOTHING

In [11]:
#split the corpus into train and test sets
train_corpus = corpus[:625395]
test_corpus = corpus[625395:]

In [12]:
#TESTING THE PERFORMANCE (PERPLEXITY) ON TRAIN AND TEST SETS
from nltk import tokenize
train_sentences = tokenize.sent_tokenize(train_corpus)
test_sentences = tokenize.sent_tokenize(test_corpus)

In [17]:
#UNIGRAM MODEL: come up with a dictionary that has every word in the corpus and it's frequency
import collections
def unigramModel(smoothing):
    unigram_model = collections.defaultdict(lambda: smoothing)
    sum_all_counts = len(word_arr)
    for word in train_corpus.split():
        if word in unigram_model:
            unigram_model[word] += 1
        else:
            unigram_model[word] = 1

    for word in unigram_model:
        unigram_model[word] = unigram_model[word]/sum_all_counts
    return unigram_model

In [18]:
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        prob_of_word = model[word]
        perplexity = perplexity + (1/prob_of_word)
    perplexity = pow(perplexity, 1/float(N))
    return perplexity

In [23]:
import math
import numpy as np
def getPerplexity(sentences, smoothing_params):
    smoothing_params = [0.5, 0.1, 0.01, 0.001, 0.0001, 0.000001]
    perplexities = []
    for smoothing in smoothing_params:
        smoothedModel = unigramModel(smoothing)
        sum_perplexity = []
        for sentence in sentences:
            sentence_perp = perplexity(sentence, smoothedModel)
            sum_perplexity.append(sentence_perp)
        avg_perplexity = np.asarray(sum_perplexity).mean()
        perplexities.append(avg_perplexity)
    return(perplexities)

In [24]:
smoothing_params = [0.5, 0.1, 0.01, 0.001, 0.0001, 0.000001]
test_perplexities_unigram = getPerplexity(test_sentences, smoothing_params)

In [25]:
## Plot or table #1

# Include a plot or table explaining how you selected and tuned your model. You must complete this step in the project.

In [27]:
%matplotlib inline
import matplotlib.pyplot as plt
import altair as alt
df = pd.DataFrame({"Smoothing Value": smoothing_params,
                   "Perplexity":test_perplexities_unigram})

alt.Chart(df).mark_line().encode(
    alt.X('Smoothing Value', scale=alt.Scale(type='log')),
    y='Perplexity')

## STATIC BIGRAM MODEL:
- FEATURE ENGINEERED --> THIS IS USING BIGRAMS INSTEAD OF UNIGRAMS TO ALLOCATE FREQUENCIES IN THE MODEL
- STATIC BIGRAM MODEL (we choose the next word as the one with the highest probability of being next)
- TUNING THE MODEL WITH SMOOTHING
- PLOTTING THE PERPLEXITY AT DIFFERENT LEVELS OF SMOOTHING

In [28]:
## Feature engineering. Please include your code for feature engineering in this cell. 
#sum all of the frequencies of the following words for each word
sum_words = train_pivot_df.sum(axis=1)
#divide these frequencies by the total frequency for each row to get a probability between 0 and 1
train_pivot_df = train_pivot_df.apply(lambda x: x/sum_words)
#in theory all of the rows should sum up to 1 now!
train_pivot_df.head()

follow,a,abandon,abandoned,abcs,abdomen,abilities,ability,able,about,above,...,youve,ytt,z,zack,zacks,zen,zero,zip,zone,zoom
lead,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a,0.000295,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000295
abandon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abandoned,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abcs,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
abdomen,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
def perplexityStaticBigram(testset, model, smoothing):
    #split the sentence into words
    #find the probability of the two combination of words
    testset = testset.split()
    perplexity = 1
    ngrams = []
    for i in range(1, len(testset)):
        ngram = ' '.join(testset[i-2+1:i+1])
        ngrams.append(ngram)
    N = 0
    for combination in ngrams:
        word1,word2 = combination.split()
        N += 1
        try:
            prob_of_word = model.loc[word1,word2]
        except:
            prob_of_word = smoothing
        if prob_of_word == 0:
            prob_of_word = smoothing
        perplexity = perplexity + (1/prob_of_word)
    if N == 0:
        print(testset)
    perplexity = pow(perplexity, 1/float(N))
    return perplexity

In [30]:
def getPerplexityNgrams(sentences):
    smoothing_params = [0.5, 0.1, 0.01, 0.001, 0.0001, 0.000001]
    perplexities = []
    for smoothing in smoothing_params:
        sum_perplexity = []
        for sentence in sentences:
            if len(sentence.split()) > 1:
                sentence_perp = perplexityStaticBigram(sentence, train_pivot_df, smoothing)
                sum_perplexity.append(sentence_perp)
        avg_perplexity = np.asarray(sum_perplexity).mean()
        perplexities.append(avg_perplexity)
    return(perplexities)

In [31]:
test_perplexities_bigram = getPerplexityNgrams(test_sentences)

In [32]:
## Plot or table #2

# Include a plot or table explaining how you selected and tuned your model. You must complete this step in the project.

In [33]:
df = pd.DataFrame({"Smoothing Value": smoothing_params,
                   "Perplexity":test_perplexities_bigram})

alt.Chart(df).mark_line().encode(
    alt.X('Smoothing Value', scale=alt.Scale(type='log')),
    y='Perplexity')

## Results

_Using this cell, please write a short, clear paragraph explaining your results. In this class, we have mostly focused on accuracy. It is OK to measure your results in another quantitative way (e.g. precision or likelihood). Whatever you pick, make sure you are clear on what you are doing, and make sure you explain why your measurement of success makes sense._

First, I wanted to explain that I am using perplexity as a measure of performance given that this is an unlabeled and unsupervised language model. The lower the perplexity, the better the model is at predicting new words / sequences of words that don't exist in the training corpus. My results are pretty interesting for these experiments. First, I noticed that my unigram model performed significantly better than my bigram model when my smoothing value was really low, but my bigram model began to outperform my unigram model significantly at a smoothing value of about 0.000015. In general, it makes sense that as my smoothing goes up, the perplexity goes down, because that means that I am assigning a higher probability to words that I haven't encountered yet, and so it will perform much better on the test set when it encounters new words. However, choosing a model with a very high smoothing value would end up giving me a model with high test performance but really low training performance, because I would be introducing a lot of bias into the model (though I would be reducing the variance of the model).

Ultimately, I think that my bigram model is better than my unigram model on average across smooething values, so that is the model that I ended up deciding to look deeper into for my error analysis.

In [34]:
## Code 

# Include code showing how you arrived at your results. You must complete this step in the project.
print(test_perplexities_bigram)
print(test_perplexities_unigram)
print(smoothing_params)

[2.5657481207537485, 2.784380622173457, 4.942714793575661, 24.498041491723868, 211.11798141380393, 20340.732900885647]
[476.72558898618354, 476.76010950081377, 477.1224643712439, 480.5881238056932, 514.3037869060355, 4160.859283476815]
[0.5, 0.1, 0.01, 0.001, 0.0001, 1e-06]


In [35]:
## Plot or table 
# Include a plot or table explaining your results. You must complete this step in the project.
#FINAL PLOT DETAILING THE DIFFERENCE BETWEEN THE BIGRAM AND UNIGRAM MODEL WITH DIFFERENT SMOOTHING PARAMS
df = pd.DataFrame({"Smoothing Value": smoothing_params + smoothing_params,
                   "Model Type": ['unigram'] * len(smoothing_params) + ['bigram'] * len(smoothing_params),
                   "Perplexity":test_perplexities_unigram + test_perplexities_bigram})

alt.Chart(df).mark_line().encode(
    alt.X('Smoothing Value', scale=alt.Scale(type='log')),
    color='Model Type',
    y=alt.Y('Perplexity', scale=alt.Scale(type='log')))

| Type of feature/model | Smoothing Params | Perplexity |
|---------|---------------------|---------|
| Bigram   |   0.5                  |2.5657481207537485|
| Bigram   |   0.1                  |2.784380622173457|
| Bigram   |   0.01                  |4.942714793575661|
| Bigram   |   0.001                  |24.498041491723868|
| Bigram   |   0.0001                  |211.11798141380393|
| Bigram   |   0.000001                |20340.732900885647|
| Unigram  |   0.5                  |476.72558898618354|
| Unigram   |   0.1                  |476.76010950081377|
| Unigram   |   0.01                  |477.1224643712439|
| Unigram   |   0.001                  |480.5881238056932|
| Unigram   |   0.0001                  |514.3037869060355|
| Unigram   |   0.000001                |4160.859283476815|

## Error analysis

_Using this cell, please write a short, clear paragraph explaining what errors your model seems to be making, and offer a brief explanation based on your code below._ 

You must:
1. Perform some error analysis technique, such as making a confusion matrix or examining model mistakes.


### Description of error analysis technique:

1. First, I decided to analyze the error on my bigram model since it peformed better than the unigram model. 
2. I also decided to analye the error manually since it is in theory attempting to generate sentences that sound like something that I would write, so I can determine if a generated sentence is convincing to me that I might have written it. 
3. I was also curious if the static bigram model that I made would produce sentences that were better or worse than a bigram model that was based on a more distributed version of probabilities. I ended up creating two different functions based on my bigram model:
- One of the functions assigns the next word in the sentence only based on the word with the highest probability to occur next. This means that if I input the same intro of a sentence, I will always get the same ending of the sentence back.
- The other function assigns the next word in the sented based on the distribution of probabilities of next words. This means that if I input the same intro of a sentence, I will get a different ending of a sentence back, because I am randomly choosing next words based on their probability of being next over the distribution of probabilities of next words based on the previous word!
4. Finally, I manually checked both of these models with the same inputs to see if any of the sentences were convincing to me as something that I could have written, and also which type of model I thought was more convincing! Overall, I discovered that my static model was producing some specific errors (such as getting stuck in loops like "i am i am i am") and both models were producing errors by creating sentences that weren't gramatically correct (or sentences that just semantically didn't make sense). I describe my analysis more fully in the summary and conclusions cell at the bottom of this notebook 



In [None]:
## Error analysis

# Include code for error analysis here, to justify your conclusions. 
# You might make a confusion matrix, sample misclassifier data, analyze learned weights, or use any other method 
# discussed in class, or which makes sense for your model

In [36]:
from numpy.random import choice
all_words = train_pivot_df.columns

def make_a_sentence_static_bigram(start_of_sentence):
    word= start_of_sentence.split()[-1]
    word = word.lower()
    sentence=start_of_sentence.split()
    while len(sentence) < 30:
        #this is the next word with the highest probability of occuring
        next_word = train_pivot_df.loc[[word]].apply(lambda x: train_pivot_df.columns[x.argmax()], axis = 1)[0]
        prob_of_next_word = train_pivot_df.loc[word,next_word]
        #this code is to determine when we have reached a likely end of sentence by encountering an "end-word"
        if next_word == 'EndWord':
                continue
        elif len(sentence) > 10 and (next_word + '!') in end_words:
            sentence.append(next_word + '!')
            break
        elif len(sentence) > 10 and (next_word + '.') in end_words:
            sentence.append(next_word + '.')
            break
        elif len(sentence) > 10 and(next_word + '?') in end_words:
            sentence.append(next_word + '?')
            break
        else :
            sentence.append(next_word)
        word=next_word
    sentence = ' '.join(sentence)
    return sentence

In [37]:
from numpy.random import choice
all_words = train_pivot_df.columns

def make_a_sentence_distributed_bigram(start_of_sentence):
    word= start_of_sentence.split()[-1]
    word = word.lower()
    sentence=start_of_sentence.split()
    while len(sentence) < 30:
        #get the probabilities of the next potential words
        #alternative version where I choose next word randomly based on distribution of probabilities of next words
        probabilities_of_next_word = (train_pivot_df.iloc[train_pivot_df.index ==word].fillna(0).values)[0]
        #choose a random word based on the distribution of probabilities
        next_word = choice(a = all_words, p=probabilities_of_next_word)
        
        #this code is to determine when we have reached a likely end of sentence by encountering an "end-word"
        if next_word == 'EndWord':
                continue
        elif len(sentence) > 10 and (next_word + '!') in end_words:
            sentence.append(next_word + '!')
            break
        elif len(sentence) > 10 and (next_word + '.') in end_words:
            sentence.append(next_word + '.')
            break
        elif len(sentence) > 10 and(next_word + '?') in end_words:
            sentence.append(next_word + '?')
            break
        else :
            sentence.append(next_word)
        word=next_word
    sentence = ' '.join(sentence)
    return sentence

In [42]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("how am I supposed to"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("how am I supposed to"))

STATIC BIGRAM:  how am I supposed to be a lot of the i am.
DISTRIBUTED BIGRAM:  how am I supposed to do still havent been finding solace with.


In [43]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("This is the AI named Jess"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("This is the AI named Jess"))

STATIC BIGRAM:  This is the AI named Jess age dear diary i am so.
DISTRIBUTED BIGRAM:  This is the AI named Jess dear diary its strange i am.


In [44]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("What I want in life is to"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("What I want in life is to"))

STATIC BIGRAM:  What I want in life is to be a lot of the i.
DISTRIBUTED BIGRAM:  What I want in life is to be needing to visualize myself!


In [46]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("My biggest goals are to"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("My biggest goals are to"))

STATIC BIGRAM:  My biggest goals are to be a lot of the i am.
DISTRIBUTED BIGRAM:  My biggest goals are to believe i dont know every i should have.


In [47]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("I hope to achieve"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("I hope to achieve"))

STATIC BIGRAM:  I hope to achieve some of the i am so i am.
DISTRIBUTED BIGRAM:  I hope to achieve these terrible for quite a quarter do of.


In [66]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("I love"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("I love"))

STATIC BIGRAM:  I love i am so i am so i am so i.
DISTRIBUTED BIGRAM:  I love happiness in the appropriate time i told him and how.


In [54]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("I will never know"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("I will never know"))

STATIC BIGRAM:  I will never know that i am so i am so i.
DISTRIBUTED BIGRAM:  I will never know that i had only visiting you want to.


In [56]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("What if I"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("What if I"))

STATIC BIGRAM:  What if I am so i am so i am so i.
DISTRIBUTED BIGRAM:  What if I love you hide stuff that i dont look in.


In [58]:
print("STATIC BIGRAM: ", make_a_sentence_static_bigram("Dear diary"))
print("DISTRIBUTED BIGRAM: ", make_a_sentence_distributed_bigram("Dear diary"))

STATIC BIGRAM:  Dear diary i am so i am so i am so i.
DISTRIBUTED BIGRAM:  Dear diary holy writing love i couldnt watch something similar to become.


## Summary and conclusion

_Using this cell, please write a short, clear paragraph describing how your results answer or do not answer your question. What are the implications of your findings? What new questions arise from your work?_

What an interesting and insightful project! I discovered that my best performing model (the static bigram model) might have performed pretty well in theory, but in practice it actually produced sentences that didn't make a lot of sence in the English language and also that I didn't believe I would have written in my diary. This was in part because I noticed that assigning the next word to *always* be the word with the highest probability of coming next created some unintentional feedback loops in my model. For example, the most common word that comes after the word "I" is "am", which meant that in my static model, every time I encountered the word "I", my sentence would continue with "am", and there were many times that the sentence output was something like "I am [word word] I am [word] I am.." or something similar to that. In my diary entries in real life, I would definitely not be likely to write a sentence like that. However, in my distributed model where I chose to assign the next word *randomly* based on the distribution of probabilities of next words, the sentences that resulted from my model were actually relatively convincing! Sure, they weren't gramatically perfect because this is a pretty rudimentary model. However - some of them I definitely believed I could have written some version of. For example, when I inputed "I love...", the distributed model returend "I love happiness in the appropriate time i told him and how" -- this sentence used words that are really common in my day-to-day language, such as 'happiness' and the sentence was almost gramatically correct enough to be believable, using phrases such as 'in the appropriate time i told him'. On the other hand, the static model that was fed in this same input resulted in "I love i am so i am so i am so i.", which did not make sense to me and also fell victim to the feedback loop of 'i am' that I was discussing earlier. Overall, I learned a lot about bigram and unigram language models throughout this project and I also learned a bit about my writing style and the types of words / combinations of words that I use most often in my written diaries! The implications of these findings are that I am convinced that if I had more time to make a more robust model I could make a convincing language model of my own writing style. I am curious what would happen if I was to create a model that used trigrams or even n-grams with a higher n and if the sentences generated would be even more convincing -- but I will leave that for future work.

##### Side note: if this notebook is ran again, my distributed model will produce different results for my sentence generation since it picks randomly across the distribution of probabilities - so some of the results that I analyzed above might be slightly different on a different run - but this was the way that I analyzed the first round of outputs that I tested.