<a href="https://colab.research.google.com/github/pleabargain/ipynb_notebooks/blob/master/markov_generator_from_google_spreadsheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#License
MIT License

Copyright (c) 2018 Ashwin M J

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

#to be done
* connect to a google spreadsheet
* save output to new file
* remove duplicates


#nice to have
* visualize the markov chains



# Word Prediction using Markov Model

This notebook makes use of Markov model for word prediction. Specifically 2nd order Markov model is deployed here for next word prediction. As an example of the Markov chain, an attempt is made to generate a new song lyrics from a bunch of Eminem song lyrics.

In [1]:
# Preamble
!pip install --upgrade 'notebook>=5.7.6'
!pip install --upgrade -q gspread

import string
import numpy as np

Collecting notebook>=5.7.6
[?25l  Downloading https://files.pythonhosted.org/packages/87/18/37d8c8fd136b68b0983ce1621ab3c747beef25b46b7767e4f06b8a9a1bf0/notebook-5.7.7-py2.py3-none-any.whl (9.0MB)
[K    100% |████████████████████████████████| 9.0MB 3.3MB/s 
[31mgoogle-colab 1.0.0 has requirement notebook~=5.2.0, but you'll have notebook 5.7.7 which is incompatible.[0m
Installing collected packages: notebook
  Found existing installation: notebook 5.2.2
    Uninstalling notebook-5.2.2:
      Successfully uninstalled notebook-5.2.2
Successfully installed notebook-5.7.7


In [0]:
#code doesn't like a list
#test =(["First sentence"
#       ,"second sentence"])

In [4]:
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials
import regex as re

gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open('table topics questions').sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
#this is a sanity check

print(rows)

[['A former colleague has asked you to join her new company, what do you do?', '72'], ["Answer the child's question:  What does “we can’t afford it” mean?", '4'], ["Answer the child's question:  When you die who will I live with?", '8'], ["Answer the child's question:  Where did I come from?", '9'], ["Answer the child's question:  Why can’t I stay up as late as you?", '12'], ["Answer the child's question: How was I made?", '1'], ["Answer the child's question: Is Father Christmas Real?", '2'], ["Answer the child's question: What is God?", '6'], ["Answer the child's question: Why do people die?", '13'], ["Convince us that we don't need to go to work tomorrow.", ''], ['Convince us that farmers markets are good for the economy.', ''], ['Convince us that the best way to get a job is through networking.', ''], ['Convince us that you deserve to be the next project manager.', ''], ['Convince us that your late work is acceptable.', ''], ['Convince us to quit our jobs and do what you do.', ''], 

In [0]:
# Path of the text file containing the training data
training_data_file = rows


## Training

### Helper functions

In [0]:
def remove_punctuation(sentence):
    return sentence.translate(str.maketrans('','', string.punctuation))

In [0]:
def add2dict(dictionary, key, value):
    if key not in dictionary:
        dictionary[key] = []
    dictionary[key].append(value)

In [0]:
def list2probabilitydict(given_list):
    probability_dict = {}
    given_list_length = len(given_list)
    for item in given_list:
        probability_dict[item] = probability_dict.get(item, 0) + 1
    for key, value in probability_dict.items():
        probability_dict[key] = value / given_list_length
    return probability_dict

In [0]:
initial_word = {}
second_word = {}
transitions = {}

### Training function

In [0]:
# Trains a Markov model based on the data in training_data_file
def train_markov_model():
    for line in open(training_data_file):
        tokens = remove_punctuation(line.rstrip().lower()).split()
        tokens_length = len(tokens)
        for i in range(tokens_length):
            token = tokens[i]
            if i == 0:
                initial_word[token] = initial_word.get(token, 0) + 1
            else:
                prev_token = tokens[i - 1]
                if i == tokens_length - 1:
                    add2dict(transitions, (prev_token, token), 'END')
                if i == 1:
                    add2dict(second_word, prev_token, token)
                else:
                    prev_prev_token = tokens[i - 2]
                    add2dict(transitions, (prev_prev_token, prev_token), token)
    
    # Normalize the distributions
    initial_word_total = sum(initial_word.values())
    for key, value in initial_word.items():
        initial_word[key] = value / initial_word_total
        
    for prev_word, next_word_list in second_word.items():
        second_word[prev_word] = list2probabilitydict(next_word_list)
        
    for word_pair, next_word_list in transitions.items():
        transitions[word_pair] = list2probabilitydict(next_word_list)
    
    print('Training successful.')

In [11]:
train_markov_model()

TypeError: ignored

### Helper functions

In [0]:
def sample_word(dictionary):
    p0 = np.random.random()
    cumulative = 0
    for key, value in dictionary.items():
        cumulative += value
        if p0 < cumulative:
            return key
    assert(False)

### Test functions

In [0]:
number_of_sentences = 15

In [0]:
# Function to generate sample text
def generate():
    for i in range(number_of_sentences):
        sentence = []
        # Initial word
        word0 = sample_word(initial_word)
        sentence.append(word0)
        # Second word
        word1 = sample_word(second_word[word0])
        sentence.append(word1)
        # Subsequent words untill END
        while True:
            word2 = sample_word(transitions[(word0, word1)])
            if word2 == 'END':
                break
            sentence.append(word2)
            word0 = word1
            word1 = word2
        print(' '.join(sentence),"\n")

### Testing arena

In [0]:
generate()

All in all there is some new variaton but not as much as I had anticipated.

The data set might not be large enough.