# Text Generation

## Python Pizza Hamburg New Year's 🎉🎉🎉
> This Jupyter Notebook was forked and edited from [this repo](https://github.com/adashofdata/nlp-in-python-tutorial/blob/d66c67a094836e7c79663d5013d4c0152d710f52/5-Text-Generation.ipynb).
> Thanks, Alice! :)

## Introduction

"Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain." Alice Zaho

A better way to generate sentences is to use Deep Learning, but Markov CHain is a good beginner's implementation. Simple and loads of fun!

### Load the datasets

Here, we will use the two datasets (one with 5G news and the other with Trump news) to generate the lists for the Markov Chain system. 

For starters, let's loads the datasets:

In [9]:
import pandas as pd

# Read in the trump dataset, including everything
df_trump = pd.read_csv('data/trump_news.csv')
df_trump

Unnamed: 0,Headings,Subheadings
0,Barr Leaves a Legacy Defined by Trump,Though he sometimes departed from the presiden...
1,"In Trump’s Corner, With Flashes of Autonomy","Attorney General William P. Barr, who departed..."
2,Clemency Case Shows the Perks Of Trump Ties,"Philip Esformes, a nursing home operator, was ..."
3,"Answering Trump, Democrats Try and Fail to Jam...","In a brief bit of political theater, the House..."
4,"Answering Trump, Democrats Fail to Pass $2,000...",At a news conference after the unsuccessful mo...
...,...,...
95,New York Post to Trump: ‘End This Dark Charade’,Rupert Murdoch’s New York Post put more distan...
96,Republicans in Congress Stay Largely in Line B...,A few top Republicans called for a smooth tran...
97,Few Voices Breaking From G.O.P.’s Wall of Sile...,"Senator Lamar Alexander, Republican of Tenness..."
98,Business and World Leaders Move On as Trump Fi...,President-elect Joseph R. Biden Jr. is seizing...


In [53]:
# read and load 5G news dataset
df_5g = pd.read_csv('data/5g_news.csv')
df_5g

Unnamed: 0,Headlines,Subheadings
0,The Tech That Was Fixed in 2020 and the Tech T...,"From videoconferencing to fitness apps, the be..."
1,Why the 5G Pushiness? Because $$$.,Selling 5G capability is a huge opportunity fo...
2,How to Take On the Tech Barons,Something has to be done about the sector. He...
3,"Apple iPhone 12 Review: Superfast Speed, if Yo...","The new iPhone has an improved design, but it’..."
4,"IPhone 12’s 5G Asterisk: It’s Great, if It’s A...","Apple unveiled the iPhone 12, left, and iPhone..."
...,...,...
75,The New Tech Cold War,Tensions between China and the U.S. are higher...
76,How to Fight That Sinking Feeling,"Markets are expecting increasingly aggressive,..."
77,Really? Is the White House Proposing to Buy Er...,The attorney general suggested that one way of...
78,"India Bans Nearly 60 Chinese Apps, Including T...",The move is part of the tit-for-tat retaliatio...


In [239]:
# Extract only Headlines text from the dataset
# headings = df_5g.Headlines.loc[29] #5G dataset
headings = df_trump.Headings.loc[24] #trump dataset
headings[:200]

'A Glitch in Trump’s Plan to Live at Mar-a-Lago: A Pact He Signed Says He Can’t'

## Build the Markov Chain Function

Build the simple (noobie-friendly) Markov chain function that creates a dictionary:
* The keys should be all of the words in the dataframe
* The values should be a list of the words which follows the keys

In [240]:
from collections import defaultdict #pydict on steroids perfect to handle missing data

def markov_chain(text):
    '''The input is a list of strings of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.rsplit(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [241]:
# Create the dictionary for the dataframe and print the output
dict_df = markov_chain(headings)
dict_df

{'A': ['Glitch', 'Pact'],
 'Glitch': ['in'],
 'in': ['Trump’s'],
 'Trump’s': ['Plan'],
 'Plan': ['to'],
 'to': ['Live'],
 'Live': ['at'],
 'at': ['Mar-a-Lago:'],
 'Mar-a-Lago:': ['A'],
 'Pact': ['He'],
 'He': ['Signed', 'Can’t'],
 'Signed': ['Says'],
 'Says': ['He']}

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary we have just created
* The number of words we want generated

Here are some examples of generated sentences:

>'Iphone 12 Review: Superfast Speed, if You Can Find It.'

>'5G as Tech Battle Between China and the West Escalates.'

In [242]:
import random

def generate_sentence(chain, count=10):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [243]:
generate_sentence(dict_df)

'Live at Mar-a-Lago: A Pact He Signed Says He Can’t.'