<a href="https://colab.research.google.com/github/mowillia/phantom_pen/blob/master/text_generation_function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Text Generation -- Google Colab

This notebook implements the various generation functions of the Phantom Pen application. These functions include

- Simple Generate

- Classify and Generate

- Classify, Extract, and Generate


This code runs most efficiently when GPU is enabled. 


In [0]:
import os
import sys
import pandas as pd
import re
from bs4 import BeautifulSoup


import nltk
import nltk.data # natural language tool kit
from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

from sklearn.externals import joblib
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split


import json
import numpy as np
import tensorflow as tf

import textwrap
import time

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.




In [0]:
#check GPU status
!nvidia-smi

Sun Jun 30 07:42:07 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [0]:
# clone mowillia repository to get access to encoder, decoder, sample files
!git clone https://github.com/mowillia/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 315, done.[K
remote: Total 315 (delta 0), reused 0 (delta 0), pack-reused 315[K
Receiving objects: 100% (315/315), 4.40 MiB | 16.16 MiB/s, done.
Resolving deltas: 100% (172/172), done.


In [0]:
# change directory to GPT2
cd gpt-2

/content/gpt-2


In [0]:
# ensures we load packages needed for the function of program
!pip3 install -r requirements.txt
!pip install regex

Collecting fire>=0.1.3 (from -r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/5a/b7/205702f348aab198baecd1d8344a90748cb68f53bdcd1cc30cbc08e47d3e/fire-0.1.3.tar.gz
Collecting regex==2017.4.5 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |████████████████████████████████| 604kB 10.4MB/s 
Collecting tqdm==4.31.1 (from -r requirements.txt (line 4))
[?25l  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 19.4MB/s 
[?25hCollecting toposort==1.5 (from -r requirements.txt (line 5))
  Downloading https://files.pythonhosted.org/packages/e9/8a/321cd8ea5f4a22a06e3ba30ef31ec33bea11a3443eeb1d89807640ee6ed4/toposort-1.5-py2.py3-none-any.whl
Building wheel

In [0]:
# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Exporting Python encoding
!export PYTHONIOENCODING=UTF-8

In [0]:
# copy trained classifier and essay text data
!cp -r /content/drive/My\ Drive/writrly_proj_files/data_pickle_files/* /content/

# copy models from checkpoint to models folder; necessary for generation
!cp -r /content/drive/My\ Drive/checkpoint/* /content/gpt-2/models

In [0]:
# copy encoder
!cp -r /content/drive/My\ Drive/writrly_proj_files/gpt2_support_files/encoder.py /content/gpt-2/

# copy sample
!cp -r /content/drive/My\ Drive/writrly_proj_files/gpt2_support_files/sample.py /content/gpt-2/

# copy model
!cp -r /content/drive/My\ Drive/writrly_proj_files/gpt2_support_files/model.py /content/gpt-2/

In [0]:
# loads the encoder, sample(r) and model for generation function
## note: these files must be in your content directory
import encoder, sample, model

In [0]:
## dictionary for 345 model

model_dict = {'science': 'atlantic_science_345', 
              'entertainment': 'atlantic_entertainment_345',
              'education': 'atlantic_education_345', 
              'politics': 'atlantic_politics_345', 
              'technology':'atlantic_technology_345',
             'health': 'atlantic_health_345',
             'ideas': 'atlantic_ideas_345',
             'international': 'atlantic_international_345',
             'business':'atlantic_business_345',
             'short_story': 'all_short_stories_345'}

### Simple Generate Function

In [0]:
## def print string given input

def simple_gen(input_string, lens, temp, model_choice):
    
    model_name=model_choice
    seed=None
    raw_text = input_string
    length=lens
    temperature=temp #set to 1.0 for highest diversity
    top_k=40 #set to 40
    top_p=0.9 #set to 0.9
    
    """
    Interactively run the model
    :model_name=117M : String, which model to use
    :seed=None : Integer seed for random number generators, fix seed to reproduce
     results
    :length=None : Number of tokens in generated text, if None (default), is
     determined by model hyperparameters
    :temperature=1 : Float value controlling randomness in boltzmann
     distribution. Lower temperature results in less random completions. As the
     temperature approaches zero, the model will become deterministic and
     repetitive. Higher temperature results in more random completions.
    :top_k=0 : Integer value controlling diversity. 1 means only 1 word is
     considered for each step (token), resulting in deterministic completions,
     while 40 means 40 words are considered at each step. 0 (default) is a
     special setting meaning no restrictions. 40 generally is a good value.
    :top_p=0.0 : Float value controlling diversity. Implements nucleus sampling,
     overriding top_k if set to a value > 0. A good setting is 0.9.
    """
    
    # produce only a single batch
    batch_size = 1

    # create encoder based on chosen model
    enc = encoder.get_encoder(model_name)
    
    # select hyperparameters based on model
    hparams = model.default_hparams()
    
    with open(os.path.join('models', model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))

    if length is None:
        length = hparams.n_ctx // 2
    elif length > hparams.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

    with tf.Session(graph=tf.Graph()) as sess:
        context = tf.placeholder(tf.int32, [batch_size, None])
        np.random.seed(seed)
        tf.set_random_seed(seed)
        output = sample.sample_sequence(
            hparams=hparams, length=length,
            context=context,
            batch_size=batch_size,
            temperature=temperature, top_k=top_k, top_p=top_p
        )

        saver = tf.train.Saver()
        ckpt = tf.train.latest_checkpoint(os.path.join('models', model_name))
        saver.restore(sess, ckpt)
    
        # encodes raw text for processing
        context_tokens = enc.encode(raw_text)

        # processes text through sampling program
        out = sess.run(output, feed_dict={context: [context_tokens for _ in range(batch_size)]})[:, len(context_tokens):]
        
        # decodes output back into text
        text = enc.decode(out[0])

    return(text)

In [0]:
Beginning_String = '\n\n\n\nPhysics is a quite cool subject.'

start_time = time.time()
text_result = simple_gen(input_string = Beginning_String, 
              lens = 800, 
              temp = 1.0, 
              model_choice = 'atlantic_technology_345')

print('Run time:', str(time.time()-start_time)+' secs')
print('     ')
print(textwrap.fill(text_result, 60))

Run time: 30.29616117477417 secs
     
 Almost anything can be described using physics. There are
the elementary particles, which are called protons and
neutrons, and electrons. There are the masses of these
particles and their interaction, called electric charge. And
there are the forces that cause the interaction between the
particles and their surroundings.   Physics describes how a
particle interacts with an environment, and can explain,
among other things, how gravity works.   But it is the
atomic world that physics tells us how atoms interact with
one another. That is the world physicists study, and that is
the real world in which humanity lives.   We could all live
in a quantum superposition of carbon and oxygen—which never
gets enough heat to grow—but that would be a nightmare of
itself. In an ideal world, atoms would be totally stable, in
all sense of the word. Our air, water, and land would remain
air, water, and land. Our food would remain food. A bath of
nectar, a steady ra

In [0]:
Beginning_String = '\n\n\n\nToday is the last day of the rest of your life.'

start_time = time.time()
text_result = simple_gen(input_string = Beginning_String, 
              lens = 800, 
              temp = 1.0, 
              model_choice = 'atlantic_ideas_345')

print('Run time:', str(time.time()-start_time)+' secs')
print('     ')
print(textwrap.fill(text_result, 60))

Run time: 37.91648316383362 secs
     
 You'll never forget it. But before you die, why not reflect
on the fleeting moment when you did?  We keep counting down
to the election of Donald J. Trump, the horrible corruption
of our democracy, and I can barely keep track of how many
people are signing my petition, lest we all end up on the
same shortlist for execution. It begins:       To spend too
much time imagining Donald Trump or Ivanka Trump or
Kellyanne Conway or the warm, fuzzy feeling of having made a
massive life decision, or, you know, actually doing
something productive. Which is to say, much of today. Which
means that we can step back from the inescapable and turn
our minds to the juicy details of the administration itself.
(I spend a large part of my Friday reading New York
magazine; I know it's nice. Get it?) For me, at least, the
most important thing about the new health care bill is that
it includes a big bang stabilization fund, funding for which
is already guaranteed by Con

### Pickle Files and Functions for Classification

#### Former Pickle File Load

Loads Pickle file for classification

In [0]:
## cleaning text for classification

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

In [0]:
#import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# import data frame
masterDF = pd.read_csv('/content/master_df.csv')

# clean essay element
essay = masterDF['essay']
essay = [clean_text(elem) for elem in essay]
masterDF['essay'] = essay

# Split data into training and test sets 
train_x, test_x, train_y, test_y = train_test_split(masterDF['essay'], masterDF['topic'])

# label encode the target variable 
# Encode labels with value between 0 and n_classes-1.
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
test_y = encoder.fit_transform(test_y)

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(masterDF['essay'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xtest_count =  count_vect.transform(test_x)

In [0]:
from sklearn.externals import joblib

#reverse encode
reverse_encode = ['business',
 'education',
 'entertainment',
 'health',
 'ideas',
 'international',
 'politics',
 'science',
 'short_story',
 'technology']

# select file
joblib_file = "/content/logreg_wordcount_model.pkl"

#load from file
file_model = joblib.load(joblib_file)

k =130

#Calculate accuracy and predictions
predictions = file_model.predict(xtest_count)
print('Accuracy:', metrics.accuracy_score(predictions, test_y))
predict_key = file_model.predict(xtest_count[k])[0]
print('Prediction: ', reverse_encode[predict_key])
print('Actual: ', reverse_encode[test_y[k]])
print('Probability: ', max(file_model.predict_proba(xtest_count[k])[0]))

Accuracy: 0.9540441176470589
Prediction:  technology
Actual:  technology
Probability:  0.9939616411216464


In [0]:
#confusion_matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(predictions, test_y) 

array([[42,  0,  0,  0,  0,  0,  0,  0,  0,  3],
       [ 1, 71,  0,  0,  1,  0,  0,  0,  0,  0],
       [ 1,  0, 60,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0, 49,  0,  0,  0,  1,  0,  0],
       [ 1,  0,  0,  1, 51,  1,  1,  0,  1,  1],
       [ 0,  0,  0,  0,  0, 38,  1,  0,  0,  0],
       [ 0,  0,  0,  0,  3,  0, 45,  0,  0,  1],
       [ 0,  0,  0,  0,  0,  0,  0, 73,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0, 42,  0],
       [ 0,  0,  0,  1,  0,  0,  0,  0,  0, 54]])

In [0]:
# Testing prediction
free_x = 'I am a politician and I love talking about economics and crime!. President Obama is a man who once wanted to be a writer.'
free_x = clean_text(free_x)
free_vect = count_vect.transform([free_x])
reverse_encode[file_model.predict(free_vect)[0]]

'international'

In [0]:
# load master data frame 
masterDF = pd.read_csv('/content/master_df.csv')

# clean essay element
essay = masterDF['essay']
essay = [clean_text(elem) for elem in essay]
masterDF['essay'] = essay

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(masterDF['essay'])

# predicts class of string or file
def predict_class(free_x):
  
  #clean text
  free_x = clean_text(free_x)
  
  free_vect = count_vect.transform([free_x])

  prediction = reverse_encode[file_model.predict(free_vect)[0]]

  return prediction, max(file_model.predict_proba(free_vect)[0])

def predict_class_file(filename):
  
  with open(filename, 'r') as file:
      free_x = file.read().replace('\n', '')  
  
  # clean text 
  free_x = clean_text(free_x)
  
  free_vect = count_vect.transform([free_x])

  prediction = reverse_encode[file_model.predict(free_vect)[0]]

  return prediction, max(file_model.predict_proba(free_vect)[0])

In [0]:
## Testing Prediction on sample essay

## text_string
with open('/content/sample_essay.txt', 'r') as file:
    text_file = file.read().replace('\n', '')

predict_class(text_file)    

### Functions for extractive summary from text

In [0]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(file_name, top_n):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  raw_sents(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_matrix(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    #print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
        summarize_text.append("".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    #print("Summarize Text: \n",textwrap.fill(" ".join(summarize_text), 50))
    return(" ".join(summarize_text))
  
def generate_summary_text(text, top_n):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  sent_tokenize(text)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_matrix(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    #print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
        summarize_text.append("".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    #print("Summarize Text: \n",textwrap.fill(" ".join(summarize_text), 50))
    return(" ".join(summarize_text))

In [0]:
def sentence_similarity(sent1, sent2, stopwords=None):
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)
 
def build_similarity_matrix(sentences, stop_words):
    # Create an empty similarity matrix
    similarity_matrix = np.zeros((len(sentences), len(sentences)))
 
    for idx1 in range(len(sentences)):
        for idx2 in range(len(sentences)):
            if idx1 == idx2: #ignore if both are same sentences
                continue 
            similarity_matrix[idx1][idx2] = sentence_similarity(sentences[idx1], sentences[idx2], stop_words)

    return similarity_matrix


def generate_summary(file_name, top_n):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  raw_sents(file_name)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_matrix(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    #print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
        summarize_text.append("".join(ranked_sentence[i][1]))

    # Step 5 - Offcourse, output the summarize texr
    #print("Summarize Text: \n",textwrap.fill(" ".join(summarize_text), 50))
    return(" ".join(summarize_text))
  
def generate_summary_text(text, top_n):
    stop_words = stopwords.words('english')
    summarize_text = []

    # Step 1 - Read text anc split it
    sentences =  sent_tokenize(text)

    # Step 2 - Generate Similary Martix across sentences
    sentence_similarity_martix = build_similarity_matrix(sentences, stop_words)

    # Step 3 - Rank sentences in similarity martix
    sentence_similarity_graph = nx.from_numpy_matrix(sentence_similarity_martix)
    scores = nx.pagerank(sentence_similarity_graph)

    # Step 4 - Sort the rank and pick top sentences
    ranked_sentence = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)    
    #print("Indexes of top ranked_sentence order are ", ranked_sentence)    

    for i in range(top_n):
        summarize_text.append(ranked_sentence[i][1])

    # Step 5 - Offcourse, output the summarize texr
    #print("Summarize Text: \n",textwrap.fill(" ".join(summarize_text), 50))
    return(summarize_text)

### Classify, Extract, and Generate

In [0]:
# create text to text transformer 
# with model chosen from text similarity or length 

def class_extract_gen_str(input_text,  T):
  
  # predicts class of input text
  old_class, old_prob = predict_class(input_text)
  
  # text_length
  length = min([950, len(word_tokenize(input_text))])
  
  # summary string
  summ_string = generate_summary_text(text = input_text, top_n=2)
  
  # return new text
  new_string =  simple_gen(input_string = '\n\n\n\n'+summ_string, 
              lens = length, 
              temp = T, 
              model_choice = model_dict[old_class])  
  
  # predicts probability of class of new string
  new_class, new_prob = predict_class(new_string)
  
  # computes cosine similarity
  vect = TfidfVectorizer(min_df=1)
  tfidf = vect.fit_transform([input_text,new_string])
  
  return remove_end_punct(new_string), [old_class, old_prob], [new_class, new_prob], (tfidf * tfidf.T).A, summ_string                                   

In [0]:
# create text to text transformer 
# with model chosen from text similarity or length 

def class_extract_gen(filename,  T):
  
  with open(filename, 'r') as file:
      input_text = file.read()
      
  # predicts class of input text
  old_class, old_prob = predict_class(input_text)
  
  # text_length
  length = min([950, len(word_tokenize(input_text))])
  
  # summary string
  summ_string = generate_summary_text(text = input_text, top_n=2)  
  
  # return new text
  new_string =  simple_gen(input_string = '\n\n\n\n'+summ_string, 
              lens = length, 
              temp = T, 
              model_choice = model_dict[old_class])  
  
  # predicts probability of class of new string
  new_class, new_prob = predict_class(new_string)
  
  # computes cosine similarity
  vect = TfidfVectorizer(min_df=1)
  tfidf = vect.fit_transform([input_text,new_string])
  
  
  return remove_end_punct(new_string), [old_class, old_prob], [new_class, new_prob], (tfidf * tfidf.T).A, summ_string

In [0]:
start_time = time.time()
# example implementation
result = class_extract_gen('/content/essay_sample_texts/sample_essay.txt', 1.0)
print('Run time:', str(time.time()-start_time)+' secs')

Run time: 45.80405926704407 secs


In [0]:
print('Old class and probability:', result[1][0], result[1][1])
print('    ')
print('New class and probability:', result[2][0], result[2][1])
print('    ')
print('Cosine Similarity Matrix:', result[3])
print('    ')
print('Summary:', textwrap.fill(result[4], 60))
print('(******)')
print('    ')
print(textwrap.fill(result[0], 60))
print('    ')

Old class and probability: ideas 0.5128328716855207
    
New class and probability: ideas 0.9128141176680875
    
Cosine Similarity Matrix: [[1.         0.83843083]
 [0.83843083 1.        ]]
    
Summary: And although my appreciation of her writing tempered as I
grew older, unlike much of the culture which now
categorically vilifies Rand, I still saw a considerable
potency and relevance in what she had written. Most learned
truths about the world are confused and complicated, bearing
Bohr’s hallmark of a deep truth in which even their
seemingly antithetical statements are also somehow true of
the world.
(******)
    
 What a vast world we live in, how many people and things we
know, how many sides of each story we have to encounter. But
we always must admit that we know so little—that even we
never really see the whole story—which is why it is
necessary to be convinced and enamored by her writings and
performances. There is a vivid, tingling, creative life
here, whether we like it or n

In [0]:
# function that removes ending punctuations
def remove_end_punct(string):
    reverse_string = string[::-1]
  
    i1 = reverse_string.find('.')
    i2 = reverse_string.find('?')
    i3 = reverse_string.find('!')
  
    if i1 == -1:
        i1 = 1000
    if i2 == -1:
        i2 = 1000
    if i3 == -1:
        i3 = 10000
    
    ifinal = min([i1, i2, i3])

    return string[:len(string)-ifinal]

In [0]:
# insert string
with open('/content/sample_essay.txt', 'r') as file:
    input_text = file.read().replace('\n', '')


### Classify and Generate

In [0]:
# create text to text transformer 
# with model chosen from text similarity or length 

def class_and_gen_str(input_text, length,  T):
  
#   with open(filename, 'r') as file:
#       input_text = file.read()
      
  # predicts class of input text
  old_class, old_prob = predict_class(input_text)
  
#   # summary string
#   summ_string = generate_summary_text(text = input_text, top_n=2)  
  
  # return new text
  new_string =  simple_gen(input_string = '\n\n\n\n'+input_text, 
              lens = length, 
              temp = T, 
              model_choice = model_dict[old_class])  
  
  # predicts probability of class of new string
  new_class, new_prob = predict_class(new_string)
  
  return remove_end_punct(new_string), [old_class, old_prob], [new_class, new_prob]

In [0]:
start_time = time.time()
# example implementation
result = class_and_gen_str('Can you write an essay about physics?', 500, 1.0)
print('Run time:', str(time.time()-start_time)+' secs')

Run time: 22.83637046813965 secs


In [0]:
print('Old class and probability:', result[1][0], result[1][1])
print('    ')
print('New class and probability:', result[2][0], result[2][1])
print('    ')
print(textwrap.fill(result[0], 60))
print('    ')

Old class and probability: science 0.1295174474194357
    
New class and probability: science 0.48487098061126166
    
 Not exactly. That's because physics was, at least from the
1930s to the 1950s, a lab exercise. It wasn't technically
possible to make useful articles about physics. The Field
Museum’s Modern Physics Lab wasn’t set up to produce
important scientific papers. Instead, it was intended to
test new ideas. “It’s hardly surprising,” Marcello Cahn, a
theoretical physicist at ETH Zurich, told me, that his own
experiments ended up “undecidable.”  In the 1960s, a new
wave of physicists, including those from the University of
Zurich, asked the same question. They decided to combine an
experiment that was easy and dangerous with something that
could be much harder: solving real-world problems. Their aim
was to try and produce modern-day articles on a single
issue—the existence of dark energy.  Cahn said that he used
to regularly argue with physics teachers about the current
state o