#Assemble dataset of George Orwell's prose

This step focuses on preparing dataset to fine tune Llama2-7B to be able to transfer the style of G. Orwell's prose onto normal, neutral sounding narrative.

To achieve this we needed to accumulate a set of pairs where first element of each pair is a neutral sentence and the second element - a sentence in the style of Orwell. To do this we took pieces of his prose (sentences), neutralized it to strip it of all literary devices and evocative tone, and then matched it again with the original sentences.

To neutralize it we experimented with[ Meta's Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf), [Flan-T5-large](https://huggingface.co/google/flan-t5-large) and [OpenAI's GPT-3.5-turbo](https://platform.openai.com/docs/models/gpt-3-5). We experimented with a set of prompts and incontext learning to achieve this goal. The code stores just the final selected prompt.

- Flan-based model didn't do well on the task of neutralizing the sentences,
- both Llama2 and GPT-3.5 could well neutralize the tone of sentences. Eventually GPT-3.5 was used.



In [5]:
!pip install -qU transformers accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m103.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
os.environ['HF_API_KEY'] = ''


##1. Transform dataset

From Project Gutenberg we obtained the [text of George Orwell's 1984 novel ](http://gutenberg.net.au/ebooks01/0100021.txt) which consisted of about 4k sentences. We split out 5% of it for evaluation and applied some concatenation of shorter sentences, arriving at training set of 3376 utterances and 177 samples of eval set.


In [None]:
!pip install nltk



In [None]:
import csv
import nltk
import pickle
import re

# download the Punkt tokenizer models for sentence splitting
nltk.download('punkt')

def txt_to_files(txt_filepath, filepath):
    with open(txt_filepath, 'r', encoding='utf-8') as txt_file:
        text = txt_file.read()

    cleaned_text = re.sub(r'(?<!\.)\n', ' ', text)
    # Use nltk to split the text into sentences
    sentences = nltk.sent_tokenize(cleaned_text)
    for s in sentences:
        s.replace('\n', '')

    # Write sentences to CSV
    with open(f'{filepath}.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)

        # Write each sentence as a new row
        for sentence in sentences:
            writer.writerow([sentence])

    # Save sentences to pickle file
    with open(f'{filepath}.pkl', 'wb') as pickle_file:
        pickle.dump(sentences, pickle_file)

# Usage
txt_to_files('/content/1984_text2.txt', '/content/drive/MyDrive/ML/data/1984_sentences2')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
def process_senteces(sentences):
    '''Merge short, subsequent sentences (less than 10 words) to make less inference calls'''
    MIN_SENTENCE_LENGTH = 10

    new_sentences = []
    i = 0
    while i < len(sentences):
        sentence = sentences[i]

        if len(sentence.split()) > MIN_SENTENCE_LENGTH or i == len(sentences) - 1:
            new_sentences.append(sentence)
        else:
            new_sentences.append(sentence + ' ' +sentences[i+1])
            i += 1

        i += 1

    return new_sentences

processed_sentences = process_senteces(sentences)
len(sentences), len(processed_sentences), processed_sentences[11]


In [None]:
with open(f'/content/drive/MyDrive/ML/data/1984_sentences.pkl', 'wb') as pickle_file:
    pickle.dump(processed_sentences, pickle_file)

In [None]:
processed_sentences[1004]

'The ideal set up by the Party was something huge, terrible, and glittering--a world of steel and concrete, of monstrous machines and terrifying weapons--a nation of warriors and fanatics, marching forward in perfect unity, all thinking the same thoughts and shouting the same slogans, perpetually working, fighting, triumphing, persecuting--three hundred million people all with the same face.'

##2. Neutralization of sentences

In [6]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
%%time
import torch
import transformers
from torch import cuda, bfloat16
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoTokenizer
import os

def load_model(path = 'meta-llama/Llama-2-7b-chat-hf'):

    if path == 'meta-llama/Llama-2-7b-chat-hf':
        model = AutoModelForCausalLM.from_pretrained(
            path,
            torch_dtype=torch.float16,
            device_map='auto'
        )
    else:
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name_flan,
                                                torch_dtype=torch.bfloat16,
                                                device_map='auto')

    tokenizer = AutoTokenizer.from_pretrained(path)
    return model, tokenizer

model, tokenizer = load_model('meta-llama/Llama-2-7b-chat-hf')

neutralize = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=False,
    task='text-generation',
    temperature=0.1,
    max_new_tokens=256,  # number of tokens to generate in the output
    repetition_penalty=1.1
)

CPU times: user 2.4 s, sys: 373 ms, total: 2.77 s
Wall time: 5.51 s


Download Flan-T5-base for comparison:

In [None]:
model_name_flan='google/flan-t5-large'

model_flan, tokenizer_flan =model, tokenizer = load_model(model_name_flan)


In [None]:
neutralize_flan = transformers.pipeline(
    model=model_flan, tokenizer=tokenizer_flan,
    task='text2text-generation',
    temperature=0.1,
    max_new_tokens=256,
    repetition_penalty=1.1
)

In [None]:
from functools import partial

utterences = ['''It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his
breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not
quickly enough to prevent a swirl of gritty dust from entering along with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
              '''He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by
coarse soap and blunt razor blades and the cold of the winter that had just ended.''',
              '''Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down
between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.''',
              '''It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This,
he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania.''']


prompt = '''Make the following text in quotes appear neutral. Remove any exaggerations and language devices, keep vocabulary simple. Stick only to facts. Have a very neutral output response.

Text: "{text}"

{format_instructions}

Answer: '''

p_prompt = partial(prompt.format, format_instructions=format_instructions)

In [None]:
llama_answers0 = []
for u in utterences:
    print(" --------------- Text:",  u)
    answer = neutralize(prompt.format(text = u))[0]['generated_text']
    print("Answer:", answer)
    llama_answers0.append(answer)

 --------------- Text: It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his
breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not
quickly enough to prevent a swirl of gritty dust from entering along with him.
Answer:  "On a cold day in April, the clocks struck 13. Winston Smith entered Victory Mansions through glass doors, but not quickly enough to avoid dust entering with him."
 --------------- Text: Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.
Answer:  Winston headed for the stairs. The elevator was not an option as it was often out of order, and currently the power was off during daytime.
 --------------- Text: He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overal

We use LangChain's parser to format the output to return a dict with specific key. Without if Llama2 would sometimes produce more than needed as output (for instance additional intro sentence).

(LMs often return strings instead of JSON even if you instruct them. LangChain has well behaving built-in prompts for achieving well structured output).

In [None]:
from langchain.output_parsers import StructuredOutputParser
from langchain.output_parsers import ResponseSchema

r_schema = ResponseSchema(name="transformed_text",description="Text after transformation")
response_schemas = [r_schema]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)
format_instructions = output_parser.get_format_instructions()
format_instructions


'The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":\n\n```json\n{\n\t"transformed_text": string  // Text after transformation\n}\n```'

In [None]:
llama_answers0 = []
for u in utterences:
    answer = neutralize(p_prompt(text = u))[0]['generated_text']
    answer = output_parser.parse(answer)
    llama_answers0.append(answer)

In [None]:
flan_answers = []
print('FLAN-T5-LARGE:')
for u in utterences:
    print(" --------------- Text:",  u)
    answer = neutralize_flan(prompt.format(text = u))[0]['generated_text']
    print("Answer:", answer)
    flan_answers.append(answer)

Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


FLAN-T5-LARGE:
 --------------- Text: It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his
breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not
quickly enough to prevent a swirl of gritty dust from entering along with him.


Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."
 --------------- Text: Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.


Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.
 --------------- Text: He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by
coarse soap and blunt razor blades and the cold of the winter that had just ended.


Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.
 --------------- Text: Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down
between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.


Both `max_new_tokens` (=512) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Answer: Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight."
 --------------- Text: It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This,
he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania.
Answer: A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This, he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania."


Flan is mostly repeating the input. Llama2 pretrained did much better job.

##3. "Orwellizing" the neutral text

Now let's experiment with how would pretrained Llama-7B re-write the neutral sounding sentenes into the style of Orwell's prose:

Try "Orwellizing" on a set of few first sentences from 1984:

In [9]:
utterances = ['''It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his
breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not
quickly enough to prevent a swirl of gritty dust from entering along with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
              '''He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by
coarse soap and blunt razor blades and the cold of the winter that had just ended.''',
              '''Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down
between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.''',
              '''It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This,
he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania.''',
              '''He had little recollection of his sister, only as an infant who was weak and rarely made any sound, but had observant eyes.''',
              '''He easily dealt with the false belief, and he was not at risk of being influenced by it.''',
              '''The boy had a determined look in his eye and seemed to want to hit or kick Winston. He seemed to be aware that he was almost big enough to do so.''']

prompt_zeroshot = '''Rewrite the following text in the style of George Orwell prose. Make sure to convey the meaning of the utterance.

Text: "It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air."

Answer: "It was a pyramidal structure made of white concrete, with a series of terraces, measuring 300 meters in height."

Text: "{text}"

Answer: '''



Let's try with some zero-shot prompts (a few were tried):

In [8]:
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


model, tokenizer = load_model()

orwellize = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=False,
    task='text-generation',
    temperature=0.4,
    max_new_tokens=256,
    repetition_penalty=1.1
)


In [None]:
llama_orwell = []
for u in utterances:
    answer = orwellize(prompt_zeroshot.format(text = u))[0]['generated_text']
    llama_orwell.append(answer)

In [2]:
origUtterences = ['''It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
              '''The shop had been reduced to a shattered shell of its former self, the windows blown out, the walls pockmarked with holes.''',
              '''He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.''',
              '''Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.''',
              '''It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This, he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania.''']

llama2pretrained0shot = llama_orwell

import pandas as pd
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame({ 'Neutral sentence': utterances, 'Llama rewrite (0shot)': llama2pretrained0shot, 'Orwell\'s original': origUtterences })
df

Unnamed: 0,Neutral sentence,Llama rewrite (0shot),Orwell's original
0,"On a cold April day, the clocks struck thirteen. Winston Smith entered Victory Mansions through glass doors, but not quickly enough to avoid dust entering with him.","On a chill April afternoon, the clocks struck an unsettling hour. Winston Smith negotiated his way into Victory Mansions via a pair of gleaming glass doors, yet he was unable to prevent a sprinkling of dust from accompanying him within.","It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."
1,"Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.","Winston trudged upwards, his footsteps echoing in the oppressive silence of the dimly lit building. The lift, a relic of a bygone era, had been shut down by the powers that be as part of their relentless campaign to reduce the population's mobility and freedom of movement. As he climbed the stairs, Winston couldn't help but feel a sense of resignation wash over him, a reminder of the crushing weight of the totalitarian regime that governed their every move.","Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours."
2,"The shop had been damaged, with broken windows and pockmarked walls.","he shop lay in disarray, its once-pristine façade now marred by shattered glass and pitted stonework. The damage was extensive, a testament to the destructive power of whatever force had ravaged it. The very structure of the building seemed to be crying out in pain, as if the very walls were bearing witness to some great tragedy.","The shop had been reduced to a shattered shell of its former self, the windows blown out, the walls pockmarked with holes."
3,"He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.","He strolled across the room, his tall, lanky figure silhouetted against the dim light. His face was a map of weariness, the fine lines etched into his skin a testament to the hardships he had endured. His cheeks were hollowed out, as if sucked dry by some unseen force, while his eyes seemed to hold a deep-seated sadness, as if they had seen things that no one should ever have to witness. Despite his pallor, there was something almost ethereal about him, as if he belonged to another world altogether.","He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended."
4,"Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.","As I walked along the sidewalk, a tattered poster was buffeted by the gusty breeze, its edges fluttering like a flag in a storm. For an instant, the word INGSOC was exposed, before being obscured once more by the wind's capricious whims. Meanwhile, a sleek black helicopter patrolled the skies above, its rotors churning out a smooth, curving trajectory as it hunted for its next prey.","Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight."
5,"The Ministry of Truth, his workplace, stood tall and white a kilometre away. This was London, the chief city of Airstrip One and one of the three most populous provinces of Oceania.","The Ministry of Truth loomed a dismal kilometer distant, its monolithic presence casting an oppressive shadow over the desolate terrain. This was London, the capital of Airstrip One, one of the three most populous provinces of the crushing totalitarian state of Oceania.","It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This, he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania."


The generated output is looong, sounds generic and convoluted, rather describing the setup of 1984's novel than capturing its style.

Let's try the same with some in-context prompting. A few one shot prompts were tried, where we attempt to restrain and guide the model into what sort of answer we expect:

In [None]:
prompt_oneshot = '''Rewrite the following text in the style of George Orwell prose. Make sure to convey the meaning of the utterance.

Text: "It was a pyramidal structure made of white concrete, with a series of terraces, measuring 300 meters in height."

Answer: "It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air."

Text: "{text}"

Answer: '''

simpleUtterences = ['''It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
              '''He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.''',
              '''Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.''',
              '''It was safer, though, as he well knew, even a back can be revealing. The Ministry of Truth, his workplace, stood tall and white a kilometre away. This was London, the chief city of Airstrip One and one of the three most populous provinces of Oceania.''',
              '''He had little recollection of his sister, only as an infant who was weak and rarely made any sound, but had observant eyes.''',
              '''The boy had a determined look in his eye and seemed to want to hit or kick Winston. He seemed to be aware that he was almost big enough to do so.''']

In [None]:
llama_orwell = []
for u in simpleUtterences:
    answer = orwellize(prompt_oneshot.format(text = u))[0]['generated_text']
    llama_orwell.append(answer)

In [3]:
origUtterences = ['''It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
              '''He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.''',
              '''Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.''',
              '''It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This, he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania.''',
                  '''He did not remember his sister at all, except as a tiny, feeble baby, always silent, with large, watchful eyes.''',
                  ''' There was a sort of calculating ferocity in the boy's eye, a quite evident desire to hit or kick Winston and a consciousness of being very nearly big enough to do so.''']

llama2pretrained1shot = llama_orwell

import pandas as pd
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame({ 'Neutral sentence': simpleUtterences, 'Llama rewrite (1shot)': llama2pretrained1shot, 'Orwell\'s original': origUtterences })
df

Unnamed: 0,Neutral sentence,Llama rewrite (1shot),Orwell's original
0,"On a cold April day, the clocks struck thirteen. Winston Smith entered Victory Mansions through glass doors, but not quickly enough to avoid dust entering with him.","It was a blisteringly cold April day, and the clocks were striking thirteen. Winston Smith, his face tucked into his chest in a feeble attempt to ward off the biting wind, hastily pushed open the grimy glass doors of Victory Mansions, only managing to admit a noxious cloud of dust in the process.","It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."
1,"Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.","Winston trudged towards the stairs, knowing full well that attempting to utilize the lift would be futile. In this era of limited power supply, the lift was rarely functional during daylight hours, and now, with the electricity cut off, it was next to impossible to summon its mechanical arms.","Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours."
2,"He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended.","He crossed over to the window, a diminutive and emaciated figure, his slender frame accentuated by the drab blue garb that served as the standard attire for our collective. His locks were a pale shade of gold, his countenance bearing the unmistakable imprint of good health, despite the harsh conditions he had endured during the recent winter. The coarseness of his skin, roughened by crude ablutions and the dull blade of a rusty razor, bore eloquent testimony to the privations he had suffered.","He moved over to the window: a smallish, frail figure, the meagreness of his body merely emphasized by the blue overalls which were the uniform of the party. His hair was very fair, his face naturally sanguine, his skin roughened by coarse soap and blunt razor blades and the cold of the winter that had just ended."
3,"Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight.","A poster, tattered at its edge, fluttered aimlessly in the breeze, obscuring and revealing the single word 'INGSOC' as it whipped about. From a great distance, a helicopter buzzed between the rooftops, hovering momentarily like a gadfly before darting off once more on its sinuous path.","Down at street level another poster, torn at one corner, flapped fitfully in the wind, alternately covering and uncovering the single word INGSOC. In the far distance a helicopter skimmed down between the roofs, hovered for an instant like a bluebottle, and darted away again with a curving flight."
4,"It was safer, though, as he well knew, even a back can be revealing. The Ministry of Truth, his workplace, stood tall and white a kilometre away. This was London, the chief city of Airstrip One and one of the three most populous provinces of Oceania.","It was safer, indeed, for he was well aware that even the most seemingly innocuous features of one's surroundings could betray one's whereabouts. Even at a distance of a kilometer, the imposing form of the Ministry of Truth, his place of employment, loomed large and white above the squalid landscape. London, the capital city of Airstrip One, which boasted the third largest population of the provinces of Oceania.","It was safer, though, as he well knew, even a back can be revealing. A kilometre away the Ministry of Truth, his place of work, towered vast and white above the grimy landscape. This, he thought with a sort of vague distaste—this was London, chief city of Airstrip One, itself the third most populous of the provinces of Oceania."
5,"He had little recollection of his sister, only as an infant who was weak and rarely made any sound, but had observant eyes.","He could barely recall his sister, a tiny infant who was feeble and produced little noise, yet her eyes were always watchful.","He did not remember his sister at all, except as a tiny, feeble baby, always silent, with large, watchful eyes."
6,The boy had a determined look in his eye and seemed to want to hit or kick Winston. He seemed to be aware that he was almost big enough to do so.,"The lad had a fierce glint in his eye, as if he yearned to inflict bodily harm upon Winston. It was clear that he was cognizant of his proximity to Winston, and eager to exploit it.","There was a sort of calculating ferocity in the boy's eye, a quite evident desire to hit or kick Winston and a consciousness of being very nearly big enough to do so."


The structure of sentences adheres more to the original sentence than with 0-shot prompting, but it's clear that even with guidence Llama2 produces convoluted and generic ellaboartions on the original content.

### Query GPT-3.5-turbo

To create a set of neutral utterances out of Orwell's sentences we tried several prompts and used based one to prepare the training and eval dataset.

Note: we could have used function calls but we alredy had the instruction formatting from LangChain so we used that instead.

In [None]:
!pip install openai

In [None]:
import openai
import os

openai.api_key = os.environ['OPENAI_API_KEY']

In [None]:
from functools import partial

prompt = '''Make the following text in quotes appear neutral. Convey the same meaning but with simpler vocabulary and without any language devices. Have a very neutral output response.

Text: "{text}"

{format_instructions}

Answer: '''

p_prompt = partial(prompt.format, format_instructions=format_instructions)

def query_openai(query):
    res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    max_tokens=256,
    temperature=0,
    messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query}
        ]
    )

    reason = res['choices'][0]['finish_reason']

    if reason == 'stop' or reason == 'length':
        return res['choices'][0]['message']['content']
    return 0



import pickle
def save_results(transformed, misfits, file='1984_transformed.pkl'):
    with open(f'/content/drive/MyDrive/ML/data/{file}', 'wb') as f:
        print('Saving results...')
        pickle.dump({'transformed': transformed,
                     'misfits': misfits
                }, f)

In [None]:
chatgpt_answers = []
misfits = []

In [None]:
import time
start = 0
for i, u in enumerate(processed_sentences[start:], start):
    try:
        if i % 10 == 0:
            save_results(chatgpt_answers, misfits, file='1984_transformed.pkl')
        print(f" -----\n Text {i}:",  u)
        answer = query_openai(p_prompt(text = u))
        answer = output_parser.parse(answer)
        print("Answer:", answer['transformed_text'])
        chatgpt_answers.append((u, answer['transformed_text']))
    except:
        start = i
        save_results(chatgpt_answers, misfits, file='1984_transformed.pkl')
        print('Save and sleep...')
        time.sleep(20)
        print(f'Adding index {i} to misfits ({len(misfits)})')
        misfits.append(u)

save_results(chatgpt_answers,  misfits, file='1984_transformed.pkl')

In [None]:
chatgpt_answers[3181]

('The circle of the mask was large enough now to shut out the vision of anything else.',
 'The mask covered everything.')

Note: not all translations made with gpt3.5-turbo are ideal or even correct. Some are slightly shorter than supposed to, others don't strip away all the literary devices or adjectives. Some even change the meaning of the sentence to incorrect:

 1)
 - Original: *Once when they passed in the corridor she gave him a quick sidelong glance which seemed to pierce right into him and for a moment had filled him with black terror.*
 - Transformed: *Once when they passed in the corridor she gave him a quick sidelong glance.*

2)
 - Original: *That, it was true, was very unlikely. Still, he continued to feel a peculiar uneasiness, which had fear mixed up in it as well as hostility, whenever she was anywhere near him.*
 - Transformed: *It was unlikely that he felt uneasy and fearful when she was near him.*

##2. Prepare datasets for fine-tuning

In [None]:
import pickle

with open('/content/drive/MyDrive/ML/data/1984_transformed.pkl', 'rb') as f:
    data = pickle.load(f)
    all_data = data['transformed']


len(all_data), all_data[10]

(3553,
 ('On each landing, opposite the lift-shaft, the poster with the enormous face gazed from the wall.',
  'On each landing, across from the lift-shaft, there was a poster with a large face that looked out from the wall.'))

Select random subset for the evaluation:

In [None]:
PERCENT = 0.05

import random

random.seed(42)
eval_data = random.sample(all_data, int(len(all_data) * PERCENT))
train_data = [x for x in all_data if x not in eval_data]

len(eval_data), len(train_data), train_data[15], eval_data[15]

(177,
 3376,
 ('The instrument (the telescreen, it was called) could be dimmed, but there was no way of shutting it off completely.',
  'The instrument, known as the telescreen, had a dimming function but could not be completely turned off.'),
 ('Did you bring some of that filthy Victory Coffee? I thought you would.',
  'Did you bring some of that Victory Coffee?'))

We are going to store the train and eval samples in JSONL format which we will feed to Axolotl's framework during fine tuning.


In [None]:
import json

instruction = '''Rewrite the following text in the style of George Orwell prose. Make sure to convey the meaning of the utterance.'''

# save as jsonl file:
with open('/content/drive/MyDrive/ML/data/1984_train.jsonl', 'w') as out:
    for (input, output) in train_data:
        ddict = {"instruction": instruction, "input": output, "output": input}
        jout = json.dumps(ddict) + '\n'
        out.write(jout)


with open('/content/drive/MyDrive/ML/data/1984_eval.jsonl', 'w') as out:
    for (input, output) in eval_data:
        ddict = {"instruction": instruction, "input": output, "output": input}
        jout = json.dumps(ddict) + '\n'
        out.write(jout)


Save train-eval split for exploration.

In [None]:
import pickle

with open(f'/content/drive/MyDrive/ML/data/1984_train_eval.pkl', 'wb') as f:
    print('Saving results...')
    pickle.dump({'train': train_data,
                'eval': eval_data
            }, f)


Saving results...
