#Into the world of George Orwell prose style

George Orwell once wrote: *“Good prose is like a window pane.”*

And he spent a considerable amount of energy [advocating clear, tight writing](https://www.openculture.com/2016/05/george-orwells-six-rules-for-writing-clear-and-tight-prose.html), free of  hackneyed phrases as well as long processions of pretentious vocabulary that only blurred the clear meaning of the words.

In this excercise we're going to attempt to put his advices into practise. With the help of[ Meta's Llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-hf) model we are going to try to teach our language model to write in concise, yet evocative and (perhaps slightly) grim style of George Orwell.

In order to achieve this goal we are going to take samples of Orwell's prose and fine tune Llama2 model on them. Then we are going to compare the writing style with pretrained and fine-tuned model, prompting both models to re-write sample neutral sentences in style of G. Orwell. Exciting stuff, so let's begin!


##1. Preparing the data

Trainign set consists of a set of pairs where first element of each pair is a neutral sentence and the second element - a sentence in the style of Orwell. To do this we took pieces of his prose (sentences), neutralized it to strip it of all literary devices and evocative tone, and then matched it again with the original sentences.

Example:


In [None]:
origUtterances = ['''It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
              '''The shop had been reduced to a shattered shell of its former self, the windows blown out, the walls pockmarked with holes.'''
]
neutralUtterances = ['''On a cold April day, the clocks struck thirteen. Winston Smith entered Victory Mansions through glass doors, but not quickly enough to avoid dust entering with him.''',
              '''Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.''',
                     '''The shop had been damaged, with broken windows and pockmarked walls.''']


import pandas as pd
df = pd.DataFrame({ 'Neutral sentence': neutralUtterances,  'Orwell\'s original': origUtterances })
df

Unnamed: 0,Neutral sentence,Orwell's original
0,"On a cold April day, the clocks struck thirteen. Winston Smith entered Victory Mansions through glass doors, but not quickly enough to avoid dust entering with him.","It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."
1,"Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours.","Winston made for the stairs. It was no use trying the lift. Even at the best of times it was seldom working, and at present the electric current was cut off during daylight hours."
2,"The shop had been damaged, with broken windows and pockmarked walls.","The shop had been reduced to a shattered shell of its former self, the windows blown out, the walls pockmarked with holes."



Data preparation and experiments with "neutralizing" the sample sentences as well as "orwellizing" the pretrained Llama2 model are described and performed in an a complimentary notebook in the same folder: Orwellian rewrite - getting data.ipynb

##2. Fine-tuning of Llama2-7B

For fine-tuning we are going to use data in jsonl format prepared earlier (check the second notebook) and a great, flexible library [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl).
Axolots includes a lot of neat tricks that speed up training without sacrificing quality.

To save GPU RAM we are going to utilize 8-bit training to use less GPU RAM, and sample packing to maximize GPU utilization.
The training run options are defined in axolotl-training-config.yaml in the same folder.


In [None]:
!pip install -qU transformers accelerate  einops   xformers   bitsandbytes

In [None]:
%%capture
%pip install peft==0.5.0 python-dotenv==2.0.0

!git clone https://github.com/OpenAccess-AI-Collective/axolotl
%pip install -e "./axolotl[flash-attn]"



In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!accelerate launch /content/axolotl/scripts/finetune.py /content/axolotl-training-config.yaml

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
                              dP            dP   dP 
                              88            88   88 
   .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
   88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
   88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
   `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                    
                                                    

[2023-09-15 08:00:01,875] [INFO] [axolotl.normalize_config:87] [PID:22416] [RANK:0] GPU memory usage baseline: 0.000GB (+0.439GB misc)[39m
[2023-09-15 08:00:01,876] [INFO] [axolotl.common.cli.load_model_and_tokenizer:38] [PID:22416] [RANK:0] loading tokenizer... meta-llama/Llama-2-7b-chat-hf[3

Merge the Lora adapter and the Llama model together (use helper functions defined in helpers.py)

In [None]:
import transformers
import peft
from helpers import merge_model_lora, merge_model_lora_from_config

final_model_dir = merge_model_lora_from_config(
    "/content/drive/MyDrive/ML/data/style_transfer/models/run1/axolotl-training-config.yaml")

print(f"Final model saved to '{final_model_dir}'")

Loading base model


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading PEFT adapter
Merging and unloading...
Model saved to /content/drive/MyDrive/ML/data/style_transfer/models/run1/merged
Final model saved to '/content/drive/MyDrive/ML/data/style_transfer/models/run1/merged'


We store two models for further evaluation (one trained for 4 epochs, second for 10 epochs).

##3. Evaluate quantitatively the fine-tuned model

We are going to use Rouge score to measure the similarities between the outputs of the two models and the Orwell's original sentences on a set of 177 samples from the evaluation set.

We are going to measure Rouge-1 (refers to the overlap of unigrams (each word) between the generated output and reference), Rouge-2 and Rouge-L (longest common subsequence based statistics).

To generate output we are going to use several methods:
1. beam search = 2
2. sampling with different temperatures
3. sampling with different top_p and top_k

For convenience o
nly part of these trials are gathered in this notebooks.

In [None]:
!pip install rouge

In [None]:
%%time
import torch
from torch import cuda, bfloat16
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

import random
import pickle

random.seed(38)
with open('/content/drive/MyDrive/ML/data/1984_train_eval.pkl', 'rb') as f:
    data = pickle.load(f)
    eval_data = data['eval']

eval_golds = [e[0] for e in eval_data]


prompt = '''Rewrite the following text in the style of George Orwell prose. Make sure to convey the meaning of the utterance.

Text: "{text}"

Answer: '''


import torch

def load_model(path, quantized=True):
    if quantized:
        model, tokenizer = get_llama2_quant(model_name = path, for_eval = True)
    else:
        model = AutoModelForCausalLM.from_pretrained(
            path,
            torch_dtype=torch.float16,
            device_map='auto'
        )
        tokenizer = AutoTokenizer.from_pretrained(path)
    return model, tokenizer



def get_orwellize_pipeline(model, tokenizer, pipe_config):

    config = {
        'model': model,
        'tokenizer': tokenizer,
        'return_full_text': False,
        'task': 'text-generation',
        'temperature': 0.2,
        'do_sample': True,
        'max_new_tokens': 128,  # number of tokens to generate in the output
        'repetition_penalty': 1.1
        }
    config = config | pipe_config

    orwellize = transformers.pipeline(**config)
    interesting = [ 'num_beams', 'top_k', 'top_p', 'do_sample', 'temperature']
    print('Running with config: ')
    for k, v in config.items():
        if k in interesting:
            print(f'   - {k}: {v}')

    return orwellize


import pickle
from rouge import Rouge

def run_scoring_over_experiments(preds, golds):
    r1, p1, f1, r2, p2, f2, rl, pl, fl = 0, 0, 0, 0, 0, 0, 0, 0, 0
    for i in range(len(preds)):
        print(f' === EXPERIMENT {i + 1} ===')
        r1e, p1e, f1e, r2e, p2e, f2e, rle, ple, fle, _ = run_scoring(preds[i], golds)
        r1 += r1e
        p1 += p1e
        f1 += f1e
        r2 += r2e
        p2 += p2e
        f2 += f2e
        rl += rle
        pl += ple
        fl += fle

    r1 /= len(preds)
    p1 /= len(preds)
    f1 /= len(preds)

    r2 /= len(preds)
    p2 /= len(preds)
    f2 /= len(preds)

    rl /= len(preds)
    pl /= len(preds)
    fl /= len(preds)

    print(' **** SUMMARY OVER EXPERIMENTS: ****')
    print(f'Summarized Rouge-1: {r1}, Precision: {p1}, F1: {f1}')
    print(f'Summarized Rouge-2: {r2}, Precision: {p2}, F1: {f2}')
    print(f'Summarized Rouge-L: {rl}, Precision: {pl}, F1: {fl}')
    return r1, p1, f1, r2, p2, f2, rl, pl, fl


def run_scoring(preds, golds):
    rouge = Rouge()
    scores = rouge.get_scores(preds, golds)

    r1, p1, f1, r2, p2, f2, rl, pl, fl = 0, 0, 0, 0, 0, 0, 0, 0, 0
    for s in scores:
        r1 += s['rouge-1']['r']
        p1 += s['rouge-1']['p']
        f1 += s['rouge-1']['f']

        r2 += s['rouge-2']['r']
        p2 += s['rouge-2']['p']
        f2 += s['rouge-2']['f']

        rl += s['rouge-l']['r']
        pl += s['rouge-l']['p']
        fl += s['rouge-l']['f']

    r1 /= len(scores)
    p1 /= len(scores)
    f1 /= len(scores)

    r2 /= len(scores)
    p2 /= len(scores)
    f2 /= len(scores)

    rl /= len(scores)
    pl /= len(scores)
    fl /= len(scores)

    print(f'Summarized Rouge-1: {r1}, Precision: {p1}, F1: {f1}')
    print(f'Summarized Rouge-2: {r2}, Precision: {p2}, F1: {f2}')
    print(f'Summarized Rouge-L: {rl}, Precision: {pl}, F1: {fl}')
    return r1, p1, f1, r2, p2, f2, rl, pl, fl, scores

def run_predictions(orwellize_pipe, eval_data, save_path = '', experiments_count=1, additional_answers = []):
    answers = [[] for _ in range(experiments_count)]

    for i, ut in enumerate(eval_data):
        if i % 20 == 0:
            print(f'Processing {i}th sample...')
        for e in range(experiments_count):
            answer = orwellize_pipe(prompt.format(text=ut[1]))[0]['generated_text']
            answers[e].append(answer)
            additional_answers.append((ut, answer))
            if i % 20 == 0:
                print(f'Sample {i}th: {ut[1], answer}')

    if save_path != '':
        with open(save_path, 'wb') as f:
            pickle.dump({'eval_data': eval_data,
                    'preds': answers}, f)
    return answers

def show_k_preds(k, preds, eval_data):
     for i, ut in enumerate(eval_data[:k]):
        print('\nSimple text: ', ut[1])
        print('Orwellized: ', [p[i] for p in preds])
        print('Ground truth: ', ut[0])

def load_model_and_generate(model_path='', model=None, tokenizer=None, save_path='', pipe_config={}, experiments_count=3, additional=[]):
    if model_path:
        model, tokenizer = load_model(model_path, quantized=False)
    orwellize = get_orwellize_pipeline(model, tokenizer, pipe_config)
    preds = run_predictions(orwellize, eval_data, save_path=save_path, experiments_count = experiments_count, additional_answers = additional)
    r1, p1, f1, r2, p2, f2, rl, pl, fl = run_scoring_over_experiments(preds, eval_golds)
    return model, tokenizer


CPU times: user 2.4 s, sys: 373 ms, total: 2.77 s
Wall time: 5.51 s


Run prediction on the full eval set. Get ROUGE scores and compare to scores for pre-trained Llama model.

Run inference on 3 models:
- Llama2 trained for 4 epochs
- Llama2 trained for 10 epochs
- pretrained Llama2


###3.1. Llama2-7b trained for 4 epochs


Generating with beam_search=2 and sampling disabled:

In [None]:
pipe_config = {  'num_beams': 2, 'do_sample': False, 'temperature': 0 }
additional = []
model, tokenizer = load_model_and_generate(model_path='/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed_4eps/merged',
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run1/preds_eval4.pkl',
                        pipe_config=pipe_config,
                        experiments_count=1, additional=additional)

Summarized Rouge-1: 0.5056361630419934, Precision: 0.5554047715016833, F1: 0.5222260693828104
Summarized Rouge-2: 0.33673850549650236, Precision: 0.37476396294899955, F1: 0.3489258524313733
Summarized Rouge-L: 0.4858580197649641, Precision: 0.533503919873233, F1: 0.5018605605045853


Generating with sampling enabled and temperature=0.2, 3 experiments:

In [None]:
pipe_config = { 'temperature': 0.2, 'do_sample': True }
additional = []
model, tokenizer = load_model_and_generate(model, tokenizer,
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run1/preds_eval2.pkl',
                        pipe_config=pipe_config,
                        experiments_count=3, additional=additional)

 === EXPERIMENT 1 ===
Summarized Rouge-1: 0.43294729518029157, Precision: 0.42469158339694774, F1: 0.42086816524101855
Summarized Rouge-2: 0.24805648454277068, Precision: 0.24944662428909964, F1: 0.24399557849495848
Summarized Rouge-L: 0.40774631090419877, Precision: 0.40147988801945816, F1: 0.3973996777800327
 === EXPERIMENT 2 ===
Summarized Rouge-1: 0.4277183082633654, Precision: 0.4187382907862878, F1: 0.41449562330819795
Summarized Rouge-2: 0.24514830085001887, Precision: 0.24521642815295216, F1: 0.24001815884743202
Summarized Rouge-L: 0.40311207040592995, Precision: 0.39536367926120386, F1: 0.39116046742237537
 === EXPERIMENT 3 ===
Summarized Rouge-1: 0.4425592340818674, Precision: 0.4384768315463181, F1: 0.4329004105187088
Summarized Rouge-2: 0.2577473686324074, Precision: 0.26128119305878006, F1: 0.254882283240641
Summarized Rouge-L: 0.4194078355023315, Precision: 0.41645675845940366, F1: 0.41071489315652143
 **** SUMMARY OVER EXPERIMENTS: ****
Summarized Rouge-1: 0.434408279175

Generation with sampling and temperature=0.4, 3 experiments

In [None]:
pipe_config = { 'temperature': 0.4, 'do_sample': True  }
additional = []
model, tokenizer = load_model_and_generate(model_path='/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed_4eps/merged',
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run1/preds_eval4.pkl',
                        pipe_config=pipe_config,
                        experiments_count=1, additional=additional)

 === EXPERIMENT 1 ===
Summarized Rouge-1: 0.41502624714843034, Precision: 0.3952010148667599, F1: 0.3963388214010599
Summarized Rouge-2: 0.22326372657877852, Precision: 0.22258362355402236, F1: 0.21837816059331439
Summarized Rouge-L: 0.3878533360590867, Precision: 0.37124005709026703, F1: 0.37143144307663156
 === EXPERIMENT 2 ===
Summarized Rouge-1: 0.4191800471386874, Precision: 0.41069923853160767, F1: 0.4052877369104694
Summarized Rouge-2: 0.2306072734992025, Precision: 0.23307724654371514, F1: 0.2264893987309582
Summarized Rouge-L: 0.39460125540054064, Precision: 0.38691087140154595, F1: 0.38156027954538496
 === EXPERIMENT 3 ===
Summarized Rouge-1: 0.4178149019040617, Precision: 0.4049882825927365, F1: 0.4017473489280015
Summarized Rouge-2: 0.2304838358914898, Precision: 0.2302515333206696, F1: 0.22548223522794167
Summarized Rouge-L: 0.3928671642835993, Precision: 0.38197403431632676, F1: 0.3785683256958294
 **** SUMMARY OVER EXPERIMENTS: ****
Summarized Rouge-1: 0.4173403987303931

Generating with sampling enabled, temperature disabled, top_k = 4 and top_p=0.7:

In [None]:
pipe_config = { 'top_k': 4, 'top_p': 0.7, 'do_sample': True }
additional = []
model, tokenizer = load_model_and_generate(model, tokenizer,
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run1/preds_eval2.pkl',
                        pipe_config=pipe_config,
                        experiments_count=1, additional=additional)

Summarized Rouge-1: 0.42833834922305974, Precision: 0.41259333130670006, F1: 0.41067233300229855
Summarized Rouge-2: 0.24268971430686267, Precision: 0.23989056637063344, F1: 0.23548877103244942
Summarized Rouge-L: 0.40364855561047674, Precision: 0.39046300697909103, F1: 0.3878404705804169


###2. Llama2 trained for 10 epochs

Generation for sampling enabled, temperature=0.2, 3 experiments:

In [None]:
pipe_config = { 'temperature': 0.2, 'do_sample': True }
additional = []
model, tokenizer = load_model_and_generate(model_path='/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed/merged',
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed/preds_eval4.pkl',
                        pipe_config=pipe_config,
                        experiments_count=3, additional=additional)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Running with config: 
   - temperature: 0.2
   - do_sample: True
Processing 0th sample...
Sample 0th: ('You believe there are four. How many fingers, please?', " 'You believe there are four. How many fingers, please?'")
Sample 0th: ('You believe there are four. How many fingers, please?', " 'You believe there are four. How many fingers, please?'")
Sample 0th: ('You believe there are four. How many fingers, please?', " 'You think there's four. How many fingers, eh?'")




Processing 20th sample...
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', " 'They did not need to conspire. All they had to do was to stand up on their hind legs and shake themselves, like a horse shaking off flies.'")
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', ' There was no need for them to conspire. All they had to do was rise up and shake themselves like a horse shaking off flies.')
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', " 'There was no need for them to conspire. They only had to rise up and shake themselves like a horse shaking off flies.'")
Processing 40th sample...
Sample 40th: ("There was no response. 'Julia, are you awake?'", " There was no answer. 'Julia, darling--are you awake?'")
Sample 40th: ("

Generation with beam_num=2 and disabled sampling:

In [None]:
pipe_config = {  'num_beams': 2, 'do_sample': False, 'temperature': 0 }
additional = []
model, tokenizer = load_model_and_generate(model, tokenizer,
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run1/preds_eval4.pkl',
                        pipe_config=pipe_config,
                        experiments_count=1, additional=additional)

 === EXPERIMENT 1 ===
Summarized Rouge-1: 0.5069467295270829, Precision: 0.5568716861308904, F1: 0.5227043228987635
Summarized Rouge-2: 0.33547824182000713, Precision: 0.3692883799073701, F1: 0.345749111304934
Summarized Rouge-L: 0.4827567780800687, Precision: 0.5306308878564836, F1: 0.4981908203544119
 **** SUMMARY OVER EXPERIMENTS: ****
Summarized Rouge-1: 0.5069467295270829, Precision: 0.5568716861308904, F1: 0.5227043228987635
Summarized Rouge-2: 0.33547824182000713, Precision: 0.3692883799073701, F1: 0.0
Summarized Rouge-L: 0.4827567780800687, Precision: 0.5306308878564836, F1: 0.4981908203544119


Generation for sampling enabled, temperature=0.4, 3 experiments:

In [None]:
pipe_config = { 'temperature': 0.4, 'do_sample': True }
additional = []
model, tokenizer = load_model_and_generate(model = model, tokenizer = tokenizer,
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed/preds_eval5.pkl',
                        pipe_config=pipe_config,
                        experiments_count=3, additional=additional)

Running with config: 
   - temperature: 0.4
   - do_sample: True
Processing 0th sample...
Sample 0th: ('You believe there are four. How many fingers, please?', " 'You believe there are four,' said O'Brien. 'How many fingers, please?'")
Sample 0th: ('You believe there are four. How many fingers, please?', " 'You believe there are four. How many fingers, please?'")
Sample 0th: ('You believe there are four. How many fingers, please?', " 'You think there are four. How many fingers, eh?'")




Processing 20th sample...
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', " 'No need for them to conspire. All they had to do was get up and shake themselves like a horse shaking off flies.'")
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', ' There was no need for them to conspire. All they had to do was rise up and shake themselves like a horse shaking off flies.')
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', " 'No need to conspire,' he said. 'All they had to do was stand up and shake themselves like a horse shaking off flies.'")
Processing 40th sample...
Sample 40th: ("There was no response. 'Julia, are you awake?'", " There was no answer. 'Julia, darling--are you awake?'")
Sample 40th: ("There was no response. 'Jul

###4. Llama2 pretrained

Generation with sampling and temperature=0.2, 2 experiments:

In [None]:
pipe_config = { 'temperature': 0.2, 'do_sample': True }
model, tokenizer = load_model_and_generate(model_path = 'meta-llama/Llama-2-7b-chat-hf',
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/pretrained/preds_eval1.pkl',
                        pipe_config=pipe_config,
                        experiments_count=2)

 === EXPERIMENT 1 ===
Summarized Rouge-1: 0.37332816386106543, Precision: 0.1214895096977391, F1: 0.17334514474457013
Summarized Rouge-2: 0.09735922960736414, Precision: 0.028657051852571936, F1: 0.041141138351641786
Summarized Rouge-L: 0.33514389733180056, Precision: 0.10762473027517473, F1: 0.15397751994567077
 === EXPERIMENT 2 ===
Summarized Rouge-1: 0.39035973308544575, Precision: 0.12850908889252224, F1: 0.18159644952615062
Summarized Rouge-2: 0.10334949283505217, Precision: 0.03154718445518327, F1: 0.044488052313779766
Summarized Rouge-L: 0.3529249718583327, Precision: 0.1143660719852532, F1: 0.16212633968278012
 **** SUMMARY OVER EXPERIMENTS: ****
Summarized Rouge-1: 0.3818439484732556, Precision: 0.12499929929513068, F1: 0.17747079713536038
Summarized Rouge-2: 0.10035436122120815, Precision: 0.030102118153877602, F1: 0.04281459533271077
Summarized Rouge-L: 0.34403443459506666, Precision: 0.11099540113021397, F1: 0.15805192981422544


Generation with sampling and temperature=0.4, 2 experiments:

In [None]:
pipe_config = { 'temperature': 0.4, 'do_sample': True }
model, tokenizer = load_model_and_generate(model=model, tokenizer=tokenizer,
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/pretrained/preds_eval2.pkl',
                        pipe_config=pipe_config,
                        experiments_count=2)

Generation with beam_search=2, sampling disabled, 1 experiments:

In [None]:
pipe_config = { 'num_beams': 2, 'do_sample': False }
additional = []
model, tokenizer = load_model_and_generate(model_path='meta-llama/Llama-2-7b-chat-hf',
                        save_path='/content/drive/MyDrive/ML/data/style_transfer/models/pretrained/preds_eval3.pkl',
                        pipe_config=pipe_config,
                        experiments_count=1, additional=additional)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Running with config: 
   - temperature: 0.2
   - do_sample: False
   - num_beams: 2
Processing 0th sample...




Sample 0th: ('You believe there are four. How many fingers, please?', '\n\n"Ah, a simple question, yet one that exposes the very essence of the human condition. You see, my dear fellow, the answer to this query is not as straightforward as it may initially seem. For you see, the number of fingers one possesses is not simply a matter of biology or anatomy, but rather a reflection of one\'s very soul. The number of fingers one believes they possess is a reflection of their innermost beliefs, their deepest desires, and their most fervent hopes. It is a measure of their connection to the world around them, and their')




Processing 20th sample...
Sample 20th: ('There was no need for them to conspire. They only needed to rise up and shake themselves like a horse shaking off flies.', "\n\nIt was plain as the nose on one's face that they didn't need to hatch a plot. All they had to do was stand up and give themselves a good shake, much like a horse might do to dislodge those pesky flies that buzz around its ears.")
Processing 40th sample...
Sample 40th: ("There was no response. 'Julia, are you awake?'", ' "Julia, my dear, didst thou hear me? I called out to thee, but there was nary a stir. Art thou asleep, or hath fate dealt thee a cruel hand?"')
Processing 60th sample...
Sample 60th: ('A film theatre in Stepney was affected by a falling object, resulting in casualties and damage to the building.', ' It was a dark and dismal evening in the East End of London when a most unfortunate event occurred at the local film theatre in Stepney. A mysterious object, seemingly from the heavens above, had plummeted fro

### 3.4 Summary

- We can see that both fine-tuned models shows quantitatively much higher similarity to the original (Orwell's) sentences than pretrained model.
- Generation with beam search and sampling disabled shows highest Rouge scores (appx.0.5)
- Generation with sampling with temperature 0.2 to 0.4 gives scores around 0.4.

However scores are not all. We will investigate qualitatively the outputs of the models.

##Eval qualitatively

In [None]:
import pandas as pd
import pickle

with open('/content/drive/MyDrive/ML/data/1984_train_eval.pkl', 'rb') as f:
    data = pickle.load(f)
    eval_data = data['eval']
eval_golds = [e[0] for e in eval_data]



# ft 4 epochs
with open('/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed_4eps/preds_eval1.pkl', 'rb') as f:
    data = pickle.load(f)
    llama_4e_unpacked_beam = data['preds']

with open('/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed_4eps/preds_eval4.pkl', 'rb') as f:
    data = pickle.load(f)
    llama_4e_unpacked_temp02 = data['preds']

with open('/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed_4eps/preds_eval3.pkl', 'rb') as f:
    data = pickle.load(f)
    llama_4e_unpacked_temp04 = data['preds']



#ft 10 epochs
with open('/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed/preds_eval1.pkl', 'rb') as f:
    data = pickle.load(f)
    llama_10e_unpacked_beam = data['preds']

with open('/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed/preds_eval4.pkl', 'rb') as f:
    data = pickle.load(f)
    llama_10e_unpacked_temp02 = data['preds']

with open('/content/drive/MyDrive/ML/data/style_transfer/models/run_not_packed/preds_eval5.pkl', 'rb') as f:
    data = pickle.load(f)
    llama_10e_unpacked_temp04 = data['preds']



The evaluation set has been analyzed to select longer and more interesting sentences where there is a room for improving the style of narrations. For these samples all models in all generation configurations has been analyzed and compared and the best ones were selected.

Below we present results for a sample of sentences and for selected model configurations:

In [None]:
import random
interesting_samples_all = [6, 11, 22, 26, 39,51, 58, 66, 70, 88,99, 112, 119, 137, 147, 160, 1, 9, 16, 29, 35, 49, 56, 61, 67, 71, 75, 96, 129]
interesting_samples = random.sample(10, interesting_samples_all)

In [None]:

for idx in interesting_samples:
    print(f'\nSimplified input: {eval_data[idx][1]}\n')
    print(f' * 10 epochs unpacked (sampling): {llama_10e_unpacked_temp02[2][idx]}')
    print(f' * 4 epochs unpacked (sampling): {llama_4e_unpacked_temp02[0][idx]}')
    print(f'\nOriginal (Orwell\'s): {eval_data[idx][0]}')
    print('\n=======================\n')



Simplified input: He easily dealt with the false belief, and he was not at risk of being influenced by it.

 * 10 epochs unpacked (sampling):  'He disposed of that piece of lunacy as though it did not matter, and he was immune from its influence.
 * 4 epochs unpacked (sampling): 'He disposed of that false belief as easily as though it had been a fly,' and he was not in danger of being infected by it.

Original (Orwell's): He had no difficulty in disposing of the fallacy, and he was in no danger of succumbing to it.



Simplified input: He had little recollection of his sister, only as an infant who was weak and rarely made any sound, but had observant eyes.

 * 10 epochs unpacked (sampling):  'I hardly remember my sister,' he said. 'She was a tiny thing, weak-voiced, never made much noise, but she had large observant eyes.'
 * 4 epochs unpacked (sampling): 'He could not remember his sister at all except as a tiny, helpless creature who never said anything and had eyes that were always

Based on investigation it seemed that the beam search generation, although giving highest Rouge scores, lacked diversity and followed too closely the neutral (simplified) sentences. The sampling methods introduced more interesting unpredictability to the generation and thus were preffered.

This is not surprising. [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751) argues that high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable:

Screenshot 2023-09-17 211703.png

## Summary:

This excercise was about trying to generate sentences which would sound simple enough yet had a distinct, somewhat grim and evocative flavour of George Orwell's prose. For this task we took Orwell's 1984 novel and converted it into a training set where simple sentences would be matched with Orwell's version of them and the model would be fine tuned to try to mimic the writer's style of conveing the same thought but in Orwell's distinctive style.

We used OpenAi's gpt-3.5-turbo to generate simplified sentences from Orwell's original sentences and then trained Meta's Llama2-7B-chat version of the model on the pairs of simplified-orwells pairs of sentences. The model has been fine tuned on appx. 3370 samples (utterences, most of which were single sentences, but some shorter ones were concatenated). Model has been trained in couple of configurations and compared. The evaluation set consisted of 177 utterances. For the evaluation Rouge scores (Rouge-1, Rouge-2 and Rouge-L) have been used, however the qualitive insight into some of the samples produces helped figure out the best of the models.

Conclusions from the training:
1. Fine-tuning corrected the misbehaviour of pretrained Llama2-7b. Prompting the fine-tuned Llama2-7b to rewrite sentences in the style of Orwell's prose would yield much better results compared to pretrained model. Before fine tuning, Llama2-7b seemed to simply be applying generic, lush style of fiction novels, rambling on and expanding the sentences with little regard to original content.
2. It was observed on the evaluation set after re-write from finetuned model the simplified, neutral sentences would be adorned with occasional similies and the adjectives would be changed to more evocative and intense. The structure of the sentences would also be sometimes changed, but not as often as possible.
3. In the same time the sentence would stick strongly to the formulation of the simplified version, which is its strength (compared to rambling pretrained Llama2) but also its weakness - not being able to offer any methaphores or in many cases expand on the sentences with some flair.
4. The reason to this restrictive mode of produced output can be the very modest size of the training set. It contains only around 3.4k of samples, many of which were not diversed enough to rewrite in the first place (but they were not filtered out to not reduce further the size of the dataset). It's probable that if additional sources were used for fine tuning (other examples of fine prose by Orwell, like his essays and other novels) the results would be better.
5. Also it's worth noting that the training dataset created with the use of OpenAi gpt-3.5-turbo did not always create perfect neutral, simplified sentences (it would either strip too much of content or too little of style), making it even harder for the model to learn valuable cues.
