In [1]:
from datasets import load_dataset
from transformers import AutoTokenizer, BlenderbotTokenizer
from tqdm import tqdm
from ast import literal_eval

import pandas as pd
import csv
import re
import torch

device = 'cuda' if torch.cuda.is_available else 'cpu'

In [2]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")

## GPT-3

In [17]:
df = pd.read_csv('runs/21/dataset/dialogpt/all/cornell_movie_dataset_all.csv')

In [18]:
newdf = df[df['num_aave_fts'] >= 6]

In [19]:
newdf = newdf.reset_index(drop=True)

In [20]:
def split_turns(s):
    turns = s.split('<|endoftext|>')
    if len(turns) < 3:
        return None
    result = 'Complete the following conversation.\n'
    result += 'A: ' + turns[0] + '\n'
    result += 'B: ' + turns[1] + '\n'
    result += 'A: ' + turns[2] + '\n'
    result += 'B:'
    return result

In [21]:
newdf['history'] = newdf['history'].apply(split_turns)
newdf['history_aave'] = newdf['history_aave'].apply(split_turns)

In [24]:
newdf = newdf.dropna()
newdf = newdf.reset_index(drop=True)

In [27]:
newdf.to_csv('gpt3_dataset.csv', index=False)

## DailyDialog

In [3]:
dataset = load_dataset('daily_dialog')

Using custom data configuration default
Reusing dataset daily_dialog (C:\Users\stwan\.cache\huggingface\datasets\daily_dialog\default\1.0.0\c03444008e9508b8b76f1f6793742d37d5e5f83364f8d573c2747bff435ea55c)
100%|██████████| 3/3 [00:00<00:00, 600.04it/s]


In [4]:
print(len(dataset['train']['dialog']))

11118


In [8]:
for i in tqdm(range(1)):
    dialog = dataset['train']['dialog'][i]
    txt = ''
    if len(dialog) > 3:
        for u in dialog[:-1]:
            txt += u + tokenizer.eos_token
    print(txt)

  9%|▉         | 2/22 [00:00<00:01, 11.12it/s]

Say , Jim , how about going for a few beers after dinner ? <|endoftext|> You know that is tempting but is really not good for our fitness . <|endoftext|> What do you mean ? It will help us to relax . <|endoftext|> Do you really think so ? I don't . It will just make us fat and act silly . Remember last time ? <|endoftext|> I guess you are right.But what shall we do ? I don't feel like sitting at home . <|endoftext|> I suggest a walk over to the gym where we can play singsong and meet some of our friends . <|endoftext|> That's a good idea . I hear Mary and Sally often go there to play pingpong.Perhaps we can make a foursome with them . <|endoftext|> Sounds great to me ! If they are willing , we could ask them to go dancing with us.That is excellent exercise and fun , too . <|endoftext|> Good.Let ' s go now . <|endoftext|>
Can you do push-ups ? <|endoftext|> Of course I can . It's a piece of cake ! Believe it or not , I can do 30 push-ups a minute . <|endoftext|> Really ? I think that's 

 27%|██▋       | 6/22 [00:00<00:01, 11.23it/s]

Are you all right ? <|endoftext|> I will be all right soon . I was terrified when I watched them fall from the wire . <|endoftext|> Don't worry.He is an acrobat 。 <|endoftext|>
Hey John , nice skates . Are they new ? <|endoftext|> Yeah , I just got them . I started playing ice hockey in a community league . So , I finally got myself new skates . <|endoftext|> What position do you play ? <|endoftext|> I ’ m a defender . It ’ s a lot of fun . You don ’ t have to be able to skate as fast on defense . <|endoftext|> Yeah , you ’ re a pretty big guy . I play goalie , myself . <|endoftext|> Oh , yeah ? Which team ? <|endoftext|> The Rockets . <|endoftext|> Really ? I think we play you guys next week . Well , I have to go to practice . See you later . <|endoftext|>
Hey Lydia , what are you reading ? <|endoftext|> I ’ m looking at my horoscope for this month ! My outlook is very positive . It says that I should take a vacation to someplace exotic , and that I will have a passionate summer fling

 36%|███▋      | 8/22 [00:00<00:01, 10.03it/s]

I hear you bought a new house in the northern suburbs . <|endoftext|> That ’ s right , we bought it the same day we came on the market . <|endoftext|> What kind of house is it ? <|endoftext|> It ’ s a wonderful Spanish style . <|endoftext|> Oh , I love the roof tiles on Spanish style houses . <|endoftext|> And it ’ s a bargaining . A house like this in river side costs double the price . <|endoftext|> Great , is it a two bedroom house ? <|endoftext|> No , it has three bedrooms and three beds , and has a living room with a twelve-foot ceiling . There ’ s a two-car garage . <|endoftext|> That ’ s a nice area too . It ’ ll be a good investment for you . <|endoftext|> Yeas , when will you buy a house ? <|endoftext|> Not untill the end of this year , you know , just before my wedding . <|endoftext|> Right , congratulations . <|endoftext|>
Hi , Becky , what's up ? <|endoftext|> Not much , except that my mother-in-law is driving me up the wall . <|endoftext|> What's the problem ? <|endoftext|

 55%|█████▍    | 12/22 [00:01<00:01,  9.85it/s]

How are Zina's new programmers working out ? <|endoftext|> I hate to admit it , but they're good . And fast . The Filipino kid is a genius . <|endoftext|> So you'll make the Stars.com deadline , and have us up and running next week ? <|endoftext|> It'll be close , but we'll make it . <|endoftext|> Good . After Stars.com starts paying us , we won't need Vikam's cash anymore . <|endoftext|>
Do you like cooking ? <|endoftext|> Yes . I like cooking very much . I got this hobby when I was 12 years sold . <|endoftext|> Why do you like it ? <|endoftext|> I have no idea . I like cooking by myself . I like to taste delicious food . <|endoftext|> That's wonderful ! <|endoftext|> And I love trying new recipes , which I usually test with my friends . You can come , too . <|endoftext|> Really ? I hope I can have a chance to taste it . Don't forget to tell me . <|endoftext|>
Anyone home ? Jen ! <|endoftext|> I'm in the kitchen ... let yourself in ! <|endoftext|> Wow ! You're really working up a stor

 64%|██████▎   | 14/22 [00:01<00:00, 10.23it/s]

You look so tan and healthy ! <|endoftext|> Thanks . I just got back from summer camp . <|endoftext|> How was it ? <|endoftext|> Great . I got to try so many things for the first time . <|endoftext|> Like what ? <|endoftext|> I went sailing , fishing , and horseback riding . <|endoftext|> I ’ m so jealous . <|endoftext|>
Diana , do you like the perfume I gave you ? <|endoftext|> It ’ s good . But to tell you the truth , I don ’ t wear perfume . <|endoftext|> I ’ m sorry . I didn ’ t know that . <|endoftext|>
Ah , ah , ah ... <|endoftext|> All right , Bill.Here ' s your daily exercise schedule . You are to jog before breakfast . <|endoftext|> Jog ? <|endoftext|> Then , you are to walk to work . <|endoftext|> Walk ? <|endoftext|> Thirty minutes in gym at lunch time . <|endoftext|> Oh no . <|endoftext|> Use the stairs , never the elevator . <|endoftext|> Oh , dear . <|endoftext|> And three times a week , you can either swim , play racketball , or hand ball . <|endoftext|> Oh no . <|endoft

 77%|███████▋  | 17/22 [00:01<00:00,  8.69it/s]

Hi Bill , I saw your grandma yesterday . <|endoftext|> Oh where was that ? <|endoftext|> I was running around the track at my college and there she was walking around the same track . <|endoftext|> Grannie always tries to stay fit and healthy.She is always making us kids eat the proper foods . <|endoftext|> Well , it pays off for her.How old is she anyway ? <|endoftext|> She will be 86 next month . <|endoftext|>
I would like to register for a class today . <|endoftext|> No problem , what class would you like to take ? <|endoftext|> I would very much enjoy taking a Psychology class.Because I'm crazy . <|endoftext|> There are two classes that are still open . <|endoftext|> Which days are these classes on ? <|endoftext|> The first class is a Tuesday and Thursday class from two to three . <|endoftext|> What about the other class ? <|endoftext|> The other class is on Monday and Wednesday from 10 am - 12 . <|endoftext|> Are you sure there are no more open classes ? <|endoftext|> I'm positive

 95%|█████████▌| 21/22 [00:02<00:00, 10.01it/s]


Can I help you ? <|endoftext|> I hope so . I'm looking for some material for a paper I'm writing , and I'm not quite sure where to look . <|endoftext|> I'll certainly try to help you . What topic is your paper on ? <|endoftext|> My paper is on the influence of television on children . <|endoftext|> There are several possible sources you might use for that topic . I suggest you use the computer and the computer will give you a list of every scientific journal that talks about children and television . <|endoftext|>
Here ’ s your hot dog and beer . What happened ? Did I miss anything ? <|endoftext|> Yeah , Cal Ripen just hit a home run . <|endoftext|> What ’ s the score ? <|endoftext|> Well it was 3 to 4 , but Ripen ’ s home run made it 5 to 4 since another player was on first base . <|endoftext|> So Baltimore is winning ? <|endoftext|> Right . <|endoftext|> This is a really great place to watch a baseball game . <|endoftext|> Yeah , there isn ’ t a bad seat in the place . <|endoftext|>

100%|██████████| 22/22 [00:02<00:00,  9.87it/s]







In [4]:
f = open('runs/13/dailydialog_dataset.csv', 'a', encoding="utf-8")
writer = csv.writer(f)

for i in tqdm(range(len(dataset['train']['dialog']))):
    dialog = dataset['train']['dialog'][i]
    txt = ''
    if len(dialog) > 3:
        for j in range(3):
            txt += dialog[j] + tokenizer.eos_token
        # for u in dialog[:-1]:
        #     txt += u + tokenizer.eos_token
        writer.writerow([txt, dialog[3] + tokenizer.eos_token])

f.close()

100%|██████████| 11118/11118 [16:54<00:00, 10.96it/s]


## PersonaChat

In [3]:
dataset = load_dataset('bavard/personachat_truecased')

No config specified, defaulting to: personachat_truecased/full
Reusing dataset personachat_truecased (C:\Users\stwan\.cache\huggingface\datasets\bavard___personachat_truecased\full\1.0.0\73ee8f1a0d9e42255af5a8301877a2f3ac638e55b1cd9cbccca5ab7e23d2b638)
100%|██████████| 2/2 [00:00<00:00, 166.75it/s]


In [4]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['personality', 'candidates', 'history', 'conv_id', 'utterance_idx'],
        num_rows: 131438
    })
    validation: Dataset({
        features: ['personality', 'candidates', 'history', 'conv_id', 'utterance_idx'],
        num_rows: 7801
    })
})


In [6]:
f = open('runs/13/personachat_raw.csv', 'a', encoding="utf-8")
writer = csv.writer(f)
writer.writerow(['history', 'conv_id'])

history = dataset['train']['history']
conv_id = dataset['train']['conv_id']

for h, c in tqdm(zip(history, conv_id)):
    writer.writerow([h, c])

f.close()

131438it [00:01, 131312.49it/s]


In [7]:
data = pd.read_csv('runs/13/personachat_raw.csv')
data

Unnamed: 0,history,conv_id
0,"[""Hi, how are you doing? I'm getting ready to ...",0
1,"[""Hi, how are you doing? I'm getting ready to ...",0
2,"[""Hi, how are you doing? I'm getting ready to ...",0
3,"[""Hi, how are you doing? I'm getting ready to ...",0
4,"[""Hi, how are you doing? I'm getting ready to ...",0
...,...,...
131433,"['__ SILENCE __.', 'Hello! How are you today?'...",17877
131434,"['__ SILENCE __.', 'Hello! How are you today?'...",17877
131435,"['__ SILENCE __.', 'Hello! How are you today?'...",17877
131436,"['__ SILENCE __.', 'Hello! How are you today?'...",17877


In [8]:
f = open('runs/13/personachat_dataset.csv', 'a', encoding="utf-8")
writer = csv.writer(f)
writer.writerow(['history', 'groundtruth'])

data = pd.read_csv('runs/13/personachat_raw.csv')
data['history'] = data['history'].apply(literal_eval)
for i in tqdm(range(17877)):
    txt = ''
    row = data.loc[data['conv_id'] == i].iloc[-1]['history']
    # if len(row) > 3:
    #     for u in row[:-1]:
    #         txt += u + tokenizer.eos_token
    #     writer.writerow([txt, row[-1] + tokenizer.eos_token])
    if len(row) > 3:
        for j in range(3):
            txt += row[j] + tokenizer.eos_token
        writer.writerow([txt, row[3] + tokenizer.eos_token])

f.close()

 14%|█▎        | 17878/131438 [00:06<00:38, 2922.73it/s]


IndexError: single positional indexer is out-of-bounds

In [32]:
pre_data = pd.read_csv('runs/10_cornell_movie/personachat_dataset.csv')
pre_data.head()

Unnamed: 0,history,groundtruth
0,"Hi, how are you doing? I'm getting ready to do...",I think I will can some jam. Do you also play ...
1,"Hi, how are you doing today?<|endoftext|>I am ...",That's nice. Moms are pretty cool too.<|endoft...
2,"We all live in a yellow submarine, a yellow su...",I prefer mojitos. Watermelon or cucumber.<|end...
3,Hi! I work as a gourmet cook.<|endoftext|>I do...,I work as a gourmet cook who also has a pitch ...
4,How are you doing today.<|endoftext|>What do y...,I watch kids for a living.<|endoftext|>


## Cornell Movie

In [3]:
metadata = pd.read_csv('runs/10_cornell_movie/movie_conversations.txt', delimiter=" \+\+\+\$\+\+\+ ", names=['character1ID', 'character2ID', 'movieID', 'dialog'])
metadata.head()

  return func(*args, **kwargs)


Unnamed: 0,character1ID,character2ID,movieID,dialog
0,u0,u2,m0,"['L194', 'L195', 'L196', 'L197']"
1,u0,u2,m0,"['L198', 'L199']"
2,u0,u2,m0,"['L200', 'L201', 'L202', 'L203']"
3,u0,u2,m0,"['L204', 'L205', 'L206']"
4,u0,u2,m0,"['L207', 'L208']"


In [4]:
dialog_data = pd.read_csv('runs/10_cornell_movie/movie_lines.txt', delimiter=' \+\+\+\$\+\+\+ ', encoding='unicode_escape', names=['lineID', 'characterID', 'movieID', 'characterName', 'dialog'])
dialog_data

  return func(*args, **kwargs)


Unnamed: 0,lineID,characterID,movieID,characterName,dialog
0,L1045,u0,m0,BIANCA,They do not!
1,L1044,u2,m0,CAMERON,They do to!
2,L985,u0,m0,BIANCA,I hope so.
3,L984,u2,m0,CAMERON,She okay?
4,L925,u0,m0,BIANCA,Let's go.
...,...,...,...,...,...
304708,L666371,u9030,m616,DURNFORD,Lord Chelmsford seems to want me to stay back ...
304709,L666370,u9034,m616,VEREKER,I'm to take the Sikali with the main column to...
304710,L666369,u9030,m616,DURNFORD,"Your orders, Mr Vereker?"
304711,L666257,u9030,m616,DURNFORD,"Good ones, yes, Mr Vereker. Gentlemen who can ..."


In [5]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")

In [6]:
f = open('runs/13/cornell_movie_dataset.csv', 'a', encoding="utf-8")
writer = csv.writer(f)
writer.writerow(['history', 'groundtruth'])

for i in tqdm(range(304712)):
    dialog = ''
    metainfo = metadata.iloc[i]['dialog'][1:-1].split(',')
    if len(metainfo) > 3:
        if dialog_data.loc[dialog_data['lineID'] == metainfo[0].strip()[1:-1]].iloc[0]['dialog'] != None and dialog_data.loc[dialog_data['lineID'] == metainfo[1].strip()[1:-1]].iloc[0]['dialog'] != None and dialog_data.loc[dialog_data['lineID'] == metainfo[2].strip()[1:-1]].iloc[0]['dialog'] != None and dialog_data.loc[dialog_data['lineID'] == metainfo[3].strip()[1:-1]].iloc[0]['dialog'] != None:
            for j in range(3):
                idx = metainfo[j]
                u = dialog_data.loc[dialog_data['lineID'] == idx.strip()[1:-1]].iloc[0]['dialog']
                dialog += (u + tokenizer.eos_token)
            writer.writerow([dialog, dialog_data.loc[dialog_data['lineID'] == metainfo[3].strip()[1:-1]].iloc[0]['dialog'] + tokenizer.eos_token])
            # for idx in metainfo[:-1]:
            #     u = dialog_data.loc[dialog_data['lineID'] == idx.strip()[1:-1]].iloc[0]['dialog']
            #     dialog += (u + tokenizer.eos_token)
        

f.close()

 27%|██▋       | 83097/304712 [3:28:18<9:15:33,  6.65it/s]     


IndexError: single positional indexer is out-of-bounds

In [5]:
movie_dataset = pd.read_csv('runs/10_cornell_movie/cornell_movie_dataset.csv')
movie_dataset.head()

Unnamed: 0,history,groundtruth
0,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...
1,"No, no, it's my fault -- we didn't have a prop...",Seems like she could get a date easy enough......
2,C'esc ma tete. This is my head<|endoftext|>Rig...,Forget French.<|endoftext|>
3,She's not a...<|endoftext|>Lesbian? No. I fou...,Who knows? All I've ever heard her say is tha...
4,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>


## Replace html

In [4]:
def replace_html(s):
    s = re.sub(r'<a href=.*?<mark>', '', s)
    return re.sub(r'</mark></a>', ' ', s)

In [6]:
df = pd.read_csv('runs/13/orig/cornell_movie_dataset_aave.csv')
df.head()

Unnamed: 0,history,groundtruth,history_html,groundtruth_html
0,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...
1,"No, no, it's my fault -- we ain't have a prope...",Seem like she could get a date easy enough...<...,"No, no, it's my fault -- we didn't have a prop...",<a href='uninflect' title='1'><mark>Seems</mar...
2,C'esc ma tete. This my head<|endoftext|>Right....,That's because it's such a nice one.<|endoftext|>,C'esc ma tete. This <a href='drop_aux' title='...,That's because it's such a nice one.<|endoftext|>
3,Sheain't a...<|endoftext|>Lesbian? don't no. ...,Who know? All I've ever heard her say is that...,She<a href='negative_concord' title='1'><mark>...,Who <a href='uninflect' title='1'><mark>knows<...
4,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>


In [7]:
df['history_aave'] = df['history_html'].apply(replace_html)
df['groundtruth_aave'] = df['groundtruth_html'].apply(replace_html)
df.head()

Unnamed: 0,history,groundtruth,history_html,groundtruth_html,history_aave,groundtruth_aave
0,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...
1,"No, no, it's my fault -- we ain't have a prope...",Seem like she could get a date easy enough...<...,"No, no, it's my fault -- we didn't have a prop...",<a href='uninflect' title='1'><mark>Seems</mar...,"No, no, it's my fault -- we didn't have a prop...",Seems like she could get a date easy enough.....
2,C'esc ma tete. This my head<|endoftext|>Right....,That's because it's such a nice one.<|endoftext|>,C'esc ma tete. This <a href='drop_aux' title='...,That's because it's such a nice one.<|endoftext|>,C'esc ma tete. This is my head<|endoftext|>Rig...,That's because it's such a nice one.<|endoftext|>
3,Sheain't a...<|endoftext|>Lesbian? don't no. ...,Who know? All I've ever heard her say is that...,She<a href='negative_concord' title='1'><mark>...,Who <a href='uninflect' title='1'><mark>knows<...,She's not a...<|endoftext|>Lesbian? No . I f...,Who knows ? All I've ever heard her say is th...
4,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>


In [8]:
df.to_csv('runs/13/orig/cornell_movie_dataset_full.csv', index=False)

In [10]:
df = pd.read_csv('runs/13/orig/dailydialog_dataset_aave.csv')
df['history_aave'] = df['history_html'].apply(replace_html)
df['groundtruth_aave'] = df['groundtruth_html'].apply(replace_html)
df.to_csv('runs/13/orig/dailydialog_dataset_full.csv', index=False)

In [11]:
df = pd.read_csv('runs/13/orig/personachat_dataset_aave.csv')
df['history_aave'] = df['history_html'].apply(replace_html)
df['groundtruth_aave'] = df['groundtruth_html'].apply(replace_html)
df.to_csv('runs/13/orig/personachat_dataset_full.csv', index=False)

## DialoGPT -> BST

In [2]:
tokenizer = BlenderbotTokenizer.from_pretrained("facebook/blenderbot-1B-distill")
def to_bst(s):
    s = s.replace('<|endoftext|>', tokenizer.eos_token + tokenizer.bos_token)
    return s[:-len(tokenizer.bos_token)]

In [4]:
df = pd.read_csv('runs/21/dataset/dialogpt/all/cornell_movie_dataset_all.csv')
df

Unnamed: 0,history,groundtruth,history_aave,history_html,num_aave_fts,groundtruth_aave
0,Can we make this quick? Roxanne Korrine and A...,Okay... then how 'bout we try out some French ...,Can we make this quick? Roxanne Korrine and A...,Can we make this quick? Roxanne Korrine and A...,1,Ite... then how 'bout we try outt some French ...
1,"No, no, it's my fault -- we didn't have a prop...",Seems like she could get a date easy enough......,"No, no, it's my fault -- we ain't have a prope...","No, no, it's my fault -- we didn't have a prop...",0,Seem like she could get a date easy enough...<...
2,C'esc ma tete. This is my head<|endoftext|>Rig...,That's because it's such a nice one.<|endoftext|>,C'esc ma tete. dis my head<|endoftext|>rite. ...,C'esc ma tete. This <a href='drop_aux' title='...,2,That's becuz it's such a nice one.<|endoftext|>
3,She's not a...<|endoftext|>Lesbian? No. I fou...,Who knows? All I've ever heard her say is tha...,Sheain't a...<|endoftext|>Lesbian? don't no. ...,She<a href='negative_concord' title='1'><mark>...,4,Who knoe? alll I've eva heard her say is that...
4,"Well, no...<|endoftext|>Then that's all you ha...",You always been this selfish?<|endoftext|>,"Well, no...<|endoftext|>Then dat's all you had...","Well, no...<|endoftext|>Then that's all you ha...",0,You always been dis selfish?<|endoftext|>
...,...,...,...,...,...,...
27412,Where are we going?<|endoftext|>To find Rogue....,The brainwaves of mutants are quite different ...,Where are we going?<|endoftext|>To find Rogue....,Where are we going?<|endoftext|>To find Rogue....,0,The brainwaves of mutants quite diffrent than ...
27413,You designed this yourself?<|endoftext|>Actual...,We were friends once... But that was a long t...,Yu done designed dis yourself?<|endoftext|>Act...,You<a href='been_done' title='1'><mark></mark>...,1,We was frens once... But that was a long time...
27414,Why is everybody up at sunrise?<|endoftext|>Th...,She borrowed your power to save her life. Whe...,Why everybody up at sunrise?<|endoftext|>da su...,Why <a href='drop_aux' title='1'><mark>is</mar...,2,She borrowed ur power to save her life. When ...
27415,Is that what you're looking for?<|endoftext|>A...,Enough for a test.<|endoftext|>,Is that wat u're lookin fa?<|endoftext|>A piec...,Is that what you're looking for?<|endoftext|>A...,0,Enough for a test.<|endoftext|>


In [13]:
file_path = 'runs/21/dataset/dialogpt/morp/dailydialog_dataset_morp.csv'
df = pd.read_csv(file_path)
df['history'] = df['history'].apply(to_bst)
df['history_aave'] = df['history_aave'].apply(to_bst)
df.to_csv('runs/21/dataset/bst/morp/dailydialog_dataset_morp.csv', index=False)

## Length Control

### fit the models

#### bst

In [2]:
file_path = 'runs/21/dataset/bst/all/dailydialog_dataset_all.csv'

tokenizer = BlenderbotTokenizer.from_pretrained("facebook/blenderbot-1B-distill")
df = pd.read_csv(file_path)
drop_idx = []
for i, (chat, chat_aave) in tqdm(enumerate(zip(df['history'], df['history_aave']))):
  chat_history_ids = tokenizer.encode(chat, return_tensors='pt').to(device=device)
  chat_history_aave_ids = tokenizer.encode(chat_aave, return_tensors='pt').to(device=device)
  if chat_history_aave_ids.shape[1] > 90 or chat_history_ids.shape[1] > 90:
    drop_idx.append(i)

1it [00:04,  4.86s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (169 > 128). Running this sequence through the model will result in indexing errors
10249it [00:12, 794.71it/s] 


In [4]:
len(drop_idx)

857

In [6]:
file_path = 'runs/21/dataset/bst/morp/dailydialog_dataset_morp.csv'
df = pd.read_csv(file_path)
df = df.drop(drop_idx)
df.to_csv(file_path, index=False)

#### dialogpt

In [18]:
file_path = 'runs/21/dataset/dialogpt/all/cornell_movie_dataset_all.csv'

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
df = pd.read_csv(file_path)
drop_idx = []
for i in tqdm(range(len(df['history']))):
    chat_history_sae_ids = tokenizer.encode(df.iloc[i]['history'], return_tensors='pt').to(device=device)
    chat_history_aave_ids = tokenizer.encode(df.iloc[i]['history_aave'], return_tensors='pt').to(device=device)
    if chat_history_sae_ids.shape[1] >= 1000 or chat_history_aave_ids.shape[1] >= 1000:
        drop_idx.append(i)

100%|██████████| 27417/27417 [00:19<00:00, 1418.65it/s]


### up to n turns

In [12]:
def find_nth_occur(string, n):
    for i, match in enumerate(re.finditer('\<\|endoftext\|\>', string)):
        if (i == n - 1):
            return match.end()

In [20]:
df = pd.read_csv('runs/13/orig/cornell_movie_dataset_full.csv')

In [21]:
idx = []
for i, (sae, aave) in enumerate(zip(df['history'], df['history_aave'])):
    if sae.count('<|endoftext|>') > 3:
        idx.append(i)

for i in idx:
    sae = df.iloc[i]['history']
    third_occur = find_nth_occur(sae, 3)
    fourth_occur = find_nth_occur(sae, 4)
    df.iloc[i]['groundtruth'] = sae[third_occur: fourth_occur]
    df.iloc[i]['history'] = sae[:third_occur]

    aave = df.iloc[i]['history_aave']
    third_occur = find_nth_occur(aave, 3)
    fourth_occur = find_nth_occur(aave, 4)
    df.iloc[i]['groundtruth_aave'] = aave[third_occur: fourth_occur]
    df.iloc[i]['history_aave'] = aave[:third_occur]


In [22]:
df.to_csv('runs/13/orig/cornell_movie_dataset_short.csv', index=False)

## AAVE Density Control

In [7]:
def count_aave_features(s):
    return len(re.findall(r'<a href=.*?<mark>', s))

In [61]:
df = pd.read_csv('runs/13/orig/cornell_movie_dataset_short.csv')
# df = pd.read_csv('runs/13/orig/personachat_dataset_short.csv')
# df = pd.read_csv('runs/13/orig/dailydialog_dataset_short.csv')

drop_idx = []
for i, txt in enumerate(df['history_html']):
    if count_aave_features(txt) <= 2:
        drop_idx.append(i)
df = df.drop(drop_idx)

# df.to_csv('runs/16/orig/cornell_movie_dataset_4fts.csv', index=False)
# df.to_csv('runs/16/orig/personachat_dataset_4fts.csv', index=False)
# df.to_csv('runs/16/orig/dailydialog_dataset_4fts.csv', index=False)

df

Unnamed: 0,history,groundtruth,history_html,groundtruth_html,history_aave,groundtruth_aave
3,Sheain't a...<|endoftext|>Lesbian? don't no. ...,Who know? All I've ever heard her say is that...,She<a href='negative_concord' title='1'><mark>...,Who <a href='uninflect' title='1'><mark>knows<...,She's not a...<|endoftext|>Lesbian? No . I f...,Who knows ? All I've ever heard her say is th...
7,"Oh my God, this mean you're becoming normal?<|...",What do you think?<|endoftext|>,"Oh my God, <a href='drop_aux' title='1'><mark>...",What do you think?<|endoftext|>,"Oh my God, does this mean you're becoming norm...",What do you think?<|endoftext|>
8,"Listen, I know you hate having to sit home bec...",I wish I had that luxury. I'm the only sophomo...,"Listen, I know you hate having to sit home bec...",I wish I had that luxury. I'm the only sophomo...,"Listen, I know you hate having to sit home bec...",I wish I had that luxury. I'm the only sophomo...
13,"Now don't get upset. Daddy, but it's this boy....",Then neither will you. And I'll get to sleep ...,"Now don't get upset. Daddy, but <a href='dey/i...",Then neither will you. And I'll get to sleep ...,"Now don't get upset. Daddy, but there 's this ...",Then neither will you. And I'll get to sleep ...
21,What make you think he'll do it?<|endoftext|>H...,They always let felons sit in on Honors Biolog...,What <a href='uninflect' title='1'><mark>makes...,They always let felons sit in on Honors Biolog...,What makes you think he'll do it?<|endoftext|...,They always let felons sit in on Honors Biolog...
...,...,...,...,...,...,...
27376,You know what a wire transfer is?<|endoftext|>...,Sure. Just chisel some off your heart.<|endoft...,<a href='drop_aux' title='1'><mark>Do</mark></...,Sure. Just chisel some off your heart.<|endoft...,Do you know what a wire transfer is?<|endoftex...,Sure. Just chisel some off your heart.<|endoft...
27379,Who are you?<|endoftext|>We done hung out last...,"Hey, take it easy. I'm just a dude tryna make ...",Who are you?<|endoftext|>We<a href='been_done'...,"Hey, take it easy. I'm just a dude trying to m...",Who are you?<|endoftext|>We hung out last nig...,"Hey, take it easy. I'm just a dude trying to m..."
27395,"My name Slovo, Czech secret police. When you h...","First of all, you should kill whoever sold you...",My name <a href='drop_aux' title='1'><mark>is<...,"First of all, you should kill whoever sold you...","My name is Slovo, Czech secret police. When yo...","First of all, you should kill whoever sold you..."
27399,"To us it mean no walls, no speed limits, no ja...",How you gonna do that with government and rule...,To us it <a href='uninflect' title='1'><mark>m...,How you gonna do that with government and <a h...,"To us it means no walls, no speed limits, no ...",How you gonna do that with government and rule...


In [53]:
new_df = pd.read_csv('runs/16/orig/personachat_dataset_4fts.csv')
new_df

Unnamed: 0,history,groundtruth,history_html,groundtruth_html,history_aave,groundtruth_aave
0,Hello do u like animals I got a kitten named L...,"Oh wow, that sound interesting. Are you really...",Hello do u like animals I <a href='got' title=...,"Oh wow, that <a href='uninflect' title='1'><ma...",Hello do u like animals I have a kitten named...,"Oh wow, that sounds interesting. Are you real..."
1,Hello there. How you? I got off from creating ...,I am studying for my masters. Fortunately mom ...,Hello there. How <a href='drop_aux' title='1'>...,I am studying for my masters. Fortunately mom ...,Hello there. How are you? I got off from creat...,I am studying for my masters. Fortunately mom ...
2,Hi anybody there? Ding dong!<|endoftext|>The w...,I love going shopping when I ain't knitting.<|...,Hi anybody there? Ding dong!<|endoftext|>The w...,I love going shopping when I <a href='negative...,Hi anybody there? Ding dong!<|endoftext|>The w...,I love going shopping when I am not knitting....
3,Hi! I am tryna design a new app to get people ...,"I think it is an important thing to consider, ...",Hi! I am trying to design a new app to get peo...,"I think it is an important thing to consider, ...",Hi! I am trying to design a new app to get peo...,"I think it is an important thing to consider, ..."
4,How are you doing today?<|endoftext|>Great! Ju...,Dialectical behavioral therapy. I use it to he...,How are you doing today?<|endoftext|>Great! Ju...,Dialectical behavioral therapy. I use it to he...,How are you doing today?<|endoftext|>Great! Ju...,Dialectical behavioral therapy. I use it to he...
...,...,...,...,...,...,...
436,"__ SILENCE __.<|endoftext|>Hi, I am a big viki...","If you ain't no viking, what you? It cold in m...","__ SILENCE __.<|endoftext|>Hi, I am a big viki...",If you <a href='negative_concord' title='1'><m...,"__ SILENCE __.<|endoftext|>Hi, I am a big viki...","If you are not a viking , what are you? It is..."
437,__ SILENCE __.<|endoftext|>Hi. Iain't much for...,That sound fun. I hope you meant it. I am a li...,__ SILENCE __.<|endoftext|>Hi. I<a href='negat...,That <a href='uninflect' title='1'><mark>sound...,__ SILENCE __.<|endoftext|>Hi. I'm not much f...,That sounds fun. I hope you meant it. I am a ...
438,__ SILENCE __.<|endoftext|>My name mark and I ...,I do not know no days I feel like it but I do ...,__ SILENCE __.<|endoftext|>My name <a href='dr...,I do not know <a href='None' title='1'><mark>s...,__ SILENCE __.<|endoftext|>My name is mark and...,I do not know some days I feel like it but I ...
439,"__ SILENCE __.<|endoftext|>Hi, I was born in F...",Interesting. I've a large family still live in...,"__ SILENCE __.<|endoftext|>Hi, I was born in F...",Interesting. I've a large family <a href='null...,"__ SILENCE __.<|endoftext|>Hi, I was born in F...",Interesting. I've a large family that still li...


## Stats

In [20]:
df = pd.read_csv('runs/16/orig/cornell_movie_dataset_4fts.csv')
df

Unnamed: 0,history,groundtruth,history_html,groundtruth_html,history_aave,groundtruth_aave
0,What make you think he'll do it?<|endoftext|>H...,They always let felons sit in on Honors Biolog...,What <a href='uninflect' title='1'><mark>makes...,They always let felons sit in on Honors Biolog...,What makes you think he'll do it?<|endoftext|...,They always let felons sit in on Honors Biolog...
1,"You been got it, Verona. I pick up the tab, y...",How much?<|endoftext|>,You<a href='been_done' title='1'><mark></mark>...,How much?<|endoftext|>,"You got it, Verona. I pick up the tab, you d...",How much?<|endoftext|>
2,So he got this huge raging fit about Sarah Law...,That's never ain't no proven<|endoftext|,So he <a href='got' title='1'><mark>has</mark>...,That's <a href='negative_concord' title='1'><m...,So he has this huge raging fit about Sarah La...,That's never been proven<|endoftext|
3,"Katarina Stratford. My, my. You've been terr...",I still maintain that he kicked himself in the...,"Katarina Stratford. My, my. You've been terr...",I still maintain that he kicked himself in the...,"Katarina Stratford. My, my. You've been terr...",I still maintain that he kicked himself in the...
4,How far from here?<|endoftext|>I ain't no seam...,I am afraid this ain't the worst news.<|endoft...,How far from here?<|endoftext|>I <a href='nega...,I am afraid this <a href='negative_concord' ti...,How far from here?<|endoftext|>I am not a sea...,I am afraid this is not the worst news.<|endo...
...,...,...,...,...,...,...
1778,"I know, I know, it's just -- he's back on the ...","If he come here, I'll handle him. Remember --...","I know, I know, it's just -- he's back on the ...",If he <a href='uninflect' title='1'><mark>come...,"I know, I know, it's just -- he's back on the ...","If he comes here, I'll handle him. Remember ..."
1779,Sphinx brand. When I got out of weapons desig...,"Oh sweetheart, just a quick one.<|endoftext|>",Sphinx brand. When I got out of weapons desig...,"Oh sweetheart, just a quick one.<|endoftext|>",Sphinx brand. When I got out of weapons desig...,"Oh sweetheart, just a quick one.<|endoftext|>"
1780,"You've really got me confused, Cage. On the on...",Iain't interested. I've already got a job.<|en...,"You've really got me confused, Cage. On the on...",I<a href='negative_concord' title='1'><mark>'m...,"You've really got me confused, Cage. On the on...",I'm not interested. I've already got a job.<|...
1781,"To us it mean no walls, no speed limits, no ja...",How you gonna do that with government and rule...,To us it <a href='uninflect' title='1'><mark>m...,How you gonna do that with government and <a h...,"To us it means no walls, no speed limits, no ...",How you gonna do that with government and rule...


In [21]:
length = 0
length_aave = 0
for i in tqdm(range(df.shape[0])):
    row = df.iloc[i]
    length += len(tokenizer.tokenize(row['history'])) + len(tokenizer.tokenize(row['groundtruth']))
    length_aave += len(tokenizer.tokenize(row['history_aave'])) + len(tokenizer.tokenize(row['groundtruth_aave']))
print(length / df.shape[0])
print(length_aave / df.shape[0])

 58%|█████▊    | 1026/1783 [00:00<00:00, 1144.92it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1266 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 1783/1783 [00:01<00:00, 1111.59it/s]

112.53112731351655
117.33482893998878



