# Notebook para crear el archivo jsonl para poder realizar fine tuning de modelos

# Open AI

Esta división de títulos la pongo por si me da tiempo a realizar un fine tuning de más de un modelo, y se escoge uno de otra plataforma, como google cloud AWS u otras, que por la razón que sea necesite un tipo de archivo distinto.

En el ejemplo viene un archivo tipo csv en el que hay 2 columnas:
- prompt: la pregunta ejemplo que se le hace al modelo.


- completion: la respuesta ejemplo que debería responder el modelo.




Detalles:
- La columna completion: debe empezar con espacio.


- La columna completion: debe terminar siempre con una palabra o token determinado, en el ejemplo es END.

## Librerías y cargando datos

In [331]:
import numpy
import pandas as pd


In [332]:
df_scripts = pd.read_csv('The Saimpsons Archive/in_use/simpsons_script_lines.csv')
df_ep = pd.read_csv('The Saimpsons Archive/in_use/simpsons_episodes.csv')

In [333]:
df_scripts.sample()

Unnamed: 0,id,episode_id,number,raw_text,timestamp_in_ms,speaking_line,character_id,location_id,raw_character_text,raw_location_text,spoken_words,normalized_text,word_count
26735,36449,126,257,"Homer Simpson: It's too late for me, Marge! Se...",1172000,True,2.0,273.0,Homer Simpson,GARAGE,"It's too late for me, Marge! Sell the jeans an...",its too late for me marge sell the jeans and l...,14.0


In [334]:
df_ep.sample()

Unnamed: 0,id,imdb_rating,imdb_votes,number_in_season,number_in_series,original_air_date,original_air_year,production_code,season,title,us_viewers_in_millions,views
473,507,6.7,408.0,21,507,2012-05-13,2012,PABF15,23,Ned 'n Edna's Blend,4.07,47005.0


In [335]:
df_scripts = pd.merge(df_scripts, df_ep[['id', 'season']], left_on='episode_id', right_on='id')
df_scripts.sample()

Unnamed: 0,id_x,episode_id,number,raw_text,timestamp_in_ms,speaking_line,character_id,location_id,raw_character_text,raw_location_text,spoken_words,normalized_text,word_count,id_y,season
40293,45694,161,135,Ned Flanders: (A COUPLE OF EXASPERATED BREATHS),716000,False,11.0,1525.0,Ned Flanders,NEW FLANDERS HOUSE,,,0.0,161,8


## Juntando texto e información importante de cada escena
Una escena se considera aquello que sucede en un lugar específico para un capítulo en específico.

In [337]:
def join_text(text):
    return ' '.join(text)

def agg_chars(char_id):
    return list(set(char_id))

In [338]:
df_scripts_agg = df_scripts[['episode_id',
                            'location_id',
                            'raw_location_text',
                            'character_id',
                            'raw_character_text',
                            'raw_text']].groupby(['episode_id',
                                                'location_id']).agg({'raw_text': join_text,
                                                                    'character_id': agg_chars,
                                                                    'raw_character_text': agg_chars,
                                                                    'raw_location_text': agg_chars}).reset_index()

In [339]:
df_scripts_agg.sample(5)

Unnamed: 0,episode_id,location_id,raw_text,character_id,raw_character_text,raw_location_text
3333,174,3.0,(Springfield Elementary School: INT. SPRINGFIE...,"[3.0, 39.0, 9.0, 14.0, 15.0, nan]","[Lisa Simpson, Kids, Seymour Skinner, nan, Way...",[Springfield Elementary School]
6965,480,3889.0,(OUTDOOR HALF-SHELL STAGE: EXT. outdoor half-s...,"[321.0, nan, 5830.0]","[Cheech, Audience, nan]",[OUTDOOR HALF-SHELL STAGE]
5699,354,8.0,Barber: (ROLLS EYES) You're the boss. Bart Sim...,"[nan, 2.0, 3.0, nan, 4415.0, 101.0, 8.0, 9.0, ...","[Lisa Simpson, Homer Simpson, Barber, Ralph Wi...",[Springfield Mall]
288,13,205.0,(Ye Olde Off-Ramp Inn Motel Room: INT. YE OLDE...,"[1.0, 2.0, nan, nan]","[Homer Simpson, Marge Simpson, nan]",[Ye Olde Off-Ramp Inn Motel Room]
4569,303,5.0,(Simpson Home: INT. simpson house - TV room - ...,"[1.0, 2.0, nan, 8.0, 9.0, nan, nan, nan, 401.0...","[Lisa Simpson, Homer Simpson, Frankenstein, Ke...",[Simpson Home]


In [340]:
df_scripts_agg['raw_text'].iloc[21]

'Homer Simpson: (GASPS) Bart, did you hear that? What a name -- "Santa\'s Little Helper". It\'s a sign! It\'s an omen! (PADDOCK: ext. paddock - night) Homer Simpson: Hey, Barney, which one is Whirlwind? Barney Gumble: Number six. That\'s our lucky dog, right over there. He\'s won his last five races. Homer Simpson: What! That scrawny little bag of bones? Bart Simpson: Come on, Dad. They\'re all scrawny little bags of bones. Homer Simpson: (RESIGNED) Yeah, you\'re right. (SIGHS) I guess Whirlwind is our only hope for a Merry Christmas. Announcer: (THRU P.A.) Attention racing fans. We have a late scratch in the fourth race. Number eight, Sir Galahad, will be replaced by Santa\'s Little Helper. Once again, Sir Galahad has been replaced by Santa\'s Little Helper. Bart Simpson: It\'s a coincidence, Dad.'

## Generando prompts para cada escena

In [342]:
def prompts(characters, location):
    separador_char = ', '
    separador_loc = ''
    characters = [char for char in characters if str(char) != 'nan']
    location = [loc for loc in location if str(loc) != 'nan']
    
    if not characters and not location:
        pregunta = "Generate a scene."
    elif not characters:
        pregunta = f"Generate a scene in {separador_loc.join(location)}."
    elif not location:
        pregunta = f"Generate a scene with the characters: {separador_char.join(characters)}."
    else:
        pregunta = f"Generate a scene in {separador_loc.join(location)} with the characters: {separador_char.join(characters)}."
        
    return pregunta


Se necesita separador de localización porque el groupby se hizo con el id, no con el raw_text, por lo que es posible que se refieran en el mismo capítulo al mismo sitio con nombres ligeramente distintos.

In [343]:
df_scripts_agg.columns

Index(['episode_id', 'location_id', 'raw_text', 'character_id',
       'raw_character_text', 'raw_location_text'],
      dtype='object')

In [344]:
df_scripts_agg['prompts'] = df_scripts_agg.apply(
    lambda row: prompts(
        characters=row['raw_character_text'],
        location=row['raw_location_text']
    ),
    axis=1
)


In [345]:
df_scripts_agg

Unnamed: 0,episode_id,location_id,raw_text,character_id,raw_character_text,raw_location_text,prompts
0,1,1.0,(Street: ext. street - establishing - night) (...,"[2.0, nan, 23.0, nan]","[Homer Simpson, Voice, nan]",[Street],Generate a scene in Street with the characters...
1,1,2.0,"(Car: int. car - night) Marge Simpson: Ooo, ca...","[nan, 2.0, 1.0]","[Marge Simpson, Homer Simpson, nan]",[Car],Generate a scene in Car with the characters: M...
2,1,3.0,(Springfield Elementary School: Ext. springfie...,[nan],[nan],[Springfield Elementary School],Generate a scene in Springfield Elementary Sch...
3,1,4.0,(Auditorium: int. auditorium - night) Marge Si...,"[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, nan]","[Homer Simpson, Dewey Largo, Todd Flanders, Se...",[Auditorium],Generate a scene in Auditorium with the charac...
4,1,5.0,(Simpson Home: int. simpson house - living roo...,"[1.0, 2.0, 8.0, 9.0, 10.0, 11.0, 12.0, 22.0, 2...","[Homer Simpson, Lisa Simpson, Grampa Simpson, ...",[Simpson Home],Generate a scene in Simpson Home with the char...
...,...,...,...,...,...,...,...
7392,568,4455.0,(SKOBO'S: EXT. SKOBO'S - ESTABLISHING) Rev. Ti...,"[192.0, 699.0, 140.0, nan]","[Rev. Timothy Lovejoy, Agnes Skinner, Sideshow...",[SKOBO'S],Generate a scene in SKOBO'S with the character...
7393,568,4456.0,(FLANDERS' BASEMENT: int. flanders' basement -...,"[1.0, 11.0, 140.0, 208.0, 699.0, nan]","[Apu Nahasapeemapetilon, Rev. Timothy Lovejoy,...",[FLANDERS' BASEMENT],Generate a scene in FLANDERS' BASEMENT with th...
7394,568,4457.0,(CASINO FLOOR: Int. casino floor - continuous)...,"[3040.0, 1.0, 2.0, nan, 11.0, 75.0, nan, nan]","[Homer Simpson, Crowd, Casino Manager, Ned Fla...",[CASINO FLOOR],Generate a scene in CASINO FLOOR with the char...
7395,568,4458.0,(BURNED CHURCH: ext. burned church - continuou...,"[1.0, 11.0, 140.0, nan, 699.0]","[Rev. Timothy Lovejoy, Ned Flanders, Marge Sim...",[BURNED CHURCH],Generate a scene in BURNED CHURCH with the cha...


## Creando csv de prompts + completion

In [346]:
df_prompts = df_scripts_agg[['prompts', 'raw_text']]
df_prompts.head()

Unnamed: 0,prompts,raw_text
0,Generate a scene in Street with the characters...,(Street: ext. street - establishing - night) (...
1,Generate a scene in Car with the characters: M...,"(Car: int. car - night) Marge Simpson: Ooo, ca..."
2,Generate a scene in Springfield Elementary Sch...,(Springfield Elementary School: Ext. springfie...
3,Generate a scene in Auditorium with the charac...,(Auditorium: int. auditorium - night) Marge Si...
4,Generate a scene in Simpson Home with the char...,(Simpson Home: int. simpson house - living roo...


In [347]:
df_prompts.rename(columns={'prompts': 'prompt',
                        'raw_text': 'completion'}, inplace=True)
df_prompts.sample()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_prompts.rename(columns={'prompts': 'prompt',


Unnamed: 0,prompt,completion
2798,Generate a scene in SEWER DRAIN with the chara...,(SEWER DRAIN: int. sewer drain - a minute late...


In [348]:
df_prompts['completion'].iloc[5]

'(KITCHEN: int. kitchen - morning) Marge Simpson: Kids, you want to go Christmas shopping? Lisa Simpson: I do! Bart Simpson: All right, the mall! Marge Simpson: Go get your money. Homer Simpson: Spill it, Marge. Where have you been hiding the Christmas money? Marge Simpson: Oh, I have my secrets. Turn around. Marge Simpson: You can look now. Homer Simpson: Oh! Big jar this year.'

In [349]:
df_prompts.to_csv('Modelos/archivos_jsonl/prompts_example_2.csv', index=False)

Este es el formato típico para los modelos de davinci y de babbage