<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Fine-tune

In [1]:
#|output: asis
#| echo: false
show_doc(get_ft_model_name)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L32){target="_blank" style="float:right; font-size:smaller"}

### get_ft_model_name

>      get_ft_model_name (ft_id, sleep=60)

In [None]:
get_ft_model_name('ft-jK0ziTZX6y5d2DXbB3kdct4w')

'ada:ft-lsmoepfl-2022-08-18-00-13-00'

In [None]:
get_ft_model_name('ft-f5xtILIGM6yjvLrj0J1GH5FQ')

'ada:ft-lsmoepfl-2022-09-02-14-15-28'

In [2]:
#|output: asis
#| echo: false
show_doc(fine_tune)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L40){target="_blank" style="float:right; font-size:smaller"}

### fine_tune

>      fine_tune (train_file, valid_file, model:str='ada', n_epochs:int=4,
>                 sleep:int=120)

Run the fine tuning of a GPT-3 model via the OpenAI API.

There is some logic here to wait until the fine-tuning task is complete.
Often, the job might end up in the queue and we do not have the model id yet. 
In this case, we will ask for the status of the job regularly and wait until it is complete.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| train_file |  |  | path to json file with training prompts (column names "prompt" and "completion") |
| valid_file |  |  | path to json file with validation prompts (column names "prompt" and "completion") |
| model | str | ada | model type to use. One of "ada", "babbage", "curie", "davinci". "ada" is the default (and cheapest). |
| n_epochs | int | 4 | number of epochs to fine-tune for |
| sleep | int | 120 | number of seconds to wait between checking the status of the fine-tuning task |

## Predict

Some helpers to make it easiers to get completions from the API.

In [3]:
#|output: asis
#| echo: false
show_doc(query_gpt3)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L77){target="_blank" style="float:right; font-size:smaller"}

### query_gpt3

>      query_gpt3 (model:str, df:pandas.core.frame.DataFrame,
>                  temperature:float=0, max_tokens:int=10, sleep:float=5,
>                  one_by_one:bool=False, parallel_max:int=20)

Get completions for all prompts in a dataframe.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| model | str |  | name of the model to use, e.g. "ada:ft-personal-2022-08-24-10-41-29" |
| df | DataFrame |  | hashable dataframe with prompts and expected completions (column names "prompt" and "completion") |
| temperature | float | 0 | temperature, 0 is the default and corresponds to argmax |
| max_tokens | int | 10 | maximum number of tokens to generate |
| sleep | float | 5 | number of seconds to wait between queries |
| one_by_one | bool | False | if True, generate one completion at a time (i.e., due to submit the maximum number of prompts per request) |
| parallel_max | int | 20 | maximum number of prompts that can be sent per request |

In [4]:
#|output: asis
#| echo: false
show_doc(extract_prediction)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L130){target="_blank" style="float:right; font-size:smaller"}

### extract_prediction

>      extract_prediction (completion, i:int=0)

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| completion |  |  | dictionary with "choices" key returned by the API |
| i | int | 0 | index of the "choice" (relevant if multiple completions have been returned) |
| **Returns** | **str** |  |  |

In [None]:
example_pred = {
    "choices": [{"finish_reason": "length", "index": 0, "text": " 0@@@@@@@"}]
}

In [None]:
extract_prediction(example_pred)

'0'

In [5]:
#|output: asis
#| echo: false
show_doc(extract_regression_prediction)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L138){target="_blank" style="float:right; font-size:smaller"}

### extract_regression_prediction

>      extract_regression_prediction (completion, i:int=0)

Similar to `extract_prediction`, but returns a float.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| completion |  |  | dictionary with "choices" key returned by the API |
| i | int | 0 | index of the "choice" (relevant if multiple completions have been returned) |
| **Returns** | **float** |  |  |

In [None]:
example_pred = {
    "choices": [{"finish_reason": "length", "index": 0, "text": " -8.2@@@@@@@"}]
}

In [None]:
extract_regression_prediction(example_pred)

-8.2

In [6]:
#|output: asis
#| echo: false
show_doc(extract_inverse_prediction)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L150){target="_blank" style="float:right; font-size:smaller"}

### extract_inverse_prediction

>      extract_inverse_prediction (completion, i=0)

Extracts the prediction of a molecule/material generative task.

In [None]:
example_inverse_predictions = {
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "text": "CC1=C(C(C)=NN1)/N=N/C2=CC=C(C(F)(F)F)C=C2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
        },
        {
            "finish_reason": "length",
            "index": 1,
            "text": "CC(C=C(N(CCC#N)CCO)C=C1)=C1/N=N/C2=CC=CC=C2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
        },
        {
            "finish_reason": "length",
            "index": 2,
            "text": "CC(C=C(N(CCC#N)CCO)C=C1)=C1/N=N/C2=CC=CC=C2@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@",
        },
    ]
}

[extract_inverse_prediction(example_inverse_predictions, i) for i in range(3)]

['CC1=C(C(C)=NN1)/N=N/C2=CC=C(C(F)(F)F)C=C2',
 'CC(C=C(N(CCC#N)CCO)C=C1)=C1/N=N/C2=CC=CC=C2',
 'CC(C=C(N(CCC#N)CCO)C=C1)=C1/N=N/C2=CC=CC=C2']

In [7]:
#|output: asis
#| echo: false
show_doc(train_test_loop)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L158){target="_blank" style="float:right; font-size:smaller"}

### train_test_loop

>      train_test_loop (df:pandas.core.frame.DataFrame, train_size:int,
>                       prompt_create_fn:<built-infunctioncallable>,
>                       random_state:int, stratify:Optional[str]=None,
>                       test_subset:Optional[int]=None)

Run the full training and testing process for the classification task.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| df | DataFrame |  | dataframe with prompts and expected completions (column names "prompt" and "completion"). Split will be performed within this function. |
| train_size | int |  | number of rows to use for training |
| prompt_create_fn | callable |  | function to create a prompt from a row of the dataframe |
| random_state | int |  | random state for splitting the dataframe |
| stratify | typing.Optional[str] | None | column name to use for stratification |
| test_subset | typing.Optional[int] | None |  |
| **Returns** | **dict** |  | **number of rows to use for testing. If None, use the remainder of the dataframe.** |

## Deep ensemble

[Deep ensembles](https://cims.nyu.edu/~andrewgw/deepensembles/) are a powerful technique to make neural networks "Bayesian". It can make them more robust and also be used to obtain uncertainty estimates.

Typically, they rely on the fact that there is some inherent randomness in training of a model due to the random intialization. However, when we fine-tune a model, we always start from the same weights. Hence we anticipate that we'll need to sample the data to achieve enough randomness.

In [8]:
#|output: asis
#| echo: false
show_doc(multiple_fine_tunes)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L218){target="_blank" style="float:right; font-size:smaller"}

### multiple_fine_tunes

>      multiple_fine_tunes (train_files, valid_files)

In [9]:
#|output: asis
#| echo: false
show_doc(ensemble_fine_tune)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L227){target="_blank" style="float:right; font-size:smaller"}

### ensemble_fine_tune

>      ensemble_fine_tune (train_frame, valid_frame, num_models:int=10,
>                          subsample:float=0.8, run_file_dir:str='run_files',
>                          filename_base_string:str='')

In [10]:
#|output: asis
#| echo: false
show_doc(multiple_query_gpt3)

---

[source](https://github.com/kjappelbaum/gpt3forchem/blob/main/gpt3forchem/api_wrappers.py#L261){target="_blank" style="float:right; font-size:smaller"}

### multiple_query_gpt3

>      multiple_query_gpt3 (models:List[str], df:pandas.core.frame.DataFrame,
>                           temperature:float=0, max_tokens:int=10,
>                           sleep:float=5, one_by_one:bool=False,
>                           parallel_max:int=20)

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| models | typing.List[str] |  | names of the models to use, e.g. "ada:ft-personal-2022-08-24-10-41-29" |
| df | DataFrame |  | dataframe with prompts and expected completions (column names "prompt" and "completion") |
| temperature | float | 0 | temperature, 0 is the default and corresponds to argmax |
| max_tokens | int | 10 | maximum number of tokens to generate |
| sleep | float | 5 | number of seconds to wait between queries |
| one_by_one | bool | False | if True, generate one completion at a time (i.e., due to submit the maximum number of prompts per request) |
| parallel_max | int | 20 | maximum number of prompts that can be sent per request |

### Embeddings

This is not useful/not used as the OpenAI API currently does not allow to retrieve the internals of custom models.