Skip to content

R Package for exploring Samples of generated texts from Open AI's new GPT-2 language model

License

Notifications You must be signed in to change notification settings

kanishkamisra/gpt2samples

Repository files navigation

gpt2samples

The goal of gpt2samples is to help users explore the various sample texts as generated by Open AI’s new GPT2 transformer based language model.

An original implementation of a smaller version of GPT-2 can be found here, and the original sample text files can be found here.

Data

This package contains the following data, stored as tibbles:

tibble description
conditional-t07 Conditionally generated samples, with context prompts from WebText test corpus, default settings (temperature 1 and no truncation).
conditional-topk40 Conditionally generated samples, with context prompts from WebText test corpus, with temperature 0.7
conditional Conditionally generated samples, with context prompts from WebText test corpus, with truncation and top_k 40.
unconditional Unconditionally generated samples, default settings.
unconditional-t07 Unconditionally generated samples, with temperature 0.7
unconditional-topk40 Unconditionally generated samples, with truncation and top_k 40.

Additionally, all the generated samples (conditional and unconditional) can be explored by calling all_samples().

Installation

You can install the released version of gpt2samples from GitHub with:

# install.packages("gpt2samples")
# install.packages("devtools")
devtools::install_github("kanishkamisra/gpt2samples")

Example

This is a basic example to explore the data using dplyr verbs

library(dplyr)
library(gpt2samples)

conditional %>%
  filter(id == 100)
#> # A tibble: 2 x 4
#>   file         id type     text                                            
#>   <chr>     <int> <chr>    <chr>                                           
#> 1 conditio…   100 sample   the waterbody that you are managing, getting pr…
#> 2 conditio…   100 complet… Permit, WDFW ensures that nonconventional child…

unconditional_t07 %>%
  filter(id == 250)
#> # A tibble: 213 x 3
#>    file              id text                                               
#>    <chr>          <int> <chr>                                              
#>  1 unconditional…   250 This question already has an answer here: How do I…
#>  2 unconditional…   250 ""                                                 
#>  3 unconditional…   250 This is a basic question regarding text editing. T…
#>  4 unconditional…   250 ""                                                 
#>  5 unconditional…   250 (A)                                                
#>  6 unconditional…   250 ""                                                 
#>  7 unconditional…   250 (B)                                                
#>  8 unconditional…   250 ""                                                 
#>  9 unconditional…   250 (A)                                                
#> 10 unconditional…   250 ""                                                 
#> # … with 203 more rows

all_samples() %>%
  filter(file == "conditional") %>%
  tail()
#> # A tibble: 6 x 4
#>   file         id type     text                                            
#>   <chr>     <int> <chr>    <chr>                                           
#> 1 conditio…   500 complet… "BOP will be remembered for it's technically in…
#> 2 conditio…   500 complet… ""                                              
#> 3 conditio…   500 complet… There were literal lap times in running the wat…
#> 4 conditio…   500 complet… ""                                              
#> 5 conditio…   500 complet… ""                                              
#> 6 conditio…   500 complet… I was voiced by legendary actor turns down play…

all_samples() %>%
  group_by(file) %>%
  summarise(total_lines = n())
#> # A tibble: 6 x 2
#>   file                 total_lines
#>   <chr>                      <int>
#> 1 conditional                18067
#> 2 conditional-t07            24081
#> 3 conditional-topk40         20405
#> 4 unconditional              19469
#> 5 unconditional-t07          28841
#> 6 unconditional-topk40       21188

Additional exploration can use Julia Silge and David Robinson’s tidytext package, among others to analyze the generated text as produced by GPT-2.

Contributor Code of Conduct

Please note that the ‘gpt2samples’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

About

R Package for exploring Samples of generated texts from Open AI's new GPT-2 language model

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages