Your Python installation needs to be version 3.8
or higher.
If you can't be bothered to read all of this, you can just run
chmod +x run.sh # Make run.sh executable
./run.sh # Run the program
This will:
- Install all the required libraries
- Run three epochs of training
- Generate an inference
You can then generate more inferences as described below.
This project originally started out as an RNN I wanted to implement in Pytorch.
I had difficulties getting the model to create a coherent output. As I lacked reference values for training, I decided to finetune an existing model -- BLOOM. I hoped to learn more about the text-generation process from a top-down perspective, and to gather reference values for training in a "best-case" scenario.
pip install -r requirements.txt
❗ This step is only necessary if you want to source the data yourself❗
The dataset used to train the model is included under /data/. It was collected 23.09.2022.
The data for this project is gathered through the YouTube Data API v3. Setting up this API can roughly be divided into the following steps:
- Create a Google Developer Account
- Create a new project
- Enable the YouTube Data API v3
- Create credentials
- Make the credentials accessible to your environment
For in-depth guidance, please refer to this excellent HubSpot Article.
❗If you decided to use the data included in the repository, you can skip this section.❗
Assuming you setup the YouTube API correctly, all you need to do is run the youtube/query_api.py. It requires the name of your client_secrets_file. You need to supply the requested channel's playlistId as an argument when launching the program. It is possible to supply multiple playlistIds at once by seperating them with a space.
In order to find a channel's playlistId you need to
- Go to the channel
- Find a playlist with all the channel's videos included (often the first playlist)
- Click PLAY ALL
- Copy everything after
list=
from the link
Thus, the command to download all the titles for VICE and VICE News is:
python3 youtube/query_api.py UUn8zNIfYAQNdrFRrr8oibKw PLw613M86o5o7q1cjb26MfCgdxJtshvRZ-
To clean the data, you just need to run the preprocess.py.
Assuming the file to process is called vice.txt
, the command is:
python3 preprocess.py vice.txt
By default, this removes non-english sentences, duplicates and entries consisting of less than three words.
The resulting file is automatically split into sets of 80% train and 20% test in /data/
.
Training can easily be run by executing the main.py
.
If you have Weights & Biases set up, you add a flag to activate it as such:
python3 main.py --wandb
Inference can be run by executing inference.py
with the prompt as argument. Furthermore, you can pass certain inference parameters as arguments e.g.:
python3 inference.py North Korea --temp 0.42 --top_k 32 --rp 1.3
Output:
temp=0.42; k=32, p=0.92, rep=1.3:
----------------------------------------------------------------------------------------------------
North Korea's 'Most Humane' Hospital
Huggingface made a great tutorial on different generation strategies, where each inference parameter is explained in depth.
This project has been very insightful in gaining an understanding of text-generation from a top-down perspective. While implementing this project as a PyTorch RNN, I mostly scrambled around without having much of an understanding of what I was doing.
By fine-tuning BLOOM, I learned how to fine-tune an existing model, how to source data, how to pre-process it correctly and how to host the resulting model on Hugging Face Hub with Gradio.