Generate Vice Headlines with Bloom

❗ Requirements

Your Python installation needs to be version 3.8 or higher.

🏃 Quickstart

If you can't be bothered to read all of this, you can just run

chmod +x run.sh     # Make run.sh executable
./run.sh            # Run the program

This will:

Install all the required libraries
Run three epochs of training
Generate an inference

You can then generate more inferences as described below.

❄️ Context

This project originally started out as an RNN I wanted to implement in Pytorch. I had difficulties getting the model to create a coherent output. As I lacked reference values for training, I decided to finetune an existing model -- BLOOM. I hoped to learn more about the text-generation process from a top-down perspective, and to gather reference values for training in a "best-case" scenario.

🤖 Setup

1. Install the Required Dependencies

pip install -r requirements.txt

2. Setup YouTube API

❗ This step is only necessary if you want to source the data yourself❗
The dataset used to train the model is included under /data/. It was collected 23.09.2022.

The data for this project is gathered through the YouTube Data API v3. Setting up this API can roughly be divided into the following steps:

Create a Google Developer Account
Create a new project
Enable the YouTube Data API v3
Create credentials
Make the credentials accessible to your environment

For in-depth guidance, please refer to this excellent HubSpot Article.

📊 Data

❗If you decided to use the data included in the repository, you can skip this section.❗

1. Collecting the Data

Assuming you setup the YouTube API correctly, all you need to do is run the youtube/query_api.py. It requires the name of your client_secrets_file. You need to supply the requested channel's playlistId as an argument when launching the program. It is possible to supply multiple playlistIds at once by seperating them with a space.

In order to find a channel's playlistId you need to

Go to the channel
Find a playlist with all the channel's videos included (often the first playlist)
Click PLAY ALL
Copy everything after list= from the link

Thus, the command to download all the titles for VICE and VICE News is:

python3 youtube/query_api.py UUn8zNIfYAQNdrFRrr8oibKw PLw613M86o5o7q1cjb26MfCgdxJtshvRZ-

2. Cleaning the Data

To clean the data, you just need to run the preprocess.py. Assuming the file to process is called vice.txt, the command is:

python3 preprocess.py vice.txt

By default, this removes non-english sentences, duplicates and entries consisting of less than three words. The resulting file is automatically split into sets of 80% train and 20% test in /data/.

📉 Training

Training can easily be run by executing the main.py.
If you have Weights & Biases set up, you add a flag to activate it as such:

python3 main.py --wandb

🗿 Inference

Inference can be run by executing inference.py with the prompt as argument. Furthermore, you can pass certain inference parameters as arguments e.g.:

python3 inference.py North Korea --temp 0.42 --top_k 32 --rp 1.3

Output:
temp=0.42; k=32, p=0.92, rep=1.3:
----------------------------------------------------------------------------------------------------
North Korea's 'Most Humane' Hospital

Huggingface made a great tutorial on different generation strategies, where each inference parameter is explained in depth.

♻️ Conclusion

This project has been very insightful in gaining an understanding of text-generation from a top-down perspective. While implementing this project as a PyTorch RNN, I mostly scrambled around without having much of an understanding of what I was doing.
By fine-tuning BLOOM, I learned how to fine-tune an existing model, how to source data, how to pre-process it correctly and how to host the resulting model on Hugging Face Hub with Gradio.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generate Vice Headlines with Bloom

❗ Requirements

🏃 Quickstart

❄️ Context

🤖 Setup

1. Install the Required Dependencies

2. Setup YouTube API

📊 Data

1. Collecting the Data

2. Cleaning the Data

📉 Training

🗿 Inference

♻️ Conclusion

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
youtube		youtube
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
main.py		main.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
run.sh		run.sh
utils.py		utils.py

marcderbauer/bloom

Folders and files

Latest commit

History

Repository files navigation

Generate Vice Headlines with Bloom

❗ Requirements

🏃 Quickstart

❄️ Context

🤖 Setup

1. Install the Required Dependencies

2. Setup YouTube API

📊 Data

1. Collecting the Data

2. Cleaning the Data

📉 Training

🗿 Inference

♻️ Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages