Skip to content

irwinsnet/w266_project

 
 

Repository files navigation

Final Project - W266

Kenneth Pong and Stacy Irwin

5 December 2020

Our project report is in W266_Final_Project_KennethPong_StacyIrwin.pdf, which is in the top folder of this repository.

Our presentation slides are available here

Repository Structure

Model Training Code

The code for fine-tuning our models is in the model_training_code folder. Each model has its own subfolder. Each subfolder has a consistent structure.

  • Create_Encodings.ipynb: The code for tokenizing the articles and preparing the data for training or evaluation is in this jupyter notebook.
  • train.py: The code for training the model is in this Python module. We ran the training loop from the command line within a tmux session. This allowed both project team members to monitor the status of training and allowed us to disconnect and reconnect if necessary to the virtual machine hosting the training code.
    • The training code continuously logged losses and other events to a text file.
    • The model weights were saved after every epoch to a .tar file. This allowed us to evaluate the model's performance after any epoch, and minized progress that would be lost in case of an error. We were unable to upload the .tar files to the github repository due to their large size.
  • log_HHMM_YYYYMMDD_train.txt The .txt files contain the logs generated by the training loop.
  • evaluate_model.ipynb: Code for evaluating the model on development data is contained in this jupyter notebook.
  • CSV files: Predictions, logits, and other features were saved to CSV files.
  • JSON files: Performance metrics (e.g., recall, precision) and a few model hyperparameters (e.g., batch size, # of epochs) were saved to JSON files.
  • Files Not Uploaded to Github: Training and development data was saved to pickle files, while model weights were saved to .tar files. These files were not uploaded to the Github repository due to their large size, generally several hundred megabytes.

Training Utilities

We relied on a few common utility functions and dataset classes. The Python modules with these items are included in the model_training_code/util folder.

  • log.py: We used the Python Standard Library's logging module. The log.py module provides a customer logger class that logs data to both the console and a text file.
  • data.py: This file contains several helper functions and classes. One of the most important items in this module is our custom subclass of PyTorch's torch.utils.data.dataset class, called FNDataset. In addition to the encodings, labels, and attention mask, this class contains source file names, article lengths, and other data that was not used for training but was available for analysis.

Model Test Code

The code for evaluating models on our test dataset is in the model_test_code folder, with a different subfolder for each module. The contents of these folders are similar to the training code folders. Evaluation of the model on test data was conducted within Jupyter notebooks.

About

Final Project for UCB W266 Class on NLP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 90.8%
  • Python 9.2%