Final Project - W266

Kenneth Pong and Stacy Irwin

5 December 2020

Our project report is in W266_Final_Project_KennethPong_StacyIrwin.pdf, which is in the top folder of this repository.

Our presentation slides are available here

Repository Structure

Model Training Code

The code for fine-tuning our models is in the model_training_code folder. Each model has its own subfolder. Each subfolder has a consistent structure.

Create_Encodings.ipynb: The code for tokenizing the articles and preparing the data for training or evaluation is in this jupyter notebook.
train.py: The code for training the model is in this Python module. We ran the training loop from the command line within a tmux session. This allowed both project team members to monitor the status of training and allowed us to disconnect and reconnect if necessary to the virtual machine hosting the training code.
- The training code continuously logged losses and other events to a text file.
- The model weights were saved after every epoch to a .tar file. This allowed us to evaluate the model's performance after any epoch, and minized progress that would be lost in case of an error. We were unable to upload the .tar files to the github repository due to their large size.
log_HHMM_YYYYMMDD_train.txt The .txt files contain the logs generated by the training loop.
evaluate_model.ipynb: Code for evaluating the model on development data is contained in this jupyter notebook.
CSV files: Predictions, logits, and other features were saved to CSV files.
JSON files: Performance metrics (e.g., recall, precision) and a few model hyperparameters (e.g., batch size, # of epochs) were saved to JSON files.
Files Not Uploaded to Github: Training and development data was saved to pickle files, while model weights were saved to .tar files. These files were not uploaded to the Github repository due to their large size, generally several hundred megabytes.

Training Utilities

We relied on a few common utility functions and dataset classes. The Python modules with these items are included in the model_training_code/util folder.

log.py: We used the Python Standard Library's logging module. The log.py module provides a customer logger class that logs data to both the console and a text file.
data.py: This file contains several helper functions and classes. One of the most important items in this module is our custom subclass of PyTorch's torch.utils.data.dataset class, called FNDataset. In addition to the encodings, labels, and attention mask, this class contains source file names, article lengths, and other data that was not used for training but was available for analysis.

Model Test Code

The code for evaluating models on our test dataset is in the model_test_code folder, with a different subfolder for each module. The contents of these folders are similar to the training code folders. Evaluation of the model on test data was conducted within Jupyter notebooks.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
model_test_code		model_test_code
model_training_code		model_training_code
.gitignore		.gitignore
README.md		README.md
W266_Final_Project_KennethPong_StacyIrwin.pdf		W266_Final_Project_KennethPong_StacyIrwin.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model_test_code

model_test_code

model_training_code

model_training_code

.gitignore

.gitignore

README.md

README.md

W266_Final_Project_KennethPong_StacyIrwin.pdf

W266_Final_Project_KennethPong_StacyIrwin.pdf

Repository files navigation

Final Project - W266

Kenneth Pong and Stacy Irwin

5 December 2020

Repository Structure

Model Training Code

Training Utilities

Model Test Code

About

Releases

Packages

Languages

irwinsnet/w266_project

Folders and files

Latest commit

History

Repository files navigation

Final Project - W266

Kenneth Pong and Stacy Irwin

5 December 2020

Repository Structure

Model Training Code

Training Utilities

Model Test Code

About

Resources

Stars

Watchers

Forks

Languages