# Model Evaluation

This notebook demonstrates how you can run my pre-trained model on unseen test data.

In [None]:
# My virtual environment is tracked using `pipenv`.
# From the top directory of the project, run:
!pipenv install

In [None]:
import sys; sys.path.append('..')

from functools import partial
import tqdm, os, json

import pandas as pd
import torch
import torch.nn as nn
import gvpgnn.datasets as datasets
import gvpgnn.models as models
import gvpgnn.paths as paths
import gvpgnn.data_models as dm
import gvpgnn.embeddings as embeddings
import gvpgnn.train_utils as train_utils
import numpy as np
import torch_geometric
from sklearn.metrics import confusion_matrix
from scripts.parser import parser

## Step 1: Preprocess the Data

For convenience, I map the provided raw data to a format that's easier for my dataloader to use. You'll need to preprocess any unseen test data in the same way.


### Required files:
- I'm expecting the TEST data to be found in a CSV with the same format as `cath_w_seqs_share.csv` (filename can be changed below)
- I'm expecting the unseen proteins to have PDB files in a folder like `pdb_share` (folder can be changed below)

### 1a: Preprocess the Dataset

In [None]:
# From the top level of the repo:
!cd scripts/
!python preprocess.py \
  --csv path_to_your_test_data.csv \
  --output-folder ../data/challenge_test_set \
  --pdb-folder ../data/pdb_share

### 1b: Pre-Compute Language Model Embeddings

Next, I precompute language model embeddings for all of the examples in the dataset. These are placed alongside the `JSON` data as `.pt` files. The whole dataset is copied to a new folder to avoid overwriting any of the original data.

In [None]:
# Runs a script that fetches the pre-trained weights for all language models:
!cd scipts/
!python download_esm.py

# Then run a script to precompute the embeddings:
!python precompute_embeddings.py --in-dataset ../data/challenge_test_set

In [2]:
!cd ..