<a href="https://colab.research.google.com/github/michele1993/Protein_design/blob/main/PropGPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
import sys
sys.path.append('content')
import os
import pandas as pd
import numpy as np


In [5]:
! mkdir dataset

## Data clearning
Need to clean the data and split them in training and validation dataset for fine-tuning PropGPT2

In [7]:
# Load data
#root_dir = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join('dataset','sequences.csv')

dataset = pd.read_csv(file_path)

In [8]:
## ------ 1st Remove any pair if contains NaN -----
# identify any row with NaN
data_nan = pd.isna(dataset).sum(axis=1).astype('bool') # sum to identify if there is at least one NaN entry in a row.
# remove entries with NaN entirely
data_cleaned = dataset.loc[~data_nan, :]

assert ~pd.isna(data_cleaned).any().any(), "There are NaN entries in the data, need cleaning"
## ----------------------------

In [9]:
## ------ 2nd Remove any duplicate entry ----

# try adding a duplicate to test if works:
#data_cleaned = pd.concat([data_cleaned, pd.DataFrame(data_cleaned.iloc[-1,:], columns=data_cleaned.columns)], ignore_index=True)
#data_cleaned.loc[len(data_cleaned.index)] = data_cleaned.iloc[0,:]
#print(data_cleaned.shape)

# find all duplicates:
#duplicates = data_cleaned[data_cleaned.duplicated(subset="mutated_sequence", keep=False)]

# Remove duplicates by only keeping 'first' occurance for each
data_cleaned = data_cleaned.drop_duplicates(subset="mutated_sequence", keep="first")
#print(data_cleaned.shape)
## -------------------------------------------

In [10]:
## ------- 3rd Prepara data for fine-tuning PropGPT2 on this dataset -----
# for the moment ingore the activations just with all sequences
data_cleaned_seq = data_cleaned.iloc[:,0]

# 1st need to add "<|endoftext|>" token at the beginning of each seq
special_token = "<|endoftext|>"
data_cleaned_seq = special_token + data_cleaned_seq

# 2nd need to slip the data in training and validation

# Select % of validation seqs (i.e., 90/10)
n_seq = data_cleaned_seq.shape[0]
n_validation = n_seq // 10

# Select n. random indexes for validation
val_indx = np.random.randint(0, n_seq, n_validation)
# Select validation seqs
val_seq = data_cleaned_seq.iloc[val_indx]
# Select training seqs by eliminating validation seqs
training_seq = data_cleaned_seq.drop(index=val_indx)

# 3rd: concatane all strings together and save in a txt file
training_concatenated = ''.join(training_seq)
val_concatenated = ''.join(val_seq)

# to add newline character between each original row's string use
#concatenated = '\n'.join(training_seq)

# Save concatenated strings
with open('training.txt','w') as file:
    file.write(training_concatenated)

with open('validation.txt','w') as file:
    file.write(val_concatenated)

## Now, we can try to fine-tune PropGPT2 based on our data

First we need to install some dependecies

In [11]:
%%capture
! wget https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/examples/pytorch/language-modeling/run_clm.py

--2024-10-18 18:02:39--  https://raw.githubusercontent.com/huggingface/transformers/refs/heads/main/examples/pytorch/language-modeling/run_clm.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28424 (28K) [text/plain]
Saving to: ‘run_clm.py’


2024-10-18 18:02:39 (20.7 MB/s) - ‘run_clm.py’ saved [28424/28424]



In [12]:
%%capture
!git clone https://github.com/huggingface/transformers.git

Cloning into 'transformers'...
remote: Enumerating objects: 235896, done.[K
remote: Counting objects: 100% (21606/21606), done.[K
remote: Compressing objects: 100% (1527/1527), done.[K
remote: Total 235896 (delta 21130), reused 20112 (delta 20054), pack-reused 214290 (from 1)[K
Receiving objects: 100% (235896/235896), 241.61 MiB | 13.26 MiB/s, done.
Resolving deltas: 100% (172884/172884), done.


In [13]:
%%capture
!pip install -e transformers/.

Obtaining file:///content/transformers
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers<0.21,>=0.20 (from transformers==4.46.0.dev0)
  Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: transformers
  Building editable for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.46.0.dev0-0.editable-py3-none-any.whl size=17265 sha256=c61d223aa9fdb3e0f32bbf848f6a9a5cbb4c53349b18bdb6f35e9c8f09880b7e
 

In [14]:
%%capture
!pip install datasets evaluate

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━

In [None]:
! python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output --learning_rate 1e-06

2024-10-18 18:03:54.727157: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-18 18:03:54.746677: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-18 18:03:54.752530: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-18 18:03:54.766879: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
10/18/2024 18:04:01 - INFO - __main__ - Train