In [1]:
# install hmmer library
!sudo apt-get install hmmer
!sudo apt-get install hmmer-doc
!sudo apt-get install ncbi-blast+

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
hmmer is already the newest version (3.3.2+dfsg-1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
hmmer-doc is already the newest version (3.3.2+dfsg-1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ncbi-blast+ is already the newest version (2.12.0+ds-3build1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [2]:
import pandas as pd

## Get the 3D structure
CSV files of PF00014 domains are based on two different queries.

```b
( Identifier = "PF00014" AND Annotation Type = "Pfam" ) AND Data Collection Resolution < 3 AND Polymer Entity Sequence Length = [ 50 - 80 ] AND Polymer Entity Mutation Count < 10
```
and
```b
 ( Identifier = "PF00014" AND Annotation Type = "Pfam" ) AND Data Collection Resolution < 2 AND Polymer Entity Sequence Length = [ 50 - 80 ] AND Polymer Entity Mutation Count < 2
 ```
 the difference are the `Resolution(3Å vs 2Å)`, `Polymer Entity Mutation Count(10 vs 2)` and at the end `grouping the polymer entities with different sequence identity(100% vs 50%)`.

 The stricter criteria, with a resolution of 2Å, a mutation count of less than 2, and a sequence identity of 100%, may lead to a smaller sample size but potentially higher quality data, while the less strict criteria, with a resolution of 3Å, a mutation count of less than 10, and a sequence identity of 50%, could yield a larger sample size but with a risk of including lower quality data. With the stricter rules, we have obtained 14 samples, whereas with the less strict rule, we have collected 28 samples. Additionally, we aim to assess which set of criteria—either the more stringent or the less restrictive—ultimately yields superior results in terms of sample quality and relevance to our research objectives.

## Clean csv file

In [3]:
def clean_csv_file(path: str) -> str or None:
    """
    Reads and cleans a CSV file, providing options to return the cleaned data as a string or save it into a file.

    Parameters:
        path (str): The path to the CSV file to be cleaned.

    Returns:
        str or None: If the user chooses to get the results as a variable (1),
        the cleaned data is returned as a string. If the user chooses to save the
        results into a file (2), the cleaned data is saved into a file.

    Usage:
        1. `path` should be the path to the CSV file that needs to be cleaned.
        2. The function interactively prompts the user to choose between getting
           the results as a variable or saving them into a file.
        3. If the user selects to save the results into a file, they are further
           prompted to choose between saving only keys or in Fasta format.
        4. If the user selects to save in Fasta format, the data is saved in a
           '.fasta' file with each entry represented as a Fasta sequence.
        5. If the user selects to save only keys, the data is saved in a '.txt'
           file with each key on a separate line.

    Notes:
        - If the user does not provide a valid input for any prompt, they are
          repeatedly prompted until a valid input is provided.
        - If the user does not provide a file name when prompted for the output
          file name, a default name "output_seq" is used.
        - The function utilizes the Pandas library to read and manipulate CSV data.
        - The function utilizes Python's built-in file handling capabilities to
          save the data into text files.
    """
    break_line = '\n------------------------------\n'
    print('Reading the file...')
    df = pd.read_csv(path)
    print('Cleaning the CSV file...')
    df = df.dropna(subset=['Entity ID'])
    df = df.drop(columns=['Unnamed: 3'])
    df['Entity ID'] = df['Entity ID'].str.split('_').str[0] + ':' + df['Auth Asym ID']
    df = df.drop(columns=['Auth Asym ID'])
    df = df.reset_index(drop=True)
    print('Done!', break_line)

    while True:
        which_output = input('Do you want to get the results as variable(1) or file(2)?\nOnly enter the corresponding number[1, 2]: ')
        if which_output in ['1', '2']:
            which_output = int(which_output)
            break
        else:
            print('Invalid input. Please enter either 1 or 2.' + break_line)

    # Save the output into a variable
    if which_output == 1:
        return '\n'.join(df['Entity ID'].values)

    # Save the output into a file
    elif which_output == 2:
        print(break_line)
        output_file_name = input('Enter your output file name without extension (Press Enter for default "output_seq"): ')
        output_file_name = output_file_name.strip()
        if output_file_name == "":
            output_file_name = "output_seq"

        while True:
            with_fasta = input('[1] Only keys \\ [2] As Fasta format: ')
            if with_fasta in ['1', '2']:
                with_fasta = int(with_fasta)
                break
            else:
                print('Invalid input. Please enter either 1 or 2.' + break_line)
        # Save as .fasta
        if with_fasta == 2:
            with open(output_file_name + '.fasta', 'w') as file:
                for idx, row in df.iterrows():
                    file.write(f"> {row['Entity ID']}\n{row['Sequence']}\n")
            print(break_line, 'Data saved to', output_file_name + '.fasta')
        # Save as .keys
        elif with_fasta == 1:
            with open(output_file_name + '.txt', 'w') as f:
                f.write('\n'.join(df['Entity ID'].values))
            print(break_line, 'Data saved to', output_file_name + '.keys')


In [4]:
!wget -O pdb_report.csv "https://github.com/heispv/bioinformatics/raw/master/lab-of-bioinformatics/rcsb_pdb_custom_report_20240411062134.csv"

--2024-05-11 19:27:34--  https://github.com/heispv/bioinformatics/raw/master/lab-of-bioinformatics/rcsb_pdb_custom_report_20240411062134.csv
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/heispv/bioinformatics/master/lab-of-bioinformatics/rcsb_pdb_custom_report_20240411062134.csv [following]
--2024-05-11 19:27:35--  https://raw.githubusercontent.com/heispv/bioinformatics/master/lab-of-bioinformatics/rcsb_pdb_custom_report_20240411062134.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1611 (1.6K) [text/plain]
Saving to: ‘pdb_report.csv’


2024-05-11 19:27:35 (13.3 MB/s) - ‘pdb_report.csv’ sa

In [5]:
path = "/content/pdb_report.csv"

In [6]:
# for default -> select (2 / default / 2)
clean_csv_file(path)

Reading the file...
Cleaning the CSV file...
Done! 
------------------------------



KeyboardInterrupt: Interrupted by user

In [None]:
!cat output_seq.fasta | head -n 30

## Get MSA

In [None]:
# Getting the multiple sequence alignment is done in the PDBe website
!wget -O fasta.txt "https://github.com/heispv/bioinformatics/raw/master/lab-of-bioinformatics/fasta.txt"

### Build HMM based on the raw MSA

In [None]:
# Create an HMM model based on the fasta.txt file
!hmmbuild msa_notclean.hmm fasta.txt

In [None]:
!cat msa_notclean.hmm | head -n 30

Based on the file above, we can observe that the `hmmbuild` command, applied to the `fasta.txt` file, cuts the first 20 characters in the sequence. This action is taken because there are not enough amino acids to build the Hidden Markov Model (HMM) for that part of the sequence. Therefore, we will trim the first 20 characters of each sequence and then reapply the `hmmbuild` command.

## Clean raw MSA

In [None]:
def clean_fasta(path: str, first_clipping_num, output_file_name: str) -> None:
    """
    Clean FASTA file by removing specified number of characters from the beginning of each sequence.

    Args:
        path (str): Path to the input FASTA file.
        first_clipping_num (int): Number of characters to remove from the beginning of each sequence.
        output_file_name (str): Name of the output file.

    Returns:
        None

    This function reads a FASTA file, extracts the sequence IDs and sequences, removes the specified
    number of characters from the beginning of each sequence, and writes the cleaned sequences to a new file.
    """
    with open(path) as f:
        fastas = f.read().split('\n\n')

    clean_list = []
    for fasta in fastas:
        id = fasta.split()[0]
        sequence = ''.join(fasta.split('\n')[1:])
        clean_list.append((id, sequence))

    with open(output_file_name + '.txt', 'w') as f:
        for item in clean_list:
            f.write(f"{item[0]}\n{item[1][first_clipping_num-1:]}\n")

    print(f'Output saved in {output_file_name}.txt')

In [None]:
clean_fasta("/content/fasta.txt", 20, "clean_fasta")

In [None]:
# Check the fasta file before the cleaning
!cat fasta.txt | head -n 20

In [None]:
# Check the clean_fasta.txt file
!cat clean_fasta.txt | head -n 20

### Build HMM based on clean MSA

In [None]:
# Create an HMM model based on the clean_fasta.txt file
!hmmbuild msa.hmm clean_fasta.txt

In [None]:
!cat msa.hmm | head -n 30

* In this new file, we can observe that the probabilities start from the first amino acid (AA), indicating that no cutting is performed by the `hmmbuild` command itself.

## Get the negative and postive data from NCBI

In [None]:
!wget -O negative.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28%28reviewed%3Atrue%29+NOT+%28xref%3Apfam-PF00014%29%29"
!zcat -f negative.fasta.gz > negative.fasta
!rm negative.fasta.gz

In [None]:
!wget -O bpti_reviewd.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28%28xref%3Apfam-PF00014%29+AND+%28reviewed%3Atrue%29%29"
!zcat bpti_reviewd.fasta.gz > bpti_reviewd.fasta
!rm bpti_reviewd.fasta.gz

In [None]:
# Make blast dataset for the bpti_reviewd.fasta
!makeblastdb -in bpti_reviewd.fasta -dbtype prot

In [17]:
!cat bpti_reviewd.fasta | wc

   1860    5387  118161


The line of the code below initiates a BLASTP search, a tool for comparing protein sequences. It takes protein sequences from "output_seq.fasta" as the query and compares them against a database specified in "bpti_reviewd.fasta". The results are saved in "bpti_reviewd.blast" using format 7, which is suitable for further analysis.

In [None]:
!blastp -query output_seq.fasta -db bpti_reviewd.fasta -out bpti_reviewd.blast -outfmt 7

This command below filters a BLAST result file named "bpti_reviewd.blast". It removes comment lines (lines starting with # character), selects entries with sequence identity greater than 98%, and saves the unique identifiers of those entries into a file named "remove.fasta".

In [7]:
!grep -v "^#" bpti_reviewd.blast | awk '{if ($3 > 98) {print $2}}' | sort -u > remove.fasta

In [8]:
!cat remove.fasta | head -n 5

sp|G9I929|VKTA_MICTN
sp|O43278|SPIT1_HUMAN
sp|O43291|SPIT2_HUMAN
sp|P00974|BPT1_BOVIN
sp|P00980|VKTHA_DENAN


We are only interested in the ids, to get the ids from the remove.fasta file, we should run the command below. the results would be saved in the `remove.ids` file.

In [9]:
!cat remove.fasta | cut -d "|" -f 2 > remove.ids

In [10]:
!wc remove.ids

 27  27 189 remove.ids


Based on the `remove.ids` file, there are 27 sequences which should be removed from the main data.

In [11]:
def filter_sequences(seq_file_path, ids_file_path, output_file_path):
    """
    Filters sequences from a FASTA file based on a list of excluded sequence IDs and save them in a file.

    Parameters:
    - seq_file_path (str): The file path to the input FASTA file containing sequences to filter.
    - ids_file_path (str): The file path to the input file containing a list of sequence IDs to exclude.
    - output_file_path (str): The file path to save the filtered sequences.

    Returns:
    - None
    """
    # Open the file containing excluded sequence IDs and create a set to store them
    with open(ids_file_path, 'r') as file:
        excluded_ids = {line.strip() for line in file}

    # Open the input FASTA file and extract sequences
    with open(seq_file_path, 'r') as file:
        content = file.read().strip()
        sequences = content.split('>')[1:]

    # Open the output file for writing filtered sequences
    with open(output_file_path, 'w') as outfile:
        for sequence in sequences:
            header = sequence.split('\n', 1)[0]
            seq_id = header.split('|')[1]

            if seq_id not in excluded_ids:
                outfile.write(f'>{sequence}\n')


In [12]:
filter_sequences('bpti_reviewd.fasta', 'remove.ids', 'pos_filtered.fasta')

In [13]:
!cat pos_filtered.fasta | grep ">" | wc

    364    3628   37607


We can see that now the number of the sequences are 364.

In [14]:
!cat negative.fasta | grep ">" | wc

 570891 8385714 74362141
