<a href="https://colab.research.google.com/github/paulynamagana/AFDB_notebooks/blob/main/AFDB_FTP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src = "https://www.embl.org/about/info/communications/wp-content/uploads/2017/09/Ebi_official_logo.png"
 height="100" align="right">

# Access structures from AlphaFold DB via FTP

FTP, or File Transfer Protocol, is a standard network protocol facilitating the exchange of files between computers.



<br>

As of September 2023, the EMBL-EBI’s FTP area hosts TAR files for proteomes of 48 organisms, including model organisms and WHO pathogens of interest.

We document every data version update in our [CHANGELOG](https://ftp.ebi.ac.uk/pub/databases/alphafold/CHANGELOG.txt).


<br>

The folders are named following a structured convention, comprising three distinct elements separated by underscores:

- Reference Proteome (UPID): UP000000429
- Taxonomy ID: 85962
- Organism: HELPY (derived from the first three characters of the genus, "Helicobacter," and the first two characters of the species, "pylori").

<br>

In order to understand the folders, visit the [Downloads tab](https://alphafold.ebi.ac.uk/download)

You can also find the compressed files for Swiss-Prot which contains 542,378 predicted structures:

|File type|File name|Size|
|---------|--------------|---------------------|
|Swiss-Prot (CIF Files)|swissprot_cif_v4.tar| 37,643 MB|
|Swiss-Prot (PDB files)|swissprot_pdb_v4.tar|26,935 MB|




In [31]:
import ftplib

ftp_server = ftplib.FTP("ftp.ebi.ac.uk")

# Login as an anonymous user
ftp_server.login("anonymous", "anonymous@")

# Navigate to the directory
ftp_server.cwd("/pub/databases/alphafold/")

# List the contents of the directory
# Retrieve and print the list of file names in the directory
file_names = ftp_server.nlst()
for file_name in file_names:
    print(file_name)

CHANGELOG.txt
README.txt
accession_ids.csv
download_metadata.json
latest
sequences.fasta
v1
v2
v3
v4


In [32]:
#@title #See all the available proteomes in one version
folder_navigate = "v3" #@param {type:"string"}
#@markdown `folder_navigate` is the version of the AFDB that you want to use to download

#@title Navigate to the "v4" directory
ftp_server.cwd(folder_navigate)

# Retrieve and print the list of file names in the directory
file_names = ftp_server.nlst()
for file_name in file_names:
    print(file_name)

UP000000429_85962_HELPY_v3.tar
UP000000437_7955_DANRE_v3.tar
UP000000535_242231_NEIG1_v3.tar
UP000000559_237561_CANAL_v3.tar
UP000000579_71421_HAEIN_v3.tar
UP000000586_171101_STRR6_v3.tar
UP000000589_10090_MOUSE_v3.tar
UP000000625_83333_ECOLI_v3.tar
UP000000799_192222_CAMJE_v3.tar
UP000000803_7227_DROME_v3.tar
UP000000805_243232_METJA_v3.tar
UP000000806_272631_MYCLE_v3.tar
UP000001014_99287_SALTY_v3.tar
UP000001450_36329_PLAF7_v3.tar
UP000001584_83332_MYCTU_v3.tar
UP000001631_447093_AJECG_v3.tar
UP000001940_6239_CAEEL_v3.tar
UP000002059_502779_PARBA_v3.tar
UP000002195_44689_DICDI_v3.tar
UP000002296_353153_TRYCC_v3.tar
UP000002311_559292_YEAST_v3.tar
UP000002438_208964_PSEAE_v3.tar
UP000002485_284812_SCHPO_v3.tar
UP000002494_10116_RAT_v3.tar
UP000002716_300267_SHIDS_v3.tar
UP000005640_9606_HUMAN_v3.tar
UP000006304_1133849_9NOCA1_v3.tar
UP000006548_3702_ARATH_v3.tar
UP000006672_6279_BRUMA_v3.tar
UP000007305_4577_MAIZE_v3.tar
UP000007841_1125630_KLEPH_v3.tar
UP000008153_5671_LEIIN_v3.tar


In [40]:
#@title #Extract specific files from the proteome

from ftplib import FTP
import tarfile
import tempfile
import os
from google.colab import files

def extract_specific_file_from_ftp(folder_navigate, tar_file, file_fragment_names):
    username = 'anonymous'
    password = 'anonymous'
    ftp_server = "ftp.ebi.ac.uk"
    base_path = 'pub/databases/alphafold/' + folder_navigate + '/'  # Dynamically use folder_navigate
    file_path = base_path + tar_file

    # Convert the string of file fragments into a list
    file_fragment_names = file_fragment_names.split(', ')

    # Connect to the FTP server
    with FTP(ftp_server) as ftp:
        ftp.login(username, password)

        # Use a temporary file to store the tar file
        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
            try:
                ftp.retrbinary(f'RETR {file_path}', tmp_file.write)
                tar_file_path = tmp_file.name
            except Exception as e:
                print(f"Error downloading the file: {e}")
                return  # Exit the function if there's an error

        # Open the temporary tar file for reading
        try:
            with tarfile.open(tar_file_path, mode="r:*") as tar:
                for file_fragment_name in file_fragment_names:  # Iterate over each file fragment name
                    extracted = False
                    for member in tar.getmembers():
                        if file_fragment_name in member.name:
                            extracted_file = tar.extractfile(member)
                            if extracted_file:
                                content = extracted_file.read()
                                output_filename = f"{member.name.replace('/', '_')}"
                                with open(output_filename, 'wb') as f_out:
                                    f_out.write(content)
                                print(f"Extracted and saved {member.name} as {output_filename}.")
                                extracted = True
                                # Download the file to the local system
                                files.download(output_filename)
                                break  # Break if file is found and extracted
                    if not extracted:
                        print(f"File containing '{file_fragment_name}' not found in the tar archive.")
        except tarfile.TarError as e:
            print(f"Error reading the tar file: {e}")
        finally:
            # Clean up the temporary file
            os.remove(tar_file_path)

# Example usage
version_navigate = 'v4'  #@param {type:"string"}
tar_file = 'UP000005640_9606_HUMAN_v4.tar' #@param {type:"string"}
file_fragment_names = 'Q8WZ42-F1, Q8WZ42-F2' #@param {type:"string"}
extract_specific_file_from_ftp(folder_navigate, tar_file, file_fragment_names)

Extracted and saved AF-Q8WZ42-F1-model_v4.cif.gz as AF-Q8WZ42-F1-model_v4.cif.gz.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Extracted and saved AF-Q8WZ42-F2-model_v4.cif.gz as AF-Q8WZ42-F2-model_v4.cif.gz.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>