<a href="https://colab.research.google.com/github/paulynamagana/AFDB_notebooks/blob/main/AFDB_FTP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src = "https://www.embl.org/about/info/communications/wp-content/uploads/2017/09/Ebi_official_logo.png"
 height="100" align="right">

# Access structures from AlphaFold DB via FTP

FTP, or File Transfer Protocol, is a standard network protocol facilitating the exchange of files between computers.



<br>

As of September 2023, the EMBL-EBI’s FTP area hosts TAR files for proteomes of 48 organisms, including model organisms and WHO pathogens of interest.

We document every data version update in our [CHANGELOG](https://ftp.ebi.ac.uk/pub/databases/alphafold/CHANGELOG.txt).


<br>

The folders are named following a structured convention, comprising three distinct elements separated by underscores:

- Reference Proteome (UPID): UP000000429
- Taxonomy ID: 85962
- Organism: HELPY (derived from the first three characters of the genus, "Helicobacter," and the first two characters of the species, "pylori").

<br>

In order to understand the folders, visit the [Downloads tab](https://alphafold.ebi.ac.uk/download)

You can also find the compressed files for Swiss-Prot which contains 542,378 predicted structures:

|File type|File name|Size|
|---------|--------------|---------------------|
|Swiss-Prot (CIF Files)|swissprot_cif_v4.tar| 37,643 MB|
|Swiss-Prot (PDB files)|swissprot_pdb_v4.tar|26,935 MB|




In [None]:
#@title #Run to see what's in the FTP area
#@markdown Run this block to see what's in the FTP area
import ftplib
import io
import json
from ftplib import FTP
import tarfile
import tempfile
import os
from google.colab import files
import zipfile

ftp_server = ftplib.FTP("ftp.ebi.ac.uk")

# Login as an anonymous user
ftp_server.login("anonymous", "anonymous@")

# Navigate to the directory
ftp_server.cwd("/pub/databases/alphafold/")

# List the contents of the directory
# Retrieve and print the list of file names in the directory
file_names = ftp_server.nlst()
for file_name in file_names:
    print(file_name)

CHANGELOG.txt
Outreach
README.txt
Training
accession_ids.csv
diffs.ndjson.gz
download_metadata.json
latest
proteomes
sequences.fasta
v1
v2
v3
v4
v5
v6


In [None]:
#@title #See all the available proteomes in one version
#@markdown This block will retrieve a list of the files inside the version archive you define

folder_navigate = "v6" #@param {type:"string"}
#@markdown `folder_navigate` is the version of the AFDB that you want to use to download

#@title Navigate to the "v4" directory
ftp_server.cwd(folder_navigate)

# Retrieve and print the list of file names in the directory
file_names = ftp_server.nlst()
for file_name in file_names:
    print(file_name)

UP000000429_85962_HELPY_v6.tar
UP000000437_7955_DANRE_v6.tar
UP000000535_242231_NEIG1_v6.tar
UP000000559_237561_CANAL_v6.tar
UP000000579_71421_HAEIN_v6.tar
UP000000586_171101_STRR6_v6.tar
UP000000589_10090_MOUSE_v6.tar
UP000000625_83333_ECOLI_v6.tar
UP000000799_192222_CAMJE_v6.tar
UP000000803_7227_DROME_v6.tar
UP000000805_243232_METJA_v6.tar
UP000000806_272631_MYCLE_v6.tar
UP000001014_99287_SALTY_v6.tar
UP000001450_36329_PLAF7_v6.tar
UP000001584_83332_MYCTU_v6.tar
UP000001631_447093_AJECG_v6.tar
UP000001940_6239_CAEEL_v6.tar
UP000002059_502779_PARBA_v6.tar
UP000002195_44689_DICDI_v6.tar
UP000002296_353153_TRYCC_v6.tar
UP000002311_559292_YEAST_v6.tar
UP000002438_208964_PSEAE_v6.tar
UP000002485_284812_SCHPO_v6.tar
UP000002494_10116_RAT_v6.tar
UP000002716_300267_SHIDS_v6.tar
UP000005640_9606_HUMAN_v6.tar
UP000006304_1133849_9NOCA1_v6.tar
UP000006548_3702_ARATH_v6.tar
UP000006672_6279_BRUMA_v6.tar
UP000007305_4577_MAIZE_v6.tar
UP000007841_1125630_KLEPH_v6.tar
UP000008153_5671_LEIIN_v6.tar


In [None]:
#@title #Get the metadata
#@markdown This block will retrieve the metadata and print it, you can see the tar file, the specie, common name and metadata accompanying this.

# FTP server details
ftp_server = "ftp.ebi.ac.uk"

try:
    with ftplib.FTP(ftp_server) as ftp:
        print("Accessing metadata...")
        ftp.login(user="anonymous", passwd="anonymous")
        ftp.cwd("/pub/databases/alphafold")  # Navigate to the directory containing the metadata file

        with io.BytesIO() as bio:
            # Attempt to download the metadata file
            ftp.retrbinary('RETR download_metadata.json', bio.write)
            bio.seek(0)  # Go to the start of the BytesIO buffer
            metadata = json.load(bio)

    # Assuming metadata is a list of dictionaries, similar to the sample above
    if metadata:
        # Print the headers
        headers = metadata[0].keys()
        print("\t".join(headers))

        # Print the values for each record
        for record in metadata:
            values = [str(record[key]) for key in headers]
            print("\t".join(values))
    else:
        print("No metadata found.")

except Exception as e:
    print(f"Failed to fetch metadata: {e}")


Accessing metadata...
archive_name	species	common_name	latin_common_name	reference_proteome	num_predicted_structures	size_bytes	type
UP000006548_3702_ARATH_v6.tar	Arabidopsis thaliana	Arabidopsis	True	UP000006548	27402	3877400576	proteome
UP000001940_6239_CAEEL_v6.tar	Caenorhabditis elegans	Nematode worm	False	UP000001940	19700	2777489920	proteome
UP000000559_237561_CANAL_v6.tar	Candida albicans	C. albicans	True	UP000000559	5973	1028541440	proteome
UP000000437_7955_DANRE_v6.tar	Danio rerio	Zebrafish	False	UP000000437	26290	4979466752	proteome
UP000002195_44689_DICDI_v6.tar	Dictyostelium discoideum	Dictyostelium	True	UP000002195	12612	2292386816	proteome
UP000000803_7227_DROME_v6.tar	Drosophila melanogaster	Fruit fly	False	UP000000803	13461	2319755776	proteome
UP000000625_83333_ECOLI_v6.tar	Escherichia coli	E. coli	True	UP000000625	4370	477871104	proteome
UP000008827_3847_SOYBN_v6.tar	Glycine max	Soybean	False	UP000008827	55796	7616606720	proteome
UP000005640_9606_HUMAN_v6.tar	Homo sapi

In [None]:
#@title #Extract specific file type (mmCIF or PDB) for all fragments for a UniProt accession
#@markdown This block will download all the fragments for a specific UniProt accession <br>
#@markdown <strong>NOTE:</strong>  You will get a pop-up window to download files to your local computer


#Input parameters
database_version = 'v6' #@param {type:"string"}
tar_file = 'UP000005640_9606_HUMAN_v6.tar' #@param {type:"string"}
#@markdown Make sure this file coincides with the database version you're searching in (i.e. end of tar file) It should say `v4`  if you're searching in the `version 4` within the database
UniProt_accession = 'Q8WXH0'  #@param {type:"string"}
file_type = "pdb" #@param {type:"string"}
#@markdown If you wish to download `cif` or `pdb` files, make sure it doesn't contain whitespaces

def extract_files_and_zip_from_ftp(folder_navigate, tar_file, file_fragment_name, file_type):
    username = 'anonymous'
    password = 'anonymous'
    ftp_server = "ftp.ebi.ac.uk"
    base_path = f'pub/databases/alphafold/{folder_navigate}/'
    file_path = base_path + tar_file
    extracted_files = []  # To keep track of extracted file names

    # Connect to the FTP server
    with FTP(ftp_server) as ftp:
        ftp.login(username, password)

        # Use a temporary file to store the tar file
        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
            try:
                ftp.retrbinary(f'RETR {file_path}', tmp_file.write)
                tar_file_path = tmp_file.name
            except Exception as e:
                print(f"Error downloading the file: {e}")
                return

        # Open the temporary tar file for reading
        try:
            with tarfile.open(tar_file_path, mode="r:*") as tar:
                # Search through the entire tar file for all matches and check filename contains the file fragment and has a .cif extension
                for member in tar.getmembers():
                    if file_fragment_name in member.name and member.name.endswith(f'.{file_type.lower()}.gz'):
                        extracted_file = tar.extractfile(member)
                        if extracted_file:
                            content = extracted_file.read()
                            output_filename = f"{member.name.replace('/', '_')}"
                            with open(output_filename, 'wb') as f_out:
                                f_out.write(content)
                            print(f"Extracted and saved {member.name} as {output_filename}.")
                            extracted_files.append(output_filename)
        except tarfile.TarError as e:
            print(f"Error reading the tar file: {e}")
        finally:
            # Clean up the temporary file
            os.remove(tar_file_path)

    # Zip the extracted files
    zip_filename = "extracted_files.zip"
    with zipfile.ZipFile(zip_filename, 'w') as zipf:
        for file in extracted_files:
            zipf.write(file)
            os.remove(file)  # Optional: remove the file after adding it to the zip to save space
    print(f"Created zip archive: {zip_filename}")

    # Download the zip file
    files.download(zip_filename)



extract_files_and_zip_from_ftp(database_version, tar_file, UniProt_accession, file_type)


Extracted and saved AF-Q8WXH0-F1-model_v6.pdb.gz as AF-Q8WXH0-F1-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F10-model_v6.pdb.gz as AF-Q8WXH0-F10-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F11-model_v6.pdb.gz as AF-Q8WXH0-F11-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F12-model_v6.pdb.gz as AF-Q8WXH0-F12-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F13-model_v6.pdb.gz as AF-Q8WXH0-F13-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F14-model_v6.pdb.gz as AF-Q8WXH0-F14-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F15-model_v6.pdb.gz as AF-Q8WXH0-F15-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F16-model_v6.pdb.gz as AF-Q8WXH0-F16-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F17-model_v6.pdb.gz as AF-Q8WXH0-F17-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F18-model_v6.pdb.gz as AF-Q8WXH0-F18-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F19-model_v6.pdb.gz as AF-Q8WXH0-F19-model_v6.pdb.gz.
Extracted and saved AF-Q8WXH0-F2-model_v6.pdb.gz as AF-Q8WXH0-F2-model_v6.pdb.

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>