<a href="https://colab.research.google.com/github/rcsb/rcsb-training-resources/blob/master/training-events/2024/utilizing-binary-cif/RCSB_mmCIF_BCIF_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstration of working with mmCIF and BCIF using RCSB PDB Python Packages

## Set-up
#### Install packages (and make sure to re-upgrade often!)

In [None]:
!pip install --upgrade mmcif
!pip install --upgrade rcsb.utils.io
# Make sure to keep packages up-to-date (re-upgrade often)!


##### Package GitHub repositories:
- https://github.com/rcsb/py-mmcif
- https://github.com/rcsb/py-rcsb_utils_io (see README.md for additional details)

## 1. Reading an mmCIF or BCIF File

In [131]:
from rcsb.utils.io.MarshalUtil import MarshalUtil

mU = MarshalUtil()

# Reading mmCIF
# Load from remote URL
dataContainerList = mU.doImport("https://files.rcsb.org/download/4HHB.cif.gz", fmt="mmcif")

# Or, load from a local file (either compressed or uncompressed)
# dataContainerList = mU.doImport("local/path/to/file.cif", fmt="mmcif")

In [132]:
# Reading BCIF (note the URL change)
dataContainerList = mU.doImport("https://models.rcsb.org/4HHB.bcif.gz", fmt="bcif")

## 2. Accessing Data Categories

Once the data is loaded, you can access and inspect the data categories:

In [None]:
# Get the first data container (usually there's only one per file)
dataContainer = dataContainerList[0]

# Print the container name
containerName = dataContainer.getName()
print(f"Container Name: {containerName}")

In [None]:
# Get the list of categories
categoryNames = dataContainer.getObjNameList()
print("Categories:", categoryNames)

In [None]:
# Access a specific category and its attributes
j = 0  # print the first 5 rows
if dataContainer.exists("atom_site"):
    atomSiteObj = dataContainer.getObj("atom_site")
    for i in range(atomSiteObj.getRowCount()):
        rowData = atomSiteObj.getRowAttributeDict(i)
        print(rowData)
        j += 1
        if j > 5:
            break

In [None]:
# Creating a dictionary from a DataContainer
dcD = {}

for dataContainer in dataContainerList:
    eName = dataContainer.getName()
    for catName in categoryNames:
        if not dataContainer.exists(catName):
            continue
        dObj = dataContainer.getObj(catName)
        for ii in range(dObj.getRowCount()):
            dD = dObj.getRowAttributeDict(ii)
            dcD.setdefault(eName, {}).setdefault(catName, []).append(dD)

# Print the first 5 data items
count = 0
for k, v in dcD.items():
    print(k)
    for k2, v2 in v.items():
        print(k2, v2)
        count += 1
        if count == 5:
            break

In [None]:
# FYI — You can also export (and import) JSON and pickle data.
# (Works for any type of json or dictionary--doesn't need to be CIF-related!)

# Export as JSON
mU.doExport("4HHB.json", dcD, fmt="json")

# Export as Pickle file
mU.doExport("4HHB.pic", dcD, fmt="pickle")

## 3. Manipulating, Deleting, and Adding Categories

In [138]:
# First, let's create a copy of the dataContainerList to work with
from copy import deepcopy
dataContainerListCopy = deepcopy(dataContainerList)
dc = dataContainerListCopy[0]

In [None]:
### Renaming a category
# For example, to rename "citation" to "citation_reference"
dc.rename("citation", "citation_reference")

In [140]:
### Delete a Category
# For example, to delete all EM-related categories:
for catName in categoryNames:
    if catName.startswith("em"):
        dc.remove(catName)

In [141]:
### Add a New Category
# To add a new category to the data container:
from mmcif.api.DataCategory import DataCategory

# Create a new category object
newCategory = DataCategory("new_category", attributeNameList=["ordinal", "attribute1", "attribute2"])

# Add data to the category
newCategory.append([1, "a", "b"])
newCategory.append([2, "c", "d"])
newCategory.append([3, "e", "f"])
newCategory.append([4, "g", "h"])

# Add the new category to the data container
dc.append(newCategory)

In [None]:
# Now verify the changes above took effect (check for "citation_reference", "new_category", and no "em_*" categories)
categoryNames = dc.getObjNameList()
print("Categories:", categoryNames)

In [None]:
# You can also export it to check (more on this below)
mU.doExport("4HHB_modified.cif", dataContainerListCopy, fmt="mmcif")

In [None]:
### All available Data Container methods:

# In Python interpreter, type `dc.` then hit your tab key to see all possible methods:

# >>> dc.
# dc.append(                 dc.getGlobal()             dc.getObjNameList()        dc.invokeDataBlockMethod(  dc.rename(                 dc.setProp(
# dc.copy(                   dc.getName()               dc.getProp(                dc.merge(                  dc.replace(                dc.setType(
# dc.exists(                 dc.getObj(                 dc.getPropCatalog()        dc.printIt(                dc.setGlobal()             dc.toJSON()
# dc.filterObjectNameList(   dc.getObjCatalog()         dc.getType()               dc.remove(                 dc.setName(

# Type `help(method_name)`` to get info about method, e.g.:
help(dc.remove)


## 4. Exporting Data

In [None]:
### Export as mmCIF - One simple step
mU.doExport("4HHB_new.cif", dataContainerList, fmt="mmcif")

In [146]:
### Export as BCIF - A couple extra steps.

# First, create a DictionaryApi provider (only need to do once)
from mmcif.api.DictionaryApi import DictionaryApi
from mmcif.io.IoAdapterPy import IoAdapterPy as IoAdapter

# Include common PDBx/mmCIF dictionary and CSM extension (ModelCIF) dictionary
dictFilePathL = [
    "https://raw.githubusercontent.com/wwpdb-dictionaries/mmcif_pdbx/master/dist/mmcif_pdbx_v5_next.dic",
    "https://raw.githubusercontent.com/ihmwg/ModelCIF/master/dist/mmcif_ma_ext.dic",
]
myIo = IoAdapter(raiseExceptions=True)
dApiContainerList = []
for dictFilePath in dictFilePathL:
    dApiContainerList += myIo.readFile(inputFilePath=dictFilePath)
dictionaryApi = DictionaryApi(containerList=dApiContainerList, consolidate=True)

In [None]:
# After doing the above, you can export as many BCIFs as you wish:
mU.doExport("4HHB_new.bcif", dataContainerList, fmt="bcif", dictionaryApi=dictionaryApi)
mU.doExport("4HHB_new2.bcif", dataContainerList, fmt="bcif", dictionaryApi=dictionaryApi)
mU.doExport("4HHB_new3.bcif", dataContainerList, fmt="bcif", dictionaryApi=dictionaryApi)

In [None]:
### Compress the file with Gzip
from rcsb.utils.io.FileUtil import FileUtil
fU = FileUtil(workPath=".")

# Compress the file
fU.compress("4HHB_new.bcif", "4HHB_new.bcif.gz")

# Remove the uncompressed file
mU.remove("4HHB_new.bcif")

##### ***Note** that further improvements to the above method for exporting data as BCIF will continue to be developed, so stay tuned for software updates that simplify this process even more (such as circumventing DictionaryApi creation and automatic GZIP compression).

##### Follow along GitHub repos for updates:
- https://github.com/rcsb/py-mmcif
- https://github.com/rcsb/py-rcsb_utils_io

## 5. Working with computed structure models (CSMs)

In [149]:
# Read in a CSM (from AlphaFold DB)
dataContainerList = mU.doImport("https://alphafold.ebi.ac.uk/files/AF-P24854-F1-model_v4.cif", fmt="mmcif")

In [None]:
# Export as BCIF (by specifying the same dictionaryApi object created above)
mU.doExport("AF-P24854-F1.bcif", dataContainerList, fmt="bcif", dictionaryApi=dictionaryApi)