# Microcompartment Database (BMC-DB)

## Project Description

This project provides a database framework for storing and linking information
about **bacterial microcompartment (BMC) proteins**, their genes, modifications,
isoforms, complexes, and external database references.

### From microcomp_db.py:

This script is the main entry point for the database project. It is responsible
for creating the database, reading data from a CSV file, and linking the data to
the database. We'll use click to manage the command line interface and SQLAlchemy
to manage the database connection and operations.

### From db.py:

This script is the development of the BMC database using SQLAlchemy.

### From file_and_data.py:

This script links the addition of data manually to the BMC using csv from readfile.py to database described in db.

### From readfile.py:

This script allows for the reading of excel file for manual data addition to the database.

## Installation and getting started

BMC-database relies on several other programs, packages, and tools for normal running and for development. While many of these dependencies are installed automatically during setup, some may need to be downloaded and installed separately

In [None]:
# Clone the repository
git clone <https://github.com/monserrj/BMC-database>

# Install dependencies
pip install -r requirements.txt
# Requirements are sqlalchemy, click, pandas

Python Version 3.13
Sqlite version 3

## Database Structure - db.py file

SQL schema of the database:

![BMC Database Schema](PASTE IMAGE LATER)

(https://dbdiagram.io/d/DB_STRUCTURE_052025-682494635b2fc4582f9453e4)


**Tables**

* Protein: stores protein sequence, accession, structure, canonical/isoform

* Xdatabase: external database descriptions

* Xref: database cross-references

* ProteinXref: protein-to-database link

* Cds: gene details linked to protein

* Origin: original DNA sequence for engineered CDS

* CdsXref: CDS-to-database link

* Modif: modifications of engineered CDS

* CdsModif: CDS-to-modification link

* Name: protein names

* ProteinName: protein-to-name link

* Isoforms: canonical vs isoform proteins

* Complex: BMC protein complexes details

* ProteinComplex: protein-to-complex link

* Interaction: interactions between proteins

* Prot_prot_interact: link between protein–protein via interactions

**Functions for addition of data to the database**

These functions should check that the information added to the database is not duplicate and relationships are correctly added

<div class="alert alert-block alert-danger"> 
<b>Not tested yet

* protein_addition(protseq, protaccession, struct, canonical)

* xdatabase_addition(xdb_name, href, xdb_type)

* cds_addition(cdsseq, cdsaccession, cds_origin, protein)

* xref_addition(xdb, cds, protein, xrefacc)

* name_addition(protname, protein)
</div>

**Functions to create database and obtaine the session**
* create_db()
* get_session()

## Addition of data manually to the database

**Module: file_and_data.py**

Links manual data addition to the BMC database using CSV files.

*Functions:* link_db_csv() – adds data in a loop from excel file

**Module: readfile.py**

Reads Excel/CSV files for manual data addition to the database.

*Functions:* cli_open_csvfile() – prompts for file path, loads CSV data.

## microcomp_db.py

Main entry point for project. Handles:

* Database creation

* Reading CSV input

* Linking data to database

Uses click for CLI and SQLAlchemy

**Running the script**

```python microcomp_db.py --dbpath my_database.db --csvpath data/proteins.csv --force --verbose```

**Available options**

* --dbpath PATH: Path where the SQLite database will be created.

    * Default: bmc.db (in current folder).

* --csvpath PATH: Path to the CSV file with protein data to be imported.

    * Default: ../data/raw/prot_info/prot_data_minimal_correct_UP.csv.

* --force / --no-force:

    * If --force is set, the script will overwrite an existing database file.
    * If --no-force (default), the script will exit without overwriting.

* --verbose / --no-verbose:

    * If --verbose is set, extra information will be printed during execution.

**Workflow**

1. Create or overwrite the database (depending on --force).

2. Read the CSV file using read_file.

3. Link the data into the database using link_db_csv.


## Usage examples

Unsure if to do examples by sections 

### Example 1 - Adding a protein
From db.py → protein_addition

In [None]:
protein_addition(
    protseq="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPT",
    protaccession="P12345",
    struct="BMC-H",
    canonical=True
)

### Example 2 – Adding an external database

From db.py → xdatabase_addition

In [None]:
xdatabase_addition(
    xdb_name="UniProt",
    href="https://www.uniprot.org/",
    xdb_type="seq"
)

### Example 3 - Create a new database with default CSV:

In [None]:
python microcomp_db.py

### Example 4 - Create a database with custom name and CSV file:

In [None]:
python microcomp_db.py --dbpath results/bmc_data.db --csvpath data/raw/complex_data.csv


### Example 5 - Force overwrite an existing database with verbose logging:

In [None]:
python microcomp_db.py --dbpath bmc.db --csvpath data/raw/complex_data.csv --force --verbose


# PENDING SECTIONS:

## Dictionaries for when there is vocabulary lock/index
This includes convention names, how accession number works, locked vocabulary......
## 
## Contribution Guidelines
## Licence
## Contact Information
## Credits and Acknowledgments
* Leighton
* SQLAlchemy
* Click
* Pandas