This repository contains the Biological Database Integration Project developed as part of MSc Bioinformatics coursework.
It demonstrates the integration of multiple biological data sources — Ensembl, UniProt, KEGG, and miRBase — using a combination of R scripts and SQL relational design.
| File | Description |
|---|---|
| BD_assignment.pdf | Project report documenting the R–SQL integration workflow, methodology, and results. |
| BD_assignment.R | Main R script performing data retrieval and integration using REST APIs, BioMart, and MySQL. |
| BD_assignment.sql | MySQL schema defining the relational structure for genes, proteins, pathways, and miRNAs. |
The project builds an automated data pipeline to:
- Retrieve gene and protein data from Ensembl and UniProt via API.
- Map KEGG pathways and perform functional enrichment using
clusterProfiler. - Integrate miRNA information from miRBase.
- Store all data in a structured MySQL database for downstream querying and analysis.
- R – for scripting, data retrieval, and enrichment analysis
- MySQL – for database design and relational integration
- REST APIs – for accessing UniProt and KEGG resources
- biomaRt, RMySQL, httr, jsonlite, clusterProfiler – main R packages used
- Designing relational schemas for biological data
- Accessing bioinformatics APIs programmatically
- Integrating heterogeneous biological datasets
- Applying R-based functional and pathway analysis
- Bridging R dataframes and SQL tables
Tables included:
gene_annotations– Ensembl gene informationkegg_data– KEGG pathway enrichment resultsmirna_data– miRNA mappingsummary– combined relational table joining all datasets