This repository contains the code, models, data, and documentation for the project "Functional and Structural Characterization of a Protein Domain Family". The focus of the project is the Lipase/Vitellogenin domain family (Pfam ID: PF00151), specifically in Homo sapiens. The analysis integrates model building, functional characterization, and motif identification.
Function/: Contains the outputs generated by the notebook, such as processed data and results.Model/:Building/: Contains outputs related to the PSSM and HMM model construction.Evaluation/: Contains evaluation results of the models against SwissProt annotations.
Taxonomy/: Taxonomic analysis outputs, including lineage data and phylogenetic trees.Motifs/: Outputs related to motif discovery within disordered regions.BD Report/: final PDF of the project report.
The aim of the project is to characterize the Lipase/Vitellogenin domain family by:
- Constructing sequence models (PSSM and HMM).
- Evaluating these models against SwissProt database annotations.
- Analyzing taxonomic distribution, functional enrichment, and motif conservation.
- Sequence Models: Built using PSI-BLAST and HMMER based on a representative sequence.
- Taxonomy Analysis: Phylogenetic insights into protein distribution across species.
- Functional Insights: Gene Ontology enrichment analysis for biological functions.
- Motif Discovery: Identification of conserved motifs within disordered regions using ProSite and ELM patterns.
-
Software:
- NCBI-BLAST+
- HMMER
- Clustal Omega
- JalView
- Python 3.x
-
Databases:
- Clone the repository:
git clone https://github.com/maloooon/Protein_Domain_Family_Characterization.git cd Protein_Domain_Family_Characterization - Open the Jupyter Notebook:
jupyter notebook BD_Project_notebook.ipynb
- Follow the steps in the notebook:
- Model Building: Generate PSSM and HMM models.
- Evaluation: Assess the models against SwissProt annotations.
- Taxonomy Analysis: Visualize lineage distribution.
- Motif Discovery: Identify conserved motifs in disordered regions.
- Model Performance: PSSM and HMM models evaluated for precision, recall, and other metrics.
- Taxonomic Insights: Visualization of domain conservation across species.
- Functional Enrichment: Key biological processes identified via GO annotations.
- Motif Discovery: Conserved motifs linked to functional hotspots.
This project was developed by:
This project is licensed under the MIT License. See the LICENSE file for details.