# Comparative Analysis of GRB2 Protein Sequences
**Name:** Olívia DiBiasio <br>
**Date:** January 2, 2026

## Abstract:
GRB2 (growth factor receptor–bound protein 2) is an adaptor protein that plays a central role in growth factor–mediated signal transduction by linking activated receptor tyrosine kinases to the RAS/MAPK signaling cascade. This pathway is essential for regulating cell growth, differentiation, and survival and is frequently dysregulated in cancer. In this notebook, computational analyses are used to examine GRB2 protein sequences across vertebrate species and to assess sequence and structural features relevant to signaling function. By characterizing evolutionary conservation across GRB2, this work explores how sequence variation may influence downstream signaling behavior and contribute to disease-associated perturbations.

## How This Notebook Can Be Used:
This notebook uses vertebrate sequence comparisons to establish a baseline map of functional constraint across the GRB2 protein. By quantifying conservation across the full-length sequence and key signaling domains, the analysis provides a framework for prioritizing cancer-associated GRB2 variants that occur in highly conserved regions and are therefore more likely to disrupt RTK–RAS/MAPK signaling. The same conservation-based scoring approach can be readily extended to additional signaling proteins or applied to independent mutation datasets.

## Methods: 
### Method 1: Sequence Acquisition and Parsing 
GRB2 protein sequences from multiple vertebrate species were obtained in amino acid FASTA format. Species included human (Homo sapiens), rat, mouse, chicken, African clawed frog (Xenopus laevis), and zebrafish. FASTA files were parsed programmatically using the Biopython SeqIO module. For each species, the first FASTA record was extracted and stored as a string representing the amino acid sequence. The human GRB2 sequence was designated as the reference sequence for all subsequent comparative analyses.

In [1]:
import matplotlib.pyplot as plt
import numpy as np

from Bio import SeqIO
import os

# Your folder with FASTA files
DATA_DIR = "/Users/livdibiasio/Desktop/Grb2_data"

FASTA_PATHS = {
    "Homo_Sapiens": os.path.join(DATA_DIR, "HomoSapiens_Grb2.fasta"),
    "Rat": os.path.join(DATA_DIR, "Rat_Grb2.fasta"),
    "Mouse": os.path.join(DATA_DIR, "Mouse_Grb2.fasta"),
    "Chicken": os.path.join(DATA_DIR, "Chicken_Grb2.fasta"),
    "Frog": os.path.join(DATA_DIR, "AfricanClawedFrog_Grb2.fasta"),
    "Zebrafish": os.path.join(DATA_DIR, "Zebrafish_Grb2.fasta"),
}

sequences = {}
for species, path in FASTA_PATHS.items():
    records = list(SeqIO.parse(path, "fasta"))
    sequences[species] = str(records[0].seq)

### Method 2: Validate inputs and show sequence lengths 
GRB2 protein sequences from multiple vertebrate species were obtained in amino acid FASTA format. Species included human (Homo sapiens), rat, mouse, chicken, African clawed frog (Xenopus laevis), and zebrafish. FASTA files were parsed programmatically using the Biopython SeqIO module. For each species, the first FASTA record was extracted and stored as a string representing the amino acid sequence. The human GRB2 sequence was designated as the reference sequence for all subsequent comparative analyses.

In [21]:
import pandas as pd

for sp, path in FASTA_PATHS.items():
    print(f"{sp:12s} exists: {os.path.exists(path)} | {path}")

length_df = pd.DataFrame(
    [{"Species": sp, "Length_aa": len(seq)} for sp, seq in sequences.items()]
).sort_values("Species").reset_index(drop=True)

length_df

Homo_Sapiens exists: True | /Users/livdibiasio/Desktop/Grb2_data/HomoSapiens_Grb2.fasta
Rat          exists: True | /Users/livdibiasio/Desktop/Grb2_data/Rat_Grb2.fasta
Mouse        exists: True | /Users/livdibiasio/Desktop/Grb2_data/Mouse_Grb2.fasta
Chicken      exists: True | /Users/livdibiasio/Desktop/Grb2_data/Chicken_Grb2.fasta
Frog         exists: True | /Users/livdibiasio/Desktop/Grb2_data/AfricanClawedFrog_Grb2.fasta
Zebrafish    exists: True | /Users/livdibiasio/Desktop/Grb2_data/Zebrafish_Grb2.fasta


Unnamed: 0,Species,Length_aa
0,Chicken,217
1,Frog,229
2,Homo_Sapiens,217
3,Mouse,217
4,Rat,217
5,Zebrafish,217


### Method 3: Define domains and extraction
To assess conservation within biologically relevant regions, GRB2 was segmented into its major functional domains using boundaries defined relative to the human reference sequence. These domains included the N-terminal SH3 domain, the central SH2 domain, and the C-terminal SH3 domain. The SH2 domain mediates phosphotyrosine recognition at activated receptor tyrosine kinases, while the SH3 domains facilitate interactions with proline-rich signaling partners. Domain boundaries were applied uniformly across species to enable consistent cross-species comparison.

In [2]:
# Domain boundaries (0-indexed, end-exclusive)
SH2   = slice(59, 151)
SH3_N = slice(0, 57)
SH3_C = slice(155, 214)

domains = {}
for sp, seq in sequences.items():
    domains[sp] = {
        "SH2": seq[SH2],
        "SH3_N": seq[SH3_N],
        "SH3_C": seq[SH3_C],
        "SH3_NC": seq[SH3_N] + seq[SH3_C],  # combined SH3 domains
    }


### Method 4: Sequence Conservation Scoring
Sequence conservation was quantified using percent identity relative to the human GRB2 reference sequence. Percent identity was calculated as the proportion of identical amino acid positions between two sequences, divided by the length of the compared region and multiplied by 100. To avoid artificial inflation of divergence due to length differences, comparisons were restricted to the overlapping region defined by the minimum sequence length for each pairwise comparison.

In [3]:
def identity_points(human_seq: str, other_seq: str):
    # compare only overlapping region so different lengths won't crash
    min_len = min(len(human_seq), len(other_seq))
    human_seq = human_seq[:min_len]
    other_seq = other_seq[:min_len]

    points = 0
    for a, b in zip(human_seq, other_seq):
        if a == b:
            points += 1

    percent = (points / min_len) * 100 if min_len else 0.0
    return points, percent, min_len

### Method 5: Whole-Protein Conservation Analysis
Whole-protein conservation was assessed by calculating percent identity between the human GRB2 sequence and each non-human ortholog. These pairwise identity values provided a measure of overall evolutionary constraint acting on GRB2 across vertebrates. Results were compiled and ranked to visualize conservation trends as a function of evolutionary distance, establishing a baseline estimate of global sequence constraint for the protein.

In [4]:
human_seq = sequences["Homo_Sapiens"]

scores = {}
for species, seq in sequences.items():
    if species == "Homo_Sapiens":
        continue
    pts, pct, possible = identity_points(human_seq, seq)
    scores[species] = (pts, pct, possible)

ranked = sorted(scores.items(), key=lambda x: x[1][1], reverse=True)

print("Ranked vs Homo_Sapiens (Whole Protein)")
print("-"*45)
print("Species        Points    Possible   Percent")
for species, (pts, pct, possible) in ranked:
    print(f"{species:12s} {pts:6d}   {possible:10d}   {pct:6.2f}%")

Ranked vs Homo_Sapiens (Whole Protein)
---------------------------------------------
Species        Points    Possible   Percent
Rat             217          217   100.00%
Mouse           216          217    99.54%
Chicken         209          217    96.31%
Zebrafish       205          217    94.47%
Frog            155          217    71.43%


### Method 6: Domain-Specific Conservation Analysis
To determine whether specific functional interfaces exhibit differential evolutionary constraint, conservation analyses were performed separately for individual GRB2 domains. Percent identity was calculated for the SH2 domain and for the combined SH3 domains relative to the human reference. Domain-level conservation values were summarized by species and compared to whole-protein conservation patterns. Higher conservation within specific domains was interpreted as evidence of stronger functional constraint, consistent with preserved signaling interactions in receptor tyrosine kinase–dependent pathways frequently implicated in cancer.

In [5]:
def get_sh3(seq: str) -> str:
    return seq[SH3_N] + seq[SH3_C]

def rank_domain(domain_name: str, human_domain: str, domain_func):
    domain_scores = {}

    for species, seq in sequences.items():
        if species == "Homo_Sapiens":
            continue

        pts, pct, possible = identity_points(human_domain, domain_func(seq))
        domain_scores[species] = (pts, pct, possible)

    ranked_domain = sorted(domain_scores.items(), key=lambda x: x[1][1], reverse=True)

    print(f"\nRanked vs Homo_Sapiens ({domain_name})")
    print("-"*45)
    print("Species        Points    Possible   Percent")
    for species, (pts, pct, possible) in ranked_domain:
        print(f"{species:12s} {pts:6d}   {possible:10d}   {pct:6.2f}%")

# SH2 ranking
rank_domain("SH2 Domain", human_seq[SH2], lambda s: s[SH2])

# SH3 ranking (N + C combined)
rank_domain("SH3 Domains (N + C)", get_sh3(human_seq), get_sh3)


Ranked vs Homo_Sapiens (SH2 Domain)
---------------------------------------------
Species        Points    Possible   Percent
Rat              92           92   100.00%
Mouse            92           92   100.00%
Frog             91           92    98.91%
Chicken          89           92    96.74%
Zebrafish        88           92    95.65%

Ranked vs Homo_Sapiens (SH3 Domains (N + C))
---------------------------------------------
Species        Points    Possible   Percent
Rat             116          116   100.00%
Mouse           116          116   100.00%
Chicken         112          116    96.55%
Zebrafish       110          116    94.83%
Frog             58          116    50.00%


## Results
GRB2 protein sequences were successfully compared across all vertebrate species included in the analysis, with full-length sequences recovered for each species. Overall, GRB2 showed a high level of sequence conservation relative to the human reference, even across evolutionarily distant species. This strong conservation suggests that GRB2 plays a fundamental role in cellular signaling that has been preserved throughout vertebrate evolution.

When conservation was examined at the domain level, clear patterns emerged. The SH2 domain was especially well conserved across species, consistent with its role in binding phosphorylated receptor tyrosine kinases and initiating downstream signaling. The SH3 domains also displayed substantial conservation, supporting their importance in mediating protein–protein interactions that propagate signaling through pathways such as RAS/MAPK.

While small differences in conservation were observed between domains, all major signaling interfaces remained highly preserved. Together, these findings indicate that the regions of GRB2 responsible for assembling and regulating signaling complexes are under strong evolutionary constraint. The consistency of conservation across species reinforces the idea that even modest alterations in these domains could have meaningful effects on signaling behavior.

<img src="figure1_whole_protein.png" width="350">

<p><strong>Figure 1.</strong> Whole-protein conservation of GRB2 across vertebrates. Percent identity of GRB2 orthologs relative to the human reference sequence.</p>

<img src="figure2_domains.png" width="350">

<p><strong>Figure 2.</strong> Domain-specific conservation of GRB2 signaling interfaces across vertebrates. Percent identity of the SH2 and SH3 domains relative to the human GRB2 reference sequence. The SH2 domain shows particularly strong con

## Discussion
This analysis shows that GRB2 is highly conserved across vertebrate species, which aligns with its central role as an adaptor protein in receptor tyrosine kinase signaling. The strong conservation observed at the whole-protein level suggests that GRB2 structure and function have been maintained over evolutionary time, emphasizing how essential this protein is for regulating pathways involved in cell growth, survival, and differentiation.

Examining individual domains adds important context. The SH2 domain was especially conserved across all species analyzed, consistent with its role in binding phosphorylated receptors and initiating downstream signaling through the RAS/MAPK pathway. The SH3 domains also remained highly conserved, supporting their role in mediating protein–protein interactions within signaling complexes. Greater divergence observed in the SH3 domain of more distantly related species, such as frog, may reflect increased evolutionary flexibility in this region while still preserving overall GRB2 function.

These conservation patterns have important implications for cancer biology. GRB2 is frequently involved in oncogenic signaling downstream of hyperactive receptor tyrosine kinases, and mutations or alterations affecting highly conserved regions are more likely to disrupt normal signaling control. Variants occurring within conserved domains, particularly the SH2 domain, may therefore serve as useful biomarkers for identifying functionally impactful changes in GRB2-associated signaling. By establishing a baseline map of evolutionary constraint, this analysis provides a framework for prioritizing GRB2 variants that may contribute to cancer progression and could inform future studies investigating GRB2 as a potential biomarker or therapeutic target.

One limitation of this study is that sequence conservation alone does not directly measure functional impact. While highly conserved regions are more likely to be functionally important, experimental validation or integration with functional datasets would be required to confirm the biological consequences of specific GRB2 variants in cancer contexts.

## References

1. Lowenstein, E. J., Daly, R. J., Batzer, A. G., et al. (1992). The SH2 and SH3 domain-containing protein GRB2 links receptor tyrosine kinases to ras signaling. *Cell*, 70(3), 431–442.

2. Yarden, Y., & Sliwkowski, M. X. (2001). Untangling the ErbB signalling network. *Nature Reviews Molecular Cell Biology*, 2(2), 127–137.

3. UniProt Consortium. (2023). UniProt: the universal protein knowledgebase. *Nucleic Acids Research*, 51(D1), D523–D531.

4. Durinck, S., Spellman, P. T., Birney, E., & Huber, W. (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. *Nature Protocols*, 4(8), 1184–1191.

5. Cock, P. J. A., Antao, T., Chang, J. T., et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. *Bioinformatics*, 25(11), 1422–1425.
