### String matching
- find a fast algorithm to match peptide sequences to human protein sequences


```python
String A example: 
query_sequences = [
    "RDLEAEHVLP",      # Query peptide 1
    "MELSAEXYZW",      # Query peptide 2
    ...
]

String B example: 
proteome = {
    "A0A087WZT3": "MELSAEYLREKLQRDLEAEHVLPSPGGVGQVRGETAASETQLGS", 
    ...
}
```

**Parameters:**
- `k = 9` (fragment length)
- `max_mismatches = 1` (allow 1 mismatch)

**Step 1: Fragment the query peptide 1 into k-mers (k=9)**
```
Position 0-8:   RDLEAEHVL -> fragment 1
Position 1-9:   DLEAEHVLP -> fragment 2
```

**Step 2: Search each fragment in proteome**

Fragment 1: `RDLEAEHVL`  

Search in protein `A0A087WZT3`:
```
MELSAEYLREKLQRDLEAEHVLPSPGGVGQVRGETAASETQLGS
             ^^^^^^^^^
             Match found at position 13!
```

**Exact match found:** âœ“
- Protein ID: `A0A087WZT3`
- Position: `13`
- Matched: `RDLEAEHVL`

Continue with all other query peptides and their fragments

In [6]:
import pickle

# Load the pickle file
with open('uniprotkb_human_ref_proteome_dict.pkl', 'rb') as f:
    proteome_dict = pickle.load(f)

# Now you can use it
print(f"Number of proteins: {len(proteome_dict)}")

# Look at first few entries
for accession_id, sequence in list(proteome_dict.items())[:3]:
    print(f"ID: {accession_id}")
    print(f"Sequence: {sequence[:50]}...")  # First 50 amino acids
    print(f"Length: {len(sequence)}")
    print()

Number of proteins: 83413
ID: A0A087WV00
Sequence: MDAAGRGCHLLPLPAARGPARAPAAAAAAAASPPGPCSGAACAPSAAAGA...
Length: 1057

ID: A0A087WZT3
Sequence: MELSAEYLREKLQRDLEAEHVLPSPGGVGQVRGETAASETQLGS...
Length: 44

ID: A0A087X1C5
Sequence: MGLEALVPLAMIVAIFLLLVDLMHRHQRWAARYPPGPLPLPGLGNLLHVD...
Length: 515



In [7]:
def find_peptide_overlaps(query_sequences, proteome_dict, k=9, max_mismatches=1):
    """
    Find all occurrences of k-mer fragments from query sequences 
    in the proteome database.
    
    Args:
        query_sequences: List of peptide sequences to search for
        proteome_dict: Dict {protein_id: sequence}
        k: Length of fragments to generate
        max_mismatches: Maximum allowed mismatches (default 1)
    
    Returns:
        Dict mapping each query to list of matches
    """
    pass