# Solutions – Week 2
**Course:** NLP and Text Mining  
**Group C:** Choekyel Nyungmartasang, Vincent Gaspoz, Marc Anton Nanzer, Matthias Johannes Peterhans  

## Summary
In this week, we explored text processing (regex, stemming, lemmatization) and text mining (cosine similarity, Naive Bayes, and neural models).  
We learned how to preprocess medical text, create rule-based and learned NER models, and understand when sequence modeling is necessary.

___
# Q&A 

___
## Part 1: Text Processing

### 1) Read the subsections 2.7.1-2.7.3 from https://web.stanford.edu/~jurafsky/slp3/2.pdf. It describes regular expressions. Explain how they can be used to find ICD-10 codes in a medical texts and what disadvantages it implies.

In [None]:
import re
icd_general = re.compile(
    r"(?i)(?<![A-Z0-9])[A-Z]\s?\d{2}(?:[.\---]?\s?[0-9]{1,4})?(?![A-Z0-9])"
)
text ="""
Dx: A41.9 suspected. Prior I10. Label shows A 41 9; PDF split A41
- 9, Also note U07.1 and C50-C54 range.
    """

codes = [m.group(0).replace(" ", "") for m in icd_general.finditer(text)]
print (codes) # ['A41.9", "I10', "A419', "A41-9", "U07.1'|



ICD-10 codes follow a pattern: one letter, two digits, and sometimes a decimal point with more letters/numbers (e.g. E11.9 (Type 2 diabetes)).

We can use regex to find these patterns in medical texts. A regex pattern for ICD-10 codes could be something like:

```
(?i)(?<![A-Z0-9])[A-Z]\s?\d{2}(?:[.\-]?\s?[0-9]{1,4})?(?![A-Z0-9])
```
This regex breaks down as follows:
- `(?i)`: Case insensitive matching.
- `(?<![A-Z0-9])`: Negative lookbehind to ensure the code
    is not preceded by another letter or digit.
- `[A-Z]`: Matches a single uppercase letter.
- `\s?`: Matches an optional space.
- `\d{2}`: Matches exactly two digits.
- `(?:[.\-]?\s?[0-9]{1,4})?
: Non-capturing group that matches an optional decimal point or hyphen, followed by an optional space and 1 to 4 digits.
- `(?![A-Z0-9])`: Negative lookahead to ensure the code is
    not followed by another letter or digit.


So, if we run this regex on a hospital note like: “Patient diagnosed with E11.9 and later admitted for U07.1”. It will find E11.9 and U07.1 as ICD-10 codes.

Disadvantages could be that regex might pick up things that look like ICD-10 codes but are not. Like for example a product code. Regex is also context independent, meaning it can not handle negation or temporal meaning. For example, “exclude E11.9”, it will still take the ICD-10 code.


### 2) Solve the three exercises in the Jupyter notebook Spacy-excercises.ipynb in the materials of week 2.

In [None]:
## Setup
!python -m spacy download en_core_web_md

### Excercise 2.1: Named Entity Recognition (NER) with spaCy
- Which entities did spaCy detect?
- Which important terms were missed?

In [None]:
import spacy

# Load the small English pipeline
nlp = spacy.load("en_core_web_md")

# Example medical text
text = "The patient was prescribed 5 mg of Prednisone in Zurich."

doc = nlp(text)

# Print detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)


### Answer 2.1
spaCy detected:
- "5 mg" as QUANTITY
- "Prednisone" as Geopolitical entity (GPE)
- "Zurich" as Geopolitical entity (GPE)

Missed entities:
- "Prednisone" should be a drug/medication, not a location (GPE).
- The text contains no explicit disease mention, so nothing else was tagged.

### Exercise 2.2: Custom Rule-Based Matcher
- Describe the what the following code is doing.
- What are problems with coding like that?

In [None]:
# Add EntityRuler properly (v3 syntax)
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define patterns
dosage_pattern = [{"LIKE_NUM": True}, {"LOWER": {"IN": ["mg", "ml", "g"]}}]
med_suffix_pattern = [{"TEXT": {"REGEX": "(?i).*(ine|ol|pril|sartan|mab)$"}}]
med_list_pattern = [{"LOWER": {"IN": ["prednisone", "ibuprofen", "paracetamol", "metformin"]}}]
route_pattern = [{"LOWER": {"IN": ["po", "iv", "im", "sc"]}}]
freq_pattern = [{"LOWER": {"IN": ["once", "twice"]}}, {"LOWER": {"IN": ["daily", "weekly"]}}]

# Add patterns
ruler.add_patterns([
    {"label": "DOSAGE", "pattern": dosage_pattern},
    {"label": "MEDICATION", "pattern": med_suffix_pattern},
    {"label": "MEDICATION", "pattern": med_list_pattern},
    {"label": "ROUTE", "pattern": route_pattern},
    {"label": "FREQUENCY", "pattern": freq_pattern},
])

# Example text
text = "The patient was prescribed 5 mg of Prednisone in Zurich. Later, they took 10 ml ibuprofen PO twice daily."
doc = nlp(text)

print("Entities:")
for ent in doc.ents:
    print(f"- {ent.text!r:>12}  ->  {ent.label_}")


### Answer 2.2

The code loads spaCy English model (before it was the small model but since I already downloaded the medium model I changed for that.)

It adds an EntitiyRuler before the built in NER model. This allows custom rules to tag entities first.

Then patterns for speficif medical concepts are defined. Like DOSAGE -> numbers followed by units (mg, ml, g) or MEDICATION -> either words ending with typical drug suffixes like -ine, -ol, -pril or a custom list of known drugs like Prednisone, Ibuprofen.

These are then added to the ruler so they become recognized entities.

The text is than ran on the pipeline and detected entities are printed. This fixes the earlier problem of medication being tagged as location.

Problems with coding like this is that the patterns are assumed to be complete. Meaning, new drug names, unusual dosages or different frequency phrases will be missed if not included in the patterns. It is also not scalable. To manually find out and write all the patterns is expensive. There is also a risk of false positives when using patterns which rely on suffix. For example a word ending on -ol like "alcohol" may be wrongly tagged as a medication.


### Exercise 3.3: Train a tiny custom NER model in spaCy
- Run the code and inspect output.
- Try adding 5–6 new entity types (e.g., SYMPTOM, LAB_TEST, DEVICE, ROUTE, DURATION, FREQUENCY) with just 1–2 examples each to see how training reacts.
- Evaluate qualitatively: What does the model get right/wrong?

In [None]:
TRAIN_DATA = [
    ("The patient received 5 mg Prednisone.", 
     {"entities": [(20, 24, "DOSAGE"), (25, 35, "DRUG")]}),
    ("Pain in the left knee improved after ibuprofen.", 
     {"entities": [(12, 21, "BODY_PART"), (33, 42, "DRUG")]}),
    ("He was given 2 ml epinephrine IM.", 
     {"entities": [(12, 16, "DOSAGE"), (17, 27, "DRUG")]}),
    ("CT showed a 2 cm lesion in the liver.", 
     {"entities": [(12, 16, "MEASUREMENT"), (31, 36, "BODY_PART")]}),
    ("Administer 10 mg morphine intravenously.", 
     {"entities": [(11, 16, "DOSAGE"), (17, 25, "DRUG")]}),
    ("MRI confirmed swelling in the brain.", 
     {"entities": [(28, 33, "BODY_PART")]}),
    ("Patient reported headache, treated with aspirin.", 
     {"entities": [(17, 25, "BODY_PART"), (40, 47, "DRUG")]}),
    ("She received 250 mg amoxicillin for 5 days.", 
     {"entities": [(13, 19, "DOSAGE"), (20, 31, "DRUG")]}),
    ("X-ray revealed fracture in the right arm.", 
     {"entities": [(35, 43, "BODY_PART")]}),
    ("The doctor prescribed 20 mg omeprazole daily.", 
     {"entities": [(24, 29, "DOSAGE"), (30, 40, "DRUG")]}),
    ("Ultrasound detected a 5 cm cyst in the kidney.", 
     {"entities": [(24, 28, "MEASUREMENT"), (40, 46, "BODY_PART")]}),
    ("Patient complained of chest pain, given nitroglycerin.", 
     {"entities": [(22, 27, "BODY_PART"), (35, 48, "DRUG")]}),
    ("Treatment started with 8 mg dexamethasone.", 
     {"entities": [(22, 26, "DOSAGE"), (27, 40, "DRUG")]}),
    ("Examination revealed tumor in the stomach.", 
     {"entities": [(32, 39, "BODY_PART")]}),
    ("She was prescribed 50 mg sertraline at night.", 
     {"entities": [(19, 24, "DOSAGE"), (25, 35, "DRUG")]}),
]

TRAIN_DATA += [
    ("The patient complained of stomachache.", 
     {"entities": [(25, 33, "SYMPTOM")]}),

    ("Blood test showed high glucose.", 
     {"entities": [(0, 10, "LAB_TEST")]}),

    ("The surgeon implanted a pacemaker.", 
     {"entities": [(25, 34, "DEVICE")]}),

    ("The drug was given IV.", 
     {"entities": [(18, 20, "ROUTE")]}),

    ("She took antibiotics for 7 days.", 
     {"entities": [(26, 32, "DURATION")]}),

    ("Medication was prescribed twice daily.", 
     {"entities": [(26, 37, "FREQUENCY")]}),
]

In [None]:
from spacy.training.example import Example
import random

# from train_data import TRAIN_DATA

# Blank English pipeline
nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")

# Add labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# Convert to spaCy examples
examples = []
for text, ann in TRAIN_DATA:
    doc = nlp.make_doc(text)
    examples.append(Example.from_dict(doc, ann))

# Training loop
optimizer = nlp.initialize()
for epoch in range(20):
    random.shuffle(examples)
    losses = {}
    for ex in examples:
        nlp.update([ex], sgd=optimizer, losses=losses)
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}, Losses: {losses}")

# Test on new text
#test_text = "The nurse gave 10 mg morphine for arm pain."
test_text = "The nurse gave 10 mg morphine for arm pain. The patient had stomachache and rested for 5 days. Then the patient worked out once a week. Later he got chest pain"
doc = nlp(test_text)
print([(ent.text, ent.label_) for ent in doc.ents])


### Answer 3.3

With even a few examples the model can generalize to unseen sentences. So it learns simple, repetitive patterns well. 

The limitations are that the model only memorizes patterns and is not robust for general rules. There is also the problem of overlapping labels. Like "chest pain" could be both SYMPTOM and a BODY_PART. Also using only 1-2 training examples added no value. At least 5 examples per new label may be needed.

### 3) Discuss briefly why and when to use a stemmer, when one can have lemmatizer?

Use a stemmer if you want a fast and rough normalization of words. Stemmer is good for tasks like search engines or indexing where exact grammar is not that important. Example: studies -> studi, running -> run.

Use a lemmatizer if you need the true dictionary base form of words. Lemmatizer is better for linguistic analysis, information extraction and ML models that rely on correct word forms.

Example: better -> good, running (verb) -> run, running (noun) -> running.

---
## Part 2: Text Mining 

### 1) Solve the exercise on slide 10 of week 3. 

- The cosine similarity of 0.3 means the reports share only a small overlap in symptoms.

- Cosine similarity and normalized Euclidean distance are not the same, but they are mathematically related. High cosine similarity means a small Euclidean distance!

- Yes, results improve with SNOMED CT because it normalizes terms and captures relationships between symptoms.

### 2) Make the code in Sentiment-NB.jpynb in the materials of week 3 run on your machine. The three data text files are in the folder data of week 3. Why is padding the text not useful here?

Because the classifier Naïve Bayes with TF-IDF does not process sequences of fixed length like a neural network does. Instead, it represents each document as a bag of words. Which is an unordered vector of word frequencies or importance weights. Also the position or order of words is not used, so padding to make all texts equal length does not affect anything. Padding only matters in models that handle sequential input (e.g., RNNs, LSTMs, Transformers) where word order and length influence the output.

### 3) Tweak the neural network in section "Neural Network" of Sentiment-NB.jpynb such that the accuracy is above 80%. Is GlobalAveragePooling1D useful here?

No, not for this task. It just averages embeddings and ignores the order of words. So phrases like “not bad” and “bad not” look similar. LSTM captures sequence order and is better for sentiment classification.