# 02: Preprocessing the Medical FAQ Dataset
This notebook focuses on preparing the data for the upcoming LLM fine-tuning stage. I start with the cleaned and transformed dataset from the previous step, which is stored in Azure Blob Storage as a Parquet file. The goal here is to refine the text further so that it is ready for training a model like DistilGPT-2. This preparation makes the data more consistent and meaningful, which directly supports the project's aim of creating accurate, multilingual responses to patient questions in telehealth. By doing this, I ensure the final tool can reduce staff workload in healthcare settings, where efficient data handling can lead to real cost savings and better patient care.


In [3]:
# Import libraries for Text preprocessing, tokenization, and lemmatization
import nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType



[nltk_data] Downloading package punkt to
[nltk_data]     /home/godhanaravara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/godhanaravara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/godhanaravara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---
## Loading the saved Parquet file

I initiate this step to retrieve the cleaned dataset I previously saved as transformed_medquad.parquet from Azure Blob Storage and prepare the Natural Language Toolkit (NLTK) for text processing. This dataset contains 16,359 rows of medical FAQ data with tranformed columns, ready for advanced refinement. 
- I use Spark to read the Parquet file from the mounted path /mnt/AZ_CONTAINER/transformed_medquad.parquet and display its row count, schema, and first five rows to verify the data. Additionally, I downloaded NLTK resources (punkt, stopwords, wordnet) to enable tokenization, stopword removal, and lemmatization.
- The purpose is to pick up where I left off, avoiding the need to repeat basic cleaning every time. I did it this way because Parquet files are efficient for storage and quick to load in Spark, which fits my project's use of Databricks for distributed processing. 
- This helps the problem statement by keeping the workflow smooth and scalable, allowing me to handle larger datasets in the future without performance issues. From a business perspective, it shows how cloud storage integrates with processing tools to cut down on time and resources. This sets the stage for text analysis, supporting the business goal of delivering accurate healthcare FAQs, which can save healthcare providers significant costs (up to 60% versus on-premises solutions, per Azure economics).

In [None]:
# Load the saved parquet file from Azure Blob
df = spark.read.parquet("/mnt/faqdata/transformed_medquad.parquet")
print(f"Loaded Parquet with {df.count()} rows")

print("First 5 rows:")
df.show(5, truncate=35)

Loaded Parquet with 16359 rows
First 5 rows:
+-----------------------------------+-----------------------------------+-----------------+-----------------------------------+
|                 transform_question|                             answer|           source|                         focus_area|
+-----------------------------------+-----------------------------------+-----------------+-----------------------------------+
|what are the symptoms of periphe...|People who have P.A.D. may have ...|  NIHSeniorHealth|Peripheral Arterial Disease (P.A...|
|who is at risk for adult acute m...|Smoking, previous chemotherapy t...|        CancerGov|       Adult Acute Myeloid Leukemia|
|what are the treatments for myel...|Because myelodysplastic /myelopr...|        CancerGov|Myelodysplastic/ Myeloproliferat...|
|what research or clinical trials...|New types of treatment are being...|        CancerGov|                        Skin Cancer|
|   what is are childbirth problems |While childbirth usual

In [None]:
df.printSchema()

root
 |-- transform_question: string (nullable = true)
 |-- answer: string (nullable = true)
 |-- source: string (nullable = true)
 |-- focus_area: string (nullable = true)



---
## Tokenizing the Text

I create a user-defined function (UDF) to tokenize the `transform_question` and `transform_answer` columns, turning them into lists of words. The purpose is to split sentences into individual parts, making it easier to filter and format it suitable for deeper linguistic analysis. 
- I implement it as a UDF in Spark using NLTK's `word_tokenize` because it allows distributed processing across the dataset's 16,359 rows. This way supports the project's need for handling real-world medical questions efficiently.
- This prepares the data, by generating new columns `token_question` and `token_answer`, for more advanced filtering, which ultimately leads to better model performance in generating helpful telehealth answers. 
- This improves FAQ response quality with structured data handling and enhances patient education.



In [None]:
# Define tokenization
def tokenize_text(text):
    if text:
        return nltk.word_tokenize(text)
    return []

tokenize_udf = udf(tokenize_text, ArrayType(StringType()))

# Apply tokenization
try:
    df_tokenized = df.select(
        "source",
        "focus_area",
        tokenize_udf(df.transform_question).alias("token_question"),
        col("answer").alias("original_answer")
    )
    print(f"Tokenized row count: {df_tokenized.count()}")
    print("First 5 rows after tokenization:")
    df_tokenized.show(5, truncate=35)
except Exception as err:
    print("Error tokenizing text:", err)
    raise

Tokenized row count: 16359
First 5 rows after tokenization:
+-----------------+-----------------------------------+-----------------------------------+-----------------------------------+
|           source|                         focus_area|                     token_question|                    original_answer|
+-----------------+-----------------------------------+-----------------------------------+-----------------------------------+
|  NIHSeniorHealth|Peripheral Arterial Disease (P.A...|[what, are, the, symptoms, of, p...|People who have P.A.D. may have ...|
|        CancerGov|       Adult Acute Myeloid Leukemia|[who, is, at, risk, for, adult, ...|Smoking, previous chemotherapy t...|
|        CancerGov|Myelodysplastic/ Myeloproliferat...|[what, are, the, treatments, for...|Because myelodysplastic /myelopr...|
|        CancerGov|                        Skin Cancer|[what, research, or, clinical, t...|New types of treatment are being...|
|MPlusHealthTopics|                Childbirt

---
## Removing Stopwords

Using another UDF, I filter out common English stopwords (like "the", "is", etc.) from the tokenized columns to focus on the significant key terms. 
-  I utilize NLTK's stopword list for this because it is reliable, has a broad coverage of English, and is built-in which avoids extra dependencies in my setup. This creates `clean_question` and `clean_answer` columns.
- This step of the project reduces the data noise, improving the LLM's focus on relevant medical content. It helps in potentially lowering error rates in FAQ responses and saves time.


In [None]:
stop_words = set(stopwords.words('english'))

# Define UDF to remove stopwords
def remove_stopwords(tokens):
    if tokens:
        return [word for word in tokens if word.lower() not in stop_words]
    return []

remove_stopwords_udf = udf(remove_stopwords, ArrayType(StringType()))

# Apply stopwords removal
try:
    df_processed = df_tokenized.select(
        "source",
        "focus_area",
        remove_stopwords_udf(df_tokenized.token_question).alias("clean_question"),
        col("original_answer")
    )
    print(f"Processed row count: {df_processed.count()}")
    print("First 5 rows after stopwords removal:")
    df_processed.show(5, truncate=35)
except Exception as err:
    print("Error removing stopwords:", err)
    raise

Processed row count: 16359
First 5 rows after stopwords removal:
+-----------------+-----------------------------------+-----------------------------------+-----------------------------------+
|           source|                         focus_area|                     clean_question|                    original_answer|
+-----------------+-----------------------------------+-----------------------------------+-----------------------------------+
|  NIHSeniorHealth|Peripheral Arterial Disease (P.A...|[symptoms, peripheral, arterial,...|People who have P.A.D. may have ...|
|        CancerGov|       Adult Acute Myeloid Leukemia|[risk, adult, acute, myeloid, le...|Smoking, previous chemotherapy t...|
|        CancerGov|Myelodysplastic/ Myeloproliferat...|[treatments, myelodysplastic, my...|Because myelodysplastic /myelopr...|
|        CancerGov|                        Skin Cancer|[research, clinical, trials, don...|New types of treatment are being...|
|MPlusHealthTopics|                Chil

---
## Lemmatizing the text

I apply a lemmatization UDF to reduce words to their base form, like changing "running" to "run". This is to standardize variations, ensuring the model treats similar words as one. 
- I use NLTK's `WordNetLemmatizer` because it integrates well with the tokenization step and keeps the code simple. I chose this technique, over stemming to preserve word meaning, and enhance data consistency which is vital for training an LLM on medical FAQs where context matters and improves accuracy.

In [None]:
lemmatizer = WordNetLemmatizer()

# Define lemmatization
def lemmatize_tokens(tokens):
    if tokens:
        return [lemmatizer.lemmatize(word) for word in tokens]
    return []

lemmatize_udf = udf(lemmatize_tokens, ArrayType(StringType()))

# Apply lemmatization
try:
    df_lemmatized = df_processed.select(
        "source",
        "focus_area",
        lemmatize_udf(df_processed.clean_question).alias("lemma_question"),
        col("original_answer")
    )
    print(f"Lemmatized row count: {df_lemmatized.count()}")
    print("First 5 rows after lemmatization:")
    df_lemmatized.show(5, truncate=40)
except Exception as err:
    print("Error lemmatizing text:", err)
    raise

Lemmatized row count: 16359
First 5 rows after lemmatization:
+-----------------+----------------------------------------+----------------------------------------+----------------------------------------+
|           source|                              focus_area|                          lemma_question|                         original_answer|
+-----------------+----------------------------------------+----------------------------------------+----------------------------------------+
|  NIHSeniorHealth|    Peripheral Arterial Disease (P.A.D.)|[symptom, peripheral, arterial, disea...|People who have P.A.D. may have sympt...|
|        CancerGov|            Adult Acute Myeloid Leukemia| [risk, adult, acute, myeloid, leukemia]|Smoking, previous chemotherapy treatm...|
|        CancerGov|Myelodysplastic/ Myeloproliferative N...|[treatment, myelodysplastic, myelopro...|Because myelodysplastic /myeloprolife...|
|        CancerGov|                             Skin Cancer|[research, clinical,

---
## Checking for Bias in `focus_area`

I group the data by `focus_area` to assess and review its distribution of values and calculate percentages, flagging if any category dominates over 50% that could skew the dataset or model outcomes. 
- The purpose is to identify potential bias early and the dataset represents a broad range of medical topics. 
- I do this with Spark's `groupBy` and `filter` functions because it is efficient for the dataset's size. This step is done to promote fairness in the project, avoiding skewed LLM outputs. 
- This approach supports equitable telehealth responses which is a key aspect of healthcare access and balances computational ease with meaningful insight.

In [5]:
focus_dist = df_lemmatized.groupBy("focus_area").count().orderBy("count", ascending=False)
print("Focus area distribution:")

# Basic bias check: if one category dominates (>50% of rows)
total_rows = df_lemmatized.count()
focus_dist = focus_dist.withColumn("percent", (col("count") / total_rows * 100))
dominant = focus_dist.filter(col("percent") > 50).count()
if dominant > 0:
    print("Warning: Potential bias detected - one focus_area exceeds 50% of data.")
else:
    print("No bias detected")

Focus area distribution:
No bias detected


---
## Saving the preprocessed data

I save the lemmatized DataFrame back to Azure Blob as 'preprocessed_medquad.parquet' to maintain the refined data for future stages. 


In [None]:
# Saving the preprocessed data to Storage Blob
output_path = "f{MOUNT_PT}/preprocessed_medquad.parquet"

try:
    df_lemmatized.write.mode("overwrite").parquet(output_path)
    print(f"Preprocessed data saved to storage blob")

    # Verifying the save to ensure the data was saved completely
    df_verified = spark.read.parquet(output_path)
    print(f"Verified saved data with {df_verified.count()} rows")
    df_verified.show(5, truncate=35)
except Exception as err:
    print("Error saving data:", err)
    print("Check SAS Token expiry and 'Write' permission")
    raise

Preprocessed data saved to storage blob
Verified saved data with 16359 rows
+-----------------+----------------------------------------+----------------------------------------+----------------------------------------+
|           source|                              focus_area|                          lemma_question|                         original_answer|
+-----------------+----------------------------------------+----------------------------------------+----------------------------------------+
|  NIHSeniorHealth|    Peripheral Arterial Disease (P.A.D.)|[symptom, peripheral, arterial, disea...|People who have P.A.D. may have sympt...|
|        CancerGov|            Adult Acute Myeloid Leukemia| [risk, adult, acute, myeloid, leukemia]|Smoking, previous chemotherapy treatm...|
|        CancerGov|Myelodysplastic/ Myeloproliferative N...|[treatment, myelodysplastic, myelopro...|Because myelodysplastic /myeloprolife...|
|        CancerGov|                             Skin Cancer|[resea

---
## Conclusion
This notebook successfully preprocessed the transformed Medical FAQ dataset (`transformed_medquad.parquet`, 16,359 rows) from Azure Blob Storage, applying lemmatization to questions using NLTK and Spark UDFs, checking for bias in focus_area, and saving the refined data as `preprocessed_medquad.parquet` for downstream fine-tuning and RAG in the Healthcare FAQ Generator pipeline.

### Key Results

- Row Count: 16,359 rows preserved after lemmatization on 'question' column only.
- Lemmatization: Questions tokenized, lemmatized, and stored as arrays (e.g., "symptom peripheral arterial disease pad" → **[symptom, peripheral, arterial, disease, pad]**).
- Bias Check: No dominant focus_area (>50%), indicating balanced representation across medical topics.
- Runtime: ~5-10 minutes on Databricks Community Edition (CPU).
- Output: preprocessed_medquad.parquet with columns `source, focus_area, lemma_question (array), original_answer (string)`.

### Business Impact

- Prepares consistent, meaningful text data for multilingual FAQ generation, enabling accurate telehealth responses and reducing clinician workload.
- Promotes fairness by mitigating bias in focus areas, ensuring equitable healthcare information access.

### Next Steps

- Model Fine-Tuning: Use the preprocessed Parquet to fine-tune flan-t5-base for FAQ generation in the next notebook (03_train_LLM.ipynb).
- RAG Pipeline: Build retrieval and evaluation with LangChain and FAISS.
- Multilingual Expansion: Integrate GCP Translation API for Spanish/Telugu support.

This step demonstrates scalable NLP preprocessing with PySpark and NLTK, forming a strong foundation for healthcare applications.