This is lab practical covering ETL, Feature Stores, a semantic search using the Stroke Prediction dataset from Kaggle.
The dataset is already provided in the dataset folder and contains data about patients' health and whether they have had a stroke. The lab uses a multi-modal data consisting of:
- CSV File (
data/stroke_features.csv): Patient social and demographic data - SQLite Database (
data/stroke_database.db): Medical records and history - Medical Documents (
data/medical_corpus.json): Text corpus for building the knowledge base
For this lab you will have to set a local Python environment. The required packages are listed in the requirements.txt file.
You can install them using pip:
pip install -r requirements.txtor directly in a Jupyter notebook cell:
pip install pandas numpy matplotlib seaborn scikit-learn scipy chromadbMake sure to use Python 3.8 or higher (This lab has been tested with Python 3.11)
You can also run this lab in Google Colab. You can upload the provided Jupyter notebook *_students.ipynb to your Colab environment. You will need to upload the dataset files to the Colab environment.
Additionally, you will need to install the required packages in a Colab notebook cell:
!pip install chromadbThe other packages are pre-installed in Colab.
All code sections marked with TODO comments must be implemented by students. The notebook provides structure and hints but requires active problem-solving.
- Data loading from CSV and SQLite
- Data cleaning and preprocessing
- Feature engineering functions
- Model training pipeline
- Semantic search query implementation
- Patient report generation
In this part, you will load the patient data from the CSV file and the medical records from the SQLite database. You will then check the data for inconsistencies and outlier. You will have to join the data from the two sources and prepare it for modeling.
In this part, you will create a simple feature store to store the preprocessed data. You will store basic featrus from the data, and try to engineer new features that might improve model performance.
In this part, you will train a machine learning model to predict whether a patient will have a stroke based on their health data. You will evaluate the model performance using appropriate metrics. You will see how different features affect the model performance.
The models will be evaluated using:
- Accuracy: Overall correctness
- Precision: True positive rate
- Recall: Sensitivity to positive cases
- ROC-AUC: Model discrimination ability
In this part, you will build a simple semantic search system using ChromaDB to store medical documents. You will implement a function to query the knowledge base and retrieve relevant documents based on user queries The queries will be related to our data and medical conditions.
If you encounter any issues during the lab, please refer to the error messages for guidance.
# Check if database file exists or the path is correct
import os
if not os.path.exists('data/stroke_database.db'):
print("Database not found! Run setup_data_files() first")# Reduce dataset size if needed
df_sample = df_clean.sample(n=2000, random_state=42)# Fallback: Use TF-IDF + cosine similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(documents)
similarities = cosine_similarity(query_vector, vectors)-
Data Leakage Most common issue
- Fitting scalers/transformers on full dataset before splitting
- Filling missing values before splitting
- Using test statistics to transform test data
-
SQL Mistakes
- Forgetting JOIN conditions
- Wrong JOIN type (INNER vs LEFT)
- Not aliasing table names
-
Feature Engineering
- Not applying same transformations to test set
- Creating features that look into the future
- Not stratifying splits for imbalanced data
For additional guidance on:
- Pandas: Official Documentation
- Scikit-learn: User Guide
- ChromaDB: Official Docs