[CSC_4SD02_TP - Databases] Lab Exercises

This is lab practical covering ETL, Feature Stores, a semantic search using the Stroke Prediction dataset from Kaggle.

The dataset is already provided in the dataset folder and contains data about patients' health and whether they have had a stroke. The lab uses a multi-modal data consisting of:

CSV File (data/stroke_features.csv): Patient social and demographic data
SQLite Database (data/stroke_database.db): Medical records and history
Medical Documents (data/medical_corpus.json): Text corpus for building the knowledge base

Setup Instructions

Local Environment

For this lab you will have to set a local Python environment. The required packages are listed in the requirements.txt file. You can install them using pip:

pip install -r requirements.txt

or directly in a Jupyter notebook cell:

pip install pandas numpy matplotlib seaborn scikit-learn scipy chromadb

Make sure to use Python 3.8 or higher (This lab has been tested with Python 3.11)

Google Colab

You can also run this lab in Google Colab. You can upload the provided Jupyter notebook *_students.ipynb to your Colab environment. You will need to upload the dataset files to the Colab environment.

Additionally, you will need to install the required packages in a Colab notebook cell:

 !pip install chromadb

The other packages are pre-installed in Colab.

Lab Structure

Implementation Guidelines

All code sections marked with TODO comments must be implemented by students. The notebook provides structure and hints but requires active problem-solving.

Code Sections to Complete

Data loading from CSV and SQLite
Data cleaning and preprocessing
Feature engineering functions
Model training pipeline
Semantic search query implementation
Patient report generation

Part 1: Data Loading and Preprocessing (ETL)

In this part, you will load the patient data from the CSV file and the medical records from the SQLite database. You will then check the data for inconsistencies and outlier. You will have to join the data from the two sources and prepare it for modeling.

Part 2: Feature Store

In this part, you will create a simple feature store to store the preprocessed data. You will store basic featrus from the data, and try to engineer new features that might improve model performance.

Part 3: Model Training and Evaluation

In this part, you will train a machine learning model to predict whether a patient will have a stroke based on their health data. You will evaluate the model performance using appropriate metrics. You will see how different features affect the model performance.

The models will be evaluated using:

Accuracy: Overall correctness
Precision: True positive rate
Recall: Sensitivity to positive cases
ROC-AUC: Model discrimination ability

Part 4: Semantic Search

In this part, you will build a simple semantic search system using ChromaDB to store medical documents. You will implement a function to query the knowledge base and retrieve relevant documents based on user queries The queries will be related to our data and medical conditions.

Troubleshooting

If you encounter any issues during the lab, please refer to the error messages for guidance.

SQLite Connection Issues

# Check if database file exists or the path is correct
import os
if not os.path.exists('data/stroke_database.db'):
    print("Database not found! Run setup_data_files() first")

Memory Issues in Colab

# Reduce dataset size if needed
df_sample = df_clean.sample(n=2000, random_state=42)

ChromaDB Installation Fails

# Fallback: Use TF-IDF + cosine similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(documents)
similarities = cosine_similarity(query_vector, vectors)

Things to be careful about

Data Leakage Most common issue
- Fitting scalers/transformers on full dataset before splitting
- Filling missing values before splitting
- Using test statistics to transform test data
SQL Mistakes
- Forgetting JOIN conditions
- Wrong JOIN type (INNER vs LEFT)
- Not aliasing table names
Feature Engineering
- Not applying same transformations to test set
- Creating features that look into the future
- Not stratifying splits for imbalanced data

Learning Resources

For additional guidance on:

Pandas: Official Documentation
Scikit-learn: User Guide
ChromaDB: Official Docs

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
README.md		README.md
lab_exercise_students.ipynb		lab_exercise_students.ipynb
requirments.txt		requirments.txt
stroke_database.db		stroke_database.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[CSC_4SD02_TP - Databases] Lab Exercises

Setup Instructions

Local Environment

Google Colab

Lab Structure

Implementation Guidelines

Code Sections to Complete

Part 1: Data Loading and Preprocessing (ETL)

Part 2: Feature Store

Part 3: Model Training and Evaluation

Part 4: Semantic Search

Troubleshooting

SQLite Connection Issues

Memory Issues in Colab

ChromaDB Installation Fails

Things to be careful about

Learning Resources

About

Uh oh!

Releases

Packages

Languages

lorenzoGrivet/Lab_database

Folders and files

Latest commit

History

Repository files navigation

[CSC_4SD02_TP - Databases] Lab Exercises

Setup Instructions

Local Environment

Google Colab

Lab Structure

Implementation Guidelines

Code Sections to Complete

Part 1: Data Loading and Preprocessing (ETL)

Part 2: Feature Store

Part 3: Model Training and Evaluation

Part 4: Semantic Search

Troubleshooting

SQLite Connection Issues

Memory Issues in Colab

ChromaDB Installation Fails

Things to be careful about

Learning Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages