Skip to content

lorenzoGrivet/Lab_database

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[CSC_4SD02_TP - Databases] Lab Exercises

This is lab practical covering ETL, Feature Stores, a semantic search using the Stroke Prediction dataset from Kaggle.

The dataset is already provided in the dataset folder and contains data about patients' health and whether they have had a stroke. The lab uses a multi-modal data consisting of:

  1. CSV File (data/stroke_features.csv): Patient social and demographic data
  2. SQLite Database (data/stroke_database.db): Medical records and history
  3. Medical Documents (data/medical_corpus.json): Text corpus for building the knowledge base

Setup Instructions

Local Environment

For this lab you will have to set a local Python environment. The required packages are listed in the requirements.txt file. You can install them using pip:

pip install -r requirements.txt

or directly in a Jupyter notebook cell:

pip install pandas numpy matplotlib seaborn scikit-learn scipy chromadb

Make sure to use Python 3.8 or higher (This lab has been tested with Python 3.11)

Google Colab

You can also run this lab in Google Colab. You can upload the provided Jupyter notebook *_students.ipynb to your Colab environment. You will need to upload the dataset files to the Colab environment.

Additionally, you will need to install the required packages in a Colab notebook cell:

 !pip install chromadb

The other packages are pre-installed in Colab.

Lab Structure

Implementation Guidelines

All code sections marked with TODO comments must be implemented by students. The notebook provides structure and hints but requires active problem-solving.

Code Sections to Complete

  • Data loading from CSV and SQLite
  • Data cleaning and preprocessing
  • Feature engineering functions
  • Model training pipeline
  • Semantic search query implementation
  • Patient report generation

Part 1: Data Loading and Preprocessing (ETL)

In this part, you will load the patient data from the CSV file and the medical records from the SQLite database. You will then check the data for inconsistencies and outlier. You will have to join the data from the two sources and prepare it for modeling.

Part 2: Feature Store

In this part, you will create a simple feature store to store the preprocessed data. You will store basic featrus from the data, and try to engineer new features that might improve model performance.

Part 3: Model Training and Evaluation

In this part, you will train a machine learning model to predict whether a patient will have a stroke based on their health data. You will evaluate the model performance using appropriate metrics. You will see how different features affect the model performance.

The models will be evaluated using:

  • Accuracy: Overall correctness
  • Precision: True positive rate
  • Recall: Sensitivity to positive cases
  • ROC-AUC: Model discrimination ability

Part 4: Semantic Search

In this part, you will build a simple semantic search system using ChromaDB to store medical documents. You will implement a function to query the knowledge base and retrieve relevant documents based on user queries The queries will be related to our data and medical conditions.

Troubleshooting

If you encounter any issues during the lab, please refer to the error messages for guidance.

SQLite Connection Issues

# Check if database file exists or the path is correct
import os
if not os.path.exists('data/stroke_database.db'):
    print("Database not found! Run setup_data_files() first")

Memory Issues in Colab

# Reduce dataset size if needed
df_sample = df_clean.sample(n=2000, random_state=42)

ChromaDB Installation Fails

# Fallback: Use TF-IDF + cosine similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(documents)
similarities = cosine_similarity(query_vector, vectors)

Things to be careful about

  1. Data Leakage Most common issue

    • Fitting scalers/transformers on full dataset before splitting
    • Filling missing values before splitting
    • Using test statistics to transform test data
  2. SQL Mistakes

    • Forgetting JOIN conditions
    • Wrong JOIN type (INNER vs LEFT)
    • Not aliasing table names
  3. Feature Engineering

    • Not applying same transformations to test set
    • Creating features that look into the future
    • Not stratifying splits for imbalanced data

Learning Resources

For additional guidance on:

About

[CSC_4SD02_TP - Databases] Lab Exercises

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%