# Data Analysis for Rare Disease Diagnosis

## Introduction
This notebook explores the datasets from the UK Biobank and Rare Diseases Database to understand feature distributions, missing values, and correlations. This analysis will help guide the preprocessing and modeling steps.

---

## Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization style
sns.set(style='whitegrid')
%matplotlib inline


## Load the Data
Replace the file paths below with the correct paths to your raw data files.

In [None]:
uk_data_path = '../data/raw/uk_biobank_data.csv'
rare_data_path = '../data/raw/rare_diseases_data.csv'

uk_data = pd.read_csv(uk_data_path)
rare_data = pd.read_csv(rare_data_path)

print("UK Biobank Data Sample:")
display(uk_data.head())

print("Rare Diseases Data Sample:")
display(rare_data.head())


## Data Overview
Check for missing values, data types, and general information for both datasets.

In [None]:
print("UK Biobank Data Info:")
uk_data.info()

print("\nMissing values in UK Biobank Data:")
print(uk_data.isnull().sum())

print("\nRare Diseases Data Info:")
rare_data.info()

print("\nMissing values in Rare Diseases Data:")
print(rare_data.isnull().sum())


## Visualize Feature Distributions
Explore some key features such as Age, Gender distribution, and any numeric features in your dataset.

In [None]:
# Example: Age Distribution in UK Biobank Data
plt.figure(figsize=(8,5))
sns.histplot(uk_data['age'], bins=30, kde=True)
plt.title('Age Distribution in UK Biobank')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()


In [None]:
# Example: Gender Distribution in UK Biobank Data (assuming 'gender' column exists)
plt.figure(figsize=(6,4))
sns.countplot(x='gender', data=uk_data)
plt.title('Gender Distribution in UK Biobank')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


## Correlation Analysis
Check correlation among numerical variables to identify potential predictors.

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(uk_data.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap for UK Biobank')
plt.show()


## Conclusion
This preliminary analysis reveals the data landscape, points out missing values that require treatment, and gives insights into feature distributions and correlations. The next step will be data preprocessing and feature selection before training the machine learning models.