Reset Colab to /content, cloned my private GitHub repo using my PAT, switched the remote back to tokenless HTTPS for safety, created a standard folder structure, and set my git identity for commits.

In [79]:
# ==== YOUR GITHUB CREDENTIALS ====
GITHUB_USER = "muhammadhussainqureshi"
REPO_NAME   = "heart-disease-ml"
TOKEN       = "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"  # private PAT (repo scope)

# ---- Hard reset to a valid working dir (fixes getcwd errors) ----
%cd /
%cd /content
!pwd

# ---- Fresh clone into /content/<repo> using PAT (private) ----
import os, shutil
REPO_PATH = f"/content/{REPO_NAME}"
shutil.rmtree(REPO_PATH, ignore_errors=True)
repo_url = f"https://{GITHUB_USER}:{TOKEN}@github.com/{GITHUB_USER}/{REPO_NAME}.git"
!git clone "{repo_url}" "{REPO_PATH}"
%cd "{REPO_PATH}"

# ---- Immediately scrub token from the remote (safety) ----
!git remote set-url origin "https://github.com/{GITHUB_USER}/{REPO_NAME}.git"

# ---- Ensure standard project folders ----
for p in ["data/raw","data/processed","notebooks","reports","src"]:
    os.makedirs(p, exist_ok=True)

# ---- Minimal git identity ----
!git config user.name "{GITHUB_USER}"
!git config user.email "{GITHUB_USER}@users.noreply.github.com"

!git status

/
/content
/content
Cloning into '/content/heart-disease-ml'...
/content/heart-disease-ml
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)


**Download dataset from Kaggle with kagglehub and save inside the repo**

Used `kagglehub` to download the “`redwankarimsony/heart-disease-data`” dataset, located a `heart*.csv`, and copied it into my repo at `data/raw/heart.csv`. Then I loaded and previewed the first few rows to confirm the file is valid.

In [80]:
# ---- Download the dataset via kagglehub (as requested) ----
import kagglehub
import os, glob, shutil, pandas as pd

# Download latest version (your provided snippet)
path = kagglehub.dataset_download("redwankarimsony/heart-disease-data")
print("Path to dataset files:", path)

# ---- Locate a heart*.csv and copy it into repo as data/raw/heart.csv ----
cands = glob.glob(os.path.join(path, "**", "heart.csv"), recursive=True) \
      + glob.glob(os.path.join(path, "**", "*heart*.*csv"), recursive=True)
assert cands, "No heart*.csv found in the Kaggle dataset."
RAW_PATH = "data/raw/heart.csv"
shutil.copy(cands[0], RAW_PATH)
print("Saved raw CSV →", RAW_PATH)

# ---- Quick preview ----
df = pd.read_csv(RAW_PATH)
print("Loaded:", RAW_PATH, "| shape:", df.shape)
df.head(3)

Using Colab cache for faster access to the 'heart-disease-data' dataset.
Path to dataset files: /kaggle/input/heart-disease-data
Saved raw CSV → data/raw/heart.csv
Loaded: data/raw/heart.csv | shape: (920, 16)


Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1


**Normalize target to 0/1, basic checks, and save processed CSV**

Standardized the label to 0 (no disease) and 1 (disease), printed class balance, removed duplicate rows, and saved a processed snapshot to `data/processed/heart_day1_clean.csv` for modeling.

In [81]:
import numpy as np

# ---- Ensure a binary target (0/1) regardless of variant ----
target_col = "target" if "target" in df.columns else ("num" if "num" in df.columns else None)
assert target_col, f"No target column found in columns: {df.columns.tolist()}"

# UCI variant sometimes has num in {0..4}; convert to binary (>=1 → 1)
if target_col == "num" and df[target_col].max() > 1:
    df[target_col] = (df[target_col] >= 1).astype(int)

df[target_col] = df[target_col].astype(int)

# ---- Show class balance ----
print("Target column:", target_col)
print("Class balance (%):")
print((df[target_col].value_counts(normalize=True)*100).round(2))

# ---- Light hygiene: drop exact duplicates (keep missing handling for pipeline) ----
before = df.shape[0]
df = df.drop_duplicates().copy()
print("Dropped duplicates:", before - df.shape[0])

# ---- Save a processed snapshot for modeling ----
SNAP = "data/processed/heart_day1_clean.csv"
df.to_csv(SNAP, index=False)
print("Saved processed CSV →", SNAP)

Target column: num
Class balance (%):
num
1    55.33
0    44.67
Name: proportion, dtype: float64
Dropped duplicates: 0
Saved processed CSV → data/processed/heart_day1_clean.csv


**Save this notebook file, commit, and push to GitHub**

Saved `01_EDA.ipynb` into the repo, staged all changes, committed with a clear message, pushed to GitHub using my PAT, and then reset the remote to tokenless HTTPS for safety.

In [82]:
# ---- Save THIS notebook into the repo so it’s tracked ----
import glob, shutil, os
# find the most recent .ipynb under /content and copy it into notebooks/
cands = sorted(glob.glob("/content/*.ipynb"))
if cands:
    shutil.copy(cands[-1], "notebooks/01_EDA.ipynb")
    print("Saved notebook → notebooks/01_EDA.ipynb")

# ---- For private repo: temporarily set remote with token for push, then scrub ----
!git remote set-url origin "https://{GITHUB_USER}:{TOKEN}@github.com/{GITHUB_USER}/{REPO_NAME}.git"
!git add .
!git commit -m "01_EDA: kagglehub download, raw+processed CSVs, EDA notebook"
!git push origin HEAD:main
!git remote set-url origin "https://github.com/{GITHUB_USER}/{REPO_NAME}.git"

print("\nRemotes now:")
!git remote -v

[main (root-commit) 5c06400] 01_EDA: kagglehub download, raw+processed CSVs, EDA notebook
 2 files changed, 1842 insertions(+)
 create mode 100644 data/processed/heart_day1_clean.csv
 create mode 100644 data/raw/heart.csv
Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 2 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (7/7), 24.22 KiB | 3.03 MiB/s, done.
Total 7 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), done.[K
To https://github.com/muhammadhussainqureshi/heart-disease-ml.git
 * [new branch]      HEAD -> main

Remotes now:
origin	https://github.com/muhammadhussainqureshi/heart-disease-ml.git (fetch)
origin	https://github.com/muhammadhussainqureshi/heart-disease-ml.git (push)
