# Assignment 1 - Wine Quality (UCI) - Classification

**Student:** Justus Izuchukwu Onuh  
**Institution:** Ho Chi Minh City University of Technology  
**Course:** Programming Platform for Data Analysis and Visualization (CO5177)  
**Lecturer:** LE THANH SACH  


---

### Objective 
Perform **Exploratory Data Analysis (EDA)** and apply a **classification** model to the UCI Red Wine Quality dataset.  
Everything (code, plots, explanations, and results) will be contained in this notebook.


### Tasks 
1. Load and explore the dataset  
2. Perform comprehensive EDA with visualizations  
3. Apply a **classification** model (predict `quality_label`) 
4. Evaluate model performance with classification metrics  
5. Present findings and explanations clearly using Markdown and code cells


In [1]:
# Step 1: Setup — imports and download dataset
from pathlib import Path 
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys

sns.set(style="whitegrid")
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
DATA_PATH = DATA_DIR / "winequality-red.csv"

if not DATA_PATH.exists():
    print("Downloading dataset from UCI repository...")
    urllib.request.urlretrieve(URL, DATA_PATH)
    print("Download complete:", DATA_PATH)
else:
    print("Dataset already present at:", DATA_PATH)

print("Notebook kernel Python:", sys.executable)


Downloading dataset from UCI repository...
Download complete: data/winequality-red.csv
Notebook kernel Python: /opt/homebrew/opt/python@3.11/bin/python3.11


### Dataset description

- **Source:** UCI Machine Learning Repository (Wine Quality — Red)  
- **Samples:** 1599 red wine observations  
- **Features (11 numeric):** fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol  
- **Original target:** `quality` (integer score, usually 3–8)

**Modeling decision (classification):**  
We will convert `quality` to a binary label `quality_label` where:
- `quality_label = 1` (good) if `quality >= 7`  
- `quality_label = 0` (not good) if `quality < 7`

This threshold is chosen to separate higher-quality wines from the rest; it is a common and interpretable split for classification tasks. We will justify and briefly discuss this choice in the notebook results.
