# Assignment 5: Adult Income Prediction 💵📊

## 📚 Learning Objectives
- Perform data cleaning and preprocessing (handling missing values, encoding).
- Visualize data distributions.
- Train and compare **Logistic Regression** and **K-Nearest Neighbors (KNN)** classifiers.

## Part 1: Data Loading and Basic Inspection

### Q1
Load the `adult` dataset (version 2) from OpenML. Display its shape, list the feature names, and identify the target variable.

In [None]:
from sklearn.datasets import fetch_openml # Import fetch_openml / Импорт fetch_openml
import pandas as pd # Data manipulation / Работа с данными
import numpy as np # Numerical operations / Числовые операции
import matplotlib.pyplot as plt # Plotting / Графики
import seaborn as sns # Advanced plotting / Продвинутые графики

# Fetch dataset
adult = fetch_openml(name='adult', version=2, as_frame=True, parser='auto') # Fetch Adult dataset / Загрузить датасет Adult
df = adult.frame # Convert to DataFrame / Конвертировать в DataFrame

print("Shape:", df.shape)
print("Features:", df.columns.tolist())
print("Target Variable:", 'class' if 'class' in df.columns else 'income')

### Q2
Display the non-null count and data type for each feature.

In [None]:
df.info() # Check data info / Проверить информацию о данных


## Part 2: Descriptive Statistics

### Q3
Generate summary statistics for both numerical and categorical columns.

In [None]:
print("--- Numerical Summary ---")
display(df.describe()) # Summary statistics / Описательная статистика

print("\n--- Categorical Summary ---")
display(df.describe(include=['object', 'category']))

### Q4
List all categorical columns and their unique values.

In [None]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns # Select columns by type / Выбрать колонки по типу
for col in cat_cols:
    print(f"\nColumn: {col}")
    print(df[col].unique())

### Q5
Check for missing values in the dataset.

In [None]:
print(df.isnull().sum()) # Count missing values / Подсчитать пропущенные значения


### Q5.1
Handle missing values:
1. Calculate the ratio of missing values for the dataset.
2. If the ratio is < 20%, drop rows with missing values.
3. If the ratio is >= 20%, impute missing values (median for numerical, mode for categorical).

In [None]:
total_cells = np.product(df.shape)
missing_cells = df.isnull().sum().sum() # Count missing values / Подсчитать пропущенные значения
ratio = missing_cells / total_cells

print(f"Missing Ratio: {ratio:.4f}")

if ratio < 0.20:
    print("Ratio < 20%. Dropping rows with missing values...")
    df_clean = df.dropna() # Drop rows with missing values / Удалить строки с пропусками
else:
    print("Ratio >= 20%. Imputing values...")
    # Simple imputation logic (for demonstration)
    df_clean = df.copy()
    for col in df_clean.columns:
        if df_clean[col].dtype in [np.float64, np.int64]:
            df_clean[col].fillna(df_clean[col].median(), inplace=True)
        else:
            df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)

print("New Shape:", df_clean.shape)

## Part 3: Visualization

### Q6
Plot histograms for all numerical columns.

In [None]:
df_clean.hist(figsize=(12, 10), bins=20, edgecolor='black') # Plot histograms / Построить гистограммы
plt.suptitle('Histograms of Numerical Columns', fontsize=16)
plt.show()

### Q7
Plot bar charts showing the frequency of each category for all categorical features.

In [None]:
cat_cols = df_clean.select_dtypes(include=['object', 'category']).columns

plt.figure(figsize=(16, 20))
for i, col in enumerate(cat_cols, 1):
    plt.subplot(5, 3, i)
    sns.countplot(y=col, data=df_clean, order=df_clean[col].value_counts().index) # Plot category counts / График количества категорий
    plt.title(col)
    plt.tight_layout()
plt.show()

## Part 4: Preprocessing and Modeling

### Q8
Assign the feature columns to `X` and the target column to `y`.

In [None]:
# Target is 'class' in OpenML adult dataset
target_col = 'class'
X = df_clean.drop(columns=[target_col])
y = df_clean[target_col]

### Q9
Apply One-Hot Encoding to the categorical features using `pandas.get_dummies()`.

In [None]:
X_encoded = pd.get_dummies(X, drop_first=True) # One-Hot Encoding / One-Hot Кодирование
print("Shape after encoding:", X_encoded.shape)

### Q10
Split the dataset into training (80%) and testing (20%) sets using stratified sampling. Print the shapes of the resulting sets.

In [None]:
from sklearn.model_selection import train_test_split # Split data / Разделить данные

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, stratify=y, random_state=42) # Split data / Разделить данные

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

### Q11
Train a Logistic Regression classifier on the training set.

In [None]:
from sklearn.linear_model import LogisticRegression # Initialize Logistic Regression / Инициализация Логистической регрессии

# Increase max_iter to ensure convergence
log_reg = LogisticRegression(max_iter=1000, random_state=42) # Initialize Logistic Regression / Инициализация Логистической регрессии
log_reg.fit(X_train, y_train) # Train model / Обучить модель


### Q12
Report the accuracy score of the Logistic Regression model on the test set.

In [None]:
score_log = log_reg.score(X_test, y_test) # Evaluate accuracy / Оценить точность
print(f"Logistic Regression Accuracy: {score_log:.4f}")

### Q13
Train a K-Nearest Neighbors (KNN) classifier with `k=5`.

In [None]:
from sklearn.neighbors import KNeighborsClassifier # Initialize KNN / Инициализация KNN

knn = KNeighborsClassifier(n_neighbors=5) # Initialize KNN / Инициализация KNN
knn.fit(X_train, y_train) # Train KNN / Обучить KNN


### Q14
Report the accuracy score of the KNN model on the test set.

In [None]:
score_knn = knn.score(X_test, y_test) # Evaluate KNN / Оценить KNN
print(f"KNN Accuracy: {score_knn:.4f}")

### Q15
**Question:** Explain the difference between Logistic Regression and K-Nearest Neighbors (KNN) in two sentences.

**Answer:**
**Logistic Regression** is a parametric linear model that learns a specific boundary (formula) to separate classes, making it fast and interpretable. **KNN** is a non-parametric instance-based learner that memorizes the training data and makes predictions based on local similarity, which can be computationally expensive but captures complex non-linear patterns.