# Task 2 Baseline Notebook: “Adult Census Income” Classification Project

## 1. Problem Description

The goal of this project is to build a classification model that, given a person’s demographic and socioeconomic attributes, predicts whether their annual income is **greater than** or **less than or equal to** \$50,000. This is a classic binary supervised learning task illustrating the entire data science workflow: data loading and cleaning, preprocessing and feature engineering, model training, evaluation, and interpretation.

## 2. Dataset (“Adult”)

- **Source**: Extracted from the 1994 U.S. Census by Barry Becker and Ron Kohavi, donated to the UCI Machine Learning Repository.  
- **Size**:  
  - Total: 48,842 records (32,561 training, 16,281 evaluation)  
  - After removing missing values: 45,222 (30,162 train, 15,060 evaluation)  
- **Instances**: Individuals over 16 years old with gross income > \$100 and working > 0 hours/week.  
- **Attributes**: 14 features (6 continuous, 8 categorical) + 1 target variable.  
- **Target variable** (`income`):  
  - `>50K` (annual income over \$50,000)  
  - `<=50K` (annual income \$50,000 or less)

### 2.1 Continuous Attributes

- `age`: Age in years  
- `fnlwgt`: Final weight (population expansion factor)  
- `education_num`: Number of years of formal education  
- `capital_gain`: Non‐recurring capital gains (USD)  
- `capital_loss`: Non‐recurring capital losses (USD)  
- `hours_per_week`: Hours worked per week

### 2.2 Categorical Attributes


- `workclass`, `education`, `marital_status`, `occupation`, `relationship`, `race`, `sex`, `native_country` (various categories as documented)

## 3. Workflow


1. **Load & Explore**  
2. **Clean Data**  
3. **Preprocess** (encode categoricals, impute missing)  
4. **Feature Engineering**  
5. **Train/Test Split**  
6. **Modeling** (Random Forest, Gradient Boosting, etc.)  
7. **Evaluation** (Accuracy, AUC, custom Score)  
8. **Submit**

## 4. Prediction Objective


Predict `income` (`>50K` vs. `<=50K`) as accurately as possible, demonstrating skills in data cleaning, modeling, and evaluation of real‐world datasets.

## 5. Baseline solution

### 5.1 - Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)

### Importance of Data Cleaning

A machine learning model’s quality depends on the data’s quality. Dirty or incomplete data can:
- **Bias** estimates if certain groups are underrepresented.
- **Reduce** useful data volume by dropping too many rows.
- **Cause errors** or unreliable training results.

---

#### 1. Dropping Missing Rows

The simplest approach is to remove any row with `NaN`:

```python
import pandas as pd

df = pd.read_csv('adult.csv')
df_clean = df.dropna()
# Warning: Removing many rows may waste a lot of valuable data and may bring inconsistencies with evaluation data.
```

#### 2. Imputation with Central Tendency

Rather than deleting rows, replace `NaN` with statistical values:

- **Numeric → Median** (robust to outliers):

  ```python
  import numpy as np

  num_cols = df.select_dtypes(include=[np.number]).columns
  for col in num_cols:
      med = df[col].median()
      df[col] = df[col].fillna(med)
  ```

- **Categorical → Mode** (most frequent category):

  ```python
  cat_cols = df.select_dtypes(include=['object', 'category']).columns
  for col in cat_cols:
      mode = df[col].mode()[0]
      df[col] = df[col].fillna(mode)
  ```

**Benefits of Imputation**
- Retains most of your data.
- Reduces bias from dropping entire rows.
- Easy to implement in pandas with a few lines.

Next, explore advanced methods (KNN Imputer, MICE, model-based) once you’re comfortable with these basics.  
See more: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html


### 5.2 - Load data

In [2]:
# Load train data (set the path to the CSV file)
df_train_full = pd.read_csv('train_task2.csv', na_values='?', skipinitialspace=True)

# Load evaluation data (set the path to the CSV file)
df_eval = pd.read_csv('eval_task2.csv', na_values='?', skipinitialspace=True)

# Store ids
eval_ids = df_eval['id']
# Remove id column
df_eval = df_eval.drop('id', axis=1)

### 5.3 - Count missing values

In [3]:
# Check missing values in train data
missing_train = df_train_full.isnull().sum()
print("Missing values in train data:\n", missing_train[missing_train > 0])

Missing values in train data:
 workclass         1836
occupation        1843
native_country     583
dtype: int64


In [4]:
# Check missing values in evaluation data
missing_eval = df_eval.isnull().sum()
print("Missing values in evaluation data:\n", missing_eval[missing_eval > 0])

Missing values in evaluation data:
 workclass         963
occupation        966
native_country    274
dtype: int64


### 5.4 - Impute training and evaluation data with median/mode

In [5]:
# Identify your columns
numeric_cols     = df_train_full.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df_train_full.select_dtypes(include=['object','category']).columns.tolist()
# If 'income' is categorical, remove it from features
if 'income' in categorical_cols:
    categorical_cols.remove('income')

# Impute missing values
# For numeric columns, use median; for categorical, use mode
num_cols = df_train_full.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    med_train = df_train_full[col].median()
    med_eval = df_eval[col].median()

    df_train_full[col] = df_train_full[col].fillna(med_train)
    df_eval[col] = df_eval[col].fillna(med_eval)

cat_cols = df_train_full.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    mode_train = df_train_full[col].mode()[0]
    mode_eval = df_eval[col].mode()[0]
    
    df_train_full[col] = df_train_full[col].fillna(mode_train)
    df_eval[col] = df_eval[col].fillna(mode_eval)

### 5.5 - Check no missing values left

In [6]:
# Check there are no missing values left
missing_train = df_train_full.isnull().sum()
missing_eval = df_eval.isnull().sum()
if missing_train.sum() == 0 and missing_eval.sum() == 0:
    print("No missing values left in train and evaluation data.")
else:
    print("There are still missing values in the data.")
    print("Train data missing values:\n", missing_train[missing_train > 0])
    print("Evaluation data missing values:\n", missing_eval[missing_eval > 0])

No missing values left in train and evaluation data.


### Handling Numeric and Categorical Features



Real-world tables often mix:

- **Numeric features** (e.g., age, salary, hours worked)  
- **Categorical features** (e.g., gender, marital status, country)

Most ML algorithms require **numeric input**, so we must encode categories without imposing false order or exploding dimensions.

#### Common Encoding Techniques

1. **Label Encoding**  
   - Maps each category to a unique integer.  
   - *Pros:* Simple and fast.  
   - *Cons:* Imposes arbitrary order.

   ```python
   from sklearn.preprocessing import LabelEncoder
   le = LabelEncoder()
   df['color_enc'] = le.fit_transform(df['color'])
   ```

2. **One-Hot Encoding**  
   - Creates one binary column per category.  
   - *Pros:* No order.  
   - *Cons:* High dimensionality if many categories.

   ```python
   df = pd.get_dummies(df, columns=['color'])
   ```

3. **Ordinal Encoding**  
   - Uses integers for naturally ordered categories (e.g., education levels).  
   - *Example:* `Preschool`→0, `HS-grad`→3, `Bachelors`→5, `Masters`→6.

4. **Mean/Target Encoding**  
   - Replaces each category with the average target value for that category.  
   - Useful for high‑cardinality features, but watch for overfitting.


### 5.6 - Enconde categoricals

In [7]:
# Encode categoricals
le = LabelEncoder()
for col in categorical_cols:
    df_train_full[col] = le.fit_transform(df_train_full[col])
    df_eval[col] = le.transform(df_eval[col])

# Encode 'income' in the train set
df_train_full['income'] = le.fit_transform(df_train_full['income'])

### Why Feature Scaling Matters



Numeric features often differ widely in range (e.g., age 17–90 vs. population weight tens to hundreds of thousands).  
Models that rely on distances or gradients (linear regression, SVM, neural nets, KNN) can be skewed when one feature dominates:

- **Domination**: large-range features drive the gradient or distance.  
- **Slow or unstable convergence**: gradient steps optimal for one dimension may be too big or small for another.  
- **Uneven regularization**: L1/L2 penalties depend on feature scale.

#### When to Scale
- **Scale-sensitive models**: linear/logistic regression, SVM, k‑NN, neural networks, PCA  
- **Scale-insensitive**: tree-based models (decision trees, random forest, boosting)

#### Common Scaling Methods
1. **Min–Max**: map to [0–1] or [1–100]  
2. **Standard (Z-score)**: zero mean, unit variance  
3. **Robust**: uses median and IQR to resist outliers

### 5.7 - Scale numerics

In [8]:
# Scale numerics
scaler = MinMaxScaler(feature_range=(1, 100))
df_train_full.loc[:, numeric_cols] = scaler.fit_transform(df_train_full[numeric_cols])
df_eval.loc[:, numeric_cols] = scaler.transform(df_eval[numeric_cols])

### Model Choice and Alternatives



While **Logistic Regression** is simple and interpretable, it’s not always best. Consider:

- **Logistic Regression**  
  - *Pros:* Fast, interpretable coefficients, natural for binary labels.  
  - *Cons:* Assumes linear decision boundary, needs manual feature interactions.

- **Decision Trees**  
  - *Pros:* Captures nonlinearity and interactions automatically.  
  - *Cons:* Prone to overfitting without pruning.

- **Ensembles**  
  - **Random Forest**: robust, handles missing data, minimal tuning.  
  - **Gradient Boosting** (LightGBM, XGBoost): state-of-the-art for tabular data.

- **SVM**: Effective with kernels, needs scaling, may be slow for large data.  
- **KNN**: Simple neighbor-based, sensitive to scale and data density.  
- **Naive Bayes**: Fast, good with categoricals, assumes feature independence.  
- **Neural Networks**: Flexible but require more data, tuning, compute.

### 5.8 - Logistic regression model

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Separate features (X) and response variable (y)
X = df_train_full.drop('income', axis=1)
y = df_train_full['income']

# Divide into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    train_size=0.8,
    random_state=42,
    stratify=y
)

# Create and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
acc    = f1_score(y_test, y_pred)

print(f"Test F1: {acc:.3f}\n")

Test F1: 0.562



### 5.9 - Train your model again with all the data

In [10]:
model.fit(X, y)

### 5.10 - Make predictions on the evaluation data and save to csv

In [14]:
# Predict using the trained model
y_eval_pred_encoded = model.predict(df_eval)

# Decode the predictions back to original labels
y_eval_pred = le.inverse_transform(y_eval_pred_encoded)

# Combine the IDs and predictions into a DataFrame
df_submission = pd.DataFrame({
    'id': eval_ids,
    'prediction': y_eval_pred
})

# Save the DataFrame to a CSV file
df_submission.to_csv('predictions_task2.csv', index=False)

print("Predictions saved to predictions.csv!")

Predictions saved to predictions.csv!


## 6. New solution