# **Welcome to our first Python Lab!**

### We'll be using Python to build a simple machine learning model to talk about how AI models make decisions.

## **1. Setting Up the Dataset**

We'll be using the [**Adult Income Dataset** from the UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/2/adult).

This dataset examines whether a person earns more than $50K/year based on factors such as education, occupation, and marital status.

It raises questions about fairness in predicting people’s income based on demographic characteristics, so it's a good example of the sociotechnical aspects of AI we discussed in lecture.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Load the dataset from UCI (it’s available via many sources or from your own directory)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
                'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
                'hours_per_week', 'native_country', 'income']

# Load the dataset into a pandas DataFrame
df = pd.read_csv(url, names=column_names, na_values=' ?')

# Display the first few rows of the dataset
df.head()

In [None]:
df.columns

In [None]:
len(df)

Before we dive into building a machine learning model, it's crucial to perform **exploratory data analysis (EDA)** to understand the dataset better. This will help us uncover trends, anomalies, and potential biases in the data. We'll look at descriptive statistics for numerical features, the distribution of categorical features, and visualize key attributes of the dataset.

In [None]:
# Drop rows with missing values
df = df.dropna()

# Get basic descriptive statistics
df.describe()

In [None]:
len(df)

Let's now look at the distributions of important categorical features like race, sex, education, and workclass to check for imbalances.

In [None]:
# Categorical feature distributions
categorical_columns = ['relationship', 'native_country', 'race', 'sex', 'education', 'workclass', 'marital_status', 'occupation', 'income']

# Displaying the distribution of each categorical variable
for col in categorical_columns:
    print(f"\nDistribution of {col}:\n")
    print(df[col].value_counts(normalize=True) * 100)

    # Plot each categorical variable in a separate chart
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=col, order=df[col].value_counts().index)
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=45)  # Rotate x-axis labels for readability
    plt.show()

Let’s now explore the relationships between key features (like age, hours per week, and education_num) and the target variable (income) to see how they might influence the outcome.

In [None]:
# Plotting age, education_num, and hours_per_week vs income
fig, ax = plt.subplots(1, 3, figsize=(18, 6))

sns.boxplot(data=df, x='income', y='age', ax=ax[0])
ax[0].set_title('Age vs Income')

sns.boxplot(data=df, x='income', y='education_num', ax=ax[1])
ax[1].set_title('Education Level vs Income')

sns.boxplot(data=df, x='income', y='hours_per_week', ax=ax[2])
ax[2].set_title('Hours Worked per Week vs Income')

plt.tight_layout()
plt.show()

Lastly, we’ll create a correlation heatmap to understand how the numerical features are correlated with one another. High correlations might indicate multicollinearity (where features are highly related), which we need to account for in modeling.

In [None]:
import numpy as np

# Perform one-hot encoding for all categorical columns
df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Check the column names to ensure 'income' is properly encoded
print(df_encoded.columns)

# Select only numeric columns for the correlation matrix
numeric_df_encoded = df_encoded.select_dtypes(include=[np.number])

# Display the correlation heatmap, including income
plt.figure(figsize=(15, 10))
sns.heatmap(numeric_df_encoded.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title('Correlation Heatmap of Numerical and Encoded Categorical Features (Including Income)')
plt.show()

In [None]:
df_encoded.dtypes

Next, we need to perform a train/test split to prepare for fitting the model.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df_encoded.drop(columns=['income_ >50K'])  # Features (drop the encoded income column)
y = df_encoded['income_ >50K']  # Target (income)

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Check the shape of the resulting datasets
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Testing labels shape:", y_test.shape)

## **3. Building a Simple Machine Learning Model**

Now that the data is ready, let's build a Logistic Regression model to predict whether someone earns more than $50K/year.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create the Logistic Regression model
model = LogisticRegression(max_iter=1000)

# Train the model on the training data
model.fit(X_train, y_train)

## **4. Evaluating the Model**

After training, we need to evaluate the model to see how well it performs on the test data. We’ll use a **confusion matrix** to visualize the performance of the model.

The confusion matrix helps us understand how many predictions were correct (true positives and true negatives) and how many were incorrect (false positives and false negatives).

This breakdown is crucial for evaluating fairness and bias in the model.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Predict on the test set
y_pred = model.predict(X_test)

# Compute the confusion matrix and transpose it
cm = confusion_matrix(y_test, y_pred)

# Display the confusion matrix with transposed axes
ConfusionMatrixDisplay(confusion_matrix=cm).plot()

* True Positives (TP): Correctly predicted high income.
* True Negatives (TN): Correctly predicted low income.
* False Positives (FP): Incorrectly predicted high income (overprediction).
* False Negatives (FN): Incorrectly predicted low income (underprediction).

Generally speaking, the accuracy of a model is the ratio of correctly predicted instances to the total instances.

Formula: (TP + TN) / (TP + TN + FP + FN)

Let's calculate it using code:

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy is the best-known general performance measure but it may not be reliable if there are class imbalances. In that case, it's better to break down the confusion matrix further for a more detailed analysis.

One such measure is **precision**, or the proportion of positive identifications that were actually correct. It focuses on the **accuracy** of the **positive predictions**.

It’s useful when the cost of false positives is high (e.g., in spam detection).

Formula: TP / (TP + FP)

In [None]:
# Calculate Precision
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")

Next is recall, also known as sensitivity or true positive rate. It's the proportion of true positives out of all actual positives. It answers the question: Of all the actual positive instances, how many were correctly identified?

It focuses on the model's ability to **capture all true positives**.

Formula: TP / (TP + FN)

In [None]:
# Calculate Recall (Sensitivity or True Positive Rate)
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.4f}")

The F1 score is the harmonic mean of precision and recall. It balances the need for both precision and recall, and is useful when you want to avoid extremes of either metric.

Formula: 2 * (Precision * Recall) / (Precision + Recall)

In [None]:
# Calculate F1-Score
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.4f}")

## ROC and AUC


* The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR; what we also referred to as recall or sensitivity earlier) against the False Positive Rate (FPR; the proportion of actual negatives incorrectly identified as positives) at various classification thresholds. It shows how well the model can distinguish between classes.


* The Area Under the Curve (AUC) represents the area under the ROC curve. It’s a single scalar value that summarizes the ROC curve. A higher AUC indicates better model performance in distinguishing between classes.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

# Calculate the probability estimates for the test set
y_proba = model.predict_proba(X_test)[:, 1]

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Compute AUC
auc = roc_auc_score(y_test, y_proba)

# Plot ROC Curve
plt.figure()
plt.plot(fpr, tpr, label=f"ROC Curve (area = {auc:.2f})")
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

# Print AUC score
print(f"AUC: {auc:.2f}")

AUC = 0.88 suggests that the model is not perfect but it's effective: it has a good ability to distinguish between the positive and negative classes.

Some General AUC Ranges to use as a rule of thumb in analysis:
* AUC = 1.0: Perfect model – the model makes no mistakes in classification.
* AUC = 0.5: Random guessing – the model cannot distinguish between the classes (equivalent to flipping a coin).
* AUC between 0.7 and 0.9: Good model – your model does a good job distinguishing between positive and negative cases.
* AUC < 0.7: Poor model – the model struggles to separate the classes.


## Your Task:

Although this dataset isn't classifying people into high-stakes categories like "offender" or "non-offender," it is predicting whether someone earns more than $50k a year. From a responsible AI perspective, this can still be problematic. Explain how this type of prediction could negatively impact the individuals being classified by the algorithm, as well as the broader group of stakeholders who might not interact with the algorithm directly but could still be affected by its outcomes.