<a href="https://colab.research.google.com/github/gvern/AgentGPT/blob/main/ResponsibleAI_notebook_PART1_environment_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Responsible AI - Notebook for the entire course

Author : Marie Couvé

# Use case preparation

## Problem statement

Based on the Census data, **determine whether a person makes over $50,000 US Dollar a year**.

The dataset is available here: https://archive.ics.uci.edu/dataset/2/adult


## Dataset description

### Numeric Features
*   `age`: The age of the individual in years.
*   `fnlwgt`: The number of individuals the Census Organizations believes that set of observations represents.
*   `education-num`:  An enumeration of the categorical representation of education. The higher the number, the higher the education that individual achieved. For example, an `education-num` of `11` represents `Assoc_voc` (associate degree at a vocational school), an `education_num` of `13` represents `Bachelors`, and an `education_num` of `9` represents `HS-grad` (high school graduate).
*   `capital-gain`: Capital gain made by the individual, represented in US Dollars.
*   `capital-loss`: Capital loss made by the individual, represented in US Dollars.
*   `hours-per-week`: Hours worked per week.

### Categorical Features
*   `workclass`: The individual's type of employer. Examples include: `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, and `Never-worked`.
*   `education`: The highest level of education achieved for that individual.
*   `marital-status`: Marital status of the individual. Examples include: `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, and `Married-AF-spouse`.
*   `occupation`: The occupation of the individual. Example include: `tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial` and more.
*   `relationship`:  The relationship of each individual in a household. Examples include: `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, and `Unmarried`.
*   `race`: `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Black`, and `Other`.
*   `sex`:  Gender of the individual available only in binary choices: `Female` or `Male`.
*   `native-country`: Country of origin of the individual. Examples include: `United-States`, `Cambodia`, `England`, `Puerto-Rico`, `Canada`, `Germany`, `Outlying-US(Guam-USVI-etc)`, `India`, `Japan`, and more.

### Label
*   `income-per-year`: Whether the person makes more than $50,000 US Dollars annually.


# Let's start coding!

## Package installation





In [None]:
! pip install codecarbon

In [None]:
! pip install dalex

In [None]:
! pip install shap

In [None]:
! pip install fairlearn

In [None]:
! pip install ucimlrepo


## Imports

In [None]:
# Remove warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Generic
import pandas as pd
import numpy as np
from copy import copy

# Dataset
from ucimlrepo import fetch_ucirepo

# ML
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Carbon impact
from codecarbon import EmissionsTracker

# Explainability
import shap

# Fairness
import dalex as dx
from dalex.fairness import resample
from dalex.fairness import roc_pivot

from fairlearn.reductions import ExponentiatedGradient
from fairlearn.reductions import TruePositiveRateParity, DemographicParity, EqualizedOdds
from fairlearn.postprocessing import ThresholdOptimizer, plot_threshold_optimizer



# PART 1 - Environmental impact of AI

We are going to use CodeCarbon to evaluate the carbon impact of AI.


The documentation can be found here : https://mlco2.github.io/codecarbon/


## Carbon tracking initialisation

In [None]:
# Instantiate the tracker from Code Carbon

# Start tracking


## Data preparation

### Load the dataset


In [None]:
# Load data
# Fetch dataset
adult = fetch_ucirepo(id=2)

# Get features and targets
X = adult.data.features
y = adult.data.targets

# Combine in a single dataframe
df_raw = pd.concat([X,y], axis=1)

In [None]:
# First look at the data


### Data exploration

Explore and analyze the data as necessary. You need to have a good understanding of each features and it's influence on the target.

Some examples:
- Look at the data types
- Look at missing

In [None]:
# Explore the data


In [None]:
# Analyze the data


### Data cleaning

Clean anything you have found during your exploration, such as:

1. Remove the '.' in some of the values in the 'income' column
2. ...

In [None]:
# Clean the income column



### Data splitting and preprocessing

1. Make a copy of the raw dataframe
2. Binarize the target label (we suggest this mapping: >50K:1, <=50K:0). We suggest using: `df.map()`
3. Split in train and test. We suggest: `train_test_split()`
4. Preprocess your data using the preprocessor below. Try to get the columns name from the preprocessor and save the preprocessed data in a dataframe

```
X_train_preprocessed = ...
X_test_preprocessed = ...
```

Preprocessor for the data:
```
preprocessor = make_column_transformer(
    ("passthrough", make_column_selector(dtype_include=int)),
    (StandardScaler(), make_column_selector(dtype_include=float)),
    (OneHotEncoder(handle_unknown="ignore", sparse=False), make_column_selector(dtype_include=object))
)
```

What it does:
- If the column is integer, no preprocessing
- If the column is float, Standardize features by removing the mean and scaling to unit variance
- If the column is categorical, Encode features as a one-hot numeric array


In [None]:
# Make a copy of the raw dataset


In [None]:
# Binarize label : >50K:1, <=50K:0


# Drop the initial label column


In [None]:
# Train and test split


In [None]:
# Preprocessing
# Define preprocessor


# Train preprocessor and transform data



## Train a simple model

1. Define and fit a simple model. We suggest a Decision Tree with a the parameters: max_depth=10, class_weight="balanced"
```
clf_decisiontree_base = ...
```
2. Assess the algorithmic performance. You can use the `classification_report()` on test data


### Training and evaluation

In [None]:
# Train a decision tree


In [None]:
# Algorithmic performance



## Carbon tracking conclusion


In [None]:
# Stop the carbon tracker and print the carbon emissions



### Your analysis

*How much carbon was emitted? Is it a lot? Compare it to references (km in car, km in plane...)*



## Bonus: Compare the carbon impact of different models

Train other models and compare their performance and their carbon impacts