<a href="https://colab.research.google.com/github/lcbjrrr/algojust/blob/main/NB04_hist_algo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Algorithmic and Historical bias


- **Algorithmic bias**: Bias that arises from the design, implementation, or use of algorithms. This can be due to skewed training data or poor algorithm design and can lead to unfair outcomes in areas like recruiting or criminal justice.
Omitted variable bias: Occurs when important variables are left out of a model or analysis, leading to inaccurate results. For example, analyzing car data without considering mileage or age can lead to inaccurate conclusions about vehicle value

- **Historical bias**: Occurs when past socio-cultural prejudices and beliefs are reflected in the data and subsequent analysis. This is particularly challenging when historical data is used to train machine learning models, as the models can perpetuate and amplify these biases. An example is an AI hiring tool that learned from historically biased hiring data and ended up favoring male candidates


*The data used herein is for illustrative purposes only and does not reflect actual real-world data*

#### Here is a trained [model](https://colab.research.google.com/github/lcbjrrr/quantai/blob/main/M2_Py_ML_RegLin.ipynb), please use it to perform preditions

In [None]:
!wget https://github.com/lcbjrrr/algojust/raw/refs/heads/main/lingreg.pkl

--2025-10-16 15:30:32--  https://github.com/lcbjrrr/algojust/raw/refs/heads/main/lingreg.pkl
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/lcbjrrr/algojust/refs/heads/main/lingreg.pkl [following]
--2025-10-16 15:30:32--  https://raw.githubusercontent.com/lcbjrrr/algojust/refs/heads/main/lingreg.pkl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1017 [application/octet-stream]
Saving to: ‘lingreg.pkl’


2025-10-16 15:30:32 (32.4 MB/s) - ‘lingreg.pkl’ saved [1017/1017]



## AI/ML Models

Machine Learning (ML) models are a technology where computers learn from data to find patterns and make predictions without needing explicit programming for each task. These models improve their accuracy by processing more data and adjusting how they analyze information over time.

### Load a Model

The joblib.load function in Python quickly loads saved data like machine learning models, especially when working with large arrays. .pkl files are Python-specific files created with the pickle module to store objects in a serialized format for later use.​

In [None]:
import joblib
model = joblib.load('lingreg.pkl')
model.coef_

array([-7.20418133, 39.28951506, -0.31423604, -0.88926581, -0.21726562,
       -0.81547586,  1.4617983 , -0.42905681])

In [None]:
import pandas as pd
test = pd.DataFrame({'age':[33,33,33,33,66,66],
                     'female':[1,0,1,0,1,0],
                     'vehicle_age':[2,2,2,2,4,4],
                     'claim_last2y':[1,1,0,0,0,0],
                     'v_Hatchback':[1,1,1,1,0,0],
                     'v_SUV':[0,0,0,0,1,1],
                     'v_Sedan':[0,0,0,0,0,0],
                     'v_Truck':[0,0,0,0,0,0] })
test

Unnamed: 0,age,female,vehicle_age,claim_last2y,v_Hatchback,v_SUV,v_Sedan,v_Truck
0,33,1,2,1,1,0,0,0
1,33,0,2,1,1,0,0,0
2,33,1,2,0,1,0,0,0
3,33,0,2,0,1,0,0,0
4,66,1,4,0,0,1,0,0
5,66,0,4,0,0,1,0,0


With the trained model, we can perform predictions. Predictions are estimates or decisions made by trained models when given new, unseen data, using patterns learned from historical information.​



In [None]:
test['predictions'] = model.predict(test[['age','female','vehicle_age','claim_last2y','v_Hatchback','v_SUV','v_Sedan','v_Truck']])
test

array([494.87821716, 455.5887021 , 495.76748297, 456.4779679 ,
       256.80281685, 217.51330179])

Here are the predictions. **What kind of bias do you see on those**?

In [None]:
test['predictions']=predictions
test

Unnamed: 0,age,female,vehicle_age,claim_last2y,v_Hatchback,v_SUV,v_Sedan,v_Truck,predictions
0,33,1,2,1,1,0,0,0,494.878217
1,33,0,2,1,1,0,0,0,455.588702
2,33,1,2,0,1,0,0,0,495.767483
3,33,0,2,0,1,0,0,0,456.477968
4,66,1,4,0,0,1,0,0,256.802817
5,66,0,4,0,0,1,0,0,217.513302


## Training the model

In [3]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/lcbjrrr/algojust/refs/heads/main/car_insurance%20-%20bias.csv")
df.head(3)

Unnamed: 0,age,female,premium,vehicle_age,claim_last2y,v
0,23,1,580,3,0,Sedan
1,24,0,520,5,1,SUV
2,31,1,510,2,0,Hatchback


In [2]:
df = pd.get_dummies(df )
df.head(3)

Unnamed: 0,age,female,premium,vehicle_age,claim_last2y,v_Hatchback,v_SUV,v_Sedan,v_Truck
0,23,1,580,3,0,False,False,True,False
1,24,0,520,5,1,False,True,False,False
2,31,1,510,2,0,True,False,False,False


To train a AI/ML model, in the example here a linear regression model using a training dataset, save the trained model to a file, load it back, and then use it to make predictions. The process involves importing libraries, fitting the model to data, evaluating its performance, and applying it to new data points to see the predicted outcomes. It is possible to measure the correctness in traning with the score() method, in the case here (RegLing) the r2.

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(df[['age','female','vehicle_age','claim_last2y','v_Hatchback','v_SUV','v_Sedan','v_Truck']],df['premium'])
#print(linreg.coef_,linreg.intercept_)
linreg.score(df[['age','female','vehicle_age','claim_last2y','v_Hatchback','v_SUV','v_Sedan','v_Truck']],df['premium'])

0.9824782207152526

In [None]:
linreg.predict(test[['age','female','vehicle_age','claim_last2y','v_Hatchback','v_SUV','v_Sedan','v_Truck']])

array([494.87821716, 455.5887021 , 495.76748297, 456.4779679 ,
       256.80281685, 217.51330179])

In [None]:
predictions

array([494.87821716, 455.5887021 , 495.76748297, 456.4779679 ,
       256.80281685, 217.51330179])

 In order to save the trained linear regression model (linreg) to a file named lingreg.pkl using the joblib.dump function.

In [None]:
joblib.dump(linreg, 'lingreg.pkl')

['lingreg.pkl']

## Linear Regression 101

In [None]:
import pandas as pd
treino = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/algojust/refs/heads/main/grades%20-%20all.csv')
treino.head(3)

Unnamed: 0,Ex1,Ex2,Ex3
0,100,100,90
1,90,100,90
2,95,100,100


In [None]:
from sklearn.linear_model import LinearRegression
reglin = LinearRegression()
reglin.fit(treino[['Ex1','Ex2']] , treino['Ex3'])
reglin.score(treino[['Ex1','Ex2']] , treino['Ex3'])

0.9818896713333345

In [None]:
test = pd.DataFrame({'Ex1':[80,20,35],
                     'Ex2':[80,40,40],
                     'Ex3':[35,80,90]})
test

Unnamed: 0,Ex1,Ex2,Ex3
0,80,80,35
1,20,40,80
2,35,40,90


In [None]:
preds = reglin.predict(test[['Ex1','Ex2']])
preds

array([73.12693817, 39.75624675, 34.2650204 ])

## Activity: Algorithmic and Historical bias


**Problem**

Even foundational models like linear regression can perpetuate and amplify biases present in their training data, leading to unfair or inaccurate predictions. Your task is to select a dataset from Kaggle suitable for linear regression. Train a linear regression model on your chosen dataset and generate predictions. Then analyze your predictions for evidence of bias (in case there are).

**Conclusions**

Summarize the specific biases identified in your model's predictions, detailing their impact. Discuss the effectiveness of any mitigation strategies you attempted or propose and reflect on the ethical implications of your findings. Your conclusion should articulate the challenges of achieving fairness in machine learning and suggest future research directions or practical steps for developing more equitable predictive models. In case there are no signs of bias, provide concrete evidence of it and comment on this finding.