<a href="https://colab.research.google.com/github/lavaman131/responsible-ai-law-ethics-society/blob/main/pre_class3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![banner](https://learn.responsibly.ai/assets/banner.jpg)

# Class 3 - Discrimination & Fairness: Pre-Class Task

https://learn.responsibly.ai

In the third class, we will dive into the challenge of fairness of machine learning models.

In this pre-class task, you will develop a simple classification model that takes a short textual biography of a person and returns its occupation. The model doesn't need to be fancy, you can aim for a "baseline" model with basic preprocessing that achieves reasonable accuracy of at least 80% on the test dataset. For example, Linear Regression on Bag of Words should be sufficient, but feel free to explore more powerful model families. We recommend keeping it simple and use the `sklearn` package. Finally, you will store your model so that you will be able to load it into the notebook in class.

Please go through the whole notebook before you start coding. You could plan your work better if you have first an overview of the task.

If you have any questions, please post them in the `#ds` channel in Discord or join the office hours.

Let's start!

## Setup

In [1]:
%%bash

wget --quiet https://stash.responsibly.ai/3-fairness/activity/data.zip

In [2]:
!unzip -oq data.zip

In [3]:
import pandas as pd
from IPython.display import display, Markdown

## Dataset

In [4]:
train_df = pd.read_csv('./data/train.csv')
test_df = pd.read_csv('./data/test.csv')

The training dataset and the test dataset consists of multiple rows, one for each person, and two columns:
1. `bio` - The biographies as text (i.e., `string`). This is the input to the model.
1. `occupation` - The occupations of each person as text (i.e., `string`). This is the model's output.

In [5]:
train_df.head()

Unnamed: 0,bio,occupation
0,His research and teaching focus on security s...,professor
1,He is currently associated with Shri Datta Ho...,dentist
2,Dr. Swarthout is an analytical environmental ...,professor
3,His work has appeared in major U.S. and Europ...,photographer
4,"He was born in the year 1977 in Rajshahi, hea...",photographer


In [6]:
test_df.head()

Unnamed: 0,bio,occupation
0,"He has worked at the University of Valencia, ...",attorney
1,"Nick Romano is his client, a young man with a...",attorney
2,"In this position, he has written several best...",physician
3,His research focuses on using traditional and...,professor
4,His music to the period supernatural film The...,composer


We used 75%-25% split between the training and the test dataset:

In [7]:
print(f'# train: {len(train_df)}')
print(f'# test: {len(test_df)}')

# train: 280470
# test: 93492


There are 28 occupations:

In [8]:
sorted(train_df['occupation'].unique())

['accountant',
 'architect',
 'attorney',
 'chiropractor',
 'comedian',
 'composer',
 'dentist',
 'dietitian',
 'dj',
 'filmmaker',
 'interior_designer',
 'journalist',
 'model',
 'nurse',
 'painter',
 'paralegal',
 'pastor',
 'personal_trainer',
 'photographer',
 'physician',
 'poet',
 'professor',
 'psychologist',
 'rapper',
 'software_engineer',
 'surgeon',
 'teacher',
 'yoga_teacher']

Each running of the next cell will sample 10 random rows and show their occupations and biographies:

In [9]:
for _, row in train_df.sample(10).iterrows():
    display(Markdown('### Ground-Truth Occupation: ' + row['occupation']))
    display(Markdown(row['bio']))

### Ground-Truth Occupation: professor

 Her research interests include health economics, disability policy, labor economics, and social insurance programs. Her past work includes estimating the effect of the ACA Medicaid expansions on federal disability insurance applications, examining the relationship between rising health insurance costs and employee compensation, and analyzing the time use of people with disabilities. Her work has been published in journals such as Health Economics, Demography, and the Psychiatric Rehabilitation Journal. Before joining George Mason, Anand was a senior researcher at Mathematica Policy Research.

### Ground-Truth Occupation: journalist

 He graduated from Marshall University in 2014 with a degree in journalism. An unabashed Appalachian, Culvyhouse took a hiatus from the news industry to try to carve himself a holler somewhere in the Ohio River Valley.

### Ground-Truth Occupation: model

 She was born in Huntington Beach on August 11, 1982. She is listed on FreeOnes since 2003 and is currently ranked 10050th place. She currently has 21 gallery links in her own FreeOnes section.

### Ground-Truth Occupation: journalist

 She is currently a Financial Reporter for Business Media Group’s Global Islamic Finance Magazine. She was short listed for the Young Journalist of the Year 2011 at the Muslim Writers Awards. For more information visit her website at www.tasnimnazeer.com

### Ground-Truth Occupation: journalist

 He also reports for Al-Qabas, a newspaper based in Kuwait, and for Radio Montecarlo. In this interview, Melhem reviews the evolution of Arab and Muslim nations' perspective on United States policies and actions in the Middle East, especially during the U.S. intervention in Lebanon during the 1980s. He also explores how Osama bin Laden and his contemporaries differ from the militant Islamic movements such as the Lebanese Hezbollah, which confronted the Reagan administration. Interview conducted late September 2001.

### Ground-Truth Occupation: psychologist

 Dr. Summerfield has a Ph.D. in Professional Psychology, and a Masters degree in Counseling and Speech. He specializes in cognitive–behavioral therapy for anxiety, depression, pain management, addiction, and support for cancer patients or survivors. Office locations in Westlake Village, CA. Visit his website at www.lestersummerfield.com or call (805) 496-6992 to schedule an appointment.

### Ground-Truth Occupation: attorney

 He specializes in affordable housing law and helping domestic violence victims reclaim their lives. He was a Fulbright fellow at the University of Alberta Faculty of Law and earned a J.D. in 2006 from the University of Idaho College of Law. He can be reached at (208) 345-0106, ext. 103, or by e-mail at richieeppink@idaholegalaid.org.

### Ground-Truth Occupation: surgeon

 He earned his medical degree at Georgetown University School of Medicine and then went on to complete a general surgery residency program at Columbia University—New York Presbyterian Hospital, where he served as Chief Surgery Resident for Robotic Surgery. Prior to joining MedStar Franklin Square’s team, he pursued a colorectal surgery fellowship at the Cleveland Clinic Florida, learning novel, advanced laparoscopic techniques.

### Ground-Truth Occupation: model

 Her carreer has also linked in with high profile advertising campaigns and also acting and music, videos and films, including Cool As Ice and Miami Rhapsody.

### Ground-Truth Occupation: attorney

 He has lived in Alaska for 21 years, and practiced law for 15. Mr. Peterson has a B.A. in Economics from the University of Alaska Anchorage, and J.D. from the University of San Diego School of Law. Following law school, he worked as a civil litigator specializing in employment law in San Diego before returning to Alaska to serve in the District Attorney’s Office. Mr. Peterson has worked in the Office of Special Prosecutions since 2007, and also co-owns and operates a sport fishing guide service on the Kenai Peninsula. He, his wife, and their two children live in Anchorage.

## Your turn!

### Training

Train a model on the training dataset and ensure that your model achieves at least 80% accuracy on the test dataset. Please use the variable name `model` to hold your model object (e.g., sklearn's LogisticRegression, PyTorch model, ...).

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib_inline
import seaborn as sns
# get higher quality plots
matplotlib_inline.backend_inline.set_matplotlib_formats('retina')
plt.style.use("fivethirtyeight")

In [11]:
word_lengths = [len(text.split()) for text in train_df.bio]

In [12]:
print("The average text length in the training corpus is", np.mean(word_lengths))
print("The max text length in the training corpus is", np.max(word_lengths))

The average text length in the training corpus is 58.67281349163903
The max text length in the training corpus is 175


In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

X_train = vectorizer.fit_transform(train_df.bio)
X_test = vectorizer.transform(test_df.bio)

In [14]:
from sklearn.preprocessing import LabelEncoder

# preprocessing labels
le = LabelEncoder()
y_train = le.fit_transform(train_df.occupation)
y_test = le.transform(test_df.occupation)

In [15]:
le.classes_

array(['accountant', 'architect', 'attorney', 'chiropractor', 'comedian',
       'composer', 'dentist', 'dietitian', 'dj', 'filmmaker',
       'interior_designer', 'journalist', 'model', 'nurse', 'painter',
       'paralegal', 'pastor', 'personal_trainer', 'photographer',
       'physician', 'poet', 'professor', 'psychologist', 'rapper',
       'software_engineer', 'surgeon', 'teacher', 'yoga_teacher'],
      dtype=object)

In [16]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((280470, 187353), (280470,), (93492, 187353), (93492,))

In [17]:
from sklearn.linear_model import LogisticRegression
# training code
model = LogisticRegression(solver="sag")
model.fit(X_train, y_train)



### Evaluation I

Show the accuracy of the model on the test dataset:

In [18]:
test_accuracy = model.score(X_test, y_test)
print("The test accuracy is", round(test_accuracy, 3))

The test accuracy is 0.916


### `predict` function

Implement the function `predict`, that returns predictions to a sequence of bios given as list/pandas series of strings. You will need to use this function in class.

In [19]:
from typing import List, Union

In [20]:
def predict(model: LogisticRegression,
            bios: Union[List[str], pd.Series],
            vectorizer: CountVectorizer) -> np.ndarray:
            X_test = vectorizer.transform(bios)
            return model.predict(X_test)

Demonstrate that your `predict` function works, **show the inputs and the predicted output** of a few examples:

In [21]:
# using predict function
sample = test_df.bio[:10]
preds = predict(model, sample, vectorizer)
labels = le.inverse_transform(preds)
for i in range(len(sample)):
  print(f"text:\n{sample[i]}")
  print(f"predicted label: {labels[i]}")
  print(f"true label: {test_df.occupation[i]}")
  print()

text:
 He has worked at the University of Valencia, at the European Commission’s Anti-Fraud Office in Brussels, and the Legal Department of the International Monetary Fund’s Financial Integrity Group, and most recently at the Inter-American Development Bank, both in Washington, D.C. He currently works at the Fiscal Department of Uría Menéndez Abogados, S.L.P in Barcelona (Spain). He may be contacted at alberto.gil@uria.com. Contributions include chapter 1: Background and Current Status of FATCA, chapter 15: Framework of Intergovernmental Agreements, chapter 18: The OECD Role in Exchange of Information: the Trace Project, FATCA, and Beyond, and chapter 28: Exchange of Tax Information and the Impact of FATCA for Spain.
predicted label: attorney
true label: attorney

text:
 Nick Romano is his client, a young man with a long string of crimes behind him. Romano, after turning to robbery again, is caught by a cop and Nick pumps all his bullets into him in frustration. Morton's appeal to the 

### Evaluation II

Show the accuracy-per-class on the test dataset:

In [22]:
from sklearn.metrics import accuracy_score
# evaluation code 2

for group_name, df in test_df.groupby("occupation"):
  y_pred = predict(model, df.bio, vectorizer=vectorizer)
  y_true = le.transform(df.occupation)
  acc = accuracy_score(y_true, y_pred)
  print(f"accuracy for {group_name} = {round(acc, 3)}")

accuracy for accountant = 0.927
accuracy for architect = 0.896
accuracy for attorney = 0.962
accuracy for chiropractor = 0.909
accuracy for comedian = 0.915
accuracy for composer = 0.96
accuracy for dentist = 0.979
accuracy for dietitian = 0.953
accuracy for dj = 0.908
accuracy for filmmaker = 0.933
accuracy for interior_designer = 0.915
accuracy for journalist = 0.933
accuracy for model = 0.926
accuracy for nurse = 0.902
accuracy for painter = 0.942
accuracy for paralegal = 0.88
accuracy for pastor = 0.871
accuracy for personal_trainer = 0.891
accuracy for photographer = 0.966
accuracy for physician = 0.89
accuracy for poet = 0.927
accuracy for professor = 0.826
accuracy for psychologist = 0.904
accuracy for rapper = 0.923
accuracy for software_engineer = 0.901
accuracy for surgeon = 0.935
accuracy for teacher = 0.865
accuracy for yoga_teacher = 0.942


### Evaluation III

Now, show the **Acceptance Rate**, **False Positive Rate** and **False Negative Rate** of each occupation.

Bonus: use the Seaborn's `PairPlot` + dot plot to plot it [demo](https://seaborn.pydata.org/examples/pairgrid_dotplot.html).


#### Confusion Matrix

Actual class/Predicted class | P | N
-----------------------------|---|--------------
P       | **TP** | FN
N     | FP | **TN**


#### Metric Definitions


**Acceptance Rate**

${\displaystyle \mathrm {AR}
= {\frac{\mathrm {TP + FP}}{\mathrm {TP+FN+FP+TN}}}}$


**False negative rate (FNR)**

${\displaystyle \mathrm {FNR} = {\frac {\mathrm {FN} }{\mathrm {FN} +\mathrm {TP} }}}$

**False Positive Rate (FPR)**

${\displaystyle \mathrm {FPR} = {\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }}}$

In [23]:
from sklearn.metrics import confusion_matrix
# evaluation code 3
y_pred = model.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred)
n = cf_matrix.shape[0]

for i in range(n):
  tp = cf_matrix[i][i]
  fn = 0
  fp = 0
  for j in range(n):
    if j == i:
      continue
    fn += cf_matrix[i][j]
    fp = cf_matrix[j][i]
  tn = len(y_pred) - tp - fn - fp

  ar = (tp + fp) / (tp + fn + fp + tn)
  fnr = fn / (fn + tp)
  fpr = fp / (fp + tn)

  class_label = le.classes_[i]
  print(f"Class: {class_label}")
  print(f"Acceptance Rate (AR): {round(ar, 3)}")
  print(f"False Negative Rate (FNR): {round(fnr, 3)}")
  print(f"False Positive Rate (FPR): {round(fpr, 3)}")
  print()

Class: accountant
Acceptance Rate (AR): 0.019
False Negative Rate (FNR): 0.073
False Positive Rate (FPR): 0.0

Class: architect
Acceptance Rate (AR): 0.035
False Negative Rate (FNR): 0.104
False Positive Rate (FPR): 0.0

Class: attorney
Acceptance Rate (AR): 0.118
False Negative Rate (FNR): 0.038
False Positive Rate (FPR): 0.0

Class: chiropractor
Acceptance Rate (AR): 0.007
False Negative Rate (FNR): 0.091
False Positive Rate (FPR): 0.0

Class: comedian
Acceptance Rate (AR): 0.009
False Negative Rate (FNR): 0.085
False Positive Rate (FPR): 0.0

Class: composer
Acceptance Rate (AR): 0.022
False Negative Rate (FNR): 0.04
False Positive Rate (FPR): 0.0

Class: dentist
Acceptance Rate (AR): 0.034
False Negative Rate (FNR): 0.021
False Positive Rate (FPR): 0.0

Class: dietitian
Acceptance Rate (AR): 0.013
False Negative Rate (FNR): 0.047
False Positive Rate (FPR): 0.0

Class: dj
Acceptance Rate (AR): 0.005
False Negative Rate (FNR): 0.092
False Positive Rate (FPR): 0.0

Class: filmmaker
Ac

## That's all!

If you found a mistake / problem in this notebook, or something was unclear, please post at the `#ds` channel.

**Prepare to explain to your team about this data and the model you've trained.**

### Submission

1. Save the notebook as a pdf file (In Colab: File > Print)
2. Upload in Gradescope http://go.responsibly.ai/gradescope