# Project Title: Bias Free Resume Screening



## Problem Statement:

1. The idea of machine learning-esume screening is to screen a pool of resumes and decide which resumes can be accepted based on their skill set. In this project, I aim to screen the resumes based on the similarity between the skills mentioned and the job description.
2. However, there could be bias during screening due to conscious or unconscious biases concerning sensitive features like sex, age, ethnicity, etc. In this project, I also aim to detect and reduce the gender bias to make a fair model.

## Methodology 

##### Step 1 - Data Collection: 
1. I also used 'Original Resume Dataset' (https://github.com/JAIJANYANI/Automated-Resume-Screening-System/tree/master/Original_Resumes) which has 19 resumes. The same source provided a job description based on which we needs to screen the resumes.
2. I used 'Kaggle Resume Dataset' (https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset) which has around 2000 resumes that were collected by web scrapping. The provided data is classified into different job groups. I have manually collected and created the job descriptions for each job group.

##### Step 2 - Data Extraction: 
1. I used 'docx2txt' and 'pdfminer' to extract raw text from docx and pdf. I used 'NLTK' to remove the stopwords. I used 're' to remove url/email/special character patterns from the text.
2. I also tried 'ResumeParser' from 'pyresparser' to extract columns like name, email, skills, education, 
and total experience, but the values I found are only consistent with some formats of resume. So, I opted to go for the 1st method.
 

##### Step 3 - Data Annotation: 
1. For the 'Original Resume Dataset' - I have manually Ranked the Resumes. I used this dataset to verify the ranking accuracy of the similaries we got. 
2. For the ''Kaggle Resume Dataset' - Since They are 2000+ resumes, I opted to verify the cosine similary by assessing 'Category' it is predicting. I have also randomly generated Gender and manipulated the extracted to have the gender value so that we could analyse the Bias. 

##### Step 4 - Vectorizing Data: 
1. I used 'TfidfVectorizer' to vectorize the resumes and job descriptions.

##### Step 5 - Similarities: 
1. I used 'cosine_similarity' to find the similarity matrix between resume vectors and jobs vectors.

##### Step 6 - Performance Analysis:
1. I used Randomforest classification Model for analysing the accuracy of similarities. 
2. I used aif360 to predict the bias w.r.t Gender
9. I used Reweighing pre-processing technique for improving the performance. 

##### Note: From step 2 to 6 you can see the implementation Below 

### Task 1 - Load and Extract Resumes and Jobs

#### 1.1 Extracting Resumes from docx and pdf

In [408]:
import docx2txt
from pdfminer.high_level import extract_text

## Extracting text from PDF
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

#Extracting text from DOCX
def extract_text_from_docx(docx_path):
    txt = docx2txt.process(docx_path)
    if txt:
        return txt.replace('\t', ' ')
    return None

#### 1.2 Cleaning the Extracted Resume text

In [411]:
import re
import nltk

def clean(text):
    """
    Clean the input text by removing URLs, emails, special characters, and stop words.
    
    :param text: The string to be cleaned
    :return: The cleaned string
    """

    # Compile patterns for URLs and emails to speed up cleaning process
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
    
    # Remove URLs
    clean_text = url_pattern.sub('', text)
    
    # Remove emails
    clean_text = email_pattern.sub('', clean_text)
    
    # Remove special characters (keeping only words and whitespace)
    clean_text = re.sub(r'[^\w\s]', '', clean_text)
    
    # Remove stop words by filtering the split words of the text
    stop_words = set(nltk.corpus.stopwords.words('english'))
    clean_text = ' '.join(word for word in clean_text.split() if word.lower() not in stop_words)
    return clean_text
    

#### 1.3 Extracting the annotations and manipulating data to add gender

In [414]:
import random

def addGender(clean_text, gender):
    gender = random.choice(["Male", "Female"])
    clean_text = gender + " " + clean_text
    return clean_text

In [536]:
import os

## This is for "Kaggle Dataset"
def extractingResume(directory_Path, directory_name) :
    mydictList = []
    directory = directory_Path
    files = os.listdir(directory)
    for filename in files :
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path) and file_path.endswith('.pdf'):
            resume_text = extract_text_from_pdf(file_path)
            mydictList.append({"ResumeText" : resume_text, "cleanText" : addGender(clean(resume_text)), "category": directory_name, "gender": gender})
        elif os.path.isfile(file_path) and (file_path.endswith('.docx') or file_path.endswith('.DOCX')):
            resume_text = extract_text_from_docx(file_path)
            mydictList.append({"ResumeText" : resume_text, "cleanText" : addGender(clean(resume_text)), "category": directory_name, "gender": gender})
        else:
            print('Unsupported File Format:', filename)
        continue
    return mydictList

## This is for "Original Dataset"
def extractingOriginalResume(directory_Path, gender) :
    mydictList = []
    directory = directory_Path
    files = os.listdir(directory)
    for filename in files :
        file_path = os.path.join(directory, filename)
        if os.path.isfile(file_path) and file_path.endswith('.pdf'):
            resume_text = extract_text_from_pdf(file_path)
            mydictList.append({"ResumeText" : resume_text, "cleanText" : clean(resume_text), "Rank": int(os.path.splitext(filename)[0]), "gender": gender})
        elif os.path.isfile(file_path) and (file_path.endswith('.docx') or file_path.endswith('.DOCX')):
            resume_text = extract_text_from_docx(file_path)
            mydictList.append({"ResumeText" : resume_text, "cleanText" : clean(resume_text), "Rank": int(os.path.splitext(filename)[0]), "gender": gender})
        else:
            print('Unsupported File Format:', filename)
        continue
    return mydictList

In [133]:
import pandas as pd

#### 1.4 Extract "Kaggle Dataset" and job description for each category

In [135]:
dir_path = "./data"

dir_list = os.listdir(dir_path)
print(len(dir_list))
print(dir_list)

22
['ACCOUNTANT', 'ADVOCATE', 'AGRICULTURE', 'APPAREL', 'ARTS', 'AVIATION', 'BANKING', 'BUSINESS-DEVELOPMENT', 'CHEF', 'CONSTRUCTION', 'CONSULTANT', 'DESIGNER', 'DIGITAL-MEDIA', 'ENGINEERING', 'FINANCE', 'FITNESS', 'HEALTHCARE', 'HR', 'INFORMATION-TECHNOLOGY', 'PUBLIC-RELATIONS', 'SALES', 'TEACHER']


In [137]:
data_list = []
for dir_name in dir_list :
    data_item = extractingResume(dir_path+"/"+dir_name, dir_name)
    data_list = data_list + data_item

In [139]:
resumes_df = pd.DataFrame(data_list)
resumes_df.loc[resumes_df['gender'] == "Male", 'Gender_Male'] = 1 
resumes_df.loc[resumes_df['gender'] == "Female", 'Gender_Male'] = 0 
resumes_df = resumes_df.drop(columns = ['gender'])
resumes_df.head()

Unnamed: 0,ResumeText,cleanText,category,Gender_Male
0,ACCOUNTANT\nSummary\n\nFinancial Accountant sp...,Male ACCOUNTANT Summary Financial Accountant s...,ACCOUNTANT,1.0
1,STAFF ACCOUNTANT\nSummary\nHighly analytical a...,Male STAFF ACCOUNTANT Summary Highly analytica...,ACCOUNTANT,1.0
2,ACCOUNTANT\nProfessional Summary\nTo obtain a ...,Female ACCOUNTANT Professional Summary obtain ...,ACCOUNTANT,0.0
3,SENIOR ACCOUNTANT\nExperience\nCompany Name Ju...,Female SENIOR ACCOUNTANT Experience Company Na...,ACCOUNTANT,0.0
4,SENIOR ACCOUNTANT\nProfessional Summary\nSenio...,Female SENIOR ACCOUNTANT Professional Summary ...,ACCOUNTANT,0.0


In [141]:
resumes_df['Gender_Male'].value_counts()

Gender_Male
0.0    1223
1.0    1203
Name: count, dtype: int64

In [143]:
jobs_df = pd.read_csv('./jobdescriptions/jobs.csv')
jobs_df.head()

Unnamed: 0,job,job_details
0,ACCOUNTANT,We are seeking trustworthy candidates who work...
1,ADVOCATE,Advocates play a crucial role in providing sup...
2,AGRICULTURE,An agricultural worker performs various tasks ...
3,APPAREL,Apparel industry is related to clothing and ot...
4,ARTS,An artist's main assignment is to create somet...


#### 1.5 Extract Original Dataset and a single original job description 

In [538]:
original_resumes_list = extractingOriginalResume("./original data/Female","Female") + extractingOriginalResume("./original data/Male","Male")

In [542]:
original_resumes_df = pd.DataFrame(original_resumes_list)
original_resumes_df.head(20)

Unnamed: 0,ResumeText,cleanText,Rank,gender
0,"Choo Gui Yi, Ivy\n\n\n\nAddress: 38 Lorong N T...",Choo Gui Yi Ivy Address 38 Lorong N Telok Kura...,1,Female
1,Curriculum Vitae\n\nOlivia Karina Peter (Ms)\n...,Curriculum Vitae Olivia Karina Peter Ms Mobile...,14,Female
2,TASNEEM NASRULLA\n\nPROFESSIONAL EXPERIENCE\n\...,TASNEEM NASRULLA PROFESSIONAL EXPERIENCE Apr 2...,15,Female
3,Lau Peiwen Erwina \n\n +65 90083183 \n\nEr...,Lau Peiwen Erwina 65 90083183 Date Birth 29th ...,17,Female
4,NOT PROTECTIVELY MARKED\n\n\n\nNOT PROTECTIVEL...,PROTECTIVELY MARKED PROTECTIVELY MARKED Radhik...,18,Female
5,Ocvia Freriana\nSingapore\n\nCompliance profes...,Ocvia Freriana Singapore Compliance profession...,19,Female
6,CURRICULUM VITAE\n\n\n\n \n\nNAME: ...,CURRICULUM VITAE NAME Gloria Cheng Ge Fang EMA...,2,Female
7,"ROHINI PRAKASH, CA \nSINGAPORE |+ 65 98624747 ...",ROHINI PRAKASH CA SINGAPORE 65 98624747 INVEST...,5,Female
8,1\n\n \n\n THE SEARCH SPECIALISTS FOR THE FUND...,1 SEARCH SPECIALISTS FUNDS INDUSTRY LONDON LUX...,8,Female
9,"Trisa Tay, CA\n\n\n\n\n\n ...",Trisa Tay CA TRISA TAY CHARTERED ACCOUNTANT CA...,9,Female


In [430]:
with open('./jobdescriptions/oaktree.txt', 'r') as file:
    original_job_data = [file.read().replace('\n', '')]

print(original_job_data)

['Assist with the implementation of a new Singapore based investment platform that will be used as the primary investment holding platform for Oaktree investments in the APAC region  Manage the accounting and administration function across all the limited partnership structures and Section 13x/R SPV’s in our local Singapore based investment platform and all our SPV’s across the APAC region Serve on the board of directors of SPV’s across the APAC region as and when required Manage the accounting and operations of the local service company set up to provide administration of the Singapore platform   Work closely with and oversee the work carried out by external service providers who will complete the local accounting, legal, tax advisory and compliance and company secretarial work on our Singapore platform and across the APAC region   Assist with the setting up of new investment structures for any new deals ensuring that they are operational in advance of deal completion and with cash re

### Task 2 - vectorizing resumes and jobs

In [146]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### 2.1 Vectorizing Kaggle web-scrapped resumes and jobs

In [148]:
resumes_list = resumes_df['cleanText'].to_list()
jobs_list = list(jobs_df['job_details'])
tfidfvectorizer = TfidfVectorizer(analyzer='word' , stop_words='english',)
tfidfvectorizer.fit(jobs_list)
tfidf_X = tfidfvectorizer.transform(resumes_list)
X = tfidf_X.todense().A
tfidf_Y = tfidfvectorizer.transform(jobs_list)
Y = tfidf_Y.todense().A

#### 2.2 vectorizing orignial resumes and related job data

In [544]:
original_resumes_list = original_resumes_df['cleanText'].to_list()
tfidfvectorizer_og = TfidfVectorizer(analyzer='word' , stop_words='english',)
tfidfvectorizer_og.fit(jobs_list)
tfidf_og_X = tfidfvectorizer_og.transform(original_resumes_list)
X_og = tfidf_og_X.todense().A
tfidf_og_Y = tfidfvectorizer_og.transform(original_job_data)
Y_og = tfidf_og_Y.todense().A

### Task 3 - Assessing Similarity Performance in Original Dataset and analysis

We get cosine similarity Matrix for the dataset and predicted result of the resumes based on the similarity

#### 3.1 Get Similarity Matrix and create a Resumes dataframe with similarity

In [439]:
from sklearn.metrics.pairwise import cosine_similarity

In [546]:
# this is for original dataset
similarities_og = cosine_similarity(X_og, Y_og)

In [548]:
sim_og_df = pd.DataFrame(similarities_og, columns = ["Similarity"]) 
sim_og_df.head()

Unnamed: 0,Similarity
0,0.255232
1,0.230127
2,0.277726
3,0.354138
4,0.289732


In [556]:
result_original_df = pd.merge(original_resumes_df, sim_og_df, left_index=True, right_index=True)
result_original_df.head()

Unnamed: 0,ResumeText,cleanText,Rank,gender,Similarity
0,"Choo Gui Yi, Ivy\n\n\n\nAddress: 38 Lorong N T...",Choo Gui Yi Ivy Address 38 Lorong N Telok Kura...,1,Female,0.255232
1,Curriculum Vitae\n\nOlivia Karina Peter (Ms)\n...,Curriculum Vitae Olivia Karina Peter Ms Mobile...,14,Female,0.230127
2,TASNEEM NASRULLA\n\nPROFESSIONAL EXPERIENCE\n\...,TASNEEM NASRULLA PROFESSIONAL EXPERIENCE Apr 2...,15,Female,0.277726
3,Lau Peiwen Erwina \n\n +65 90083183 \n\nEr...,Lau Peiwen Erwina 65 90083183 Date Birth 29th ...,17,Female,0.354138
4,NOT PROTECTIVELY MARKED\n\n\n\nNOT PROTECTIVEL...,PROTECTIVELY MARKED PROTECTIVELY MARKED Radhik...,18,Female,0.289732


In [562]:
ranked_original_df = result_original_df.copy()

In [564]:
ranked_original_df.loc[ranked_original_df['Rank'] <= 10 , 'Result'] = "Pass"
ranked_original_df.loc[ranked_original_df['Rank'] > 10 , 'Result'] = "Fail"

ranked_original_df.loc[ranked_original_df['gender'] == "Male", 'Gender_Male'] = 1 
ranked_original_df.loc[ranked_original_df['gender'] == "Female", 'Gender_Male'] = 0 
ranked_original_df = ranked_original_df.drop(columns = ['gender'])

ranked_original_df.head()

Unnamed: 0,ResumeText,cleanText,Rank,Similarity,Result,Gender_Male
0,"Choo Gui Yi, Ivy\n\n\n\nAddress: 38 Lorong N T...",Choo Gui Yi Ivy Address 38 Lorong N Telok Kura...,1,0.255232,Pass,0.0
1,Curriculum Vitae\n\nOlivia Karina Peter (Ms)\n...,Curriculum Vitae Olivia Karina Peter Ms Mobile...,14,0.230127,Fail,0.0
2,TASNEEM NASRULLA\n\nPROFESSIONAL EXPERIENCE\n\...,TASNEEM NASRULLA PROFESSIONAL EXPERIENCE Apr 2...,15,0.277726,Fail,0.0
3,Lau Peiwen Erwina \n\n +65 90083183 \n\nEr...,Lau Peiwen Erwina 65 90083183 Date Birth 29th ...,17,0.354138,Fail,0.0
4,NOT PROTECTIVELY MARKED\n\n\n\nNOT PROTECTIVEL...,PROTECTIVELY MARKED PROTECTIVELY MARKED Radhik...,18,0.289732,Fail,0.0


In [566]:
ranked_original_df = ranked_original_df.drop(columns = ["Rank", "ResumeText", "cleanText"])
ranked_original_df.head()

Unnamed: 0,Similarity,Result,Gender_Male
0,0.255232,Pass,0.0
1,0.230127,Fail,0.0
2,0.277726,Fail,0.0
3,0.354138,Fail,0.0
4,0.289732,Fail,0.0


#### 3.2 Similarity based prediction using RandomForest Classifier

In [623]:
from sklearn.model_selection import train_test_split

df_train_og, df_test_og = train_test_split(ranked_original_df,random_state=42, test_size=0.3, shuffle=True)

X_train_og = df_train_og.drop(columns = ['Result'])
Y_train_og = df_train_og['Result']

X_test_og = df_test_og.drop(columns = ['Result'])
Y_test_og = df_test_og['Result']

In [625]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

clf_og = RandomForestClassifier()
clf_og.fit(X_train_og, Y_train_og)
Y_test_pred_og = clf_og.predict(X_test_og)
Y_train_pred_og = clf_og.predict(X_train_og)
print(classification_report(Y_test_og, Y_test_pred_og))
print(classification_report(Y_train_og, Y_train_pred_og))

              precision    recall  f1-score   support

        Fail       0.67      0.67      0.67         3
        Pass       0.67      0.67      0.67         3

    accuracy                           0.67         6
   macro avg       0.67      0.67      0.67         6
weighted avg       0.67      0.67      0.67         6

              precision    recall  f1-score   support

        Fail       1.00      1.00      1.00         6
        Pass       1.00      1.00      1.00         7

    accuracy                           1.00        13
   macro avg       1.00      1.00      1.00        13
weighted avg       1.00      1.00      1.00        13



##### Note: The accuracy in test data is 67% and in train data is 100%

#### 3.3 Finding Gender Bias

In [629]:
from aif360.datasets import StandardDataset
from aif360.metrics import BinaryLabelDatasetMetric

In [631]:
privileged_groups = [{'Gender_Male': 1}]
unprivileged_groups = [{'Gender_Male': 0}]

In [633]:
dataset_train_og = StandardDataset(df_train_og, 
                          label_name='Result', 
                          favorable_classes=['Pass'], 
                          protected_attribute_names=['Gender_Male'], 
                          privileged_classes=[[1]])

metric_train_og = BinaryLabelDatasetMetric(dataset_train_og, 
                                             unprivileged_groups=unprivileged_groups,
                                             privileged_groups=privileged_groups)

print("Bias metrics for ACCOUNTANT Class ")
print("SPD ", metric_train_og.statistical_parity_difference())
print("DI ", metric_train_og.disparate_impact())

Bias metrics for ACCOUNTANT Class 
SPD  -0.0714285714285714
DI  0.875


##### Note: SPD < 0 and DI < 1, we can say that there is a slight bias towards the privilaged group i.e gender = Male. But since DI passes the 80% rule It can be considered as non discriminating in the hiring process.  

### Task 4 - Assessing similarity performance in Web-scrapped Resume Dataset and analysis

#### 4.1 Get Similarity Matrix and create a Resumes dataframe with similarities

In [456]:
# this is for kaggle dataset 
similarities = cosine_similarity(X, Y)

In [458]:
sim_df = pd.DataFrame(similarities, columns = dir_list)
sim_df.head()

Unnamed: 0,ACCOUNTANT,ADVOCATE,AGRICULTURE,APPAREL,ARTS,AVIATION,BANKING,BUSINESS-DEVELOPMENT,CHEF,CONSTRUCTION,...,DIGITAL-MEDIA,ENGINEERING,FINANCE,FITNESS,HEALTHCARE,HR,INFORMATION-TECHNOLOGY,PUBLIC-RELATIONS,SALES,TEACHER
0,0.367191,0.15909,0.126796,0.077386,0.044751,0.075339,0.080641,0.124356,0.097543,0.057657,...,0.03849,0.11342,0.358637,0.052826,0.054857,0.143595,0.138978,0.098829,0.101056,0.121479
1,0.332098,0.107284,0.087919,0.065034,0.035684,0.038768,0.115646,0.20086,0.086771,0.048215,...,0.033048,0.080275,0.366193,0.035228,0.033874,0.043389,0.106536,0.076702,0.179613,0.073416
2,0.152652,0.099753,0.028095,0.035142,0.02287,0.023178,0.169406,0.130659,0.02737,0.024641,...,0.018776,0.071713,0.135129,0.032164,0.065515,0.04138,0.104702,0.07023,0.168573,0.052803
3,0.32302,0.102498,0.027728,0.061147,0.045103,0.094261,0.141295,0.13533,0.051006,0.068489,...,0.027973,0.059533,0.324054,0.053802,0.018356,0.079859,0.104002,0.08297,0.159596,0.131249
4,0.257616,0.095625,0.061164,0.031256,0.030981,0.032045,0.106382,0.084383,0.022516,0.039428,...,0.041765,0.036468,0.216486,0.027416,0.02823,0.041927,0.10106,0.086502,0.071646,0.07935


In [460]:
result_df = pd.merge(resumes_df, sim_df, left_index=True, right_index=True)

In [462]:
result_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2426 entries, 0 to 2425
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ResumeText              2426 non-null   object 
 1   cleanText               2426 non-null   object 
 2   category                2426 non-null   object 
 3   Gender_Male             2426 non-null   float64
 4   ACCOUNTANT              2426 non-null   float64
 5   ADVOCATE                2426 non-null   float64
 6   AGRICULTURE             2426 non-null   float64
 7   APPAREL                 2426 non-null   float64
 8   ARTS                    2426 non-null   float64
 9   AVIATION                2426 non-null   float64
 10  BANKING                 2426 non-null   float64
 11  BUSINESS-DEVELOPMENT    2426 non-null   float64
 12  CHEF                    2426 non-null   float64
 13  CONSTRUCTION            2426 non-null   float64
 14  CONSULTANT              2426 non-null   

In [464]:
result_df = result_df.drop(columns = ['ResumeText', 'cleanText'])
result_df.head()

Unnamed: 0,category,Gender_Male,ACCOUNTANT,ADVOCATE,AGRICULTURE,APPAREL,ARTS,AVIATION,BANKING,BUSINESS-DEVELOPMENT,...,DIGITAL-MEDIA,ENGINEERING,FINANCE,FITNESS,HEALTHCARE,HR,INFORMATION-TECHNOLOGY,PUBLIC-RELATIONS,SALES,TEACHER
0,ACCOUNTANT,1.0,0.367191,0.15909,0.126796,0.077386,0.044751,0.075339,0.080641,0.124356,...,0.03849,0.11342,0.358637,0.052826,0.054857,0.143595,0.138978,0.098829,0.101056,0.121479
1,ACCOUNTANT,1.0,0.332098,0.107284,0.087919,0.065034,0.035684,0.038768,0.115646,0.20086,...,0.033048,0.080275,0.366193,0.035228,0.033874,0.043389,0.106536,0.076702,0.179613,0.073416
2,ACCOUNTANT,0.0,0.152652,0.099753,0.028095,0.035142,0.02287,0.023178,0.169406,0.130659,...,0.018776,0.071713,0.135129,0.032164,0.065515,0.04138,0.104702,0.07023,0.168573,0.052803
3,ACCOUNTANT,0.0,0.32302,0.102498,0.027728,0.061147,0.045103,0.094261,0.141295,0.13533,...,0.027973,0.059533,0.324054,0.053802,0.018356,0.079859,0.104002,0.08297,0.159596,0.131249
4,ACCOUNTANT,0.0,0.257616,0.095625,0.061164,0.031256,0.030981,0.032045,0.106382,0.084383,...,0.041765,0.036468,0.216486,0.027416,0.02823,0.041927,0.10106,0.086502,0.071646,0.07935


#### 4.2 Evaluating cosine-similarity performance using a classifier 

We classify the resumes to categories to check the accuracy of similarities and job categories. I used two classifiers i.e RandomForest and KNN to verify the accuracy.

In [177]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(result_df,random_state=42, test_size=0.3, shuffle=True)

In [179]:
X_train = df_train.drop(columns = ['category'])
Y_train = df_train['category']

X_test = df_test.drop(columns = ['category'])
Y_test = df_test['category']

##### 4.2.1 Using RandomForest Classifier

In [181]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

clf2 = RandomForestClassifier()
clf2.fit(X_train, Y_train)
Y_test_pred2 = clf2.predict(X_test)
Y_train_pred2 = clf2.predict(X_train)
print(classification_report(Y_test, Y_test_pred2))
print(classification_report(Y_train, Y_train_pred2))

                        precision    recall  f1-score   support

            ACCOUNTANT       0.61      0.70      0.65        33
              ADVOCATE       0.28      0.32      0.30        41
           AGRICULTURE       0.57      0.16      0.25        25
               APPAREL       0.33      0.17      0.23        29
                  ARTS       0.45      0.14      0.21        37
              AVIATION       0.40      0.42      0.41        40
               BANKING       0.54      0.59      0.57        32
  BUSINESS-DEVELOPMENT       0.47      0.57      0.51        37
                  CHEF       0.87      0.76      0.81        34
          CONSTRUCTION       0.67      0.87      0.75        30
            CONSULTANT       0.15      0.06      0.09        33
              DESIGNER       0.87      0.73      0.79        37
         DIGITAL-MEDIA       0.47      0.72      0.57        25
           ENGINEERING       0.65      0.56      0.60        39
               FINANCE       0.42      

##### Note: We can see that the prediction with RandomForest Classifier is 53% accurate. Where as train data classification is 100% accurate.

##### 4.2.2 Using KNN classifier

In [183]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier()
clf.fit(X_train, Y_train)
Y_test_pred = clf.predict(X_test)
Y_train_pred = clf.predict(X_train)

In [185]:
print("Classification report for test data")
print(classification_report(Y_test, Y_test_pred))

Classification report for test data
                        precision    recall  f1-score   support

            ACCOUNTANT       0.54      0.76      0.63        33
              ADVOCATE       0.26      0.46      0.33        41
           AGRICULTURE       0.13      0.12      0.12        25
               APPAREL       0.19      0.28      0.22        29
                  ARTS       0.30      0.16      0.21        37
              AVIATION       0.31      0.40      0.35        40
               BANKING       0.34      0.38      0.36        32
  BUSINESS-DEVELOPMENT       0.51      0.59      0.55        37
                  CHEF       0.84      0.79      0.82        34
          CONSTRUCTION       0.79      0.73      0.76        30
            CONSULTANT       0.08      0.03      0.04        33
              DESIGNER       0.88      0.57      0.69        37
         DIGITAL-MEDIA       0.58      0.60      0.59        25
           ENGINEERING       0.52      0.44      0.47        39
   

In [187]:
print("Classification report for train data")
print(classification_report(Y_train, Y_train_pred))

Classification report for train data
                        precision    recall  f1-score   support

            ACCOUNTANT       0.61      0.89      0.72        85
              ADVOCATE       0.35      0.64      0.45        77
           AGRICULTURE       0.34      0.55      0.42        38
               APPAREL       0.51      0.63      0.56        68
                  ARTS       0.35      0.29      0.31        66
              AVIATION       0.67      0.73      0.70        77
               BANKING       0.63      0.51      0.56        83
  BUSINESS-DEVELOPMENT       0.58      0.66      0.62        83
                  CHEF       0.83      0.75      0.79        84
          CONSTRUCTION       0.82      0.82      0.82        82
            CONSULTANT       0.44      0.27      0.33        82
              DESIGNER       0.85      0.64      0.73        70
         DIGITAL-MEDIA       0.67      0.63      0.65        71
           ENGINEERING       0.67      0.67      0.67        79
  

##### Note : Here we can see that the test data classification is 46% accurate and train data classification is 63% accurate. 

##### Hence I am going ahead with RandomForest Classifier since it has better prediction among the two and the training is perfectly done. 

#### 4.3 - Finding the gender bias in classification

In [190]:
from aif360.datasets import StandardDataset
from aif360.metrics import BinaryLabelDatasetMetric

In [192]:
privileged_groups = [{'Gender_Male': 1}]
unprivileged_groups = [{'Gender_Male': 0}]

In [228]:
dataset_train_ACCOUNTANT = StandardDataset(df_train, 
                          label_name='category', 
                          favorable_classes=['ACCOUNTANT'], 
                          protected_attribute_names=['Gender_Male'], 
                          privileged_classes=[[1]])

metric_train_ACCOUNTANT = BinaryLabelDatasetMetric(dataset_train_ACCOUNTANT, 
                                             unprivileged_groups=unprivileged_groups,
                                             privileged_groups=privileged_groups)

print("Bias metrics for ACCOUNTANT Class ")
print("SPD ", metric_train_ACCOUNTANT.statistical_parity_difference())
print("DI ", metric_train_ACCOUNTANT.disparate_impact())

Bias metrics for ACCOUNTANT Class 
SPD  -0.01025381950612897
DI  0.8145396124108847


##### Note: SPD < 0 and DI < 1, we can say that there is a slight bias towards the privilaged group i.e gender = Male. But since DI passes the 80% rule It can be considered as non discriminating in the hiring process.  

#### 4.4 - Improving Model and Bias Performance

##### 4.4.1 Using Re-weighing pre-processing technique

Here we are using 'Accountant' taget class

In [232]:
from aif360.algorithms.preprocessing import Reweighing

In [234]:
dataset_train_mitigation = StandardDataset(df_train, 
                          label_name='category', 
                          favorable_classes=['ACCOUNTANT'], 
                          protected_attribute_names=['Gender_Male'], 
                          privileged_classes=[[1]])

privileged_groups = [{'Gender_Male': 1}]
unprivileged_groups = [{'Gender_Male': 0}]

RW = Reweighing(unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)
dataset_transf_train = RW.fit_transform(dataset_train_mitigation)

tranf_train_df = dataset_transf_train.convert_to_dataframe()[0]

In [236]:
tranf_train_df['category'].value_counts()

category
0.0    1613
1.0      85
Name: count, dtype: int64

##### 4.4.2 Remodeling after Re-weighing the train dataset

In [498]:
X_train_mitigation = tranf_train_df.drop( columns = ['category'])
Y_train_mitigation = tranf_train_df['category']

df_test2 = df_test.copy()
df_test2.loc[df_test2['category'] == 'ACCOUNTANT', 'category2'] = 1
df_test2.loc[df_test2['category'] != 'ACCOUNTANT', 'category2'] = 0

X_test_mitigation = df_test2.drop( columns = ['category', 'category2'])
Y_test_mitigation = df_test2['category2']

clf_mitigation = RandomForestClassifier(random_state = 42)
clf_mitigation = clf_mitigation.fit(X_train_mitigation, Y_train_mitigation)
Y_pred_mitigation = clf_mitigation.predict(X_test_mitigation)

print(classification_report(Y_test_mitigation, Y_pred_mitigation, digits=4))

              precision    recall  f1-score   support

         0.0     0.9719    0.9957    0.9837       695
         1.0     0.8125    0.3939    0.5306        33

    accuracy                         0.9684       728
   macro avg     0.8922    0.6948    0.7571       728
weighted avg     0.9647    0.9684    0.9631       728



In [499]:
Y_train_pred_mitigation = clf_mitigation.predict(X_train_mitigation)
print(classification_report(Y_train_mitigation, Y_train_pred_mitigation, digits=4))

              precision    recall  f1-score   support

         0.0     1.0000    1.0000    1.0000      1613
         1.0     1.0000    1.0000    1.0000        85

    accuracy                         1.0000      1698
   macro avg     1.0000    1.0000    1.0000      1698
weighted avg     1.0000    1.0000    1.0000      1698



##### Note: we can see that the performance metrics has improved a lot for testing data after re-weighing 

##### 4.4.3 Calculating Bias metrics after Re-weighing

In [268]:
dataset_test_after_mitigation = StandardDataset(df_test2, 
                          label_name='category', 
                          favorable_classes=['ACCOUNTANT'], 
                          protected_attribute_names=['Gender_Male'], 
                          privileged_classes=[[1]])

print("Bias Metrics for testing dataset after using Reweighing Technique")
bias_Metrics_after_mitigation = BinaryLabelDatasetMetric(dataset_test_after_mitigation, 
                                             unprivileged_groups=unprivileged_groups,
                                             privileged_groups=privileged_groups)

print("Bias metrics for ACCOUNTANT Class after re-weighing")
print("SPD ", metric_train_ACCOUNTANT.statistical_parity_difference())
print("DI ", metric_train_ACCOUNTANT.disparate_impact())

Bias Metrics for testing dataset after using Reweighing Technique
Bias metrics for ACCOUNTANT Class after re-weighing
SPD  -0.01025381950612897
DI  0.8145396124108847


##### Note: The bias metrics did not change for the best or worst. 

### Task 5 - Observations

1. Resume Screening accuracy in the original Dataset is 67%. 
2. Resume Screening accuract in the Kaggle web scrapped data was initially 53% and then after re-weighing it improved to 97%. 
3. Biasness: We can see that there is a slight bias towards "Male" gender. But I think based on 80% rule, we don't consider this resume screening system as unwittingly biased system.

### Task 8 - What Can be better?

1. I wanted to analyse the "Original Dataset" after performing re-weighing technique and see how it works. But since the data is below 20 samples, After reweighing the data, it has shown "Nan" error. This happened because the number of samples were really low.
3. I wanted to Extract each entity of the resumes, like name, gender, email, work-experience, etc and train the model based on them, But I couldn't find a proper extracting tool which does it accurately. Hence I opted for including entire text.
4. If we could extract the features like age, gender, ethinicity from the text, there would be more bias related features we could analysis.  