# Fairness - Keywords and topics extraction
This notebook takes a list of DOIs from a txt file and process them as follows:

1. Extracts the DOIs from the txt file
2. Downloads the papers from ACM jounal
3. Extracts the keywords from the pdf text
4. Eextracts the topics from the jounal
5. Consolidates the data

In [30]:
import getpass
import requests
import sys
import time
from glob import glob
from random import random
sys.path.append("../src")
from paper_reader import Paper
from utils.utils import extract_doi_from_str
import mac_changer.macchanger as mc

## Read the txt file containing the DOIs in the following format:

    https://doi.org/{doi} Title

In [12]:
data_folder_path = "./data/"
dois_file_name = "PapersFairness.txt"
dois_file_rel_path = data_folder_path + dois_file_name
with open(dois_file_rel_path, 'r') as dois_file:
    lines = dois_file.readlines()

## Get DOIs list

In [31]:
doi_list = []
for idx, line in enumerate(lines):
    line = line.strip()
    doi = extract_doi_from_str(line)
    if line[0] == "#":
        print(f"Line {idx} - skipped because it's commented.")
        pass
    if doi:
        doi_list.append(doi[0])
    else:
        print(f"Line {idx} - DOI not found.")

In [14]:
doi_list

['10.1145/3593013.3594039',
 '10.1145/3593013.3594048',
 '10.1145/3442188.3445876',
 '10.1145/3531146.3533226',
 '10.1145/3287560.3287600',
 '10.1145/3442188.3445901',
 '10.1145/3287560.3287586',
 '10.1145/3531146.3533115',
 '10.1145/3531146.3533074',
 '10.1145/3593013.3594007',
 '10.1145/3287560.3287586',
 '10.1145/3351095.3373155',
 '10.1145/3514094.3534137',
 '10.1145/3593013.3594004',
 '10.1145/3442188.3445927',
 '10.1145/3351095.3372878',
 '10.1145/3593013.3594057',
 '10.1145/3287560.3287567',
 '10.1145/3531146.3534635',
 '10.1145/3442188.3445902',
 '10.1145/3593013.3594116',
 '10.1145/3593013.3594037',
 '10.1145/3531146.3534643',
 '10.1145/3351095.3372851',
 '10.1145/3593013.3594045',
 '10.1145/3593013.3594028',
 '10.1145/3531146.3534645',
 '10.1145/3593013.3594008',
 '10.1145/3442188.3445865',
 '10.1145/3593013.3594097',
 '10.1145/3593013.3594106',
 '10.1145/3531146.3533180',
 '10.1145/3531146.3533160',
 '10.1145/3593013.3594075',
 '10.1145/3531146.3533197',
 '10.1145/3531146.35

## Download the papers

In [35]:
data_folder = "data/"
paper_list = [path.split('/')[-1] for path in glob(data_folder + "*pdf")]

In [23]:
password = getpass.getpass()

 ·······


In [36]:
# my ip: 172.28.15.94
# dl-support@acm.org
print(">> Current IP: ", mc.get_current_ip("en0"))
k = 0
for doi in doi_list:
    if k and (k % 15 == 0):
        print(f">> Changing mac address, {k} downloads reached.")
        mc.change_mac_random(interface="en0")
        retries = 0
        while retries < 5:
            try:
                retries += 1
                new_IP = mc.get_current_ip("en0")
                print(f">> New IP set: {new_IP}")
                break
            except Exception as e:
#               # Wait 10s for connection to be re-stablished
                time.sleep(10)
        
        if retries >= 5:
            print(f">> Failed to obtain IP after retried {retries} times for interface {interface}.")
            break
            
    print(f"Paper ------- {doi}")
    file_name = f"{doi.replace('.', '_').replace('/', '-')}.pdf"
#     check if file already exists.
    if file_name not in paper_list:
        url = f"https://dl.acm.org/doi/pdf/{doi}"
        res = requests.get(url, timeout=60)
        
        if res.ok:
            with open(data_folder + file_name, "wb") as pdf_file:
                pdf_file.write(res.content)
            print(">> Ok")
        else:
            print(">> Failed to download the PDF. Status code:", res.status_code)
        time.sleep(10 + 10 * random())
        k += 1        
    else:
        print(f">> Paper {file_name} already downloaded.")

>> Current IP:  192.168.143.9
Paper ------- 10.1145/3593013.3594039
>> Paper 10_1145-3593013_3594039.pdf already downloaded.
Paper ------- 10.1145/3593013.3594048
>> Paper 10_1145-3593013_3594048.pdf already downloaded.
Paper ------- 10.1145/3442188.3445876
>> Paper 10_1145-3442188_3445876.pdf already downloaded.
Paper ------- 10.1145/3531146.3533226
>> Paper 10_1145-3531146_3533226.pdf already downloaded.
Paper ------- 10.1145/3287560.3287600
>> Paper 10_1145-3287560_3287600.pdf already downloaded.
Paper ------- 10.1145/3442188.3445901
>> Paper 10_1145-3442188_3445901.pdf already downloaded.
Paper ------- 10.1145/3287560.3287586
>> Paper 10_1145-3287560_3287586.pdf already downloaded.
Paper ------- 10.1145/3531146.3533115
>> Paper 10_1145-3531146_3533115.pdf already downloaded.
Paper ------- 10.1145/3531146.3533074
>> Paper 10_1145-3531146_3533074.pdf already downloaded.
Paper ------- 10.1145/3593013.3594007
>> Paper 10_1145-3593013_3594007.pdf already downloaded.
Paper ------- 10.114

In [48]:
reader = Reader()

NameError: name 'Reader' is not defined

In [3]:
reader.extract_info("3442188.3445864.pdf")

In [4]:
reader.get_metadata()

[{'file_name': '3442188.3445864.pdf',
  'doi': '10.1145/3442188.3445864',
  'title': 'Price Discrimination with Fairness Constraints',
  'authors': 'Cohen, Maxime C.;Elmachtoub, Adam N.;Lei, Xiao',
  'journal': 'Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency',
  'publisher': 'ACM',
  'year': '2021',
  'keywords': 'fairness;price discrimination;personalization;social welfare',
  'topics': 'Price Discrimination with Fairness Constraints;Applied computing;Law, social and behavioral sciences;Economics;Law;Information systems;World Wide Web;Web searching and information discovery;Personalization;Social and professional topics;Professional topics;Management of computing and information systems;Implementation management;Pricing and resource allocation'}]

In [5]:
reader.get_metadata(format='dataframe')

Unnamed: 0,file_name,doi,title,authors,journal,publisher,year,keywords,topics
0,3442188.3445864.pdf,10.1145/3442188.3445864,Price Discrimination with Fairness Constraints,"Cohen, Maxime C.;Elmachtoub, Adam N.;Lei, Xiao",Proceedings of the 2021 ACM Conference on Fair...,ACM,2021,fairness;price discrimination;personalization;...,Price Discrimination with Fairness Constraints...


In [5]:
reader.get_data()

[{'doi': '10.1145/3442188.3445864',
  'text': 'price discrimination with fairness constraints\nmaxime c cohen\ndesautels faculty of management\nmcgill university\nmontreal quebec canada\nmaximecohenmcgillca\nadam n elmachtoub\ndepartment of industrial engineering\nand operations research  data\nscience institute columbia university\nnew york new york usa\nadamieorcolumbiaedu\nxiao lei\ndepartment of industrial engineering\nand operations research columbia\nuniversity\nnew york new york usa\nxl2625columbiaedu\nabstract\nprice discrimination  offering different prices to different cus\ntomers  has become common practice while it allows sellers\nto increase their profits it also raises several concerns in terms\nof fairness this topic has received extensive attention from me\ndia industry and regulatory agencies in this paper we consider\nthe problem of setting prices for different groups under fairness\nconstraints\nin this paper we propose a formal framework for pricing with\nfairness i

In [3]:
reader.extract_info_from_bulk()

--------- Extracting --------- 3442188.3445864.pdf
--------- Extracting --------- 3351095.3372828.pdf
--------- Extracting --------- 3086567.3086571.pdf
--------- Extracting --------- 3531146.3533226.pdf
--------- Extracting --------- NIPS-2017-avoiding-discrimination-through-causal-reasoning-Paper.pdf
--------- Extracting --------- 3442188.3445912.pdf
--------- Extracting --------- uj_44430+SOURCE1+SOURCE1.1.pdf
--------- Extracting --------- 261474.pdf
--------- Extracting --------- 3442188.3445876.pdf
--------- Extracting --------- mitchell-et-al-2021-algorithmic-fairness-choices-assumptions-and-definitions.pdf
--------- Extracting --------- 3494672.pdf
--------- Extracting --------- 3018896.3025169.pdf
--------- Extracting --------- 3351095.3372839.pdf
--------- Extracting --------- 2303.16972.pdf
--------- Extracting --------- 2204.06438.pdf
--------- Extracting --------- 3442188.3445901.pdf
--------- Extracting --------- 3457607.pdf
--------- Extracting --------- s41060-017-0058-

In [4]:
reader.get_metadata(format='dataframe')

Unnamed: 0,file_name,doi,title,authors,journal,publisher,year,keywords,topics
0,3442188.3445864.pdf,10.1145/3442188.3445864,Price Discrimination with Fairness Constraints,"Cohen, Maxime C.;Elmachtoub, Adam N.;Lei, Xiao",Proceedings of the 2021 ACM Conference on Fair...,ACM,2021.0,fairness;price discrimination;personalization;...,Price Discrimination with Fairness Constraints...
1,3351095.3372828.pdf,10.1145/3351095.3372828,Mitigating bias in algorithmic hiring,"Raghavan, Manish;Barocas, Solon;Kleinberg, Jon...",Proceedings of the 2020 Conference on Fairness...,ACM,2020.0,algorithmic hiring;discrimination law;algorith...,Mitigating bias in algorithmic hiring: evaluat...
2,3086567.3086571.pdf,10.1145/3086567.3086571,Cloud IaaS for Mass Spectrometry and Proteomics,"Judson, Brenden;McGrath, Garret;Peuchen, Eliza...",Proceedings of the 8th Workshop on Scientific ...,ACM,2017.0,cloud;iaas;distributed computing;data transit;...,Cloud IaaS for Mass Spectrometry and Proteomic...
3,3531146.3533226.pdf,10.1145/3531146.3533226,Demographic-Reliant Algorithmic Fairness: Char...,"Andrus, McKane;Villeneuve, Sarah","2022 ACM Conference on Fairness, Accountabilit...",ACM,2022.0,,
4,NIPS-2017-avoiding-discrimination-through-caus...,,,,,,,,
5,3442188.3445912.pdf,10.1145/3442188.3445912,Bridging Machine Learning and Mechanism Design...,"Finocchiaro, Jessie;Maio, Rol;;Monachou, Faidr...",Proceedings of the 2021 ACM Conference on Fair...,ACM,2021.0,,
6,uj_44430+SOURCE1+SOURCE1.1.pdf,10.1007/978-3-030-85447-8_24,A Systematic Review of Fairness in Artificial ...,"Xivuri, Khensani;Twinomurinzi, Hossana",Lecture Notes in Computer Science,Springer International Publishing,2021.0,ai;machine learning;algorithms;fairness;bias;e...,
7,261474.pdf,10.1177/2053951718756684,Understanding perception of algorithmic decisi...,"Lee, Min Kyung",Big Data &amp; Society,SAGE Publications,2018.0,fairness;transparency;explanations;design;meth...,
8,3442188.3445876.pdf,10.1145/3442188.3445876,Group Fairness,"RÃ¤z, Tim",Proceedings of the 2021 ACM Conference on Fair...,ACM,2021.0,,Group Fairness: Independence Revisited;Applied...
9,mitchell-et-al-2021-algorithmic-fairness-choic...,,,,,,,algorithmic fairness;predictive modeling;stati...,


In [6]:
reader.get_metadata()

[{'file_name': '3442188.3445864.pdf',
  'doi': '10.1145/3442188.3445864',
  'title': 'Price Discrimination with Fairness Constraints',
  'authors': 'Cohen, Maxime C.;Elmachtoub, Adam N.;Lei, Xiao',
  'journal': 'Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency',
  'publisher': 'ACM',
  'year': '2021',
  'keywords': 'fairness;price discrimination;personalization;social welfare',
  'topics': 'Price Discrimination with Fairness Constraints;Applied computing;Law, social and behavioral sciences;Economics;Law;Information systems;World Wide Web;Web searching and information discovery;Personalization;Social and professional topics;Professional topics;Management of computing and information systems;Implementation management;Pricing and resource allocation'},
 {'file_name': '3351095.3372828.pdf',
  'doi': '10.1145/3351095.3372828',
  'title': 'Mitigating bias in algorithmic hiring',
  'authors': 'Raghavan, Manish;Barocas, Solon;Kleinberg, Jon;Levy, Karen',
  '

In [6]:
reader.get_data(raw=True)

[{'doi': '10.1145/3351095.3372828',
  'text': 'Mitigating Bias in Algorithmic Hiring: Evaluating Claims and\nPractices\nManish Raghavan\nCornell University\nSolon Barocas\nMicrosoft Research and Cornell University\nJon Kleinberg\nCornell University\nKaren Levy\nCornell University\nABSTRACT\nThere has been rapidly growing interest in the use of algorithms\nin hiring, especially as a means to address or mitigate bias. Yet, to\ndate, little is known about how these methods are used in practice.\nHow are algorithmic assessments built, validated, and examined for\nbias? In this work, we document and analyze the claims and prac-\ntices of companies offering algorithms for employment assessment.\nIn particular, we identify vendors of algorithmic pre-employment\nassessments (i.e., algorithms to screen candidates), document what\nthey have disclosed about their development and validation proce-\ndures, and evaluate their practices, focusing particularly on efforts\nto detect and mitigate bias. 