Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Maria Baba"
STUDENT_ID = "14201089"

---

# Analyzing Gender Distribution Among Scientific Authors in Computational Social Science

*Objective*: Understand the gender distribution of authors across different scientific disciplines using web scraping and API-based gender identification.

Gender diversity in research is crucial for ensuring diverse perspectives and approaches in scientific inquiry, and for the comprehensiveness and richness of research findings. A balanced gender representation can help challenge systemic biases that might otherwise marginalize or overlook significant areas of study. A diverse research community can also act as a role model, inspiring future generations of all genders to pursue scientific endeavors.

This assignment focuses on the question of the gender distribution of researchers in different disciplines, and on identifying how often women are the first or last author of publications. 

To do so, you will scrape a preprint website, and you will use the API genderize.io to identify the gender of the author based on their name.

1. Prepare: Identify a source and decide a scraping strategy

2. Scrape the list of articles and authors

3. Use API to identify gender 

4. Analyze gender distribution and authorship order

5. Reflect on your findings. 

6. Scrape the paper abstracts

### Setup and requirements
First make sure that you have the needed libraries for Python correctly installed.

In [2]:
# Selenium
# !pip install selenium
# !pip install webdriver-manager
# !pip install webdriver-manager --upgrade
# !pip install packaging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By

# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# driver.get("https://www.google.com")

In [3]:
# Request
#%pip install requests
import requests

In [4]:
# Beautifulsoup
#%pip install beautifulsoup
from bs4 import BeautifulSoup

In [5]:
import pandas as pd
import numpy as np

## 1. Plan and strategize

We first need to decide which site to scrape and our strategy for doing so. We will focus on a preprint repository. Preprint repositories host and disseminate research papers before they are peer-reviewed and published in academic journals. They therefore give a view of the latest research.

There are several repositories that represent different scientific disciplines (e.g., PubMed for life sciences, arXiv for physics and computer science, JSTOR for humanities and social sciences, SocArxiv for social science, etc.) 

We will here focus on arxiv.org, where many Computational Social Scientists publish, often under the category "Computers and Society".

You need to pick a page on ArXiv where you can get a representative sample of these research papers -- and which you are allowed to scrape.

1. Browse Arxiv.org, and select a page on the website where you can find a sample of research papers.
2. Check the robots.txt. Are you allowed to scrape the page you selected? (If not, you will have to choose another one!)
3. Decide a strategy for scraping the page as quickly and easily as possible to find the names of the authors for each paper, their titles, and a link to the pages.
4. Choose which Python libraries for scraping that you will use.

### Question 1: Which library is most suitable?

Given the structure of the website, which Python libraries for scraping do you think is appropriate to use? Motivate your choice in a few sentences.

_[Student answers here.]_The library that is appropriate to use for this website is requests as the website is static and the information is easy to extract.

[Evaluation: This is an open question. Any motivation that makes sense is fine, but in general, requests make more sense for this page than selenium, since the site in question is not dynamic. Using selenium will be slower and more difficult.]

## 2. Scrape the list of articles and authors 

Implement your scraping strategy. Scrape the page and collect the information about the publication. 

- You will need to get (1) the link to the article, (2) the title of the article, (3) the names of all authors of the paper, in the same order as they appear on the paper. 
- You need to scrape 200 research papers.

- Note that you may need to iterate over multiple pages.
- Note that you need to handle possible exceptions and that your code needs to be able to restart if it crashes.
- You final result should be a list of dicts, with keys 'title', 'url', and 'authors'. 'authors' should consist of a list where the authors are listed in the order that they were on the paper. 
- You need to clean and validate your data: check that all papers have authors, that all papers have titles, clean the texts to remove empty spaces and similar, etc.
- Store the resulting array persistently as a pickle with the name 'scraping_result.pkl'.

For instance: [{'title': 'How to use Large Langauge Models for Text Analysis', 'authors': ['Törnberg, Petter'], 'url':'https://arxiv.org/abs/2307.13106' } ...]


In [6]:
#scrape 200 articles from computers and society category

import requests
base_url= 'http://export.arxiv.org/api/query?'
search_query= 'cat:Computers_and_Society'
parameters="start=0&max_results=20"

url= f'{base_url}search_query={search_query}&{parameters}'

response= requests.get(url)
if response.status_code == 200:
    print("Here is the result from the API:")
    print(response.text)
    json_string = response.text
else:
    print("Error")

Here is the result from the API:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dcat%3AComputers_and_Society%26id_list%3D%26start%3D0%26max_results%3D20" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=cat:Computers_and_Society&amp;id_list=&amp;start=0&amp;max_results=20</title>
  <id>http://arxiv.org/api/f8gw0sxDLFbZJmxFiqU6UaX9yoA</id>
  <updated>2023-12-08T00:00:00-05:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">20</opensearch:itemsPerPage>
</feed>



In [7]:
import urllib, urllib.request
import urllib.request as ur
import feedparser
import pickle

base_url= 'http://export.arxiv.org/api/query?'
search_query= 'search_query=cat:cs.CY&start=0&max_results=200'

url= ur.urlopen(base_url+search_query)
response= url.read()
values= feedparser.parse(response)


data_list=[]
for entry in values.entries:
    title=entry.title
    link= entry.id
    n= entry.authors
    names= [d.get('name') for d in n]
    data= {'title':title, 'authors':names, 'url':link }
    data_list.append(data)
print(data_list)
    
with open('scraping_result.pkl', 'wb') as f:
    pickle.dump(data_list, f)
f.close()

[{'title': 'Influencing Software Usage', 'authors': ['Lorrie Faith Cranor', 'Rebecca N. Wright'], 'url': 'http://arxiv.org/abs/cs/9809018v1'}, {'title': 'The Impact of Net Culture on Mainstream Societies: a Global Analysis', 'authors': ['Tapas Kumar Das'], 'url': 'http://arxiv.org/abs/cs/9903013v1'}, {'title': 'Not Just a Matter of Time: Field Differences and the Shaping of\n  Electronic Media in Supporting Scientific Communication', 'authors': ['Rob Kling', 'Geoffrey McKim'], 'url': 'http://arxiv.org/abs/cs/9909008v2'}, {'title': 'Agents of Choice: Tools that Facilitate Notice and Choice about Web Site\n  Data Practices', 'authors': ['Lorrie Faith Cranor'], 'url': 'http://arxiv.org/abs/cs/0001011v1'}, {'title': 'Scientific Collaboratories as Socio-Technical Interaction Networks: A\n  Theoretical Approach', 'authors': ['Rob Kling', 'Geoffrey McKim', 'Joanna Fortuna', 'Adam King'], 'url': 'http://arxiv.org/abs/cs/0005007v1'}, {'title': "What's Fit To Print: The Effect Of Ownership Conce

In [8]:
# Check if keys exists in dictionary
assert 'title' in data_list[0], "Key 'title' not found in dictionary"
assert 'authors' in data_list[0], "Key 'author' not found in dictionary"
assert 'url' in data_list[0], "Key 'url' not found in dictionary"

## 3. Use Genderize.io to identify author gender

The next step is to identify the gender of the authors. To do so, we will use the free API genderdize.io. 

1. Go to https://genderize.io/ and read the API documnentation.
2. Do you need to register to use it? Do you need an API key? 
3. How do you call the API? What parameters do you need to send? 
4. What rate limiting is used? How long do you need to wait between calls?

You will use what you learned to carry out the following tasks.


#### Task 1: _identify_gender()_
Write a function _identify_gender(first_name)_ that takes a name, and uses the API to guess the gender. The function should send a request to genderize.io, and parse the resulting json to a dict. The function should return a dict with the data provided by the API.

In [9]:
import json
import requests

def identify_gender(first_name):
    base= 'https://api.genderize.io?'

    url= f'{base}name={first_name}'
    result=requests.get(url)
    
    json_str = result.text
    details= json.loads(json_str)
    return details

# Test
print(identify_gender("Sasha"))

{'count': 14512, 'name': 'Sasha', 'gender': 'female', 'probability': 0.52}


#### Task 2: Identify gender of all authors

Your task is now to use your new function to identify the genders of all authors that you previously scraped. 

To do so, you first need to extract the first name of each author. You need to iterate over these names and use your function to identify the gender of the author.

Your result should be a dataframe with the following columns:

- article_url | author_full_name | first_name | author_order | estimated_gender | gender_probability

Author_order should be a number specifying where the author was in the author list for the publication (e.g., 0 = first author, 1 = second author...) _Estimated_gender_ should contain the API response on gender, and _gender_probability_ the certainty of the gender, according to the API.

Note:
- You will need to transform your dict to the dataframe shown above, with one author per line. (This means that each URL will be associated to multiple author names.)
- Make sure that you respect the rate limiting of the API. 
- Make sure that you handle exceptions and that your function continues 
- Note that you get a maximum of 1,000 free calls per day, so you need to make sure that you do not waste your API calls!
- The API may not have all names stored. For these names, store a _np.nan_ value as the gender.

Pickle the resulting dataframe with the name: 'author_gender.df.pkl'

In [10]:

import pickle
import pandas as pd
import requests
import numpy as np
import time

# Load your scraped data
with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)

#extract author name and article url from data
authors= [x['authors'] for x in data_list]
art_url= [x['url'] for x in data_list]

#create a list to show the order of authors and use 
#enumarate functions to get order
order_authors=[]
for outer_index, inner_list in enumerate(authors, start=1):
    for inner_index, item in enumerate(inner_list, start=1):
        order_authors.append(inner_index)

#create dataframe with author names and url
df1= pd.DataFrame({'article_url':art_url, 'author_full_name': authors})
df2= df1.explode('author_full_name', ignore_index=True)


names= df2['author_full_name'].tolist()

#extract first name of authors
f_name= [x.split()[0] if len(x.split()) >=1 else np.NaN for x in names]


#apply gender function and extract gender with probability
gender= [identify_gender(x)for x in f_name]

estimated_g= [x['gender'] for x in gender] 
prob= [x['probability'] if x['probability'] >0.0 else np.NaN for x in gender] 

#create final dataframe and store it in a pickle file
df= df2.assign(first_name= f_name, author_order= order_authors, estimated_gender= estimated_g, gender_probability= prob)


df.to_pickle('author_gender.df.pkl')
        

In [11]:
assert 'article_url' in df.columns, "article_url column is missing"
assert 'author_full_name' in df.columns, "author_full_name column is missing"
assert 'first_name' in df.columns, "first_name column is missing"
assert 'author_order' in df.columns, "author_order column is missing"
assert 'estimated_gender' in df.columns, "estimated_gender column is missing"
assert 'gender_probability' in df.columns, "gender_probability column is missing"

with open('author_gender.df.pkl', 'wb') as f:
    pickle.dump(df, f)

display(df.head(10))

Unnamed: 0,article_url,author_full_name,first_name,author_order,estimated_gender,gender_probability
0,http://arxiv.org/abs/cs/9809018v1,Lorrie Faith Cranor,Lorrie,1,female,1.0
1,http://arxiv.org/abs/cs/9809018v1,Rebecca N. Wright,Rebecca,2,female,1.0
2,http://arxiv.org/abs/cs/9903013v1,Tapas Kumar Das,Tapas,1,male,0.98
3,http://arxiv.org/abs/cs/9909008v2,Rob Kling,Rob,1,male,1.0
4,http://arxiv.org/abs/cs/9909008v2,Geoffrey McKim,Geoffrey,2,male,1.0
5,http://arxiv.org/abs/cs/0001011v1,Lorrie Faith Cranor,Lorrie,1,female,1.0
6,http://arxiv.org/abs/cs/0005007v1,Rob Kling,Rob,1,male,1.0
7,http://arxiv.org/abs/cs/0005007v1,Geoffrey McKim,Geoffrey,2,male,1.0
8,http://arxiv.org/abs/cs/0005007v1,Joanna Fortuna,Joanna,3,female,0.99
9,http://arxiv.org/abs/cs/0005007v1,Adam King,Adam,4,male,1.0


## 4. Analyze gender distribution and authorship order

Now that you have gathered the necessary data, you will use this data to answer some research questions about gender equality in CSS research. Note that in calculating this, you need to handle that the API may have failed to identify the gender of some authors.

1. What fraction of the authors included are women? 
2. What happens to this fraction if you only include authors for which the gender_probability is higher than 80%? 
3. Being the first or single author on a research paper is an important status signal among researchers: it often means that you made the most work. What fraction of the first or single authors are women? 
4. Being the _last_ author on a research paper with _three or more authors_ is also an important status signal: it tends to mean that you were the one to acquire funding or lead the lab. What fraction of the last-authors on papers with three or more author are women?


In [3]:
import pandas as pd
gender_data= pd.read_pickle("author_gender.df.pkl")
display(gender_data)


Unnamed: 0,article_url,author_full_name,first_name,author_order,estimated_gender,gender_probability
0,http://arxiv.org/abs/cs/9809018v1,Lorrie Faith Cranor,Lorrie,1,female,1.00
1,http://arxiv.org/abs/cs/9809018v1,Rebecca N. Wright,Rebecca,2,female,1.00
2,http://arxiv.org/abs/cs/9903013v1,Tapas Kumar Das,Tapas,1,male,0.98
3,http://arxiv.org/abs/cs/9909008v2,Rob Kling,Rob,1,male,1.00
4,http://arxiv.org/abs/cs/9909008v2,Geoffrey McKim,Geoffrey,2,male,1.00
...,...,...,...,...,...,...
370,http://arxiv.org/abs/1004.1224v2,Somayeh Fatahi,Somayeh,1,female,0.99
371,http://arxiv.org/abs/1004.1224v2,Nasser Ghasem-Aghaee,Nasser,2,male,0.99
372,http://arxiv.org/abs/1004.1793v1,S. K. Nayak,S.,1,female,0.50
373,http://arxiv.org/abs/1004.1793v1,S. B. Thorat,S.,2,female,0.50


In [4]:
#Fraction of women atuhors
genders= gender_data['estimated_gender']
female_nr=gender_data.value_counts(subset=genders)['female']
total= len(gender_data)
fraction= (female_nr)/(total)
print(f'Fraction of female authors is: {fraction}')

Fraction of female authors is: 0.27466666666666667


In [5]:
#fraction of females with probability over 80

females_80 = gender_data[(gender_data['gender_probability'] > 0.80) & (gender_data['estimated_gender']=='female')]
fem= len(females_80)
total= len(gender_data)
fraction= (fem)/(total)
print(f'Fraction of female authors when probalbility is over 80%: {fraction}')

Fraction of female authors when probalbility is over 80%: 0.21866666666666668


In [6]:
#fraction of females who are first in the article
females_order = gender_data[(gender_data['author_order'] ==1) &(gender_data['estimated_gender']=='female')]
fem_order= len(females_order)
total= len(gender_data)
fraction= (fem_order)/(total)
print(f'Fraction of female authors who are first in the articles {fraction}')


Fraction of female authors who are first in the articles 0.14933333333333335


In [7]:
#fraction of last author being female
females_order = gender_data[(gender_data['author_order'] >=3) &(gender_data['estimated_gender']=='female')]
fem_order= len(females_order)
total= len(gender_data)
fraction= (fem_order)/(total)
print(f'Fraction of female authors who are last in the article: {fraction}')

Fraction of female authors who are last in the article: 0.048


## 5. Reflect on your findings

You have now carried out your analysis of the gender distribution in articles in CSS using scraped data. Reflect on your findings and method, and answer each of the following questions in a few sentences.

1. What are the implications of the observed gender distribution and author order in CSS? How do these distributions compare with your expectations?
2. How accurate do you think your findings are? What are the limitations of determining gender based solely on names? Are there cultural or regional nuances that the API might miss?
3. Reflect on the ethical considerations involved in scraping this data. 


YOUR ANSWER HERE:  Women represent a very small percentage of authors. This matches my expectations as I am aware of the gender biases in the research/sceintifc world and the struggle female have to publisht heir researcha nd establish their legitimacy. I think these findings are somewhat accurate, however it is difficult to make a correct assumption by only scraping 200 articles. 

The limitations of determining gender is that some names can be gender neutral, or people can have names that might be consider common for the oppostie gender. Some ethical considerations include using this data is the fact that we are using authors' names without their consent.

## 6. Scrape the paper abstract

Your next task is to get the abstract for each paper. You will use these abstracts in a later exercise in the course, where we will use text analysis to examine whether the content of research papers are a function of the gender of the author. 

To do so, you need to iterate over the papers that you have already identified, and scrape the abstract from the URL listed. 

#### Task 1: scrape_abstract()
Write a function scrape_abstract(url) that goes to the research paper URL, and scrapes the content of the abstract. The function should return the abstract as a string, and nothing else.

In [2]:
import requests
from bs4 import BeautifulSoup

def scrape_abstract(url):
    response= requests.get(url)
    soup= BeautifulSoup(response.text, 'html.parser')
    description= soup.find('blockquote')
    abstract= (description.text)
    
    return (abstract)

# Test
url = "https://arxiv.org/abs/2307.13106"
print(scrape_abstract(url))



Abstract:This guide introduces Large Language Models (LLM) as a highly versatile text analysis method within the social sciences. As LLMs are easy-to-use, cheap, fast, and applicable on a broad range of text analysis tasks, ranging from text annotation and classification to sentiment analysis and critical discourse analysis, many scholars believe that LLMs will transform how we do text analysis. This how-to guide is aimed at students and researchers with limited programming experience, and offers a simple introduction to how LLMs can be used for text analysis in your own research project, as well as advice on best practices. We will go through each of the steps of analyzing textual data with LLMs using Python: installing the software, setting up the API, loading the data, developing an analysis prompt, analyzing the text, and validating the results. As an illustrative example, we will use the challenging task of identifying populism in political texts, and show how LLMs move beyond th

#### Task 2: Scrape all urls

You will now use your function to scrape all the URLs that you collected in step 2.

The following will provide instructions for how you can go about this task. However, there are several ways to do this, and you are free to choose your preferred method.

Prepare your data:

1. Load your list of dicts from step 2 (scraping_result.pkl)
2. Use it to create a dataframe. 
3. Add a column 'scraped' which should be False for all rows, and a column 'abstract' that should be None for all rows.
4. Store the dataframe persistently (e.g., by pickling it.)

The scraping procedure:

1. Load the persitent pickle as dataframe (so that if your computer crashes, the function will continue where you were)
2. Repeat the following steps until there are no more rows with scraped == False:
3. Fetch a random row with scraped == False
4. Go to the URL and scrape the abstract.
5. Set abstract column in the dataframe to the resulting abstract, set scraped to True.
6. Store the dataframe persistently as a pickle. 

Remember: 
- You may use another strategy. However, since you will be scraping many pages, you should expect your scraper to encounter problems along the way. You therefore need to make sure that you regularly store the results persistently.
- Make sure to handle any exceptions gracefully.
- Be respectful toward the website owners: wait at least one second between each call. 

Your final result should be a dataframe stored as 'scraped_abstracts.df.pkl', with filled 'abstract' and 'scraped' columns.

<!-- [Evaluation: ]
- Load dataframe as df
- Check that the len of df = len of the result list from question 2. 
- Check that each line has an abstract, with len() > 100 e.g.
 -->

In [4]:
import pickle
import pandas as pd
import requests
import numpy as np
from urllib.parse import urlparse, urlunparse
import time

# Load your scraped data
with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)
df = pd.DataFrame(data_list)
df.to_pickle('persistent_df.pkl')

#create list for abstracts and use 
#function to extract them from urls
abstract_l=[]
abs= [scrape_abstract(x) for x in df['url']]

#add all abstracts to a dataframe
df_abstracts=df.assign(Abstract= abs)

#If abstract is in column, write True for Scraped column
df_abstracts['Scraped']= df_abstracts['Abstract'].notna()
display(df_abstracts)

#store in pickle
df_abstracts.to_pickle('scraped_abstracts.df.pkl')


Unnamed: 0,title,authors,url,Abstract,Scraped
0,Influencing Software Usage,"[Lorrie Faith Cranor, Rebecca N. Wright]",http://arxiv.org/abs/cs/9809018v1,\nAbstract: Technology designers often strive...,True
1,The Impact of Net Culture on Mainstream Societ...,[Tapas Kumar Das],http://arxiv.org/abs/cs/9903013v1,\nAbstract: In this work the impact of the In...,True
2,Not Just a Matter of Time: Field Differences a...,"[Rob Kling, Geoffrey McKim]",http://arxiv.org/abs/cs/9909008v2,\nAbstract: The shift towards the use of elec...,True
3,Agents of Choice: Tools that Facilitate Notice...,[Lorrie Faith Cranor],http://arxiv.org/abs/cs/0001011v1,\nAbstract: A variety of tools have been intr...,True
4,Scientific Collaboratories as Socio-Technical ...,"[Rob Kling, Geoffrey McKim, Joanna Fortuna, Ad...",http://arxiv.org/abs/cs/0005007v1,\nAbstract: Collaboratories refer to laborato...,True
...,...,...,...,...,...
195,Profile Popularity in a Business-oriented Onli...,[Thorsten Strufe],http://arxiv.org/abs/1003.0466v1,\nAbstract: Analysing Online Social Networks ...,True
196,Graphically E-Learning introduction and its be...,"[A. Daneshmand Malayeri, J. Abdollahi, R. Rezaei]",http://arxiv.org/abs/1003.3094v1,\nAbstract:E-learning with using multimedia an...,True
197,New designing of E-Learning systems with using...,"[Amin Daneshmand Malayeri, Jalal Abdollahi]",http://arxiv.org/abs/1003.3097v1,\nAbstract:One of the most applied learning in...,True
198,Design and Implementation of an Intelligent Ed...,"[Somayeh Fatahi, Nasser Ghasem-Aghaee]",http://arxiv.org/abs/1004.1224v2,\nAbstract:The Personality and emotions are ef...,True
