## Q3: LSE Department of Statistics (35 marks)


For this question, we want to collect some data about LSE Statistics academic staff via web scraping, and find out the distribution of different types of staff, gender balance, etc

Do the following:

1. [Data collection via web scraping] Use web scraping with Python modules `requests` and `BeautifulSoup` to extract the information about the academic staff of the Department of Statistics from https://www.lse.ac.uk/Statistics/People (and the linked pages). For simplicity, we only consider:
    > * All the staff under the tab "Academic faculty" 
    > * Guest lecturers/teachers and LSE fellows under the tab "Academic associates - Emeritus Professors, Guest Lecturers and Visiting Staff" 
        * i.e. we DO NOT consider any of the emeritus or visiting staff, as they are less involved and/or only joining for a short period

    For each staff fulfilling the requirements above, do the following:
    
    * (a) Get the URL of the staff's page from "More information". Example:
    <img src="figs/people.png" width="400"/>
    
    For example, for Dr. James Abdey, the link is https://www.lse.ac.uk/Statistics/People/Dr-James-Abdey
        * You may get some shorter version of the path first (like `Statistics/People/Dr-James-Abdey`), but you can construct the full URL from the short version
    
    * (b) With the URLs collected from (a), use web scraping with python modules `requests` and `BeautifulSoup` to scrape the following information and store the result in a `pd.DataFrame` with 3 columns:
        * Name of the staff (e.g. James Abdey, with or without the title prefix)
        * Job title (e.g. Associate Professorial Lecturer)
        * Gender - depending on the page, you may be able to deduce the gender of the staff from the content of "About me"
            * You may want to use `np.nan` to represent those you cannot find the gender automatically
        
        <img src="figs/james.png" width="400"/>

2. [Data exploration] Calculate the descriptive statistics on title and gender. Based on the descriptive statistics _only_, answer the following:
    * Are there any issues with the data collected? If yes, what are they?
    * Is the gender ratio of academic staff close to 50:50 within the department?
    
3. [Data wrangling] Cleaning and organise the data:

    * (a) Handle any issues of the data collected from part (1) based on what you have discovered in part (2), including:
        * If you have discovered that you did not collect the data correctly in part (1), fix it
            * If you are not able to fix the code in part (1), at the very least try to fix the data manually so that you can get reasonable results (and marks) in part (3) and (4)
        * If there is any missing data, handle them _appropriately_ and explain your rationale
    * (b) Extract the "role" and the "rank" of each staff from the `pd.DataFrame` in (3a):
        * "Role" of the staff (based on https://info.lse.ac.uk/staff/divisions/Human-Resources/The-recruitment-toolkit/Role-profiles): 
            * "teaching" (the job title should have the words like "Lecturer", "Teacher", "Teaching Fellow" or "LSE Fellow"), e.g. Associate Professorial Lecturer
            * "faculty" (the job title should have the _word_ "Professor"), e.g. Assistant Professor
        * "Rank" of the staff: 
            * "non-tenure" (non tenure-track, should contain the words like "Guest" or "Fellow"), e.g. Guest Teacher, LSE fellow
            * "assistant" (entry-level tenure-track, should contain the word "Assistant"), e.g. Assistant Professor
            * "associate" (mid-level tenure-track, should contain the word "Associate"), e.g. Associate Professorial Lecturer
            * "full" (senior-level tenure-track, does not have any of the keywords above), e.g. Professor
        
    and put the information about the role and rank as two additional columns to the data frame in (3a)
    * (c) Store the `pd.DataFrame` from (3b) into a csv file in the `data` folder, with the name of the file to be `lse_statistics_2022mmdd.csv` with `mm` the month and `dd` the day you have collected the data. This file will help us to verify your result
    
4. [Simple analysis] Use the `pd.DataFrame` from part (3b), find out if the gender ratio depends on the roles and/or the rank
    * Based on your results in this part, do you think the Department of Statistics has a good gender balance among its staff?
    
    
Please state the limitations of your answers.

---
### Note

* Data should be extracted as automatically as possible
* If you are not able to do some parts, you can hard code the values and continue to work on the rest of the question 
    * You will lose marks for not being able to solve the corresponding part, but at least you may get some marks from the other parts of the question
    * For example, if you cannot get the URLs from part 1(a), you may manually figure out the URLs of the staff to allow you to continue to work on part 1(b) and the rest of the question - but of course you will lose marks for part 1(a)
    
---
    
### Hints

* When finding the gender from the content of "About me", what pronouns do you expect to see with higher frequency if the staff is male? If the staff is female?
* You may find [`.str.contains()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html) and/or `.str.split()` useful for (3b)
    
---

In [1]:
## your attempt, please add the code cells and markdown cells for your answers. Make sure you:
## * use the right type of cells
## * state clearly which answer is for which part
## * show the output of the code cells

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from re import search

In [3]:
url = 'https://www.lse.ac.uk/Statistics/People'
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')

## 1.
[Data collection via web scraping] Use web scraping with Python modules requests and BeautifulSoup to extract the information about the academic staff of the Department of Statistics from https://www.lse.ac.uk/Statistics/People (and the linked pages). For simplicity, we only consider:

All the staff under the tab "Academic faculty"
Guest lecturers/teachers and LSE fellows under the tab "Academic associates - Emeritus Professors, Guest Lecturers and Visiting Staff"
i.e. we DO NOT consider any of the emeritus or visiting staff, as they are less involved and/or only joining for a short period
For each staff fulfilling the requirements above, do the following:

### (a) Get the URL of the staff's page from "More information". Example:

In [4]:
prof_links = []
bio = soup.find_all('div', attrs={'class': 'accordion__txt'})

for i in range(len(bio)):
    #the first p attribute in the div has the title
    title = bio[i].find_next('p') 
    str_title = str(title)
    
    #skip Emeritus and Visiting
    if search('Emeritus', str_title) != None or search('Visiting', str_title) != None:
        pass
    
    #only the first two accordions
    elif bio[i].find_previous('h2').get_text() == 'Research staff':
        break
        
    #hard code this link which appears at the end of a header
    elif title.find_next('a')['href'] == 'http://www.lse.ac.uk/lseseds': 
        pass
    else:
        #the first link after the title is the website
        link = title.find_next('a')['href']
        prof_links.append(link)

print(prof_links)

['/Statistics/People/Dr-James-Abdey', '/Statistics/People/Dr-Marcos-Barreto', '/Statistics/People/Professor-Pauline-Barrieu', '/Statistics/People/Dr-Erik-Baurdoux', '/Statistics/People/Professor-Wicher-Bergsma', '/Statistics/People/Professor-Umut-Cetin', '/Statistics/People/Dr-Yining-Chen', '/Statistics/People/Yunxiao-Chen', '/Statistics/People/Professor-Angelos-Dassios', '/Statistics/People/Dr-Daniela-Escobar', '/Statistics/People/Professor-Piotr-Fryzlewicz', '/Statistics/People/Dr-Sara-Geneletti', '/Statistics/People/Dr-Kostas-Kalogeropoulos', '/Statistics/People/Professor-Kostas-Kardaras', '/Statistics/People/Professor-Jouni-Kuha', '/Statistics/People/Professor-Clifford-Lam', '/Statistics/People/Dr-Joshua-Loftus', '/Statistics/People/Dr-Gelly-Mitrodima', '/Statistics/People/Professor-Irini-Moustaki', '/Statistics/People/Dr-Francesca-Panero', '/Statistics/People/Dr-Xinghao-Qiao', '/Statistics/People/Dr-Chengchun-Shi', '/Statistics/People/Dr-Andreas-Søjmark', '/Statistics/People/Profe

In [5]:
len(prof_links)

37

### (b) With the URLs collected from (a), use web scraping with python modules requests and BeautifulSoup to scrape the following information and store the result in a pd.DataFrame with 3 columns:

Name of the staff (e.g. James Abdey, with or without the title prefix)

Job title (e.g. Associate Professorial Lecturer)

Gender - depending on the page, you may be able to deduce the gender of the staff from the content of "About me"

You may want to use np.nan to represent those you cannot find the gender automatically

In [6]:
academic_faculty = []

for i in range(len(prof_links)):
    academic_faculty.append("https://www.lse.ac.uk/" + prof_links[i])

In [7]:
staff_ls = []

for i in range(len(academic_faculty)):
    temp_ls = []
    temp_url = academic_faculty[i]
    temp_r = requests.get(temp_url)
    temp_soup = BeautifulSoup(temp_r.content,'lxml')
    
    #Find their name on their page
    name_tag = temp_soup.find('h1', class_='people__name')
    name = name_tag.contents[1]
    temp_ls.append(name[1:]) #get rid of space in front of name
    
    #Find their title on their page
    title_tag = temp_soup.find('h2', class_='people__position')
    title = title_tag.contents[0]
    
    #some titles had an extra space at the end
    if title[-1] == ' ':
        temp_ls.append(title[:-1])
    else:
        temp_ls.append(title)
    
    #Find their bio on their page
    bio_tag = temp_soup.find('div', class_='people__bio')
    bio_str = str(bio_tag).lower().replace('\n', ' ')
    
    #https://stackabuse.com/python-check-if-string-contains-substring/
    #Find gender pronouns
    if ' his ' in bio_str or ' he ' in bio_str:
        temp_ls.append('M')
    elif ' her ' in bio_str or ' she ' in bio_str:
        temp_ls.append('F')
    else:
        temp_ls.append(np.nan)
        
    staff_ls.append(temp_ls)

In [8]:
staff_df = pd.DataFrame(staff_ls, columns = ['Name', 'Job Title', 'Gender'])
staff_df

Unnamed: 0,Name,Job Title,Gender
0,James Abdey,Associate Professorial Lecturer,M
1,Marcos Barreto,Assistant Professorial Lecturer,M
2,Pauline Barrieu,Professor and Head of Department,F
3,Erik Baurdoux,Associate Professor,M
4,Wicher Bergsma,Professor and Deputy Head of Department (Teach...,M
5,Umut Cetin,Professor,M
6,Yining Chen,Associate Professor,M
7,Yunxiao Chen,Assistant Professor,M
8,Angelos Dassios,Professor,M
9,Daniela Escobar,Assistant Professorial Lecturer,F


## 2. 
[Data exploration] Calculate the descriptive statistics on title and gender. Based on the descriptive statistics only, answer the following:

Are there any issues with the data collected? If yes, what are they?

- Some of the staff have their genders missing if they did not put their gender pronouns in their bio

Is the gender ratio of academic staff close to 50:50 within the department?

- No, the ratio is closer to 2/3 male and 1/3 female.

In [9]:
staff_df['Gender'].value_counts(normalize=True) 

M    0.676471
F    0.323529
Name: Gender, dtype: float64

In [10]:
staff_df["Job Title"].value_counts()

Professor                                             9
Associate Professor                                   6
Assistant Professor                                   6
Assistant Professorial Lecturer                       4
Guest Teacher                                         2
LSE Fellow                                            2
Guest Lecturer                                        2
Associate Professorial Lecturer                       1
Professor and Head of Department                      1
Professor and Deputy Head of Department (Teaching)    1
Professor of Data Science                             1
Visiting Fellow                                       1
Guest Lecturer and GTA Mentor                         1
Name: Job Title, dtype: int64

## 3. [Data wrangling] Cleaning and organise the data:

### (a) Handle any issues of the data collected from part (1) based on what you have discovered in part (2), including:

If you have discovered that you did not collect the data correctly in part (1), fix it
If you are not able to fix the code in part (1), at the very least try to fix the data manually so that you can get reasonable results (and marks) in part (3) and (4)
If there is any missing data, handle them appropriately and explain your rationale

- I have fixed part 1 to be accurate so the data should all be correct except for the professors who did not have their pronouns in their bio. Clifford Lam and Phil Chen both did not have pronouns in their bio, going off of societal norms associated with their names I will manually assign them as male. Qiwei Yao had "he" pronouns in his bio once, but it was at the very beginning of a paragraph. In order to ensure that the pronoun "he" was not being collected from the middle of a word such as "there" I added a space before and after it to ensure it was a standalone word. Even though I've tried replace the new line character with a space the code is still not picking up on it, so I will also manually change his gender. Since all the genders we're changing are to male, I will use a blanket line of code associated with the NaN genders.

In [11]:
staff_df['Gender'] = staff_df['Gender'].replace(np.NaN, 'M')

### (b) Extract the "role" and the "rank" of each staff from the pd.DataFrame in (3a):

"Role" of the staff (based on https://info.lse.ac.uk/staff/divisions/Human-Resources/The-recruitment-toolkit/Role-profiles):
"teaching" (the job title should have the words like "Lecturer", "Teacher", "Teaching Fellow" or "LSE Fellow"), e.g. Associate Professorial Lecturer
"faculty" (the job title should have the word "Professor"), e.g. Assistant Professor

"Rank" of the staff:
"non-tenure" (non tenure-track, should contain the words like "Guest" or "Fellow"), e.g. Guest Teacher, LSE fellow
"assistant" (entry-level tenure-track, should contain the word "Assistant"), e.g. Assistant Professor
"associate" (mid-level tenure-track, should contain the word "Associate"), e.g. Associate Professorial Lecturer
"full" (senior-level tenure-track, does not have any of the keywords above), e.g. Professor
and put the information about the role and rank as two additional columns to the data frame in (3a)

In [12]:
#roles
lecturer = staff_df['Job Title'].str.contains('Lecturer')
teacher = staff_df['Job Title'].str.contains('Teacher')
role_fellow = staff_df['Job Title'].str.contains('Fellow')
faculty = staff_df['Job Title'].str.contains('Professor')

#ranks
guest = staff_df['Job Title'].str.contains('Guest')
rank_fellow = staff_df['Job Title'].str.contains('Fellow')
assistant = staff_df['Job Title'].str.contains('Assistant')
associate = staff_df['Job Title'].str.contains('Associate')

In [13]:
#combine boolean arrays
teaching = lecturer | teacher | role_fellow
nontenure = guest | rank_fellow

In [14]:
role = []

for i in range(len(staff_df)):
    if teaching[i] == True:
        role.append('Teaching')
        continue
    elif faculty[i] == True:
        role.append('Faculty')
        
    #this condition should never be met
    else:
        role.append(np.nan)

In [15]:
rank = []
for i in range(len(staff_df)):
    if nontenure[i] == True:
        rank.append('Non-Tenure')
        continue
    elif assistant[i] == True:
        rank.append('Assistant')
        continue
    elif associate[i] == True:
        rank.append('Associate')
        continue
    else:
        rank.append('Full')

In [16]:
staff_df['Role'] = role
staff_df['Rank'] = rank

In [17]:
staff_df

Unnamed: 0,Name,Job Title,Gender,Role,Rank
0,James Abdey,Associate Professorial Lecturer,M,Teaching,Associate
1,Marcos Barreto,Assistant Professorial Lecturer,M,Teaching,Assistant
2,Pauline Barrieu,Professor and Head of Department,F,Faculty,Full
3,Erik Baurdoux,Associate Professor,M,Faculty,Associate
4,Wicher Bergsma,Professor and Deputy Head of Department (Teach...,M,Faculty,Full
5,Umut Cetin,Professor,M,Faculty,Full
6,Yining Chen,Associate Professor,M,Faculty,Associate
7,Yunxiao Chen,Assistant Professor,M,Faculty,Assistant
8,Angelos Dassios,Professor,M,Faculty,Full
9,Daniela Escobar,Assistant Professorial Lecturer,F,Teaching,Assistant


### (c) Store the pd.DataFrame from (3b) into a csv file in the data folder, with the name of the file to be lse_statistics_2022mmdd.csv with mm the month and dd the day you have collected the data. This file will help us to verify your result

In [18]:
staff_df.to_csv(path_or_buf = 'data/lse_statistics_2022mmdd.csv')

## 4. [Simple analysis] Use the pd.DataFrame from part (3b), find out if the gender ratio depends on the roles and/or the rank

Based on your results in this part, do you think the Department of Statistics has a good gender balance among its staff?
- Yes, it still appears that there is gender inequality in the Statistics department based on rank and role. Men are much more likely than women to be in the faculty position. Similarly, they're much more likely to have higher rank titles such as Associate and Full.

In [19]:
gender_ratio_rank = pd.pivot_table(staff_df, values = 'Name', index = 'Gender', columns = 'Rank', aggfunc = 'count')
gender_ratio_rank

Rank,Assistant,Associate,Full,Non-Tenure
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,4,1,3,3
M,6,6,9,5


In [20]:
gender_ratio_role = pd.pivot_table(staff_df, values = 'Name', index = 'Gender', columns = 'Role', aggfunc = 'count')
gender_ratio_role

Role,Faculty,Teaching
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,5,6
M,19,7
