<h1> <center> Data Acquisition: Departmental Staff Information through Webscraping </center> </h1>

We chose to investigate the data for 12 departments so we could have a good variety. These departments were chosen with regards to data availability, webscraping convenience, and subject diversity as well as considering how they may have different effects on our hypotheses. 
    
The first thing to do was inspect the HTML code and group together the websites that have similar enough structure that iterating the same code over the links would produce satisfactory results. By testing out websites and editing the code to make it general enough to work for different links yet still specific enough to collect the required data, we came up with code that would work for the departments of Departments of Social Policy, Anthropology, Finance, Mathematics, Statistics, Psychological and Behavioural Science, and International Relations. The rest of the departments had varying HTML code for their staff website and so tailored approaches for webscraping were required for the departments of Management, Sociology, Geography and Environment, Economic History, and Government. 
    
The dataframes for each department contains the staff member's name, department, label, and title. The label is so that we can group each individual as research or non-research based and the title is to investigate whether there is any relevancy to the number of doctors vs. professors vs. neither. 
    
When dealing with the data, we didn't include research/PhD students as there is a seperate publications database for these students and our focus is on the staff member publications.
    

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

<h2> <i> <center> Departments of Social Policy, Anthropology, Finance, Mathematics, Statistics, Psychological and Behavioural Science, & International Relations </center> </i> </h2> 

<ul> Some issues that we dealt with:
    <li> A lot of the websites seemed to have put information and irrelevant data in the same format as a staff member's name, so to deal with that, we examined the websites and came up with a list of words that appeared in name format but indicated that the information was not a name. The code below checks all the information and if these words appear in text, it doesn't include the text as a name.  </li>
    <li> To get the title of a person (such as professor or doctor), we had to look at the first word in the text containing name. Often, we encountered titles such as mr and ms which we dropped or names with no title. So the code checks whether the first index is a title and moves it to the title column. </li>
    <li> Some name texts also had parentheses at their end indicating a staff member's position which we had to check for and remove. </li>
    </ul>
    

In [2]:
urls = ["https://www.lse.ac.uk/social-policy/people", "https://www.lse.ac.uk/anthropology/people",
        "https://www.lse.ac.uk/finance/people", "https://www.lse.ac.uk/mathematics/people", 
        "https://www.lse.ac.uk/statistics/people", "https://www.lse.ac.uk/PBS/People", 
        "https://www.lse.ac.uk/international-relations/people"]

department_names = ["Social Policy", "Anthropology", "Finance", "Mathematics", "Statistics", 
               "Psychological and Beahvioural Science", "International Relations"]
department_dfs = {}

words_to_exclude = ["about", "advice", "the", "and", "&", "for", "institute", "case", "supervisor", "fellow",
                   "manager", "programme", "department", "leave", "hours", "pb", "enquiries", "upervisor"]

for link in urls:
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'html.parser')
    
    # getting department corresponding to link 
    department_index = urls.index(link)
    department = department_names[department_index]
    
    # lists to store data
    names = []
    departments = []
    labels = []
    titles = []

    # iterating over each panel
    for panel in soup.find_all("div", class_="accordion__panel"):
    # getting individuals in the panel
        individuals = panel.find_all(["strong", "b"])
        
        # getting the label for the accordion panel
        ancestor_section = panel.find_parents("section", class_="accordion")
        accordion_label_tag = ancestor_section[0].find("h2", class_="accordion__title")
        accordion_label = accordion_label_tag.text.strip()
        
        # dropping all PhD and Research students
        if 'student' in accordion_label.lower():
            continue
        
        # iterating each individual
        for individual in individuals:
            if individual.text.strip():
                name_tag = individual
                name_text = name_tag.text.strip()
                
                # checking for words that indicate text isn't a name
                exclude = False
                for word in words_to_exclude:
                    if word in name_text.lower():
                        exclude = True
                if exclude:
                    continue
                    
                # deleting the parenthesis at the end of some names
                last_opening_parenthesis_index = name_text.rfind('(')
                if last_opening_parenthesis_index != -1 and last_opening_parenthesis_index == len(name_text) - 1:
                    name_text = name_text[:last_opening_parenthesis_index].strip()
            
                # checking if the first word of the name is in the list of titles
                name_words = name_text.split()
                first_word = name_words[0].lower() if name_words else ""
                if first_word in ["mr", "professor", "ms", "dr"]:
                    if first_word in ['professor', 'dr']:
                        title = first_word.capitalize()
                    else:
                        title = ' '
                    name_parts = name_words[1:] if len(name_words) > 1 else []
                    name = " ".join(name_parts)
                else:
                    title = " "
                    name = name_text
            
                # adding data to lists
                names.append(name)
                departments.append(department)
                labels.append(accordion_label)
                titles.append(title)

    # converting to dataframe
    df = pd.DataFrame({"Name": names, "Department": departments, "Label": labels, "Title": titles})
    department_dfs[department] = df

# combining all the dataframes    
similar_structure_df = pd.concat(department_dfs.values(), ignore_index=True)
similar_structure_df

Unnamed: 0,Name,Department,Label,Title
0,Fabio Battaglia,Social Policy,Academic staff,Dr
1,Liam Beiser-McGrath,Social Policy,Academic staff,Dr
2,Thomas Biegert,Social Policy,Academic staff,Dr
3,Tania Burchardt,Social Policy,Academic staff,Dr
4,Leonidas Cheliotis,Social Policy,Academic staff,Dr
...,...,...,...,...
640,Agnes Yu,International Relations,Graduate teaching assistants (GTAs),
641,Arthur Kilgore,International Relations,Guest teachers,
642,Oksana Levkovych,International Relations,Guest teachers,
643,David Rampton,International Relations,Guest teachers,


<h2> <i> <center> Department of  Management </center> </i> </h2> 

In [3]:
#getting the management webpage
r_mgt=requests.get('https://www.lse.ac.uk/management/people-home')
soup_mgt=BeautifulSoup(r_mgt.content,'lxml')

In [4]:
#getting the name,department,label information as a list
mgt=[]

label1=soup_mgt.find_all('h1')
label1_wanted1=['academic staff',
               'other academic and research staff']
label1_wanted2=['professional services staff']

for label1 in soup_mgt.find_all('h1'):
    department='Management'
    
    #getting info under tabs: 'academic staff', 'other academic and research staff'
    #these two have similar structures
    if label1.get_text().lower() in label1_wanted1:
        label=label1.get_text()
        shortcut=label1.find_next('div',attrs={'class':'accordionContainer'})
        for label2 in shortcut.find_all('h2',attrs={'class':'accordion__title'}):
            inlabel2=label2.find_next('div',attrs={'class':'accordion__content'})
            for person in inlabel2.find_all('div',attrs={'class':'accordion__txt'}):
                #2 different cases
                if person.find('p').find('strong'):
                    name=person.find('p').find('strong').get_text()
                    mgt.append([name,department,label]) 
                else:
                    name=person.find('p').find_next('a').get_text()
                    mgt.append([name,department,label])
    
    #getting info under tab: 'professional services staff'
    #this one has a different structure
    if label1.get_text().lower() in label1_wanted2:
        label=label1.get_text()
        shortcut=label1.find_next('div',attrs={'class':'accordionContainer'})
        for label2 in shortcut.find_all('h2',attrs={'class':'accordion__title'}):
            inlabel2=label2.find_next('div',attrs={'class':'accordion__content'})
            for person in inlabel2.find_all('div',attrs={'class':'accordion__txt'}):
                #3 different cases
                if person.find('p').find('strong'):
                    name=person.find('p').find('strong').get_text()
                    mgt.append([name,department,label]) 
                elif person.find('p').find('b'):
                    name=person.find('p').find('b').get_text()
                    mgt.append([name,department,label]) 
                elif person.find('p').find('span'):
                    name=person.find('p').find('span').get_text()
                    mgt.append([name,department,label]) 
                else:
                    print('not all included')

In [5]:
#from the extracted name, decide if the person is prof/dr/non

for i in range(len(mgt)):
    namestr=mgt[i][0].lower().split()
    if 'dr' in namestr:
        mgt[i].append('Dr')
    elif 'professor' in namestr:
        mgt[i].append('Professor')
    else:
        mgt[i].append(' ')
    
    #getting rid of other useless strings, only keep the name
    name=mgt[i][0].replace('Dr','').replace('Professor','').replace('Sir','').replace('\xa0',' ').split()
    name=" ".join(name)
    mgt[i][0]=name

In [6]:
#convert list to dataframe

management_df = pd.DataFrame(mgt,columns=['Name','Department','Label','Title'])

In [7]:
#some problems with the webscaped information

print('\nSome names called vacancies are accidentally included, as they directly appear on the webpage.')
display(management_df[management_df['Name']=='Vacancy'])
print('\n\nThere is one name missing because of the abnormal structure of the webpage')
display(management_df[management_df['Name']==''])
display(management_df[148:151])


Some names called vacancies are accidentally included, as they directly appear on the webpage.


Unnamed: 0,Name,Department,Label,Title
75,Vacancy,Management,Professional services staff,
82,Vacancy,Management,Professional services staff,
100,Vacancy,Management,Professional services staff,
122,Vacancy,Management,Professional services staff,
133,Vacancy,Management,Professional services staff,




There is one name missing because of the abnormal structure of the webpage


Unnamed: 0,Name,Department,Label,Title
149,,Management,Other academic and research staff,Dr


Unnamed: 0,Name,Department,Label,Title
148,Michele Fioretti,Management,Other academic and research staff,Dr
149,,Management,Other academic and research staff,Dr
150,Dina Rabie,Management,Other academic and research staff,Dr


In [8]:
#dealing with the problems

management_df['Name'] = management_df['Name'].replace('Vacancy', pd.NA)
management_df=management_df.copy().dropna()
management_df.loc[management_df['Name']=='','Name']='Dr Henry Hang Shen'

display(management_df[management_df['Name']=='Vacancy'])
display(management_df[management_df['Name']==''])

Unnamed: 0,Name,Department,Label,Title


Unnamed: 0,Name,Department,Label,Title


In [9]:
#final dataframe
management_df

Unnamed: 0,Name,Department,Label,Title
0,Bethania Antunes,Management,Academic staff,Dr
1,Sarah Ashwin,Management,Academic staff,Professor
2,Jonathan E. Booth,Management,Academic staff,Dr
3,Wafaa Elmezraoui,Management,Academic staff,
4,Karin King,Management,Academic staff,Dr
...,...,...,...,...
168,Paul Willman,Management,Other academic and research staff,Professor
169,Mohamed Abouaziza,Management,Other academic and research staff,Dr
170,Anushri Gupta,Management,Other academic and research staff,Dr
171,Philipp Schoenegger,Management,Other academic and research staff,Dr


<h2> <i> <center> Department of  Sociology </center> </i> </h2> 

In [10]:
url = 'https://www.lse.ac.uk/sociology/people'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

# creating lists to strore data
names = []
departments = []
labels = []
titles = []

# getting all panels
accordion_panels = soup.find_all("div", class_="accordion__panel")

# iterating over each panel
for panel in accordion_panels:
    # getting individuals in the panel
    individuals = panel.find_all(["strong", "b"])
    
    # getting the label for accordion
    ancestor_section = panel.find_parents("section", class_="accordion")
    accordion_title_tag = ancestor_section[0].find("h2", class_="accordion__title")
    accordion_title = accordion_title_tag.text.strip() 

    # dropping all PhD and Research students
    if 'student' in accordion_title.lower():
        continue
    
    # iterating each individual
    for individual in individuals:
        if individual.text.strip():
            name_tag = individual
            name_text = name_tag.text.strip().replace("\xa0", "")
            # as an ask section is also in the same format as name, this will skip adding that section
            if "ask me about" in name_text.lower():
                continue 
            # seperating word in name
            name_parts = name_text.split(",")
            last_name = name_parts[0]
            
            # getting first name, last name, and title
            if len(name_parts) == 2:
                if 'prof' in name_parts[1].lower():
                    title = 'Professor'
                    first_name_parts = name_parts[1].split("(")
                    first_name = first_name_parts[0]
                elif 'Dr' in name_parts[1]:
                    title = "Dr"
                    first_name_parts = name_parts[1].split("(")
                    first_name = first_name_parts[0]
                else:
                    first_name = name_parts[1]
                    title = ''
            else:
                continue
   
            # adding data to list
            names.append(f"{first_name} {last_name}")
            departments.append("Sociology") 
            labels.append(accordion_title)
            titles.append(title)

# converting to dataframe
sociology_df = pd.DataFrame({"Name": names, "Department": departments, "Label": labels, "Title": titles})
sociology_df

Unnamed: 0,Name,Department,Label,Title
0,Mahvish Ahmad,Sociology,Academic staff,Dr
1,Suki Ali,Sociology,Academic staff,Dr
2,Robin Archer,Sociology,Academic staff,Dr
3,Chetan Bhatt,Sociology,Academic staff,Professor
4,Chetan Bhatt,Sociology,Academic staff,Professor
...,...,...,...,...
66,Sergey V. Zherebkin,Sociology,Academic visitors,Professor
67,Victoria Mallett,Sociology,Afilliated Research Fellows,Dr
68,Martha McCurdy,Sociology,Afilliated Research Fellows,Dr
69,Dominika Partyga,Sociology,Afilliated Research Fellows,Dr


<h2> <i> <center> Department of  Geography and Environment </center> </i> </h2> 

In [11]:
url_eh_staff = 'https://www.lse.ac.uk/geography-and-environment/our-people'
r = requests.get(url_eh_staff)
soup = BeautifulSoup(r.content,'lxml')

In [12]:
geo=[]

# Deal with special "Type" - Senior Managment Team
triggers = soup.find_all('a', {'class': 'accordion__trigger'})
for trigger in triggers:
    department='Geography and Environment'
    label=trigger.get_text().replace('\n','')
    # Find the sibling <div> with class 'accordion__panel'
    panel_div = trigger.find_next_sibling('div', {'class': 'accordion__panel'})
    if panel_div:
        # Find all <div class="accordion__txt"> within the panel_div
        txt_divs = panel_div.find_all('div', {'class': 'accordion__txt'})
        for txt_div in txt_divs:
            # Find all <p> tags within the txt_div
            p_tags = txt_div.find_all('p')
            if len(p_tags) > 1:  # Ensure there is at least a second <p> tag
                second_p = p_tags[1]
                manage_info = second_p.find('strong')
                if manage_info:
                    name = manage_info.get_text().strip().replace('\xa0',' ')
                    if name!='':
                        geo.append([name,department,label]) 
                        
# Deal with 'Academic staff','Teaching staff','Affiliate staff'
soup.find_all('h2')
wanted=['Academic staff','Teaching staff','Affiliate staff']
for title in soup.find_all('h2'):
    department='Geography and Environment'
    if 'class' in title.attrs:
        continue
    else:
        if title.get_text() in wanted:
            section_div = title.find_next('section', {'class': 'accordion'})
            if section_div:
                txt_divs = section_div.find_all('div', {'class': 'accordion__txt'})
                for txt_div in txt_divs:
                    info = txt_div.find('a')
                    strong_info = txt_div.find('strong')
                    if info:
                        name = info.get_text().strip().replace('\xa0',' ')
                        geo.append([name,department,title.get_text()])                 
                    elif strong_info:
                        name = strong_info.get_text().strip().replace('\xa0',' ')
                        geo.append([name,department,title.get_text()])

                                
# Deal with special "Type" - Professional Services Staff
professional_staff_h2 = soup.find('h2', string='Professional Services Staff')
if professional_staff_h2:
    department='Geography and Environment'
    section_div = professional_staff_h2.find_next('section',{'class': 'accordion'})
    if section_div:
        txt_divs = section_div.find_all('div', {'class': 'accordion__txt'})
        for txt_div in txt_divs:
            info_strong = txt_div.find('strong')
            if info_strong:
                name = info_strong.get_text().strip().replace('\xa0',' ')
                geo.append([name,department,'Professional Services Staff'])
                
# Deal with special "Type" - Visiting Staff                  
visiting_staff_h2 = soup.find('h2', string='Visiting staff')
if visiting_staff_h2:
    department='Geography and Environment'
    section_div = visiting_staff_h2.find_next('div',{'class': 'accordion__content'})
    for person in section_div.find_all('p'):
        if person.find('a'):
            name=person.find('a').get_text().replace('\xa0',' ')
            geo.append([name,department,'Visiting staff'])
        else:
            name=person.get_text().replace('\xa0',' ')
            geo.append([name,department,'Visiting staff'])

In [13]:
#from the extracted name, decide if the person is prof/dr/non

for i in range(len(geo)):
    namestr=geo[i][0].lower().split()
    if 'dr' in namestr:
        geo[i].append('Dr')
    elif 'prof' in namestr:
        geo[i].append('Professor')
    elif 'professor' in namestr:
        geo[i].append('Professor')
    else:
        geo[i].append(' ')
    
    #getting rid of other useless strings, only keep the name
    if geo[i][0].startswith('Dr.'):
        name=geo[i][0].replace('Dr.','')
        geo[i][0]=name
    else:
        name=geo[i][0].replace('Dr','').replace('.','').replace('Prof','').replace('Professor','').replace('\xa0',' ').split()
        name=" ".join(name)
        geo[i][0]=name

In [14]:
geography_and_environment_df = pd.DataFrame(geo,columns=['Name','Department','Label','Title'])

By inspection, there is one name missing because of the abnormal structure of the webpage

In [15]:
display(geography_and_environment_df[geography_and_environment_df['Name']==''])

Unnamed: 0,Name,Department,Label,Title
11,,Geography and Environment,Academic staff,Dr


In [16]:
#dealing with abonormal data

geography_and_environment_df.loc[geography_and_environment_df['Name']=='','Name']='Aretousa Bloom'

In [17]:
#verifying

display(geography_and_environment_df[geography_and_environment_df['Name']==''])

Unnamed: 0,Name,Department,Label,Title


In [18]:
geography_and_environment_df

Unnamed: 0,Name,Department,Label,Title
0,Hyun Bang Shin,Geography and Environment,Senior Management Team,Professor
1,Christian Hilber,Geography and Environment,Senior Management Team,Professor
2,Claire Mercer,Geography and Environment,Senior Management Team,Professor
3,Simon Dietz,Geography and Environment,Senior Management Team,Professor
4,Nancy Holman,Geography and Environment,Senior Management Team,Dr
...,...,...,...,...
136,Rory Sullivan,Geography and Environment,Visiting staff,
137,Jayaraj Sundaresan,Geography and Environment,Visiting staff,
138,Callum Ward,Geography and Environment,Visiting staff,
139,Jason Wong,Geography and Environment,Visiting staff,


<h2> <i> <center> Department of  Economic History </center> </i> </h2> 

In [19]:
url_eh_staff = 'https://www.lse.ac.uk/Economic-History/People'
r = requests.get(url_eh_staff)
soup = BeautifulSoup(r.content,'lxml')

In [20]:
eh=[]

#senior management team
manage_divs = soup.find_all('div', {'class': 'accordion__txt'})[:4]
for manage_div in manage_divs:
    department='Economic History'
    manage_info = manage_div.find('strong')
    if manage_info:
        name = manage_info.get_text().strip().replace('\xa0',' ').split('-')
        name=name[0].split()
        name=' '.join(name)
        eh.append([name,department,'Senior Management Team'])

#all other tabs have similar structures
triggers = soup.find_all('a', {'class': 'accordion__trigger'})[:6]
for trigger in triggers:
    department='Economic History'
    trigger_text = trigger.get_text().strip().replace('\xa0',' ')
    # Find the sibling <div class="accordion__panel">
    panel_div = trigger.find_next_sibling('div', {'class': 'accordion__panel'})
    if panel_div:
        # Find all <div class="accordion__txt"> within the panel_div
        txt_divs = panel_div.find_all('div', {'class': 'accordion__txt'})
        for txt_div in txt_divs:
            info = txt_div.find('strong')
            name = info.get_text().strip()
            eh.append([name,department,trigger_text])

In [21]:
#from the extracted name, decide if the person is prof/dr/non

for i in range(len(eh)):      
    namestr=eh[i][0].lower().split()
    if 'dr' in namestr:
        eh[i].append('Dr')
    elif 'professor' in namestr:
        eh[i].append('Professor')
    else:
        eh[i].append('')
    
    #getting rid of other useless strings, only keep the name
    namestrings=[]
    name=eh[i][0].replace('\xa0',' ').split('(')
    for stri in name[0].split():
        if stri in ['Dr','Mr','Professor']:
            continue
        elif stri=='-':
            break
        else:
            namestrings.append(stri)  
    name=" ".join(namestrings)
    eh[i][0]=name

In [22]:
#converting list to dataframe, final dataset
economic_history_df = pd.DataFrame(eh,columns=['Name','Department','Label','Title'])
economic_history_df

Unnamed: 0,Name,Department,Label,Title
0,Patrick Wallis,Economic History,Senior Management Team,Professor
1,Neil Cummins,Economic History,Senior Management Team,Professor
2,Sara Horrell,Economic History,Senior Management Team,Professor
3,Jennie Stayner,Economic History,Senior Management Team,
4,Olivier Accominotti,Economic History,"Faculty, Fellows and Teachers",Professor
...,...,...,...,...
65,Kamilah Hassan,Economic History,Professional Support Staff,
66,Helena Ivins,Economic History,Professional Support Staff,
67,Tracy Keefe,Economic History,Professional Support Staff,
68,Jennie Stayner,Economic History,Professional Support Staff,


<h2> <i> <center> Department of  Government </center> </i> </h2> 

In [23]:
#getting the government webpage

r_gvt=requests.get('https://www.lse.ac.uk/government/people')
soup_gvt=BeautifulSoup(r_gvt.content,'lxml')

In [24]:
#getting the name,department,label information as a list

gvt=[]

label1=soup_gvt.find_all('h2',attrs={'class':'accordion__title'})
llabel1_wanted1=['academic staff',
               'professional services staff',
                'research staff']
label1_wanted2=['guest teachers and gtas']
label1_wanted3=['emeritus, affiliated & visiting academic staff']

for label1 in soup_gvt.find_all('h2',attrs={'class':'accordion__title'}):
    department='Government'
    
    # the following three tabs have the similar structures:
    #'academic staff','professional services staff','research staff'
    if label1.get_text().lower() in llabel1_wanted1:
        label=label1.get_text()
        shortcut=label1.find_next('div',attrs={'class':'accordion__content'})
        for person in shortcut.find_all('div',attrs={'class':'accordion__txt'}):
            if person.find('p').find('strong'):
                name=person.find('p').find('strong').get_text()
                gvt.append([name,department,label])
            elif person.find('p').find('b'):
                name=person.find('p').find('b').get_text()
                gvt.append([name,department,label])
            else:
                print('not all structure considered')
    
    #different structure for 'guest teachers and gtas'
    if label1.get_text().lower() in label1_wanted2:
        label=label1.get_text()
        shortcut=label1.find_next('div',attrs={'class':'accordion__content'})
        for person in shortcut.find_all('p'):
            if person.find('strong'):
                continue
            else:
                name=person.get_text()
                gvt.append([name,department,label])

    #different structure for 'emeritus, affiliated & visiting academic staff'
    if label1.get_text().lower() in label1_wanted3:
        label=label1.get_text()
        shortcut=label1.find_next('div',attrs={'class':'accordion__content'})
        for people in shortcut.find_all('ul'):
            for person in people.find_all('li'):
                if person.find('p'):
                    if person.find('p').find('strong'):
                        name=person.find('p').find('strong').get_text()
                        gvt.append([name,department,label])
                    else:
                        name=person.find('p').find('span').get_text()
                        gvt.append([name,department,label])
                else:
                    if person.find('strong'):
                        name=person.find('strong').get_text()
                        gvt.append([name,department,label])
                    else:
                        print('Not all structures considered')

In [25]:
#from the extracted name, decide if the person is prof/dr/non

for i in range(len(gvt)):
    namestr=gvt[i][0].lower().split()
    if '(dr)' in namestr:
        gvt[i].append('Dr')
    elif '(prof)' in namestr:
        gvt[i].append('Professor')
    else:
        gvt[i].append(' ')
    
    #getting rid of other useless strings, only keep the name
    namestrings=[]
    name=gvt[i][0].replace('\xa0',' ').split()
    for stri in name:
        string=stri.strip('()')
        if string.startswith('GV'):
            continue
        elif string in ['Dr','Mr','Prof']:
            continue
        else:
            namestrings.append(stri)  
    name=" ".join(namestrings)
    gvt[i][0]=name

In [26]:
#converting list to dataframe, final dataset
government_df = pd.DataFrame(gvt,columns=['Name','Department','Label','Title'])
government_df

Unnamed: 0,Name,Department,Label,Title
0,Victor Agboga,Government,Academic Staff,
1,Elise Antoine,Government,Academic Staff,Dr
2,Paul Apostolidis,Government,Academic Staff,Professor
3,Tom Bailey,Government,Academic Staff,
4,Daniel Berliner,Government,Academic Staff,Dr
...,...,...,...,...
158,Lukas Slothuus,Government,"Emeritus, Affiliated & Visiting Academic Staff",Dr
159,Zeynep Somer Topcu,Government,"Emeritus, Affiliated & Visiting Academic Staff",Dr
160,Christine Stedtnitz,Government,"Emeritus, Affiliated & Visiting Academic Staff",Dr
161,Jill Stuart,Government,"Emeritus, Affiliated & Visiting Academic Staff",Dr


<h2> <i> <center> Merging Data </center> </i> </h2> 

In [27]:
# putting all the dataframes into one
merged_data = pd.concat([similar_structure_df, government_df, economic_history_df, geography_and_environment_df, 
                         sociology_df, management_df], ignore_index=True)
merged_data

Unnamed: 0,Name,Department,Label,Title
0,Fabio Battaglia,Social Policy,Academic staff,Dr
1,Liam Beiser-McGrath,Social Policy,Academic staff,Dr
2,Thomas Biegert,Social Policy,Academic staff,Dr
3,Tania Burchardt,Social Policy,Academic staff,Dr
4,Leonidas Cheliotis,Social Policy,Academic staff,Dr
...,...,...,...,...
1253,Paul Willman,Management,Other academic and research staff,Professor
1254,Mohamed Abouaziza,Management,Other academic and research staff,Dr
1255,Anushri Gupta,Management,Other academic and research staff,Dr
1256,Philipp Schoenegger,Management,Other academic and research staff,Dr


One thing we noticed was that sometimes the same staff member would show up under multiple labels. We chose to deal with it in the completed dataframe at once instead of running the code multiple times for each departmental dataframe. 

In [28]:
# removing double names
merged_data = merged_data.drop_duplicates(subset=["Name"])
merged_data

Unnamed: 0,Name,Department,Label,Title
0,Fabio Battaglia,Social Policy,Academic staff,Dr
1,Liam Beiser-McGrath,Social Policy,Academic staff,Dr
2,Thomas Biegert,Social Policy,Academic staff,Dr
3,Tania Burchardt,Social Policy,Academic staff,Dr
4,Leonidas Cheliotis,Social Policy,Academic staff,Dr
...,...,...,...,...
1253,Paul Willman,Management,Other academic and research staff,Professor
1254,Mohamed Abouaziza,Management,Other academic and research staff,Dr
1255,Anushri Gupta,Management,Other academic and research staff,Dr
1256,Philipp Schoenegger,Management,Other academic and research staff,Dr


We also want to sort each staff member as a research based or non-research based staff member and store that information into another column to make analysis easier. We base it on the label, with more academic type roles as researchers and service/teaching type roles as non-research. There are also a few unique labels such as Public Engagement and Impact which only has one person. To figure these out, we do a quick check of the staff members under these labels and look at what there role entails and whether their page is related to research. So for Public Engagement and Impact, the sole staff member under the label had research listed under their interests and had also published a paper, so we identified them as a researcher. Similar tactics were used for other unique labels.

In [29]:
unique_labels = merged_data["Label"].unique()
merged_data["Label"].value_counts()

Academic staff                                                                            245
Professional services staff                                                               120
Visiting staff                                                                             53
Academic Staff                                                                             53
Emeritus, Affiliated & Visiting Academic Staff                                             51
Guest Teachers and GTAs                                                                    37
Visiting Fellows                                                                           36
Other academic and research staff                                                          36
Teaching staff                                                                             34
Academic Faculty                                                                           34
Finance faculty                                             

In [30]:
# defining keywords for research and non-research labels
research_keywords = ["academic", "research", "faculty", "emeritus", "visiting", "affiliate", "impact", "committee"]
non_research_keywords = ["service", "management", "admin", "teach", "support", "tutor"]

# to avoid error
merged_data_copy = merged_data.copy()

# iterrating over each row and categorising label
for index, row in merged_data_copy.iterrows():
    label = row['Label'].lower()
    if any(keyword in label for keyword in research_keywords):
        merged_data_copy.loc[index, 'Category'] = "Research"
    elif any(keyword in label for keyword in non_research_keywords):
        merged_data_copy.loc[index, 'Category'] = "Non-Research"

In [31]:
# saving as csv file 
merged_data_copy.to_csv('departmental_staff_data.csv', index=False)