# Harvesting Data from the Web



One of the most common tasks in harvesting data is creating datasets from the information on the webpage. We will primarily work with the HSE page for the School of Sociology.

## 1. Harvesting data on staff 

Go to this [page](https://social.hse.ru/en/soc/persons). Here you need to harvest the following information:

1. Names
2. Links to personal pages
3. Positions

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import re

In [2]:
url = "https://social.hse.ru/en/soc/persons"
page = requests.get(url)
soup = bs(page.text, "html.parser")
staff = soup.find_all("div", {"class": "fa-person__box"})

In [3]:
staff_list = {"name": [], "position": [], "url":[]}

In [4]:
for person in staff:
    staff_list["name"].append(person.find('a', class_='fa-person__name').text.strip())
    staff_list["position"].append(person.find('p', class_='fa-person__info').text.strip())
    staff_list["url"].append(person.find('a', class_='fa-person__name')['href'].replace('//', 'http://'))

In [5]:
df = pd.DataFrame.from_dict(staff_list)

In [6]:
pd.set_option("display.max_rows", None, "display.max_columns", None)
df

Unnamed: 0,name,position,url
0,"Lezhnina, Yulia P.",Head,http://www.hse.ru/en/staff/lezhnina
1,"Chepurenko, Alexander",Academic Supervisor,http://www.hse.ru/en/org/persons/63903
2,"Strebkov, Denis",Deputy School Head,http://www.hse.ru/en/staff/strebkov
3,"Zangieva, Irina",Deputy School Head,http://www.hse.ru/en/org/persons/7531133
4,"Artamonova, Liudmila",Manager,http://www.hse.ru/en/org/persons/108084
5,"Davidenko, Maria",Assistant Professor,http://www.hse.ru/en/org/persons/202139458
6,"Davydov, Sergey G.",Associate Professor,http://www.hse.ru/en/org/persons/8747291
7,"Di Puppo, Lili",Assistant Professor,http://www.hse.ru/en/org/persons/57316523
8,"Fröhlich, Christian",Associate Professor,http://www.hse.ru/en/org/persons/127116556
9,"Kuskova, Valentina",Associate Professor,http://www.hse.ru/en/staff/vkuskova


1. How many people are there?
2. How many professors are there?
3. How many heads are there?
4. How many managers are there?

In [11]:
print("There are {} people".format(len(df)))
print("There are {} professors and {} professors who are also department heads".format(df['position'].value_counts()['Professor'], df['position'].value_counts()['Department Head, Professor']))
print("There are {} heads".format(df['position'].value_counts()['Head']))
print("There are {} managers".format(df['position'].value_counts()['Manager']))

There are 84 people
There are 13 professors and 4 professors who are also department heads
There are 2 heads
There are 2 managers


## 2. Harvesting data on each faculty member

Now I want you to harvest the following information on each faculty staff:

1. What languages do they know?
2. What is their e-mail?
3. Do they have a link to a google scholar? If so, harvest this link
4. Do they teach any courses? If so, specify the names for these courses
5. How many publications do they have?

In [8]:
members = [x for x in df['url']]

In [10]:
import sys
check = 0
descr = {"name": [], "languages": [], "email":[], "GS": [], "courses": [], "pub_num": [], "lang_num": [], "course_num": []}
for iter, personal_url in enumerate(members):
    # check iteration number
    sys.stdout.write('\r'+'number of parsed pages: ' + str(iter) + '/' + str(len(df) - 1))
    sys.stdout.flush()
    personal_page = requests.get(personal_url)
    cold_soup = bs(personal_page.text, "html.parser")
    # name
    name = re.search(r'-(.*?)—', cold_soup.find('title').text).group(1)
    descr['name'].append(name)
    # number of publications
    pubs = cold_soup.find('ul', class_="g-ul g-list g-list_closer publications")
    num = 0
    if pubs is not None:
        for pub in pubs:
            num += 1
    else:
        num = 0
    descr['pub_num'].append(num)
    # mail
    mail = re.findall('\[.*-at-.*\]', cold_soup.prettify())[0].split('[')[1]
    mail_re = ''.join(map(str, re.findall(r'(-at-|[a-zA-Z]|\.+)', mail))).replace('-at-', '@')
    descr['email'].append(mail_re)
    # link to google scholar
    scholar = cold_soup.find('a', class_="link b")
    if scholar is not None:
        descr['GS'].append(scholar.get('href'))
    else:
        descr['GS'].append('No information')
    # languages
    languages = []
    nlang = 0
    for langs in cold_soup.find('dl').find_all('dd'):
        languages.append(langs.text)
        nlang += 1
    descr['languages'].append(' '.join(map(str, languages)))
    descr['lang_num'].append(nlang)
    # courses
    courses = cold_soup.find("div", {"tab-node": "edu-courses"})
    if courses is None:
        descr['courses'].append('No courses')
        descr['course_num'].append(0)
    else:
        cn = 0
        courses = courses.find('ul')
        if courses is None:
            courses = cold_soup.find("div", {"tab-node": "edu-courses"})
        list_of_courses = []
        for course in courses:
            if course.find('a') is not None:
                list_of_courses.append(course.find('a').text)
                cn += 1
        descr['courses'].append(' '.join(map(str, list_of_courses)))
        descr['course_num'].append(cn)

number of parsed pages: 83/83

In [12]:
df_descr = pd.DataFrame.from_dict(descr)
df_descr

Unnamed: 0,name,languages,email,GS,courses,pub_num,lang_num,course_num
0,Yulia P. Lezhnina,English German,jlezhnina@hse.ru,https://scholar.google.com/citations?user=uloD...,Social Policy as an Instrument of Sustainable ...,7,2,2
1,Alexander Chepurenko,English German Swedish,achepurenko@hse.ru,http://scholar.google.ru/citations?view_op=lis...,"Researching Entrepreneurship: How to plan, des...",24,3,2
2,Denis Strebkov,English (Upper-Intermediate) Russian (Native),strebkov@hse.ru,https://scholar.google.com/citations?user=uQEn...,"Organization, Preparation and Presentation of ...",61,2,2
3,Irina Zangieva,English Russian,izangieva@hse.ru,https://scholar.google.ru/citations?user=HPmVP...,Applied Statistical Analysis Data Analysis in ...,2,2,5
4,Liudmila Artamonova,English,LArtamonova@hse.ru,No information,No courses,0,1,0
5,Maria Davidenko,English Japanese,mdavidenko@hse.ru,https://scholar.google.com/citations?view_op=l...,Academic English Writing Sociology of Gender,8,2,2
6,Sergey Gennadyevich Davydov,English,sdavydov@hse.ru,https://scholar.google.ru/citations?hl=ru&user...,Analysis of Media Markets and Media Organizati...,69,1,9
7,Lili Di Puppo,French English German Russian Italian,ldipuppo@hse.ru,https://scholar.google.ru/citations?user=c1nEW...,Methodology and Research Methods in Sociology:...,5,5,2
8,Christian Fröhlich,English Russian German,cfroehlich@hse.ru,https://scholar.google.com/citations?user=MGOj...,Classical Sociological Theory Contemporary Soc...,19,3,7
9,Valentina Kuskova,English,vkuskova@hse.ru,https://scholar.google.ru/citations?user=zy4L3...,Analysis of Covariance Models Contemporary Dat...,30,1,14


## 3. Use your Pandas knowledge to answer the following questions:

1. How many languages a faculty knows on average?
2. How many courses a faculty has during this year on average?
3. How many publications a faculty has on average?
4. Is there any difference between visiting lectures and and senior lectures?
5. Is there any difference between heads and others?
6. Is there any difference between professors and associate professors?

In [13]:
print("A faculty knows on average {} languages".format(df_descr['lang_num'].mean()))
print("A faculty has on average {} courses this year".format(df_descr['course_num'].mean()))
print("A faculty has on average {} publications".format(df_descr['pub_num'].mean()))

A faculty knows on average 1.7738095238095237 languages
A faculty has on average 3.2738095238095237 courses this year
A faculty has on average 22.55952380952381 publications


In [14]:
df_descr = pd.DataFrame.from_dict(descr)
df_descr['position'] = df['position']
df_descr['IsHead'] = df_descr['position'].str.contains('Head')
df_descr.groupby('IsHead').describe()

Unnamed: 0_level_0,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,course_num,course_num,course_num,course_num,course_num,course_num,course_num,course_num
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
IsHead,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
False,74.0,20.662162,27.88542,0.0,3.0,6.0,31.75,157.0,74.0,1.743243,1.060638,1.0,1.0,1.0,2.0,6.0,74.0,3.364865,2.588565,0.0,1.25,3.0,5.0,14.0
True,10.0,36.6,43.050874,2.0,2.5,20.5,59.0,133.0,10.0,2.0,1.247219,1.0,1.0,2.0,2.0,5.0,10.0,2.6,1.837873,0.0,1.25,2.0,4.5,5.0


In [15]:
test = df_descr.loc[(df_descr['position'] == 'Senior Lecturer') | (df['position'] == 'Visiting Lecturer')]
test.groupby('position').describe()

Unnamed: 0_level_0,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,course_num,course_num,course_num,course_num,course_num,course_num,course_num,course_num
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
position,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
Senior Lecturer,10.0,18.0,20.231988,2.0,2.5,11.0,24.75,65.0,10.0,1.3,0.483046,1.0,1.0,1.0,1.75,2.0,10.0,2.7,1.828782,1.0,1.25,2.5,3.0,7.0
Visiting Lecturer,11.0,11.090909,14.250997,0.0,0.0,3.0,21.5,36.0,11.0,1.636364,0.80904,1.0,1.0,1.0,2.0,3.0,11.0,2.181818,1.990888,0.0,1.0,1.0,3.5,6.0


In [16]:
test1 = df_descr.loc[(df_descr['position'] == 'Associate Professor') | (df['position'] == 'Professor')]
test1.groupby('position').describe()

Unnamed: 0_level_0,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,pub_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,lang_num,course_num,course_num,course_num,course_num,course_num,course_num,course_num,course_num
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
position,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
Associate Professor,25.0,34.4,28.712657,0.0,6.0,32.0,52.0,105.0,25.0,1.68,0.945163,1.0,1.0,1.0,2.0,5.0,25.0,4.96,3.10215,1.0,2.0,4.0,7.0,14.0
Professor,13.0,19.230769,42.464012,0.0,2.0,5.0,9.0,157.0,13.0,1.769231,1.091928,1.0,1.0,1.0,3.0,4.0,13.0,3.923077,1.497862,2.0,3.0,4.0,5.0,6.0
