# Homework nr. 1 - data visualization (deadline 25/10/2018)

In short, the main task is to download data on theses defended at CTU from the Internet, store them in pandas Data Frame and then visualize some hidden information.
  
> The instructions are not given in details: It is up to you to come up with ideas on how to fulfill the particular tasks as best you can. Thinking of how to visualize the data is an important part of data visualization! ;)

## What are you supposed to do:

  1. Browse the web https://dspace.cvut.cz/?locale-attribute=en and find out how to download data on Bachelor and Master theses.
  2. Download or scrape the data such that for each thesis you know the following:
    * Faculty name, department name, thesis title, thesis type (bachelor/master), supervisor name, reviewer name, year (or date) of the defence, study programme and discipline, link to a webpage with details.
  3. Store these data in one _csv_ file (should be handed in along with this notebook).
  4. Use tools available for Python to plot charts and tables to visualize/display this information:
    * Number of defended theses per year for CTU/Faculties. Distinguish the type of thesis.
    * Find the departments/study programmes/supervisors/reviewers with highest numbers of thesis and come up with some nice plots and tables to depict their numbers.
    * Mean/median/minimum/maximum number of supervised theses per year for faculties.
    * Number (or fraction) of theses supervised by people with various degrees (Bc./Ing./Ph.D./ ...).

**If you do all this properly, you will obtain 6 points**

To earn **extra two points** you can do some of these:
  * Use http://beakerx.com to make your notebook interactive in a meaningful way.
  * Come up with some other reasonable and interesting views of data.
  * Use your data to create an interactive webpage (HTML + JavaScript).

## Comments

  * Please follow the instructions from https://courses.fit.cvut.cz/MI-PDD/homeworks/index.html.
  * If the reviewing teacher is not satisfied, he can give you another chance to rework your homework and to obtain more points.

## Solution 

### Data downloading
* Data are download from page dspace, where can be list all uploaded work on CTU.
* Data for each work can be download as table on work url (https://dspace.cvut.cz/handle/10467/78315?show=full)
* Data jsou ve více jazycích. To upřednostňuje jazyk spolu se staženými soubory (bachelor, master's, ...), lze specifikovat.
* In script I download master's and bachelor's thesis and want data in english if possible.
* In data table are not save faculty name and people's degrees.
* Faculty name are parse from page navigation in every work own page.
* Degree for supervisors and rereviewers can be dowload from (https://usermap.cvut.cz/)
    * There is some problems. UserMap are genereted by js script and can't be dowload by html get.
    * So, for download degrees of supervisor and rereviewers need selenium and Chrome driver for generating pages. This generating pages take some time. For this reason, I have enclosed the csv file with downloaded data.
* While downloading are data flush to csv every downloaded page with works. This prevent lost data when run raise exception.
* Data are save in works.csv with head of csv 
(,supervisor,author,uri,language,subject,title,type,acceptedDate,
rewiever,discipline,department,programme,faculty,supervisor_degree,rewiever_degree)




In [2]:
# Imports for downloading

import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import os.path


In [44]:
from selenium import webdriver
# Problem when getting people from users.cvut.cz Ajax rendering. 
# For get people -> need Chrome webdriver and it is take more time to get All
# Function to get degrees of people 
class People():
    
    def __init__(self):
        self.driver = webdriver.Chrome()
        self.people = {}
    
    def end(self):
        self.driver.quit()
        
    "Get degree from user whichc work on faculty. Cache users for not download mre tha one."
    def getDegree(self, name, faculty):
        try:
            item = (name, faculty)
            if item in self.people:
                return self.people[item]
            
            self.driver.get("https://usermap.cvut.cz/search?query=" + name);
            for element in self.driver.find_elements_by_id(
                "search-results-table")[0].find_element_by_tag_name(
                "tbody").find_elements_by_tag_name("tr"):
            
                names = element.find_element_by_tag_name("a").text
                fac = element.find_element_by_tag_name("abbr").get_attribute("title").split("-")[0].strip()
                if faculty == fac:
                    splitName = names.split(",")
                    degrees = ", ".join(splitName[len(name.split(" ")):])
                    self.people[(name, faculty)] = degrees
                    return degrees
        except Exception:
            return None
        return None
        
    
    

In [None]:
# Download data - It may take a several minutes. 
#                 You can edit the number of pages downloaded. 
#                 Work is being rolled down from the newest.


# Main dpace url for find BP, DP
urlMain = 'https://dspace.cvut.cz{}'
# Url with search form
urlDist = '/discover' 
# Data for specific page to download
data = {
    'rpp' : '100',
    'etal' : '0', 
    'group_by' : 'none', 
    'page' : '0',
    'sort_by' : 'dc.date.issued_dt',
    'order' : 'desc'}

#Prefered lang
pref_lang = "eng"
#Download degrees from usemap -> need chrome driver for render javascript to download.
dPeople = True
work_get = ["bachelor thesis", "master's thesis"]

# Need
newColumns = {'dc.contributor.advisor' : 'supervisor' , 'dc.contributor.author' : 'author', 
                 'dc.identifier.uri' : 'uri', 'dc.date.issued' : 'issued',
       'dc.language.iso' : 'language', 'dc.subject' : 'subject', 'dc.title' : 'title', 'dc.type' : 'type',
       'dc.date.accepted' : 'acceptedDate', 'dc.contributor.referee' :'rewiever',
       'theses.degree.discipline' : 'discipline', 'theses.degree.grantor' : 'department',
       'theses.degree.programme' : 'programme'}

if dPeople: people = People()
# Group columns by language spec and keep one of want language or if not exist keep another one.
# Keep only one column in prefer language
def manageColumns(df):
    mp={}
    rem_flag = False
    for number, lang in enumerate(df[2]):
        if df[0][number] not in mp:
            mp[df[0][number]] = []
        mp[df[0][number]].append((lang, number))
    for i in mp.copy():
        if len(mp[i])> 1:
            for j in mp[i]:
                if j[0] == pref_lang:
                    mp[i].remove(j)
                    rem_flag = True
                    break
            if not rem_flag:
                mp[i].pop(0)
        else:
            del mp[i]
    for i in mp:
        for j in mp[i]:
            df = df.drop(j[1], axis=0)
    return df

# Extract nice data frame from one work html page to table
def parseDataFromHtmlTablePage(pageText):
    ldf = pd.read_html(pageText.text,header = None, flavor = 'bs4')
    df = ldf[0]
    df = manageColumns(df)
    df = df.transpose()
    df.columns = df.iloc[0]
    if ("dc.type" not in df.columns):
        print("Not specific type.")
        return pd.DataFrame()
    df = df.drop(0, axis = 0)
    df = df.drop(2, axis = 0)
   
    if (str(df['dc.type'][1]).lower() not in work_get):
        return pd.DataFrame()
    df = df.drop(['dc.date.accessioned', 'dc.date.available', 'dc.identifier', 
                  'dc.description.abstract', 'dc.publisher', 'dc.rights'  ], axis = 1)
    
    for i in df.columns:
        if i not in newColumns:
            df = df.drop(i, axis=0)
    
    df.rename(columns=newColumns, inplace=True)
    
    
    # Data which are not on dspace page
    df["faculty"] = BeautifulSoup(pageText.text, "html.parser").find_all("ul", 
                        {"class": "breadcrumb hidden-xs"})[0].find_all("li")[1].get_text().strip()
    
    if dPeople: 
        df["supervisor_degree"] = people.getDegree(df['supervisor'][1], df['faculty'][1])
        df["rewiever_degree"] = people.getDegree(df['rewiever'][1], df['faculty'][1])
    return df

# Data frame with all data
data_all = pd.DataFrame(columns = ['supervisor', 'author', 'issued', 'uri', 'language', 'subject', 'title', 'type', 
                  'acceptedDate', 'rewiever', 'discipline', 'department', 
                                   'programme', 'faculty', 'supervisor_degree', 'rewiever_degree'])

firstPage = requests.get(urlMain.format(urlDist), data)
soup = BeautifulSoup(firstPage.text, "html.parser")
pages = int(soup.find("li", {"class": "last-page-link"}).find("a").get_text())
print("Download first page. Pages with works:", pages, flush=True)

sumTime = 0
# from page
fromPage = 1
# go over all pages
for pg in range(fromPage, pages+1):

    data['page'] = pg
    page = requests.get(urlMain.format(urlDist), data)
    soup = BeautifulSoup(page.text, "html.parser")
    
    # go over all items on page
    t1 = time.time()
    for i in soup.findAll("div", {"class": "row ds-artifact-item "}):
        one = requests.get(urlMain.format(i.find("a").get("href")), {'show' : 'full'})
        if one.status_code != 200:
            print("Cant reach the work page. Continue..")
            continue
        
        df = parseDataFromHtmlTablePage(one)
        if df.shape[0] == 0:
            continue
        if data_all.shape[0] == 0:
            data_all = df.copy()
        else:
            data_all = pd.concat([data_all,df], ignore_index=True, sort=False)
        
    if data_all.shape[0] == 0:
            continue
    # Get lower type and convert date in Data Frame
    data_all['type'] = data_all['type'].str.lower()
    
    #data_all['acceptedDate'] =  pd.to_datetime(data_all['acceptedDate'], format='%Y-%m-%d')
    
    #Count time and print download pages. 
    #After 100 download flush dataframe to csv. 
    #To prevent program die.
    sumTime += time.time()-t1
    
    print("Page:", pg, "/", pages, flush=True)
    print(sumTime, pg, flush=True)
    if os.path.isfile("tmp.csv"):
        data_all.to_csv("tmp.csv", mode='a', sep=',', header=False)
    else:
        data_all.to_csv("tmp.csv", mode='a', sep=',', header=True)
        
    data_all = data_all.iloc[0:0]
       
    print("Remaining :", (sumTime/(pg+1-fromPage))*(pages-pg), flush=True)
    
if dPeople: people.end()



### Data visualize
...

In [47]:
#import numpy as np
#import sklearn as skit
#import matplotlib.pyplot as plt
#import seaborn as sns


dataCVUT = pd.read_csv('test.csv', index_col=0)
dataCVUT= dataCVUT.reset_index(drop=True)
dataCVUT['acceptedDate'] =  pd.to_datetime(dataCVUT['acceptedDate'], format='%Y-%m-%d')
dataCVUT['Year'] = dataCVUT['acceptedDate'].map(lambda x: x.year)
display(dataCVUT.head())
display(dataCVUT.groupby(['title']).size().sort_values(ascending=False))
display(dataCVUT.groupby(['supervisor_degree']).size().sort_values(ascending=False))

Unnamed: 0,supervisor,author,uri,language,subject,title,type,acceptedDate,rewiever,discipline,department,programme,faculty,supervisor_degree,rewiever_degree,Year
0,Ryjáček Pavel,Ogden Gary,http://hdl.handle.net/10467/78426,ENG,"bridge,steel,strengthening,CFRP,historical",Strengthening of Steel Heritage Bridges,master's thesis,NaT,Kolpaský Ludvík,Advanced Masters in Structural Analysis of Mon...,katedra ocelových a dřevěných konstrukcí,Civil Engineering,Fakulta stavební,,,
1,Jíra Aleš,Tomanec Martin,http://hdl.handle.net/10467/78286,CZE,"tooth,enamel,nanoindentation,afm,micromechanic...",External environment effect on the tooth ename...,bachelor thesis,2018-06-28,Tesárek Pavel,Konstrukce a dopravní stavby,katedra mechaniky,Stavební inženýrství,Fakulta stavební,,,2018.0
2,Pospíchal Václav,Kompas Michal,http://hdl.handle.net/10467/78291,CZE,"construction techniques,low temperature,tempor...",Cold weather construction techniques,bachelor thesis,2018-06-28,Čermák Jan,Realizace pozemních a inženýrských staveb,katedra technologie staveb,Stavitelství,Fakulta stavební,,,2018.0
3,Šulc Rostislav,Sokolová Kristýna,http://hdl.handle.net/10467/78311,CZE,"Fly ash,agglomerate,fly ash stock-pile,propert...",Properties of deposited fly ash from coal powe...,bachelor thesis,2018-06-28,Peterová Adéla,Realizace pozemních a inženýrských staveb,katedra technologie staveb,Stavitelství,Fakulta stavební,,,2018.0
4,Šejnoha Michal,Pavelcová Veronika,http://hdl.handle.net/10467/78283,CZE,"uderground structure,earthquake,dynamic analys...",Evaluation of real underground structure subje...,bachelor thesis,2018-06-28,Šejnoha Jiří,Konstrukce a dopravní stavby,katedra mechaniky,Stavební inženýrství,Fakulta stavební,,,2018.0


title
Family house                                                                                                                                87
Family house in Jizera mountains                                                                                                            13
Family house covered with Soil                                                                                                               9
Family house Hostivař                                                                                                                        8
Family House                                                                                                                                 8
Hotel****                                                                                                                                    8
Liteň Castle Grounds                                                                                                                    

supervisor_degree
 Ing.,  Ph.D.                        731
 Ing.                                316
 doc. Ing.,  Ph.D.                   249
 Ing. arch.                          137
 doc. Ing. arch.                     104
 doc. Ing.,  CSc.                    100
 prof. Ing. arch.                     97
 Ing.,  CSc.                          71
 Ing. arch.,  Ph.D.                   61
 prof. Ing.,  CSc.                    60
 doc. Ing. arch.,  CSc.               40
 prof. Akad. arch.                    32
 MgA.                                 30
 doc. Dr. Ing.                        27
 PhDr.,  Ph.D.                        26
 Mgr.,  Ph.D.                         25
 doc. RNDr. Ing.,  Ph.D.              24
 doc. Ing. arch.,  Ph.D.              24
 Ing. Bc.,  Ph.D.                     20
 prof. Ing. arch. Akad. arch.         17
 RNDr.,  Ph.D.                        14
 MgA.,  Ph.D.                         11
 prof. Ing. arch.,  Hon. FAIA         10
 Ph.D.                                1