# Homework nr. 1 - data visualization (deadline 25/10/2018)

In short, the main task is to download data on theses defended at CTU from the Internet, store them in pandas Data Frame and then visualize some hidden information.
  
> The instructions are not given in details: It is up to you to come up with ideas on how to fulfill the particular tasks as best you can. Thinking of how to visualize the data is an important part of data visualization! ;)

## What are you supposed to do:

  1. Browse the web https://dspace.cvut.cz/?locale-attribute=en and find out how to download data on Bachelor and Master theses.
  2. Download or scrape the data such that for each thesis you know the following:
    * Faculty name, department name, thesis title, thesis type (bachelor/master), supervisor name, reviewer name, year (or date) of the defence, study programme and discipline, link to a webpage with details.
  3. Store these data in one _csv_ file (should be handed in along with this notebook).
  4. Use tools available for Python to plot charts and tables to visualize/display this information:
    * Number of defended theses per year for CTU/Faculties. Distinguish the type of thesis.
    * Find the departments/study programmes/supervisors/reviewers with highest numbers of thesis and come up with some nice plots and tables to depict their numbers.
    * Mean/median/minimum/maximum number of supervised theses per year for faculties.
    * Number (or fraction) of theses supervised by people with various degrees (Bc./Ing./Ph.D./ ...).

**If you do all this properly, you will obtain 6 points**

To earn **extra two points** you can do some of these:
  * Use http://beakerx.com to make your notebook interactive in a meaningful way.
  * Come up with some other reasonable and interesting views of data.
  * Use your data to create an interactive webpage (HTML + JavaScript).

## Comments

  * Please follow the instructions from https://courses.fit.cvut.cz/MI-PDD/homeworks/index.html.
  * If the reviewing teacher is not satisfied, he can give you another chance to rework your homework and to obtain more points.

In [137]:
# Imports

import numpy as np
import pandas as pd
import sklearn as skit
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
from selenium import webdriver


In [146]:
# Download data - It may take a several minutes. 
#                 You can edit the number of pages downloaded. 
#                 Work is being rolled down from the newest.


# Main dpace url for find BP, DP
urlMain = 'https://dspace.cvut.cz{}'
# Url with search form
urlDist = '/discover' 
# Data for specific page to download
data = {
    'rpp' : '5',
    'etal' : '0', 
    'group_by' : 'none', 
    'page' : '0',
    'sort_by' : 'dc.date.issued_dt',
    'order' : 'desc'}

#Prefered lang
pref_lang = "eng"
work_get = ["bachelor thesis", "master's thesis"]

# Group columns by language spec and keep one of want language or if not exist keep another one.
# Keep only one column in prefer language
def manageColumns(df):
    mp={}
    rem_flag = False
    for number, lang in enumerate(df[2]):
        if df[0][number] not in mp:
            mp[df[0][number]] = []
        mp[df[0][number]].append((lang, number))
    for i in mp.copy():
        if len(mp[i])> 1:
            for j in mp[i]:
                if j[0] == pref_lang:
                    mp[i].remove(j)
                    rem_flag = True
                    break
            if not rem_flag:
                mp[i].pop(0)
        else:
            del mp[i]
    for i in mp:
        for j in mp[i]:
            df = df.drop(j[1], axis=0)
    return df

# Extract nice data frame from one work html page to table
def parseDataFromHtmlTablePage(pageText):
    ldf = pd.read_html(pageText.text,header = None, flavor = 'bs4')
    df = ldf[0]
    df = manageColumns(df)
    df = df.transpose()
    df.columns = df.iloc[0]
    if ("dc.type" not in df.columns):
        print("Not specific type.")
        return pd.DataFrame()
    df = df.drop(0, axis = 0)
    df = df.drop(2, axis = 0)
   
    if (df['dc.type'][1].lower() not in work_get):
        return pd.DataFrame()
    df = df.drop(['dc.date.accessioned', 'dc.date.available', 'dc.date.issued', 'dc.identifier', 
                  'dc.description.abstract', 'dc.publisher', 'dc.rights'  ], axis = 1)
    df.columns = ['supervisor', 'author', 'uri', 'language', 'subject', 'title', 'type', 
                  'acceptedDate', 'rewiever', 'discipline', 'department', 'programme']
    
    
    # Data which are not on dspace page
    df["faculty"] = BeautifulSoup(pageText.text, "html.parser").find_all("ul", 
                        {"class": "breadcrumb hidden-xs"})[0].find_all("li")[1].get_text().strip()
    
   
    firstPage = requests.get('https://usermap.cvut.cz/search', {"query" : df['supervisor'][1]})
    table = BeautifulSoup(firstPage.text, "html.parser").find('table', 
    {'id' : "search-results-table"})
    table_body = table.find('tbody')
    
    #driver = webdriver.Chrome()
    
    print(table_body)
    print(firstPage.url)
    print(one.url)
    print(df['department'][1])
    df["supervisor_degree"] = None
    df["rewiever_degree"] = None
    return df

# Data frame with all data
data_all = pd.DataFrame()

firstPage = requests.get(urlMain.format(urlDist), data)
soup = BeautifulSoup(firstPage.text, "html.parser")
pages = int(soup.find("li", {"class": "last-page-link"}).find("a").get_text())
print("Download first page. Pages with works:", pages)

# go over all pages
for pg in range(pages+1):
    if (pg%100 == 0):
        print("Page:", pg, "/", pages)
    pg = 10
    data['page'] = pg
    page = requests.get(urlMain.format(urlDist), data)
    soup = BeautifulSoup(page.text, "html.parser")
    
    # go over all items on page
    for i in soup.findAll("div", {"class": "row ds-artifact-item "}):
        
        one = requests.get(urlMain.format(i.find("a").get("href")), {'show' : 'full'})
        if one.status_code != 200:
            print("Cant reach the work page. Continue..")
            continue
            
        df = parseDataFromHtmlTablePage(one)
        if df.shape[0] == 0:
            continue
        if data_all.shape[0] == 0:
            data_all = df.copy()
        else:
            data_all = pd.concat([data_all,df], ignore_index=True)
        
    data_all['type'] = data_all['type'].str.lower()
    data_all['acceptedDate'] =  pd.to_datetime(data_all['acceptedDate'], format='%Y-%m-%d')
    print(data_all.info())
    break

data_all.to_csv("test.csv", sep=',')


Download first page. Pages with works: 6462
Page: 0 / 6462
None
https://usermap.cvut.cz/search?query=Posp%C3%ADchal+V%C3%A1clav
https://dspace.cvut.cz/handle/10467/78144?show=full
katedra technologie staveb
None
https://usermap.cvut.cz/search?query=%C4%8Cejka+Tom%C3%A1%C5%A1
https://dspace.cvut.cz/handle/10467/78143?show=full
katedra konstrukcí pozemních staveb
None
https://usermap.cvut.cz/search?query=Posp%C3%ADchal+V%C3%A1clav
https://dspace.cvut.cz/handle/10467/78146?show=full
katedra technologie staveb
None
https://usermap.cvut.cz/search?query=Ryj%C3%A1%C4%8Dek+Pavel
https://dspace.cvut.cz/handle/10467/78147?show=full
katedra ocelových a dřevěných konstrukcí
None
https://usermap.cvut.cz/search?query=R%C5%AF%C5%BEi%C4%8Dka+Jan
https://dspace.cvut.cz/handle/10467/78148?show=full
katedra konstrukcí pozemních staveb
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 15 columns):
supervisor           5 non-null object
author               5 non-null 

In [88]:
dataCVUT = pd.read_csv('test.csv', index_col=0)
display(dataCVUT.head())
dataCVUT.groupby(['title']).size().sort_values(ascending=False)

Unnamed: 0,supervisor,author,uri,language,subject,title,type,acceptedDate,rewiever,discipline,department,programme,faculty,supervisor_title,rewiever_title
0,Pospíchal Václav,Sedloň Zbyněk,http://hdl.handle.net/10467/78144,CZE,"technological standard,schedule of constructio...",Building - technological project - Apartments ...,,2018-06-27,Šulc Rostislav,"Příprava, realizace a provoz staveb",katedra technologie staveb,Stavební inženýrství,,,
1,Čejka Tomáš,Petáková Belinda,http://hdl.handle.net/10467/78143,CZE,"stable,moisture remediation,structural and tec...",Jílové u Prahy Stables - moisture remediation,bachelor thesis,2018-06-27,Novák Michal,Konstrukce pozemních staveb,katedra konstrukcí pozemních staveb,Stavební inženýrství,,,
2,Pospíchal Václav,Střelbová Lenka,http://hdl.handle.net/10467/78146,CZE,"Technical suspervision of the builder,reconstr...",The activities of the technical supervision du...,,2018-06-27,Štorc Vojtěch,"Příprava, realizace a provoz staveb",katedra technologie staveb,Stavební inženýrství,,,
3,Ryjáček Pavel,Stejskal Jakub,http://hdl.handle.net/10467/78147,CZE,"continuously welded rail,CWR,interaction,track...",The application of the DFF300 for the bridges ...,,2018-06-27,Stančík Vojtěch,Konstrukce a dopravní stavby,katedra ocelových a dřevěných konstrukcí,Stavební inženýrství,,,
4,Růžička Jan,Široký Martin,http://hdl.handle.net/10467/78148,CZE,"apartment building,Resby,structure design,envi...",Structure design of apartment building in opti...,,2018-06-27,Veselka Jakub,Konstrukce pozemních staveb,katedra konstrukcí pozemních staveb,Stavební inženýrství,,,


title
The application of the DFF300 for the bridges with the CWR                         1
The activities of the technical supervision during reconstruction of water tank    1
Structure design of apartment building in options using BIM                        1
Jílové u Prahy Stables - moisture remediation                                      1
Building - technological project - Apartments building Zátiší                      1
dtype: int64