# HW 1 Data mining - Scraper
## Author : Lukáš Bíro

## Table of content:
   * [Importing libraries](#Importing-libraries)
   * [Defining Scraper class](#Defining-Scraper-class)
   * [Preprocessing](#Preprocessing)
   * [Displaying datasets](#Displaying-datasets)
   * [Final saving of both dataframes to csv files](#Final-saving-of-both-dataframes-to-csv-files)

## Importing libraries

In [4]:
import unicodedata
import requests
import math
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

## Defining Scraper class

In [5]:
'''This class is defined to include functions useful for scraping specific tables within webpage volby.cz

    Part of preprocessing that may be common for multiple tables is included in the class, the rest is not part of the class.'''
class Scraper ():
    
    def __init__(self, link):
        self.link = link
        self.soup = self.create_soup()
        self.datalist = None
    
    def create_soup (self):
        r = requests.get(self.link)
        return BeautifulSoup(r.text, 'lxml')

    # performs scraping and super basic preprocessing 
    def scrape(self):
        datalist = self.soup.findAll('tr')
        datalist = [i.text.strip() for i in datalist]
        datalist = [i.split("\n") for i in datalist]
        #deletes the first element in the list to avoid double indexing 
        for i in datalist[2:]:
            del i[0]
        self.datalist = datalist
    
    # Main preprocessing, separates titles, adds year and voting rate to the data and stores it as a dataframe
    # It treats differently table with candidates and with parties
    def preprocess (self, year, stlpce, kand = False ):
        if kand == True:
            for i in self.datalist[2:]:
                if year != 2002:
                    try:
                        i.insert(3, i[2].split(' ')[2])
                    except:
                        i.insert(3, np.nan)
                else:
                    pass
            table = pd.DataFrame(self.datalist[2:], columns = stlpce)
        if kand == False:
            vol_ucast = self.datalist[2][6]
            table = pd.DataFrame(self.datalist[5:], columns = stlpce)
            table['Vol. účast v %'] = vol_ucast
            
        table['Rok'] = year
        self.table = table

## Preprocessing

In [6]:
'''Main preprocessing part of the code, more information in the respective hashes. '''
dataset_kand = pd.DataFrame()
dataset_strany = pd.DataFrame() 

#The main loop to scrape the data for respective years and concatenate them all in dataframes
years = [2002, 2006, 2010, 2014, 2018]
for year in years:
    link_kand = 'https://www.volby.cz/pls/kv' + str(year) + '/kv21111?xjazyk=CZ&xid=1&xv=11&xdz=3&xnumnuts=4102&xobec=554961&xstrana=0'
    link_strany = 'https://volby.cz/pls/kv' + str(year) + '/kv1111?xjazyk=CZ&xid=1&xdz=3&xnumnuts=4102&xobec=554961'
    stlpce_kand = ['Kandidátní listina', 'Poř.číslo', 'Příjmení, jméno', 'Titul', 'Věk', 'Navrh.strana', 'Polit.přísl.', 'Absolutní hlasy', 'Hlasy v %', 'Pořadí', 'Mandát']
    stlpce_strany = ['Kandidátní listina', 'Absolutní hlasy', 'Hlasy v %', 'Počet kandidátů', 'Přepočtený základ dle počtu kandidátů', 'Přepočtené % plat. hlasů', 'Počet mandátů', 'Podíly hlasů']
    scraper_kand = Scraper(link_kand)
    scraper_strany = Scraper(link_strany)
    scraper_kand.scrape()
    scraper_strany.scrape()
    scraper_kand.preprocess(year, stlpce_kand, kand = True)
    scraper_strany.preprocess(year, stlpce_strany)
    dataset_kand = pd.concat([dataset_kand, scraper_kand.table], ignore_index = True)
    dataset_strany = pd.concat([dataset_strany, scraper_strany.table], ignore_index = True)

dataset_kand['Absolutní hlasy'] = dataset_kand['Absolutní hlasy'].apply(lambda x: int(x[0] + x[2:]) if len(x) == 5 else int(x[:]))
#dataset['Příjmení, jméno'] = dataset['Příjmení, jméno'].apply(lambda x: x.split(' ')[0] + )

# I could not figure out how to adjust the data for candidates directly in pandas dataframe so I put them to lists, modified and put back to df
datalist = dataset_kand.values.tolist()
# Dealing with titles, and transforming blank spaces to np.nan, there is an encoding problem that I was not able to avoid differently
# I realize it's not an elegant solution
for i in datalist:
    try:
        if '.' in i[3]:
            pass
        else:
            i[3] = np.nan
    except:
        i[3] = np.nan
    try:
        int(i[9])
    except:
        i[9] = np.nan
        
    if i[10] == '*':
        pass
    else:
        i[10] = np.nan
        
dataset_kand = pd.DataFrame(datalist, columns = stlpce_kand + ['Rok'])

#again the same problem with table with parties, modified and put all back to a dataframe
dataset_strany = dataset_strany.drop(dataset_strany.index[13])
data_strany = dataset_strany.values.tolist()
for i in data_strany:
    if i[7] == 'X':
        pass
    else:
        i[7] = np.nan
dataset_strany = pd.DataFrame(data_strany, columns = stlpce_strany + ['Vol. účast v %'] + ['Rok'])     

# Displaying datasets

In [7]:
#First few rows of the candidates dataset
dataset_kand.head()

Unnamed: 0,Kandidátní listina,Poř.číslo,"Příjmení, jméno",Titul,Věk,Navrh.strana,Polit.přísl.,Absolutní hlasy,Hlasy v %,Pořadí,Mandát,Rok
0,Česká pravice,9,Adámek Petr,,54,ČP,BEZPP,419,2.41,,,2002
1,Karlovarská koalice,19,Andrejkivová Pavla,,56,KDU-ČSL,KDU-ČSL,3158,3.14,4.0,*,2002
2,Karlovarská koalice,20,Antonik Jozef,,58,VPM,BEZPP,2588,2.57,11.0,,2002
3,Komunistická str.Čech a Moravy,14,Aubrecht Miroslav,,73,KSČM,KSČM,1858,2.58,8.0,,2002
4,Strana zelených,9,Balák Libor,,41,SZ,BEZPP,1229,4.35,2.0,,2002


In [8]:
#First few rows of a parties dataset
dataset_strany.head()

Unnamed: 0,Kandidátní listina,Absolutní hlasy,Hlasy v %,Počet kandidátů,Přepočtený základ dle počtu kandidátů,Přepočtené % plat. hlasů,Počet mandátů,Podíly hlasů,Vol. účast v %,Rok
0,Pravý Blok,958,0.19,11,149 073.15,0.64,0,,34.12,2002
1,Strana za životní jistoty,906,0.18,7,94 864.73,0.95,0,,34.12,2002
2,Česká pravice,17 360,3.37,38,514 980.00,3.37,0,,34.12,2002
3,"Dem.K.V.-S.ODA, N.P.a Zak.č.US",24 320,4.72,38,514 980.00,4.72,0,,34.12,2002
4,Komunistická str.Čech a Moravy,71 746,13.93,38,514 980.00,13.93,6,X,34.12,2002


# Final saving of both dataframes to csv files

In [9]:
dataset_strany.to_csv('dataset_strany.csv')
dataset_kand.to_csv('dataset_kand.csv')

I am sorry for using combination of English and Slovak language to name the variables, I am not very creative with naming