## IMPORTANT
## Please note: Due date 2021-12-17 23:59
## 

## Exercise 05.01: Subgroup Discovery


Part 1: Check out subgroup discovery (again) - the pysubgroup implementation, respectively
* pysubgroup: https://github.com/flemmerich/pysubgroup
* Further details about implementation: https://link.springer.com/chapter/10.1007/978-3-030-10997-4_46

Part 2: Apply pysubgroup
* Use the credit-g dataset: https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
* Perform analysis - with respect to three target variables: (1) class, (2) credit amount, (3) age
* For the targets, distinguish between nominal and numeric targets in your analysis; test discretization for the numeric targets and compare the results
* For doing this, apply preprocessing on the data: Discretize attributes as needed
* Experiment with several quality functions: Briefly compare the results, and discuss your findings
* Experiment with different algorithms (beam search, SD-Map), and compare their runtimes. What do you observe?

In [33]:
import numpy as np
import pandas as pd
from time import time

# ... add here
import pysubgroup as ps

elements = data.columns
s1 = set(elements)
s2 = set(['age', 'credit_amount'])
s3 = s1-s2

data = pd.read_csv("./german_with_names.data",delim_whitespace=True)

#discretcize data
for i in range(len(data["age"])):
    data["age"][i] = str(int(data["age"][i]/20)*20) + " to " + str(int(data["age"][i]/20)*20+19)
    
for i in range(len(data["duration"])):
    data["duration"][i] = str(int(data["duration"][i]/20)*20) + " to " + str(int(data["duration"][i]/20)*20+19)
    
for i in range(len(data["credit_amount"])):
    data["credit_amount"][i] = str(int(data["credit_amount"][i]/2000)*2000) + " to " + str(int(data["credit_amount"][i]/2000)*2000+1999)

target = ps.BinaryTarget(target_attribute = "class",target_value = 1)
searchspace = ps.create_selectors(data, ignore=list(s3))
task = ps.SubgroupDiscoveryTask (
    data, 
    target, 
    searchspace, 
    result_set_size= 10, 
    depth=4, 
    qf=ps.WRAccQF())

#execute algorithm
result = ps.BeamSearch().execute(task)

print("BeamsearchTime: ", t_beamsearch)

print("Discretized")

#print results
for quality, rule in result.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)
    

print("undiscretized")
data2 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target2 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace2 = ps.create_selectors(data2, ignore=list(s3))
task2 = ps.SubgroupDiscoveryTask(
    data2, 
    target2, 
    searchspace2, 
    result_set_size= 10, 
    depth=4, 
    qf=ps.WRAccQF())
result2 = ps.BeamSearch().execute(task2)
for quality, rule in result2.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)

BeamsearchTime:  0.01293182373046875
Discretized
quality:  0.02260000000000002 	rule:  credit_amount=='2000 to 3999'
quality:  0.013100000000000007 	rule:  age=='20 to 39' AND credit_amount=='2000 to 3999'
quality:  0.012700000000000003 	rule:  age=='40 to 59' AND credit_amount=='0 to 1999'
quality:  0.009400000000000014 	rule:  age=='40 to 59'
quality:  0.008600000000000033 	rule:  credit_amount=='0 to 1999'
quality:  0.008300000000000004 	rule:  age=='40 to 59' AND credit_amount=='2000 to 3999'
quality:  0.0028000000000000013 	rule:  age=='60 to 79' AND credit_amount=='0 to 1999'
quality:  0.002300000000000003 	rule:  age=='60 to 79'
quality:  0.0012000000000000003 	rule:  age=='60 to 79' AND credit_amount=='2000 to 3999'
quality:  0.00030000000000000003 	rule:  age=='60 to 79' AND credit_amount=='12000 to 13999'
undiscretized
quality:  0.012600000000000012 	rule:  credit_amount: [1262:1908[
quality:  0.011300000000000011 	rule:  age: [36:45[
quality:  0.010300000000000014 	rule:  ag

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["age"][i] = str(int(data["age"][i]/20)*20) + " to " + str(int(data["age"][i]/20)*20+19)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["duration"][i] = str(int(data["duration"][i]/20)*20) + " to " + str(int(data["duration"][i]/20)*20+19)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["credit_amount"][i] = str(int(data["credit_amount"][i]/2000)*2000) + " to " + str(int(data["credit_amount"][i]/2000)*2000+1999)


In [32]:

print("WRAccQF")

data3 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target3 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace3 = ps.create_selectors(data3, ignore=list(s3))
task3 = ps.SubgroupDiscoveryTask(
    data3, 
    target3, 
    searchspace3, 
    result_set_size= 5, 
    depth=2, 
    qf=ps.WRAccQF())
result3 = ps.BeamSearch().execute(task3)
for quality, rule in result3.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)

print("CHiSquared")    
    
data3 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target3 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace3 = ps.create_selectors(data3, ignore=list(s3))
task3 = ps.SubgroupDiscoveryTask(
    data3, 
    target3, 
    searchspace3, 
    result_set_size= 5, 
    depth=2, 
    qf=ps.ChiSquaredQF())
result3 = ps.BeamSearch().execute(task3)
for quality, rule in result3.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)
    
print("SimpleBinomialQF")

data3 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target3 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace3 = ps.create_selectors(data3, ignore=list(s3))
task3 = ps.SubgroupDiscoveryTask(
    data3, 
    target3, 
    searchspace3, 
    result_set_size= 5, 
    depth=2, 
    qf=ps.SimpleBinomialQF())
result3 = ps.BeamSearch().execute(task3)
for quality, rule in result3.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)
    
print("LiftQF")

data3 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target3 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace3 = ps.create_selectors(data3, ignore=list(s3))
task3 = ps.SubgroupDiscoveryTask(
    data3, 
    target3, 
    searchspace3, 
    result_set_size= 5, 
    depth=2, 
    qf=ps.LiftQF())
result3 = ps.BeamSearch().execute(task3)
for quality, rule in result3.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)

WRAccQF
quality:  0.012600000000000012 	rule:  credit_amount: [1262:1908[
quality:  0.011300000000000011 	rule:  age: [36:45[
quality:  0.010300000000000014 	rule:  age>=45
quality:  0.0103 	rule:  age>=45 AND credit_amount: [1262:1908[
quality:  0.008000000000000007 	rule:  credit_amount: [2859:4736[
CHiSquared
quality:  18.601190476190478 	rule:  credit_amount>=4736
quality:  16.36808069556608 	rule:  age<26
quality:  12.8485077491919 	rule:  age>=45 AND credit_amount: [1262:1908[
quality:  10.847107438016526 	rule:  age: [26:30[ AND credit_amount>=4736
quality:  10.847107438016526 	rule:  age<26 AND credit_amount>=4736
SimpleBinomialQF
quality:  0.05086807422745387 	rule:  age>=45 AND credit_amount: [1262:1908[
quality:  0.0288675134594813 	rule:  age>=45 AND credit_amount: [2859:4736[
quality:  0.028034632047869067 	rule:  credit_amount: [1262:1908[
quality:  0.024600119446402134 	rule:  age: [36:45[
quality:  0.022974136442391765 	rule:  age>=45
LiftQF
quality:  0.2512195121951219

In [38]:
#Beamsearch
#time
data3 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target3 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace3 = ps.create_selectors(data3, ignore=list(s3))
task3 = ps.SubgroupDiscoveryTask(
    data3, 
    target3, 
    searchspace3, 
    result_set_size= 5, 
    depth=2, 
    qf=ps.WRAccQF())

t0 = time()

result3 = ps.BeamSearch().execute(task3)

t1 = time()
t_beamsearch = t1-t0

for quality, rule in result3.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)

print("Beamsearch: ",t_beamsearch)



#Apriori
#time
data3 = pd.read_csv("./german_with_names.data",delim_whitespace=True)
#undiscreticized data
target3 = ps.BinaryTarget(target_attribute = "class",target_value = 1)
#searchspace2 = ps.create_selectors(data2, ignore=['ex_account','duration','credit_history',,'class',])
searchspace3 = ps.create_selectors(data3, ignore=list(s3))
task3 = ps.SubgroupDiscoveryTask(
    data3, 
    target3, 
    searchspace3, 
    result_set_size= 5, 
    depth=2, 
    qf=ps.WRAccQF())

t0 = time()

result3 = ps.Apriori().execute(task3)

t1 = time()
t_beamsearch = t1-t0

for quality, rule in result3.to_descriptions():
    print("quality: ", quality, "\trule: ", rule)

print("Beamsearch: ",t_beamsearch)

quality:  0.012600000000000012 	rule:  credit_amount: [1262:1908[
quality:  0.011300000000000011 	rule:  age: [36:45[
quality:  0.010300000000000014 	rule:  age>=45
quality:  0.0103 	rule:  age>=45 AND credit_amount: [1262:1908[
quality:  0.008000000000000007 	rule:  credit_amount: [2859:4736[
Beamsearch:  0.003988981246948242
Apriori: Using numba for speedup
10
quality:  0.012600000000000012 	rule:  credit_amount: [1262:1908[
quality:  0.011300000000000011 	rule:  age: [36:45[
quality:  0.010300000000000014 	rule:  age>=45
quality:  0.0103 	rule:  age>=45 AND credit_amount: [1262:1908[
quality:  0.008000000000000007 	rule:  credit_amount: [2859:4736[
Beamsearch:  8.717403411865234


  p_subgroup = np.divide(positives_subgroup, instances_subgroup)


In [3]:
data2

Unnamed: 0,ex_account,duration,credit_history,purpose,credit_amount,savings_account,present_employment,installment_rate,personal_status,debtors,...,property,age,other_installment,housing,existing_credits,job,liable_people,telephone,foreign_worker,class
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,A14,12,A32,A42,1736,A61,A74,3,A92,A101,...,A121,31,A143,A152,1,A172,1,A191,A201,1
996,A11,30,A32,A41,3857,A61,A73,4,A91,A101,...,A122,40,A143,A152,1,A174,1,A192,A201,1
997,A14,12,A32,A43,804,A61,A75,4,A93,A101,...,A123,38,A143,A152,1,A173,1,A191,A201,1
998,A11,45,A32,A43,1845,A61,A73,4,A93,A101,...,A124,23,A143,A153,1,A173,1,A192,A201,2


## Exercise 05.02: Reading/Discussion/Summary

Part 1: Reading:
* Read the following paper: Paulheim (2016) "Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods"
* The paper is available here: http://www.semantic-web-journal.net/system/files/swj1167.pdf   <br>
  (It is also available in the "files/exercises" course folder)

Part 2: Think about the following questions:
* Why are Knowledge Graphs useful?
* How are they constructed (with examples)?
* What is knowledge graph refinement, and how does it work?
* What is knowledge graph completion, and how does it work?
* How do you evaluate those techniques?

Part 3: Discussing, Summary
* Prepare answers for these questions for the practical session on December 14, 2021. You will first discuss these in groups, and then we will discuss them in the plenary meeting.
* After that, summarize your findings (and those of the group discussion) in a small report (max. half a Din A4 page). For example, you could write 2-3 sentences for answering a specific question.
* Please note, that the due date for handing in the assignment is December 17, 23:59 (!)

## Uploading your solution
For uploading your solution, please upload two files:
* The Jupyter-Notebook file (.ipynb)
* An easily human-interpretable PDF (printout/file) of the Jupyter notebook file (.pdf) as python source code
* IMPORTANT: Please add your name (Example: MartinAtzmueller), as a suffix to the file names, e.g.:<br>
  KBS-Assigment4_MartinAtzmueller.ipynb, KBS-Assignment4_MartinAtzmueller.pdf