# KBS - Assignment 1

# Numpy

An interesting resource: https://ad-wiki.informatik.uni-freiburg.de/teaching/NumpyCheatSheet

I also recommend to have a look at these resources:
* http://www.numpy.org/
* https://www.scipy.org/scipylib/
* http://wiki.scipy.org/Numpy_Example_List
* https://numpy.org/doc/stable/user/quickstart.html
* https://docs.scipy.org/doc/scipy/reference/tutorial/


### Numpy arrays
Arrays can be created in various ways:

In [None]:
import numpy as np

x = np.array([2,3,1,0])
print(x)

x = np.array([2, 3, 1, 0])
print(x)

x = np.array([[1,2.0],[0,0],(1+1j,3.)])
print(x)

x = np.array([[ 1.+0.j, 2.+0.j], [ 0.+0.j, 0.+0.j], [ 1.+1.j, 3.+0.j]]) 
print(x)

### Creating "prefilled" arrays

In [None]:
import numpy as np

x = np.zeros((2, 3))
print(x)

x = np.ones((2, 3))
print(x)

x = np.arange(10)
print(x)

x = np.arange(2, 10, dtype=np.float)
print(x)

x = np.arange(2, 3, 0.1)
print(x)

x = np.linspace(1., 4., 6)
print(x)

x = np.random.random((2,3))
print(x)

x = np.diag([1,2,3])
print(x)

### reading from a file ==> np.getfromtxt et al.

### Numpy datatypes

In [None]:
import numpy as np

x = np.float32(1.0)
print(x)

x = np.int_([1,2,4])
print(x)

x = np.arange(3, dtype=np.uint8)
print(x)
print(x.dtype)


### Array operations
* Basic operations apply element-wise. The result is a new array with the resulting elements.
* Operations like *= and += will modify the existing array.


In [None]:
import numpy as np

a = np.arange(5)
b = np.arange(5) 
print(a+b)

print(a-b)

print(a**2)

print(a>3)

print(10*np.sin(a))

a = np.zeros(4).reshape(2,2)
a[0,0] = 1
a[1,1] = 1
b = np.arange(4).reshape(2,2) 

print(a*b) 

# dot product (!)
print(np.dot(a,b))


## Exercise 01.01
Create a 1D array. Then, convert this 1D array to a 2D array with 2 rows.

In [6]:
import numpy as np

a = np.arange(10)

# reshape the array and assign it to a
a = a.reshape(2, 5)

# Then, the expected output is
# array([[0, 1, 2, 3, 4],
#        [5, 6, 7, 8, 9]])
a

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

## Exercise 01.02
Import the iris dataset into a numpy array, keeping the correct data format.

In [18]:
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.loadtxt(url, delimiter=",", dtype=bytes)
iris = iris[0:3].astype(object)

iris
# this should result in
# array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
#        [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
#        [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa']], dtype=object)

array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
       [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa']], dtype=object)

## Exercise 01.03
Create an array containing the text column species from iris dataset of the last exercise.

In [22]:
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

iris = np.loadtxt(url, delimiter=",", dtype=bytes)
species = iris[:, 4]  # access 5th row
species[:5].astype('|S18')
# The output should look like
#> (150,)
#> array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
#>        b'Iris-setosa'],
#>       dtype='|S18')

array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
       b'Iris-setosa'], dtype='|S18')

## Exercise 01.04
Start with the iris dataset - read into an array.
Then, create bins of the petal length, i.e. the third column of the iris dataset to form a text array, such that if petal length is:
* Less than 3 –> ‘small’
* 3-5 –> ‘medium’
* ‘>=5 –> ‘large’

In [None]:
import numpy as np

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# ...

# Pandas - Intro

Look at https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html and try out the examples in this notebook below.

In [23]:
from IPython.display import IFrame

IFrame(src="https://pandas.pydata.org/pandas-docs/version/1.3/user_guide/10min.html", width=1024, height=400)

### Pandas - Advanced Recipes

Have a look at https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook ; integrate and play with the code in this notebook below.

In [25]:
from IPython.display import IFrame

IFrame(src="https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook", width=1024, height=400)


In [None]:
# ...


## Exercise 01.05
Given the series *a_series* reshape it into a dataframe with 7 rows and 5 columns.

In [42]:
import numpy as np
import pandas as pd

a_series = pd.Series(np.random.randint(1, 10, 35))

df = pd.DataFrame(a_series.values.reshape(7, 5))
print(df)

# Example expected output
#    0  1  2  3  4
# 0  7  8  3  6  6
# 1  8  8  7  9  3
# 2  8  2  1  8  1
# 3  8  9  6  7  8
# 4  5  5  6  9  6
# 5  9  9  7  9  9
# 6  6  7  3  1  6

   0  1  2  3  4
0  5  4  9  3  7
1  3  4  4  2  2
2  5  6  6  4  2
3  3  6  8  1  8
4  7  7  8  6  5
5  7  1  5  5  1
6  5  4  9  5  5


## Exercise 01.06
Perform normalization of a dataframe *df* in two ways:
* (1) Normalize all columns of the dataframe *df* by subtracting the column mean and divide by the standard deviation.
* (2) Normalize all columns of *df* such that the minimum value in each column is 0 and max is 1.

In [75]:
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))
# (1)
df = df.astype(float, copy=False)
means = df.mean(axis=0)
for i in range (8):
    for j in range(10):
        df.values[i][j] = (df.values[i][j] - means[j]) / std

print(df)

# (2)
mins = df.min(axis=0)
maxs = df.max(axis=0)
diffs = maxs-mins
for i in range (8):
    for j in range(10):
        df.values[i][j] = (df.values[i][j] - mins[j]) / diffs[j]

print(df)


          0         1         2         3         4         5         6  \
0 -1.217723  0.560245  0.898244 -1.176051 -1.064928 -1.523311  0.588026   
1 -0.995477  0.597286 -0.990846 -1.139010  0.824162 -1.226983 -0.189835   
2 -1.402927 -0.328739 -0.620437  0.416711  1.639064  1.365886 -0.337999   
3 -0.439862 -0.180575 -0.509314 -1.250133 -0.250027 -0.634327  0.810272   
4  1.856680 -0.625067 -0.768601  2.157638 -0.509314 -1.078819  0.291698   
5  0.819532  0.004630  1.416818  0.787121 -0.435232  1.699255 -1.004737   
6  1.634434 -1.106600 -0.916764  1.305695  0.601916  0.921395  0.958436   
7 -0.254657  1.078819  1.490900 -1.101969 -0.805642  0.476903 -1.115860   

          7         8         9  
0  0.375040  0.976956  1.611283  
1 -0.995477  1.199202 -0.722299  
2 -0.106493  0.013890 -0.981586  
3 -0.588026 -1.727036  0.574135  
4  1.634434  1.273284 -1.685365  
5 -1.588132  1.569612 -0.314848  
6  1.671475 -1.986323  1.018627  
7 -0.402821 -1.319585  0.500053  
          0       

## Exercise 01.07
* Read the paper "Martin Atzmueller, Joachim Baumeister, and Frank Puppe. Semi-Automatic Learning of Simple Diagnostic Scores utilizing Complexity Measures. Artificial Intelligence in Medicine, 37(1):19–30, 2006." (also discussed in the lecture): https://www.sciencedirect.com/science/article/pii/S0933365705000862
* Implement a data structure (i.e., a class) for a diagnostic profile - relating to the description in the paper. Obviously, you need to distinguish between diagnoses and (common) attributes/findings.
* The class should have an (internal) method build which takes a dataframe, a list of diagnoses, and list of findings (in the form of the respective attributes) and constructs the according diagnostic profile object. You can, for example, implement this as a static method of the class, or use the respective instance method being called from the constructor of the class (for instantiating the object)
* For this class, also implement a method prune(...) which takes an integer for removing infrequent findings
* Test your implementation, via reading in and applying your implementation on this dataset: 
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
* Here, you can consider the class attribute for the different diagnoses (i.e., consider the individual values of the class attribute as such).

In [2]:
import numpy as np
import pandas as pd

class DiagnosticProfile:
    