# Webmining - Assignment 0

This **Home Assignment** is to be submitted and you will be given points for each of the tasks. It familiarizes you with basics of *statistics* and basics of the *sklearn* package as well as the general setup for home assignments.
This first home assignment is shorter and also less difficult than upcoming ones.

## Formalities
**Submit in a group of 2-3 people until 11.05.2020 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Double check if your code relies on presence of files or directories other than those mentioned
    in given tasks. Tests run under Linux, hence don't use Windows style paths 
    (`some\path`, `C:\another\path`). Also, use paths only that are relative to and within your
    working directory (OK: `some/path`, `./some/path`; NOT OK: `/home/alice/python`, 
    `../../python`).
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for only the mean, prints and output instead of returning it!)
- ...



### Isolation
Functions that are expected to run in isolation are marked with [Isolation] Warning. For these additionally you are **not** allowed to:
- do imports of any kind (also _not_ from the python standard library)
- use imported stuff (e.g. import numpy somewhere, now use numpy)
- call other functions that you have defined (when you write a variance function you are not allowed to call your previously defined mean function)
- use other global variables/names
Think of these functions as running in a seperate scripts that is not allowed to use any import statements of any kinf

In [1]:
# credentials of all team members (you may add or remove items from the dictionary)
team_members = [
    {
        'first_name': 'Alice',
        'last_name': 'Foo',
        'student_id': 12345
    },
    {
        'first_name': 'Bob',
        'last_name': 'Bar',
        'student_id': 54321
    }
]

## Task 1
To refresh your knowledge on basic statistics we are going to implement mean, mode, median and standard deviation. All these functions should leave the input argument intact.

**[Isolation] Warning: We expect that all functions for this task to work in isolation!**


### 1a) Mean (0.5 points)
Write a function my_mean that takes a list of numeric values and returns the mean. 


### 1b) Std (0.5 points)
Write a function my_std that takes a list of numeric values and returns the standard deviation. Divide by n and not by n-1.


### 1c) Mode (1.0 points)
Write a function my_mode that takes a list and returns the mode.
If there is no unique mode, raise a ValueError.


### 1d) Median (0.5 points)
Write a function my_median that takes a list of numeric values and returns the median.

In [1]:
def my_mean(nums):
    if not nums:
        print('Not valid list')
        return
    sum_num = 0
    for num in nums:
        sum_num += num
    return sum_num/len(nums)

In [3]:
#test my_mean
print(my_mean([1,2,3,4.5]))

2.625


In [6]:
def my_std(nums):
    mean = sum(nums)/len(nums)
    var = 0
    for num in nums:
        var += (num - mean)**2
    return (var/(len(nums)))**0.5

In [7]:
#test my_std
print(my_std([1,2,3,4.5]))

1.2930100540985752


In [93]:
def my_mode(list_value):
    hash_table = {}
    for v in list_value:
        if v in hash_table:
            hash_table[v] += 1
        else:
            hash_table[v] = 1
    mode = [key for key,value in hash_table.items() if value == max(hash_table.values())]
    try:
        if len(mode) == 1:
            return mode[0]
        else:
            raise ValueError('ValueError: no unique mode')
    except ValueError as e:
        return e

In [94]:
#test my_mode
print(my_mode([1,2,3,4]))
print(my_mode(['a','b','a']))

ValueError: no unique mode
a


In [34]:
def my_median(nums):
    n = len(nums)
    sort_nums = sorted(nums)
    return sort_nums[n//2] if n%2==1 else (sort_nums[n//2-1]+sort_nums[n//2])/2
    

In [39]:
#test my_median
print(my_median([1,2,3,4]))
my_median([1,2,3,4,5,6,7])

2.5


4

## Task 2:
In this task we are will explore basic classifiers and the sklearn package.
### 2a) Preprocessing (1 point)
Write a function ```preprocess```. It takes no input.

It does:

- read the credit_g dataset assume into a pandas dataframe. The file is located in the same folder as the notebook and called ```credit-g.csv```
- compute the boolean target vector (True if 'class' is 'good')
- remove the target column from the dataframe
- convert the categorical variables to numeric ones using pd.get_dummies
- perform a (80/20) train/test split using sklearn.model_selection.train_test_split with a seed 123456
- returns the results of the train test split in order


### 2b) Train linear SVM classifier (0.5 points)
Write a function ```train_LinearSVM_classifier``` that trains a Linear Support Vector classifier.

It takes two arguments, the first one is the train dataset, the second the target array. It returns the trained classifier.
Use the Linear support vector classifier from sklearn with seed of 123456.


### 2c) Train logistic regression classifier (0.5 points)
Write a function ```train_LogisticRegression_classifier``` that trains a logistic regression classifier.

It takes two arguments, the first one is the train dataset, the second the target array. It returns the trained classifier.
Use the logistic regression classifier from sklearn with seed of 123456.

### 2d) Evaluate the results  (1 point)
Write a function ```get_scores``` that computes the precision, recall, accuracy and F1 scores.
It takes three arguments. The first one is a trained classifier, the second one is the test dataset to evaluate the classifier on, the third is the ground truth target vector.
The function returns a dictionary like this:

```
{'accuracy' : accuracy,
 'recall' : recall,
 'precision' : precision,
 'F1' : F1}
 ```
 
**[Isolation] Warning! We expect this function (2d) to work in isolation**



### 2 e) Bringing it all together  (0.25 points each)
Write two functions: ```run_SVM``` and ```run_Log``` that use the above functions to train and evaluate a SVM classifier and Logistic regression classifier respectively.
It therefor

1. loads the dataset & performs a train test split
2. trains the respectiv classifier
3. returns the scores dictionary

Thereby use the functions ```preprocess```, ```train_LinearSVM_classifier```, ```train_LogisticRegression_classifier```, ```get_scores``` you defined above.

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split

def preprocess():
    # read csv file
    df = pd.read_csv('./credit-g.csv',index_col = False)
    # compute boolean target
    target_vector = [True if target == 'good' else False for target in df['class']]
    # remove target column 'class'
    df = df.drop(columns = ['class'])
    # conver categorical value in numrical value
    df = pd.get_dummies(df)
    # tarin test split
    X_train, X_test, y_train, y_test = train_test_split(df, target_vector, test_size = 0.2, random_state = 123456)
    return X_train, X_test, y_train, y_test

In [69]:
from sklearn.svm import LinearSVC
def train_LinearSVM_classifier(X_train,y_train):
    clf = LinearSVC(random_state=123456)
    clf.fit(X_train,y_train)
    return clf

In [71]:
def get_scores(classfier, X_test, y_test):
    result = {'accuracy' : None,
              'recall' : None,
              'precision' : None,
              'F1' : None}
    pred = classfier.predict(X_test)
    i = 0
    TP, FP, FN, TN = 0,0,0,0
    while i < len(y_test):
        if pred[i] == True and y_test[i] == True:
            TP += 1
        elif pred[i] == True and y_test[i] == False:
            FP += 1
        elif pred[i] == False and y_test[i] == True:
            FN += 1
        elif pred[i] == False and y_test[i] == False:
            TN += 1
        i += 1
    percision = TP/(TP+FP)
    recall = TP/(TP+FN)
    acc = (TP+TN)/(TP+FP+FN+TN)
    f1 = 2*percision*recall/(percision+recall)
    result['accuracy'] = acc
    result['recall'] = recall
    result['precision'] = percision
    result['F1'] = f1
    return result

In [73]:
from sklearn.linear_model import LogisticRegression
def train_LogisticRegression_classifier(X_train,y_train):
    clf = LogisticRegression(random_state=123456)
    clf.fit(X_train,y_train)
    return clf

In [91]:
def run_Log():
    X_train, X_test, y_train, y_test = preprocess()
    log_classifier = train_LogisticRegression_classifier(X_train,y_train)
#     print('scores:', log_classifier.score(X_test,y_test))
    scores = get_scores(log_classifier,X_test,y_test)
    return scores

def run_SVM():
    X_train, X_test, y_train, y_test = preprocess()
    SVM_classifier = train_LinearSVM_classifier(X_train,y_train)
#     print('scores:', SVM_classifier.score(X_test,y_test))
    scores = get_scores(SVM_classifier,X_test,y_test)
    return scores

In [92]:
#test case
print(run_Log())
print(run_SVM())

{'accuracy': 0.735, 'recall': 0.8382352941176471, 'precision': 0.7862068965517242, 'F1': 0.8113879003558719}
{'accuracy': 0.35, 'recall': 0.051470588235294115, 'precision': 0.875, 'F1': 0.09722222222222222}


