# Python II - Assignment 1

This **Home Assignment** is to be submitted and you will be given points for each of the tasks. It familiarizes you with basics of *statistics* and basics of the *sklearn* package as well as the general setup for home assignments.
This first home assignment is shorter and also less difficult than upcoming ones.

## Formalities
**Submit in a group of 2-3 people until 01.06.2020 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Double check if your code relies on presence of files or directories other than those mentioned
    in given tasks. Tests run under Linux, hence don't use Windows style paths 
    (`some\path`, `C:\another\path`). Also, use paths only that are relative to and within your
    working directory (OK: `some/path`, `./some/path`; NOT OK: `/home/alice/python`, 
    `../../python`).
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for the mean only, only prints the output instead of returning it!)
- ...

In [1]:
# credentials of all team members (you may add or remove items from the dictionary)
team_members = [
    {
        'first_name': 'Poorya',
        'last_name': 'Khanali Satarerazleghi',
        'student_id': 381198
    },
    {
        'first_name': 'Bob',
        'last_name': 'Bar',
        'student_id': 54321
    }
]

## 1.) Using pandas (2.5 points total)

### a) Load the credit-g dataset (1)

Write a function `load_credit`. It takes no arguments. It returns a dataframe.

Assume there is a file `credit-g.csv` load it into a pandas dataframe. Convert all non numeric columns to Categorical columns.
Convert the `employment` column to an ordered Categorical column. The correct order is ascending by the length of employment, `unemployed` is the shortest.
Return this dataframe.

### b) Basic information (0.5)

Write a function `basic_info` that takes a loaded and preprocessed dataframe as above as argument. It returns a dict.

The dict contains the following information for the provided dataframe:
```python
{'n_rows' : 0, #number of rows
 'n_columns' : 0, #number of columns
 'average_credit' : 0.0, # average credit_amount
 'credit_purposes' : [], # all purposes, each only once, as strings
 'fraction_good' : 0.0, # fraction of instances with 'class'==good
 'fraction_bad' : 0.0} # fraction of instances with 'class'==bad
```
Do not hard code the answers but actually compute them from the dataframe.

### c) Distribution on subsets (1)

Write a function `subset_info` that takes the same input as in b) and also returns a dict.

Return the ratio of good to bad instances for different subsets of the dataset:
```python
{'young': 0.0, # people below 40
 'old': 0.0, # people with age 40 or greater
 'male' : 0.0, # obvious
 'female' : 0.0, # obvious
 'young_male' : 0.0, # people that are young and male 
 'employed' : 0.0} # people that are employed for at least one year 
```

If you have 10 good instances and 5 bad instances the ration is 2.

In [2]:
def load_credit():
    import pandas as pd
    df = pd.read_csv ("credit-g.csv", na_values="?")
    return df

In [3]:
load_credit()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good
3,'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,...,'life insurance',45,none,'for free',1,skilled,2,none,yes,good
4,'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,...,'no known property',53,none,'for free',2,skilled,2,none,yes,bad
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,'no checking',12,'existing paid',furniture/equipment,1736,'<100','4<=X<7',3,'female div/dep/mar',none,...,'real estate',31,none,own,1,'unskilled resident',1,none,yes,good
996,'<0',30,'existing paid','used car',3857,'<100','1<=X<4',4,'male div/sep',none,...,'life insurance',40,none,own,1,'high qualif/self emp/mgmt',1,yes,yes,good
997,'no checking',12,'existing paid',radio/tv,804,'<100','>=7',4,'male single',none,...,car,38,none,own,1,skilled,1,none,yes,good
998,'<0',45,'existing paid',radio/tv,1845,'<100','1<=X<4',4,'male single',none,...,'no known property',23,none,'for free',1,skilled,1,yes,yes,bad


In [3]:
def basic_info():
    pass


In [4]:
def subset_info():
    pass

## 2.) Visualizing (3 points total)
In this task you are required to do some visualizations. You can use the `matplotlib` and `seaborn` library. Please show the plot here in the notebook and save the figures. We will deduct points if figures are lacking labels, legend, etc. We will also deduct points if axis labels are unreadable. Titles are not required.

When talking about a figure always start with a *small description* (1-2 sentences) of what you see. Only thereafter start explaining. Also **save** your explanation strings in identically named **files**. So if you should save your explanation in `foo`, save that string also in `foo.txt`. Use the `.txt` extension.

### a) Age vs Amount (0.5)

Create a scatterplot that visualizes the distribution of class labels in the age-credit_amount space. Save the plot as `'credit_age.png'`

#### Easy to classify? (0.5)

Explain whether it is easy to classify good vs bad using only the age and credit_amount columns. Store your explanation as a string in `explanation_a`.

### b) Distribution by purpose (1.0)

Visualize the distribution of class labels by purpose and credit_amount. Do **not** use a **scatterplot**. Save the plot as `'purpose.png'`

#### Easy to classify? (1.0)

Using the visualization from b) explain which purposes (if any) are easy to classify given the two attributes. Elaborate on the relevance of your findings.

Store your explanation as a string in `explanation_b`.

In [5]:
explanation_a = "Some explanations"
with open('explanation_a.txt', 'w') as f:
    f.write(explanation_a)

## 3.) Classification on credit-g (3.5 points total)

In this task you should experiment with the different classifiers on sklearn.


### a) Preparation (0.5)

Write a function `preparation` that takes a dataframe like produced in 1 and prepares it for use with sklearn. The required steps are:

- compute the boolean target vector (True if 'class' is 'good')
- remove the target column from the dataframe
- convert the categorical variables to numeric ones using pd.get_dummies

Thereafter return 1) the prepared dataframe and 2) the target vector.

#### Talk about pd.get_dummies (0.75)

Explain what `pd.get_dummies` does. Thereby also talk about the drawbacks and or advantages of this method regarding in particular the `employment` column. Save the explanation in `dummies`. (Same rules apply as in 2.) for working with explanations). Also write the results to a file `dummies.txt`.

### b) Generic evaluation (1.5)

Write a function `my_eval` that takes 5 inputs. 1) A prepared dataframe, 2) the target vector 3) a sklearn classifier class 4) a dict of potential parameters to create a classifier instance from 5) a dict of parameters passed to the `.fit` function of a classifier.

The function instanciates a new classifier from the given class using the provided dict in 4). It then performs 10 fold cross validation of this classifier instance on the provided dataset and target vector providing the dict of fit parameters.

Thereafter the function returns a dict like so:
```python
{'precision': (0,1), # mean first then std
 'recall': (0,1)} # mean first then std
```
That contains the mean and std of precision and recall scores for the 10 fold cross validation.

### c) Application (0.75)

Experiment with different classifiers and different parameters for fitting.
As a result provide a list of tuples. Each tuple is a triplet of sklearn classifier class, a dict with keyword arguments passed when creating the classifier and another dict passed when using as the last argument to the function in c).

Store 3 of these triplets as a list in the variable `my_classifiers`. Try to find a tiplet that has high precision low recall, one with high recall and low precision and one relatively mixed.
Avoid triplets where evaluating in b) takes longer than 120s.

In [6]:
pass

## 4.) Classification discrimination (4 points total)

Recently the is quite some interest on the topic of discrimination/fairness in machine learning. In this tasky you will explore a very *crude* example of evaluating fairness in machine learning.

### a) Preparation (1)

Write a function `prepare_fairness` that takes a loaded credit-g dataframe like in task 1.a) and prepares it for this analysis. Therefor replace the column `'personal_status'` with a column called `'gender'` that has two values of male and female only. (Replace = remove and add a new one)
Thereafter take all the females and append a *random sample* of males use a seed of 1. Thereafter they should be equally many males and females.
Use `pd.get_dummies` to transform categorical columns to numerical ones.
Finally return the result of a 50/50 train test split, use a seed of 1.

### b) A crude notion of discrimination (2)

Write a function `eval_fairness` that takes a classifier instance, a dict with arguments passed to the fit method, and the 4 arguments obtained from sklearn `train_test_split` in the same order.

Train the classifier on the training examples. Now on the test examples compute the fraction of females/males that have been predicted 'good' with respect to all females/males. Now swap the gender of all the test instances and let the classifier predict again. Compute the same fractions. 

Return a dict:
```python
{'frac_females' : 0, # fraction of 'good' predicted females
 'frac_males' : 0, # fraction of 'good' predicted males
 'frac_females_swap' : 0.0, # fraction of 'good' predicted former females (males after swap
 'frac_males_swap' : 0} # fraction of 'good' predicted fomer males (females after swap)
```

Apply this procedure to the three classifiers from 3b)

#### Is any of the classifiers discriminating? (1)

Argue whether any of the classifiers are discriminating.
Also argue what drawbacks this notion/procedure for evaluating discrimination has.
Store the argument as a string in the variable `discrimination` and also write it to a file `discrimination.txt`

In [7]:
pass