# Intro to Data Analysis with Python

In this notebook, we will show you how to solve a data analysis problem end-to-end.

The steps and decisions I took may not be the best or the fastest - the goal is to give you an **overview** of what steps might be necessary.

## Python

#### 1.1. **Comments**

First, how to write comments. Feel free to annotate a lot, so that you remember what you did and why :)

In [None]:
# This is a comment

#### 1.2. Syntax

Below are some examples to get a feel for Python if you've never seen it before. There are (for our level) no types for the variables.

In [None]:
# Variables

x = 5
y = 0.25
z = "some text"

# lists
a = [1, 2, 3]
b = [x, y, z]

In [None]:
print(x + y)
print(x * z)
# print(y * z)    # this does not work

print(b)          # lists can have any objects as elements
print(a * x)

Since Python doesn't use brackets to divide namespaces, the spacing is very important. For example, **all the code inside a loop or an if-else must have the same indent!**

In [None]:
# Loops

for i in range(len(a)): # range(n) yields all values in the interval [0, n-1]
    a[i] += 2           # select element with a[.]

a

In [None]:
# Conditionals 

def is_odd(x):
    if x % 2 == 0:
        return True
    else:
        return False

is_odd(5)

Python also has `while`-loops and cool things like *comprehensions*, but let's leave that for now.

#### 1.3. numpy

One of the most important libraries when working with numerical data is `numpy`. It provides support for many numerical operations on scalars, vectors, matrices and fields.

In [None]:
import numpy as np   # importing numpy with the alias np

In [None]:
n = np.array([1, 2, 3])                   # initialize array with a list
m = np.array([[1, 2, 3], [4, 5, 6]])      # initialize 2D array with list of lists -> matrix

In [None]:
n

In [None]:
n.shape     # very important attribute, tells us the form of the vector or array
 
# (3,) means that a is a column (standing) vector with 3 elements 

In [None]:
n.sum()           # no arguments -> sum of all elements in the array

In [None]:
m

In [None]:
m.shape   # 2 rows, 3 columns

In [None]:
m.sum(axis=0)     # axis=0 means per column, axis=1 means per row


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Data Analysis

Two libraries will help us analyse our data - `pandas` and `matplotlib`.

We use `pandas` to read in tabular data from local files, as well as to clean and transform the data. It also offers some visualization functions, but the freedom there is limited.
`matplotlib` is a rich visualization library.  

In [None]:
# Importing libraries

import numpy as np       # vectors and matrices + functions
import pandas as pd      # DataFrame data structure + functions
import matplotlib.pyplot as plt   # visualization
import seaborn as sns             # visualization

# show the visualizations in-line
%matplotlib inline       
sns.set_theme()

_____________________

### Data

The dataset is a simplified version of the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/index.php)'s [Statlog (German Credit Data) Data Set](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29). 

Modified version downloaded from [Kaggle](https://www.kaggle.com/datasets/kabure/german-credit-data-with-risk). [License](https://creativecommons.org/publicdomain/zero/1.0/).

**The data set consists of customers that have taken out a credit. Each credit is classified as low or high risk according to the set of attributes:**

Feature ID| Feature Name |Description | Type | Values
--|-----|-----|----|---
1 | Age | person's age | numeric | age of the bank customer
2 | Sex | person's sex | text | "male" or "female"
3 | Job | type of employment | numeric | 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled
4 | Housing | type of housing | text | "own", "rent", or "free"
5 | Saving accounts | savings account balance category | text | "little", "moderate", "quite rich", "rich"
6 | Checking account | checking account balance category | text | "little", "moderate", "rich"
7 | Credit amount | credit amount | numeric | credit amount in DM (Deutsche Mark)
8 | Duration | credit duration | numeric | duration of the credit in months - 4 to 72 months
9 | Purpose | credit purpose | text | "car", "furniture/equipment", "radio/TV", "domestic appliances", "repairs", "education", "business", "vacation/others"


**The target variable is `Risk`:**

Feature ID| Feature Name |Description | Type | Values
--|-----|-----|----|---
10 | Risk | credit risk | text | "low" risk or "high" risk

_________________________

### Data Loading

In [None]:
data = pd.read_csv('Data/german_credit_data.csv')

data

### General functions

In [None]:
# Show the first n lines (no number gives first 5)

data.head()

In [None]:
# Summary

data.info()

A lot of info here: 

- 1000 entries/people
- 9 features (Age-Purpose)
- Missing values in the features `Saving accounts` and `Checking account`
- Memory usage

In [None]:
# Selection - column

data['Purpose']

In [None]:
# Selection - row

data.iloc[0:3]

In [None]:
# Selection - condition

data[data['Credit amount'] < 400]

In [None]:
# statistical functions

data['Age'].mean()   #  .max()   .min()   .std() 

In [None]:
# unique values and counts

data['Purpose'].value_counts()

---------------

### Exploratory data analysis (EDA)

My favourite way to get an overview of the features is to visualize how each is distributed. Of course, there are different types of features, numerical, categorical, text etc. For this data set, we need to take only two cases into account - numerical features (integers/floats) and categorical features (small number of distinct values).

In [None]:
# Plot feature distributions

# Set the size of the entire plot (I set this by trial and error :)
fig = plt.figure(figsize=(20, 7))  

# Go through all the columns (there is 10 of them with the target)
for i, col in enumerate(data.columns):
    
    sp = plt.subplot(2, 5, i+1)  # 4 x 3 is the grid to place the plots in
    
    if len(data[col].value_counts()) > 10:    # If the column has more than 10 distinct values, we
                          # can assume the column is not categorical
        sns.histplot(x=col, data=data, kde=True)    # Nice histogram
        
    else:    # Categorical data
        sns.countplot(x=col, data=data)   # Count plot
    plt.xticks(rotation=90)

plt.subplots_adjust(wspace=0.3, hspace=0.7)

**Some things we notice:**

- Most people in the dataset are 20-40 years old
- The 'Job' column has four values. Looking into the [dataset description](https://www.kaggle.com/datasets/uciml/german-credit), we can see that the meanings are as follows:

| Code | Job type |
| --- | --- |
| 0 | unskilled and non-resident |
| 1 | unskilled and resident |
| 2 | skilled |
| 3 | highly skilled |

- The credit amount was usually 2000-3000 DM
- The credit duration was usually 60 months at most, usually 12 or 24 months
- The target variable `Risk` is unbalanced - 70% of the credits were low risk

---------

Another useful function is `.describe()`, which shows us important summary statistics of the features. We saw some of them in the visualizations above. `.describe()` can be used on the entire `DataFrame`, but also on chosen columns.

In [None]:
data.describe(include='all')

------

Let's try to answer some data science questions.

#### Do people with highly skilled jobs take out larger loans?

We can use a boxplot to answer this question. The center line of the box represents the mean credit amount of all people with a particular job level. The box shows the quartiles, and the whiskers going up and down - the entirety of the distribution's support. The points above the whiskers are considered to be outliers.

In [None]:
sns.barplot(x="Job", y="Credit amount", data=data)

Yes, apparently. The loans taken out by people with job level 3 are 2000 DM higher on average.

#### Is the credit's purpose indicative of the risk?

In [None]:
fig = plt.figure(figsize=(15, 7)) 
sns.boxplot(x="Purpose", y="Credit amount", data=data, hue="Risk")

At least when taking out a loan for the `vacation/others` purpose, the credit amount is very clearly indicative of how risky the loan is. 

### Data Cleaning

Before we dive into the machine learning part of this notebook, we have some cleaning to do. Machine learning algorithms have problems when dealing with missing values.

**How many `NaN` values are there in the data?**

In [None]:
data.isna().sum()

There are missing values in the `Saving accounts` and `Credit account` features.

#### Missing values

There are several ways to handle missing values:
- Imputation - Replacing the missing values with a real value (e.g. mean or median of the feature)
- Deleting entries with missing values
- Deleting columns/features with missing values

The correct way to handle missing data is not universal. For example, if the number of missing values is too high, imputation will add a lot of noise, which can make model learning very difficult.

If we have a small data set, deleting entries means we would have even fewer data points to train on.

If all features in a data set have missing values, it doesn't make sense to delete those features.

In our case, however, the missing values are in only two of the **categorical** columns - `Saving accounts` and `Checking account`. The missing values must then correspond to cases where is was unknown whether the person had such an account, or how much money was in it.

Let's check how well these features correspond to the target variable `Risk`. We will treat the `NaN` values as a separate category.

In [None]:
sns.histplot(data=data.fillna('unknown'), y="Saving accounts", hue="Risk", multiple="stack", shrink=.8)

Then, `Checking account`:

In [None]:
sns.histplot(data=data.fillna('unknown'), y="Checking account", hue="Risk", multiple="stack", shrink=.8)

There are many more examples of low risk than high risk when the value of the two features is `unknown`. So this category might help with the prediction of the risk. However, in the real world, this type of missing data is not random and can make our model very biased - for example, the `unknown` category's distribution most closely matches the distribution of the `rich` categories.  

In [None]:
data = data.fillna('unknown')

______________________

#### Convert to numerical

Since most machine learning algorithms require numerical data, let's use indicator variables for each category. We need to convert all categorical features (their values are strings) to numbers. The categorical features have the dtype `object`:

In [None]:
data.head()

In [None]:
data.info()

We can convert these text features to numerical by using **indicator variables**.

The `pandas` function `.get_dummies()` does just that, expanding $N$ categories into $N$ indicator (binary) variables.

In [None]:
categorical_features = data.select_dtypes(include='object').columns

print(categorical_features)

num_data = pd.get_dummies(data, columns=categorical_features)
num_data.head()

We now have 28 columns instead of 10, but we can delete some of them in the next step.

Let's make sure they're all numerical.

In [None]:
num_data.info()

_________________

### Correlation

For the same numerical features, we can also compute the correlation.

The correlation coefficient is a value between -1 and 1. 
- A coefficient of 0 means that the two variables are not correlated, that is, we can't draw conclusions about one variable if we know the other.
- Coefficients >0 denote a positive correlation, meaning that an increase in one variable is connected to an increase in the other variable. A value of 1 means the variables are exactly equal.
- Coefficients <0 denote a negative correlation, meaning that an increase in one variable is connected to a decrease in the other variable. 

In [None]:
correlation = num_data.corr()

plt.figure(figsize=(20,15))
sns.heatmap(correlation.round(decimals=2), annot=True, vmax=1, vmin=-1)

**Observations:**

- The binary indicator variables for `Sex` and `Risk` have perfect anti-correlation, so we can drop one of each, for example `Sex_male` and `Risk_low`.
- There is a relatively high positive correlation between the variables `Credit amount` and `Duration` (0.62). Of course, larger credit amounts usually have longer credit durations, so no surprise there.
- The variables with highest correlation with the target variable are `Checking_account_unknown` (indicates low risk) and `Checking_account_little` (indicates high risk). The features `Credit amount`, `Saving accounts_little` and `Duration` are also related to the `Risk` variable.

In [None]:
num_data = num_data.drop(['Sex_male', 'Risk_low'], axis=1)
num_data.head()

We now have the `Risk_high` variable with 1 meaning high risk and 0 low risk. This is more intuitive and having 'high' be the positive class (index 1) makes evaluation easier, since we usually want to predict the high risk credits anyway.

__________________________

### Train and test data

How do we go about using machine learning to predict whether a loan is low or high risk? 

First, we need to split our data in two separate datesets - a training data set with known labels/risk, and a test data set where we test the learned knowledge.

In [None]:
from sklearn.model_selection import train_test_split

labels = num_data.Risk_high
features = num_data.drop("Risk_high", axis=1)

X_train, X_test, y_train, y_test = train_test_split(features, labels,  random_state=42)

X_train.head()

In [None]:
X_test.shape

We have 750 training samples and 250 test samples.

----------------------

Next, we need a model. The most important question we need to answer is...



### What are we trying to predict?   

The `Risk_high` variable, which is binary (two categories).

The problem to solve is called [binary classification](https://en.wikipedia.org/wiki/Binary_classification).
There are many algorithms that can solve binary classification problems. We will look at two example models - [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) and [logistic regression](https://kambria.io/blog/logistic-regression-for-machine-learning/#:~:text=What%20Is%20Logistic%20Regression%3F,either%20a%200%20or%201.). Let's import the algorithms and some other helpful packages first.

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
import graphviz

### Evaluation

After looking into what our model is doing, we can perform a quantitative evaluation. For this, we need to use the learned model to predict the labels of *unseen* data, in this case our test data `X_test`.

Let's first pack everything into a function, since we'll evaluate both models.

In [None]:
def evaluate_model(model, model_name, test_data, test_labels):

    # Use the learned model to make predictions about unseen data
    test_predictions = model.predict(test_data)
    
    # Print the classification report
    print(metrics.classification_report(test_labels, test_predictions, target_names=['low', 'high']))
    
    # Evaluate by comparing the predictions with the true labels of the test data (here, in a confusion matrix)
    confusion_matrix = metrics.confusion_matrix(test_labels,  test_predictions)

    # Turn the confusion matrix into a dataframe
    matrix_df = pd.DataFrame(confusion_matrix)

    # Plot the result
    ax = plt.axes()
    sns.heatmap(matrix_df, annot=True, fmt="g", ax=ax)
    ax.set_title('Confusion Matrix - {}'.format(model_name), fontsize=15)
    ax.set_xlabel("Predicted Risk_high", fontsize=15)
    ax.set_ylabel("Actual Risk_high", fontsize=15)
    plt.show()

<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

## Decision Tree

A decision tree is exactly what it sounds like - starting from the top (or the root) of the tree, which contains all data points, we start branching out by choosing feature ranges that allow us to split the data according to the target variable. Having learned a good tree, unseen data is then classified by following the correct tree branches to a tree leaf.

<img src="images/tree.png" style="margin:auto" width=500/>

In [None]:
# Create a classifier instance
model_dtree = DecisionTreeClassifier(random_state=42, class_weight='balanced')    

# Learn from the training data = fit the classifier to the training data
model_dtree = model_dtree.fit(X_train, y_train)

#### **Let's visualize the learned tree.** It will be saved in a pdf file in the notebook's directory.

In [None]:
dot_data = export_graphviz(model_dtree, out_file=None, feature_names=features.columns, 
                                class_names=np.array(['low', 'high']), filled=True, rounded=True, special_characters=True) 
graph = graphviz.Source(dot_data) 
graph.render("CreditRisk") 

#### **Evaluation**

In [None]:
evaluate_model(model_dtree, 'Decision Tree', X_test, y_test)


The confusion matrix compares the predictions with the true test labels. The diagonal shows the correct guesses and above and below are the 'confusions' where our model was wrong.

Generally when doing classification, what we want to see is a diagonal with high numbers. That is not the case here, since the lower right (number of actual high risk customers that were predicted as high risk) is the lowest value. Recognizing the low risk customers, however, seems easy in comparison.

If we think about it, in the real world it is much more important to be able to predict high risk customers than low risk customers. As such, this model is far from optimal, due to the higher cost of classifying a high risk loan as low risk.


<img src="https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png" width=2000/>

Let's take look at another classifier.

## Logistic Regression

Logistic regression is a different type of classification algorithm. It used gradient descent to minimize a given error/loss function. It creates a *decision boundary* between positive and negative class samples.

<img src="images/logreg.png" style="margin:auto" width=500/>

<br>
Logistic regression and other gradient methods assume the input data are standardized (to the same mean and standard deviation). 
Let's scale our data. It's very important that the scaling statistics are calculated only on the training data (otherwise we're cheating!).

In [None]:
from sklearn.preprocessing import StandardScaler

# Create scaler instance
scaler = StandardScaler().fit(X_train)  
X_train_std = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test_std = pd.DataFrame(scaler.transform(X_test), columns=X_train.columns)

X_train_std.head()

A result of this transformation is the loss of readability - it's impossible to tell what `Age`=-1.016566 means.

In [None]:
# Create a classifier instance
model_lr = LogisticRegression(class_weight='balanced')    

# Learn from the training data = fit the classifier to the training data
model_lr = model_lr.fit(X_train_std, y_train)      

Let's look at the test data results.

In [None]:
evaluate_model(model_lr, 'Logistic Regression', X_test_std, y_test)

At first glance, performance on the doesn't look much different than the Decision Tree model.

However, the recall score for the positive class (`Risk_high`=1) has increased by a good margin, and even the accuracy score (the percentage of correct predictions) has increased slightly.

As we've discusses already, for this specific use case what we care about is recognizing high risk customers (`Risk_high` = 1). 
This means that the worst source of error for any model would be classifying a high risk customer as low risk. This scenario corresponds to the lower left corner of the confusion matrix, and we can see that is the lowest value in the graphic by a good margin. Although the model is not perfect, it can be useful to the people deciding whether to approve a credit or not.



![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)