<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Modelling lab: Can we predict risk of bankruptcy?

In this exercise we will try to predict companies that go bankrupt in the next 5 years, based on various numeric financial attributes. The dataset describes Polish companies and their financial accounts, and further information is [available here](https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data). For this exercise we chose only the companies that went bankrupt within 5 years (one of the five available datasets on the site).

# Part 1 - Exploratory data analysis

### 1. Load the data

Look at the usual things:

- how many rows/columns?
- what are the data types?
- any missing values?

### 2. Examine the target

The target is the final column, `bankruptcy_label` (1 if the company went bankrupt within 5 years).

What is the distribution of the target?

### 3. Create a heatmap showing the correlations within your features

It's generally a good idea to see how your features correlate to each other, to identify columns that perhaps encode the same information. We'll do this by calculating the correlation matrix (how every column correlates with every other) and inspect it visually as a heatmap.

#### 3. a. Get a correlation matrix of your features (hint: there's a built in `pandas` function for that!

#### 3. b. Use `seaborn` to plot a heatmap of your correlation matrix

You will need to play around with the plotting options to get a clear view of where the correlations are

### 3. c. Dig into it a bit more, and find the top few strongest correlated column pairs

### 4. Look at the distribution of your variables for bankrupt vs. non-bankrupt companies

We're looking for cases where the distribution of a feature is different for bankrupt companies, suggesting that it would be a good feature to separate the classes.

You could start by comparing the average value of each feature for bankrupt vs. non-bankrupt companies, or as a bonus plot the actual distribution (using histograms, box plots or similar).

### 5. Based on findings from above, combined with your domain knowledge, choose 5 features with which to build a first predictive model

It is usually a good idea to choose uncorrelated features, and in the case of classification those that suggest a separation between labels 0 and 1 (i.e. features that are distinctly different for bankrupt and non-bankrupt companies).

# Part 2 - Prediction

### 1. Using the features you selected above, create training and test sets

Ensure that both your training and test sets have the same distribution of bankrupt/non-bankrupt companies. You can use a specific parameter when splitting your data to achieve this.

### 2. Choose an appropriate metric for evaluating your predictions

Think about:

- the problem itself: is it classification (binary? multi-class?)/regression?
- the distribution of your target: does this change which metrics are appropriate?

### 3. Train *two* different predictive models

- Choose two different models to train
- Think about best practices:
    - Only train using the training set
    - Use cross-validation to get a better estimate of performance on unseen data
    - Use grid search to optimise your models' hyperparameters

### 4. Evaluate your models on the *training* set

Go deeper than the single metric you used for training. Try showing the confusion matrix for both models - what can you see?

# Part 3 - Improvements

Time to improve your models. There are many different things you can do here including:

- choosing more/different features
- trying to tune more hyperparameters (if they exist)
- trying a third model
- tuning other aspects of the prediction process (e.g. changing the cutoff for predicting 1 instead of 0 from the default of 0.5, that is: if your model predicts a company is 50% or more likely to be bankrupt it will predict "bankrupt". Changing this threshold may help improve its performance)

Further reading is available for example in this excellent article [Learning from Imbalanced Classes](https://www.svds.com/learning-imbalanced-classes/)

### 1. Try something to improve your best performing model from above

### 2. Interpret your best model by extracting feature importances (if possible)

### 3. Finally, evaluate your best model on the test set - how did it do?