# Machine Learning / Aprendizagem Automática

## Diogo Soares, André Falcão and Sara C. Madeira, 2020/21

# ML Project  - Learning about Donations

## Logistics 
  
**Students are encouraged to work in teams of 3 people**. 

Projects with smaller teams are allowed, in exceptional cases, but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of December, 18th (last day before Christmas holidays).** 

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. The notebook is both the solution and the report.**

**Decisions should be fundamented and results should be critically discussed.**

## Tools

The team should use [Python 3](https://www.python.org) and [Jupyter Notebook](http://jupyter.org), together with **[Scikit-learn](http://scikit-learn.org/stable/)**, **[Orange3](https://orange.biolab.si)**, or **both**.

**[Orange3](https://orange.biolab.si)** can be used through its **[programmatic version](https://docs.orange.biolab.si/3/data-mining-library/)**, by importing and using its packages, or throught its **workflow version**. 

**It is up to the team to decide when to use Scikit-learn, Orange, or both.**

In this context, your Jupyter notebook might have a mix of code, results, text explanations, workflow figures, etc. 

In case you use Orange/workflows for some tasks you should also deliver the workflow files and explain the options taken in each widget in your notebook.

**You can use this noteboook and the sections below as template for your work.**

## Dataset

The dataset to be analysed is **`Donors_dataset.csv`**, made available together with this project description. This dataset, downloaded from [Kaggle](https://www.kaggle.com), contains selected data from the following dataset: [Donors-Prediction](https://www.kaggle.com/momohmustapha/donorsprediction/)


**In this project, your team is supposed to use only tabular data (not Images or Image Metadata) and see how far you can go in predicting donations and understanding the donors. You should use both supervised and unsupervised learning to tackled 2 tasks:**

1. **Task 1 (Supervised Learning) - Predicting Donation and Donation Type**
2. **Task 2 (Unsupervised Learning) - Characterizing Donors**

The **`Donors_dataset.csv`** you should learn from has **19.372 instances** described by **50 data fields** that you might use as **categorical/numerical features** 

### File Descriptions

* **Donors_dataset.csv** - Tabular/text data to be used in the machine learning tasks.


### Data Fields

* **CARD_PROM_12** - number of card promotions sent to the individual by the charitable organization in the past 12 months
* **CLUSTER_CODE** - one of 54 possible cluster codes, which are unique in terms of socioeconomic status, urbanicity, ethnicity, and other demographic characteristics
* **CONTROL_NUMBER** - unique identifier of each individual
* **DONOR_AGE** - age as of last year's mail solicitation
* **DONOR_GENDER** - actual or inferred gender
* **FILE_AVG_GIFT** - this variable is identical to LIFETIME_AVG_GIFT_AMT
* **FILE_CARD_GIFT** - lifetime average donation (in \\$) from the individual in response to all card solicitations from the charitable organization
* **FREQUENCY_STATUS_97NK** - based on the period of recency (determined by RECENCY_STATUS_96NK), which is the past 12 months for all groups except L and E. L and E are 13–24 months ago and 25–36 months ago, respectively: 1 if one donation in this period, 2 if two donations in this period, 3 if three donations in this period, and 4 if four or more donations in this period.
* **HOME_OWNER** - H if the individual is a homeowner, U if this information is unknown
* **INCOME_GROUP** - one of 7 possible income level groups based on a number of demographic characteristics
* **IN_HOUSE** - 1 if the individual has ever donated to the charitable organization's In House program, 0 if not
* **LAST_GIFT_AMT** - amount of the most recent donation from the individual to the charitable organization
* **LIFETIME_AVG_GIFT_AMT** - lifetime average donation (in \\$) from the individual to the charitable organization
* **LIFETIME_CARD_PROM** - total number of card promotions sent to the individual by the charitable organization
* **LIFETIME_GIFT_AMOUNT** - total lifetime donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_GIFT_COUNT** - total number of donations from the individual to the charitable organization
* **LIFETIME_GIFT_RANGE** - maximum donation amount from the individual minus minimum donation amount from the individual
* **LIFETIME_MAX_GIFT_AMT** - maximum donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_MIN_GIFT_AMT** - minimum donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_PROM** - total number of promotions sent to the individual by the charitable organization
* **MEDIAN_HOME_VALUE** - median home value (in 100\\$) as determined by other input variables
* **MEDIAN_HOUSEHOLD_INCOME** - median household income (in 100\\$) as determined by other input variables
* **MONTHS_SINCE_FIRST_GIFT** - number of months since the first donation from the individual to the charitable organization
* **MONTHS_SINCE_LAST_GIFT** - number of months since the most recent donation from the individual to the charitable organization
* **MONTHS_SINCE_LAST_PROM_RESP** - number of months since the individual has responded to a promotion by the charitable organization
* **MONTHS_SINCE_ORIGIN** - number of months that the individual has been in the charitable organization's database
* **MOR_HIT_RATE** - total number of known times the donor has responded to a mailed solicitation from a group other than the charitable organization
* **NUMBER_PROM_12** - number of promotions (card or other) sent to the individual by the charitable organization in the past 12 months
* **OVERLAY_SOURCE** - the data source against which the individual was matched: M if Metromail, P if Polk, B if both
* **PCT_ATTRIBUTE1** - percent of residents in the neighborhood in which the individual lives that are males and active military
* **PCT_ATTRIBUTE2** - percent of residents in the neighborhood in which the individual lives that are males and veterans
* **PCT_ATTRIBUTE3** - percent of residents in the neighborhood in which the individual lives that are Vietnam veterans
* **PCT_ATTRIBUTE4** - percent of residents in the neighborhood in which the individual lives that are WWII veterans
* **PCT_OWNER_OCCUPIED** - percent of owner-occupied housing in the neighborhood in which the individual lives
* **PEP_STAR** - 1 if individual has ever achieved STAR donor status, 0 if not
* **PER_CAPITA_INCOME** - per capita income (in \\$) of the neighborhood in which the individual lives
* **PUBLISHED_PHONE** - 1 if the individual's telephone number is published, 0 if not
* **RECENCY_STATUS_96NK** - recency status as of two years ago: A if active donor, S if star donor, N if new donor, E if inactive donor, F if first time donor, L if lapsing donor
* **RECENT_AVG_CARD_GIFT_AMT** - average donation from the individual in response to a card solicitation from the charitable organization since four years ago
* **RECENT_AVG_GIFT_AMT** - average donation (in \\$) from the individual to the charitable organization since four years ago
* **RECENT_CARD_RESPONSE_COUNT** - number of times the individual has responded to a card solicitation from the charitable organization since four years ago
* **RECENT_CARD_RESPONSE_PROP** - proportion of responses to the individual to the number of card solicitations from the charitable organization since four years ago
* **RECENT_RESPONSE_COUNT** - number of times the individual has responded to a promotion (card or other) from the charitable organization since four years ago
* **RECENT_RESPONSE_PROP** - proportion of responses to the individual to the number of (card or other) solicitations from the charitable organization since four years ago
* **RECENT_STAR_STATUS** - 1 if individual has achieved star donor status since four years ago, 0 if not
* **SES** - one of 5 possible socioeconomic codes classifying the neighborhood in which the individual lives
* **TARGET_B** - 1 if individual donated in response to last year's 97NK mail solicitation from the charitable organization, 0 if individual did not
* **TARGET_D** - amount of donation (in \\$) from the individual in response to last year's 97NK mail solicitation from the charitable organization
* **URBANICITY** - classification of the neighborhood in which the individual lives: U if urban, C if city, S if suburban, T if town, R if rural, ? if missing
* **WEALTH_RATING** - one of 10 possible wealth rating groups based on a number of demographic characteristics


### Donation TYPE

You are supposed to create a new column/feature named `DONATION_TYPE`, whose values describe ranges of the donation amount (DA) reported in feature `TARGET_D`:
* `A` - DA >= 50
* `B` - 20 <= DA < 50 
* `C` - 13 <= DA < 20
* `D` - 10 <= DA < 13
* `E` - DA < 10


### **Important Notes on Data Cleaning and Preprocessing**

   1. Data can contain **errors/typos**, whose correction might improve the analysis.
   2. Some features can contain **many values**, whose grouping in categories (aggregation into bins) might improve the analysis.
   3. Data can contain **missing values**, that you might decide to fill. You might also decide to eliminate instances/features with high percentages of missing values.
   4. **Not all features are necessarily important** for the analysis.
   5. Depending on the analysis, **some features might have to be excluded**.
   6. Class distribution is an important characteristic of the dataset that should be checked. **Class imbalance** might impair machine learning. 
  
Some potentially useful links:

* Data Cleaning and Preprocessing in Scikit-learn: https://scikit-learn.org/stable/modules/preprocessing.html#
* Data Cleaning and Preprocessing in Orange: https://docs.biolab.si//3/visual-programming/widgets/data/preprocess.html
* Dealing with imbalance datasets: https://pypi.org/project/imbalanced-learn/ and https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t7

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder


from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Task 0 (Know your Data) - Exploratory Data Analysis

## Loading Data

In [None]:
raw_data = pd.read_csv("Donors_dataset.csv")
raw_data 

In [None]:
raw_data

## Understanding Data

In this task you should **understand better the features**, their distribution of values, potential errors, etc and plan/describe what data preprocessing steps should be performed next. Very important also is to check the distribution of values in the target (class distribution). 

Here you can find a notebook with some examples of what you can do in **Exploratory Data Analysis**: https://www.kaggle.com/artgor/exploration-of-data-step-by-step/notebook. You can also use Orange widgets for this.

### Exploratory Data Analysis

First we will print and plot different tables and visualisations to get a feeling and better overview of the data. 

In [None]:
raw_data.info() 

In [None]:
raw_data.describe()

Next we will plot the histograms to check if there are numerical features with a high number of values, that can possibly be binned to be more convenient for the models. 

#### Histogram Plot for numerical features

In [None]:
raw_data.hist(bins=30, figsize=(20, 20))

In [None]:
# sort features in different categories for plotting:
all_features = list(raw_data.columns[2:])
pscontinuous_features = ['CONTROL_NUMBER', 'MONTHS_SINCE_ORIGIN', 'DONOR_AGE', 'PER_CAPITA_INCOME', \
                              'WEALTH_RATING', 'MEDIAN_HOME_VALUE', 'MEDIAN_HOUSEHOLD_INCOME', \
                             'PCT_OWNER_OCCUPIED', 'PCT_ATTRIBUTE2', 'PCT_ATTRIBUTE3', 'PCT_ATTRIBUTE4', \
                              'RECENT_RESPONSE_PROP', 'RECENT_AVG_GIFT_AMT', 'RECENT_CARD_RESPONSE_PROP', \
                              'RECENT_AVG_CARD_GIFT_AMT', 'RECENT_RESPONSE_COUNT', 'RECENT_CARD_RESPONSE_COUNT', \
                              'MONTHS_SINCE_LAST_PROM_RESP', 'LIFETIME_CARD_PROM', 'LIFETIME_PROM', \
                              'LIFETIME_GIFT_COUNT', 'LAST_GIFT_AMT', 'NUMBER_PROM_12', 'MONTHS_SINCE_LAST_GIFT', \
                              'MONTHS_SINCE_FIRST_GIFT', 'FILE_CARD_GIFT','CARD_PROM_12']
categorical_features = ['IN_HOUSE', 'URBANICITY', 'SES', 'HOME_OWNER', 'DONOR_GENDER', 'INCOME_GROUP', \
                        'PUBLISHED_PHONE', 'OVERLAY_SOURCE', 'PEP_STAR', 'RECENCY_STATUS_96NK', \
                        'FREQUENCY_STATUS_97NK']
other_features = ['CLUSTER_CODE', 'MOR_HIT_RATE', 'PCT_ATTRIBUTE1', 'RECENT_STAR_STATUS', 'LIFETIME_GIFT_AMOUNT', \
                  'LIFETIME_AVG_GIFT_AMT', 'LIFETIME_GIFT_RANGE', 'LIFETIME_MAX_GIFT_AMT', 'LIFETIME_MIN_GIFT_AMT', \
                  'FILE_AVG_GIFT']

# check, if all are included and not couble counts:
feature_lists = pscontinuous_features+categorical_features+other_features
print('All features included and no doubles: ', \
      len(all_features)==len(feature_lists) and set(feature_lists)==set(all_features))

def create_plots(features_to_plot, plottype='violin'):
    '''Creates plots for given features. Plottypes: 'violin', 'count' . We can add more if necessary.
    carful: if number of values per feature is high when 'count' is chosen, running time goes up. '''
    print(f'{len(features_to_plot)} plots:')
    ncols = 3
    nrows = int(len(features_to_plot)/ncols)+1
    newplots = plt.figure(figsize=(ncols*5,nrows*5))
    for ind, feature in enumerate(features_to_plot):
        plt.subplot(nrows, ncols, ind+1)
        if plottype=='violin':
            sns.violinplot(x="TARGET_B", y=feature, data=raw_data, fontsize=8)
            plt.title(f'TARGET_B by {feature}', fontsize=8)
        if plottype!='violin':
            sns.countplot(x='TARGET_B', data=raw_data, hue=feature);
            plt.title(f'{feature} in TARGET_B', fontsize=8)

####  Violin Plots for numerical Data

In [None]:
create_plots(pscontinuous_features, 'violin')

#### Countplots for categorical features 

In [None]:
create_plots(categorical_features, 'count')

In [None]:
plot = raw_data['TARGET_B'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('TARGET_B classes counts')


#### Violin plots for other features 

In [None]:
create_plots(other_features, 'violin')

----

Result of the exploratory analysis: the error for this feature is that one value is a typo ( = "A"). That row can be deleted. 

----
 

In [None]:
display(raw_data[raw_data.DONOR_GENDER == "A"])

In [None]:
raw_data.drop(14977, inplace = True )
raw_data = raw_data.reset_index(drop=True)
display(raw_data) 


In [None]:
corr = raw_data.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 14))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(250, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
heatmap = sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);


---

Result of the exploratory analysis: Highly correlated features (red = positive, blue = negative correlated) can be reduntant because they dont produce no additional information and can be useless for the model. 

TODO: maybe deleted redundant data. 

---

### Transforming donation amount in classes

In [None]:
transformed_data = raw_data.copy()

In [None]:
transformed_data = transformed_data.drop(columns = ["CONTROL_NUMBER"])

In [None]:
def label_donation_type(row):
    if row['TARGET_D'] >= 50:
        return 'A'
    if row['TARGET_D'] >= 20 and row['TARGET_D'] < 50:
        return 'B'
    if row['TARGET_D'] >= 13 and row['TARGET_D'] < 20:
        return 'C'
    if row['TARGET_D'] >= 10 and row['TARGET_D'] < 13:
        return 'D'
    if row['TARGET_D'] < 10:
        return 'E'
    return '?'

transformed_data['DONATION_TYPE'] = transformed_data.apply (lambda row: label_donation_type(row), axis=1)
transformed_data = transformed_data.drop(columns = ('TARGET_D'))

display(transformed_data)

In [None]:
transformed_data['WEALTH_RATING'].min()

# Task 1 (Supervised Learning) - Predicting Donation and Donation Type

In this task you should target 3 classification tasks:
1. **Predicting  Donation (binary classification task)**; 
2. **Predicting Donation TYPE (multiclass classification)**; and
3. **Train specialized models for SES (socioeconomic classification)**.

**You should:**

* Choose **one classifier in each category**: Tree models, Rule models, Linear models, Distance-based models, and Probabilistic models.
* Use cross-validation to evaluate the results. 
* Present and discuss the results for different evaluation measures, present confusion matrices. Remember that not only overall results are important. Check what happens when learning to predict each class.
* Describe the parameters used for each classifier and how their choice impacted or not the results.
* Choose the best classifier and fundament you choice.
* **Discuss critically your choices and the results!**

## Preprocessing Data for Classification

### Binning numerical data 

Result of Histogram Analysis: 

Attributes worth binning (because they have a high number of values): 

- DONOR_AGE
- LIFETIME_CARD_FROM
- LIFETIME_GIFT_COUNT
- LIFETIME_PROM
- MEDIAN_HOME_VALUE
- MEDIAN HOUSEHOLD_INCOME
- MONTHS_SINCE_LAST_GIFT
- MONTHS_SINCE_FIRST_GIFT
- PCT_ATTRIBUTE1
- PCT_ATTRIBUTE2
- PCT_ATTRIBUTE3
- PCT_ATTRIBUTE4
- PCT_OWNER_OCCUPIED
- PER_CAPITA_INCOME
- RECENT_RESPONSE_PROP
- MONTHS_SINCE_LAST_PROM_RESP

TODO: create bins 

- either quantile binning (our first approach)
- or Log transform 

---

In [None]:
plot = transformed_data['DONATION_TYPE'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('DONATION_TYPE classes counts')


In [None]:
x = dict(transformed_data.isna().sum())
{k: v for k, v in sorted(x.items(), key=lambda item: item[1])}

In [None]:
plot = transformed_data['WEALTH_RATING'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('WEALTH_RATING classes counts')


--- 

The dataset contains NaN values in some columns. As the training of the models cant happen with NaN values there are two possibilities. Either drop the values which will lead to a very small dataset, or we try to replace NaN values with other values (with different techniques) in order to maintain a big dataset and dont loose information.

Features with missing values, that need to be replaced are: 

- 'WEALTH_RATING': 8810
- 'DONOR_AGE_label': 4795
- 'INCOME_GROUP': 4392
- 'MONTHS_SINCE_LAST_PROM_RESP': 246

- 'URBANICITY'
- 'SES'

### Plots

In [None]:
plot = transformed_data['WEALTH_RATING'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('WEALTH_RATING classes counts')


In [None]:
#plot = transformed_data['DONOR_AGE_label'].value_counts().sort_index().plot(kind = 'barh')
#plot.set_title('DONOR_AGE_label classes counts')


In [None]:
plot = transformed_data['INCOME_GROUP'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('INCOME_GROUP classes counts')


In [None]:
plot = transformed_data['MONTHS_SINCE_LAST_PROM_RESP'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('MONTHS_SINCE_LAST_PROM_RESP classes counts')


In [None]:
plot = transformed_data['INCOME_GROUP'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('INCOME_GROUP classes counts')


In [None]:
plot = transformed_data['SES'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('SES classes counts')


### Replace false values (outliers) with mean of column 

- TODO : How did we recognize the outliers ? 

In [None]:
transformed_data.MOR_HIT_RATE.plot()

---
Result: MOR_HIT_RATE : time of answers on other mailings. Seem to high in some cases, will be replaced by column mean. 


In [None]:
transformed_data.MONTHS_SINCE_LAST_PROM_RESP.describe()

---
Result: MONTHS_SINCE_LAST_PROM_RESP : months since last answer. This value cant be negative and will be replaced by column mean. 


In [None]:
# MOR_HIT_RATE : time of answers on other mailings. seems to high, will be replaced by column mean. 
transformed_data.MOR_HIT_RATE = transformed_data.MOR_HIT_RATE.apply(lambda x: (transformed_data.MOR_HIT_RATE.mode()[0]) if x > 100 else x)

# MONTHS_SINCE_LAST_PROM_RESP : months since last answer cant be negative,will be replaced by column mean. 
transformed_data.MONTHS_SINCE_LAST_PROM_RESP = transformed_data.MONTHS_SINCE_LAST_PROM_RESP.apply(lambda x: (transformed_data.MONTHS_SINCE_LAST_PROM_RESP.mode()[0]) if x < 0.0 else x)


### Encoding categorical features

In [None]:
### show Columns that are of the type "object" and need to be transformed to numerical values

obj_df = transformed_data.select_dtypes(include=['object']).copy()
obj_df.head()

**In order to clean the data, especially the categorical data, we are encoding the features with the LabelEncoder()** <br>
LabelEncoder(): <br>
Encode categorical features as a one-hot numeric array. LabelEncoder is used to transform non-numerical labels to numerical labels.

In [None]:
lb_make = LabelEncoder()


transformed_data["URBANICITY"] = lb_make.fit_transform(transformed_data["URBANICITY"])
URB_classes = lb_make.classes_.copy() 
transformed_data["SES"] = lb_make.fit_transform(transformed_data["SES"])
transformed_data["CLUSTER_CODE"] = lb_make.fit_transform(transformed_data["CLUSTER_CODE"])
transformed_data["HOME_OWNER"] = lb_make.fit_transform(transformed_data["HOME_OWNER"])
transformed_data["DONOR_GENDER"] = lb_make.fit_transform(transformed_data["DONOR_GENDER"])
DGE_classes = lb_make.classes_.copy()
transformed_data["OVERLAY_SOURCE"] = lb_make.fit_transform(transformed_data["OVERLAY_SOURCE"])
OLS_classes = lb_make.classes_.copy()
transformed_data["RECENCY_STATUS_96NK"] = lb_make.fit_transform(transformed_data["RECENCY_STATUS_96NK"])
transformed_data["DONATION_TYPE"] = lb_make.fit_transform(transformed_data["DONATION_TYPE"])

DT_LaEnc = lb_make #For unsupervized learning

In [None]:
transformed_data.info()

In [None]:
plot = transformed_data['URBANICITY'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('URBANICITY classes counts')

### TODO: Data Cleaning

1. **add rule based model**
2. **oneHot Encoding Feature for all categorical features ? (?)**


model training : 

3. **redo multiclass classification with onehot encoded data**

5. **Importance analysis of features = package ?**


---

- todo: (optional)
    * Outlier Detection with Standard Deviation
    * https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
- todo: feature engineering:  (parallel zum model training)
    * You might also decide to eliminate instances/features with high percentages of missing values. 
    * Not all features are necessarily important for the analysis.
    * Depending on the analysis, some features might have to be excluded


### Replace missing Data (NaNs)

In [None]:
x = dict(transformed_data.isna().sum())
{k: v for k, v in sorted(x.items(), key=lambda item: item[1])}
raw_data.MOR_HIT_RATE

The LabelEncoder set all NaN values to 0. URBANICITY, SES and CLUSTER_CODE didn't have any category 0. So it is an identifier for NaN.
In order to impute the data correctly, these 0 values have to be set to NaN again:

In [None]:
transformed_data.loc[transformed_data['URBANICITY'] == 0,'URBANICITY'] = np.nan
transformed_data.loc[transformed_data['SES'] == 0,'SES'] = np.nan
transformed_data.loc[transformed_data['CLUSTER_CODE'] == 0,'CLUSTER_CODE'] = np.nan

In [None]:
impute_mode = ['URBANICITY', 'SES', 'CLUSTER_CODE', 'INCOME_GROUP', 'WEALTH_RATING']
impute_mean = ['MONTHS_SINCE_LAST_PROM_RESP']

Replace NaNs with mode()/most frequent: <br>
    - ['URBANICITY', 'SES', 'CLUSTER_CODE', 'INCOME_GROUP', 'WEALTH_RATING']
   
Replace NaNs with mean(): <br>
    - ['MONTHS_SINCE_LAST_PROM_RESP']



In [None]:
for feature in impute_mode:
    print(transformed_data[feature].mode()[0])
    transformed_data[feature].fillna(transformed_data[feature].mode()[0], inplace=True)

for feature in impute_mean:
    print(transformed_data[feature].mean())
    transformed_data[feature].fillna(transformed_data[feature].mean(), inplace=True)

Rest of NaN values will be replaced with the KNNImputer, with neighbors=20

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=20)
transformed_data = pd.DataFrame(imputer.fit_transform(transformed_data), columns=transformed_data.columns)

Test, if all NaNs are replaced

In [None]:
#plot = df_filled['URBANICITY'].value_counts().sort_index().plot(kind = 'barh')
x = dict(transformed_data.isna().sum())
{k: v for k, v in sorted(x.items(), key=lambda item: item[1])}

In [None]:
transformed_data['MOR_HIT_RATE'].hist()

In [None]:
transformed_data.info()

In [None]:
CL_data = transformed_data.copy() #This data should not be binned, but encoded --> clustering

In [None]:
CL_data = transformed_data.copy() # This data should not be binned, but encoded

### Binning

In [None]:
features_for_binning =  ["DONOR_AGE", "LIFETIME_CARD_PROM", "LIFETIME_GIFT_COUNT", "LIFETIME_PROM", "MONTHS_SINCE_ORIGIN", "MEDIAN_HOME_VALUE", "MEDIAN_HOUSEHOLD_INCOME", "MONTHS_SINCE_FIRST_GIFT", "PCT_ATTRIBUTE2", "PCT_ATTRIBUTE3", "PCT_ATTRIBUTE4", "PCT_OWNER_OCCUPIED", "PER_CAPITA_INCOME", "RECENT_RESPONSE_PROP"]
quantile_list = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
quantile_labels = [1,2,3,4,5,6,7,8,9,10]

for feature in features_for_binning:
    print(feature)
    if feature=="MONTHS_SINCE_ORIGIN": # less bins for this features because of high number of zeros
        quantile_list = np.linspace(0, 1.0, 7)
        quantile_labels = np.arange(0,len(quantile_list)-1)
        
    quantiles = transformed_data[feature].quantile(quantile_list)
    
    
    #binned_dataframe[f'{feature}_range'] = pd.qcut(binned_dataframe[feature],q=quantile_list)
    transformed_data[f'{feature}_label'] = pd.qcut(transformed_data[feature],q=quantile_list,labels=quantile_labels, duplicates='drop')

    
transformed_data = transformed_data.drop(columns = features_for_binning)    
display(transformed_data)

In [None]:
# This data will be used in Rule Mining
RM_data = transformed_data.copy()

### Convert float64 to int

converting for better performance
no difference to float values, because decimal number is not used. 

In [None]:
tf = transformed_data.select_dtypes(include=['float64'])
#tf.drop(columns=['RECENT_CARD_RESPONSE_PROP','RECENT_AVG_GIFT_AMT', 'RECENT_AVG_CARD_GIFT_AMT', 
#                 'LIFETIME_AVG_GIFT_AMT', 'LIFETIME_GIFT_RANGE', 'LIFETIME_MAX_GIFT_AMT', 'LIFETIME_MIN_GIFT_AMT', 'FILE_AVG_GIFT'])
transformed_data[tf.columns] = transformed_data[tf.columns].round(0).astype(int)

In [None]:
transformed_data.info()

**Tried to use OrdinalEncoder to replace NaNs, but decided to use LabelEncoder and manually replace NaN, by choosing the correct function (mean/mode/KNN)** <br>
**Code:** <br>
from sklearn.preprocessing import OrdinalEncoder
from fancyimpute import KNN
#instantiate both packages to use
encoder = OrdinalEncoder()
imputer = KNN()
**create a list of categorical columns to iterate over** <br>
to_encode = ['WEALTH_RATING','DONOR_AGE_label','INCOME_GROUP', 'URBANICITY','SES' ]


def encode(data):
    pd.set_option('mode.chained_assignment', None)
    '''function to encode non-null data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_ordinal = encoder.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    tdata = data.copy()
    tdata.is_copy = None
    tdata.loc[data.notnull()] = np.squeeze(impute_ordinal)
  
    return tdata

#create a for loop to iterate through each column in the data
for columns in to_encode:
    encode(transformed_data[columns])

## Balancing the Data: Resampling the dataset

In [None]:
transformed_data['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

In [None]:
transformed_data['DONATION_TYPE'].hist()

**Balancing the Data:** <br>
Class distribution is an important characteristic of the dataset that should be checked. **Class imbalance** might impair machine learning.

After cleaning up the dataset, we still have imbalanced features. The focus is on the feature 'TARGET_B'. As you can see in the plot above, we have nearly 3 times more "not donated" (0) than "donated" (1).

In order to fix it, we decided to random under- and over-sample the dataset. In the end, after optimising, we decided to go with over-sampling. After analysing, we guess that we lose too many information through under-fitting. 

### Random under-sampling

--> under-sample "0" to the count of "1". In the end we should have a balanced TARGET_B, where count(0) = count(1)

In [None]:
transformed_data

In [None]:
resampled_data = transformed_data.copy()
count_class_0, count_class_1 = resampled_data['TARGET_B'].value_counts()
print(count_class_0,count_class_1)

#Divide
td_class_0 = resampled_data[resampled_data['TARGET_B'] == 0]
td_class_1 = resampled_data[resampled_data['TARGET_B'] == 1]

td_class_0_under = td_class_0.sample(count_class_1)
td_under = pd.concat([td_class_0_under, td_class_1], axis=0)

print('Random under-sampling:')
print(td_under['TARGET_B'].value_counts())

td_under['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

td_under = td_under.reset_index(drop=True)



### Random over-sampling

--> over-sample "1" to the count of "0". In the end we should have a balanced TARGET_B, where count(0) = count(1)

In [None]:
td_class_1_over = td_class_1.sample(count_class_0, replace=True)
td_over = pd.concat([td_class_0, td_class_1_over], axis=0)

print('Random over-sampling:')
print(td_over['TARGET_B'].value_counts())

td_over['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

td_over = td_over.reset_index(drop=True)



### Over-sampling DONATION_TYPE

In [None]:
resampled_data_dt = transformed_data.copy()
count_class_0, count_class_1, count_class_2, count_class_3, count_class_4, count_class_5 = resampled_data_dt['DONATION_TYPE'].value_counts()
print(count_class_0,count_class_1)

#Divide
td_class_0_dt = resampled_data[resampled_data['DONATION_TYPE'] == 0]
td_class_1_dt = resampled_data[resampled_data['DONATION_TYPE'] == 1]
td_class_2_dt = resampled_data[resampled_data['DONATION_TYPE'] == 2]
td_class_3_dt = resampled_data[resampled_data['DONATION_TYPE'] == 3]
td_class_4_dt = resampled_data[resampled_data['DONATION_TYPE'] == 4]
td_class_5_dt = resampled_data[resampled_data['DONATION_TYPE'] == 5]




td_class_1_over_dt = td_class_1_dt.sample(count_class_0, replace=True)
td_class_2_over_dt = td_class_2_dt.sample(count_class_0, replace=True)
td_class_3_over_dt = td_class_3_dt.sample(count_class_0, replace=True)
td_class_4_over_dt = td_class_4_dt.sample(count_class_0, replace=True)
td_class_5_over_dt = td_class_5_dt.sample(count_class_0, replace=True)








td_over_dt = pd.concat([td_class_0_dt, td_class_1_over_dt, td_class_2_over_dt, td_class_3_over_dt, td_class_4_over_dt, td_class_5_over_dt], axis=0)

print('Random over-sampling:')
print(td_over_dt['DONATION_TYPE'].value_counts())

td_over_dt['DONATION_TYPE'].value_counts().plot(kind='bar', title='Count (DONATION_TYPE)');

td_over_dt = td_over_dt.reset_index(drop=True)


### One Hot encoding DONATION_TYPE

In [None]:
td_over['DONATION_TYPE'].hist()


In [None]:
transformed_data = td_over_dt.copy()

### One Hot Encoding
one_hot = pd.get_dummies(transformed_data['DONATION_TYPE'], prefix= 'DONATION_TYPE')
transformed_data = transformed_data.drop('DONATION_TYPE', axis=1)
transformed_data = transformed_data.join(one_hot)
transformed_data



### Python imbalanced learn module

#!pip3 install imblearn
import imblearn

def plot_2d_space(X, y, label='Classes'):   
    colors = ['#1F77B4', '#FF7F0E']
    markers = ['o', 's']
    for l, c, m in zip(np.unique(y), colors, markers):
        plt.scatter(
            X[y==l, 0],
            X[y==l, 1],
            c=c, label=l, marker=m
        )
    plt.title(label)
    plt.legend(loc='upper right')
    plt.show()

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(resampled_data.['TARGET_B'])

print('Removed indexes:', id_rus)

plot_2d_space(X_rus, y_rus, 'Random under-sampling')

**Resampling the training set pushed the accurency to 0.51
Conclusion: no resampling of the trainingset**

Code: <br>
count_class_0, count_class_1 = train_dataset['TARGET_B'].value_counts()
print(count_class_0,count_class_1)

#Divide
td_class_0 = train_dataset[train_dataset['TARGET_B'] == 0]
td_class_1 = train_dataset[train_dataset['TARGET_B'] == 1]

td_class_0_under = td_class_0.sample(count_class_1)
td_under = pd.concat([td_class_0_under, td_class_1], axis=0)

print('Random under-sampling:')
print(td_under['TARGET_B'].value_counts())

td_under['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

train_dataset = td_under




---

Class imbalance for DONATE_TYPE

- 

## Creating Training and Test data (Splitting)

### Split for normal classification task

In order to train and evaluate the models we need to split them into training- and testsets. 
As a basis for the split we can use three of base-datasets we created before:

- transformed_data (cleaned dataset but no balancing of classes = imbalanced class (donates / not donating))
- td_under (cleaned dataset with balancing of the class (donates / not donating) with random under sampling)
- td_over (cleaned dataset with balancing of the class (donates / not donating) with random over sampling)

To use the datasets copy one of the following datasets in the following cell: 

- transformed_data 
- td_under
- td_over

In [None]:
#transformed_data = transformed_data.copy()
#transformed_data = td_under.copy()
transformed_data = td_over_dt.copy()



For splitting the data we drop the two target columns ["TARGET_B", "DONATION_TYPE"] out of the training(X) and testdataset (X) and assign them to the target-feature (y) also for training and test. 


In [None]:
#dt_columns = ["DONATION_TYPE_0", "DONATION_TYPE_1", "DONATION_TYPE_2","DONATION_TYPE_3", "DONATION_TYPE_4", "DONATION_TYPE_5"]

In [None]:
train_dataset = transformed_data.sample(frac=0.8,random_state=87)
test_dataset = transformed_data.drop(train_dataset.index)

X_train = train_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE" ])
X_test = test_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])

y_train_target_b = train_dataset.pop("TARGET_B")
y_test_target_b = test_dataset.pop('TARGET_B')

#y_train_donation_type = train_dataset["DONATION_TYPE"].values
#y_test_donation_type = test_dataset["DONATION_TYPE"].values

y_train_donation_type = train_dataset["DONATION_TYPE"]
y_test_donation_type = test_dataset["DONATION_TYPE"]



In [None]:
y_train_donation_type

### Split for classification task for the specific SES classes 

Also we want to train and evaluate models for the specific socioeconomic classes. 
In order to create different datasets for each class we split them in the following and create respective training and test datasets. 

In [None]:
y_train_donation_type_array = y_train_donation_type.to_numpy()
y_test_donation_type_array = y_test_donation_type.to_numpy()
print(y_train_donation_type_array)

In [None]:
SES_1 = transformed_data[transformed_data.SES == 0]
SES_2 = transformed_data[transformed_data.SES == 1]
SES_3 = transformed_data[transformed_data.SES == 2]
SES_4 = transformed_data[transformed_data.SES == 3]
SES_nan = transformed_data[transformed_data.SES == 4]

def split_sets(df):

    train_dataset = df.sample(frac=0.8,random_state=0)
    test_dataset = df.drop(train_dataset.index)

    X_train = train_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])
    X_test = test_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])

    y_train_target_b = train_dataset.pop("TARGET_B")
    y_test_target_b = test_dataset.pop('TARGET_B')

    #y_train_donation_type = train_dataset["DONATION_TYPE"].values
    #y_test_donation_type = test_dataset["DONATION_TYPE"].values

    y_train_donation_type = train_dataset.pop("DONATION_TYPE")
    y_test_donation_type = test_dataset.pop('DONATION_TYPE')
    
    return X_train, X_test, y_train_target_b, y_test_target_b, y_train_donation_type, y_test_donation_type


X_train_SES_1, X_test_SES_1, y_train_target_b_SES_1, y_test_target_b_SES_1, y_train_donation_type_SES_1, y_test_donation_type_SES_1 = split_sets(SES_1)
X_train_SES_2, X_test_SES_2, y_train_target_b_SES_2, y_test_target_b_SES_2, y_train_donation_type_SES_2, y_test_donation_type_SES_2 = split_sets(SES_2)
X_train_SES_3, X_test_SES_3, y_train_target_b_SES_3, y_test_target_b_SES_3, y_train_donation_type_SES_3, y_test_donation_type_SES_3 = split_sets(SES_3)
X_train_SES_4, X_test_SES_4, y_train_target_b_SES_4, y_test_target_b_SES_4, y_train_donation_type_SES_4, y_test_donation_type_SES_4 = split_sets(SES_4)
X_train_SES_nan, X_test_SES_nan, y_train_target_b_SES_nan, y_test_target_b_SES_nan, y_train_donation_type_SES_nan, y_test_donation_type_SES_nan = split_sets(SES_4)




## Training and Evaluation of the Classifiers

In the following we will introduce five classifiers in order to train Models for the three given classificaiton tasks. 

- Predicting Donation (binary classification task);
- Predicting Donation TYPE (multiclass classification); and
- Train specialized models for SES (socioeconomic classification).

We choose the following classifiers for each category: 

- Tree models: RandomForestClassifier
- Distance-based models: KNeighborsClassifier
- Linear models: Support Vector Machine
- Probabilistic models: Gaussian Naive Bayes
- Rule models: todo

For Evalutation we use a 5-fold cross validation to evaluate the results.

TODO: 

1. describe different measures 
2. explain results
3. check not only overall accuracy but also precision and recall for each class !!
4. Describe the parameters used for each classifier and how their choice impacted or not the results.
5. Choose the best classifier and fundament you choice.
6. Discuss critically your choices and the results!



--------

The **Metrics** we are evaluating: 

Precision: What proportion of positive identifications was actually correct?

Recall: What proportion of actual positives was identified correctly?

f1-score: The harmonic mean of precision and recall.

Accuracy: The fraction of all predictions the model got right. 

To fully evaluate the effectiveness of a model, we must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. So the f1-score as the harmonic mean of precision and recall is a good metric to look at. 





In [None]:

def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame, target_names, models ) -> pd.DataFrame:
    '''
    Lightweight script to test many models and find winners
    :param X_train: training split
    :param y_train: training target vector
    :param X_test: test split
    :param y_test: test target vector
    :return: DataFrame of predictions
    '''
    
    dfs = []
    
    results = []
    names = []
    scoring = ["precision" , "recall" , "f1", 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
    target_names = target_names
    
    for name, model in models:
        kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
        cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
        clf = model.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(name)
        #print(y_pred)
        print(classification_report(y_test, y_pred, target_names=target_names, zero_division = 0))
        # Generate confusion matrix
        #matrix = plot_confusion_matrix(model, X_test, y_test)#, normalize='true')
        #matrix.plot()
        #plt.rcParams["axes.grid"] = False
        results.append(cv_results)
        names.append(name)
        this_df = pd.DataFrame(cv_results)
        this_df['model'] = name
        dfs.append(this_df)
        final = pd.concat(dfs, ignore_index=True)        
    return final

### Hyperparametertuning for all binary classification models  

The untuned models will perform very poorly. Based on the different datasets we use for training we need to choose the right hyperparameters of the models. 

In the following each of the tunable models (Gaussian NB is not tunable) will be trained with different hyperparameters in order to find the best combination of parameters to increase the metrics we specified. 

We specified to score on recall (the proportion of the actual values the model predicted right), precision (the proportion of the predicted values that truly are that class) and on overall model accuracy. 

After evaluating the best parameters for each model we can use the models with the best parameters to create a model comparision. 

#### Random Forest: 

In [None]:
# takes 7 min 

rand_forest = RandomForestClassifier(n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_target_b)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_target_b, y_pred))

params = {'bootstrap': [True, False],
          'max_depth': [40, 50, 60],
          'max_features': ['auto'],
          'min_samples_leaf': [1, 2, 4],
          'min_samples_split': [2, 5],
          'n_estimators': [400, 600, 800]}

#scoring = ['recall' , 'precision', 'accuracy']

grid_search_cv = GridSearchCV(
    rand_forest,
    params, 
    verbose=1, 
    cv=5,
    n_jobs= -1, 
    scoring="f1" 
    #refit='recall'
)


grid_search_cv.fit(X_train, y_train_target_b)

print("Best params for Rand Forest : " +  str(grid_search_cv.best_params_))

print("Best estimator for Rand Forest : " + str(grid_search_cv.best_estimator_))

print("Best score for Rand Forest : " + str(grid_search_cv.best_score_))

In [None]:
# f1_weighted

As the Hyperparametertuning showed we use the following parameters: 


In [None]:
rand_forest = RandomForestClassifier(bootstrap=False, max_depth=60, n_estimators=800,n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_target_b)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


#### kNN 

In [None]:
knn = KNeighborsClassifier()
knn = knn.fit(X_train, y_train_target_b)
y_pred = knn.predict(X_test)
print(classification_report(y_test_target_b, y_pred))

params = {'n_neighbors':[50,60,70],
              'leaf_size':[1,3,5],
              'algorithm':['auto', 'kd_tree'],
              'n_jobs':[-1]}


grid_search_cv = GridSearchCV(
    knn,
    params, 
    verbose=1, 
    cv=3,
    n_jobs= -1, 
    scoring="f1"
)


grid_search_cv.fit(X_train, y_train_target_b)

print("Best params for knn : " +  str(grid_search_cv.best_params_))

print("Best estimator for knn : " + str(grid_search_cv.best_estimator_))

print("Best score for knn : " + str(grid_search_cv.best_score_))

As the Hyperparametertuning showed we use the following parameters: 



In [None]:
knn = KNeighborsClassifier(leaf_size=3, n_jobs=-1, n_neighbors=70)
knn = knn.fit(X_train, y_train_target_b)
y_pred = knn.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


#### SVM 

In [None]:
# training a linear SVM classifier 

svm_model_linear = SVC().fit(X_train, y_train_target_b) 
svm_predictions = svm_model_linear.predict(X_test) 
print(classification_report(y_test_target_b, svm_predictions))

# Tuning the SVM with 

# defining parameter range 
param_grid = {'C': [0.001, 0.01, 0.1],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear', 'rbf']}  
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1) 
  
# fitting the model for grid search 
grid.fit(X_train, y_train_target_b)


print("Best params for SVM : " +  str(grid.best_params_))

print("Best estimator for SVM : " + str(grid.best_estimator_))

print("Best score for SVM : " + str(grid.best_score_))


As the Hyperparametertuning showed we use the following parameters: 



In [None]:

svc = SVC(C=0.1, gamma=0.01)
svc = svc.fit(X_train, y_train_target_b)
y_pred = svc.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


#### Feature significance based on the Random Forest Classifier

In [None]:
importances = rand_forest.feature_importances_
std = np.std([rand_forest.feature_importances_ for tree in rand_forest.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d : %s (%f)" % (f + 1, indices[f] ,X_train.columns[indices[f]] ,importances[indices[f]]))

# Plot the impurity-based feature importances of the forest
plt.figure( figsize=(20,5))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()



Result: As the analysis showed, there are features that are more and less relevant for the model. 
As a next step for speeding up model training we could drop the less relevant features. 

### Hyperparametertuning for multiclass classification

#### Random Forest: 

In [None]:
# takes 7 min 

rand_forest = RandomForestClassifier(n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_donation_type)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))

params = {'bootstrap': [True, False],
          'max_depth': [10, 20, 30],
          'max_features': ['auto'],
          'min_samples_leaf': [1, 2, 4],
          'min_samples_split': [2, 5],
          'n_estimators': [200, 400, 600]}


grid_search_cv = GridSearchCV(
    rand_forest,
    params, 
    verbose=1, 
    cv=3,
    n_jobs= -1
)


grid_search_cv.fit(X_train, y_train_donation_type)

print("Best params for Rand Forest : " +  str(grid_search_cv.best_params_))

print("Best estimator for Rand Forest : " + str(grid_search_cv.best_estimator_))

print("Best score for Rand Forest : " + str(grid_search_cv.best_score_))

In [None]:
rand_forest = RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_donation_type)
y_pred = rand_forest.predict(X_test) 
print(classification_report(y_test_donation_type, y_pred))

In [None]:
rand_forest = RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_donation_type)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))

In [None]:
y_pred

#### kNN 

In [None]:
knn = KNeighborsClassifier()


knn = knn.fit(X_train, y_train_donation_type)
y_pred = knn.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))

params = {'n_neighbors':[50,60,70],
              'leaf_size':[1,3,5],
              'algorithm':['auto', 'kd_tree'],
              'n_jobs':[-1]}


grid_search_cv = GridSearchCV(
    knn,
    params, 
    verbose=1, 
    cv=3,
    n_jobs= -1
)


grid_search_cv.fit(X_train, y_train_donation_type)

print("Best params for knn : " +  str(grid_search_cv.best_params_))

print("Best estimator for knn : " + str(grid_search_cv.best_estimator_))

print("Best score for knn : " + str(grid_search_cv.best_score_))

In [None]:
knn = KNeighborsClassifier(leaf_size=5, n_jobs=-1, n_neighbors=60)
knn = knn.fit(X_train, y_train_donation_type)
y_pred = knn.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))


#### SVM 

In [None]:
# training a linear SVM classifier 

svm_model_linear = SVC().fit(X_train, y_train_donation_type) 
svm_predictions = svm_model_linear.predict(X_test) 
print(classification_report(y_test_donation_type, svm_predictions))

# Tuning the SVM with 

# defining parameter range 
param_grid = {'C': [0.001, 0.01, 0.1],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear', 'rbf']}  
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1) 
  
# fitting the model for grid search 
grid.fit(X_train, y_train_donation_type)


print("Best params for SVM : " +  str(grid.best_params_))

print("Best estimator for SVM : " + str(grid.best_estimator_))

print("Best score for SVM : " + str(grid.best_score_))


In [None]:

svc = SVC()
svc = svc.fit(X_train, y_train_target_b)
y_pred = svc.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


Best Classifiers: 

## Binary Classification 

### Without Sampling
- RF : RandomForestClassifier(max_depth=30, min_samples_leaf=4, n_estimators=400,n_jobs=-1)
- kNN : KNeighborsClassifier(leaf_size=1, n_jobs=-1, n_neighbors=60)
- SVM : 

### Oversampling 
- RF : RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1)
- kNN : KNeighborsClassifier(leaf_size=3, n_jobs=-1, n_neighbors=70)
- SVM : SVC(C=0.1, gamma=0.01)

### Undersampling 
- RF : 
- kNN : 
- SVM : 


## Multiclass Classification 

### Without Sampling
- RF : 
- kNN : 
- SVM : 
### Oversampling 
- RF : RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)
- kNN : KNeighborsClassifier(leaf_size=5, n_jobs=-1, n_neighbors=60)
- SVM : 
### Undersampling 
- RF : 
- kNN : 
- SVM : 


### Train Models for binary Classification

In [None]:
final = run_exps(
    X_train, 
    y_train_target_b, 
    X_test, 
    y_test_target_b, 
    ['wont donate', 'donates'],
    [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1)),
     ('Distance Based Model: KNN', KNeighborsClassifier(leaf_size=3, n_jobs=-1, n_neighbors=70)),
     ('Probabilistic Model: GNB', GaussianNB()),
     ('Linear Model: SVM', SVC(C=0.1, gamma=0.01))
        # Rule Based model: 
    ])

In [None]:
bootstraps = []

for model in list(set(final.model.values)):
    model_df = final.loc[final.model == model]
    bootstrap = model_df.sample(n=30, replace=True)
    bootstraps.append(bootstrap)
        
bootstrap_df = pd.concat(bootstraps, ignore_index=True)
results_long = pd.melt(bootstrap_df,id_vars=['model'],var_name='metrics', value_name='values')

time_metrics = ['fit_time','score_time'] # fit time metrics## PERFORMANCE METRICS
results_long_nofit = results_long.loc[~results_long['metrics'].isin(time_metrics)] # get df without fit data
results_long_nofit = results_long_nofit.sort_values(by='values')## TIME METRICS
results_long_fit = results_long.loc[results_long['metrics'].isin(time_metrics)] # df with fit data
results_long_fit = results_long_fit.sort_values(by='values')

In [None]:
plt.figure(figsize=(10, 7))
sns.set(font_scale=1)
g = sns.boxplot(x="model", y="values", hue="metrics", data=results_long_nofit, palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Comparison of Model by Classification Metric')
plt.savefig('./benchmark_models_performance.png',dpi=300)

### Train Models for multiclass classification

After evaluating the best parameters for each model we can use the models with the best parameters to create a model comparision. 

In [None]:
final = run_exps(
    X_train, 
    y_train_donation_type_array, 
    X_test, 
    y_test_donation_type_array, 
    ['wont donate', 'A', 'B', 'C', 'D', 'E'],
    [
        ('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)),
        ('Distance Based Model: KNN', KNeighborsClassifier(leaf_size=5, n_jobs=-1, n_neighbors=60)),
        ('Probabilistic Model: GNB', GaussianNB()),
        ('Linear Model: SVM', SVC())
        # Rule Based model: 
    ])

In [None]:
bootstraps = []

for model in list(set(final.model.values)):
    model_df = final.loc[final.model == model]
    bootstrap = model_df.sample(n=30, replace=True)
    bootstraps.append(bootstrap)
        
bootstrap_df = pd.concat(bootstraps, ignore_index=True)
results_long = pd.melt(bootstrap_df,id_vars=['model'],var_name='metrics', value_name='values')

time_metrics = ['fit_time','score_time'] # fit time metrics## PERFORMANCE METRICS
results_long_nofit = results_long.loc[~results_long['metrics'].isin(time_metrics)] # get df without fit data
results_long_nofit = results_long_nofit.sort_values(by='values')## TIME METRICS
results_long_fit = results_long.loc[results_long['metrics'].isin(time_metrics)] # df with fit data
results_long_fit = results_long_fit.sort_values(by='values')

In [None]:

plt.figure(figsize=(10, 7))
sns.set(font_scale=1)
g = sns.boxplot(x="model", y="values", hue="metrics", data=results_long_nofit, palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Comparison of Model by Classification Metric')
plt.savefig('./benchmark_models_performance.png',dpi=300)

### Binary classifier for SES - classes

Train models for each SES class

---
TODO: we just have to train the best classifier (ONE !!!) from the previous experiments to predict donation and donation type for each SES class = 2 models for 5 SES = experiments 

---

In [None]:
final = run_exps(X_train_SES_1, 
                 y_train_target_b_SES_1, 
                 X_test_SES_1 , 
                 y_test_target_b_SES_1, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_2, 
                 y_train_target_b_SES_2, 
                 X_test_SES_2, 
                 y_test_target_b_SES_2, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_3, 
                 y_train_target_b_SES_3, 
                 X_test_SES_3 , 
                 y_test_target_b_SES_3, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_4, 
                 y_train_target_b_SES_4, 
                 X_test_SES_4 , 
                 y_test_target_b_SES_4, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_nan, 
                 y_train_target_b_SES_nan, 
                 X_test_SES_nan, 
                 y_test_target_b_SES_nan, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

## Multiclass classifier for SES - classes

In [None]:
final = run_exps(X_train_SES_1, 
                 y_train_target_b_SES_1, 
                 X_test_SES_1 , 
                 y_test_target_b_SES_1, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_2, 
                 y_train_target_b_SES_2, 
                 X_test_SES_2, 
                 y_test_target_b_SES_2, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_3, 
                 y_train_target_b_SES_3, 
                 X_test_SES_3 , 
                 y_test_target_b_SES_3, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_4, 
                 y_train_target_b_SES_4, 
                 X_test_SES_4 , 
                 y_test_target_b_SES_4, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_nan, 
                 y_train_target_b_SES_nan, 
                 X_test_SES_nan, 
                 y_test_target_b_SES_nan, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

Result: 

# todo


## Logical Models: Rule models

In [None]:
# after complete processing, all columns should be used
interesting_features = ['TARGET_B', 'HOME_OWNER', 'DONOR_GENDER', 'DONATION_TYPE'] 

# data used for Rule Mining:
RM_transformed_data = transformed_data[interesting_features]
RM_transformed_data

In [None]:
# sources:
# https://scikit-learn.org/stable/modules/preprocessing.html  -   6.3.4. Encoding categorical features
# TP05: 2.2

# def ohe_encode moved to 3.1

cols_not_binary = ['HOME_OWNER', 'DONOR_GENDER', 'DONATION_TYPE']
RM_binary_data = ohe_encode(cols_not_binary)
RM_binary_data

Todo: 
- Binning for all interesting features (first preprocessing section)
- put the OneHotEncoding in first preprocessing section?

## ????? Finding Associations

In [None]:

# calculate frequent patterns:
freq_patterns = apriori(RM_binary_data, min_support=0.05, use_colnames=True) # maybe choose different min_support..
#freq_patterns['size'] = freq_patterns['itemsets'].apply(lambda x: len(x))
#freq_patterns = freq_patterns[freq_patterns['size']>1]
#freq_patterns

In [None]:

# careful here, if we change '?' again maybe...
def check_if_interesting(x):
    # Interesting (maybe different criteria..):
    # - consequents==(TARGET_B or DONATION_TYPE)
    # - TARGET_B, B, D, E not in antecedents
    targets = ['TARGET_B', 'DONATION_TYPE:A', 'DONATION_TYPE:B', 'DONATION_TYPE:C', 
               'DONATION_TYPE:D', 'DONATION_TYPE:E', 'DONATION_TYPE:?']
    target_is_only_consequents = any(item in targets for item in x[1]) and len(x[1])==1
    target_not_in_ancedents = not any(item in targets for item in x[0])
    return target_is_only_consequents and target_not_in_ancedents

# generate assiciation rules:
as_rules = association_rules(freq_patterns, metric="confidence", min_threshold=0.2)
# filter out uninteresting rules:
as_rules['interesting?'] = as_rules[['antecedents', 'consequents']].apply(check_if_interesting, axis=1)
as_rules = as_rules[ as_rules['interesting?']==True]
as_rules

### Geometric Models: Linear models

#### Predicting Donation (binary classification task)

In [None]:
# example: linear regression or SVM ? 
# first we try SVM classifier
  
# training a linear SVM classifier 

svm_model_linear = SVC().fit(X_train, y_train_target_b) 
svm_predictions = svm_model_linear.predict(X_test) 
  
# model accuracy for X_test   
accuracy = svm_model_linear.score(X_test, y_test_target_b) 
  
# creating a confusion matrix 
cm = confusion_matrix(y_test_target_b, svm_predictions) 
matrix = plot_confusion_matrix(svm_model_linear, X_test, y_test_target_b)#, normalize='true')


In [None]:
# example: linear regression or SVM ? 
# first we try SVM classifier
  
# training a linear SVM classifier 

svm_model_linear = SVC().fit(X_train, y_train_target_b) 
svm_predictions = svm_model_linear.predict(X_test) 
  
# model accuracy for X_test   
accuracy = svm_model_linear.score(X_test, y_test_target_b) 
  
# creating a confusion matrix 
cm = confusion_matrix(y_test_target_b, svm_predictions) 
matrix = plot_confusion_matrix(svm_model_linear, X_test, y_test_target_b)#, normalize='true')
# Tuning the SVM with 

# defining parameter range 
param_grid = {'C': [0.001, 0.01, 0.1, 1],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'kernel': ['linear', 'rbf']}  
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1) 
  
# fitting the model for grid search 
#grid.fit(X_train, y_train_target_b)


# print best parameter after tuning 
print(grid.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(grid.best_estimator_) 

grid_predictions = grid.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, grid_predictions)) 




### Geometric Models: Distance-based models

In [None]:
# example: knn

# training a KNN classifier 

knn = KNeighborsClassifier(leaf_size=1, n_jobs=-1, n_neighbors=4).fit(X_train, y_train_target_b) 
  
# accuracy on X_test 
accuracy = knn.score(X_test, y_test_target_b) 
print(accuracy)
  
# creating a confusion matrix 
knn_predictions = knn.predict(X_test)  
cm = confusion_matrix(y_test_target_b, knn_predictions) 

matrix = plot_confusion_matrix(knn, X_test, y_test_target_b, normalize='true')



In [None]:
# hyperparameter tuning

#define the model and parameters
knn = KNeighborsClassifier()

parameters = {'n_neighbors':[4,5,6,7],
              'leaf_size':[1,3,5],
              'algorithm':['auto', 'kd_tree'],
              'n_jobs':[-1]}

#Fit the model
model = GridSearchCV(knn, param_grid=parameters)
model.fit(X_train,y_train_target_b)

#predictions on test data
prediction=model.predict(X_test)

# print best parameter after tuning 
print(model.best_params_) 
  
# print how our model looks after hyper-parameter tuning 
print(model.best_estimator_) 



In [None]:

# accuracy on X_test 
accuracy = knn.score(X_test, y_test_target_b) 
print(accuracy)
  
cm = confusion_matrix(y_test_target_b, knn_predictions) 



grid_predictions = model.predict(X_test) 
  
# print classification report 
print(classification_report(y_test_target_b, grid_predictions)) 


### Probabilistic models: 

In [None]:
display(as_rules[as_rules['consequents']==frozenset({'DONATION_TYPE:High'})].head)# example: naive bayes 

# see above. 

# no grid search needed for naive bayes, because no parameters to tune

## Classification - Results and Discussion 

Based on the f1-Score....



# Task 2 (Unsupervised Learning) - Characterizing Donors and Donation Type

In this task you should **use unsupervised learning algorithms and try to characterize donors (people who really did a donation) and their donation type**. You can use:
* **Association rule mining** to find **associations between the features and the target Donation/DonationTYPE**.
* **Clustering algorithms to find similar groups of donors**. Is it possible to find groups of donors with the same/similar DonationTYPE?
* **Be creative and define your own unsupervised analysis!** What would it be interesting to find out ?

## Preprocessing Data for Association Rule Mining

In [None]:
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.cluster import contingency_matrix

In [None]:
# Better visualization and suppression of warnings:
pd.options.mode.chained_assignment = None
pd.set_option("display.max_columns", None)

In this part, we will use the apriori algorithm to calculate frequent patterns in the data. From these patterns we can draw the association rules we look for. 

We can use a large part of the preprocessing from the supervised learning. NaN-replacement is of course very usefull. Binning as well, as for features with a wide range of values, no meaningfull rules can be learned. However, not all of the data in this dataset is already binned.

We will check first for correlations between the features and drop most of the ones that are correlated. The reason: After the necessary OneHotEncoding, we will have to deal with a significantly larger number of columns. That together with the already large number of rows leads to high assignment of memory space while calculating the frequent patterns, which can cause a crash of the program.  



At first, extract the data from Task 1:

In [None]:
RM_transformed_data = RM_data.copy().astype(float)
RM_transformed_data

Most of the features are already in place for checking for correlations. However, the features URBANICITY and DONOR_GENDER receive more accurate encoding:

In [None]:
# Order URBANICITY from rural (0) to city (4): 
# 0(=?,suburban)->2 ; 1(=C,city)->4 ; 2(=R,rural)->0 ; 3(=S, suburban)->2 ; 
# 4(=T, town)->1 ; 5(=T,urban)->3
def newEnc_URB(x):
    if x==0 or x==3: return 2
    if x==1: return 4
    if x==2: return 0
    if x==4: return 1
    if x==5: return 3
RM_transformed_data['URB_newEnc'] = RM_transformed_data['URBANICITY'].apply(newEnc_URB)

# Order DONOR_GENDER. Take the average for unknown:
def newEnc_DGE(x):
    if x==0: return 1    # female
    if x==1: return 0    # male
    if x==2: return 0.5  # unknown
RM_transformed_data['DGE_newEnc'] = RM_transformed_data['DONOR_GENDER'].apply(newEnc_DGE)
    
new_enc_features = ['URBANICITY', 'DONOR_GENDER', 'OVERLAY_SOURCE']
RM_transformed_data = RM_transformed_data.drop(new_enc_features, axis=1)

Now, we can search for correlations between the features: The function `find_correlations` returns a list with the correlated features. It also finds groups with strong correlations within them. 

In order to achieve this, the algorithm creates a correlation matrix between all features and looks for the highest value in the matrix. One of the two features belonging to this value will be added to the list of correlated features. This feature will be dropped from the matrix. This will be done over and over again until there is no value in the correlation matrix that is above the given correlation threshold.

In [None]:
def create_corr_groups(corrs):
    '''
    From the given correlating tuples, checks which of them share a common features and merges them.
    :param corrs: set containing tuples with correlating features
    :return new_corrs: set with merged tuples, if they shared features
    :return found_new_corr: True if the function found a new correlation between features in two tuples
    '''
    found_new_corr = False
    new_corrs = set()
    for corr1_list in corrs.copy():
        corr1_is_closed = False
        if corr1_list in corrs:
            corr1 = set(corr1_list)
            corrs.remove(corr1_list)
            for corr2_list in corrs.copy():
                corr2 = set(corr2_list)
                if corr1!=corr2 and corr1&corr2 != set():
                    new_corrs.add(tuple(corr1.union(corr2)))
                    corrs.remove(corr2_list)
                    corr1_is_closed = True
                    found_new_corr = True
            if corr1_is_closed==False:
                new_corrs.add(corr1_list)
    return new_corrs, found_new_corr

def find_correlations(data, corr_threshold=0.6, targets=['TARGET_B', 'DONATION_TYPE'], show_hist=False):
    '''
    :param data: dataset whose features shall be checked for correlations
    :param corr_threshold: threshold from which correlating features should be dropped
    :return correlated_features: features that are envolved in strong correlations. Should be dropped.
    :return correlations: set containing tuples with correlating features
    :return corr_mat: correlation matrix.
    :return corr_hist: Histogram with distribution of all correlations.
    :return lowest_corr: lowest correlation value. Usually negative. 
    '''
    corr_mat = data.drop(targets, axis=1).corr()
    corr_hist = False
    if show_hist:
        corr_hist = plt.hist(corr_mat.to_numpy().flatten());
        plt.show()
    
    lowest_corr = np.amin(corr_mat.to_numpy().flatten())
    highest_corr_val = corr_threshold
    correlated_features = set() # list with all features with correlations  
    correlation_groups = set() # list with pairs of correlating features

    initiation = True
    while abs(highest_corr_val)>corr_threshold or initiation:
        highest_corr_val = corr_threshold
        highest_corr_name = None
        for col in range(len(corr_mat.columns)):
            for row in range(col):
                if corr_mat.iloc[col, row] > highest_corr_val:
                    highest_corr_val = abs(corr_mat.iloc[col, row])
                    highest_corr_name = corr_mat.columns[row]
                    correlation_groups.add((highest_corr_name, corr_mat.index[col]))
        if highest_corr_name == None:
            break
        corr_mat = corr_mat.drop(highest_corr_name, axis=1).drop(highest_corr_name, axis=0)

    correlation_groups = list(correlation_groups)
    # create list with all correlated features:
    correlated_features = set()
    for group in correlation_groups:
        for member in group:
            correlated_features.add(member)
    found_new = True
    j = 0
    while found_new==True and j<20:
        temp = correlation_groups.copy()
        correlation_groups, found_new = create_corr_groups(temp)
        j += 1
    print('Cor. Features found:', len(correlated_features))
    return correlated_features, list(correlation_groups), corr_mat, corr_hist, lowest_corr

In [None]:
RM_corr_features, RM_corr_groups,corr_mat,_,_= find_correlations(RM_transformed_data, corr_threshold=0.25) #0.25 ist gut

print('\nRM_corr_groups:')
for ind, group in enumerate(RM_corr_groups):
    print('\n', ind, group)

The newly found correlation groups seem accurate in most cases, although there are two groups that contain a range of features around the topic recent and general donation activity. Further analysis could improve the accuracy of the feature groups.

However, with this knowledgege, we can drop all features with correlations, but keep one feature from each group (inc_features) that will represend the group from know:

In [None]:
inc_features = ['RECENT_RESPONSE_COUNT', 'INCOME_GROUP', 'PCT_ATTRIBUTE2_label', 'MONTHS_SINCE_LAST_GIFT', \
                'PUBLISHED_PHONE', 'LIFETIME_AVG_GIFT_AMT']
# out of interest: keep DONOR_AGE_label as well:
inc_features = inc_features + ['DONOR_AGE_label']

RM_features = list(set(RM_transformed_data.columns)-set(RM_corr_features))+inc_features 
RM_transformed_data = RM_transformed_data[RM_features]
display(RM_transformed_data)

We have reduced the number of columns already significanty. However, the already performed binning is not sufficient. After some trials, we saw that binning features in ten bins is still too high for the apriori algorithm. Therefore, check in the data, for which features, further binning is necessary. Other features need further processing as well. This, we will perform now: 

In [None]:
# Devide in binning and seperate transformation:
features_not_to_bin = ['TARGET_B', 'DONATION_TYPE', 'HOME_OWNER', 'DONOR_GENDER_RM', 'PUBLISHED_PHONE']
features_seperate = ['SES',  'OVERLAY_SOURCE', 'PCT_ATTRIBUTE1', 'DGE_newEnc'\
                         ]#'RECENCY_STATUS_96NK', 'RECENT_STAR_STATUS', 'WEALTH_RATING', 'MOR_HIT_RATE', 'PUBLISHED_PHONE']
features_to_bin = list(set(RM_features)-set(features_not_to_bin)-set(features_seperate))

# Backtransform DONATION_TYPE and create new compressed target:
RM_transformed_data["DONATION_TYPE"] = DT_LaEnc.inverse_transform(RM_transformed_data["DONATION_TYPE"].astype('int32'))
def convert_DT(x):
    if x=='?':
        return 'None'
    if x=='A' or x=='B':
        return 'High'
    else:
        return 'Low'
RM_transformed_data.DONATION_TYPE = RM_transformed_data["DONATION_TYPE"].apply(convert_DT)

# Deal with extraordinaly features:
RM_transformed_data["DONOR_GENDER_RM"] = RM_transformed_data["DGE_newEnc"].apply(lambda x: 'F' if x==1 else ('M' if x==0 else 'U'))
RM_transformed_data["SES_RM"] = RM_transformed_data["SES"].apply(lambda x: 'low' if x==0 else 'high')
RM_transformed_data["PCT_ATTRIBUTE1_RM"] = RM_transformed_data["PCT_ATTRIBUTE1"].apply(lambda x: 'low' if x==0 else 'high')
RM_transformed_data = RM_transformed_data.drop([  "PCT_ATTRIBUTE1", "SES", "DGE_newEnc"], axis=1)

RM_transformed_data

Next, we perform binning into three bins: 

In [None]:
RM_quantiles_list = [0, 1./3., 2./3., 1.0]
RM_quantile_labels = [0, 1, 2]
for feature in features_to_bin:
    quantiles = RM_transformed_data[feature].quantile(RM_quantiles_list)
    RM_transformed_data[f'{feature}_RM'] = pd.qcut(RM_transformed_data[feature],q=RM_quantiles_list, \
                                                   labels=RM_quantile_labels, duplicates='drop')

RM_transformed_data = RM_transformed_data.drop(columns = features_to_bin)    
#display(RM_transformed_data)

The following function does the OneHotEncoding. After applying it to the data, we receive one column for each different value in each column. 

We can save columns here as well: In the function, for features that we recently binnend in three bins, the middle bin is deleted. The reason is that we are not so much interested in this information (only if a value of a feature is very high or low). 

In [None]:
def ohe_encode(col_names, X=RM_transformed_data):
    '''Takes  and DataFrame X with all columns. 
    Encodes features and replaces old columns in X. 
    Deletes middle bin for features ending in "RM" (are divided in three bins).
    :param col_names: columns to encode (format: ['colum1', 'column2', ...])
    :param X: Dataframe with all columns
    :return: OneHotEncoded dataframe
    '''
    enc = OneHotEncoder()
    matrix = X[col_names].to_numpy()
    enc.fit(matrix)
    matrix = enc.transform(matrix).toarray()
    
    categories_new = np.array(enc.categories_)
    features_new = np.array([])
    for ind1, cat in enumerate(categories_new):
        for ind2, cat_new in enumerate(cat):
            if ind2==1 and (col_names[ind1][-2:]=='RM'): # We do not need these Categories here and will drop them
                features_new = np.append(features_new, 'Drop')
            else:
                features_new = np.append(features_new, col_names[ind1]+':'+str(cat_new))
                
    new_df = pd.DataFrame(matrix)
    new_df.columns = features_new.tolist()

    updated_df = pd.concat([X, new_df], axis=1)
    updated_df = updated_df.drop(col_names, axis=1)
    updated_df = updated_df.drop('Drop', axis=1)
    
    return(updated_df)

columns_to_ohenc = list(RM_transformed_data.columns)
columns_to_ohenc.remove('TARGET_B')
RM_binary_data = ohe_encode(columns_to_ohenc)

Some columns achieve higher meaning by giving them new encoding. Some have low meaning, so we can drop them here.
Furthermore, to get apriori to run, we use only the first 10000 rows (about half of the dataset). 

In [None]:
RM_binary_data = RM_binary_data.rename(columns={'PUBLISHED_PHONE:1.0': 'PUBLISHED_PHONE:0', 'PUBLISHED_PHONE:0.0': 'PUBLISHED_PHONE:1'})
RM_binary_data = RM_binary_data.rename(columns={'HOME_OWNER:1.0': 'HOME_OWNER:0', 'HOME_OWNER:0.0': 'HOME_OWNER:1'})
RM_binary_data = RM_binary_data.rename(columns={'PEP_STAR:1.0': 'PEP_STAR:0', 'PEP_STAR:0.0': 'PEP_STAR:1'})
RM_binary_data = RM_binary_data.drop('DONATION_TYPE:Low', axis=1)
RM_binary_data = RM_binary_data.drop('DONOR_GENDER_RM:U', axis=1) 

drop_rows = np.arange(10000, RM_binary_data.shape[0],1)
RM_binary_data = RM_binary_data.drop(drop_rows)
RM_binary_data

## Finding Associations

In order to run apriori, we calculate the minimum support we want to use. It should be significantly smaller than the smallest class-support, but still high enough to get the algorithm to run:

In [None]:
lowest_class_support = RM_transformed_data["DONATION_TYPE"].value_counts()[2] \
                       / RM_transformed_data["DONATION_TYPE"].value_counts().sum()
minimum_support = lowest_class_support / 5
print('minimum_support: ', minimum_support)

Apply apriori and use itemsets smaller three. This way, we receive rules in the end, where we can draw conclusions from one feature to a target:

In [None]:
freq_patterns = apriori(RM_binary_data, min_support=minimum_support, use_colnames=True) 
freq_patterns['length'] = freq_patterns['itemsets'].apply(lambda x: len(x))
freq_patterns = freq_patterns[ freq_patterns['length'] < 3]

We calculate the association rules from the frequent patterns. Therefore, we have to choose a metric, how to determine, which rules are interesting for us. We pick the metric "leverage". It gives us the observed support of consequents (A) and antecents (B) divided by the expected support, if they were independent:

$
\begin{align}
\text{leverage} = \text{support}(A \rightarrow B) - \text{support}(A) \cdot \text{support}(B) \in [-1,1]
\end{align}
$

The measurement tells us, if there is a correlation without being biased by a high antecent or consequent support.
Otherwise the set of rules, we would receive would contain lots of features with high support.
-1 corresponds to high anti-correlation, 0 for no correlation and 1 for high correlation.

The minimum threshold is chosen to be 0.001 to look for correlations:

In [None]:
metric = "leverage"
as_rules_raw = association_rules(freq_patterns, metric=metric, min_threshold=0.001) 

In the end, only rules are important for as, that are related to characteristics we are interested in. We choose to look for rules that indicate:
- high probability to donate (TARGET_B)
- low probability to donate (DONATION_TYPE:None)
- high probability to donate a lot, in case of a donation (DONATION_TYPE:High)

Consequently, we sort out these rules:

In [None]:
def check_if_interesting(x):
    '''
    :param x: [antecends, consequents] 
    :return: True, if rule is interesting, else False
    '''
    # Interesting:
    # - consequents==(TARGET_B or DONATION_TYPE:High or DONATION_TYPE:None)
    # - target not in antecedents
    
    targets = ['TARGET_B', 'DONATION_TYPE:High', 'DONATION_TYPE:Low', 'DONATION_TYPE:None']
    target_is_only_consequents = any(item in targets for item in x[1]) and len(x[1])==1
    target_not_in_ancedents = not any(item in targets for item in x[0])
    return target_is_only_consequents and target_not_in_ancedents

In [None]:
as_rules = as_rules_raw.copy()
# filter out uninteresting rules:
as_rules['interesting?'] = as_rules[['antecedents', 'consequents']].apply(check_if_interesting, axis=1)
as_rules = as_rules[ as_rules['interesting?']==True]
as_rules = as_rules.sort_values(by=[metric], ascending=False)
as_rules = as_rules.drop(['conviction', 'lift', 'conviction', 'interesting?', 'confidence'], axis=1)

## Association Rules - Results and Discussion 

The rules are orderd by the type of target for easier analyzation:

In [None]:
df1 = as_rules[as_rules['consequents']==frozenset({'TARGET_B'})]
df2 = as_rules[as_rules['consequents']==frozenset({'DONATION_TYPE:High'})]
df3 = as_rules[as_rules['consequents']==frozenset({'DONATION_TYPE:None'})]
for df in [df1, df2, df3]:
    display(df)

Now, we draw conclusions about which features lead to a person being more (*TARGET_B*) or less likely (*DONATAION_TYPE:None*) to donate. Apart from that, we can see, which features seem to have an impact on the amount that the person donates (*DONATION_TYPE:High*) in case of a donation.

Most conclusions are close to what one would expect:
For example that the features *HOME_OWNER* and *INCOME_GROUP* are an indicator for high probability to donate as they stand for general wealth, which is characterised well by the corresponding correlation group of *INCOME_GROUP*. 
A high *DONOR_AGE* indicates a high likelihood to donate. A low age indicates the opposite, although young people seem to donate a lot if they decide to do it. 
If not a lot of time has passed since the last gift (*MONTHS_SINCE_LAST_GIFT*), the person is likely to donate again and vice versa. If a person donates after a long time again, the amount is likely to be high.  
A high *RECENT_RESPONSE_COUNT* leads to a person being likely to donate, while a low value is an indicator for a person being less likely to donate. It is not surprising that persons who frequently respond to solicitations are likely to donate. 
An interesting observation is the fact that a person who does not frequently respond is likely to donate a high amount, if he or she donates eventually. From the derivation of correlated features, we can see that *RECENT_RESPONSE_COUNT* correlates with other features such as a general number of donation activity during the lifetime. This allows the conclusion, that a person who has never really been in touch with the thought of making a donation, is actually likely to donate a lot, if you get him to decide to donate. This interpretations is closely related to the one for *MONTHS_SINCE_LAST_GIFT*.

Some features appearing in rules are not easy to interpret:
A high *CLUSTER_CODE* is a good indicator for a rarely donating person, a low one indicates frequent and high donations. We have no information about the ordering of the cluster code though and how it is calculated. 
The features *PCT_ATTRIBUTE2* (high percentage of male veterans in the neighborhood) and *PCT_ATTRIBUTE3* (high number of Vietnam veterans) indicate the opposite behavior. However, there is no trivial explanation for this. A closer analysis of the attributes of the neighborhoods with high (low) number of veterans could help to explain this. 

To conclude, the expecation is matched that persons with higher wealth donate more. Persons who always have been donating will donate again. A person that has not donated in a long time or never donated is less likely to donate, but donates more, if he or she does so eventually.

## Preprocessing Data for Clustering

In this part, we are looking for clusters in the data via kmeans- and agglomerative clustering. Subsequently, we can sort out the clusters with interesting values of the targets.

Just like in rule mining, we can use parts of the preprocessing from other parts of the project. Here, we use data with already replaced NaN-values. We do not need to worry about the running time and storage usage of the algorithm as in apriori. However, we need to sort out correlated features to increase clarity of the results. Furthermore, it would give inappropriate weights to some groups of features in the clustering algorithms. For example, features that indicate the recent activity. Of these we have a lot in the dataset. They would have a significantly higher weight than other features, just because the number of features is higher.

Lastly, it is important to standardise the features (subtract mean and devide by standard deviation). This avoids that features with high values receive larger weights in the clustering algorithms.

In [None]:
clustering_data = CL_data.copy()

In [None]:
# Take encoding back for classes where classes can not be ordered:
# Order Urbanicity and Gender again: 
clustering_data['URB_newEnc'] = clustering_data['URBANICITY'].apply(newEnc_URB)
clustering_data['DGE_newEnc'] = clustering_data['DONOR_GENDER'].apply(newEnc_DGE)
    
new_enc_features = ['URBANICITY', 'DONOR_GENDER', 'OVERLAY_SOURCE']
clustering_data = clustering_data.drop(new_enc_features, axis=1)

We search for correlated features with the function that we already used in rule mining. As already mentioned, the clustering algorithms do not demand a very low number of features, so we can choose a higher correlation threshold.

In [None]:
CL_corr_features, CL_corr_groups, CL_corr_mat,_,_= find_correlations(clustering_data, corr_threshold=0.5)

print('RM_corr_features: \n', CL_corr_features)
print('\nRM_corr_groups:')
for ind, group in enumerate(CL_corr_groups):
    print('\n', ind, group)

The correlation groups here are similar to the ones found in rule mining. It is sometimes hard to assign one common attribute to a whole group. Again, this works quite well in most cases.

We will drop again all but one feature from each group:

In [None]:
inc_features_CL = ['MEDIAN_HOUSEHOLD_INCOME', 'NUMBER_PROM_12', 'LIFETIME_AVG_GIFT_AMT', \
                   'RECENT_RESPONSE_COUNT', 'MONTHS_SINCE_LAST_GIFT']
CL_features = list(set(clustering_data.columns)-set(CL_corr_features))+inc_features_CL 
clustering_data= clustering_data[CL_features]

As the last step, we can standardise the data:

In [None]:
# The data will be standardized (otherwise features with higher values will have higher weights in clustering):
clustering_data_X = clustering_data.drop(['TARGET_B', 'DONATION_TYPE'], axis=1)
scaler = preprocessing.StandardScaler().fit(clustering_data_X)
clustering_data_X = scaler.transform(clustering_data_X)

clustering_data_y = RM_transformed_data["DONATION_TYPE"]

## Finding Groups

### Clustering by k-means partitioning

In k-means clustering, the algorithm chooses randomly n cluster centers in the vector room of all features. The distance from the cluster centers to all points is calculated and each point is chosen to be in the class of the closest cluster center. After that, the cluster centers a moved to the average of all points being part of its class. This is repeated a lot of times, but usually converges quickly. We choose 300 iterations, which gives accurate results while avoiding long running times.

The number of clusters is chosen to be 12, which is a good amount of clusters to avoid a lot of small clusters. In the end, we would like to give more general statements about the donation behavior, not only about a small number of donors. Furthermore, with small clusters it becomes possible that statistical fluctuations result in false conclusions. 12 is still high enough though, to be able to characterize the clusters.

After these inicial thoughts, we can create the classifier, fit and make predictions:

In [None]:
nClusters = 12 
kmeans_classifier = KMeans(n_clusters=nClusters, random_state=0) 
kmeans_classifier

In [None]:
kmeans_classifier = kmeans_classifier.fit(clustering_data_X)
clustering_data['kmeans_Cluster'] = kmeans_classifier.predict(clustering_data_X)

### Agglomerative Hierarchical Clustering

In hierarchical clustering, the algorithm starts with all data points being their own cluster, always merges two clusters that are closest to each other, untilthe chosen number of clusters (n) remains. We choose n to be 12 as well, with the same reasons as in k-means clustering. 
Another crucial parameter here is the selected type of linkage, which indicates, how the distance between two clusters should be measured. We pick ward's method. It is the increase in the sum of the squared errors when the two clusters are merged. It is more robust to outliers then single, complete and average linkage. These methods do not give reasonable results in this study.

Again, we create the classicier, fit and make predictions:

In [None]:
Agg_classifier = AgglomerativeClustering(linkage ="ward", n_clusters=nClusters)
Agg_classifier

In [None]:
Agg_classifier = Agg_classifier.fit(clustering_data_X)
clustering_data['Agg_Cluster'] = Agg_classifier.labels_

### Compare Agglomerative and kMeans Clustering:

Before taking a closer look at each cluster, we compare the clusters received by k-means and agglomerative clustering in a contingency matrix:

In [None]:
def make_contingency_matrix(targets, classifier_labels): 
    '''
    :param targets: list with gargets
    :param classifier_labels: list with targets determinded by classifier
    :return: dataframe with contingency matrix
    '''
    contmat = contingency_matrix(targets, classifier_labels)
    df_contmat = pd.DataFrame(contmat)
    targets_list = list(set(targets))
    targets_list.sort()
    df_contmat.index = targets_list
    return df_contmat

cm = make_contingency_matrix(clustering_data.kmeans_Cluster, clustering_data.Agg_Cluster,)

# format the matrix and give names to clusters:
for ind in cm.columns:
    cm = cm.rename(columns={ind: 'agg_'+str(ind)}, index={ind: 'kms_'+str(ind)})
cm['agreement'] = cm.max(axis=1) / cm.sum(axis=1)
cm.loc['agreement'] = cm.max(axis=0) / cm.sum(axis=0)
cm['most similar'] = cm.idxmax(axis=1)
cm = cm.round(2)
cm

We can see that some clusters in k-means clustering have high agreement with clusters in agglomerative clustering. We can drop those, as it will not give us more information about the donors to consider a cluster twice: If clusters have an agreement greater than 80%, we drop the agg.-cluster:

In [None]:
cm["drop?"] = cm['agreement'].apply(lambda x: True if x>0.8 else False)
to_drop = cm[cm['agreement']>0.8]['most similar']
print("Clusters dropped:")
display(to_drop)

### Analyzing the clusters

We will work with the unstandardized data to receive meaningful values for the cluster means in each feature.

At first, we create for the k-means and the agglomerative clusters a dataframe that contains the average value of the targets as well as the cluster-size. This helps, to see which clusters are actually interesting for us. The dataframes are merged to work with one single set of clusters from that point. After sorting out the interesting clusters, we can check, which features characterise each cluster. For example, which features are extraordinary for a cluster that has a very high number of people donating.

In [None]:
# Agg.-clustering:

# Use the unstandardized data for further analysis:
cluster_attributes_Agg = clustering_data.groupby(['Agg_Cluster']).mean()
cluster_attributes_Agg['Cluster size'] = clustering_data['Agg_Cluster'].value_counts()

# DONATION_TYPE is repleaced by new column here!
# Use conditional probability here:
cluster_attributes_Agg['DONATION_TYPE if TB=1'] = \
                    cluster_attributes_Agg['DONATION_TYPE'] / cluster_attributes_Agg['TARGET_B']
cluster_attributes_Agg = cluster_attributes_Agg.drop('DONATION_TYPE', axis=1)

# Divide in attributes (Target, size of Cluster) and features:
attributes_Agg = ['TARGET_B', 'DONATION_TYPE if TB=1', 'Cluster size']
cluster_features_Agg = cluster_attributes_Agg[cluster_attributes_Agg.columns.difference(attributes_Agg)]
cluster_features_Agg = cluster_features_Agg.drop('kmeans_Cluster', axis=1)
cluster_attributes_Agg = cluster_attributes_Agg.drop(cluster_attributes_Agg.columns.difference(attributes_Agg), axis=1)

# give proper cluster_names:
for ind in cluster_attributes_Agg.index:
    cluster_attributes_Agg = cluster_attributes_Agg.rename(index={ind: 'agg_'+str(ind)})
    cluster_features_Agg = cluster_features_Agg.rename(index={ind: 'agg_'+str(ind)})
cluster_attributes_Agg.index.name = 'Cluster'
cluster_features_Agg.index.name = 'Cluster'

#drop clusters that match one from kmeans:
cluster_attributes_Agg = cluster_attributes_Agg.drop(to_drop, axis=0) 
cluster_features_Agg = cluster_features_Agg.drop(to_drop, axis=0)

In [None]:
# kmeans (do the same as for agg.-clustering):

cluster_attributes_kmeans = clustering_data.groupby(['kmeans_Cluster']).mean()
cluster_attributes_kmeans['Cluster size'] = clustering_data['kmeans_Cluster'].value_counts()

cluster_attributes_kmeans['DONATION_TYPE if TB=1'] = \
                cluster_attributes_kmeans['DONATION_TYPE'] / cluster_attributes_kmeans['TARGET_B']
cluster_attributes_kmeans = cluster_attributes_kmeans.drop('DONATION_TYPE', axis=1)

attributes = ['TARGET_B', 'DONATION_TYPE if TB=1', 'Cluster size']
cluster_features_kmeans = cluster_attributes_kmeans[cluster_attributes_kmeans.columns.difference(attributes)]
cluster_features_kmeans = cluster_features_kmeans.drop('Agg_Cluster', axis=1)
cluster_attributes_kmeans = cluster_attributes_kmeans.\
                        drop(cluster_attributes_kmeans.columns.difference(attributes), axis=1)

for ind in cluster_attributes_kmeans.index:
    cluster_attributes_kmeans = cluster_attributes_kmeans.rename(index={ind: 'kms_'+str(ind)})
    cluster_features_kmeans = cluster_features_kmeans.rename(index={ind: 'kms_'+str(ind)})
cluster_attributes_kmeans.index.name = 'Cluster'
cluster_features_kmeans.index.name = 'Cluster'

In [None]:
# merge the dataframes attributes and features for Agg and kmeans:
cluster_attributes = cluster_attributes_kmeans.append(cluster_attributes_Agg)
cluster_features = cluster_features_kmeans.append(cluster_features_Agg)

In [None]:
# calculate average value of TARGET_B and DONATION_TYPE in the whole dataset:
# (important, to check if a cluster differs)
avg_TB = (cluster_attributes['TARGET_B']*cluster_attributes['Cluster size']).sum() \
         /cluster_attributes['Cluster size'].sum()
avg_DT_ifTB1 = (cluster_attributes['DONATION_TYPE if TB=1']*cluster_attributes['DONATION_TYPE if TB=1']).sum() \
         /cluster_attributes['DONATION_TYPE if TB=1'].sum()

We filter out the clusters that are interesting for us. At least one of the target variable averages should be significantly different than the one for the whole dataset. The cluster should not be very small to exclude statistical fluctuations: 

In [None]:
def find_interesting_clusters(attributes):
    '''
    :param attributes: list with attribute names 
    :return: True if interesting, else False
    '''
    differing_TB = False
    differing_DT = False
    too_small = False
    if abs(attributes[0]-avg_TB)>0.04: # TARGET_B
        differing_TB = True
    if abs(attributes[2]-avg_DT_ifTB1)>0.6: # DONATION_TYPE
        differing_DT = True
    if attributes[1]<100:
        too_small = True
    return (differing_TB or differing_DT) and not too_small

cluster_attributes['interesting?'] = cluster_attributes[['TARGET_B', 'Cluster size', 'DONATION_TYPE if TB=1']] \
                                    .apply(find_interesting_clusters, axis=1)

# drop uninteresting columns and transpose for better visualization:
cluster_features = cluster_features[cluster_attributes['interesting?']==True] 
cluster_attributes = cluster_attributes[cluster_attributes['interesting?']==True]
cluster_attributes = cluster_attributes.transpose()
cluster_features = cluster_features.transpose()

print(f'Interesting clusters: {cluster_attributes.shape[1]}/{2*nClusters}')

display(cluster_attributes)

We can not include all features in our analysis as this wouldbe too much for this project. Therefore, we try to concentrate on the most important ones: We need to filter out the features which characterise each cluster.

We use the difference between the cluster average and the dataset average devided by the cluster average as indicator. A threshold is chosen for this variable. 

In [None]:
# Calculate Dataset-averages of all features:
cluster_features["Dataset AVG"] = 1
for feature in cluster_features.index:
    cluster_features.loc[feature, 'Dataset AVG'] = clustering_data[feature].mean()

for cluster in cluster_attributes.columns:
    cluster_features['deviation '+str(cluster)] = (cluster_features[cluster]-cluster_features['Dataset AVG']) \
                                                            / cluster_features['Dataset AVG']
cluster_feature_dev = cluster_features.drop(cluster_attributes.columns, axis=1)
cluster_feature_dev = cluster_feature_dev.drop('Dataset AVG', axis=1)

# Choose a threshold for the deviation from which a feature shall be considered:
high_dev_threshold = 0.4

We make the cut and print out the results:

In [None]:
print('Dataset average for TARGET_B:\t', np.round(avg_TB,2))
print('Dataset average for DONATION_TYPE, if TARGET_B==1:\t', np.round(avg_DT_ifTB1,2))
display(cluster_attributes)
for cluster in cluster_feature_dev:
    no = cluster[10:]
    print('Cluster: ', no)
    deviating_features = cluster_features[ abs(cluster_feature_dev[cluster]) > high_dev_threshold ]
    display(deviating_features[ [no, 'Dataset AVG']] )

## Clustering - Results and Discussion 

We will know examine the clusters found and relate (if applicable) the behavior to the rules found in the previous section: 

*kms_0* is a cluster with a high fraction of donors, that do not necessarily donate higher amounts than the dataset average though. Usually, not a lot of time has passed since the person's last gift. This behavior is very simliar to the rules we found in the previous section regarding *MONTHS_SINCE_LAST_GIFT*. Additionally, the people in this cluster have received a lot of donation-commercials recently. You can draw the conclusion from this cluster that people donate frequently when receiving lots of commercials, while not giving high amounts.

In *kms_2* we find a low number of donors. The average donation type is not very low, though. The *PCT_ATTRIBUTE1* (high percentage of military members in the area) is rather low. People in this cluster have not been very active recently, but quite some people have had star donorship. This behavior is closely related to the rules concerning the attribute *RECENT_RESPONSE_COUNT*.

We can find a large fraction of donors that donate large amounts in cluster *kms_4*: Very intersting is the fact that although the lifetime average is low, people are very active recently and donate a lot. These are people that started donating recently. An interesting group, that would be interesting to look further into.

Cluster *kms_5* is a large group with few donors and small donation amounts. It can be characterised by lots of features. Interesting is the very high cluster code, that is hard to interpret though. In general, these people do neither donate to other groups (*MOR_HIT_RATE*), are inactive in general and live in rather rural areas.

Another very large cluster is *agg_0*. The fraction of donors is high and they donate large amounts. It is also characterized by a rather low number of military servants. What stands out, is that they were very active recently.

Cluster *agg_4* is rather small, but stands out in the fact that people donate very low amounts. However, they have donated high amounts in their lifetimes, but are obviously inactive now (low *RECENT_RESPONSE_COUNT*, low *RECENT_STAR_STATUS*).

In the last cluster (*agg_8*), a large fraction donates large amounts. They also donate to other organisations (*MOR_HIT_RATE*). An interesting fact is that they do not usually seem to own a house. The number of military servants is high in the area and they publish their phone with a higher probability. 

We would like to show the different clusters in 2D plots here. This is made very hard though by the large amount of features. The clusters are not necessarily distinguishable from each other in two or three dimensions. Categorical features neither serve for this kind of graphical display. 

We exemplarily show the clusters in the case for *MEDIAN_HOUSEHOLD_INCOME* and *DONOR_AGE* in clusters *kms_5* and *kms_9*:

In [None]:
def clusterplot_2D(feature1, feature2, cluster_no1, cluster_no2, cluster_no3=None, n=1000):
    '''
    :params feature: featurename
    :params cluster_no: number of the cluster to compare
    :param n: number of sample data points. All data point would result in a mess.
    '''
    plotting_data = clustering_data.sample(n=1000).copy()
    X1 = plotting_data[plotting_data["kmeans_Cluster"]==5][feature1]
    Y1 = plotting_data[plotting_data["kmeans_Cluster"]==5][feature2]
    X2 = plotting_data[plotting_data["kmeans_Cluster"]==9][feature1]
    Y2 = plotting_data[plotting_data["kmeans_Cluster"]==9][feature2]
    X3 = plotting_data[plotting_data["kmeans_Cluster"]==9][feature1]
    Y3 = plotting_data[plotting_data["kmeans_Cluster"]==9][feature2]
    cluster_plot, cluster_ax = plt.subplots(nrows=1, ncols=1)
    cluster_ax.scatter(X1, Y1, c='red', label=f'kms_{cluster_no1}')
    cluster_ax.scatter(X2, Y2, c='blue', label=f'kms_{cluster_no2}')
    if cluster_no3 != None:
        cluster_ax.scatter(X3, Y3, c='green', label=f'kms_{cluster_no3}')
    cluster_ax.set_xlabel(feature1)
    cluster_ax.set_ylabel(feature2)
    cluster_ax.legend()
    cluster_ax.set_title('Two clusters displayed in two features')
    plt.show()

f1 = 'MEDIAN_HOUSEHOLD_INCOME'
f2 = 'DONOR_AGE'
f3 = 'LIFETIME_AVG_GIFT_AMT'
clusterplot_2D(f1, f2, 5, 9, n=200)

The result is exactly what one would expect from the summary above: Higher values of income for cluster 9, no or very low difference in age. There is no visible line that devides the two clusters from each other, as the hyperplane that distinguishes between the clusters exists in higher dimensions. The clusters seem to be overlapping in this kind of graphical display.

# Final Comments and Conclusions

### Supervised Learning

### Unsupervised Learning

To summarise, association rules between features and targets could be found with the apriori algorithm and clusters of donors were determined via k-means and agglomerative clustering. The analysis could still be improved by taking a closer look into the correlations between the features during pre-processing. The analysis was strongly limited due to the fact that the apriori algorithm needs a high amount of memory space. Therefore, a lot of information about the donors was lost during binning and sorting out features.
The two ways of analysis yielded different aspects, but overall leave some main conclusions:

- A short time since the last gift results in a high probability that the person will donate again, although the amount is not necessarily high.
- When a person did not respond a lot recently, he or she will probably not donate. In case of a donation, the donation has high likelihood to be high.
- Higher wealth leads in general to more and larger donations.
- People in urban areas donate more money than in rural areas.
- People with higher age donate more often.
- In areas with a low percentage of male military servants, people donate more.