# Machine Learning / Aprendizagem Automática

## Diogo Soares, André Falcão and Sara C. Madeira, 2020/21

# ML Project  - Learning about Donations

## Logistics

**Students are encouraged to work in teams of 3 people**. 

Projects with smaller teams are allowed, in exceptional cases, but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of December, 18th (last day before Christmas holidays).** 

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. The notebook is both the solution and the report.**

**Decisions should be fundamented and results should be critically discussed.**

## Tools

The team should use [Python 3](https://www.python.org) and [Jupyter Notebook](http://jupyter.org), together with **[Scikit-learn](http://scikit-learn.org/stable/)**, **[Orange3](https://orange.biolab.si)**, or **both**.

**[Orange3](https://orange.biolab.si)** can be used through its **[programmatic version](https://docs.orange.biolab.si/3/data-mining-library/)**, by importing and using its packages, or throught its **workflow version**. 

**It is up to the team to decide when to use Scikit-learn, Orange, or both.**

In this context, your Jupyter notebook might have a mix of code, results, text explanations, workflow figures, etc. 

In case you use Orange/workflows for some tasks you should also deliver the workflow files and explain the options taken in each widget in your notebook.

**You can use this noteboook and the sections below as template for your work.**

## Dataset

The dataset to be analysed is **`Donors_dataset.csv`**, made available together with this project description. This dataset, downloaded from [Kaggle](https://www.kaggle.com), contains selected data from the following dataset: [Donors-Prediction](https://www.kaggle.com/momohmustapha/donorsprediction/). 


**In this project, your team is supposed to use only tabular data (not Images or Image Metadata) and see how far you can go in predicting donations and understanding the donors. You should use both supervised and unsupervised learning to tackled 2 tasks:**

1. **Task 1 (Supervised Learning) - Predicting Donation and Donation Type**
2. **Task 2 (Unsupervised Learning) - Characterizing Donors**

The **`Donors_dataset.csv`** you should learn from has **19.372 instances** described by **50 data fields** that you might use as **categorical/numerical features** 

### File Descriptions

* **Donors_dataset.csv** - Tabular/text data to be used in the machine learning tasks.


### Data Fields

* **CARD_PROM_12** - number of card promotions sent to the individual by the charitable organization in the past 12 months
* **CLUSTER_CODE** - one of 54 possible cluster codes, which are unique in terms of socioeconomic status, urbanicity, ethnicity, and other demographic characteristics
* **CONTROL_NUMBER** - unique identifier of each individual
* **DONOR_AGE** - age as of last year's mail solicitation
* **DONOR_GENDER** - actual or inferred gender
* **FILE_AVG_GIFT** - this variable is identical to LIFETIME_AVG_GIFT_AMT
* **FILE_CARD_GIFT** - lifetime average donation (in \\$) from the individual in response to all card solicitations from the charitable organization
* **FREQUENCY_STATUS_97NK** - based on the period of recency (determined by RECENCY_STATUS_96NK), which is the past 12 months for all groups except L and E. L and E are 13–24 months ago and 25–36 months ago, respectively: 1 if one donation in this period, 2 if two donations in this period, 3 if three donations in this period, and 4 if four or more donations in this period.
* **HOME_OWNER** - H if the individual is a homeowner, U if this information is unknown
* **INCOME_GROUP** - one of 7 possible income level groups based on a number of demographic characteristics
* **IN_HOUSE** - 1 if the individual has ever donated to the charitable organization's In House program, 0 if not
* **LAST_GIFT_AMT** - amount of the most recent donation from the individual to the charitable organization
* **LIFETIME_AVG_GIFT_AMT** - lifetime average donation (in \\$) from the individual to the charitable organization
* **LIFETIME_CARD_PROM** - total number of card promotions sent to the individual by the charitable organization
* **LIFETIME_GIFT_AMOUNT** - total lifetime donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_GIFT_COUNT** - total number of donations from the individual to the charitable organization
* **LIFETIME_GIFT_RANGE** - maximum donation amount from the individual minus minimum donation amount from the individual
* **LIFETIME_MAX_GIFT_AMT** - maximum donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_MIN_GIFT_AMT** - minimum donation amount (in \\$) from the individual to the charitable organization
* **LIFETIME_PROM** - total number of promotions sent to the individual by the charitable organization
* **MEDIAN_HOME_VALUE** - median home value (in 100\\$) as determined by other input variables
* **MEDIAN_HOUSEHOLD_INCOME** - median household income (in 100\\$) as determined by other input variables
* **MONTHS_SINCE_FIRST_GIFT** - number of months since the first donation from the individual to the charitable organization
* **MONTHS_SINCE_LAST_GIFT** - number of months since the most recent donation from the individual to the charitable organization
* **MONTHS_SINCE_LAST_PROM_RESP** - number of months since the individual has responded to a promotion by the charitable organization
* **MONTHS_SINCE_ORIGIN** - number of months that the individual has been in the charitable organization's database
* **MOR_HIT_RATE** - total number of known times the donor has responded to a mailed solicitation from a group other than the charitable organization
* **NUMBER_PROM_12** - number of promotions (card or other) sent to the individual by the charitable organization in the past 12 months
* **OVERLAY_SOURCE** - the data source against which the individual was matched: M if Metromail, P if Polk, B if both
* **PCT_ATTRIBUTE1** - percent of residents in the neighborhood in which the individual lives that are males and active military
* **PCT_ATTRIBUTE2** - percent of residents in the neighborhood in which the individual lives that are males and veterans
* **PCT_ATTRIBUTE3** - percent of residents in the neighborhood in which the individual lives that are Vietnam veterans
* **PCT_ATTRIBUTE4** - percent of residents in the neighborhood in which the individual lives that are WWII veterans
* **PCT_OWNER_OCCUPIED** - percent of owner-occupied housing in the neighborhood in which the individual lives
* **PEP_STAR** - 1 if individual has ever achieved STAR donor status, 0 if not
* **PER_CAPITA_INCOME** - per capita income (in \\$) of the neighborhood in which the individual lives
* **PUBLISHED_PHONE** - 1 if the individual's telephone number is published, 0 if not
* **RECENCY_STATUS_96NK** - recency status as of two years ago: A if active donor, S if star donor, N if new donor, E if inactive donor, F if first time donor, L if lapsing donor
* **RECENT_AVG_CARD_GIFT_AMT** - average donation from the individual in response to a card solicitation from the charitable organization since four years ago
* **RECENT_AVG_GIFT_AMT** - average donation (in \\$) from the individual to the charitable organization since four years ago
* **RECENT_CARD_RESPONSE_COUNT** - number of times the individual has responded to a card solicitation from the charitable organization since four years ago
* **RECENT_CARD_RESPONSE_PROP** - proportion of responses to the individual to the number of card solicitations from the charitable organization since four years ago
* **RECENT_RESPONSE_COUNT** - number of times the individual has responded to a promotion (card or other) from the charitable organization since four years ago
* **RECENT_RESPONSE_PROP** - proportion of responses to the individual to the number of (card or other) solicitations from the charitable organization since four years ago
* **RECENT_STAR_STATUS** - 1 if individual has achieved star donor status since four years ago, 0 if not
* **SES** - one of 5 possible socioeconomic codes classifying the neighborhood in which the individual lives
* **TARGET_B** - 1 if individual donated in response to last year's 97NK mail solicitation from the charitable organization, 0 if individual did not
* **TARGET_D** - amount of donation (in \\$) from the individual in response to last year's 97NK mail solicitation from the charitable organization
* **URBANICITY** - classification of the neighborhood in which the individual lives: U if urban, C if city, S if suburban, T if town, R if rural, ? if missing
* **WEALTH_RATING** - one of 10 possible wealth rating groups based on a number of demographic characteristics


### Donation TYPE

You are supposed to create a new column/feature named `DONATION_TYPE`, whose values describe ranges of the donation amount (DA) reported in feature `TARGET_D`:
* `A` - DA >= 50
* `B` - 20 <= DA < 50 
* `C` - 13 <= DA < 20
* `D` - 10 <= DA < 13
* `E` - DA < 10


### **Important Notes on Data Cleaning and Preprocessing**

   1. Data can contain **errors/typos**, whose correction might improve the analysis.
   2. Some features can contain **many values**, whose grouping in categories (aggregation into bins) might improve the analysis.
   3. Data can contain **missing values**, that you might decide to fill. You might also decide to eliminate instances/features with high percentages of missing values.
   4. **Not all features are necessarily important** for the analysis.
   5. Depending on the analysis, **some features might have to be excluded**.
   6. Class distribution is an important characteristic of the dataset that should be checked. **Class imbalance** might impair machine learning. 
  
Some potentially useful links:

* Data Cleaning and Preprocessing in Scikit-learn: https://scikit-learn.org/stable/modules/preprocessing.html#
* Data Cleaning and Preprocessing in Orange: https://docs.biolab.si//3/visual-programming/widgets/data/preprocess.html
* Dealing with imbalance datasets: https://pypi.org/project/imbalanced-learn/ and https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets#t7

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder


from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Task 0 (Know your Data) - Exploratory Data Analysis

## Loading Data

In [None]:
raw_data = pd.read_csv("Donors_dataset.csv")
raw_data

In [None]:
raw_data

## Understanding Data

In this task you should **understand better the features**, their distribution of values, potential errors, etc and plan/describe what data preprocessing steps should be performed next. Very important also is to check the distribution of values in the target (class distribution). 

Here you can find a notebook with some examples of what you can do in **Exploratory Data Analysis**: https://www.kaggle.com/artgor/exploration-of-data-step-by-step/notebook. You can also use Orange widgets for this.

### Exploratory Data Analysis

First we will print and plot different tables and visualisations to get a feeling and better overview of the data. 

In [None]:
raw_data.info() 

In [None]:
raw_data.describe()

Next we will plot the histograms to check if there are numerical features with a high number of values, that can possibly be binned to be more convenient for the models. 

#### Histogram Plot for numerical features

In [None]:
raw_data.hist(bins=30, figsize=(20, 20))

In [None]:
# sort features in different categories for plotting:
all_features = list(raw_data.columns[2:])
pscontinuous_features = ['CONTROL_NUMBER', 'MONTHS_SINCE_ORIGIN', 'DONOR_AGE', 'PER_CAPITA_INCOME', \
                              'WEALTH_RATING', 'MEDIAN_HOME_VALUE', 'MEDIAN_HOUSEHOLD_INCOME', \
                             'PCT_OWNER_OCCUPIED', 'PCT_ATTRIBUTE2', 'PCT_ATTRIBUTE3', 'PCT_ATTRIBUTE4', \
                              'RECENT_RESPONSE_PROP', 'RECENT_AVG_GIFT_AMT', 'RECENT_CARD_RESPONSE_PROP', \
                              'RECENT_AVG_CARD_GIFT_AMT', 'RECENT_RESPONSE_COUNT', 'RECENT_CARD_RESPONSE_COUNT', \
                              'MONTHS_SINCE_LAST_PROM_RESP', 'LIFETIME_CARD_PROM', 'LIFETIME_PROM', \
                              'LIFETIME_GIFT_COUNT', 'LAST_GIFT_AMT', 'NUMBER_PROM_12', 'MONTHS_SINCE_LAST_GIFT', \
                              'MONTHS_SINCE_FIRST_GIFT', 'FILE_CARD_GIFT','CARD_PROM_12']
categorical_features = ['IN_HOUSE', 'URBANICITY', 'SES', 'HOME_OWNER', 'DONOR_GENDER', 'INCOME_GROUP', \
                        'PUBLISHED_PHONE', 'OVERLAY_SOURCE', 'PEP_STAR', 'RECENCY_STATUS_96NK', \
                        'FREQUENCY_STATUS_97NK']
other_features = ['CLUSTER_CODE', 'MOR_HIT_RATE', 'PCT_ATTRIBUTE1', 'RECENT_STAR_STATUS', 'LIFETIME_GIFT_AMOUNT', \
                  'LIFETIME_AVG_GIFT_AMT', 'LIFETIME_GIFT_RANGE', 'LIFETIME_MAX_GIFT_AMT', 'LIFETIME_MIN_GIFT_AMT', \
                  'FILE_AVG_GIFT']

# check, if all are included and not couble counts:
feature_lists = pscontinuous_features+categorical_features+other_features
print('All features included and no doubles: ', \
      len(all_features)==len(feature_lists) and set(feature_lists)==set(all_features))

def create_plots(features_to_plot, plottype='violin'):
    '''Creates plots for given features. Plottypes: 'violin', 'count' . We can add more if necessary.
    carful: if number of values per feature is high when 'count' is chosen, running time goes up. '''
    print(f'{len(features_to_plot)} plots:')
    ncols = 3
    nrows = int(len(features_to_plot)/ncols)+1
    newplots = plt.figure(figsize=(ncols*5,nrows*5))
    for ind, feature in enumerate(features_to_plot):
        plt.subplot(nrows, ncols, ind+1)
        if plottype=='violin':
            sns.violinplot(x="TARGET_B", y=feature, data=raw_data, fontsize=8)
            plt.title(f'TARGET_B by {feature}', fontsize=8)
        if plottype!='violin':
            sns.countplot(x='TARGET_B', data=raw_data, hue=feature);
            plt.title(f'{feature} in TARGET_B', fontsize=8)

####  Violin Plots for numerical Data

In [None]:
create_plots(pscontinuous_features, 'violin')

#### Countplots for categorical features 

In [None]:
create_plots(categorical_features, 'count')

In [None]:
plot = raw_data['TARGET_B'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('TARGET_B classes counts')


#### Violin plots for other features 

In [None]:
create_plots(other_features, 'violin')

----

Result of the exploratory analysis: the error for this feature is that one value is a typo ( = "A"). That row can be deleted. 

----
 

In [None]:
display(raw_data[raw_data.DONOR_GENDER == "A"])

In [None]:
raw_data.drop(14977, inplace = True )
raw_data = raw_data.reset_index(drop=True)
display(raw_data) 


In [None]:
corr = raw_data.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 14))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(250, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
heatmap = sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);


---

Result of the exploratory analysis: Highly correlated features (red = positive, blue = negative correlated) can be reduntant because they dont produce no additional information and can be useless for the model. 

TODO: maybe deleted redundant data. 

---

### Transforming donation amount in classes

In [None]:
transformed_data = raw_data.copy()

In [None]:
transformed_data = transformed_data.drop(columns = ["CONTROL_NUMBER"])

In [None]:
def label_donation_type(row):
    if row['TARGET_D'] >= 50:
        return 'A'
    if row['TARGET_D'] >= 20 and row['TARGET_D'] < 50:
        return 'B'
    if row['TARGET_D'] >= 13 and row['TARGET_D'] < 20:
        return 'C'
    if row['TARGET_D'] >= 10 and row['TARGET_D'] < 13:
        return 'D'
    if row['TARGET_D'] < 10:
        return 'E'
    return '?'

transformed_data['DONATION_TYPE'] = transformed_data.apply (lambda row: label_donation_type(row), axis=1)
transformed_data = transformed_data.drop(columns = ('TARGET_D'))

display(transformed_data)

In [None]:
transformed_data['WEALTH_RATING'].min()

# Task 1 (Supervised Learning) - Predicting Donation and Donation Type

In this task you should target 3 classification tasks:
1. **Predicting  Donation (binary classification task)**; 
2. **Predicting Donation TYPE (multiclass classification)**; and
3. **Train specialized models for SES (socioeconomic classification)**.

**You should:**

* Choose **one classifier in each category**: Tree models, Rule models, Linear models, Distance-based models, and Probabilistic models.
* Use cross-validation to evaluate the results. 
* Present and discuss the results for different evaluation measures, present confusion matrices. Remember that not only overall results are important. Check what happens when learning to predict each class.
* Describe the parameters used for each classifier and how their choice impacted or not the results.
* Choose the best classifier and fundament you choice.
* **Discuss critically your choices and the results!**

## Preprocessing Data for Classification

### Binning numerical data 

Result of Histogram Analysis: 

Attributes worth binning (because they have a high number of values): 

- DONOR_AGE
- LIFETIME_CARD_FROM
- LIFETIME_GIFT_COUNT
- LIFETIME_PROM
- MEDIAN_HOME_VALUE
- MEDIAN HOUSEHOLD_INCOME
- MONTHS_SINCE_LAST_GIFT
- MONTHS_SINCE_FIRST_GIFT
- PCT_ATTRIBUTE1
- PCT_ATTRIBUTE2
- PCT_ATTRIBUTE3
- PCT_ATTRIBUTE4
- PCT_OWNER_OCCUPIED
- PER_CAPITA_INCOME
- RECENT_RESPONSE_PROP
- MONTHS_SINCE_LAST_PROM_RESP

TODO: create bins 

- either quantile binning (our first approach)
- or Log transform 

---


features_for_binning =  ["DONOR_AGE", "LIFETIME_CARD_PROM", "LIFETIME_GIFT_COUNT", "LIFETIME_PROM", "MEDIAN_HOME_VALUE", "MEDIAN_HOUSEHOLD_INCOME", "MONTHS_SINCE_FIRST_GIFT", "PCT_ATTRIBUTE2", "PCT_ATTRIBUTE3", "PCT_ATTRIBUTE4", "PCT_OWNER_OCCUPIED", "PER_CAPITA_INCOME", "RECENT_RESPONSE_PROP"]
quantile_list = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
quantile_labels = [1,2,3,4,5,6,7,8,9,10]

for feature in features_for_binning:
    quantiles = transformed_data[feature].quantile(quantile_list)

    #binned_dataframe[f'{feature}_range'] = pd.qcut(binned_dataframe[feature],q=quantile_list)
    transformed_data[f'{feature}_label'] = pd.qcut(transformed_data[feature],q=quantile_list,labels=quantile_labels, duplicates='drop')

    
transformed_data = transformed_data.drop(columns = features_for_binning)    
display(transformed_data)

In [None]:
plot = transformed_data['DONATION_TYPE'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('DONATION_TYPE classes counts')


### Replace missing Data (NaNs)

In [None]:
x = dict(transformed_data.isna().sum())
{k: v for k, v in sorted(x.items(), key=lambda item: item[1])}

In [None]:
plot = transformed_data['WEALTH_RATING'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('WEALTH_RATING classes counts')


In [None]:
#plot = transformed_data['DONOR_AGE_label'].value_counts().sort_index().plot(kind = 'barh')
#plot.set_title('DONOR_AGE_label classes counts')


--- 

The dataset contains NaN values in some columns. As the training of the models cant happen with NaN values there are two possibilities. Either drop the values which will lead to a very small dataset, or we try to replace NaN values with other values (with different techniques) in order to maintain a big dataset and dont loose information.

Features with missing values, that need to be replaced are: 

- 'WEALTH_RATING': 8810
- 'DONOR_AGE_label': 4795
- 'INCOME_GROUP': 4392
- 'MONTHS_SINCE_LAST_PROM_RESP': 246

- 'URBANICITY'
- 'SES'

#### OneHot

In [None]:
def ohe_encode(col_names, X):
    '''Takes columns to encode (format: ['colum1', 'column2', ...]) and DataFrame X with all columns. 
    Encodes features and replaces old columns in X. Returns updated X'''
    enc = OneHotEncoder()
    matrix = X[col_names].to_numpy()
    enc.fit(matrix)
    matrix = enc.transform(matrix).toarray()
    
    categories_new = np.array(enc.categories_)
    for ind1, cat in enumerate(categories_new):
        for ind2, cat_new in enumerate(cat):
            categories_new[ind1][ind2] = col_names[ind1]+':'+cat_new
    features_new = categories_new.reshape(categories_new.size,1)

    new_df = pd.DataFrame(matrix)
    new_df.columns = np.concatenate(features_new.ravel()).tolist()

    updated_df = pd.concat([X, new_df], axis=1)
    updated_df = updated_df.drop(col_names, axis=1)
    
    return(updated_df)

#### OneHot encoding of non nummeric categories

In [None]:
#to_encode = ['WEALTH_RATING','DONOR_AGE_label','INCOME_GROUP', 'URBANICITY','SES' ]
#ohe_encode(to_encode, transformed_data)

#### encode and KNN

In [None]:
#!pip3 install fancyimpute

from sklearn.preprocessing import OrdinalEncoder
from fancyimpute import KNN
#instantiate both packages to use
encoder = OrdinalEncoder()
imputer = KNN()
# create a list of categorical columns to iterate over
to_encode = ['WEALTH_RATING','DONOR_AGE_label','INCOME_GROUP', 'URBANICITY','SES' ]

def encode(data):
    '''function to encode non-null data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_ordinal = encoder.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    tdata = data.copy()
    tdata.loc[data.notnull()] = np.squeeze(impute_ordinal)
    return tdata

#create a for loop to iterate through each column in the data
for columns in to_encode:
    encode(transformed_data[columns])

#### TODO: 
#Urbancity
#SES



#possibilities:
#1. any number, like 999 or 0
#2. mean/median
#3. KNN

# how to replace WEALTH_RATING ? 
# its from 0 to 9 , so missing data could be = 10 
# as it now is binned data = label = we create the label 999
# or replace with a number or with mean ? 
# TODO: try out different values 

#transformed_data.WEALTH_RATING =  transformed_data.WEALTH_RATING.fillna(999)
#2. 
#imputer = KNNImputer(n_neighbors=2)
#transformed_data.WEALTH_RATING =  imputer.fit_transform(transformed_data.WEALTH_RATING)


# how to replace DONOR_AGE ? 
# replace with a number or with mean ? 
# as it now is binned data = label = we create the label 999
# first we try mean
# TODO: try out different values : median ? 

#transformed_data.DONOR_AGE =  transformed_data.DONOR_AGE.fillna((transformed_data.DONOR_AGE.mean()))
#transformed_data.DONOR_AGE_label = transformed_data.DONOR_AGE_label.cat.add_categories(999)
#transformed_data.DONOR_AGE_label =  transformed_data.DONOR_AGE_label.fillna(999)


# how to replace INCOME_GROUP ? 
# its from 1 to 7 , so missing data could be = 8
# as it now is binned data = label = we create the label 999

# or replace with a number or with mean ? 
# TODO: try out different values 
# first tried with 0 
# replace with 8
#transformed_data.INCOME_GROUP =  transformed_data.INCOME_GROUP.fillna(999)


# how to replace MONTHS_SINCE_LAST_PROM_RESP ? 
# its from normally : 0 to 36. 
# negative values make no sense here, a person cant respond in a negative amount of time. 
# we can replace with the maximum value (36)or replace with a number or with mean. 
# first we try the mean()
# TODO: try out different values :  median ? 

#transformed_data.MONTHS_SINCE_LAST_PROM_RESP =  transformed_data.MONTHS_SINCE_LAST_PROM_RESP.fillna(transformed_data.MONTHS_SINCE_LAST_PROM_RESP.mean())
#transformed_data.MONTHS_SINCE_LAST_PROM_RESP =  transformed_data.MONTHS_SINCE_LAST_PROM_RESP.fillna(999)



#X = [transformed_data['MONTHS_SINCE_LAST_PROM_RESP']]
#print(transformed_data.pscontinuous_features)
#for feature in pscontinuous_features:
#   X.append(transformed_data[feature])

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)

df_numeric = transformed_data.select_dtypes(include=[np.float]).values
#no_nan = imputer.fit_transform(df_numeric)

df_filled = pd.DataFrame(imputer.fit_transform(df_numeric))

transformed_data.isna().sum()
display(transformed_data.select_dtypes(include=[np.float])[30:50])


display(df_filled[30:50])
len(no_nan)

In [None]:
#display(df_filled[30:50])

In [None]:
plot = transformed_data['WEALTH_RATING'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('WEALTH_RATING classes counts')


In [None]:
#plot = transformed_data['DONOR_AGE_label'].value_counts().sort_index().plot(kind = 'barh')
#plot.set_title('DONOR_AGE_label classes counts')


In [None]:
plot = transformed_data['INCOME_GROUP'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('INCOME_GROUP classes counts')


In [None]:
plot = transformed_data['MONTHS_SINCE_LAST_PROM_RESP'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('MONTHS_SINCE_LAST_PROM_RESP classes counts')


In [None]:
plot = transformed_data['INCOME_GROUP'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('URBANICITY classes counts')


In [None]:
plot = transformed_data['SES'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('SES classes counts')


### Replace false values (outliers) with mean of column 

- TODO : How did we recognize the outliers ? 

In [None]:
transformed_data.MOR_HIT_RATE.plot()

---
Result: MOR_HIT_RATE : time of answers on other mailings. Seem to high in some cases, will be replaced by column mean. 


In [None]:
transformed_data.MONTHS_SINCE_LAST_PROM_RESP.describe()

---
Result: MONTHS_SINCE_LAST_PROM_RESP : months since last answer. This value cant be negative and will be replaced by column mean. 


In [None]:
# MOR_HIT_RATE : time of answers on other mailings. seems to high, will be replaced by column mean. 
transformed_data.MOR_HIT_RATE = transformed_data.MOR_HIT_RATE.apply(lambda x: (transformed_data.MOR_HIT_RATE.mode()[0]) if x > 100 else x)

# MONTHS_SINCE_LAST_PROM_RESP : months since last answer cant be negative,will be replaced by column mean. 
transformed_data.MONTHS_SINCE_LAST_PROM_RESP = transformed_data.MONTHS_SINCE_LAST_PROM_RESP.apply(lambda x: (transformed_data.MONTHS_SINCE_LAST_PROM_RESP.mode()[0]) if x < 0.0 else x)


### Encoding categorical features

In [None]:
### show Columns that are of the type "object" and need to be transformed to numerical values

obj_df = transformed_data.select_dtypes(include=['object']).copy()
obj_df.head()

In [None]:
lb_make = LabelEncoder()


transformed_data["URBANICITY"] = lb_make.fit_transform(transformed_data["URBANICITY"])
transformed_data["SES"] = lb_make.fit_transform(transformed_data["SES"])
transformed_data["CLUSTER_CODE"] = lb_make.fit_transform(transformed_data["CLUSTER_CODE"])
transformed_data["HOME_OWNER"] = lb_make.fit_transform(transformed_data["HOME_OWNER"])
transformed_data["DONOR_GENDER"] = lb_make.fit_transform(transformed_data["DONOR_GENDER"])
transformed_data["OVERLAY_SOURCE"] = lb_make.fit_transform(transformed_data["OVERLAY_SOURCE"])
transformed_data["RECENCY_STATUS_96NK"] = lb_make.fit_transform(transformed_data["RECENCY_STATUS_96NK"])
transformed_data["DONATION_TYPE"] = lb_make.fit_transform(transformed_data["DONATION_TYPE"])
#transformed_data["DONOR_AGE_label"] = lb_make.fit_transform(transformed_data["DONOR_AGE_label"])


In [None]:
plot = transformed_data['URBANICITY'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('URBANICITY classes counts')

### TODO: Data Cleaning

1. **add rule based model**
2. **oneHot Encoding Feature for all categorical features ? (?)**


model training : 

3. **redo multiclass classification with onehot encoded data**

5. **Importance analysis of features = package ?**


---

- todo: (optional)
    * Outlier Detection with Standard Deviation
    * https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114
- todo: feature engineering:  (parallel zum model training)
    * You might also decide to eliminate instances/features with high percentages of missing values. 
    * Not all features are necessarily important for the analysis.
    * Depending on the analysis, some features might have to be excluded


In [None]:
x = dict(transformed_data.isna().sum())
{k: v for k, v in sorted(x.items(), key=lambda item: item[1])}
raw_data.MOR_HIT_RATE

In [None]:
transformed_data.loc[transformed_data['URBANICITY'] == 0,'URBANICITY'] = np.nan
transformed_data.loc[transformed_data['SES'] == 0,'SES'] = np.nan
transformed_data.loc[transformed_data['CLUSTER_CODE'] == 0,'CLUSTER_CODE'] = np.nan

In [None]:
impute_mode = ['URBANICITY', 'SES', 'CLUSTER_CODE', 'INCOME_GROUP', 'WEALTH_RATING']
impute_mean = ['MONTHS_SINCE_LAST_PROM_RESP']

In [None]:
for feature in impute_mode:
    print(transformed_data[feature].mode()[0])
    transformed_data[feature].fillna(transformed_data[feature].mode()[0], inplace=True)

for feature in impute_mean:
    print(transformed_data[feature].mean())
    transformed_data[feature].fillna(transformed_data[feature].mean(), inplace=True)

In [None]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=20)
transformed_data = pd.DataFrame(imputer.fit_transform(transformed_data), columns=transformed_data.columns)

In [None]:
transformed_data.isna().sum()
display(transformed_data['URBANICITY'])

# if KNN
#display(df_filled[30:50])

In [None]:
#plot = df_filled['URBANICITY'].value_counts().sort_index().plot(kind = 'barh')
x = dict(transformed_data.isna().sum())
{k: v for k, v in sorted(x.items(), key=lambda item: item[1])}

In [None]:
transformed_data['MOR_HIT_RATE'].hist()

In [None]:
transformed_data.info()

### Binning

In [None]:
features_for_binning =  ["DONOR_AGE", "LIFETIME_CARD_PROM", "LIFETIME_GIFT_COUNT", "LIFETIME_PROM", "MONTHS_SINCE_ORIGIN", "MEDIAN_HOME_VALUE", "MEDIAN_HOUSEHOLD_INCOME", "MONTHS_SINCE_FIRST_GIFT", "PCT_ATTRIBUTE2", "PCT_ATTRIBUTE3", "PCT_ATTRIBUTE4", "PCT_OWNER_OCCUPIED", "PER_CAPITA_INCOME", "RECENT_RESPONSE_PROP"]
quantile_list = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]
quantile_labels = [1,2,3,4,5,6,7,8,9,10]

for feature in features_for_binning:
    print(feature)
    if feature=="MONTHS_SINCE_ORIGIN": # less bins for this features because of high number of zeros
        quantile_list = np.linspace(0, 1.0, 7)
        quantile_labels = np.arange(0,len(quantile_list)-1)
        
    quantiles = transformed_data[feature].quantile(quantile_list)

    #binned_dataframe[f'{feature}_range'] = pd.qcut(binned_dataframe[feature],q=quantile_list)
    transformed_data[f'{feature}_label'] = pd.qcut(transformed_data[feature],q=quantile_list,labels=quantile_labels, duplicates='drop')

    
transformed_data = transformed_data.drop(columns = features_for_binning)    
display(transformed_data)

In [None]:
tf = transformed_data.select_dtypes(include=['float64'])
#tf.drop(columns=['RECENT_CARD_RESPONSE_PROP','RECENT_AVG_GIFT_AMT', 'RECENT_AVG_CARD_GIFT_AMT', 
#                 'LIFETIME_AVG_GIFT_AMT', 'LIFETIME_GIFT_RANGE', 'LIFETIME_MAX_GIFT_AMT', 'LIFETIME_MIN_GIFT_AMT', 'FILE_AVG_GIFT'])
transformed_data[tf.columns] = transformed_data[tf.columns].round(0).astype(int)

In [None]:
transformed_data.info()

from sklearn.preprocessing import OrdinalEncoder
from fancyimpute import KNN
#instantiate both packages to use
encoder = OrdinalEncoder()
imputer = KNN()
# create a list of categorical columns to iterate over
to_encode = ['WEALTH_RATING','DONOR_AGE_label','INCOME_GROUP', 'URBANICITY','SES' ]


def encode(data):
    pd.set_option('mode.chained_assignment', None)
    '''function to encode non-null data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_ordinal = encoder.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    tdata = data.copy()
    tdata.is_copy = None
    tdata.loc[data.notnull()] = np.squeeze(impute_ordinal)
  
    return tdata

#create a for loop to iterate through each column in the data
for columns in to_encode:
    encode(transformed_data[columns])

plot = transformed_data['WEALTH_RATING'].value_counts().sort_index().plot(kind = 'barh')
plot.set_title('URBANICITY classes counts')

## Balancing the Data: Resampling the dataset

### Random under-sampling

In [None]:
resampled_data = transformed_data.copy()
count_class_0, count_class_1 = resampled_data['TARGET_B'].value_counts()
print(count_class_0,count_class_1)

#Divide
td_class_0 = resampled_data[resampled_data['TARGET_B'] == 0]
td_class_1 = resampled_data[resampled_data['TARGET_B'] == 1]

td_class_0_under = td_class_0.sample(count_class_1)
td_under = pd.concat([td_class_0_under, td_class_1], axis=0)

print('Random under-sampling:')
print(td_under['TARGET_B'].value_counts())

td_under['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

### Random over-sampling

In [None]:
td_class_1_over = td_class_1.sample(count_class_0, replace=True)
td_over = pd.concat([td_class_0, td_class_1_over], axis=0)

print('Random over-sampling:')
print(td_over['TARGET_B'].value_counts())

td_over['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

### Python imbalanced learn module

#!pip3 install imblearn
import imblearn

def plot_2d_space(X, y, label='Classes'):   
    colors = ['#1F77B4', '#FF7F0E']
    markers = ['o', 's']
    for l, c, m in zip(np.unique(y), colors, markers):
        plt.scatter(
            X[y==l, 0],
            X[y==l, 1],
            c=c, label=l, marker=m
        )
    plt.title(label)
    plt.legend(loc='upper right')
    plt.show()

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(resampled_data.['TARGET_B'])

print('Removed indexes:', id_rus)

plot_2d_space(X_rus, y_rus, 'Random under-sampling')

Resampling the training set pushed the accurency to 0.51
Conclusion: no resampling of the trainingset

Code: <br>
count_class_0, count_class_1 = train_dataset['TARGET_B'].value_counts()
print(count_class_0,count_class_1)

#Divide
td_class_0 = train_dataset[train_dataset['TARGET_B'] == 0]
td_class_1 = train_dataset[train_dataset['TARGET_B'] == 1]

td_class_0_under = td_class_0.sample(count_class_1)
td_under = pd.concat([td_class_0_under, td_class_1], axis=0)

print('Random under-sampling:')
print(td_under['TARGET_B'].value_counts())

td_under['TARGET_B'].value_counts().plot(kind='bar', title='Count (TARGET_B)');

train_dataset = td_under




---

Class imbalance for DONATE_TYPE

- 

## Creating Training and Test data (Splitting)

### Split for normal classification task

In order to train and evaluate the models we need to split them into training- and testsets. 
As a basis for the split we can use three of base-datasets we created before:

- transformed_data (cleaned dataset but no balancing of classes = imbalanced class (donates / not donating))
- td_under (cleaned dataset with balancing of the class (donates / not donating) with random under sampling)
- td_over (cleaned dataset with balancing of the class (donates / not donating) with random over sampling)

To use the datasets copy one of the following datasets in the following cell: 

- transformed_data 
- td_under
- td_over

In [None]:
#transformed_data = transformed_data.copy()
#transformed_data = td_under.copy()
transformed_data = td_over.copy()



For splitting the data we drop the two target columns ["TARGET_B", "DONATION_TYPE"] out of the training(X) and testdataset (X) and assign them to the target-feature (y) also for training and test. 


In [None]:
train_dataset = transformed_data.sample(frac=0.8,random_state=87)
test_dataset = transformed_data.drop(train_dataset.index)

X_train = train_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])
X_test = test_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])

y_train_target_b = train_dataset.pop("TARGET_B")
y_test_target_b = test_dataset.pop('TARGET_B')

y_train_donation_type = train_dataset.pop("DONATION_TYPE")
y_test_donation_type = test_dataset.pop('DONATION_TYPE')



### Split for classification task for the specific SES classes 

Also we want to train and evaluate models for the specific socioeconomic classes. 
In order to create different datasets for each class we split them in the following and create respective training and test datasets. 

In [None]:
SES_1 = transformed_data[transformed_data.SES == 0]
SES_2 = transformed_data[transformed_data.SES == 1]
SES_3 = transformed_data[transformed_data.SES == 2]
SES_4 = transformed_data[transformed_data.SES == 3]
SES_nan = transformed_data[transformed_data.SES == 4]

def split_sets(df):

    train_dataset = df.sample(frac=0.8,random_state=0)
    test_dataset = df.drop(train_dataset.index)

    X_train = train_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])
    X_test = test_dataset.drop(columns = ["TARGET_B", "DONATION_TYPE"])

    y_train_target_b = train_dataset.pop("TARGET_B")
    y_test_target_b = test_dataset.pop('TARGET_B')

    #y_train_donation_type = train_dataset["DONATION_TYPE"].values
    #y_test_donation_type = test_dataset["DONATION_TYPE"].values

    y_train_donation_type = train_dataset.pop("DONATION_TYPE")
    y_test_donation_type = test_dataset.pop('DONATION_TYPE')
    
    return X_train, X_test, y_train_target_b, y_test_target_b, y_train_donation_type, y_test_donation_type


X_train_SES_1, X_test_SES_1, y_train_target_b_SES_1, y_test_target_b_SES_1, y_train_donation_type_SES_1, y_test_donation_type_SES_1 = split_sets(SES_1)
X_train_SES_2, X_test_SES_2, y_train_target_b_SES_2, y_test_target_b_SES_2, y_train_donation_type_SES_2, y_test_donation_type_SES_2 = split_sets(SES_2)
X_train_SES_3, X_test_SES_3, y_train_target_b_SES_3, y_test_target_b_SES_3, y_train_donation_type_SES_3, y_test_donation_type_SES_3 = split_sets(SES_3)
X_train_SES_4, X_test_SES_4, y_train_target_b_SES_4, y_test_target_b_SES_4, y_train_donation_type_SES_4, y_test_donation_type_SES_4 = split_sets(SES_4)
X_train_SES_nan, X_test_SES_nan, y_train_target_b_SES_nan, y_test_target_b_SES_nan, y_train_donation_type_SES_nan, y_test_donation_type_SES_nan = split_sets(SES_4)




## Training and Evaluation of the Classifiers

In the following we will introduce five classifiers in order to train Models for the three given classificaiton tasks. 

- Predicting Donation (binary classification task);
- Predicting Donation TYPE (multiclass classification); and
- Train specialized models for SES (socioeconomic classification).

We choose the following classifiers for each category: 

- Tree models: RandomForestClassifier
- Distance-based models: KNeighborsClassifier
- Linear models: Support Vector Machine
- Probabilistic models: Gaussian Naive Bayes
- Rule models: todo

For Evalutation we use a 5-fold cross validation to evaluate the results.

TODO: 

1. describe different measures 
2. explain results
3. check not only overall accuracy but also precision and recall for each class !!
4. Describe the parameters used for each classifier and how their choice impacted or not the results.
5. Choose the best classifier and fundament you choice.
6. Discuss critically your choices and the results!



--------

The **Metrics** we are evaluating: 

Precision: What proportion of positive identifications was actually correct?

Recall: What proportion of actual positives was identified correctly?

f1-score: The harmonic mean of precision and recall.

Accuracy: The fraction of all predictions the model got right. 

To fully evaluate the effectiveness of a model, we must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. So the f1-score as the harmonic mean of precision and recall is a good metric to look at. 





In [None]:

def run_exps(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame, target_names, models ) -> pd.DataFrame:
    '''
    Lightweight script to test many models and find winners
    :param X_train: training split
    :param y_train: training target vector
    :param X_test: test split
    :param y_test: test target vector
    :return: DataFrame of predictions
    '''
    
    dfs = []
    
    results = []
    names = []
    scoring = ["precision" , "recall" , "f1", 'accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
    target_names = target_names
    
    for name, model in models:
        kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
        cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
        clf = model.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        print(name)
        #print(y_pred)
        print(classification_report(y_test, y_pred, target_names=target_names, zero_division = 0))
        # Generate confusion matrix
        #matrix = plot_confusion_matrix(model, X_test, y_test)#, normalize='true')
        #matrix.plot()
        #plt.rcParams["axes.grid"] = False
        results.append(cv_results)
        names.append(name)
        this_df = pd.DataFrame(cv_results)
        this_df['model'] = name
        dfs.append(this_df)
        final = pd.concat(dfs, ignore_index=True)        
    return final

### Hyperparametertuning for all binary classification models  

The untuned models will perform very poorly. Based on the different datasets we use for training we need to choose the right hyperparameters of the models. 

In the following each of the tunable models (Gaussian NB is not tunable) will be trained with different hyperparameters in order to find the best combination of parameters to increase the metrics we specified. 

We specified to score on recall (the proportion of the actual values the model predicted right), precision (the proportion of the predicted values that truly are that class) and on overall model accuracy. 

After evaluating the best parameters for each model we can use the models with the best parameters to create a model comparision. 

#### Random Forest: 

In [None]:
# takes 7 min 

rand_forest = RandomForestClassifier(n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_target_b)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_target_b, y_pred))

params = {'bootstrap': [True, False],
          'max_depth': [40, 50, 60],
          'max_features': ['auto'],
          'min_samples_leaf': [1, 2, 4],
          'min_samples_split': [2, 5],
          'n_estimators': [400, 600, 800]}

#scoring = ['recall' , 'precision', 'accuracy']

grid_search_cv = GridSearchCV(
    rand_forest,
    params, 
    verbose=1, 
    cv=5,
    n_jobs= -1, 
    scoring="f1" 
    #refit='recall'
)


grid_search_cv.fit(X_train, y_train_target_b)

print("Best params for Rand Forest : " +  str(grid_search_cv.best_params_))

print("Best estimator for Rand Forest : " + str(grid_search_cv.best_estimator_))

print("Best score for Rand Forest : " + str(grid_search_cv.best_score_))

In [None]:
# f1_weighted

As the Hyperparametertuning showed we use the following parameters: 


In [None]:
rand_forest = RandomForestClassifier(bootstrap=False, max_depth=60, n_estimators=800,n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_target_b)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


#### kNN 

In [None]:
knn = KNeighborsClassifier()
knn = knn.fit(X_train, y_train_target_b)
y_pred = knn.predict(X_test)
print(classification_report(y_test_target_b, y_pred))

params = {'n_neighbors':[50,60,70],
              'leaf_size':[1,3,5],
              'algorithm':['auto', 'kd_tree'],
              'n_jobs':[-1]}


grid_search_cv = GridSearchCV(
    knn,
    params, 
    verbose=1, 
    cv=3,
    n_jobs= -1, 
    scoring="f1"
)


grid_search_cv.fit(X_train, y_train_target_b)

print("Best params for knn : " +  str(grid_search_cv.best_params_))

print("Best estimator for knn : " + str(grid_search_cv.best_estimator_))

print("Best score for knn : " + str(grid_search_cv.best_score_))

As the Hyperparametertuning showed we use the following parameters: 



In [None]:
knn = KNeighborsClassifier(leaf_size=3, n_jobs=-1, n_neighbors=70)
knn = knn.fit(X_train, y_train_target_b)
y_pred = knn.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


#### SVM 

In [None]:
# training a linear SVM classifier 

svm_model_linear = SVC().fit(X_train, y_train_target_b) 
svm_predictions = svm_model_linear.predict(X_test) 
print(classification_report(y_test_target_b, svm_predictions))

# Tuning the SVM with 

# defining parameter range 
param_grid = {'C': [0.001, 0.01, 0.1],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear', 'rbf']}  
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1) 
  
# fitting the model for grid search 
grid.fit(X_train, y_train_target_b)


print("Best params for SVM : " +  str(grid.best_params_))

print("Best estimator for SVM : " + str(grid.best_estimator_))

print("Best score for SVM : " + str(grid.best_score_))


As the Hyperparametertuning showed we use the following parameters: 



In [None]:

svc = SVC(C=0.1, gamma=0.01)
svc = svc.fit(X_train, y_train_target_b)
y_pred = svc.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


#### Feature significance based on the Random Forest Classifier

In [None]:
importances = rand_forest.feature_importances_
std = np.std([rand_forest.feature_importances_ for tree in rand_forest.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train.shape[1]):
    print("%d. feature %d : %s (%f)" % (f + 1, indices[f] ,X_train.columns[indices[f]] ,importances[indices[f]]))

# Plot the impurity-based feature importances of the forest
plt.figure( figsize=(20,5))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
        color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()



Result: As the analysis showed, there are features that are more and less relevant for the model. 
As a next step for speeding up model training we could drop the less relevant features. 

### Hyperparametertuning for multiclass classification

#### Random Forest: 

In [None]:
# takes 7 min 

rand_forest = RandomForestClassifier(n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_donation_type)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))

params = {'bootstrap': [True, False],
          'max_depth': [10, 20, 30],
          'max_features': ['auto'],
          'min_samples_leaf': [1, 2, 4],
          'min_samples_split': [2, 5],
          'n_estimators': [200, 400, 600]}


grid_search_cv = GridSearchCV(
    rand_forest,
    params, 
    verbose=1, 
    cv=3,
    n_jobs= -1
)


grid_search_cv.fit(X_train, y_train_donation_type)

print("Best params for Rand Forest : " +  str(grid_search_cv.best_params_))

print("Best estimator for Rand Forest : " + str(grid_search_cv.best_estimator_))

print("Best score for Rand Forest : " + str(grid_search_cv.best_score_))

In [None]:
rand_forest = RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)
rand_forest = rand_forest.fit(X_train, y_train_donation_type)
y_pred = rand_forest.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))

#### kNN 

In [None]:
knn = KNeighborsClassifier()


knn = knn.fit(X_train, y_train_donation_type)
y_pred = knn.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))

params = {'n_neighbors':[50,60,70],
              'leaf_size':[1,3,5],
              'algorithm':['auto', 'kd_tree'],
              'n_jobs':[-1]}


grid_search_cv = GridSearchCV(
    knn,
    params, 
    verbose=1, 
    cv=3,
    n_jobs= -1
)


grid_search_cv.fit(X_train, y_train_donation_type)

print("Best params for knn : " +  str(grid_search_cv.best_params_))

print("Best estimator for knn : " + str(grid_search_cv.best_estimator_))

print("Best score for knn : " + str(grid_search_cv.best_score_))

In [None]:
knn = KNeighborsClassifier(leaf_size=5, n_jobs=-1, n_neighbors=60)
knn = knn.fit(X_train, y_train_donation_type)
y_pred = knn.predict(X_test)
print(classification_report(y_test_donation_type, y_pred))


#### SVM 

In [None]:
# training a linear SVM classifier 

svm_model_linear = SVC().fit(X_train, y_train_donation_type) 
svm_predictions = svm_model_linear.predict(X_test) 
print(classification_report(y_test_donation_type, svm_predictions))

# Tuning the SVM with 

# defining parameter range 
param_grid = {'C': [0.001, 0.01, 0.1],  
              'gamma': [1, 0.1, 0.01, 0.001], 
              'kernel': ['linear', 'rbf']}  
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3, n_jobs = -1) 
  
# fitting the model for grid search 
grid.fit(X_train, y_train_donation_type)


print("Best params for SVM : " +  str(grid.best_params_))

print("Best estimator for SVM : " + str(grid.best_estimator_))

print("Best score for SVM : " + str(grid.best_score_))


In [None]:

svc = SVC()
svc = svc.fit(X_train, y_train_target_b)
y_pred = svc.predict(X_test)
print(classification_report(y_test_target_b, y_pred))


Best Classifiers: 

## Binary Classification 

### Without Sampling
- RF : RandomForestClassifier(max_depth=30, min_samples_leaf=4, n_estimators=400,n_jobs=-1)
- kNN : KNeighborsClassifier(leaf_size=1, n_jobs=-1, n_neighbors=60)
- SVM : 

### Oversampling 
- RF : RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1)
- kNN : KNeighborsClassifier(leaf_size=3, n_jobs=-1, n_neighbors=70)
- SVM : SVC(C=0.1, gamma=0.01)

### Undersampling 
- RF : 
- kNN : 
- SVM : 


## Multiclass Classification 

### Without Sampling
- RF : 
- kNN : 
- SVM : 
### Oversampling 
- RF : RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)
- kNN : KNeighborsClassifier(leaf_size=5, n_jobs=-1, n_neighbors=60)
- SVM : 
### Undersampling 
- RF : 
- kNN : 
- SVM : 


### Train Models for binary Classification

In [None]:
final = run_exps(
    X_train, 
    y_train_target_b, 
    X_test, 
    y_test_target_b, 
    ['wont donate', 'donates'],
    [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1)),
     ('Distance Based Model: KNN', KNeighborsClassifier(leaf_size=3, n_jobs=-1, n_neighbors=70)),
     ('Probabilistic Model: GNB', GaussianNB()),
     ('Linear Model: SVM', SVC(C=0.1, gamma=0.01))
        # Rule Based model: 
    ])

In [None]:
bootstraps = []

for model in list(set(final.model.values)):
    model_df = final.loc[final.model == model]
    bootstrap = model_df.sample(n=30, replace=True)
    bootstraps.append(bootstrap)
        
bootstrap_df = pd.concat(bootstraps, ignore_index=True)
results_long = pd.melt(bootstrap_df,id_vars=['model'],var_name='metrics', value_name='values')

time_metrics = ['fit_time','score_time'] # fit time metrics## PERFORMANCE METRICS
results_long_nofit = results_long.loc[~results_long['metrics'].isin(time_metrics)] # get df without fit data
results_long_nofit = results_long_nofit.sort_values(by='values')## TIME METRICS
results_long_fit = results_long.loc[results_long['metrics'].isin(time_metrics)] # df with fit data
results_long_fit = results_long_fit.sort_values(by='values')

In [None]:
plt.figure(figsize=(10, 7))
sns.set(font_scale=1)
g = sns.boxplot(x="model", y="values", hue="metrics", data=results_long_nofit, palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Comparison of Model by Classification Metric')
plt.savefig('./benchmark_models_performance.png',dpi=300)

### Train Models for multiclass classification

After evaluating the best parameters for each model we can use the models with the best parameters to create a model comparision. 

In [None]:
final = run_exps(
    X_train, 
    y_train_donation_type, 
    X_test, 
    y_test_donation_type, 
    ['wont donate', 'A', 'B', 'C', 'D', 'E'],
    [
        ('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1)),
        ('Distance Based Model: KNN', KNeighborsClassifier(leaf_size=5, n_jobs=-1, n_neighbors=60)),
        ('Probabilistic Model: GNB', GaussianNB()),
        ('Linear Model: SVM', SVC())
        # Rule Based model: 
    ])

In [None]:
bootstraps = []

for model in list(set(final.model.values)):
    model_df = final.loc[final.model == model]
    bootstrap = model_df.sample(n=30, replace=True)
    bootstraps.append(bootstrap)
        
bootstrap_df = pd.concat(bootstraps, ignore_index=True)
results_long = pd.melt(bootstrap_df,id_vars=['model'],var_name='metrics', value_name='values')

time_metrics = ['fit_time','score_time'] # fit time metrics## PERFORMANCE METRICS
results_long_nofit = results_long.loc[~results_long['metrics'].isin(time_metrics)] # get df without fit data
results_long_nofit = results_long_nofit.sort_values(by='values')## TIME METRICS
results_long_fit = results_long.loc[results_long['metrics'].isin(time_metrics)] # df with fit data
results_long_fit = results_long_fit.sort_values(by='values')

In [None]:

plt.figure(figsize=(10, 7))
sns.set(font_scale=1)
g = sns.boxplot(x="model", y="values", hue="metrics", data=results_long_nofit, palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Comparison of Model by Classification Metric')
plt.savefig('./benchmark_models_performance.png',dpi=300)

### Binary classifier for SES - classes

Train models for each SES class

---
TODO: we just have to train the best classifier (ONE !!!) from the previous experiments to predict donation and donation type for each SES class = 2 models for 5 SES = experiments 

---

In [None]:
final = run_exps(X_train_SES_1, 
                 y_train_target_b_SES_1, 
                 X_test_SES_1 , 
                 y_test_target_b_SES_1, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_2, 
                 y_train_target_b_SES_2, 
                 X_test_SES_2, 
                 y_test_target_b_SES_2, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_3, 
                 y_train_target_b_SES_3, 
                 X_test_SES_3 , 
                 y_test_target_b_SES_3, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_4, 
                 y_train_target_b_SES_4, 
                 X_test_SES_4 , 
                 y_test_target_b_SES_4, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

final = run_exps(X_train_SES_nan, 
                 y_train_target_b_SES_nan, 
                 X_test_SES_nan, 
                 y_test_target_b_SES_nan, 
                 ['wont donate', 'donates'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=50, n_estimators=600,n_jobs=-1))])

## Multiclass classifier for SES - classes

In [None]:
final = run_exps(X_train_SES_1, 
                 y_train_target_b_SES_1, 
                 X_test_SES_1 , 
                 y_test_target_b_SES_1, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_2, 
                 y_train_target_b_SES_2, 
                 X_test_SES_2, 
                 y_test_target_b_SES_2, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_3, 
                 y_train_target_b_SES_3, 
                 X_test_SES_3 , 
                 y_test_target_b_SES_3, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_4, 
                 y_train_target_b_SES_4, 
                 X_test_SES_4 , 
                 y_test_target_b_SES_4, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

final = run_exps(X_train_SES_nan, 
                 y_train_target_b_SES_nan, 
                 X_test_SES_nan, 
                 y_test_target_b_SES_nan, 
                 ['wont donate', 'A', 'B', 'C', 'D', 'E'],
                 [('Tree based Model: RF', RandomForestClassifier(bootstrap=False, max_depth=30, n_estimators=600,
                       n_jobs=-1))])

Result: 

# todo


## Logical Models: Rule models

In [None]:
# after complete processing, all columns should be used
interesting_features = ['TARGET_B', 'HOME_OWNER', 'DONOR_GENDER', 'DONATION_TYPE'] 

# data used for Rule Mining:
RM_transformed_data = transformed_data[interesting_features]
RM_transformed_data

In [None]:
# sources:
# https://scikit-learn.org/stable/modules/preprocessing.html  -   6.3.4. Encoding categorical features
# TP05: 2.2

# def ohe_encode moved to 3.1

cols_not_binary = ['HOME_OWNER', 'DONOR_GENDER', 'DONATION_TYPE']
RM_binary_data = ohe_encode(cols_not_binary)
RM_binary_data

Todo: 
- Binning for all interesting features (first preprocessing section)
- put the OneHotEncoding in first preprocessing section?

## ????? Finding Associations

In [None]:

# calculate frequent patterns:
freq_patterns = apriori(RM_binary_data, min_support=0.05, use_colnames=True) # maybe choose different min_support..
#freq_patterns['size'] = freq_patterns['itemsets'].apply(lambda x: len(x))
#freq_patterns = freq_patterns[freq_patterns['size']>1]
#freq_patterns

In [None]:

# careful here, if we change '?' again maybe...
def check_if_interesting(x):
    # Interesting (maybe different criteria..):
    # - consequents==(TARGET_B or DONATION_TYPE)
    # - TARGET_B, B, D, E not in antecedents
    targets = ['TARGET_B', 'DONATION_TYPE:A', 'DONATION_TYPE:B', 'DONATION_TYPE:C', 
               'DONATION_TYPE:D', 'DONATION_TYPE:E', 'DONATION_TYPE:?']
    target_is_only_consequents = any(item in targets for item in x[1]) and len(x[1])==1
    target_not_in_ancedents = not any(item in targets for item in x[0])
    return target_is_only_consequents and target_not_in_ancedents

# generate assiciation rules:
as_rules = association_rules(freq_patterns, metric="confidence", min_threshold=0.2)
# filter out uninteresting rules:
as_rules['interesting?'] = as_rules[['antecedents', 'consequents']].apply(check_if_interesting, axis=1)
as_rules = as_rules[ as_rules['interesting?']==True]
as_rules

## Classification - Results and Discussion 

Based on the f1-Score....



# Task 2 (Unsupervised Learning) - Characterizing Donors and Donation Type

In this task you should **use unsupervised learning algorithms and try to characterize donors (people who really did a donation) and their donation type**. You can use:
* **Association rule mining** to find **associations between the features and the target Donation/DonationTYPE**.
* **Clustering algorithms to find similar groups of donors**. Is it possible to find groups of donors with the same/similar DonationTYPE?
* **Be creative and define your own unsupervised analysis!** What would it be interesting to find out ?

## Preprocessing Data for Association Rule Mining

...

In [None]:
#transformed_data

In [None]:
# after complete processing, all columns should be used
interesting_features = ['TARGET_B', 'HOME_OWNER', 'DONOR_GENDER', 'DONATION_TYPE'] 

# data used for Rule Mining:
RM_transformed_data = transformed_data[interesting_features]
RM_transformed_data

In [None]:
# sources:
# https://scikit-learn.org/stable/modules/preprocessing.html  -   6.3.4. Encoding categorical features
# TP05: 2.2

def ohe_encode(col_names, X=RM_transformed_data):
    '''Takes columns to encode (format: ['colum1', 'column2', ...]) and DataFrame X with all columns. 
    Encodes features and replaces old columns in X. Returns updated X'''
    enc = OneHotEncoder()
    matrix = X[col_names].to_numpy()
    enc.fit(matrix)
    matrix = enc.transform(matrix).toarray()
    
    categories_new = np.array(enc.categories_)
    for ind1, cat in enumerate(categories_new):
        for ind2, cat_new in enumerate(cat):
            categories_new[ind1][ind2] = col_names[ind1]+':'+cat_new
    features_new = categories_new.reshape(categories_new.size,1)

    new_df = pd.DataFrame(matrix)
    new_df.columns = np.concatenate(features_new.ravel()).tolist()

    updated_df = pd.concat([X, new_df], axis=1)
    updated_df = updated_df.drop(col_names, axis=1)
    
    return(updated_df)

cols_not_binary = ['HOME_OWNER', 'DONOR_GENDER', 'DONATION_TYPE']
RM_binary_data = ohe_encode(cols_not_binary)
RM_binary_data

Todo: 
- Binning for all interesting features (first preprocessing section)
- put the OneHotEncoding in first preprocessing section?

## Finding Associations

In [None]:

# calculate frequent patterns:
freq_patterns = apriori(RM_binary_data, min_support=0.05, use_colnames=True) # maybe choose different min_support..
#freq_patterns['size'] = freq_patterns['itemsets'].apply(lambda x: len(x))
#freq_patterns = freq_patterns[freq_patterns['size']>1]
#freq_patterns

In [None]:

# careful here, if we change '?' again maybe...
def check_if_interesting(x):
    # Interesting (maybe different criteria..):
    # - consequents==(TARGET_B or DONATION_TYPE)
    # - TARGET_B, B, D, E not in antecedents
    targets = ['TARGET_B', 'DONATION_TYPE:A', 'DONATION_TYPE:B', 'DONATION_TYPE:C', 
               'DONATION_TYPE:D', 'DONATION_TYPE:E', 'DONATION_TYPE:?']
    target_is_only_consequents = any(item in targets for item in x[1]) and len(x[1])==1
    target_not_in_ancedents = not any(item in targets for item in x[0])
    return target_is_only_consequents and target_not_in_ancedents

# generate assiciation rules:
as_rules = association_rules(freq_patterns, metric="confidence", min_threshold=0.2)
# filter out uninteresting rules:
as_rules['interesting?'] = as_rules[['antecedents', 'consequents']].apply(check_if_interesting, axis=1)
as_rules = as_rules[ as_rules['interesting?']==True]
as_rules

TODO:
- Check other interesting criteria for rules
- evaluate which metrix is the most reasonable (confidence, lift, etc..)

## Association Rules - Results and Discussion 

...

## Preprocessing Data for Clustering

...

## Finding Groups

...

## Clustering - Results and Discussion 

...

# Final Comments and Conclusions

...