---

<center> <h1> <span style='color:#C7B097'> AltaML Hackathon 2022 </span> </h2> </center>
<center> <h2> <span style='color:#98AFC7'> Charity Classification </span> </h1> </center>
<center> <img src="https://altaml.com/media/ul4ibm45/logo-for-dark.png?mode=pad&width=200&height=70&format=webp&quality=100" alt="altaml" style="width:200px;"> </center>
<center> <img src="https://img-prod-cms-rt-microsoft-com.akamaized.net/cms/api/am/imageFileData/RE1Mu3b?ver=5c31" alt="microsoft" style="width:200px;"> </center>

---

#
<center> <h2> <span style='color:#C7B097'> Table of Content </span> </h2> </center>

* [1 - Introduction](#1-introduction)
  * [1.1 - Import libraries](#11---import-libraries)
    * [1.1.1 - Configurations](#111---configurations)
* [2 - Data](#2-data)
  * [2.1 - Charity Navigator Dataset](#21-charity-navigator-dataset)
    * [2.1.1 - Data Preparation](#211---data-preparation)
    * [2.1.2 - Data Wrangling](#212---data-wrangling)
    * [2.1.3 - TF-IDF with n-grams](#213---tf-idf-with-n-grams)
    * [2.1.4 - Feature Selection](#214---feature-selection)
    * [2.1.5 - Oversampling and Undersampling for inbalanced data](#215---oversampling-and-undersampling-for-inbalanced-data)
* [3 - Classification](#3---classification)
  * [3.1 - Logistic Regression](#31---logistic-regression)
  * [3.2 - Gradient Boosting XGBoost](#32---gradient-boosting-xgboost)
    * [3.2.1 - Feature Importance (First 30 features)](#321---feature-importance-first-30-features)
* [4 - Final Results](#4---final-results)
* [5 - Test loading the pickle file](#5-test-loading-the-pickle-file)

# 1 - Introduction

#### Description: 

This project aims to classify the charities among 10 different categories. The dataset have information of 8400 different US charities as rated by CharityNavigator.org

Possible categories for each charity:

1. Animals
2. Arts, Culture, Humanities
3. Community Development
4. Education
5. Environment
6. Health
7. Human Services
8.  Human and Civil Rights
9.  Religion
10. Research and Public Policy


#### Dataset:
Link : 
<a href="https://www.kaggle.com/datasets/katyjqian/charity-navigator-scores-expenses-dataset?resource=download"> Charity Navigator Dataset</a>

The data is a public service of Charity Navigator, but the data is likely owned by individual charities. Charity Navigator collects this data. This data was webscraped in May 2019 but uses rating details mostly from 2017

## 1.1 - Import libraries

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import warnings
from IPython.core.interactiveshell import InteractiveShell
import os

# NLP
import nltk.corpus
nltk.download("stopwords")
from nltk.corpus import stopwords

# Visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
import seaborn as sns
import plotly.express as px


# Features Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Feature Selection
from sklearn.feature_selection import SelectKBest, chi2


# Machine Learning
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.tree import DecisionTreeClassifier, plot_tree # Decision Tree
from sklearn.ensemble import RandomForestClassifier # Random Forest
from xgboost import XGBClassifier, plot_importance # Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier # Gradient Boosting
from sklearn.neural_network import MLPClassifier # Neural Networks
from sklearn.neighbors import KNeighborsClassifier # KNeighbors
from sklearn.svm import SVC # Support Vector Machine 
from sklearn.model_selection import train_test_split # Split the data
import pickle # Format to save the ML model


# Metrics
from sklearn.metrics import (f1_score, accuracy_score, recall_score, 
precision_score, classification_report)
from sklearn.metrics import make_scorer 

# Inbalance Treatment
from imblearn.over_sampling import SMOTE, RandomOverSampler
# from imblearn.under_sampling import RandomUnderSampler

# Utilities functions
import utils


### 1.1.1 - Configurations

In [None]:
# Show multiple outputs at the same cell
InteractiveShell.ast_node_interactivity = "all"
warnings.filterwarnings("ignore") # Ignore warnings
plt.style.use('fivethirtyeight')
rcParams['figure.figsize'] = 16, 6 # Set the standard plot size
%matplotlib inline

In [None]:
accuracy = make_scorer(accuracy_score)
recall = make_scorer(recall_score, average='macro')
precision = make_scorer(precision_score, average='macro')
f1 = make_scorer(f1_score, average='macro')
os.makedirs('outputs', exist_ok=True)

In [None]:
smote = True
metric = accuracy #f1 # Metric used for optimization
metric_name = 'Accuracy' #'F1'
search_params = True # Search for the best parameters

# 2 - Data

## 2.1 - Charity Navigator Dataset

[id1]: # "Accountability & Transparency Score - %"
[id2]: # "Charity class"
[id3]: # "Mission & Description."
[id4]: # "ID number."
[id5]: # "Total Expenses in (Program , Funding ,Administrative)."
[id6]: # "Administrative Expenses Percentage (of total expenses)%."
[id7]: # "Funding Efficiency in (amount spent to raise 1 in donations)."
[id8]: # "Funding Expenses Percentage (of total expenses)."
[id9]: # "Program Expenses Percentage (of total expenses)."
[id10]: # "Financial Score (out of 100)."
[id11]: # "Name of Leader."
[id12]: # "Compensation of Leader in."
[id13]: # "Compensation of Leader Percentage."
[id14]: # "Tagline."
[id15]: # "Name of Charity."
[id16]: # "Total Revenue."
[id17]: # "Overall Score (out of 100)."
[id18]: # "State."
[id19]: # "Subcategory."
[id20]: # "Size of Charity (based on Total Expenses)."
[id21]: # "Program Expenses in (amount spent on program & services it delivers)."
[id22]: # "Funding Expenses in (amount spent on raising money)."
[id23]: # "Administrative Expenses in $ (amount spent on overhead, staff, meeting costs)."

| Index | Feature (Input) | Short Description | Unit/Format |
|:-------:|:-----------------|:------------------------------------------------------|:-------------------------------------|
| 1 | [`ascore`][id1] | Accountability & Transparency Score | % |
| 2 | [`category`][id2] | Charity class (`10 categories that will be used as output`) | dimensionless |
| 3 | [`description`][id3] | Mission & Description (`Used as the only input`) | dimensionless |
| 4 | [`EIN`][id4] | ID number | dimensionless |
| 5 | [`tot_exp`][id5] | Total Expenses in (Program , Funding ,Administrative) | dimensionless |
| 6 | [`admin_exp_p`][id6] | Administrative Expenses Percentage (of total expenses) | % |
| 7 | [`fund_eff`][id7] | Funding Efficiency in (amount spent to raise 1 in donations) | dimensionless |
| 7 | [`fund_exp_p`][id8] | Funding Expenses Percentage (of total expenses) | dimensionless |
| 8 | [`program_exp_p`][id9] | Program Expenses Percentage (of total expenses) | % |
| 9 | [`fscore`][id10] | Financial Score (out of 100) | dimensionless |
| 10 | [`leader`][id11] | Name of Leader | dimensionless |
| 11 | [`leader_comp`][id12] | Compensation of Leader in | dimensionless |
| 12 | [`leader_comp_p`][id13] | Compensation of Leader Percentage | dimensionless |
| 12 | [`motto`][id14] | Tagline | dimensionless |
| 13 | [`name`][id15] | Name of Charity | dimensionless |
| 14 | [`tot_rev`][id16] | Total Revenue | $ |
| 15 | [`score`][id17] | Overall Score (out of 100) | dimensionless |
| 16 | [`state`][id18] | State | dimensionless |
| 17 | [`subcategory`][id19] | Subcategory | dimensionless |
| 18 | [`size`][id20] | Size of Charity (based on Total Expenses) | dimensionless |
| 19 | [`program_exp`][id21] | Program Expenses in (amount spent on program & services it delivers) | dimensionless |
| 20 | [`fund_exp`][id22] | Funding Expenses in (amount spent on raising money) | dimensionless |
| 21 | [`admin_exp`][id23] | Administrative Expenses in $ (amount spent on overhead, staff, meeting costs) | dimensionless |


### 2.1.1 - Data Preparation

In [None]:
# Load data
# import opendatasets as od
# od.download(
#     "https://www.kaggle.com/datasets/muratkokludataset/acoustic-extinguisher-fire-dataset")

df_charity = pd.read_csv("CLEAN_charity_data.csv")

#### For the propose of this project we are going to use only the `description` feature as input and the `category` as output

In [None]:
df_charity.columns

In [None]:
# Description sample
df_charity['description'][1]

In [None]:
# Drop all features but category and description
df_charity.drop(columns=['ascore','ein','tot_exp','admin_exp_p','fund_eff',
'fund_exp_p', 'program_exp_p', 'fscore', 'leader',
'leader_comp', 'leader_comp_p', 'motto', 'name', 'tot_rev', 'score',
'state', 'subcategory', 'size', 'program_exp', 'fund_exp', 'admin_exp'], inplace = True)
# Drop the rows with the category `international`
df_charity = df_charity[df_charity.category != 'International']

In [None]:
utils.data_check(df_charity)

In [None]:
np.unique(df_charity['category'])

In [None]:
top_prod = df_charity.groupby('category').size().reset_index().rename(
    columns={0: 'total'}).sort_values('total', ascending=False)
fig = px.pie(top_prod, values='total', names='category', width=800, height=800,
             color_discrete_sequence=px.colors.sequential.thermal, 
             title="Charity Categories")
fig.show()

In [None]:
top_prod

#### Observations:
1. Both `description` and `category` are object dtype
2. There are 6 duplicated rows
3. There are 10 individual categories (after delete the `International`) for the charities with most being `Human Services` category
4. The plot above displays the distribution of all charity categories

### 2.1.2 - Data Wrangling

In [None]:
# Transform the object type into category type for the category target
df_charity["category"] = df_charity["category"].astype("category")
df_charity.info()

In [None]:
df_charity["description_no_punctuation"] = df_charity["description"].apply(utils.remove_punctuation)
df_charity

In [None]:
df_charity.drop_duplicates(inplace=True)
df_charity.reset_index(inplace = True, drop = True)

In [None]:
df_charity.duplicated().sum()

In [None]:
# Removing Stop words
stop_words = stopwords.words("english")
# print(sorted(stop_words))
# print(len(stop_words))

In [None]:
func = lambda x: ' '.join([word for word in x.split() if word not in (stop_words)])
df_charity["description_clean"] = df_charity["description_no_punctuation"].apply(func)

In [None]:
df_charity.drop(["description_no_punctuation"], axis = 1, inplace = True)
df_charity.head()

### 2.1.3 - TF-IDF with n-grams
(Term Frequency–Inverse Document Frequency)

In [None]:
vectorizer = TfidfVectorizer(
    lowercase = True, 
    stop_words = stop_words, 
    ngram_range = (1, 2), 
    min_df = 0.009, 
    max_df = 0.99
)

X = vectorizer.fit_transform(df_charity["description_clean"])

In [None]:
print("="*30)
print(f"Number of feature names: {len(vectorizer.get_feature_names())}")
print("="*30)
# The first 50 features
print(f"The first 50 features:\n{np.transpose(vectorizer.get_feature_names())[:50]}")
print("="*30)
print(f"Number of rows: {len(X.toarray())}")
print("="*30)

In [None]:
df_charity_description_tfidf = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
df_charity_description_tfidf["category"] = df_charity["category"]
df_charity_description_tfidf.head(3)

In [None]:
cat = np.unique(df_charity_description_tfidf.category)
display(cat)
print(f"Number of Categories: {len(cat)}")

In [None]:
# Set all features and the target to numbers that represent each values decribed on table above.
category_dict = {
    'Human Services':0 ,
    'Health' : 1, 
    'Education': 2, 
    'Arts, Culture, Humanities':3 ,
    'Religion' : 4, 
    'Research and Public Policy': 5,
    # 'International':6,
    'Community Development' : 6, 
    'Animals': 7,
    'Human and Civil Rights':8,
    'Environment' : 9}

data = df_charity_description_tfidf.replace(category_dict)
# data.drop(["description", "description_clean"], axis=1, inplace=True)
data

In [None]:
target = "category"
features = list(data.drop(target, axis=1))
# print(f"Feature: \n {features}")

In [None]:
print(f"Target: {target}")

### 2.1.4 - Feature Selection
Let's reduce the amount of features

In [None]:
X,y = data.iloc[:,:-1], data.iloc[:,-1]

In [None]:
X.shape

In [None]:
selector = SelectKBest(chi2, k=1118) # Reducing the features didn't improve the results
X_new = selector.fit_transform(X, y)

In [None]:
X_new.shape

In [None]:
cols = selector.get_support(indices=True)
features_new = list(np.array(features)[cols])
X_new = pd.DataFrame(X_new, columns=features_new)

In [None]:
# data = pd.concat([X, y], axis=1)
data_new = pd.concat([X_new, y], axis=1)

In [None]:
# X_new = X
# features_new = features
# data_new = data

In [None]:
# Data split
train, test = train_test_split(
    data_new, test_size=0.20, stratify=data_new[target], random_state=1)
print(data_new.shape)
print(train.shape)
print(test.shape)
print(train.shape[0] + test.shape[0])


In [None]:
print(train[target].value_counts())
print(test[target].value_counts())

In [None]:
data_new['category'].value_counts()

The `stratify` parameter is to keep the same proportions of the classes in the target for both train and test dataframes.

### 2.1.5 - Oversampling and Undersampling for inbalanced data

Both training and testing data for the target are still inbalanced.

There are too many moderately adapted students and too few highly-adapted students. Let's create an alternative dataset with oversampling and undersampling in case our models are not tought correctly on the original train datasets.

Duplicate rows to increase class 2 (High level of adaptability).

In [None]:
if smote:
    oversample = SMOTE()
    train[features_new], train[target] = oversample.fit_resample(train.drop(
        ["category"], axis=1), train["category"])

In [None]:
# Train and Test set after applying oversampling
print(train[target].value_counts())
print(test[target].value_counts())


# 3 - Classification

## 3.1 - Logistic Regression

In [None]:
test.shape

In [None]:
# %%time
# Training model
model_lr = LogisticRegression()

# Parameters to test
# params = {
#     "solver": ("newton-cg", "lbfgs"),
#     "max_iter": tuple(range(20, 35, 5)),
#     "class_weight": ["balanced", None],
#     'multi_class': ['auto'],
#     'random_state': [1],
# }

# Best parameters
params = {
    'solver': 'newton-cg',
    'max_iter': 8,
    'class_weight': 'balanced',
    'multi_class': 'auto',
    'random_state': 1,
}

pred_lr_test = utils.results(
    model=model_lr, 
    params=params, 
    train=train, 
    test=test, 
    features=features_new,
    target=target,
    metric=metric, 
    metric_name=metric_name,
    search_params=False)
# 0.920263
# 0.842207

In [None]:
pred_prob_lg = model_lr.predict_proba(test[features_new])
utils.pr_curve(
    model_name='Multiclass-Logistic', 
    target=test[target], 
    pred_prob=pred_prob_lg
    )

In [None]:
pickle.dump(model_lr, open('model_lr.pkl', 'wb')) # Save the Logistic regression model as pickle file

### 3.2.2 - Feature importance (First 10 features)

In [None]:
lg_imp = pd.DataFrame(model_lr.coef_[0], columns = ["Weights"], index = features_new)
lg_imp= lg_imp.sort_values("Weights", ascending = False)
lg_imp

In [None]:
logistic_importances = pd.Series(model_lr.coef_[0], index = features_new)

fig, ax = plt.subplots()
logistic_importances.sort_values(ascending = False)[:10][::-1].plot.barh(ax = ax)

ax.set_title("Feature Importances - Logistic Regression Classifier")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout();

## 3.2 - Gradient Boosting (XGBoost)

[XGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier) from the `XGBoost` package.

In [None]:
# %%time
# Training model
model_xgb = XGBClassifier()

# Parameters to test
# params = {
#     'max_depth': [6],
#     'n_estimators': tuple(range(50,56,1)),
#     'tree_method': ["approx"],
#     'random_state': [1],
#     'enable_categorical': [True],
#     'gamma':[i/10.0 for i in range(5)],
#     }

# Best parameters
params = {
    'max_depth': 6,
    'n_estimators': 52,
    'tree_method': 'approx',
    'random_state': 1,
    'enable_categorical': True,
    'gamma': 0.2,
    'lambda': 3,
    'alpha': 2
    }

pred_xgb_test = utils.results(
    model=model_xgb, 
    params=params, 
    train=train, 
    test=test, 
    features=features_new,
    target=target,
    metric=metric, 
    metric_name=metric_name,
    search_params=False)

# 0.967592
# 0.819756

In [None]:
pickle.dump(model_xgb, open('model_xgb.pkl', 'wb')) # Save the XXBoost model as pickle file

### 3.2.2 - Feature importance (First 10 features)

In [None]:
plot_importance(model_xgb, height=0.8, grid=False, max_num_features=10);

# 4 - Final Results

In [None]:
lr  = utils.model_metrics(test[target], pred_lr_test)  # Logistic Regression
xgb = utils.model_metrics(test[target], pred_xgb_test) # Gradient Boosting XGBoot

Merge all model results into one sigle DataFrame

In [None]:
models = [lr, xgb]
results = pd.concat(models, ignore_index=True)*100
results.rename(index={
    0:"Logistic Regression", 
    1:"Gradient Boosting (XGBoost)", 
    }, inplace=True)
results.index.name = 'Models'


In [None]:
results.style\
    .format('{:.2f}')\
    .highlight_max(color='green')\
    .highlight_min(color='red')


# 5. Test loading the pickle file

In [None]:
model_lr = pickle.load(open("outputs/model_lr.pkl", 'rb'))

#### Example

In [None]:
n = 1
description = df_charity['description'][n]
category = df_charity['category'][n]

print("Description:\n", description)
print("\nCategory: ", category)

#### Step 1 - Remove ponctuation

In [None]:
description_no_punctuation = utils.remove_punctuation(description)

#### Step 2 - Remove stop words

In [None]:
stop_words = stopwords.words("english")
clean_text = lambda x: ' '.join([word for word in x.split() if word not in (stop_words)])
description_clean = clean_text(description_no_punctuation)

#### Step 3 - TF-IDF with n-grams (Vectorize)
(Term Frequency–Inverse Document Frequency)

In [None]:
vectorizer = TfidfVectorizer(
    lowercase = True, 
    stop_words = stop_words, 
    ngram_range = (1, 2), 
    min_df = 0, 
    max_df = 1
)

# X = vectorizer.fit_transform(pd.Series(description_clean))
X = vectorizer.fit_transform([description_clean])

In [None]:
df_charity_description_tfidf = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())
df_charity_description_tfidf.head(3)

In [None]:
df_all = pd.DataFrame(np.nan, index=[0], columns=features)
df_all

In [None]:
data = df_all.fillna(df_charity_description_tfidf)
data = data.fillna(0)

In [None]:
cat = model_lr.predict(data)

In [None]:
category_dict = {
    0: 'Human Services',
    1: 'Health', 
    2: 'Education',
    3: 'Arts, Culture, Humanities',
    4: 'Religion', 
    5: 'Research and Public Policy',
    6: 'Community Development', 
    7: 'Animals',
    8: 'Human and Civil Rights',
    9: 'Environment'
}

result = category_dict[cat[0]]
result

In [None]:
# print(features, end='')