# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Capstone Project: Donor Leads through Networks


--- 

By: Wenzhe

## EDA and Data Prep

### Overview

The goal of this notebook is to explore the scraped data as well as prepare the data to be ready for use in networks.

### Notebook Structure

* [Part 1: Setup](#part-1-eda)
* [Part 2: Data Preparation and Feature Engineering](#part-2-data-preparation-and-feature-engineering)
* [Part 3: Saving the final Dataframes](#part-3-saving-the-final-dataframes)

---

## Part 1: EDA

#### Import Libraries

In [1]:
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt

### Exploring Data

In [2]:
charities = pd.read_csv('../raw_data/charities_info.csv')
persons = pd.read_csv('../raw_data/persons.csv')

In [5]:
persons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31652 entries, 0 to 31651
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name:         31652 non-null  object
 1   role          31652 non-null  object
 2   designation   31223 non-null  object
 3   charity_name  31652 non-null  object
 4   charity_uen   31652 non-null  object
dtypes: object(5)
memory usage: 1.2+ MB


Rename the name column, and make all names uppercase for standardisation.

In [6]:
persons.rename(columns={'name:':'name'}, inplace=True)

In [10]:
persons['name'] = persons['name'].str.upper()

In [11]:
persons['name']

0        HO HOU CHIAT, ISAAC
1                LU SHAN-JUI
2                 ZHOU LIHAN
3               SUN YIK CHEN
4           RAMESH S/O KUMAR
                ...         
31647           TANG KOK ENG
31648           TAN BAK PENG
31649            TAN TEO HOO
31650          LIM CHUN YONG
31651           YAP SOON WAN
Name: name, Length: 31652, dtype: object

Make the index a column to use as unique identifier for each row.

In [14]:
persons['index'] = persons.index

In [16]:
persons.head()

Unnamed: 0,name,role,designation,charity_name,charity_uen,index
0,"HO HOU CHIAT, ISAAC",Board Member,DIRECTOR,#CHECKED LIMITED,200920810R,0
1,LU SHAN-JUI,Board Member,DIRECTOR,#CHECKED LIMITED,200920810R,1
2,ZHOU LIHAN,Board Member,DIRECTOR,#CHECKED LIMITED,200920810R,2
3,SUN YIK CHEN,Key Officer,CORPORATE SECRETARY,#CHECKED LIMITED,200920810R,3
4,RAMESH S/O KUMAR,Board Member,DIRECTOR,"*SCAPE CO., LTD.",200712761D,4


In [15]:
persons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31652 entries, 0 to 31651
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          31652 non-null  object
 1   role          31652 non-null  object
 2   designation   31223 non-null  object
 3   charity_name  31652 non-null  object
 4   charity_uen   31652 non-null  object
 5   index         31652 non-null  int64 
dtypes: int64(1), object(5)
memory usage: 1.4+ MB


Make the charities columns pythonic, engineer and select columns for the final dataframe.

In [4]:
charities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601 entries, 0 to 2600
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   S/N                   2601 non-null   int64 
 1   Name of Organisation  2601 non-null   object
 2   Type                  2601 non-null   object
 3   UEN                   2601 non-null   object
 4   IPC Period            2599 non-null   object
 5   Sector                2601 non-null   object
 6   Classification        2601 non-null   object
 7   Activities            1798 non-null   object
 8   charity_name          2601 non-null   object
 9   charity_uen           2601 non-null   object
 10  charity_url           2601 non-null   object
 11  contact_person        2365 non-null   object
 12  office_no             2335 non-null   object
 13  fax_no                1052 non-null   object
 14  email                 2350 non-null   object
 15  address               2598 non-null   

In [13]:
charities.columns = [col.lower().replace(" ", "_") for col in charities.columns]
charities.columns

Index(['s/n', 'name_of_organisation', 'type', 'uen', 'ipc_period', 'sector',
       'classification', 'activities', 'charity_name', 'charity_uen',
       'charity_url', 'contact_person', 'office_no', 'fax_no', 'email',
       'address', 'web_url', 'charity_objective', 'charity_vision'],
      dtype='object')

In [17]:
charities.head()

Unnamed: 0,s/n,name_of_organisation,type,uen,ipc_period,sector,classification,activities,charity_name,charity_uen,charity_url,contact_person,office_no,fax_no,email,address,web_url,charity_objective,charity_vision
0,1,#CHECKED LIMITED,Registered,200920810R,Not Applicable,Others,Environment,,#CHECKED LIMITED,200920810R,https://www.charities.gov.sg/_layouts/15/CPInt...,Cheryl Zhen,68160383,,hello@checked.today,"350 ORCHARD ROAD, #17-07/09, SHAW HOUSE, 238868",https://www.checked.today,A. To match Green innovation ideas with the re...,1. Vision: To be an educational platform for e...
1,2,"*SCAPE CO., LTD.",Registered with IPC,200712761D,From 21/10/2021 to 20/10/2024,Others,Children/Youth,Direct Services,"*SCAPE CO., LTD.",200712761D,https://www.charities.gov.sg/_layouts/15/CPInt...,Tan Kim Noy,65090177,,enquiries@scape.sg,"2 ORCHARD LINK, #04-01, SCAPE, 237978",http://www.scape.sg,(i) To encourage and promote social and cultur...,Vision To be a celebrated talent and resource ...
2,3,=DREAMS (ASIA) LIMITED,Registered,201021998H,Not Applicable,Others,General Charitable Purposes,Grantmaking,=DREAMS (ASIA) LIMITED,201021998H,https://www.charities.gov.sg/_layouts/15/CPInt...,NG SAY LEE,63511006,,saylee@dreamsasia.org,"1 LORONG 2 TOA PAYOH, #07-00, BRADDELL HOUSE, ...",Not Available,"Overseas work for disadvantaged children, comm...",Communities are developed and poverty is allev...
3,4,=DREAMS (SINGAPORE) LIMITED,Registered with IPC,202032457N,From 01/4/2023 to 31/3/2024,Social and Welfare,Children/Youth,"Direct Services Financial assistance, bursarie...",=DREAMS (SINGAPORE) LIMITED,202032457N,https://www.charities.gov.sg/_layouts/15/CPInt...,Kelvin Koh,69922838,,hello@dreamssingapore.org.sg,"99 HAIG ROAD, =DREAMS CAMPUS, 438748",https://www.dreamsasia.org,=DREAMS Singapore is a first-of-its-kind secul...,ABOUT US\n\nWHAT: =DREAMS is a residential mod...
4,6,21C GIRLS LTD.,Registered,201436550G,Not Applicable,Education,Others,Training & education,21C GIRLS LTD.,201436550G,https://www.charities.gov.sg/_layouts/15/CPInt...,Ayesha Khanna,91465794,,ayesha@21cgirls.com,"101 UPPER CROSS STREET, #05-16, PEOPLE'S PARK ...",https://www.21cgirls.com,DELIVER FREE TECHNOLOGY CLASSES AND CAMPS FOR ...,TO TEACH TECHNOLOGY TO GIRLS SO THAT THEY CAN ...


From the exploration of both data, there are some nulls in some columns, and some columns might not be very stringent in formatting.

---

## Part 2: Data Preparation and Feature Engineering

This section will further prepare the data and engineer some features that may be useful for the network analysis.

#### IPC Charities

In [19]:
charities['type'].unique()

array(['Registered', 'Registered with IPC', 'Exempt Charity',
       'Exempt Charity with IPC'], dtype=object)

Create a column to indicate if the charity is an Institute of Public Character (IPC).

In [31]:
charities['is_ipc'] = charities['type'].apply(lambda x: 1 if 'IPC' in x else 0)

#### Charity Activities

The activities of a charity fall into a list of categories, available from the charity portal. Create dummy columns to indicate the activities of a charity.

In [26]:
list_activities = ['Direct Services', 'Research', 'Financial assistance, bursaries & scholarships', 'Supports other Charities', 'Grantmaking', 'Training & education', 'Public awareness, promotion & advisory']
list_activities_col_name = ['activity_direct_services', 'activity_research', 'activity_financial_assistance', 'activity_support_charities', 'activity_grantmaking', 'activity_training_education', 'activity_public_awareness']

In [33]:
for i, activity in enumerate(list_activities):
    charities[list_activities_col_name[i]] = charities['activities'].apply(lambda x: 1 if pd.notna(x) and activity in x else 0)

In [120]:
charities

Unnamed: 0,s/n,name_of_organisation,type,uen,ipc_period,sector,classification,activities,charity_name,charity_uen,...,charity_vision,is_ipc,activity_direct_services,activity_research,activity_financial_assistance,activity_support_charities,activity_grantmaking,activity_training_education,activity_public_awareness,postal_code
0,1,#CHECKED LIMITED,Registered,200920810R,Not Applicable,Others,Environment,,#CHECKED LIMITED,200920810R,...,1. Vision: To be an educational platform for e...,0,0,0,0,0,0,0,0,238868
1,2,"*SCAPE CO., LTD.",Registered with IPC,200712761D,From 21/10/2021 to 20/10/2024,Others,Children/Youth,Direct Services,"*SCAPE CO., LTD.",200712761D,...,Vision To be a celebrated talent and resource ...,1,1,0,0,0,0,0,0,237978
2,3,=DREAMS (ASIA) LIMITED,Registered,201021998H,Not Applicable,Others,General Charitable Purposes,Grantmaking,=DREAMS (ASIA) LIMITED,201021998H,...,Communities are developed and poverty is allev...,0,0,0,0,0,1,0,0,319637
3,4,=DREAMS (SINGAPORE) LIMITED,Registered with IPC,202032457N,From 01/4/2023 to 31/3/2024,Social and Welfare,Children/Youth,"Direct Services Financial assistance, bursarie...",=DREAMS (SINGAPORE) LIMITED,202032457N,...,ABOUT US\n\nWHAT: =DREAMS is a residential mod...,1,1,0,1,0,0,1,0,438748
4,6,21C GIRLS LTD.,Registered,201436550G,Not Applicable,Education,Others,Training & education,21C GIRLS LTD.,201436550G,...,TO TEACH TECHNOLOGY TO GIRLS SO THAT THEY CAN ...,0,0,0,0,0,0,1,0,058357
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2596,3078,Zion Living Streams Community Church,Registered,S98SS0054C,Not Applicable,Religious,Christianity,Direct Services,Zion Living Streams Community Church,S98SS0054C,...,"Praying for salvation of the Nations, looking ...",0,1,0,0,0,0,0,0,555856
2597,3079,Zion Presbyterian Church,Registered,S90SS0056K,Not Applicable,Religious,Christianity,,Zion Presbyterian Church,S90SS0056K,...,To preach the Gospel of Christ and to extend G...,0,0,0,0,0,0,0,0,486975
2598,3080,ZION SERANGOON BIBLE-PRESBYTERIAN CHURCH,Registered,S86SS0063K,Not Applicable,Religious,Christianity,Supports other Charities Training & education,ZION SERANGOON BIBLE-PRESBYTERIAN CHURCH,S86SS0063K,...,As given under (a) above.,0,0,0,0,1,0,1,0,555108
2599,3081,Zonta Singapore- Project Pari Fund,Registered with IPC,T10CC0004L,From 01/5/2021 to 31/1/2024,Social and Welfare,Community,"Financial assistance, bursaries & scholarships",Zonta Singapore- Project Pari Fund,T10CC0004L,...,Please refer to the Rules and Regulations of Z...,1,0,0,1,0,0,0,0,187967


#### Postal Code

Make a feature for postal code, though some will be NaN as either the address is not provided or doesn't contain a 6 digit postal code

In [55]:
# charities['address'][0].split(',')[-1].strip()
# charities['postal_code'] = charities['address'].apply(lambda x: x.split(',')[-1].strip().extract(r'(\d{6})') if pd.notna(x) else '')
charities['postal_code'] = charities['address'].str.extract(r'(\d{6})')

#### Charity Classification

The classification shows the activities of a charity in greater detail. However, this column does not appear to be consistently formatted. The below steps aim to create dummy columns based on more prominent activities. `charities_classification` is a dataframe made to contain these dummy columns.

In [129]:
charities_classification = charities.copy()

Prepare the `classification` column before making the dummy columns.

In [130]:
charities_classification['classification'] = charities_classification['classification'].apply(lambda x: x.replace('Others, ', '').replace('Others; ', '').strip().replace(', ', ',').replace(' ', '_').lower())

In [131]:
charities_classification = charities_classification['classification'].str.get_dummies(sep=',').add_prefix('classification_').astype(bool).astype(int)

In [132]:
class_sum = charities_classification.sum()

Keep only the columns that are used by more than 5 charities. This is an arbitrary threshold to filter out odd classifications that are rarely used.

In [137]:
col_more_than_5 = class_sum[class_sum > 5]
charities_classification = charities_classification[col_more_than_5.index]

In [138]:
charities_classification

Unnamed: 0,classification_active_ageing,classification_animal_welfare,classification_buddhism,classification_central,classification_children/youth,classification_christianity,classification_cluster/hospital_funds,classification_community,classification_contemporary_&_ethnic_dance,classification_day_rehabilitation_centre,...,classification_south_west,classification_support_groups,classification_taoism,classification_tcm_clinic,classification_theatre_&_dramatic_arts,classification_think_tanks,classification_traditional_ethnic_performing_arts,classification_training_&_education,classification_trust/research_funds,classification_visual_arts
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2596,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2597,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2598,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2599,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


---

## Part 3: Saving the final Dataframes

Create the final dataframe with features created for charities.

In [121]:
charities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601 entries, 0 to 2600
Data columns (total 28 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   s/n                            2601 non-null   int64 
 1   name_of_organisation           2601 non-null   object
 2   type                           2601 non-null   object
 3   uen                            2601 non-null   object
 4   ipc_period                     2599 non-null   object
 5   sector                         2601 non-null   object
 6   classification                 2601 non-null   object
 7   activities                     1798 non-null   object
 8   charity_name                   2601 non-null   object
 9   charity_uen                    2601 non-null   object
 10  charity_url                    2601 non-null   object
 11  contact_person                 2365 non-null   object
 12  office_no                      2335 non-null   object
 13  fax

In [125]:
charity_cols_keep = ['charity_uen',
                     'charity_name',
                     'address',
                     'postal_code',
                     'charity_objective',
                     'charity_vision',
                     'sector',

                     # Boolean features
                     'is_ipc',
                     'activity_direct_services',
                     'activity_research',
                     'activity_financial_assistance',
                     'activity_support_charities',
                     'activity_grantmaking',
                     'activity_training_education',
                     'activity_public_awareness']
charities_info = charities[charity_cols_keep]

In [127]:
charities_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601 entries, 0 to 2600
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   charity_uen                    2601 non-null   object
 1   charity_name                   2601 non-null   object
 2   address                        2598 non-null   object
 3   postal_code                    2535 non-null   object
 4   charity_objective              2274 non-null   object
 5   charity_vision                 2115 non-null   object
 6   sector                         2601 non-null   object
 7   is_ipc                         2601 non-null   int64 
 8   activity_direct_services       2601 non-null   int64 
 9   activity_research              2601 non-null   int64 
 10  activity_financial_assistance  2601 non-null   int64 
 11  activity_support_charities     2601 non-null   int64 
 12  activity_grantmaking           2601 non-null   int64 
 13  act

Merge columns and create the final dataframe.

In [139]:
charities_merged = pd.concat([charities_info, charities_classification], axis=1)

In [143]:
charities_merged.to_csv('../data/charities.csv', index=False)
persons.to_csv('../data/persons.csv', index=False)