## **Veterans Data Marketing Campaign**

Goal: predict the most likely to donate to a marketing campaign.

The response variable is `TARGET_B`. There is a also  `TARGET_D` that shows the donation amount if the person donated. This is historical data.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/martinwg/ISA591/main/data/pva97nk.csv')
df.head()

Unnamed: 0,TARGET_B,ID,TARGET_D,GiftCnt36,GiftCntAll,GiftCntCard36,GiftCntCardAll,GiftAvgLast,GiftAvg36,GiftAvgAll,...,PromCntCardAll,StatusCat96NK,StatusCatStarAll,DemCluster,DemAge,DemGender,DemHomeOwner,DemMedHomeValue,DemPctVeterans,DemMedIncome
0,0,14974,,2,4,1,3,$17.00,$13.50,$9.25,...,13,A,0,0,,F,U,$0,0,$0
1,0,6294,,1,8,0,3,$20.00,$20.00,$15.88,...,24,A,0,23,67.0,F,U,$186800,85,$0
2,1,46110,$4.00,6,41,3,20,$6.00,$5.17,$3.73,...,22,S,1,0,,M,U,$87600,36,$38750
3,1,185937,$10.00,3,12,3,8,$10.00,$8.67,$8.50,...,16,E,1,0,,M,U,$139200,27,$38942
4,0,29637,,1,1,1,1,$20.00,$20.00,$20.00,...,6,F,0,35,53.0,M,U,$168100,37,$71509


In [2]:
## variable types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9686 entries, 0 to 9685
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   TARGET_B          9686 non-null   int64  
 1   ID                9686 non-null   int64  
 2   TARGET_D          4843 non-null   object 
 3   GiftCnt36         9686 non-null   int64  
 4   GiftCntAll        9686 non-null   int64  
 5   GiftCntCard36     9686 non-null   int64  
 6   GiftCntCardAll    9686 non-null   int64  
 7   GiftAvgLast       9686 non-null   object 
 8   GiftAvg36         9686 non-null   object 
 9   GiftAvgAll        9686 non-null   object 
 10  GiftAvgCard36     7906 non-null   object 
 11  GiftTimeLast      9686 non-null   int64  
 12  GiftTimeFirst     9686 non-null   int64  
 13  PromCnt12         9686 non-null   int64  
 14  PromCnt36         9686 non-null   int64  
 15  PromCntAll        9686 non-null   int64  
 16  PromCntCard12     9686 non-null   int64  


## **Automated EDA**

Modules like:

* Autoviz
* Sweetviz
* summarytools

can help perform EDA faster.

In [7]:
## sweetviz
!pip install sweetviz



In [9]:
import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_html()

                                             |          | [  0%]   00:00 -> (? left)

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


#### Main insights

* predictors that have incorrect types (e.g., numeric variables that show as object, and objects that show as numeric - DemCluster)
* shapes (transform using log transformation)
* the number of levels of categorical predictors (>10 levels think about dim reduction)
* Extremes, anomalies



In [10]:
## checking the number of levels
df.DemHomeOwner.value_counts()

Unnamed: 0_level_0,count
DemHomeOwner,Unnamed: 1_level_1
H,5377
U,4309


In [11]:
## .nunique
df.DemHomeOwner.nunique()

2

In [12]:
## number of level DemCluster
df.DemCluster.nunique()

54

In [13]:
## for loop to show number of levels for ALL categoricals
for i in df.select_dtypes(include='object'):
  print(f'Variable {i} has {df[i].nunique()}')

Variable TARGET_D has 70
Variable GiftAvgLast has 90
Variable GiftAvg36 has 654
Variable GiftAvgAll has 1584
Variable GiftAvgCard36 has 399
Variable StatusCat96NK has 6
Variable DemGender has 3
Variable DemHomeOwner has 2
Variable DemMedHomeValue has 2533
Variable DemMedIncome has 4463


In [14]:
## summarytools
!pip install summarytools

Collecting summarytools
  Downloading summarytools-0.3.0-py3-none-any.whl.metadata (3.5 kB)
Collecting jedi>=0.16 (from ipython>=7.20.0->summarytools)
  Using cached jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading summarytools-0.3.0-py3-none-any.whl (12 kB)
Using cached jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
Installing collected packages: jedi, summarytools
Successfully installed jedi-0.19.1 summarytools-0.3.0


In [15]:
## works faster
## very basic summary
## might not give the number of levels of categorical
from summarytools import dfSummary
dfSummary(df)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,TARGET_B [int64],1. 0 2. 1,"4,843 (50.0%) 4,843 (50.0%)",,0 (0.0%)
2,ID [int64],Mean (sd) : 97975.5 (56550.2) min < med < max: 12.0 < 99106.0 < 191779.0 IQR (CV) : 99703.2 (1.7),"9,686 distinct values",,0 (0.0%)
3,TARGET_D [object],1. nan 2. $10.00 3. $15.00 4. $20.00 5. $5.00 6. $25.00 7. $12.00 8. $7.00 9. $6.00 10. $11.00 11. other,"4,843 (50.0%) 941 (9.7%) 591 (6.1%) 577 (6.0%) 503 (5.2%) 392 (4.0%) 161 (1.7%) 126 (1.3%) 124 (1.3%) 118 (1.2%) 1,310 (13.5%)",,"4,843 (50.0%)"
4,GiftCnt36 [int64],Mean (sd) : 3.2 (2.1) min < med < max: 0.0 < 3.0 < 16.0 IQR (CV) : 2.0 (1.5),16 distinct values,,0 (0.0%)
5,GiftCntAll [int64],Mean (sd) : 10.5 (9.0) min < med < max: 1.0 < 8.0 < 91.0 IQR (CV) : 11.0 (1.2),69 distinct values,,0 (0.0%)
6,GiftCntCard36 [int64],1. 1 2. 2 3. 0 4. 3 5. 4 6. 5 7. 6 8. 7 9. 8 10. 9,"3,103 (32.0%) 2,143 (22.1%) 1,780 (18.4%) 1,288 (13.3%) 672 (6.9%) 365 (3.8%) 189 (2.0%) 96 (1.0%) 39 (0.4%) 11 (0.1%)",,0 (0.0%)
7,GiftCntCardAll [int64],Mean (sd) : 5.6 (4.7) min < med < max: 0.0 < 4.0 < 41.0 IQR (CV) : 6.0 (1.2),31 distinct values,,0 (0.0%)
8,GiftAvgLast [object],1. $15.00 2. $10.00 3. $20.00 4. $25.00 5. $5.00 6. $12.00 7. $16.00 8. $6.00 9. $7.00 10. $11.00 11. other,"1,726 (17.8%) 1,603 (16.5%) 1,462 (15.1%) 862 (8.9%) 645 (6.7%) 329 (3.4%) 275 (2.8%) 273 (2.8%) 221 (2.3%) 214 (2.2%) 2,076 (21.4%)",,0 (0.0%)
9,GiftAvg36 [object],1. $15.00 2. $20.00 3. $10.00 4. $25.00 5. $12.50 6. $13.00 7. $16.00 8. $12.00 9. $5.00 10. $11.00 11. other,"1,049 (10.8%) 801 (8.3%) 549 (5.7%) 448 (4.6%) 278 (2.9%) 208 (2.1%) 206 (2.1%) 179 (1.8%) 178 (1.8%) 176 (1.8%) 5,614 (58.0%)",,0 (0.0%)
10,GiftAvgAll [object],1. $15.00 2. $20.00 3. $25.00 4. $10.00 5. $12.50 6. $9.00 7. $13.00 8. $11.00 9. $8.00 10. $11.67 11. other,"525 (5.4%) 399 (4.1%) 223 (2.3%) 194 (2.0%) 180 (1.9%) 131 (1.4%) 128 (1.3%) 96 (1.0%) 89 (0.9%) 77 (0.8%) 7,644 (78.9%)",,0 (0.0%)


### **Associations and Multivariate Insights**

* Collinearity
* Correlation matrix
* scatter-plot matrices