# Blood Donation Project

<p>Blood transfusion saves lives, from replacing blood lost during major surgery or a serious injury to treating a variety of diseases and blood disorders. Ensuring that sufficient blood is available when needed is a serious challenge for healthcare professionals. ''. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a> "every year approximately "5 million Americans need blood transfusions."</p>

<p>Our data set was taken from a mobile blood donation vehicle in Taiwan.</p>

<p>The data is stored in datasets/transfusion.data and is structured according to the RFMTC marketing model (a variant of RFM).

In this data set:

**RFMTC Components**

1. **Recency (R) - "Recency (months)"**
     - This feature represents how long it has been since a donor's last donation. Generally, donors whose last donation was more recent are more likely to donate again.
  
2. **Frequency (F) — "Frequency (times)"**
     - This indicates how often a donor donates blood. People who donate with high frequency are generally more likely to donate in the future.
  
3. **Monetary (M) - Monetary Value — "Monetary (c.c. blood)"**
     - This feature represents how much blood the donor donated in total. Generally, donors who donate larger amounts of blood are considered to be of higher value.

4. **Time (T) - "Time (months)"**
     - This shows how long it has been since a donor's first donation. This feature can be used to understand how “loyal” the donor is throughout the donation period.

5. **Churn (C) - Leaving — "whether he/she donated blood in March 2007"**
     - This shows whether a donor made a donation during a specific time period (in March 2007). Churn, in this example, represents the probability that the donor will not donate during that period.

**Usage Areas of RFMTC**

1. **Segmentation**: Donors can be segmented into different segments using these features. For example, donors with high "F" and low "R" values can be labeled as "Loyal Donors."

2. **Predicting**: The probability of future donation can be predicted using current RFMTC values.

3. **Targeting**: Special campaigns or incentives can be used to target specific donor segments.

4. **Risk Analysis**: Donors with low frequency and high churn rate can be labeled as "Risky", and special strategies can be developed for these donors.

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd

# Exploratory Data Analysis and Visualization

In [2]:
df = pd.read_csv("transfusion.data")
df

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


## Change the column names (if necessary)

In [3]:
new_column_names = {
    'Recency (months)': 'Recency',
    'Frequency (times)': 'Frequency',
    'Monetary (c.c. blood)': 'Monetary',
    'Time (months)': 'Time',
    'whether he/she donated blood in March 2007': 'Target'
                   }
            
df.rename(columns=new_column_names, inplace=True)

## Get the first 5 lines

In [4]:
df.head(5)

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## Look at the general information

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Recency    748 non-null    int64
 1   Frequency  748 non-null    int64
 2   Monetary   748 non-null    int64
 3   Time       748 non-null    int64
 4   Target     748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## Look at the shape

In [6]:
df.shape

(748, 5)

## Check for missing values

In [7]:
df.isnull()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
743,False,False,False,False,False
744,False,False,False,False,False
745,False,False,False,False,False
746,False,False,False,False,False


In [8]:
df.isnull().sum()

Recency      0
Frequency    0
Monetary     0
Time         0
Target       0
dtype: int64

## Check for duplicated values

In [9]:
df[df.duplicated()]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
18,2,6,1500,15,1
20,2,3,750,4,1
23,2,6,1500,16,1
32,4,10,2500,28,1
43,2,5,1250,16,0
...,...,...,...,...,...
735,23,1,250,23,0
736,23,1,250,23,0
737,23,1,250,23,0
738,23,1,250,23,0


In [10]:
df.duplicated().sum()

215

## Check the dtype

In [11]:
df.dtypes

Recency      int64
Frequency    int64
Monetary     int64
Time         int64
Target       int64
dtype: object

## Calculate the basic statistical values

In [12]:
df.describe()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
count,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968
std,8.095396,5.839307,1459.826781,24.376714,0.426124
min,0.0,1.0,250.0,2.0,0.0
25%,2.75,2.0,500.0,16.0,0.0
50%,7.0,4.0,1000.0,28.0,0.0
75%,14.0,7.0,1750.0,50.0,0.0
max,74.0,50.0,12500.0,98.0,1.0


In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recency,748.0,9.506684,8.095396,0.0,2.75,7.0,14.0,74.0
Frequency,748.0,5.514706,5.839307,1.0,2.0,4.0,7.0,50.0
Monetary,748.0,1378.676471,1459.826781,250.0,500.0,1000.0,1750.0,12500.0
Time,748.0,34.282086,24.376714,2.0,16.0,28.0,50.0,98.0
Target,748.0,0.237968,0.426124,0.0,0.0,0.0,0.0,1.0


## Check unique values

In [14]:
df.Recency.unique()

array([ 2,  0,  1,  4,  5,  9,  3, 12,  6, 11, 10, 13,  8, 14,  7, 16, 15,
       23, 21, 18, 22, 26, 35, 38, 40, 74, 20, 17, 25, 39, 72])

In [15]:
df.Recency.nunique()

31

In [16]:
df.Target.unique()

array([1, 0])

In [17]:
df.Target.nunique()

2

## Calculate the average of 'Recency'

In [18]:
df.Recency.mean()

9.506684491978609

## Find the highest value in 'Frequency'

In [19]:
df.Frequency.max()

50

## Calculate the median of 'Time'

In [20]:
df.Time.median()

28.0

## Calculate the standard deviation of 'Monetary'

In [21]:
df.Monetary.std()

1459.8267807725035

## Count the number of unique values in 'Time'

In [22]:
df.Time.unique()

array([98, 28, 35, 45, 77,  4, 14, 22, 58, 47, 15, 11, 48, 49, 16, 40, 34,
       21, 26, 64, 57, 53, 69, 36,  2, 46, 52, 81, 29,  9, 74, 25, 51, 71,
       23, 86, 38, 76, 70, 59, 82, 61, 79, 41, 33, 10, 95, 88, 19, 37, 39,
       78, 42, 27, 24, 63, 43, 75, 73, 50, 60, 17, 72, 62, 30, 31, 65, 89,
       87, 93, 83, 32, 12, 18, 55,  3, 13, 54])

In [23]:
df.Time.nunique()

78

## Calculate the ratio of donors in March 2007 (Target=1) to total donors

In [24]:
df[df["Target"] == 1].count()[0]

  df[df["Target"] == 1].count()[0]


178

In [25]:
df[df["Target"] == 1].count()[0] / len(df)

  df[df["Target"] == 1].count()[0] / len(df)


0.23796791443850268

In [26]:
df.Target.value_counts()

Target
0    570
1    178
Name: count, dtype: int64

In [27]:
df.Target.value_counts(normalize=True)

Target
0    0.762032
1    0.237968
Name: proportion, dtype: float64

## Filter donors with 'Recency' less than 10 months

In [28]:
df[df["Recency"] < 10]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
653,4,2,500,30,0
656,4,2,500,31,0
669,2,3,750,75,1
670,2,3,750,77,0


In [29]:
len(df[df["Recency"] < 10])

401

## Select donors who donated at least 5 times

In [30]:
df[df["Frequency"] >= 5]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
713,16,6,1500,81,0
715,16,5,1250,71,0
719,23,8,2000,69,0
726,25,6,1500,50,0


In [31]:
len(df[df["Frequency"] >= 5])

329

## Create a new column giving the time between the first donation and the last donation

In [32]:
df["Donation_Period"] = df.Time - df.Recency
df["Donation_Period"]

0      96
1      28
2      34
3      43
4      76
       ..
743    15
744    31
745    39
746     0
747     0
Name: Donation_Period, Length: 748, dtype: int64

In [33]:
df

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period
0,2,50,12500,98,1,96
1,0,13,3250,28,1,28
2,1,16,4000,35,1,34
3,2,20,5000,45,1,43
4,1,24,6000,77,0,76
...,...,...,...,...,...,...
743,23,2,500,38,0,15
744,21,2,500,52,0,31
745,23,3,750,62,0,39
746,39,1,250,39,0,0


## Outlier Analysis for 'Frequency'

In [34]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recency,748.0,9.506684,8.095396,0.0,2.75,7.0,14.0,74.0
Frequency,748.0,5.514706,5.839307,1.0,2.0,4.0,7.0,50.0
Monetary,748.0,1378.676471,1459.826781,250.0,500.0,1000.0,1750.0,12500.0
Time,748.0,34.282086,24.376714,2.0,16.0,28.0,50.0,98.0
Target,748.0,0.237968,0.426124,0.0,0.0,0.0,0.0,1.0
Donation_Period,748.0,24.775401,24.42063,0.0,0.0,19.0,41.0,96.0


In [35]:
q1 = df.Frequency.quantile(0.25)
q1

2.0

In [36]:
q3 = df.Frequency.quantile(0.75)
q3

7.0

In [37]:
iqr = q3 - q1
iqr

5.0

In [38]:
df[df["Frequency"] > 22]

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period
0,2,50,12500,98,1,96
4,1,24,6000,77,0,76
9,5,46,11500,98,1,93
10,4,23,5750,58,0,54
115,11,24,6000,64,0,53
341,23,38,9500,98,0,75
500,2,43,10750,86,1,84
502,2,34,8500,77,1,75
503,2,44,11000,98,0,96
504,0,26,6500,76,1,76


In [39]:
len(df[df["Frequency"] > 22])

13

In [40]:
len(df[df["Frequency"] > 14.5])

45

## Create a simple scoring model based on 'Recency' and 'Frequency'

In [41]:
df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period
0,2,50,12500,98,1,96
1,0,13,3250,28,1,28
2,1,16,4000,35,1,34
3,2,20,5000,45,1,43
4,1,24,6000,77,0,76


In [42]:
df["Danotiopn_Score"] = (1 / df.Recency) + df.Frequency
df["Danotiopn_Score"]

0      50.500000
1            inf
2      17.000000
3      20.500000
4      25.000000
         ...    
743     2.043478
744     2.047619
745     3.043478
746     1.025641
747     1.013889
Name: Danotiopn_Score, Length: 748, dtype: float64

In [43]:
df["Danotiopn_Score_1"] = np.where(df["Recency"] == 0, df.Frequency, (1 / df.Recency) + df.Frequency )
df["Danotiopn_Score_1"]

0      50.500000
1      13.000000
2      17.000000
3      20.500000
4      25.000000
         ...    
743     2.043478
744     2.047619
745     3.043478
746     1.025641
747     1.013889
Name: Danotiopn_Score_1, Length: 748, dtype: float64

In [44]:
df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period,Danotiopn_Score,Danotiopn_Score_1
0,2,50,12500,98,1,96,50.5,50.5
1,0,13,3250,28,1,28,inf,13.0
2,1,16,4000,35,1,34,17.0,17.0
3,2,20,5000,45,1,43,20.5,20.5
4,1,24,6000,77,0,76,25.0,25.0


## Convert Time to Years and Months (Time Series Transformation)

In [45]:
 df.Time // 12

0      8
1      2
2      2
3      3
4      6
      ..
743    3
744    4
745    5
746    3
747    6
Name: Time, Length: 748, dtype: int64

In [46]:
df["Years"] = df.Time // 12
df["Years"]

0      8
1      2
2      2
3      3
4      6
      ..
743    3
744    4
745    5
746    3
747    6
Name: Years, Length: 748, dtype: int64

In [47]:
df.Time % 12

0       2
1       4
2      11
3       9
4       5
       ..
743     2
744     4
745     2
746     3
747     0
Name: Time, Length: 748, dtype: int64

In [48]:
df["Months"] = df.Time % 12
df["Months"]

0       2
1       4
2      11
3       9
4       5
       ..
743     2
744     4
745     2
746     3
747     0
Name: Months, Length: 748, dtype: int64

In [49]:
df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period,Danotiopn_Score,Danotiopn_Score_1,Years,Months
0,2,50,12500,98,1,96,50.5,50.5,8,2
1,0,13,3250,28,1,28,inf,13.0,2,4
2,1,16,4000,35,1,34,17.0,17.0,2,11
3,2,20,5000,45,1,43,20.5,20.5,3,9
4,1,24,6000,77,0,76,25.0,25.0,6,5


## Calculate the correlation of 'Target' with other features (Correlation Analysis)

In [50]:
df.corr()["Target"]

Recency             -0.279869
Frequency            0.218633
Monetary             0.218633
Time                -0.035854
Target               1.000000
Donation_Period      0.056986
Danotiopn_Score      0.216745
Danotiopn_Score_1    0.225380
Years               -0.032680
Months              -0.021089
Name: Target, dtype: float64

In [51]:
df.corr()["Target"].sort_values(ascending = False)

Target               1.000000
Danotiopn_Score_1    0.225380
Monetary             0.218633
Frequency            0.218633
Danotiopn_Score      0.216745
Donation_Period      0.056986
Months              -0.021089
Years               -0.032680
Time                -0.035854
Recency             -0.279869
Name: Target, dtype: float64

In [52]:
df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period,Danotiopn_Score,Danotiopn_Score_1,Years,Months
0,2,50,12500,98,1,96,50.5,50.5,8,2
1,0,13,3250,28,1,28,inf,13.0,2,4
2,1,16,4000,35,1,34,17.0,17.0,2,11
3,2,20,5000,45,1,43,20.5,20.5,3,9
4,1,24,6000,77,0,76,25.0,25.0,6,5


## Create donor groups based on 'Frequency' (Grouping and Aggregation)

In [53]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recency,748.0,9.506684,8.095396,0.0,2.75,7.0,14.0,74.0
Frequency,748.0,5.514706,5.839307,1.0,2.0,4.0,7.0,50.0
Monetary,748.0,1378.676471,1459.826781,250.0,500.0,1000.0,1750.0,12500.0
Time,748.0,34.282086,24.376714,2.0,16.0,28.0,50.0,98.0
Target,748.0,0.237968,0.426124,0.0,0.0,0.0,0.0,1.0
Donation_Period,748.0,24.775401,24.42063,0.0,0.0,19.0,41.0,96.0
Danotiopn_Score,748.0,inf,,1.013514,2.071429,4.071429,7.25,inf
Danotiopn_Score_1,748.0,5.739404,5.87564,1.013514,2.071429,4.071429,7.25,50.5
Years,748.0,2.394385,2.038011,0.0,1.0,2.0,4.0,8.0
Months,748.0,5.549465,3.546058,0.0,2.0,4.0,9.0,11.0


In [54]:
bins = [0, 4, 14, 50 ]

group_names = ["Low", "Medium", "High"]


df["Frequency_Group"] = pd.cut(df.Frequency, bins, labels= group_names)


df["Frequency_Group"]

0        High
1      Medium
2        High
3        High
4        High
        ...  
743       Low
744       Low
745       Low
746       Low
747       Low
Name: Frequency_Group, Length: 748, dtype: category
Categories (3, object): ['Low' < 'Medium' < 'High']

In [55]:
df.sample(10)

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period,Danotiopn_Score,Danotiopn_Score_1,Years,Months,Frequency_Group
374,14,4,1000,40,0,26,4.071429,4.071429,3,4,Low
145,4,3,750,16,1,12,3.25,3.25,1,4,Low
554,2,4,1000,23,0,21,4.5,4.5,1,11,Low
106,0,8,2000,59,0,59,inf,8.0,4,11,Medium
438,23,14,3500,93,0,70,14.043478,14.043478,7,9,Medium
538,2,8,2000,38,1,36,8.5,8.5,3,2,Medium
36,2,12,3000,47,1,45,12.5,12.5,3,11,Medium
210,4,1,250,4,0,0,1.25,1.25,0,4,Low
283,8,2,500,16,0,8,2.125,2.125,1,4,Low
485,23,4,1000,53,0,30,4.043478,4.043478,4,5,Low


## Create a new categorical variable based on 'Recency'

In [56]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Recency,748.0,9.506684,8.095396,0.0,2.75,7.0,14.0,74.0
Frequency,748.0,5.514706,5.839307,1.0,2.0,4.0,7.0,50.0
Monetary,748.0,1378.676471,1459.826781,250.0,500.0,1000.0,1750.0,12500.0
Time,748.0,34.282086,24.376714,2.0,16.0,28.0,50.0,98.0
Target,748.0,0.237968,0.426124,0.0,0.0,0.0,0.0,1.0
Donation_Period,748.0,24.775401,24.42063,0.0,0.0,19.0,41.0,96.0
Danotiopn_Score,748.0,inf,,1.013514,2.071429,4.071429,7.25,inf
Danotiopn_Score_1,748.0,5.739404,5.87564,1.013514,2.071429,4.071429,7.25,50.5
Years,748.0,2.394385,2.038011,0.0,1.0,2.0,4.0,8.0
Months,748.0,5.549465,3.546058,0.0,2.0,4.0,9.0,11.0


In [57]:
bins = [-1, 12, 24, 36, 75 ]

group_names = ["0-12 Months", "13-24 Months", "25-36 Months", "37-74 Months"]


df["Recency_Group"] = pd.cut(df.Recency, bins, labels= group_names)


df["Recency_Group"]

0       0-12 Months
1       0-12 Months
2       0-12 Months
3       0-12 Months
4       0-12 Months
           ...     
743    13-24 Months
744    13-24 Months
745    13-24 Months
746    37-74 Months
747    37-74 Months
Name: Recency_Group, Length: 748, dtype: category
Categories (4, object): ['0-12 Months' < '13-24 Months' < '25-36 Months' < '37-74 Months']

In [58]:
df.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Target,Donation_Period,Danotiopn_Score,Danotiopn_Score_1,Years,Months,Frequency_Group,Recency_Group
0,2,50,12500,98,1,96,50.5,50.5,8,2,High,0-12 Months
1,0,13,3250,28,1,28,inf,13.0,2,4,Medium,0-12 Months
2,1,16,4000,35,1,34,17.0,17.0,2,11,High,0-12 Months
3,2,20,5000,45,1,43,20.5,20.5,3,9,High,0-12 Months
4,1,24,6000,77,0,76,25.0,25.0,6,5,High,0-12 Months


## Check the distribution of the 'Target' variable

In [59]:
df.Target.value_counts(normalize=True)

Target
0    0.762032
1    0.237968
Name: proportion, dtype: float64

In [60]:
df.Frequency_Group.value_counts()

Frequency_Group
Low       419
Medium    284
High       45
Name: count, dtype: int64

In [61]:
df.Frequency_Group.value_counts(normalize=True)

Frequency_Group
Low       0.560160
Medium    0.379679
High      0.060160
Name: proportion, dtype: float64

In [62]:
df.Frequency_Group.unique()

['High', 'Medium', 'Low']
Categories (3, object): ['Low' < 'Medium' < 'High']

In [63]:
df.Frequency_Group.nunique()

3

## Feature Analysis

In [64]:
output_data = []

for col in df.columns:
    
    # If the number of unique values in the column is less than or equal to 5
    if df.loc[:, col].nunique() <= 5:
        # Get the unique values in the column
        unique_values = df.loc[:, col].unique()
        # Append the column name, number of unique values, unique values, and data type to the output data
        output_data.append([col, df.loc[:, col].nunique(), unique_values, df.loc[:, col].dtype])
    else:
        # Otherwise, append only the column name, number of unique values, and data type to the output data
        output_data.append([col, df.loc[:, col].nunique(),"-", df.loc[:, col].dtype])

output_df = pd.DataFrame(output_data, columns=['Column Name', 'Number of Unique Values', ' Unique Values ', 'Data Type'])

output_df

Unnamed: 0,Column Name,Number of Unique Values,Unique Values,Data Type
0,Recency,31,-,int64
1,Frequency,33,-,int64
2,Monetary,33,-,int64
3,Time,78,-,int64
4,Target,2,"[1, 0]",int64
5,Donation_Period,90,-,int64
6,Danotiopn_Score,184,-,float64
7,Danotiopn_Score_1,186,-,float64
8,Years,9,-,int64
9,Months,12,-,int64


## Classify DataFrame Columns into Categorical and Numeric Types

In [65]:
def grab_col_names(dataframe, cat_th=10):

    # cat_cols
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]    
    cat_cols = cat_cols + num_but_cat
    

    # num_cols
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in cat_cols]

    print(f"Features: {dataframe.shape[1]}")
    print(f'Number of Categorical Features: {len(cat_cols)}')
    print(f'Number of Numeric Features: {len(num_cols)}')
    print(f"Categorical Features: {cat_cols}") 
    print(f"Numeric Features: {num_cols}")
    
    return cat_cols, num_cols

In [66]:
cat_cols, num_cols = grab_col_names(df)

Features: 12
Number of Categorical Features: 4
Number of Numeric Features: 8
Categorical Features: ['Target', 'Years', 'Frequency_Group', 'Recency_Group']
Numeric Features: ['Recency', 'Frequency', 'Monetary', 'Time', 'Donation_Period', 'Danotiopn_Score', 'Danotiopn_Score_1', 'Months']


## DataFrame Summary Statistics

In [67]:
def summary(df, pred=None):
    obs = df.shape[0]
    Types = df.dtypes
    Counts = df.apply(lambda x: x.count())
    Min = df.min()
    Max = df.max()
    Uniques = df.apply(lambda x: x.unique().shape[0])
    Nulls = df.apply(lambda x: x.isnull().sum())
    print('Data shape:', df.shape)

    if pred is None:
        cols = ['Types', 'Counts', 'Uniques', 'Nulls', 'Min', 'Max']
        str = pd.concat([Types, Counts, Uniques, Nulls, Min, Max], axis = 1, sort=True)

    str.columns = cols
    print('___________________________\nData Types:')
    print(str.Types.value_counts())
    print('___________________________')
    return str

summary(df)

Data shape: (748, 12)
___________________________
Data Types:
Types
int64       8
float64     2
category    1
category    1
Name: count, dtype: int64
___________________________


Unnamed: 0,Types,Counts,Uniques,Nulls,Min,Max
Danotiopn_Score,float64,748,184,0,1.013514,inf
Danotiopn_Score_1,float64,748,186,0,1.013514,50.5
Donation_Period,int64,748,90,0,0,96
Frequency,int64,748,33,0,1,50
Frequency_Group,category,748,3,0,Low,High
Monetary,int64,748,33,0,250,12500
Months,int64,748,12,0,0,11
Recency,int64,748,31,0,0,74
Recency_Group,category,748,4,0,0-12 Months,37-74 Months
Target,int64,748,2,0,0,1


## Check and Remove Duplicate Rows

In [68]:
def duplicate_values(df):
    print("Duplicate check...")
    num_duplicates = df.duplicated(subset=None, keep='first').sum()
    if num_duplicates > 0:
        print("There are", num_duplicates, "duplicated observations in the dataset.")
        df.drop_duplicates(keep='first', inplace=True)
        print(num_duplicates, "duplicates were dropped!")
        print('*' * 100)
    else:
        print("There are no duplicated observations in the dataset.")

In [69]:
duplicate_values(df)

Duplicate check...
There are 215 duplicated observations in the dataset.
215 duplicates were dropped!
****************************************************************************************************


In [70]:
duplicate_values(df)

Duplicate check...
There are no duplicated observations in the dataset.


In [71]:
df.shape

(533, 12)

## Missing Value Analysis in DataFrame

In [72]:
def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending = False)
    missing_percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)
    missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_Number', 'Missing_Percent'])
    return missing_values[missing_values['Missing_Number'] > 0]

In [73]:
missing_values(df)

Unnamed: 0,Missing_Number,Missing_Percent


## Column Value Distribution Analysis

In [74]:
def value_cnt(df, column_name):
    vc = df[column_name].value_counts()
    vc_norm = df[column_name].value_counts(normalize=True).round(3)

    vc = vc.rename_axis('workclass').reset_index(name='counts')
    vc_norm = vc_norm.rename_axis('workclass').reset_index(name='norm_counts')

    df_result = pd.concat([vc['workclass'], vc['counts'], vc_norm['norm_counts']], axis=1)
    
    return df_result

In [75]:
# Target incidence is defined as the number of occurrences of each Target value in a dataset.
# So how many 1's versus how many 0's are there in the Target column?
#Target incidence gives us an idea of how balanced (or unbalanced) our data set is.

value_cnt(df, 'Target')

Unnamed: 0,workclass,counts,norm_counts
0,0,384,0.72
1,1,149,0.28
