## **Import & Load Data**

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv("bank_transactions_cleaned.csv")

In [7]:
df.head()

Unnamed: 0,Customer ID,DOB,Gender,City/District,Account Balance,Transaction Date,Transaction Amount
0,C1010011,1992-08-19,F,NOIDA,32500.73,2016-09-26,4750.0
1,C1010014,1992-06-04,F,MUMBAI,38377.14,2016-08-01,1205.0
2,C1010018,1990-05-29,F,OTHER CITY,496.18,2016-09-15,30.0
3,C1010028,1988-08-25,F,DELHI,296828.37,2016-08-29,557.0
4,C1010038,1992-07-13,F,OTHER CITY,1290.76,2016-09-07,100.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984259 entries, 0 to 984258
Data columns (total 7 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Customer ID         984259 non-null  object 
 1   DOB                 984259 non-null  object 
 2   Gender              984259 non-null  object 
 3   City/District       984259 non-null  object 
 4   Account Balance     984259 non-null  float64
 5   Transaction Date    984259 non-null  object 
 6   Transaction Amount  984259 non-null  float64
dtypes: float64(2), object(5)
memory usage: 52.6+ MB


In [9]:
df['DOB'] = pd.to_datetime(df['DOB'])
df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])

## Aggregate Customer Data

### Subtask:
Kelompokkan DataFrame berdasarkan 'Customer ID' dan hitung 'recency' (tanggal transaksi terakhir), 'frequency' (jumlah transaksi), 'monetary' (total jumlah transaksi), 'gender' (jenis kelamin), 'city' (kota/distrik), 'balance' (total saldo akun), dan 'dob' (tanggal lahir).


In [10]:
customer_df = df.groupby('Customer ID').agg(
    recency=('Transaction Date', 'max'),
    frequency=('Transaction Amount', 'count'),
    monetary=('Transaction Amount', 'sum'),
    gender=('Gender', 'first'),
    city=('City/District', 'first'),
    balance=('Account Balance', 'first'),
    dob=('DOB', 'first')
).reset_index()

print("Aggregated customer DataFrame created:")
print(customer_df.head())

Aggregated customer DataFrame created:
  Customer ID    recency  frequency  monetary gender        city   balance  \
0    C1010011 2016-09-26          2    5106.0      F       NOIDA  32500.73   
1    C1010012 2016-08-14          1    1499.0      M      MUMBAI  24204.49   
2    C1010014 2016-08-07          2    1455.0      F      MUMBAI  38377.14   
3    C1010018 2016-09-15          1      30.0      F  OTHER CITY    496.18   
4    C1010024 2016-08-18          1    5000.0      M     KOLKATA  87058.65   

         dob  
0 1992-08-19  
1 1994-07-28  
2 1992-06-04  
3 1990-05-29  
4 1965-06-21  


**Reasoning**:
The next step is to display the information of the newly created `customer_df` DataFrame to understand its structure, data types, and non-null counts, as requested by the main task.



In [11]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 838335 entries, 0 to 838334
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Customer ID  838335 non-null  object        
 1   recency      838335 non-null  datetime64[ns]
 2   frequency    838335 non-null  int64         
 3   monetary     838335 non-null  float64       
 4   gender       838335 non-null  object        
 5   city         838335 non-null  object        
 6   balance      838335 non-null  float64       
 7   dob          838335 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(1), object(3)
memory usage: 51.2+ MB


In [12]:
customer_df.head()

Unnamed: 0,Customer ID,recency,frequency,monetary,gender,city,balance,dob
0,C1010011,2016-09-26,2,5106.0,F,NOIDA,32500.73,1992-08-19
1,C1010012,2016-08-14,1,1499.0,M,MUMBAI,24204.49,1994-07-28
2,C1010014,2016-08-07,2,1455.0,F,MUMBAI,38377.14,1992-06-04
3,C1010018,2016-09-15,1,30.0,F,OTHER CITY,496.18,1990-05-29
4,C1010024,2016-08-18,1,5000.0,M,KOLKATA,87058.65,1965-06-21


**Reasoning**:
The final step of the main task is to create a histogram of the 'monetary' column from the `customer_df` DataFrame to visualize the distribution of total transaction amounts.



# Task
Filter the original DataFrame `df` to include only transactions between '2016-08-01' and '2016-09-30', ignoring rows outside this date range for subsequent RFM calculations.

## Filter Transactions by Date

### Subtask:
Filter the original transaction DataFrame (`df`) to include only transactions within the date range '2016-08-01' to '2016-09-30'. All rows outside this date range will be ignored for subsequent RFM calculations.


In [13]:
start_date = '2016-08-01'
end_date = '2016-09-30'
df = df[(df['Transaction Date'] >= start_date) & (df['Transaction Date'] <= end_date)]

print("DataFrame filtered to include transactions from 2016-08-01 to 2016-09-30.")
print(df.head())

DataFrame filtered to include transactions from 2016-08-01 to 2016-09-30.
  Customer ID        DOB Gender City/District  Account Balance  \
0    C1010011 1992-08-19      F         NOIDA         32500.73   
1    C1010014 1992-06-04      F        MUMBAI         38377.14   
2    C1010018 1990-05-29      F    OTHER CITY           496.18   
3    C1010028 1988-08-25      F         DELHI        296828.37   
4    C1010038 1992-07-13      F    OTHER CITY          1290.76   

  Transaction Date  Transaction Amount  
0       2016-09-26              4750.0  
1       2016-08-01              1205.0  
2       2016-09-15                30.0  
3       2016-08-29               557.0  
4       2016-09-07               100.0  


## Re-aggregate Customer Data

### Subtask:
Aggregasi ulang data pelanggan berdasarkan 'Customer ID' dari DataFrame yang telah difilter. Hitung 'recency' (tanggal transaksi terakhir), 'frequency' (jumlah transaksi), 'monetary' (total jumlah transaksi), 'gender', 'city', 'balance', dan 'dob' dari data transaksi yang sudah difilter.


In [14]:
customer_df = df.groupby('Customer ID').agg(
    recency=('Transaction Date', 'max'),
    frequency=('Transaction Amount', 'count'),
    monetary=('Transaction Amount', 'sum'),
    gender=('Gender', 'first'),
    city=('City/District', 'first'),
    balance=('Account Balance', 'first'),
    dob=('DOB', 'first')
).reset_index()

print("Re-aggregated customer DataFrame created from filtered data:")
print(customer_df.head())

Re-aggregated customer DataFrame created from filtered data:
  Customer ID    recency  frequency  monetary gender        city   balance  \
0    C1010011 2016-09-26          2    5106.0      F       NOIDA  32500.73   
1    C1010012 2016-08-14          1    1499.0      M      MUMBAI  24204.49   
2    C1010014 2016-08-07          2    1455.0      F      MUMBAI  38377.14   
3    C1010018 2016-09-15          1      30.0      F  OTHER CITY    496.18   
4    C1010024 2016-08-18          1    5000.0      M     KOLKATA  87058.65   

         dob  
0 1992-08-19  
1 1994-07-28  
2 1992-06-04  
3 1990-05-29  
4 1965-06-21  


## Calculate Recency in Days

### Subtask:
Hitung 'recency' dalam hari untuk setiap pelanggan. Ini dilakukan dengan mengurangi tanggal transaksi terakhir pelanggan dari tanggal transaksi paling mutakhir di seluruh dataset yang sudah difilter.


In [15]:
latest_transaction_date = customer_df['recency'].max()
customer_df['recency'] = (latest_transaction_date - customer_df['recency']).dt.days

print("Recency in days calculated and updated in customer_df:")
print(customer_df.head())

Recency in days calculated and updated in customer_df:
  Customer ID  recency  frequency  monetary gender        city   balance  \
0    C1010011        4          2    5106.0      F       NOIDA  32500.73   
1    C1010012       47          1    1499.0      M      MUMBAI  24204.49   
2    C1010014       54          2    1455.0      F      MUMBAI  38377.14   
3    C1010018       15          1      30.0      F  OTHER CITY    496.18   
4    C1010024       43          1    5000.0      M     KOLKATA  87058.65   

         dob  
0 1992-08-19  
1 1994-07-28  
2 1992-06-04  
3 1990-05-29  
4 1965-06-21  


## Calculate Recency Score (R_score)

### Subtask:
Hitung 'R_score' berdasarkan kolom 'recency' (dalam hari) menggunakan kuantil (1-3). Skor 3 diberikan untuk nilai recency terendah (paling baru), 2 untuk menengah, dan 1 untuk tertinggi.


In [16]:
r_labels = [3, 2, 1]
customer_df['R_score'] = pd.qcut(customer_df['recency'], q=3, labels=r_labels, duplicates='drop')

print("R_score calculated and added to customer_df:")
print(customer_df.head())

R_score calculated and added to customer_df:
  Customer ID  recency  frequency  monetary gender        city   balance  \
0    C1010011        4          2    5106.0      F       NOIDA  32500.73   
1    C1010012       47          1    1499.0      M      MUMBAI  24204.49   
2    C1010014       54          2    1455.0      F      MUMBAI  38377.14   
3    C1010018       15          1      30.0      F  OTHER CITY    496.18   
4    C1010024       43          1    5000.0      M     KOLKATA  87058.65   

         dob R_score  
0 1992-08-19       3  
1 1994-07-28       1  
2 1992-06-04       1  
3 1990-05-29       3  
4 1965-06-21       2  


## Calculate Frequency Score (F_score)

### Subtask:
Hitung 'F_score' berdasarkan kolom 'frequency' dengan ketentuan manual: skor 1 jika frekuensi transaksi sama dengan 1, dan skor 2 jika frekuensi transaksi lebih besar atau sama dengan 2.


In [17]:
customer_df['F_score'] = np.where(customer_df['frequency'] >= 2, 2, 1)

print("F_score calculated and added to customer_df:")
print(customer_df.head())

F_score calculated and added to customer_df:
  Customer ID  recency  frequency  monetary gender        city   balance  \
0    C1010011        4          2    5106.0      F       NOIDA  32500.73   
1    C1010012       47          1    1499.0      M      MUMBAI  24204.49   
2    C1010014       54          2    1455.0      F      MUMBAI  38377.14   
3    C1010018       15          1      30.0      F  OTHER CITY    496.18   
4    C1010024       43          1    5000.0      M     KOLKATA  87058.65   

         dob R_score  F_score  
0 1992-08-19       3        2  
1 1994-07-28       1        1  
2 1992-06-04       1        2  
3 1990-05-29       3        1  
4 1965-06-21       2        1  


**Reasoning**:
To calculate the 'M_score' based on the 'monetary' column, I will use `pd.qcut` to divide the column into three quantiles and assign scores (1, 2, 3) to represent the monetary value, with higher monetary values getting a higher score.



In [18]:
m_labels = [1, 2, 3]
customer_df['M_score'] = pd.qcut(customer_df['monetary'], q=3, labels=m_labels, duplicates='drop')

print("M_score calculated and added to customer_df:")
print(customer_df.head())

M_score calculated and added to customer_df:
  Customer ID  recency  frequency  monetary gender        city   balance  \
0    C1010011        4          2    5106.0      F       NOIDA  32500.73   
1    C1010012       47          1    1499.0      M      MUMBAI  24204.49   
2    C1010014       54          2    1455.0      F      MUMBAI  38377.14   
3    C1010018       15          1      30.0      F  OTHER CITY    496.18   
4    C1010024       43          1    5000.0      M     KOLKATA  87058.65   

         dob R_score  F_score M_score  
0 1992-08-19       3        2       3  
1 1994-07-28       1        1       3  
2 1992-06-04       1        2       3  
3 1990-05-29       3        1       1  
4 1965-06-21       2        1       3  


## Display RFM Scores

### Subtask:
Tampilkan beberapa baris pertama dari DataFrame `customer_df` yang telah diperbarui, termasuk kolom 'recency', 'frequency', 'monetary', 'R_score', 'F_score', dan 'M_score', untuk memverifikasi hasil skoring.


**Reasoning**:
To verify the RFM scoring, I will display the first few rows of the `customer_df` DataFrame, explicitly selecting the 'recency', 'frequency', 'monetary', 'R_score', 'F_score', and 'M_score' columns as requested.



In [19]:
customer_df[['recency', 'frequency', 'monetary', 'R_score', 'F_score', 'M_score']].head()

Unnamed: 0,recency,frequency,monetary,R_score,F_score,M_score
0,4,2,5106.0,3,2,3
1,47,1,1499.0,1,1,3
2,54,2,1455.0,1,2,3
3,15,1,30.0,3,1,1
4,43,1,5000.0,2,1,3


## Transformasi Data Pelanggan

### Subtask:
Ubah nilai 'F' menjadi 'Female' dan 'M' menjadi 'Male' pada kolom `gender`.

In [20]:
# Mengubah nilai gender dari inisial menjadi deskripsi lengkap
customer_df['gender'] = customer_df['gender'].replace({'F': 'Female', 'M': 'Male'})

print("Kolom 'gender' telah diubah:")
print(customer_df['gender'].head())

Kolom 'gender' telah diubah:
0    Female
1      Male
2    Female
3    Female
4      Male
Name: gender, dtype: object


### Subtask:
Beri format dua angka di belakang koma pada kolom `monetary` dan `balance`.

In [21]:
# Membulatkan kolom 'monetary' dan 'balance' menjadi dua angka desimal
customer_df['monetary'] = customer_df['monetary'].round(2)
customer_df['balance'] = customer_df['balance'].round(2)

print("Kolom 'monetary' dan 'balance' telah dibulatkan:")
print(customer_df[['monetary', 'balance']].head())

Kolom 'monetary' dan 'balance' telah dibulatkan:
   monetary   balance
0    5106.0  32500.73
1    1499.0  24204.49
2    1455.0  38377.14
3      30.0    496.18
4    5000.0  87058.65


### Subtask:
Ganti nama kolom dan urutkan sesuai permintaan.

In [22]:
# Ganti nama kolom sesuai permintaan
customer_df = customer_df.rename(columns={
    'dob': 'DoB',
    'gender': 'Gender', # Mengganti 'gender' (lowercase) menjadi 'Gender' (uppercase)
    'city': 'City',
    'recency': 'Recency',
    'frequency': 'Frequency',
    'monetary': 'Monetary',
    'R_score': 'R Score',
    'F_score': 'F Score',
    'M_score': 'M Score',
    'balance': 'Account Balance'
})

# Urutkan kolom sesuai permintaan
desired_columns = [
    'Customer ID', 'DoB', 'Gender', 'City',
    'Recency', 'Frequency', 'Monetary',
    'R Score', 'F Score', 'M Score', 'Account Balance'
]
customer_df = customer_df[desired_columns]

print("DataFrame setelah perubahan nama dan urutan kolom:")
print(customer_df.head())

DataFrame setelah perubahan nama dan urutan kolom:
  Customer ID        DoB  Gender        City  Recency  Frequency  Monetary  \
0    C1010011 1992-08-19  Female       NOIDA        4          2    5106.0   
1    C1010012 1994-07-28    Male      MUMBAI       47          1    1499.0   
2    C1010014 1992-06-04  Female      MUMBAI       54          2    1455.0   
3    C1010018 1990-05-29  Female  OTHER CITY       15          1      30.0   
4    C1010024 1965-06-21    Male     KOLKATA       43          1    5000.0   

  R Score  F Score M Score  Account Balance  
0       3        2       3         32500.73  
1       1        1       3         24204.49  
2       1        2       3         38377.14  
3       3        1       1           496.18  
4       2        1       3         87058.65  


In [23]:
customer_df.head(10)

Unnamed: 0,Customer ID,DoB,Gender,City,Recency,Frequency,Monetary,R Score,F Score,M Score,Account Balance
0,C1010011,1992-08-19,Female,NOIDA,4,2,5106.0,3,2,3,32500.73
1,C1010012,1994-07-28,Male,MUMBAI,47,1,1499.0,1,1,3,24204.49
2,C1010014,1992-06-04,Female,MUMBAI,54,2,1455.0,1,2,3,38377.14
3,C1010018,1990-05-29,Female,OTHER CITY,15,1,30.0,3,1,1,496.18
4,C1010024,1965-06-21,Male,KOLKATA,43,1,5000.0,2,1,3,87058.65
5,C1010028,1988-08-25,Female,DELHI,32,1,557.0,2,1,2,296828.37
6,C1010031,1988-06-09,Male,TRICHY,57,2,1864.0,1,2,3,8646.21
7,C1010035,1980-06-09,Male,NAVI MUMBAI,34,2,750.0,2,2,2,378013.09
8,C1010036,1996-02-26,Male,GURGAON,35,1,208.0,2,1,1,355430.17
9,C1010037,1981-09-13,Male,BANGALORE,52,1,19680.0,1,1,3,95859.17


## Export Dataframe to CSV

### Subtask:
Export `customer_df` DataFrame to a CSV file named `rfm_segmentations.csv`.

In [24]:
# Export customer_df ke file CSV
customer_df.to_csv('rfm_segmentations.csv', index=False)

print("DataFrame 'customer_df' telah berhasil diekspor ke 'rfm_segmentations.csv'.")

DataFrame 'customer_df' telah berhasil diekspor ke 'rfm_segmentations.csv'.


In [25]:
print("Unique values for R Score:", customer_df['R Score'].unique())
print("Unique values for F Score:", customer_df['F Score'].unique())
print("Unique values for M Score:", customer_df['M Score'].unique())

Unique values for R Score: [3, 1, 2]
Categories (3, int64): [3 < 2 < 1]
Unique values for F Score: [2 1]
Unique values for M Score: [3, 1, 2]
Categories (3, int64): [1 < 2 < 3]


In [26]:
print("Jumlah unique values untuk R Score:\n", customer_df['R Score'].value_counts())
print("\nJumlah unique values untuk F Score:\n", customer_df['F Score'].value_counts())
print("\nJumlah unique values untuk M Score:\n", customer_df['M Score'].value_counts())

Jumlah unique values untuk R Score:
 R Score
3    285400
1    277387
2    273127
Name: count, dtype: int64

Jumlah unique values untuk F Score:
 F Score
1    708110
2    127804
Name: count, dtype: int64

Jumlah unique values untuk M Score:
 M Score
2    287849
1    278819
3    269246
Name: count, dtype: int64


In [28]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 835914 entries, 0 to 835913
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   Customer ID      835914 non-null  object        
 1   DoB              835914 non-null  datetime64[ns]
 2   Gender           835914 non-null  object        
 3   City             835914 non-null  object        
 4   Recency          835914 non-null  int64         
 5   Frequency        835914 non-null  int64         
 6   Monetary         835914 non-null  float64       
 7   R Score          835914 non-null  category      
 8   F Score          835914 non-null  int64         
 9   M Score          835914 non-null  category      
 10  Account Balance  835914 non-null  float64       
dtypes: category(2), datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 59.0+ MB
