## 1. Data Loading and Initial Overview

This section focuses on importing the RBI digital payments dataset using Python and obtaining an initial understanding of its structure, size, and basic characteristics. The objective is to familiarize ourselves with the dataset before proceeding to data preprocessing and analysis.


In [2]:
import pandas as pd

file_path = r"C:\Users\krish\Downloads\rbi_daily_digital_payments.csv"
df = pd.read_csv(file_path)


### Dataset Dimensions

The shape of the dataset provides information on the total number of observations (rows) and variables (columns). This helps in understanding the scale and complexity of the data.


In [4]:
df.shape


(1706, 50)

### Data Types of Variables

Understanding the data types of each column is essential for correct analysis. This step helps identify numerical, categorical, and date variables and highlights any incorrect data type assignments.


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1706 entries, 0 to 1705
Data columns (total 50 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Unnamed: 0                             1706 non-null   int64  
 1   date                                   1706 non-null   object 
 2   rtgs_vol                               1662 non-null   float64
 3   rtgs_val                               1662 non-null   float64
 4   neft_vol                               1706 non-null   float64
 5   neft_val                               1706 non-null   float64
 6   aeps_vol                               1706 non-null   float64
 7   aeps_val                               1706 non-null   float64
 8   upi_vol                                1706 non-null   float64
 9   upi_val                                1706 non-null   float64
 10  imps_vol                               1706 non-null   float64
 11  imps

### Initial Observations

Initial exploration helps in identifying missing values, unusual patterns, and the overall distribution of data. Functions like `head()` and `describe()` provide a snapshot of the dataset.


In [6]:
df.head()


Unnamed: 0.1,Unnamed: 0,date,rtgs_vol,rtgs_val,neft_vol,neft_val,aeps_vol,aeps_val,upi_vol,upi_val,...,credit_card_at_e_commerce_vol,credit_card_at_e_commerce_val,debit_card_at_pos_vol,debit_card_at_pos_val,debit_card_at_e_commerce_vol,debit_card_at_e_commerce_val,ppis_card_at_pos_vol,ppis_card_at_pos_val,ppis_card_at_e_commerce_vol,ppis_card_at_e_commerce_val
0,0,01-06-2020,4.85,436996.69,172.11,104275.13,0.44,7.68,476.97,10413.11,...,,,,,,,,,,
1,1,02-06-2020,4.54,361878.87,100.07,65259.02,0.44,7.67,476.78,9951.3,...,,,,,,,,,,
2,2,03-06-2020,4.3,330632.89,100.36,62985.75,0.44,7.48,456.26,9622.38,...,,,,,,,,,,
3,3,04-06-2020,4.35,329072.45,94.66,63148.29,0.45,7.32,463.05,9639.5,...,,,,,,,,,,
4,4,05-06-2020,4.56,365468.95,111.26,68932.72,0.48,7.32,464.79,9539.52,...,,,,,,,,,,


In [7]:
df.describe()


Unnamed: 0.1,Unnamed: 0,rtgs_vol,rtgs_val,neft_vol,neft_val,aeps_vol,aeps_val,upi_vol,upi_val,imps_vol,...,credit_card_at_e_commerce_vol,credit_card_at_e_commerce_val,debit_card_at_pos_vol,debit_card_at_pos_val,debit_card_at_e_commerce_vol,debit_card_at_e_commerce_val,ppis_card_at_pos_vol,ppis_card_at_pos_val,ppis_card_at_e_commerce_vol,ppis_card_at_e_commerce_val
count,1706.0,1662.0,1662.0,1706.0,1706.0,1706.0,1706.0,1706.0,1706.0,1706.0,...,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0
mean,852.5,6.693712,424483.2,159.832298,93722.640387,0.579812,16.949971,2534.319408,39532.997989,141.083359,...,36.718302,2412.577137,44.371313,997.789467,21.931386,503.284315,1.905816,27.560722,2.868532,65.999467
std,492.624096,3.450098,286650.9,86.456235,55898.712237,0.208948,6.082552,1544.973019,21570.239669,30.777404,...,12.176996,1060.856869,15.415501,287.904139,10.773602,165.738797,1.188726,15.036837,2.598883,33.723551
min,0.0,0.0,0.12,6.06,2093.33,0.16,4.31,289.0,4333.74,45.95,...,11.04,426.0,18.55,448.22,6.5,205.57,0.48,7.64,0.52,7.67
25%,426.25,4.69,145040.7,101.1425,54258.775,0.43,13.115,1105.4175,19857.6725,122.92,...,29.1,1662.49,31.505,792.295,13.11,384.02,0.97,17.29,1.12,43.775
50%,852.5,7.58,471274.6,150.425,100410.525,0.57,16.88,2280.215,37306.855,146.885,...,35.53,2275.61,42.62,950.05,19.4,490.74,1.46,20.74,1.55,59.32
75%,1278.75,9.18,628701.4,214.2325,128357.65,0.7,20.3775,3828.06,56383.285,161.8,...,42.08,3005.865,54.585,1121.195,28.66,598.105,2.51,34.89,3.875,79.415
max,1705.0,18.76,1589480.0,579.28,396593.31,1.68,44.22,6441.18,95915.61,256.87,...,107.59,10554.45,124.92,2679.66,76.72,1328.25,6.07,99.67,13.59,241.63


## 2. Data Pre-processing

Data preprocessing involves cleaning and preparing the dataset for analysis. This includes handling missing values, removing duplicates, correcting data types, and creating derived variables to support meaningful analysis.


In [8]:
df.isnull().sum()


Unnamed: 0                                  0
date                                        0
rtgs_vol                                   44
rtgs_val                                   44
neft_vol                                    0
neft_val                                    0
aeps_vol                                    0
aeps_val                                    0
upi_vol                                     0
upi_val                                     0
imps_vol                                    0
imps_val                                    0
nach_credit_vol                            95
nach_credit_val                            95
nach_debit_vol                             98
nach_debit_val                             98
netc_vol                                    0
netc_val                                    0
bbps_vol                                    0
bbps_val                                    0
cts_vol                                   366
cts_val                           

In [12]:
# 1. Drop columns with all values missing (like unnamed, empty columns)
df = df.dropna(axis=1, how='all')

# 2. Forward-fill remaining missing values for time-series consistency
df = df.ffill()


### Removing Duplicate Records

Duplicate records can distort analysis results. This step ensures that each observation in the dataset is unique.


In [13]:
df.duplicated().sum()


np.int64(0)

In [14]:
df = df.drop_duplicates()


### Correcting Data Types

The date variable is converted into a datetime format to enable time-series analysis. Correct data types improve analytical accuracy.


In [15]:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)


### Creating Derived Columns

Derived variables help in deeper analysis. Additional time-based variables such as year and month are created to study trends over time.


In [16]:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month


### Filtering and Aggregating Data

Aggregation helps in understanding long-term trends and comparative growth across payment systems. Data is grouped by year to analyze annual transaction patterns.


In [17]:
yearly_summary = df.groupby('year')[['upi_vol', 'upi_val', 'neft_vol', 'neft_val']].sum()
yearly_summary


Unnamed: 0_level_0,upi_vol,upi_val,neft_vol,neft_val
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020,127692.87,2372994.97,18061.32,14977711.95
2021,387331.41,7157613.3,38008.83,27678182.28
2022,740396.89,12594493.43,49480.22,32795142.43
2023,1176087.55,18287634.94,65871.24,37313078.34
2024,1722080.18,24682520.82,92684.34,43278676.55
2025,169960.01,2348037.11,8567.95,3848032.95


In [None]:
### Interpretation

The preprocessing steps ensure that the dataset is clean, consistent, and suitable for exploratory data analysis. The cleaned data allows meaningful comparison across payment systems and supports reliable visualization and insight generation in later stages of the project.
