# Homework 4 - Getting to know your customers

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import random
from sympy import *
import time


In [3]:
pd.options.mode.chained_assignment = None # to avoid warning messages 

## 1. Finding Similar Costumers

The idea behind the work is to implement *Locality-Sensitive Hashing (LSH)* algorithm to find the most similar users to the query.


We mixed up different approaches taking inspiration mainly from [this site](https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/) and the [book](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf). The work is composed of the following steps:


<ol>
  <li>Set up the data</li>
  <li>Fingerprint hashing</li>
  <li>Locality Sensitive Hashing</li>
</ol>

### 1.1 Set up the data

For the sake of this first part, not all columns are necessary since comparing each field single handedly can be quite time-expensive. 

We have therefore decided to keep only some columns of the dataset and to create a new one, *CustGroupsBalance*, in order to decrease the number of values ​​to consider and improve the execution time of the algorithm in the next phase.


In [6]:
bank_transactions = pd.read_csv(r"C:\Users\Marina\OneDrive\Desktop\ADM_HW4\bank_transactions.csv")
bank_transactions = bank_transactions.dropna() # remove missing values.
bank_transactions = bank_transactions.reset_index(drop=True)

We group the transactions according to their account balance in *lowBalance*, *mediumBalance*, *highBalance* with *createGroupsBalance(df)* function.

In [7]:
def createGroupsBalance(df):
    '''
    This function creates a new column in the dataframe relating to the group 
    to which the transactions belongs based on the account balance.
    '''
    groupsBalance = []

    for balance in df['CustAccountBalance']:
        if balance < 5000:
            groupsBalance.append('lowBalance')
        elif 5000 <= balance < 50000:
            groupsBalance.append('mediumBalance')
        else:
            groupsBalance.append('highBalance')
            

    df['CustGroupsBalance'] = groupsBalance

In [8]:
createGroupsBalance(bank_transactions)

After the pre-processing phase, the dataset with which we will work is presented.

In [9]:
transactions = bank_transactions[['CustomerDOB',  'CustGender', 'CustLocation', 'CustGroupsBalance', 'TransactionAmount (INR)', 'TransactionDate']]
transactions

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate
0,10/1/94,F,JAMSHEDPUR,mediumBalance,25.0,2/8/16
1,4/4/57,M,JHAJJAR,lowBalance,27999.0,2/8/16
2,26/11/96,F,MUMBAI,mediumBalance,459.0,2/8/16
3,14/9/73,F,MUMBAI,highBalance,2060.0,2/8/16
4,24/3/88,F,NAVI MUMBAI,mediumBalance,1762.5,2/8/16
...,...,...,...,...,...,...
1041609,8/4/90,M,NEW DELHI,mediumBalance,799.0,18/9/16
1041610,20/2/92,M,NASHIK,mediumBalance,460.0,18/9/16
1041611,18/5/89,M,HYDERABAD,highBalance,770.0,18/9/16
1041612,30/8/78,M,VISAKHAPATNAM,mediumBalance,1000.0,18/9/16


### 1.2 Fingerprint hashing

We implemented **our MinHash function**: the goal was to replace a large set of values with a smaller *"signature"* that still preserves the underlying similarity metric.

We established the following pipeline:

1. At first, we create the shingles matrix, where a shingle is an individual element that can be a part of a dataset like a character, integer or date.
1. We randomly permute the shingles matrix;
2. For each transaction, start from the top and find the position of the first shingle match and use the shingle indexes to represent the transaction (signature);
3. We repeat it several times and each time we append the result to the signature matrix.

Since permutation is a computation heavy operation especially for large datasets, we use a hashing/mapping function that typically reorders the elements using the simple math operation:

 $$ h(x) = (ax + b)  \% c  $$
 
 where *"a"* and *"b"* are random integers smaller than *"c"* and *"c"* is the **prime number slightly higher** than the total number of different elements in the dataset (that is the number of elements in the shingles matrix).

In [10]:
def permute(x, a, b, c):
    '''
    This function similates a permutation.
    '''
    return (a*x + b) % c

In [11]:
def createShinglesMatrix(df):
    '''
    This function computes the singles matrix using the comand unique()
    of Pandas, that returns for each column of the dataframe a unique 
    values based on hash table.
    '''
    shingles_matrix = []

    for column in df:
        for shingle in df[column].unique():
            shingles_matrix.append(shingle)

    # we need to fix the length of the matrix to be as long as the first prime 
    # number higher than the value of the original length of the matrix
    diff = nextprime(len(shingles_matrix)) - len(shingles_matrix)
    for _ in range(diff):
        shingles_matrix.append('NA')

    return shingles_matrix

In [15]:
def createListOfPermutations(singlesMatrix, permutations, n):
    '''
    This function computes a list of n different permutations in order to
    randomly permute the singles matrix. 
    '''
    listOfPermutations = []
    for j in range(n):
        l = {}
        for i,item in enumerate(singlesMatrix):
            l[item] = permute(i, permutations[j][0], permutations[j][1], len(singlesMatrix))

        listOfPermutations.append(l)

    return listOfPermutations

In [16]:
def createSignature(df, listOfPermutations, nPermutations):
    ''' 
    This function returns a matrix which contains the signature list for each 
    transaction. We start from the top and find the indexes of each shingle 
    that matches and use the minimum index (the first occurence) to represent 
    the transaction.
    '''
    signature_matrix=[]

    for row in df.iterrows():
        signature_k =[]
        for j in range(nPermutations):
            index=[listOfPermutations[j][row[1][s]] for s in range(6)]
            signature_k.append(min(index))
        
        signature_matrix.append(signature_k)

    return signature_matrix

In [17]:
# computation of the shingles matrix
singlesMatrix = createShinglesMatrix(transactions) 
# computation of random values for "a" and "b" parameters
permutations = [(random.randint(0,len(singlesMatrix)), random.randint(0,len(singlesMatrix))) for _ in range(30)] 
# computation of the list of 30 permutations
listOfPermutations = createListOfPermutations(singlesMatrix, permutations, 30) 
# computation of the signature matrix
signatureA = createSignature(transactions, listOfPermutations, 10) 

In [19]:
# initializazion of the signature column in transactions dataframe
transactions['signature'] = signatureA

In [20]:
transactions

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate,signature
0,10/1/94,F,JAMSHEDPUR,mediumBalance,25.0,2/8/16,"[13593, 1235, 4943, 6722, 29474, 21420, 2294, ..."
1,4/4/57,M,JHAJJAR,lowBalance,27999.0,2/8/16,"[24156, 1235, 14501, 39998, 2613, 11297, 17277..."
2,26/11/96,F,MUMBAI,mediumBalance,459.0,2/8/16,"[20621, 1235, 4943, 39998, 26295, 21420, 15762..."
3,14/9/73,F,MUMBAI,highBalance,2060.0,2/8/16,"[4133, 1235, 4943, 37232, 26295, 21420, 15762,..."
4,24/3/88,F,NAVI MUMBAI,mediumBalance,1762.5,2/8/16,"[23053, 1235, 4105, 6075, 27643, 13295, 15762,..."
...,...,...,...,...,...,...,...
1041609,8/4/90,M,NEW DELHI,mediumBalance,799.0,18/9/16,"[23053, 51831, 29639, 20896, 32653, 11297, 473..."
1041610,20/2/92,M,NASHIK,mediumBalance,460.0,18/9/16,"[2988, 11245, 672, 23058, 17430, 11297, 5962, ..."
1041611,18/5/89,M,HYDERABAD,highBalance,770.0,18/9/16,"[5547, 10978, 29639, 23058, 30146, 11297, 1571..."
1041612,30/8/78,M,VISAKHAPATNAM,mediumBalance,1000.0,18/9/16,"[23053, 55468, 926, 23058, 32653, 11297, 31571..."


In [21]:
signatureAdf = pd.DataFrame(signatureA)
# we do the transpose of the signature matrix to be able to work on the bins
trasposed_signatureA = signatureAdf.T.copy(deep = True) 
trasposed_signatureA.index = [i for i in range(1,11)]
trasposed_signatureA

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1041604,1041605,1041606,1041607,1041608,1041609,1041610,1041611,1041612,1041613
1,13593,24156,20621,4133,23053,11161,24156,24156,1701,24156,...,23944,23053,18184,12460,20976,23053,2988,5547,23053,12460
2,1235,1235,1235,1235,1235,1235,1235,1235,1235,1235,...,1418,11154,31590,56032,38552,51831,11245,10978,55468,2067
3,4943,14501,4943,4943,4105,1219,4943,5998,4943,10777,...,4943,4943,29639,2120,9392,29639,672,29639,926,29639
4,6722,39998,39998,37232,6075,1061,39998,31571,414,39998,...,6162,9616,13889,23058,8326,20896,23058,23058,23058,23058
5,29474,2613,26295,26295,27643,23116,24464,26295,21285,26295,...,37180,24154,30021,10094,20915,32653,17430,30146,32653,13579
6,21420,11297,21420,21420,13295,3172,4719,11297,21420,11297,...,14962,5075,11297,8788,11297,11297,11297,11297,11297,8788
7,2294,17277,15762,15762,15762,6295,15762,19763,15762,17277,...,4840,15762,38581,17277,9573,4733,5962,15715,31571,12721
8,29990,27895,25800,23705,21610,19515,17420,15325,13230,11135,...,39589,35966,44796,13111,2582,11313,46750,9023,46750,15239
9,9473,33624,22671,10169,9323,10169,32778,10169,8627,8477,...,1013,24363,24363,7781,24318,24363,1859,10169,21392,7781
10,14712,54487,14712,15428,14712,17896,15786,55561,14712,16144,...,17896,14712,14712,54487,54487,14712,14712,57671,11254,26079


The signature matrix above is now divided into b bands. In our case, we are setting *b = 2*, which means that we will consider any transactions with the same first two rows to be similar. The larger we make b the less likely there will be another transaction that matches all of the same permutations.

We computed buckets for each band with the *createBuckets(b, trasposed_signature)* function, that will take as input a band b and the transposition of the signature matrix and will return a dictionary contraining as key the bucket ids and as value a list of the transactions indexes that were mapped to that bucket for the given band b: $$ {bucket_{id}} = [transation1_{idx}, transaction2_{idx},...] $$

In [22]:
def listAlphabet():
  '''
  This function returns a list of all the letters of the alphabet.
  '''
  return list(map(chr, range(97, 123)))

In [24]:
def createBuckets(b, trasposed_signature): 
    ''' 
    This function returns a dictionary contraining as key the bucket ids and as value 
    a list of the transactions indexes that were mapped to that bucket for the 
    given band b.
    '''
    buckets = {}
    letters = listAlphabet()

    for i in range(1, len(trasposed_signature), b):
        sub = pd.DataFrame(trasposed_signature.loc[i:i+b-1].copy(deep = True))
        for col in sub:
            arr = sub[col].to_numpy(copy = True)
            
            k = letters[int(i/2 + 1)]
            for j in range(len(arr)):
                k += str(arr[j])

            if k in buckets:
                buckets[k].append(col)

            else:
                buckets[k] = [col]


    return buckets

In [25]:
# computation of the buckets for transactions dataframe
bucketsA = createBucktes(2, trasposed_signatureA)

### 1.3 Locality Sensitive Hashing

At first, we pre-process the query dataset as we did for the original dataset.

What we're going to do is compare two pairs of signatures, one from the query and one from the original dataset of transactions.

We need to focus our attention only on pairs that are likely to *be similar*, without investigating every pair. There is a general theory of how to provide such focus, called **locality-sensitive hashing (LSH)**, an approximate nearest neighborhood search technique in the context of recommendation system.

Here we will be focusing on **Jaccard Index metric** to compute the similarity.

In [26]:
query = pd.read_csv(r"C:\Users\Marina\OneDrive\Desktop\ADM_HW4\query_users.csv")

In [27]:
createGroupsBalance(query)

In [28]:
final_query = query[['CustomerDOB',  'CustGender', 'CustLocation', 'CustGroupsBalance', 'TransactionAmount (INR)', 'TransactionDate']]

In [29]:
# computation of the signature matrix
signatureB = createSignature(final_query, listOfPermutations, 10)
# initializazion of the signature column in query dataframe
final_query['signature'] = signatureB

In [20]:
final_query

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate,signature
0,27/7/78,M,DELHI,highBalance,65.0,2/9/16,"[27773, 27326, 17002, 10405, 52077, 4369, 2061..."
1,6/11/92,M,PANCHKULA,mediumBalance,6025.0,2/9/16,"[13488, 7582, 3056, 10405, 356, 7053, 5749, 41..."
2,14/8/91,M,PATNA,mediumBalance,541.5,10/8/16,"[13488, 19797, 8419, 10405, 356, 4385, 5749, 5..."
3,3/1/87,M,CHENNAI,highBalance,1000.0,29/8/16,"[8051, 22275, 13862, 10405, 9552, 17361, 24768..."
4,4/1/95,M,GURGAON,highBalance,80.0,25/9/16,"[12545, 10258, 12456, 7064, 17574, 17361, 3304..."
5,10/1/81,M,WORLD TRADE CENTRE BANGALORE,mediumBalance,303.0,11/9/16,"[9235, 2492, 22487, 8373, 356, 13983, 5749, 39..."
6,20/9/76,F,CHITTOOR,mediumBalance,20.0,28/8/16,"[13488, 12769, 27888, 2625, 356, 13166, 5749, ..."
7,10/4/91,M,MOHALI,lowBalance,50.0,2/8/16,"[10620, 24815, 838, 3782, 20395, 17361, 17564,..."
8,19/3/90,M,MOHALI,lowBalance,300.0,26/8/16,"[12103, 24815, 32401, 3782, 19879, 17361, 7431..."
9,19/12/70,M,SERAMPORE,highBalance,299.0,27/8/16,"[14093, 9548, 42980, 5136, 6957, 3425, 19721, ..."


In [30]:
signatureBdf = pd.DataFrame(signatureB)
# we do the transpose of the signature matrix to be able to work on the bins
trasposed_signatureB = signatureBdf.T.copy(deep = True)
trasposed_signatureB.index = [i for i in range(1,11)]
trasposed_signatureB

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
1,5667,23053,20554,3012,12050,5034,23053,24156,7170,44009,...,10205,23053,56402,12448,10205,56402,3723,120,12460,4900
2,41586,10114,25237,6900,15678,104,34204,1235,3677,23331,...,15410,49741,34204,9765,4938,34204,56032,3660,5574,9931
3,7371,31019,26539,5490,24059,46862,4943,19573,34049,10154,...,711,46862,3796,4943,711,4943,14171,3908,4943,5744
4,2653,40987,15047,42850,29264,1833,32949,39998,30528,14460,...,69036,3814,5990,20896,35990,8884,24593,21929,8884,69036
5,2754,25373,24194,32653,14064,24866,19989,19937,19937,4227,...,32653,17836,23293,18279,863,4042,21015,32653,12886,4042
6,1707,4227,11297,11297,11297,11297,21420,11297,2123,11297,...,11297,5075,2207,21420,11297,1918,11297,11297,8788,9737
7,8483,1164,3697,27375,45957,64149,15762,17277,17277,1199,...,17277,36865,15762,15762,2295,7087,31766,4899,15762,26926
8,13224,8575,15773,1108,42844,3686,12027,13514,44939,2171,...,5298,7393,42396,1695,5298,9324,11583,39736,47034,23844
9,10169,12813,4777,10169,8627,25209,30917,55387,9936,5138,...,6830,29136,10169,32823,6830,10169,30390,908,7781,26055
10,20894,14712,14712,18388,32612,14712,14712,18970,18970,9945,...,50683,14712,13167,16598,19451,16860,14712,7613,7501,14712


In [32]:
# computation of the buckets for query dataframe
bucketsB = createBuckets(2, trasposed_signatureB)

With the *findNeighbors(bucketsA, bucketsB)* function we are going to take as input two buckets (structured like dictionaries), respectively the one related to the original dataset of transactions and the one related to the query, and we will return for each element of the query which elements of the original dataset of transactions ended up in the same bucket as the latter.

In [33]:
def findNeighbors(bucketsA, bucketsB):
    '''
    This function returns a dictionary, neighbors, in which the keys are the 
    query indexes and the values the indexes of transactions which ends up 
    in the same bucket of the query element.
    '''
    neighbors = {}
    for k,v in bucketsB.items():
        if k in bucketsA:
            for elemA in v:
                for elemB in bucketsA[k]:
                    if elemA in neighbors:
                        neighbors[elemA].append(elemB)
                    else:
                        neighbors[elemA] = [elemB]
    return neighbors

With the *checkNeighbors(dfA, dfB, neighbors, threshold)* function we take as input the original dataframe and that of the query, a list of all the neighbors for each element of the query computed as described above and we check how much their signatures match computing their similarity through the *jaccard_similarity(signA, signB)* function. We will return a dictionary, *similars*, in which the keys are the query indexes and the values the ids of the costumers who actually respect the threshold value.

Those dissimilar pairs that do map to the same bucket are false positives; we hope these will be only a small fraction of all pairs.

In [34]:
def jaccard_similarity(signA, signB):
    ''' 
    This function simply computes the Jaccard similarity from its definition.
    '''
    return len(set(signA).intersection(signB))/len(set(signA).union(signB))

In [35]:
def checkNeighbors(dfA, dfB, neighbors, threshold):
    ''' 
    This function gets the similars costumers id for a given query index that
    respect the thresold value.
    The similarity measure here is based on Jaccard Index.
    '''
    similars = {}
    for k,v in neighbors.items():
        signB = dfB.loc[k]['signature']
        for elem in v:
            signA = dfA.loc[elem]['signature']
            sim = jaccard_similarity(signA, signB)
            if sim >= threshold:
                if k in similars:
                    similars[k].append(bank_transactions.loc[elem]['CustomerID'])
                else:
                    similars[k] = [bank_transactions.loc[elem]['CustomerID']]

    return similars

In [36]:
# computation of the dictionary of neighbors for each element of the query
neighbors = findNeighbors(bucketsA, bucketsB)

## Report: computation of LSH algorithm with several thresolds

After several attempts we noticed that the hashing method works properly with various thresholds, but only with a thresold value less than or equal to 0.4 we are able to find a match for every element of the query.

In terms of computational time, the algorithm works quite well, as its execution time remains under 3 minutes.

### Thresold 0.4

In [68]:
start_time = time.time()
similars = checkNeighbors(transactions, final_query, neighbors, 0.4)
end_time = time.time()


In [69]:
print("LSH Similarity with a thresold value of 0.4 computes in: " + str((end_time - start_time)) + " seconds." )

similars = dict(sorted(similars.items()))
for k,v in similars.items():
    print("For query " + str(k) + " we find " + str(len(similars[k])) + " similars.")
    
noSimilars = len(final_query) - len(similars.keys())
print("For " + str(noSimilars) + " elements of the query we don't find any similars.")

LSH Similarity with a thresold value of 0.4 computes in: 132.01143097877502 seconds.
For query 0 we find 127 similars.
For query 1 we find 9 similars.
For query 2 we find 47 similars.
For query 3 we find 1981 similars.
For query 4 we find 29 similars.
For query 5 we find 893 similars.
For query 6 we find 3207 similars.
For query 7 we find 738 similars.
For query 8 we find 151 similars.
For query 9 we find 9 similars.
For query 10 we find 1082 similars.
For query 11 we find 241 similars.
For query 12 we find 275 similars.
For query 13 we find 5 similars.
For query 14 we find 107 similars.
For query 15 we find 1520 similars.
For query 16 we find 874 similars.
For query 17 we find 427 similars.
For query 18 we find 140 similars.
For query 19 we find 378 similars.
For query 20 we find 611 similars.
For query 21 we find 101 similars.
For query 22 we find 11 similars.
For query 23 we find 5 similars.
For query 24 we find 16 similars.
For query 25 we find 49 similars.
For query 26 we find 572

### Thresold 0.8

In [70]:
start_time = time.time()
similars = checkNeighbors(transactions, final_query, neighbors, 0.8)
end_time = time.time()

In [72]:
print("LSH Similarity with a thresold value of 0.8 computes in: " + str((end_time - start_time)) + " seconds." )

similars = dict(sorted(similars.items()))
for k,v in similars.items():
    print("For query " + str(k) + " we find " + str(len(similars[k])) + " similars.")
    
noSimilars = len(final_query) - len(similars.keys())
print("For " + str(noSimilars) + " elements of the query we don't find any similars.")

LSH Similarity with a thresold value of 0.8 computes in: 127.49004077911377 seconds.
For query 0 we find 9 similars.
For query 1 we find 5 similars.
For query 2 we find 5 similars.
For query 3 we find 5 similars.
For query 4 we find 5 similars.
For query 5 we find 5 similars.
For query 6 we find 9 similars.
For query 7 we find 5 similars.
For query 8 we find 5 similars.
For query 9 we find 5 similars.
For query 10 we find 9 similars.
For query 11 we find 5 similars.
For query 12 we find 5 similars.
For query 13 we find 5 similars.
For query 14 we find 5 similars.
For query 15 we find 38 similars.
For query 16 we find 5 similars.
For query 17 we find 5 similars.
For query 18 we find 10 similars.
For query 19 we find 5 similars.
For query 20 we find 5 similars.
For query 21 we find 5 similars.
For query 22 we find 5 similars.
For query 23 we find 5 similars.
For query 24 we find 5 similars.
For query 25 we find 5 similars.
For query 26 we find 19 similars.
For query 27 we find 5 similars

### Thresold 1.0: perfect match

In [74]:
start_time = time.time()
similars = checkNeighbors(transactions, final_query, neighbors, 1.0)
end_time = time.time()

In [77]:
print("LSH Similarity with a thresold value of 1.0 computes in: " + str((end_time - start_time)) + " seconds." )

similars = dict(sorted(similars.items()))
for k,v in similars.items():
    print("For query " + str(k) + " we find " + str(len(similars[k])) + " similars.")
    
noSimilars = len(final_query) - len(similars.keys())
print("For " + str(noSimilars) + " elements of the query we don't find any similars.")

LSH Similarity with a thresold value of 1.0 computes in: 123.34179210662842 seconds.
For query 0 we find 5 similars.
For query 1 we find 5 similars.
For query 2 we find 5 similars.
For query 3 we find 5 similars.
For query 4 we find 5 similars.
For query 5 we find 5 similars.
For query 6 we find 5 similars.
For query 7 we find 5 similars.
For query 8 we find 5 similars.
For query 9 we find 5 similars.
For query 10 we find 5 similars.
For query 11 we find 5 similars.
For query 12 we find 5 similars.
For query 13 we find 5 similars.
For query 14 we find 5 similars.
For query 15 we find 10 similars.
For query 16 we find 5 similars.
For query 17 we find 5 similars.
For query 18 we find 10 similars.
For query 19 we find 5 similars.
For query 20 we find 5 similars.
For query 21 we find 5 similars.
For query 22 we find 5 similars.
For query 23 we find 5 similars.
For query 24 we find 5 similars.
For query 25 we find 5 similars.
For query 26 we find 15 similars.
For query 27 we find 5 similars

## Command Line

In [3]:
bank_transactions = pd.read_csv(r"C:\Users\Marina\OneDrive\Desktop\ADM_HW4\bank_transactions.csv")

### 1. Which location has the maximum number of purchases been made?

In [4]:
bank_transactions.groupby(['CustLocation'])['CustLocation'].count().sort_values(ascending=False).head(1)

CustLocation
MUMBAI    103595
Name: CustLocation, dtype: int64

### 2. In the dataset provided, did females spend more than males, or vice versa?

In [12]:
avgFemalesTransactions = bank_transactions.loc[bank_transactions.CustGender == "F"]['TransactionAmount (INR)'].mean()
avgMalesTransactions = bank_transactions.loc[bank_transactions.CustGender == "M"]['TransactionAmount (INR)'].mean()

if avgFemalesTransactions > avgMalesTransactions:
    print("We can conclude that in the dataset provided females spent on average more than males.")
else:
    print("We can conclude that in the dataset provided males spent on average more than females.")


We can conclude that in the dataset provided females spent on average more than males.


### 3. Report the customer with the highest average transaction amount in the dataset.

In [154]:
bank_transactions.groupby('CustomerID')['TransactionAmount (INR)'].mean().sort_values(ascending=False).head(1)

CustomerID
C7319271    1560034.99
Name: TransactionAmount (INR), dtype: float64