# Homework 4 - Getting to know your customers

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import random
from sympy import *

## 1. Finding Similar Costumers

The idea behind the work is to implement *Locality-Sensitive Hashing (LSH)* algorithm to find the most similar users to the query.


We mixed up different approaches and we took inspiration mainly from [this site](https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/). The work is structured between the following steps:


<ol>
  <li>Set up the data</li>
  <li>Fingerprint hashing</li>
  <li>Locality Sensitive Hashing</li>
</ol>

### 1.1 Set up the data

For the sake of this first part, not all columns are necessary since comparing each field single handedly can be quite time-expensive. 

We have therefore decided to keep only some columns of the dataset and to create a new one, *CustGroupsBalance*, in order to decrease the number of values ​​to consider and improve the execution time of the algorithm in the next phase.

After the pre-processing phase, the dataset with which we will work is presented.


In [146]:
bank_transactions = pd.read_csv(r"C:\Users\Marina\OneDrive\Desktop\ADM_HW4\bank_transactions.csv")
bank_transactions = bank_transactions.dropna()
bank_transactions = bank_transactions.reset_index(drop=True)

We group the transactions according to their account balance: *lowBalance*, *mediumBalance*, *highBalance*.

In [3]:
groupBalance = []

for balance in bank_transactions['CustAccountBalance']:
    if balance < 5000:
        groupBalance.append('lowBalance')
    elif 5000 <= balance < 50000:
        groupBalance.append('mediumBalance')
    else:
        groupBalance.append('highBalance')
        

bank_transactions['CustGroupsBalance'] = groupBalance

In [6]:
transactions = bank_transactions[['CustomerDOB',  'CustGender', 'CustLocation', 'CustGroupsBalance', 'TransactionAmount (INR)', 'TransactionDate']]
transactions

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate
0,10/1/94,F,JAMSHEDPUR,mediumBalance,25.0,2/8/16
1,4/4/57,M,JHAJJAR,lowBalance,27999.0,2/8/16
2,26/11/96,F,MUMBAI,mediumBalance,459.0,2/8/16
3,14/9/73,F,MUMBAI,highBalance,2060.0,2/8/16
4,24/3/88,F,NAVI MUMBAI,mediumBalance,1762.5,2/8/16
...,...,...,...,...,...,...
1041609,8/4/90,M,NEW DELHI,mediumBalance,799.0,18/9/16
1041610,20/2/92,M,NASHIK,mediumBalance,460.0,18/9/16
1041611,18/5/89,M,HYDERABAD,highBalance,770.0,18/9/16
1041612,30/8/78,M,VISAKHAPATNAM,mediumBalance,1000.0,18/9/16


### 1.2 Fingerprint hashing

We implemented **our MinHash function**: the goal was to replace a large set of values with a smaller *"signature"* that still preserves the underlying similarity metric.

We established the following pipeline:

1. At first, we create the shingles matrix, where a shingle is an individual element that can be a part of a dataset like a character, integer or date.
1. We randomly permute the shingles matrix;
2. For each transaction, start from the top and find the position of the first shingle with a 1 in its cell and use this shingle number to represent the transaction (signature);
3. We repeat it several times and each time we append the result to the signature's matrix.

Since permutation is a computation heavy operation especially for large datasets, we use a hashing/mapping function that typically reorders the elements using the simple math operation:

 $$ h(x) = (ax + b)  \% c  $$
 
 where a and b are random integers smaller than c and c is the prime number slightly higher than the total number of different elements in the dataset (that is the number of elements in the shingles matrix).

In [151]:
def permute(x, a, b, c):
    return (a*x + b) % c

In [150]:
def createShinglesMatrix(df):
    shingles_matrix = []

    for column in df:
        for shingle in df[column].unique():
            shingles_matrix.append(shingle)

    diff = nextprime(len(shingles_matrix)) - len(shingles_matrix)
    for _ in range(diff):
        shingles_matrix.append('NA')

    return shingles_matrix

In [149]:
def createListOfPermutations(arr, permutations):
    listOfPermutations = []
    for j in range(30):
        l = {}
        for i,item in enumerate(arr):
            l[item] = permute(i, permutations[j][0], permutations[j][1], len(arr))

        listOfPermutations.append(l)

    return listOfPermutations

In [13]:
def createSignature(df, listOfPermutations, nPermutations):
    signature_matrix=[]

    for row in df.iterrows():
        signature_k =[]
        for j in range(nPermutations):
            index=[listOfPermutations[j][row[1][s]] for s in range(6)]
            signature_k.append(min(index))
        
        signature_matrix.append(signature_k)

    return signature_matrix

In [14]:
listOfValues = createShinglesMatrix(transactions) # initialization of the shingles matrix
# random values for "a" and "b"
permutations = [(random.randint(0,len(listOfValues)), random.randint(0,len(listOfValues))) for _ in range(30)] 
listOfPermutations = createListOfPermutations(listOfValues, permutations) # initialization of the list of permutations
signatureA = createSignature(transactions, listOfPermutations, 10) 
transactions['signature'] = signatureA # initialization of the signature column in the dataframe 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  transactions['signature'] = signatureA


In [15]:
transactions

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate,signature
0,10/1/94,F,JAMSHEDPUR,mediumBalance,25.0,2/8/16,"[3194, 1103, 7071, 10371, 28368, 6531, 1793, 2..."
1,4/4/57,M,JHAJJAR,lowBalance,27999.0,2/8/16,"[3, 1103, 29793, 34650, 11496, 6531, 1793, 152..."
2,26/11/96,F,MUMBAI,mediumBalance,459.0,2/8/16,"[3194, 1103, 16413, 10371, 8171, 6531, 1793, 1..."
3,14/9/73,F,MUMBAI,highBalance,2060.0,2/8/16,"[39794, 1103, 64020, 13860, 17704, 6531, 981, ..."
4,24/3/88,F,NAVI MUMBAI,mediumBalance,1762.5,2/8/16,"[3194, 1103, 33011, 1343, 39032, 6531, 1793, 5..."
...,...,...,...,...,...,...,...
1041609,8/4/90,M,NEW DELHI,mediumBalance,799.0,18/9/16,"[3, 10043, 22873, 10371, 4062, 12012, 3591, 27..."
1041610,20/2/92,M,NASHIK,mediumBalance,460.0,18/9/16,"[3, 4478, 39270, 10371, 4062, 2758, 59285, 416..."
1041611,18/5/89,M,HYDERABAD,highBalance,770.0,18/9/16,"[3, 2582, 49408, 23443, 2495, 30543, 45535, 18..."
1041612,30/8/78,M,VISAKHAPATNAM,mediumBalance,1000.0,18/9/16,"[3, 10043, 60802, 10371, 1365, 32761, 4103, 26..."


In [16]:
signatureAdf = pd.DataFrame(signatureA)
trasposed_signatureA = signatureAdf.T.copy(deep = True)
trasposed_signatureA.index = [i for i in range(1,11)]
trasposed_signatureA

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1041604,1041605,1041606,1041607,1041608,1041609,1041610,1041611,1041612,1041613
1,3194,3,3194,39794,3194,30627,21494,3,3194,3,...,21494,3194,3,3,3,3,3,3,3,3
2,1103,1103,1103,1103,1103,1103,1103,1103,1103,1103,...,10043,10043,8367,10043,10043,10043,4478,2582,10043,10043
3,7071,29793,16413,64020,33011,2002,11344,64020,60802,29793,...,3192,38718,32150,13323,12735,22873,39270,49408,60802,51578
4,10371,34650,10371,13860,1343,13860,13860,22133,219,23257,...,3194,10371,10371,60735,7620,10371,10371,23443,10371,21347
5,28368,11496,8171,17704,39032,832,7040,22160,36933,22160,...,4062,4062,4062,4062,4062,4062,4062,2495,1365,4062
6,6531,6531,6531,6531,6531,6531,6531,6531,2410,6531,...,43368,14689,54573,61937,71171,12012,2758,30543,32761,59283
7,1793,1793,1793,981,1793,1793,1793,1793,1793,1793,...,8212,20835,23914,543,21379,3591,59285,45535,4103,16215
8,28641,15261,18800,1881,5420,1881,15261,1881,428,15261,...,15261,14351,19529,15261,15261,275,4166,1881,26230,1881
9,39130,6023,25363,11765,25669,11765,17950,11765,12208,25363,...,44872,13739,38824,38824,28138,20926,29366,11765,34460,11765
10,15913,15913,15913,15913,11802,1701,15913,15913,15913,15913,...,13256,44995,40022,3427,44995,31908,10524,44995,4941,17175


The signature matrix above is now divided into b bands of r rows each and we maps eanch transaction to a specific bucket. In our case, we are setting *b = 2*, which means that we will consider any transactions with the same first two rows to be similar. The larger we make b the less likely there will be another transaction that matches all of the same permutations.


In [17]:
def listAlphabet():
  return list(map(chr, range(97, 123)))

In [18]:
def create_bucktes(b, trasposed_signature): 
    buckets = {}
    letters = listAlphabet()

    for i in range(1, len(trasposed_signature), b):
        sub = pd.DataFrame(trasposed_signature.loc[i:i+b-1].copy(deep = True))
        for col in sub:
            arr = sub[col].to_numpy(copy = True)
            
            k = letters[int(i/2 + 1)]
            for j in range(len(arr)):
                k += str(arr[j])

            if k in buckets:
                buckets[k].append(col)

            else:
                buckets[k] = [col]


    return buckets

In [137]:
bucketsA = create_bucktes(2, trasposed_signatureA)

### 1.3 Locality Sensitive Hashing

We repeat the same procedure done above for the query and we create two new columns:
<ol>
    <li> neighbors: a list of costumer ids that end up in the same bucket; </li>
    <li> similars: a list of costumer ids  whose signature has a similarity of at least a certain threshold. </li>
</ol>

In [20]:
query = pd.read_csv(r"C:\Users\Marina\OneDrive\Desktop\ADM_HW4\query_users.csv")

In [21]:
groupBalance = []

for balance in query['CustAccountBalance']:
    if balance < 5000:
        groupBalance.append('lowBalance')
    elif 5000 <= balance < 50000:
        groupBalance.append('mediumBalance')
    else:
        groupBalance.append('highBalance')
        

query['CustGroupsBalance'] = groupBalance

In [22]:
final_query = query[['CustomerDOB',  'CustGender', 'CustLocation', 'CustGroupsBalance', 'TransactionAmount (INR)', 'TransactionDate']]


In [24]:
#listOfValues = createListOfValues(final_query)
signatureB = create_signature(final_query, listOfPermutations, 10)
final_query['signature'] = signatureB

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_query['signature'] = signatureB


In [25]:
final_query

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate,signature
0,27/7/78,M,DELHI,highBalance,65.0,2/9/16,"[3, 25400, 34530, 61031, 22160, 6879, 32121, 1..."
1,6/11/92,M,PANCHKULA,mediumBalance,6025.0,2/9/16,"[3, 2407, 3540, 10371, 22160, 6879, 30687, 253..."
2,14/8/91,M,PATNA,mediumBalance,541.5,10/8/16,"[3, 21176, 23249, 10371, 22160, 62612, 3323, 1..."
3,3/1/87,M,CHENNAI,highBalance,1000.0,29/8/16,"[3, 25400, 68846, 20321, 1365, 18593, 43656, 1..."
4,4/1/95,M,GURGAON,highBalance,80.0,25/9/16,"[3, 7063, 49165, 281, 22160, 2410, 45535, 1881..."
5,10/1/81,M,WORLD TRADE CENTRE BANGALORE,mediumBalance,303.0,11/9/16,"[3, 30655, 5680, 10371, 22160, 17492, 67812, 6..."
6,20/9/76,F,CHITTOOR,mediumBalance,20.0,28/8/16,"[3194, 31679, 20447, 10371, 27836, 887, 20748,..."
7,10/4/91,M,MOHALI,lowBalance,50.0,2/8/16,"[3, 1103, 29793, 16648, 13721, 6531, 1793, 152..."
8,19/3/90,M,MOHALI,lowBalance,300.0,26/8/16,"[3, 21110, 19845, 54568, 3028, 36725, 10439, 5..."
9,19/12/70,M,SERAMPORE,highBalance,299.0,27/8/16,"[3, 5573, 37370, 14451, 22160, 6978, 2857, 188..."


In [26]:
signatureBdf = pd.DataFrame(signatureB)
trasposed_signatureB = signatureBdf.T.copy(deep = True)
trasposed_signatureB.index = [i for i in range(1,11)]
trasposed_signatureB

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
1,3,3,3,3,3,3,3194,3,3,3,...,3,3,12191,8623,3,8521,3,3,3194,3
2,25400,2407,21176,25400,7063,30655,31679,1103,21110,5573,...,17136,30189,24107,38453,12655,25400,25708,25400,34659,7815
3,34530,3540,23249,68846,49165,5680,20447,29793,19845,37370,...,29793,52051,15776,22873,43744,45624,20292,31620,32784,15486
4,61031,10371,10371,20321,281,10371,10371,16648,54568,14451,...,51448,10371,6459,13860,29534,13860,10371,13698,10371,10371
5,22160,22160,22160,1365,22160,22160,27836,13721,3028,22160,...,22160,5426,15605,15605,22160,77143,22160,22160,39032,22160
6,6879,6879,62612,18593,2410,17492,887,6531,36725,6978,...,887,2832,30742,50049,24353,29648,71693,23967,53722,26142
7,32121,30687,3323,43656,45535,67812,20748,1793,10439,2857,...,21343,61511,506,28492,45535,21336,42682,17426,5314,25868
8,1881,25341,19295,1881,1881,6763,5419,15261,5393,1881,...,15261,2432,1881,15261,1881,1881,28641,1060,2513,7452
9,11765,4737,28942,4626,11765,38824,28291,11902,11902,5132,...,4626,13739,11765,26686,11765,6881,4626,11765,71931,37733
10,47530,5111,53554,2572,21812,14257,2353,15913,52492,20793,...,38287,50967,2108,31908,29620,51378,4928,857,680,23662


In [90]:
bucketsB = create_bucktes(2, trasposed_signatureB)

In [115]:
dic = {}

for k,v in bucketsB.items():
    if k in bucketsA:
        for el in v:
            for el2 in bucketsA[k]:
                if el in dic:
                    dic[el].append(bank_transactions.loc[el2]['CustomerID'])
                else:
                    dic[el] = [bank_transactions.loc[el2]['CustomerID']]

In [129]:
for k,v in dic.items():
    final_query['neighbors'].loc[k] = sorted(v)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_query['neighbors'].loc[k] = sorted(v)


In [107]:
def jaccard_similarity(signA, signB):
    return sum([1 for a, b in zip(signA, signB) if a == b]) / len(signA)

def find_similars(transactions, query, threshold):
    similars = {idx:[] for idx, val in enumerate(query.signature)}
    for idx1, signA in query['signature'].iteritems():
        for idx2, signB in transactions['signature'].iteritems():
            sim = jaccard_similarity(signA, signB)
            if sim >= threshold:
                similars[idx1].append(bank_transactions.loc[idx2]['CustomerID'])
            
    return similars

In [136]:
similars = find_similars(transactions, final_query, 0.6)

In [130]:
for k,v in similars.items():
    final_query['similars'].loc[k] = sorted(v)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_query['similars'].loc[k] = sorted(v)


In [131]:
final_query

Unnamed: 0,CustomerDOB,CustGender,CustLocation,CustGroupsBalance,TransactionAmount (INR),TransactionDate,signature,neighbors,similars
0,27/7/78,M,DELHI,highBalance,65.0,2/9/16,"[3, 25400, 34530, 61031, 22160, 6879, 32121, 1...","[C1010011, C1010075, C1010085, C1010089, C1010...","[C1015370, C1016550, C1016639, C1018774, C1022..."
1,6/11/92,M,PANCHKULA,mediumBalance,6025.0,2/9/16,"[3, 2407, 3540, 10371, 22160, 6879, 30687, 253...","[C1010636, C1010670, C1011124, C1011670, C1011...","[C2124065, C2424017, C2424023, C3224067, C3524..."
2,14/8/91,M,PATNA,mediumBalance,541.5,10/8/16,"[3, 21176, 23249, 10371, 22160, 62612, 3323, 1...","[C1010835, C1016144, C1019214, C1019969, C1020...","[C1632230, C2038950, C3617630, C4311226, C4918..."
3,3/1/87,M,CHENNAI,highBalance,1000.0,29/8/16,"[3, 25400, 68846, 20321, 1365, 18593, 43656, 1...","[C1010011, C1010075, C1010085, C1010089, C1010...","[C1140113, C1314618, C1327427, C1761138, C1880..."
4,4/1/95,M,GURGAON,highBalance,80.0,25/9/16,"[3, 7063, 49165, 281, 22160, 2410, 45535, 1881...","[C1010035, C1010036, C1010051, C1010112, C1010...","[C1012250, C1013392, C1016748, C1017723, C1018..."
5,10/1/81,M,WORLD TRADE CENTRE BANGALORE,mediumBalance,303.0,11/9/16,"[3, 30655, 5680, 10371, 22160, 17492, 67812, 6...","[C1010780, C1011266, C1011464, C1012629, C1012...","[C1016630, C1121450, C1417377, C1473957, C1559..."
6,20/9/76,F,CHITTOOR,mediumBalance,20.0,28/8/16,"[3194, 31679, 20447, 10371, 27836, 887, 20748,...","[C1010729, C1010761, C1010820, C1010891, C1010...","[C1010891, C1141256, C1239323, C1311731, C1336..."
7,10/4/91,M,MOHALI,lowBalance,50.0,2/8/16,"[3, 1103, 29793, 16648, 13721, 6531, 1793, 152...","[C1010243, C1010314, C1010352, C1010766, C1010...","[C1010771, C1012432, C1012679, C1016516, C1017..."
8,19/3/90,M,MOHALI,lowBalance,300.0,26/8/16,"[3, 21110, 19845, 54568, 3028, 36725, 10439, 5...","[C1010136, C1010136, C1010245, C1010551, C1010...","[C1228171, C1236486, C1335022, C1414084, C1624..."
9,19/12/70,M,SERAMPORE,highBalance,299.0,27/8/16,"[3, 5573, 37370, 14451, 22160, 6978, 2857, 188...","[C1010046, C1010249, C1010470, C1011112, C1011...","[C1314883, C1414814, C1814840, C2914879, C6014..."


In [135]:
query.loc[49]

CustomerDOB                       5/1/87
CustGender                             M
CustLocation                       DELHI
CustAccountBalance              10989.03
TransactionDate                   4/9/16
TransactionTime                   113240
TransactionAmount (INR)            240.0
CustGroupsBalance          mediumBalance
Name: 49, dtype: object

In [138]:
bank_transactions[bank_transactions.CustomerID == 'C1011688']

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),CustGroupsBalance
268624,T270535,C1011688,13/10/88,M,GURGAON,66251.44,12/8/16,201440,240.0,highBalance
755189,T760478,C1011688,26/8/91,M,BANGALORE,13051.11,1/9/16,161906,297.0,mediumBalance


In [139]:
bank_transactions[bank_transactions.CustomerID == 'C1032730']

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,CustGender,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR),CustGroupsBalance
874300,T880289,C1032730,7/2/86,M,THANE,17361.99,8/9/16,225450,240.0,mediumBalance


In [141]:
len(set(final_query.loc[0]['neighbors']).intersection(set(final_query.loc[0]['similars'])))

1423

In [142]:
len(set(final_query.loc[1]['neighbors']).intersection(set(final_query.loc[1]['similars'])))

13

## Command Line

### 1. Which location has the maximum number of purchases been made?

In [152]:
bank_transactions.groupby(['CustLocation'])['CustLocation'].count().sort_values(ascending=False).head(1)

CustLocation
MUMBAI    101997
Name: CustLocation, dtype: int64

### 2. In the dataset provided, did females spend more than males, or vice versa?

In [153]:
avgFemalesTransactions = bank_transactions.loc[bank_transactions.CustGender == "F"]['TransactionAmount (INR)'].mean()
avgMalesTransactions = bank_transactions.loc[bank_transactions.CustGender == "M"]['TransactionAmount (INR)'].mean()
print("Total average spent by females: ", avgFemalesTransactions)
print("Total average spent by males: ", avgMalesTransactions)

Total average spent by females:  1643.9584570348936
Total average spent by males:  1537.3411848042588


### 3. Report the customer with the highest average transaction amount in the dataset.

In [154]:
bank_transactions.groupby('CustomerID')['TransactionAmount (INR)'].mean().sort_values(ascending=False).head(1)

CustomerID
C7319271    1560034.99
Name: TransactionAmount (INR), dtype: float64