# Transactions Narrative Features Extractions

In this notebook we are going to work on transaction narrative, we will consider three kind of feaures:
1. Frequency of a given payment
2. Activity associated with a given transaction (i.e. if paying in Casino in Las Vagas then we flag a Gambler, 'PayPay' => Internet shopper and so on)
3. Country of transaction, to flag if transaction is domestic or international

In [1]:
import pandas as pd
import numpy as np
import hashlib
import math

import matplotlib.pyplot as plt
%matplotlib inline 

from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
df_train_abt = pd.read_csv("../specs/clean/Model Build - AbastractBaseTable - Validated.csv", encoding='latin-1', index_col=False)
df_train_abt.head(3)

df_train_abt['LastTransactionNarrative'] = df_train_abt['LastTransactionNarrative'].apply(str)
df_train_abt.dtypes



Unnamed: 0,ClientID,Age,Gender,County,IncomeGroup,HeldLoanPreviously,NumberOfProductsInbank,AverageTXNAmount,NumTransactions,LastTXNAmount,MerchantCode,LastTransactionNarrative,LoanFlag
0,1,36,Female,Cork,10001 - 40000,1,4,58.0,0.0,,,,0
1,2,43,Female,Cavan,0 - 10000,0,4,2.663,17.0,83.66,7211.0,THE BRIDGE LAUNDRY WICKLOW TOWN,0
2,3,32,Male,Dublin,10001 - 40000,0,2,46.0,25.0,526.18,3667.0,LUXOR HOTEL/CASINO LAS VEGAS NV,0


ClientID                      int64
Age                           int64
Gender                       object
County                       object
IncomeGroup                  object
HeldLoanPreviously            int64
NumberOfProductsInbank        int64
AverageTXNAmount            float64
NumTransactions             float64
LastTXNAmount               float64
MerchantCode                float64
LastTransactionNarrative     object
LoanFlag                      int64
dtype: object

## 1. Frequency of a given payment

First we need to construct a histogram of activity payments then we assign each customer the histogram value, making him part of these payment group

In [3]:
hist = {}
for n in list(df_train_abt['LastTransactionNarrative']):
    if not isinstance(n, str):
        continue
    if n == 'nan':
        continue
        
    hash_digest = hashlib.md5(n.encode('utf-8').strip().upper()).hexdigest()
    
    if hash_digest in hist:
        hist[hash_digest] += 1
    else:
        hist[hash_digest] = 1

df_train_narrative_hist = pd.DataFrame(list(sorted(hist.items())), columns=['TXN', 'freq'])
df_train_narrative_hist.sort_values(by=["freq"], ascending=False).head(10)



Unnamed: 0,TXN,freq
1222,438e7d777b4277f173f5e4649bc3fb29,17
3030,a58b4ec61b405f1a7ce44bcd77095f5d,16
4072,dffd447533587dcf45fa73c67b0618f0,16
192,0ac2367e83239039aa9ca9f57abda250,15
327,13029851894e840cebb0408695b6480d,14
1339,4910109b7f77c2b91cb432b5c3898738,13
519,1e2d1e8649e8abd754944f679916f3d2,13
2271,7e3ec882e9ae965eedbc8a8486e474c5,13
3725,ccfcedc7fc16d1857a83af029fc522f3,13
3181,af3cd0d6c4fb373aa6ec47868f1bd750,13


In [4]:
df_txn_features = df_train_abt.copy()[['ClientID', 'LastTransactionNarrative']]

def txn_rank(x):
    hash_digest = hashlib.md5(x.encode('utf-8').strip().upper()).hexdigest()
    return hist.get(hash_digest)
    #return df_train_narrative_hist[df_train_narrative_hist.TXN.str.contains(x)]

df_txn_features['Rank'] = df_txn_features.LastTransactionNarrative.apply(txn_rank)
df_txn_features


Unnamed: 0,ClientID,LastTransactionNarrative,Rank
0,1,,
1,2,THE BRIDGE LAUNDRY WICKLOW TOWN,2.0
2,3,LUXOR HOTEL/CASINO LAS VEGAS NV,9.0
3,4,HARVEY NORMAN CARRICKMINES,1.0
4,5,PAYPAL *PETEWOODWAR 35314369001,1.0
5,6,METROPARK HTL KLN HK10900HONG KONG,1.0
6,7,AIR FRANCE E AIRFRANCE.FR,2.0
7,8,CATHAY PACIFIC AIR L35100HONG KONG,1.0
8,9,DUBLIN MINT OFFICE LONDON,3.0
9,10,AUSTRIAN AI 2572136619496TICKET MAILED,1.0


## 2. Activity associated with a given transaction

The feature set trying to construct a payment behavior for each client, for example a Casino payment traction would associate a gambler client flag and a Paypal transaction an online shopper client and so on.
As this step is done manually, for now, we inspect the most used keywords across all transactions by tokenized each transation narrative and build words histogram

In [5]:
words_hist = {}
for n in list(df_train_abt['LastTransactionNarrative']):
    if not isinstance(n, str):
        continue
    if n == 'nan':
        continue
    for token in str.split(n):
        if token in words_hist:
            words_hist[token] += 1
        else:
            words_hist[token] = 1

df_words_hist = pd.DataFrame(list(sorted(words_hist.items(), reverse=True)), columns=['Words', 'Freq'])
df_words_hist = df_words_hist.sort_values(by=["Freq"], ascending=False)

print("Top 10 and last 10 words in TXN narrative")
df_words_hist.head(10)
df_words_hist.tail(10)



Top 10 and last 10 words in TXN narrative


Unnamed: 0,Words,Freq
4388,DUBLIN,702
3497,HOTEL,497
2771,LTD,295
2135,NV,294
6053,AIR,276
402,VEGAS,263
7337,&,222
1864,PAYPAL,211
2953,LAS,210
6520,35314369001,180


Unnamed: 0,Words,Freq
3096,KILDA,1
3094,"KILDEASA,",1
3093,KILKENNKILKENNY,1
3091,KILL,1
3089,KILLINARDEN,1
3087,KILLRUSH,1
3085,KILMACUD,1
3084,KILMIHILL,1
3083,KILMORE,1
3675,Geneve,1


In [6]:


dic_features = {
    # Feature => Keyworkds
    'gamber': ['CASINO', 'VEGAS', 'LAS', 'HOTEL/CASINO'],
    'luxurious': ['INN', 'SUITES', 'PLAZA', 'HILTON', 'ROYAL', 'HYATT', 'MARRIOTT', 'FAIRMONT', 'RESORT-WDW', 'RESORT/CASINLAS','RESORTS'],
    'golfer': ['GOLF'],
    'traveler': ['AIR', 'AIRWAY', 'Limitersd-travel.ie', 'Airport', 'EASYJET.COM', 'RYANAIR'],
    'gamer': ['PLAYSTATIONNETWORK'],
    'shopper': ['STORES', 'AMAZON.CO.UK', 'Amazon'],
    'cinephilia': ['NETFLIX.COM', 'MOVIE', 'MOVIES-AT.IE', 'MOVIEPLEX', 'MOVIES', 'ITUNES.COM/BILL'],
    'car_renter' : ['RENT', 'RENT-A-CAR']
}

def extract_payment_type(feture, narrative_tokens):
    for t in narrative_tokens:
        for f in dic_features[feture]:
            if f == t:
                return 1
    return 0    
    
df_payment_patterns = pd.DataFrame(columns=['ClientID', 
                                            'gamber', 
                                            'luxurious', 
                                            'golfer', 
                                            'traveler', 
                                            'gamer', 
                                            'shopper', 
                                            'cinephilia', 
                                            'car_renter'])
index = 0
for index, row in df_train_abt.iterrows():
    client_id = row['ClientID']
    txn_narrative_tokens = str.split(row['LastTransactionNarrative'])
    
    gamber = extract_payment_type('gamber', txn_narrative_tokens)
    luxurious = extract_payment_type('luxurious', txn_narrative_tokens)
    golfer = extract_payment_type('golfer', txn_narrative_tokens)
    traveler = extract_payment_type('traveler', txn_narrative_tokens)
    gamer = extract_payment_type('gamber', txn_narrative_tokens)
    shopper = extract_payment_type('shopper', txn_narrative_tokens)
    cinephilia = extract_payment_type('cinephilia', txn_narrative_tokens)
    car_renter = extract_payment_type('car_renter', txn_narrative_tokens)
    
    df_payment_patterns.loc[index] = [client_id, gamber, luxurious, golfer, traveler, gamer, shopper, cinephilia, car_renter]
    index += 1
 
print('All the golfer customers')
df_payment_patterns[df_payment_patterns['golfer'] == 1]


All the golfer customers


Unnamed: 0,ClientID,gamber,luxurious,golfer,traveler,gamer,shopper,cinephilia,car_renter
492,503.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2161,2175.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3407,3421.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3652,3666.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3660,3676.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4219,4236.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4820,4837.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5025,5042.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5123,5140.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5326,5343.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


# 3. Domestic or international pay

In this feature we determine if the transaction is domistic in Ireland or international or online
But due to time constraint I'm skipping this feature

# Merging Tansactions Features

In [9]:
df_train_txn = pd.merge(df_txn_features, df_payment_patterns, on='ClientID')
df_train_txn

Unnamed: 0,ClientID,LastTransactionNarrative,Rank,gamber,luxurious,golfer,traveler,gamer,shopper,cinephilia,car_renter
0,1,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,THE BRIDGE LAUNDRY WICKLOW TOWN,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,LUXOR HOTEL/CASINO LAS VEGAS NV,9.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,4,HARVEY NORMAN CARRICKMINES,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,PAYPAL *PETEWOODWAR 35314369001,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,6,METROPARK HTL KLN HK10900HONG KONG,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,7,AIR FRANCE E AIRFRANCE.FR,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
7,8,CATHAY PACIFIC AIR L35100HONG KONG,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
8,9,DUBLIN MINT OFFICE LONDON,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,10,AUSTRIAN AI 2572136619496TICKET MAILED,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Persist transaction features

In [10]:
pd.DataFrame.to_csv(df_train_txn, "../specs/clean/Model Build - AbastractBaseTable - Transactions.csv", encoding='utf-8', index=False)