# Machine Learning Challenge: Card Transactions!

Summary
- It is a classification problem
- Features include numerical values and categorical values
- Need to impute missing numerical values
- Need to impute missing categorical values
- Categorical features need to convert to numerical (LabelEncoder and OneHotEncoder or get_dummies or DictVectorizer)


dataset source
- The dataset can be download from:
- https://github.com/msarker000/ml-group-project/blob/master/documents/transactions.txt.zip

### Library

In [2]:
import numpy as np
import pandas as pd
import json
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


### Data Preparation

#### Load Data

In [3]:
def read_data(path):
    '''
    this function will read the txt file which is a line-delimited json file and produce a pandas dataframe.
    
    '''
    my_list =[]
    with open(path) as f:
        for line in f:
            json_content = json.loads(line)
            my_list.append(json_content)
    #df = pd.DataFrame(my_list) 
    return(pd.DataFrame(my_list))

In [4]:
df = read_data('transactions.txt')

In [5]:
df.head()

Unnamed: 0,accountNumber,accountOpenDate,acqCountry,availableMoney,cardCVV,cardLast4Digits,cardPresent,creditLimit,currentBalance,currentExpDate,...,merchantName,merchantState,merchantZip,posConditionCode,posEntryMode,posOnPremises,recurringAuthInd,transactionAmount,transactionDateTime,transactionType
0,733493772,2014-08-03,US,5000.0,492,9184,False,5000.0,0.0,04/2020,...,Lyft,,,1,5,,,111.33,2016-01-08T19:04:50,PURCHASE
1,733493772,2014-08-03,US,4888.67,492,9184,False,5000.0,111.33,06/2023,...,Uber,,,1,9,,,24.75,2016-01-09T22:32:39,PURCHASE
2,733493772,2014-08-03,US,4863.92,492,9184,False,5000.0,136.08,12/2027,...,Lyft,,,1,5,,,187.4,2016-01-11T13:36:55,PURCHASE
3,733493772,2014-08-03,US,4676.52,492,9184,False,5000.0,323.48,09/2029,...,Lyft,,,1,2,,,227.34,2016-01-11T22:47:46,PURCHASE
4,733493772,2014-08-03,US,4449.18,492,9184,False,5000.0,550.82,10/2024,...,Lyft,,,1,2,,,0.0,2016-01-16T01:41:11,ADDRESS_VERIFICATION


In [6]:
df.shape

(641914, 29)

This dataset has 641914 records(observations) and 29 features.

## Exploratory Data Analysis

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 641914 entries, 0 to 641913
Data columns (total 29 columns):
accountNumber               641914 non-null object
accountOpenDate             641914 non-null object
acqCountry                  641914 non-null object
availableMoney              641914 non-null float64
cardCVV                     641914 non-null object
cardLast4Digits             641914 non-null object
cardPresent                 641914 non-null bool
creditLimit                 641914 non-null float64
currentBalance              641914 non-null float64
currentExpDate              641914 non-null object
customerId                  641914 non-null object
dateOfLastAddressChange     641914 non-null object
echoBuffer                  641914 non-null object
enteredCVV                  641914 non-null object
expirationDateKeyInMatch    641914 non-null bool
isFraud                     641914 non-null bool
merchantCategoryCode        641914 non-null object
merchantCity             

####  Replacing blank values with nan.

In [7]:
 df.replace(r'^\s*$', np.nan, regex=True, inplace=True) 

In [8]:
df.describe()

Unnamed: 0,availableMoney,creditLimit,currentBalance,echoBuffer,merchantCity,merchantState,merchantZip,posOnPremises,recurringAuthInd,transactionAmount
count,641914.0,641914.0,641914.0,0.0,0.0,0.0,0.0,0.0,0.0,641914.0
mean,6652.828573,10697.210608,4044.382035,,,,,,,135.162497
std,9227.132275,11460.359133,5945.510224,,,,,,,147.053302
min,-1244.93,250.0,0.0,,,,,,,0.0
25%,1114.97,5000.0,502.4425,,,,,,,32.32
50%,3578.165,7500.0,2151.86,,,,,,,85.8
75%,8169.185,15000.0,5005.89,,,,,,,189.03
max,50000.0,50000.0,47496.5,,,,,,,1825.25


### Observations:

There are 641914 instances in the dataset. We have some numerical attributes like 'availableMoney', 'creditLimit', and some categorical attributes like 'merchantCategoryCode', 'merchantName'. We have few attributes which totally have missing values. These attributes are:

- echoBuffer
- merchantCity
- merchantState
- merchantZip
- posOnPremises
- recurringAuthInd

We can drop these columns. Some attributes like 'acqCountry' has around 3913 missing values. first, We need to handle these missing values.


#### handeling missing values

We have few options:

- totally drop those attributes from data.
- Drop those records (remove rows where these attributes are missing)
- Set the missing to some values. For numerical attributes, we can set them to the mean/median, and for categorical attributes we can set them to the most frequent category.


#### we are going to drop these attributes which totally have missing values from data.

In [8]:
df=df.drop(['echoBuffer','merchantCity','merchantState', 'merchantZip', 'posOnPremises', 
'recurringAuthInd'], axis=1)

In [9]:
df.shape

(641914, 23)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 641914 entries, 0 to 641913
Data columns (total 23 columns):
accountNumber               641914 non-null object
accountOpenDate             641914 non-null object
acqCountry                  638001 non-null object
availableMoney              641914 non-null float64
cardCVV                     641914 non-null object
cardLast4Digits             641914 non-null object
cardPresent                 641914 non-null bool
creditLimit                 641914 non-null float64
currentBalance              641914 non-null float64
currentExpDate              641914 non-null object
customerId                  641914 non-null object
dateOfLastAddressChange     641914 non-null object
enteredCVV                  641914 non-null object
expirationDateKeyInMatch    641914 non-null bool
isFraud                     641914 non-null bool
merchantCategoryCode        641914 non-null object
merchantCountryCode         641290 non-null object
merchantName             

#### Convert to Numeric

In [10]:
# numerical columns
num_cols = ['availableMoney', 'creditLimit', 'currentBalance', 'transactionAmount']

In [11]:
# categorical columns
cate_cols = df.columns.drop('isFraud').drop(num_cols)
# display categorical columns
cate_cols

Index(['accountNumber', 'accountOpenDate', 'acqCountry', 'cardCVV',
       'cardLast4Digits', 'cardPresent', 'currentExpDate', 'customerId',
       'dateOfLastAddressChange', 'enteredCVV', 'expirationDateKeyInMatch',
       'merchantCategoryCode', 'merchantCountryCode', 'merchantName',
       'posConditionCode', 'posEntryMode', 'transactionDateTime',
       'transactionType'],
      dtype='object')

In [12]:
# convert numerical data 
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce')

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 641914 entries, 0 to 641913
Data columns (total 23 columns):
accountNumber               641914 non-null object
accountOpenDate             641914 non-null object
acqCountry                  638001 non-null object
availableMoney              641914 non-null float64
cardCVV                     641914 non-null object
cardLast4Digits             641914 non-null object
cardPresent                 641914 non-null bool
creditLimit                 641914 non-null float64
currentBalance              641914 non-null float64
currentExpDate              641914 non-null object
customerId                  641914 non-null object
dateOfLastAddressChange     641914 non-null object
enteredCVV                  641914 non-null object
expirationDateKeyInMatch    641914 non-null bool
isFraud                     641914 non-null bool
merchantCategoryCode        641914 non-null object
merchantCountryCode         641290 non-null object
merchantName             

#### Categorical Feature Unique Values

In [13]:
# check the number of unique values
df[cate_cols].apply(lambda x: x.nunique(), axis=0)


accountNumber                 5000
accountOpenDate               1826
acqCountry                       4
cardCVV                        899
cardLast4Digits               5134
cardPresent                      2
currentExpDate                 165
customerId                    5000
dateOfLastAddressChange       2186
enteredCVV                     980
expirationDateKeyInMatch         2
merchantCategoryCode            19
merchantCountryCode              4
merchantName                  2493
posConditionCode                 3
posEntryMode                     5
transactionDateTime         635472
transactionType                  3
dtype: int64

In [14]:
df.isna().sum()

accountNumber                  0
accountOpenDate                0
acqCountry                  3913
availableMoney                 0
cardCVV                        0
cardLast4Digits                0
cardPresent                    0
creditLimit                    0
currentBalance                 0
currentExpDate                 0
customerId                     0
dateOfLastAddressChange        0
enteredCVV                     0
expirationDateKeyInMatch       0
isFraud                        0
merchantCategoryCode           0
merchantCountryCode          624
merchantName                   0
posConditionCode             287
posEntryMode                3345
transactionAmount              0
transactionDateTime            0
transactionType              589
dtype: int64

#### let's drop rows which has NaN value

In [15]:
df.dropna(how='any', subset=['acqCountry'], inplace=True)

In [16]:
df.dropna(how='any', subset=['merchantCountryCode'], inplace=True)

In [17]:
df.dropna(how='any', subset=['posEntryMode', 'posConditionCode', 'transactionType'], inplace=True)

In [18]:
df.isna().sum()

accountNumber               0
accountOpenDate             0
acqCountry                  0
availableMoney              0
cardCVV                     0
cardLast4Digits             0
cardPresent                 0
creditLimit                 0
currentBalance              0
currentExpDate              0
customerId                  0
dateOfLastAddressChange     0
enteredCVV                  0
expirationDateKeyInMatch    0
isFraud                     0
merchantCategoryCode        0
merchantCountryCode         0
merchantName                0
posConditionCode            0
posEntryMode                0
transactionAmount           0
transactionDateTime         0
transactionType             0
dtype: int64

In [55]:
df.head()

Unnamed: 0,accountNumber,accountOpenDate,acqCountry,availableMoney,cardCVV,cardLast4Digits,cardPresent,creditLimit,currentBalance,currentExpDate,...,expirationDateKeyInMatch,isFraud,merchantCategoryCode,merchantCountryCode,merchantName,posConditionCode,posEntryMode,transactionAmount,transactionDateTime,transactionType
0,733493772,2014-08-03,US,5000.0,492,9184,False,5000.0,0.0,04/2020,...,False,True,rideshare,US,Lyft,1,5,111.33,2016-01-08T19:04:50,PURCHASE
1,733493772,2014-08-03,US,4888.67,492,9184,False,5000.0,111.33,06/2023,...,False,False,rideshare,US,Uber,1,9,24.75,2016-01-09T22:32:39,PURCHASE
2,733493772,2014-08-03,US,4863.92,492,9184,False,5000.0,136.08,12/2027,...,False,False,rideshare,US,Lyft,1,5,187.4,2016-01-11T13:36:55,PURCHASE
3,733493772,2014-08-03,US,4676.52,492,9184,False,5000.0,323.48,09/2029,...,False,True,rideshare,US,Lyft,1,2,227.34,2016-01-11T22:47:46,PURCHASE
4,733493772,2014-08-03,US,4449.18,492,9184,False,5000.0,550.82,10/2024,...,False,False,rideshare,US,Lyft,1,2,0.0,2016-01-16T01:41:11,ADDRESS_VERIFICATION


#### encoding categorical features into numerical values.
    It is essential to encoding categorical features into numerical values. We can see from the table, we have several categorical features. I am going to try The labelEncoder and OneHotEncoder which help for encoding categorical features. We need first to extract the categorial featuers using boolean mask.
    

In [19]:
X, y = df.drop(['isFraud'],axis=1), df['isFraud']

In [20]:
# Categorical boolean mask
categorical_feature_mask = X.dtypes==object
# filter categorical columns using mask and turn it into a list
categorical_cols = X.columns[categorical_feature_mask].tolist()

In [21]:
categorical_feature_mask

accountNumber                True
accountOpenDate              True
acqCountry                   True
availableMoney              False
cardCVV                      True
cardLast4Digits              True
cardPresent                 False
creditLimit                 False
currentBalance              False
currentExpDate               True
customerId                   True
dateOfLastAddressChange      True
enteredCVV                   True
expirationDateKeyInMatch    False
merchantCategoryCode         True
merchantCountryCode          True
merchantName                 True
posConditionCode             True
posEntryMode                 True
transactionAmount           False
transactionDateTime          True
transactionType              True
dtype: bool

#### we are using LabelEncoder to convert each class under specified feature to a numerical value.

In [22]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

#### Then apply LabelEncoder on each of the categorical columns:

In [23]:
# apply le on categorical feature columns
X[categorical_cols] = X[categorical_cols].apply(lambda col: le.fit_transform(col))
X[categorical_cols].tail(10)

Unnamed: 0,accountNumber,accountOpenDate,acqCountry,cardCVV,cardLast4Digits,currentExpDate,customerId,dateOfLastAddressChange,enteredCVV,merchantCategoryCode,merchantCountryCode,merchantName,posConditionCode,posEntryMode,transactionDateTime,transactionType
641904,440,1769,3,27,2536,11,440,1976,108,12,3,43,1,1,522768,1
641905,440,1769,3,27,2536,65,440,1976,108,13,3,160,0,0,529723,1
641906,440,1769,3,27,2536,148,440,1976,108,13,3,605,0,2,538010,1
641907,440,1769,3,27,2536,148,440,1976,108,15,3,2470,1,2,546930,1
641908,440,1769,3,27,2536,60,440,1976,108,13,3,1201,0,1,549757,1
641909,440,1769,3,27,2536,11,440,1976,108,12,3,43,1,1,578560,1
641910,440,1769,3,27,2536,61,440,1976,108,13,3,161,0,2,587149,1
641911,440,1769,3,27,2536,124,440,1976,108,13,3,604,0,0,600451,1
641912,440,1769,3,27,2536,148,440,1976,108,15,3,2470,1,2,605513,1
641913,440,1769,3,27,2536,102,440,1976,108,13,3,1201,0,2,623058,1


In [24]:
enc = preprocessing.OneHotEncoder(categorical_features = categorical_feature_mask)


X_ohe = enc.fit_transform(X) # It returns an numpy array

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [25]:
X_new = X_ohe.toarray()

In [26]:
X_new

array([[   0.  ,    0.  ,    0.  , ...,    0.  ,    0.  ,  111.33],
       [   0.  ,    0.  ,    0.  , ...,  111.33,    0.  ,   24.75],
       [   0.  ,    0.  ,    0.  , ...,  136.08,    0.  ,  187.4 ],
       ...,
       [   0.  ,    0.  ,    0.  , ..., 5155.05,    0.  ,  138.42],
       [   0.  ,    0.  ,    0.  , ..., 5293.47,    0.  ,   16.31],
       [   0.  ,    0.  ,    0.  , ..., 5309.78,    0.  ,   32.53]])

### Train Test Validate Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y, 
                                                     test_size=0.3, 
                                                     random_state=0)