In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import math
from datetime import datetime

%matplotlib inline

In [None]:
creditCardData = pd.read_csv("../input/creditcard.csv")
creditCardData.head()

In [None]:
print ('# of columns: %s'%(len(creditCardData.columns)))
creditCardData.describe()

So as per the description there are 284807 rows with 28 transformed feature columns V1-V28 and 2 original features Time and Value and a Class label 

In [None]:
#Checking for missing data
creditCardData.isnull().any().sum()

In [None]:
#Plotting a heatmap to visualize the correlation between the variables
sns.heatmap(creditCardData.corr())

The features V1-V28 are totally uncorrelated which should be the case as they are obtained by performing PDA on the original dataset

## Analysis 1: When do people shop
We will visualize when people shop and when credit fraud happens and if there is a pattern. For this, however we need to convert time from seconds to days, hours and weeks

In [None]:
# As the time provided is in seconds we can use it as seconds since epoch as we won't care about years
def convert_totime(seconds):
    return datetime.fromtimestamp(seconds);

timeAnalysis = creditCardData[['Time', 'Amount', 'Class']].copy()
timeAnalysis['datetime'] = timeAnalysis.Time.apply(convert_totime)
# As the max time is 172792 seconds and 172792 / (60*60) is about 48 hrs so we only have data for 2 days so only 
# plotting data against hours make sense
timeAnalysis['hour of the day'] = timeAnalysis.datetime.dt.hour
timeAnalysisGrouped = timeAnalysis.groupby(['Class', 'hour of the day'])['Amount'].count()

In [None]:
plt.figure(figsize = (10, 6))
validTransactions = timeAnalysisGrouped[0].copy()
validTransactions.name = 'Number of transactions'
validTransactions.plot.bar(title = '# of legitimate credit card transactions per hour', legend = True)

Note: An interesting thing happened here. When I did this calculation on my laptop, the distribution did not look right, so I added 7 hours to each transaction and I got something like the figure above. I think its is due to the fact that the Kaggle's server must be running in UTC while my system is in MST (the difference between UTC and MST is 7 hours)

In [None]:
## Run this section only if your distribution is somewhat off like it shows most transactions 
## happened during the night
timeDelta = datetime.utcnow() - datetime.now() 
plt.figure(figsize = (10, 6))
timeAnalysis['hour of the day'] = timeAnalysis.datetime + timeDelta
timeAnalysis['hour of the day'] = timeAnalysis['hour of the day'].dt.hour
timeAnalysisGrouped = timeAnalysis.groupby(['Class', 'hour of the day'])['Amount'].count()
validTransactions = timeAnalysisGrouped[0].copy()
validTransactions.name = 'Number of transactions'
validTransactions.plot.bar(title = '# of legitimate credit card transactions per hour', legend = True)

In [None]:
plt.figure(figsize = (10, 6))
fraudTransactions = timeAnalysisGrouped[1].copy()
fraudTransactions.name = 'Number of transactions'
fraudTransactions.plot.bar(title = '# of fraud credit card transactions per hour', legend = True)

2 A.M. has an unsual uptick for the number of frauds committed. But it could also be that my assumption that the first transaction happened at 7 A.M. is incorrect. One thing is clear though that the fraud transactions are better spread out than the legitimate transactions. This can be due to the fact that there are very few fradulent transactions and hence they won't have a clear trend like in the case of legitimate transactions


## Analysis 2 - Are fraudulent transactions of higher value than normal transactions
It would be interesting to see if fraudulent transactions are in general of higher value than normal transactions or not. To check this lets setup a hypothesis test. Lets define our Null and Alternative hypothesis

- H<sub>0</sub> : Fraudulent transactions are of similar or lower value as normal transactions
- H<sub>A</sub> : Fraudulent transactions are of higher value as normal transactions

I took H<sub>0</sub> to be similar or lower because H<sub>0</sub> and H<sub>A</sub> should together cover all the possibilities

Before we begin lets first look at the distribution of amounts of transaction done

In [None]:
# Valid Transactions
timeAnalysis[timeAnalysis.Class == 0].Amount.plot.hist(title = 'Histogram of valid transactions')

In [None]:
# As the value of most transaction seems to be only about 2K - 2.5K. Lets limit the data further
timeAnalysis[(timeAnalysis.Class == 0) & (timeAnalysis.Amount <= 4000)].Amount.plot.hist(title = 'Histogram of valid transactions clipped at 4K')

In [None]:
# Now lets look at the Fraudulent transactions
timeAnalysis[timeAnalysis.Class == 1].Amount.plot.hist(title = 'Histogram of fraudulent transactions')

Hmmmm, there doesn't appears to be any difference visually. But lets wait till we perform the hypothesis test to draw the final conclusion.

For the hypothesis test I will be performing a Z-test, with the valid transactions acting as the population. Though a T-test can also be performed but given that our sample set (fraudulent transactions) is of size 492 there shouldn't be any difference, as for sample set >= 30 the t distribution and z distribution are nearly the same.

Lets start. We will be performing the test for 99% significance level, this means that we should get a z-score of atleast 2.326 or higher. If someone does not know the formula for z-score, here it is

$$ z-score = (\bar{x} - \mu) / S.E$$

Where
- $\bar{x}$ : mean of the sample
- $\mu$ : population mean
- S.E : Standard Error

The standard error in our case is given by the formula : $\sigma/\sqrt{n}$, where $\sigma$ is the Standard deviation of the population and n is the sample size

In [None]:
population = timeAnalysis[timeAnalysis.Class == 0].Amount
sample = timeAnalysis[timeAnalysis.Class == 1].Amount
sampleMean = sample.mean()
populationStd = population.std()
populationMean = population.mean()

In [None]:
z_score = (sampleMean - populationMean) / (populationStd / sample.size ** 0.5)
z_score

As the z-score is more than 2.326 we reject the Null hypothesis. So there is a 99% chance that the amount spend on fraudulent transactions are on average significantly higher than normal transactions. But as we observed in the histograms in absolute terms normal transactions are of higher value.

## Conclusion
The amount spend on fraudulent transactions is on average significantly higher than normal transactions but in absolute terms higher amounts are spent on valid transaction. This means we can't really create an additional boolean feature such as 'If amount spent is higher than a given value', on the other hand there is significant difference in average amount spent, maybe it can be used to identify frauds.

Also, as it would seem as per my calculation the fraudulent transactions are more spread out during the day as compared to normal transactions. Maybe scrutinizing late night transactions can lead to a better detection rate.