@ Puran Zhang
## Description
One of the greatest challenges in fraud, and in general in that area of data science related to catching illegal activities, is that you often find yourself one step behind.

Your model is trained on past data. If users come up with a totally new way to commit a fraud, it often takes you some time to be able to react. By the time you get data about that new fraud strategy and retrain the model, many frauds have been already committed.

A way to overcome this is to use unsupervised machine learning, instead of supervised. With this approach, you don't need to have examples of certain fraud patterns in order to make a prediction. Often, this works by looking at the data and identify sudden clusters of unusual activities.

This is the goal of this challenge. You have a dataset of credit card transactions and you have to identify unusual/weird events that have a high chance of being a fraud.

## Goal
Company XYZ is a major credit card company. It has information about all the transactions that users make with their credit card.
Your boss asks you to do the following:

* Your boss wants to identify those users that in your dataset never went above the monthly credit card limit (Calendar Month). The goal of this is to automatically increase their limit. Can you send him the list of Ids?


* On the other hand, she wants you to implement an algorithm that as soon as a user goes above her monthly limit, it triggers an alert so that the user can be notified about that. We assume here that at the beginning of the new month, user total money spent gets reset to zero (i.e. she pays the card fully at the end of each month). Build a function that for each day, returns a list of users who went above their credit card monthly limit on that day.


* Finally, your boss is very concerned about frauds cause they are a huge cost for credit card companies. She wants you to implement an unsupervised algorithm

## Data Loading

In [2]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest

import os
import warnings
warnings.simplefilter('ignore')

%matplotlib inline

In [4]:
os.getcwd()
path1 = '/Users/puran/Downloads/Data Science/Take_home_challenge/Credit Card Transactions/credit_card/cc_info.csv'
info = pd.read_csv(path1)
# parse_dates: bool or list of int or names or list of lists or dict, default False
path2 = '/Users/puran/Downloads/Data Science/Take_home_challenge/Credit Card Transactions/credit_card/transactions.csv'
transaction = pd.read_csv(path2, parse_dates=['date'])

In [8]:
info.head()

Unnamed: 0,credit_card,city,state,zipcode,credit_card_limit
0,1280981422329509,Dallas,PA,18612,6000
1,9737219864179988,Houston,PA,15342,16000
2,4749889059323202,Auburn,MA,1501,14000
3,9591503562024072,Orlando,WV,26412,18000
4,2095640259001271,New York,NY,10001,20000


In [5]:
info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   credit_card        984 non-null    int64 
 1   city               984 non-null    object
 2   state              984 non-null    object
 3   zipcode            984 non-null    int64 
 4   credit_card_limit  984 non-null    int64 
dtypes: int64(3), object(2)
memory usage: 38.6+ KB


In [6]:
info.describe()

Unnamed: 0,credit_card,zipcode,credit_card_limit
count,984.0,984.0,984.0
mean,5410432000000000.0,17895.316057,12321.138211
std,2545234000000000.0,23778.651105,7398.449174
min,1003715000000000.0,690.0,2000.0
25%,3316062000000000.0,3280.0,7000.0
50%,5365218000000000.0,5820.0,10000.0
75%,7562153000000000.0,18101.25,16000.0
max,9999757000000000.0,98401.0,55000.0


In [7]:
transaction.head()

Unnamed: 0,credit_card,date,transaction_dollar_amount,Long,Lat
0,1003715054175576,2015-09-11 00:32:40,43.78,-80.174132,40.26737
1,1003715054175576,2015-10-24 22:23:08,103.15,-80.19424,40.180114
2,1003715054175576,2015-10-26 18:19:36,48.55,-80.211033,40.313004
3,1003715054175576,2015-10-22 19:41:10,136.18,-80.174138,40.290895
4,1003715054175576,2015-10-26 20:08:22,71.82,-80.23872,40.166719


In [9]:
transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294588 entries, 0 to 294587
Data columns (total 5 columns):
 #   Column                     Non-Null Count   Dtype         
---  ------                     --------------   -----         
 0   credit_card                294588 non-null  int64         
 1   date                       294588 non-null  datetime64[ns]
 2   transaction_dollar_amount  294588 non-null  float64       
 3   Long                       294588 non-null  float64       
 4   Lat                        294588 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1)
memory usage: 11.2 MB


In [10]:
transaction.describe()

Unnamed: 0,credit_card,transaction_dollar_amount,Long,Lat
count,294588.0,294588.0,294588.0,294588.0
mean,5424562000000000.0,86.008036,-76.235238,40.937669
std,2555803000000000.0,124.655954,20.135015,5.391695
min,1003715000000000.0,0.01,-179.392887,-68.046553
25%,3344214000000000.0,29.97,-80.209708,40.487726
50%,5353426000000000.0,58.47,-73.199737,42.403066
75%,7646245000000000.0,100.4,-72.091933,43.180015
max,9999757000000000.0,999.97,179.917513,78.91433


## Data Processing

In [11]:
data = pd.merge(left = transaction, right= info, on = 'credit_card', how= 'left')
data.head()

Unnamed: 0,credit_card,date,transaction_dollar_amount,Long,Lat,city,state,zipcode,credit_card_limit
0,1003715054175576,2015-09-11 00:32:40,43.78,-80.174132,40.26737,Houston,PA,15342,20000
1,1003715054175576,2015-10-24 22:23:08,103.15,-80.19424,40.180114,Houston,PA,15342,20000
2,1003715054175576,2015-10-26 18:19:36,48.55,-80.211033,40.313004,Houston,PA,15342,20000
3,1003715054175576,2015-10-22 19:41:10,136.18,-80.174138,40.290895,Houston,PA,15342,20000
4,1003715054175576,2015-10-26 20:08:22,71.82,-80.23872,40.166719,Houston,PA,15342,20000


In [14]:
print('min:{0} \t max:{1}'.format(data['date'].min(), data['date'].max()))

min:2015-07-31 09:39:48 	 max:2015-10-30 10:54:58


In [None]:
# extract month, weekday, and hour information
