# Fraud case study



## Day 1: building a fraud model

## Day 2: building an app/dashboard

## Tips success

You will quickly run out of time:

*  Use CRISP-DM workflow to analyze data and build a model
*  Iterate quickly, test often, commit often
*  Build deadlines for your work so you stay on track
*  Should have a model by end of day 1
*  Start app once model is working

### CRISP-DM workflow

Follow the [CRISP-DM](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) steps:

1.  Business understanding
2.  Data understanding
3.  Data preparation
4.  Modeling
5.  Evaluation
6.  Deployment

# Introduction to case study: data & problem

Let's look at the data.  What format is the data in?  How do you extract it?

In [None]:
ls -lh data

Unzip the data so you can load it into Python

In [None]:
!unzip data/data.zip -d data

Initially, work with a subset at first in order to iterate quickly.  But, the file is one giant line of json:

In [None]:
!wc data/data.json

Write a quick and dirty script to pull out the first 100 records so we can get code working quickly.

In [None]:
%%writefile subset_json.py
"""head_json.py - extract a couple records from a huge json file.

Syntax: python head_json.py < infile.json > outfile.json
"""

import sys

start_char = '{'
stop_char = '}'
n_records = 100
level_nesting = 0

while n_records != 0:
    ch = sys.stdin.read(1)
    sys.stdout.write(ch)
    if ch == start_char:
        level_nesting += 1
    if ch == stop_char:
        level_nesting -= 1
        if level_nesting == 0:
            n_records -= 1
sys.stdout.write(']')


In [None]:
!python subset_json.py < data/data.json > data/subset.json

In [1]:
import pandas as pd

df = pd.read_json('data/data.json')

In [None]:
df.head().T

Some of the data is text (and HTML), which will require feature engineering:

* TF-IDF
* Feature hashing
* n-grams

etc.

You will also need to construct a target from `acct_type`.  Fraud events start with `fraud`.  How you define fraud depends on how you define the business problem.

In [None]:
df.acct_type.value_counts(dropna=False)

In [None]:
df.info()

Is missing data a problem?  What are your options for handling missing data?

In [None]:
df.describe().T

In [4]:
import numpy as np
df['fraud'] = np.where((df['acct_type'] == 'fraudster') | (df['acct_type'] == 'fraudster_event') | (df['acct_type'] == 'fraudster_att'), 1, 0)

In [None]:
df['org_fb_twitter'] = df['org_facebook'] + df['org_twitter']

In [None]:
df.boxplot('org_fb_twitter', 'fraud')

In [None]:
boxes = ['delivery_method', 'has_logo', 'name_length', 'org_facebook', 'org_twitter', 'user_age']

In [None]:
for box in boxes:
    df.boxplot(box, 'fraud')

In [None]:
df.groupby(['payout_type','fraud'])['fraud'].count().unstack(0).plot.bar()

In [26]:
df_for_models._get_numeric_data().columns

Index(['body_length', 'channels', 'delivery_method', 'event_created',
       'event_published', 'fb_published', 'has_analytics', 'has_header',
       'has_logo', 'name_length', 'num_order', 'object_id', 'org_facebook',
       'org_twitter', 'show_map', 'user_age', 'user_created', 'user_type',
       'venue_latitude', 'venue_longitude'],
      dtype='object')

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

drop_list = ['acct_type', 'approx_payout_date', 'event_end', 'event_start', 'gts', 'num_payouts', 'payout_type', 'sale_duration', 'sale_duration2', 'ticket_types']
df_for_models = df.drop(drop_list, axis=1)
df_for_models.fillna(0, inplace=True)

In [16]:
y = df_for_models.pop('fraud').values
X = df_for_models._get_numeric_data().values
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

In [17]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf.predict_proba(X_test)

array([[0.99, 0.01],
       [0.5 , 0.5 ],
       [0.99, 0.01],
       ...,
       [0.98, 0.02],
       [0.99, 0.01],
       [1.  , 0.  ]])

In [18]:
import pickle
# with open("data/model.pkl", 'w') as f:
pickle.dump(rf, open('data/model.pkl', 'wb'))

In [19]:
from sklearn.metrics import f1_score
f1_score(y_test, rf.predict(X_test))

0.7926988265971316

In [None]:
df.info()

In [11]:
rand_list = np.random.randint(0, len(df), 10)
rand_list

array([  701,  9880, 10352, 12846,  5267, 12349,  1041, 11687, 10088,
        8670])

In [23]:
test_examples_df = df_for_models.loc[rand_list]

In [24]:
test_examples_df.to_csv('data/test_script_examples.csv')

In [None]:
df_mw = df[525:551]

In [None]:
pd.set_option('display.max_columns', None)
df_mw.head()

In [None]:
df_mw['email_domain'].value_counts()

In [None]:
df_mw['email_._loc'] = df_mw['email_domain'].str.find('.')

In [None]:
df_mw['email_domain'].str[-3:].value_counts()

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(df_mw['description'][525], 'html.parser')

In [None]:
p_list = soup.find_all('p')

In [None]:
for p in p_list:
    print(p.get_text())

In [None]:
df_mw['fraud'].count()

In [None]:
df_mw[df_mw['fraud'] == 1]

In [None]:
df['fraud'].sum()

In [None]:
df[df['fraud'] == 1]['ticket_types'][0]

In [None]:
df_mw['ticket_types'][527]

In [None]:
(df_mw ['event_created'] - df_mw['user_created']) / df_mw['user_age']

In [None]:
df_mw['user_age']

In [None]:
(df_mw['event_end'] - df_mw['event_start'])/60000

In [None]:
df_mw['event_published'] - df_mw['event_created']

In [None]:
df_mw.head()

In [None]:
df['user_age'][df['fraud'] == 1].value_counts()

In [None]:
df['email_domain'].str[-3:][df['fraud'] == 1].value_counts()

In [None]:
pd.set_option('display.max_rows', None)
df['email_domain'].str[-3:].value_counts()

In [None]:
df['country'].value_counts()

In [None]:
df['description'][2]