# Fraud case study



## Day 1: building a fraud model

## Day 2: building an app/dashboard

## Tips success

You will quickly run out of time:

*  Use CRISP-DM workflow to analyze data and build a model
*  Iterate quickly, test often, commit often
*  Build deadlines for your work so you stay on track
*  Should have a model by end of day 1
*  Start app once model is working

### CRISP-DM workflow

Follow the [CRISP-DM](https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) steps:

1.  Business understanding
2.  Data understanding
3.  Data preparation
4.  Modeling
5.  Evaluation
6.  Deployment

# Introduction to case study: data & problem

Let's look at the data.  What format is the data in?  How do you extract it?

In [None]:
ls -lh data

Unzip the data so you can load it into Python

In [None]:
!unzip data/data.zip -d data

Initially, work with a subset at first in order to iterate quickly.  But, the file is one giant line of json:

In [None]:
!wc data/data.json

Write a quick and dirty script to pull out the first 100 records so we can get code working quickly.

In [None]:
%%writefile subset_json.py
"""head_json.py - extract a couple records from a huge json file.

Syntax: python head_json.py < infile.json > outfile.json
"""

import sys

start_char = '{'
stop_char = '}'
n_records = 100
level_nesting = 0

while n_records != 0:
    ch = sys.stdin.read(1)
    sys.stdout.write(ch)
    if ch == start_char:
        level_nesting += 1
    if ch == stop_char:
        level_nesting -= 1
        if level_nesting == 0:
            n_records -= 1
sys.stdout.write(']')


In [None]:
!python subset_json.py < data/data.json > data/subset.json

In [None]:
import pandas as pd

df = pd.read_json('data/subset.json')

In [None]:
df.head().T

Some of the data is text (and HTML), which will require feature engineering:

* TF-IDF
* Feature hashing
* n-grams

etc.

You will also need to construct a target from `acct_type`.  Fraud events start with `fraud`.  How you define fraud depends on how you define the business problem.

In [None]:
df.acct_type.value_counts(dropna=False)

In [None]:
df.info()

Is missing data a problem?  What are your options for handling missing data?

In [None]:
df.describe().T