In this notebook I will try to find an algorithm which can predict a transaction being fraud. 
This is my first worknotebook on Kaggle, so, any remarks are welcome.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # visualization library alternative to matplotlib.pyplot


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
# first, loading our data
data = pd.read_csv("../input/PS_20174392719_1491204439457_log.csv")

General look at our data

In [None]:
data.head()

In [None]:
data.describe()

This one was not very informative

In [None]:
data.info()

So, we have overall 11 columns.
Type is a categorical variable which we will need in our data.
However, destination names for sender and receiver doesn't look very helpful at all. I would also mention the last column called isFlaggedFraud. Looks like, there is a some sort of internal detection for transactions being fraud.

In [None]:
data['isFlaggedFraud'].sum()

We see that, from over 6 million transactions, this "detector" could identify only 16 being fraud. I do not consider this column very useful in our dataset, so, I drop it.
I will also drop nameOrig  and nameDest columns.

In [None]:
data.drop(['nameOrig','nameDest','isFlaggedFraud'], axis = 1, inplace=True)

Some visualizations (I am not very good at them, though)

In [None]:
sns.countplot(data['type'], hue = data['isFraud'])

Looks like, Payment, Debit transactions are safe, no fraud there. Also Cash in is safe, which is understandable.
Below, we see the number of fraud transactions per type

In [None]:
data[data['isFraud']==1].groupby('type').count()

Now, we can proceed to our algorithm.
First things first, in such cases where features are very skewed. Hence, we would be better off, if we standardize the data first.  Because only features should be standardized, I will call them X

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_toScale = data[['amount', 'oldbalanceOrg', 'newbalanceOrig',
      'oldbalanceDest', 'newbalanceDest'
       ]]
new_X = sc.fit(X_toScale)
X_scaled = new_X.transform(X_toScale)

In [None]:
#creating our dataframe with scaled values

scaled_df = pd.DataFrame(X_scaled, columns=['amount', 'oldbalanceOrg', 'newbalanceOrig',
      'oldbalanceDest', 'newbalanceDest'
       ])


In [None]:
# we have also some categorical variable, called Type. Let's convert it to dummies, and then add to our final dataframe
dummy_df = pd.DataFrame(pd.get_dummies(data['type']))
#now, final dataframe
final_df = scaled_df.join(dummy_df, how = 'outer')

In [None]:
final_df.head(5)

Looks like we are good to go. Now I will use RandomForestClassifier in order to develop my algorithm.
First I will import the model, split our train and test datasets, then fit, and in the end predict labels

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# for future, train test split will be moved into model selection
# from sklearn.model_selection import train_test_split
rfc = RandomForestClassifier() #using default values
#splitting our dataset
X = final_df #dataset that we scaled and preprocessed
y = data['isFraud'] #the column from our original dataset will be our label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) #use this random state to match my results only
#training our model
model = rfc.fit(X_train,y_train)
#predicting our labels
predictions = model.predict(X_test)


Time to evaluate our model

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,predictions))
print(confusion_matrix(y_test, predictions))

Looks like we have very good model in case of detecting non-fraud transactions. Nevertheless, when detecting fraud transactions we have some errors. I am now working on other models  to use for this problem and if you have any suggestions, please, let me know in comments below.