# RANDOM FOREST CLASSIFIER MODEL

In this notebook, we trying make a simple random forest classifier to predict whether a user will download an app after clicking a mobile app ad.

Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

TalkingData, China抯 largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user抯 click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist. 
you抮e challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad

# IMPORT REQUIRED LIBRARIES

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# LOAD DATA

In [2]:

train_sample = pd.read_csv('../input/train_sample.csv')
test = pd.read_csv('../input/test.csv')
sample_submission = pd.read_csv('../input/sample_submission.csv')

np.random.seed(0)

# EXPLORATORY DATA ANALYSIS

In [3]:
train_sample.head()

In [4]:
train_sample.info()

In [13]:
train_sample.describe(include= 'all')

In [5]:
test.head()

# SELECT PREDICTORS AND TARGET

In [31]:
predictors = ['ip', 'app', 'device', 'os', 'channel' ]
y = train_sample['is_attributed']

In [32]:
x_train = train_sample[predictors]
x_test = test[predictors]

# MAKE PIPELINE

In [23]:
my_pipeline = make_pipeline(Imputer(), RandomForestClassifier())


# CROSS VALIDATION

In [33]:

scores = cross_val_score(my_pipeline, x_train, y, scoring='roc_auc', cv=5)
print(scores)

# MODEL CREATION


In [36]:
my_pipeline.fit(x_train, y)
prediction = my_pipeline.predict(x_test)
print(prediction)

# MAKE SUBMISSION FILE

In [37]:
my_submission = pd.DataFrame({'click_id': test.click_id, 'is_attributed': prediction})
my_submission.to_csv('submission.csv', index=False)