# <div style="text-align: center">A Data Science Framework for Elo </div>
### <div align="center"><b>Quite Practical and Far from any Theoretical Concepts</b></div>
<div style="text-align:center">last update: <b>11/28/2018</b></div>
<img src='http://s8.picofile.com/file/8344134250/KOpng.png'>
You can Fork and Run this kernel on **Github**:
> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)


 <a id="1"></a> <br>
## 1- Introduction
**[Elo](https://www.cartaoelo.com.br/)** has defined a competition in **Kaggle**. A realistic and attractive data set for data scientists.
on this notebook, I will provide a **comprehensive** approach to solve Elo Recommendation problem.

I am open to getting your feedback for improving this **kernel**.

<a id="top"></a> <br>
## Notebook  Content
1. [Introduction](#1)
1. [Data Science Workflow for Elo](#2)
1. [Problem Definition](#3)
    1. [Business View](#4)
        1. [Real world Application Vs Competitions](#31)
1. [Problem feature](#7)
    1. [Aim](#8)
    1. [Variables](#9)
    1. [ Inputs & Outputs](#10)
1. [Select Framework](#11)
    1. [Import](#12)
    1. [Version](#13)
    1. [Setup](#14)
1. [Exploratory data analysis](#15)
    1. [Data Collection](#16)
        1. [Features](#17)
        1. [Explorer Dataset](#18)
    1. [Data Cleaning](#19)
    1. [Data Preprocessing](#20)
    1. [Data Visualization](#23)
        1. [countplot](#61)
        1. [pie plot](#62)
        1. [Histogram](#63)
        1. [violin plot](#64)
        1. [kdeplot](#65)
1. [Apply Learning](#24)
1. [Conclusion](#25)
1. [References](#26)

-------------------------------------------------------------------------------------------------------------

 **I hope you find this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated**
 
 -----------

<a id="2"></a> <br>
## 2- A Data Science Workflow for Elo
Of course, the same solution can not be provided for all problems, so the best way is to create a **general framework** and adapt it to new problem.

**You can see my workflow in the below image** :

 <img src="http://s8.picofile.com/file/8342707700/workflow2.png"  />

**You should feel free	to	adjust 	this	checklist 	to	your needs**
###### [Go to top](#top)

<a id="3"></a> <br>
## 3- Problem Definition
I think one of the important things when you start a new machine learning project is Defining your problem. that means you should understand business problem.( **Problem Formalization**)
> **we will be predicting whether a question asked on Quora is sincere or not.**

## 3-1 About Elo
 [Elo](https://www.cartaoelo.com.br/) is one of the largest **payment brands** in Brazil, has built partnerships with merchants in order to offer promotions or discounts to cardholders. But 
1. do these promotions work for either the consumer or the merchant?
1. Do customers enjoy their experience? 
1. Do merchants see repeat business? 

**Personalization is key**.
<a id="4"></a> <br>
## 3-2 Business View 
Elo has built machine learning models to understand the most important aspects and preferences in their customers’ lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where you come in.
<a id="31"></a> <br>
### 3-2-1 Real world Application Vs Competitions
Just a simple comparison between real-world apps with competitions:
<img src="http://s9.picofile.com/file/8339956300/reallife.png" height="600" width="500" />

###### [Go to top](#top)

<a id="7"></a> <br>
## 4- Problem Feature
Problem Definition has three steps that have illustrated in the picture below:

1. Aim
1. Variable
1. Inputs & Outputs





<a id="8"></a> <br>
### 4-1 Aim
Develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty.
We are predicting a **loyalty score** for each card_id represented in test.csv and sample_submission.csv.

<a id="9"></a> <br>
### 4-2 Variables
The data is formatted as follows:

train.csv and test.csv contain card_ids and information about the card itself - the first month the card was active, etc. train.csv also contains the target.

historical_transactions.csv and new_merchant_transactions.csv are designed to be joined with train.csv, test.csv, and merchants.csv. They contain information about transactions for each card, as described above.

merchants can be joined with the transaction sets to provide additional merchant-level information.


<a id="10"></a> <br>
### 4-3 Inputs & Outputs
we use train.csv and test.csv as Input and we should upload a  submission.csv as Output

### 4-4 Evaluation
Submissions are scored on the root mean squared error. RMSE(Root Mean Squared Error) is defined as:
<img src='https://www.includehelp.com/ml-ai/Images/rmse-1.jpg'>
where y^ is the predicted loyalty score for each card_id, and y is the actual loyalty score assigned to a card_id.

**<< Note >>**
> You must answer the following question:
How does your company expect to use and benefit from **your model**.
###### [Go to top](#top)

<a id="11"></a> <br>
## 5- Select Framework
After problem definition and problem feature, we should select our framework to solve the problem.
What we mean by the framework is that  the programming languages you use and by what modules the problem will be solved.
###### [Go to top](#top)

<a id="12"></a> <br>
## 5-2 Import

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt
from pandas import get_dummies
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import string
import scipy
import numpy
import json
import sys
import csv
import os

<a id="13"></a> <br>
## 5-3 version

In [None]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))


<a id="14"></a> <br>
## 5-4 Setup

A few tiny adjustments for better **code readability**

<a id="15"></a> <br>
## 6- EDA
 In this section, you'll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. 
 
* Which variables suggest interesting relationships?
* Which observations are unusual?
* Analysis of the features!

By the end of the section, you'll be able to answer these questions and more, while generating graphics that are both insightful and beautiful.  then We will review analytical and statistical operations:

1. Data Collection
1. Visualization
1. Data Cleaning
1. Data Preprocessing
<img src="http://s9.picofile.com/file/8338476134/EDA.png">

 ###### [Go to top](#top)

<a id="16"></a> <br>
## 6-1 Data Collection
**Data collection** is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.[techopedia]

I start Collection Data by the training and testing datasets into **Pandas DataFrames**.
###### [Go to top](#top)

In [None]:
train = pd.read_csv('../input/train.csv', parse_dates=["first_active_month"] )
test = pd.read_csv('../input/test.csv' ,parse_dates=["first_active_month"] )
merchants=pd.read_csv('../input/merchants.csv')


**<< Note 1 >>**

* Each **row** is an observation (also known as : sample, example, instance, record).
* Each **column** is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate).
###### [Go to top](#top)

## data_dictionary Analysis

In [None]:
data_dictionary_train=pd.read_excel('../input/Data_Dictionary.xlsx',sheet_name='train')
data_dictionary_history=pd.read_excel('../input/Data_Dictionary.xlsx',sheet_name='history')
data_dictionary_new_merchant_period=pd.read_excel('../input/Data_Dictionary.xlsx',sheet_name='new_merchant_period')
data_dictionary_merchant=pd.read_excel('../input/Data_Dictionary.xlsx',sheet_name='merchant')

### data_dictionary_train

In [None]:
data_dictionary_train.head(10)

### data_dictionary_history

In [None]:
data_dictionary_history.head(10)

### data_dictionary_new_merchant_period

In [None]:
data_dictionary_new_merchant_period.head(10)

### data_dictionary_merchant

In [None]:
data_dictionary_merchant.head(30)

## Train Analysis

In [None]:
train.sample(1) 

In [None]:
test.sample(1) 

Or you can use others command to explorer dataset, such as 

In [None]:
train.tail(1)

<a id="17"></a> <br>
## 6-1-1 Features
Features can be from following types:
* numeric
* categorical
* ordinal
* datetime
* coordinates

Find the type of features in **Qoura dataset**?!

For getting some information about the dataset you can use **info()** command.

In [None]:
print(train.info())

In [None]:
print(test.info())

<a id="18"></a> <br>
## 6-1-2 Explorer Dataset
1- Dimensions of the dataset.

2- Peek at the data itself.

3- Statistical summary of all attributes.

4- Breakdown of the data by the class variable.

Don’t worry, each look at the data is **one command**. These are useful commands that you can use again and again on future projects.
###### [Go to top](#top)

In [None]:
# shape for train and test
print('Shape of train:',train.shape)
print('Shape of test:',test.shape)

In [None]:
#columns*rows
train.size

After loading the data via **pandas**, we should checkout what the content is, description and via the following:

In [None]:
type(train)

In [None]:
type(test)

In [None]:
train.describe()

To pop up 5 random rows from the data set, we can use **sample(5)**  function and find the type of features.

In [None]:
train.sample(5) 

<a id="19"></a> <br>
## 6-2 Data Cleaning
When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

The primary goal of data cleaning is to detect and remove errors and **anomalies** to increase the value of data in analytics and decision making. While it has been the focus of many researchers for several years, individual problems have been addressed separately. These include missing value imputation, outliers detection, transformations, integrity constraints violations detection and repair, consistent query answering, deduplication, and many other related problems such as profiling and constraints mining.[4]
###### [Go to top](#top)

How many NA elements in every column!!

Good news, it is Zero!

To check out how many null info are on the dataset, we can use **isnull().sum()**.

In [None]:
train.isnull().sum()

But if we had , we can just use **dropna()**(be careful sometimes you should not do this!)

In [None]:
# remove rows that have NA's
print('Before Droping',train.shape)
train = train.dropna()
print('After Droping',train.shape)


We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

To print dataset **columns**, we can use columns atribute.

In [None]:
train.columns

You see number of unique item for Target  with command below:

In [None]:
train_target = train['target'].values

np.unique(train_target)

To check the first 5 rows of the data set, we can use head(5).

In [None]:
train.head(5) 

Or to check out last 5 row of the data set, we use tail() function.

In [None]:
train.tail() 

To give a **statistical summary** about the dataset, we can use **describe()**


In [None]:
train.describe() 

As you can see, the statistical information that this command gives us is not suitable for this type of data
**describe() is more useful for numerical data sets**

<a id="20"></a> <br>
## 6-3 Data Preprocessing
**Data preprocessing** refers to the transformations applied to our data before feeding it to the algorithm.
 
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.
there are plenty of steps for data preprocessing and we just listed some of them in general(Not just for Quora) :
1. removing Target column (id)
1. Sampling (without replacement)
1. Making part of iris unbalanced and balancing (with undersampling and SMOTE)
1. Introducing missing values and treating them (replacing by average values)
1. Noise filtering
1. Data discretization
1. Normalization and standardization
1. PCA analysis
1. Feature selection (filter, embedded, wrapper)
1. Etc.

What methods of preprocessing can we run on  Quora?! 
###### [Go to top](#top)

**<< Note 2 >>**
in pandas's data frame you can perform some query such as "where"

In [None]:
train.where(train ['target']==1).count()

As you can see in the below in python, it is so easy perform some query on the dataframe:

In [None]:
train[train['target']<-32]

Some examples of questions that they are insincere

In [None]:
train[train['target']==1].head(5)

**<< Note >>**
>**Preprocessing and generation pipelines depend on a model type**

<a id="23"></a> <br>
## 6-4 Data Visualization
**Data visualization**  is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

> * Two** important rules** for Data visualization:
>     1. Do not put too little information
>     1. Do not put too much information

###### [Go to top](#top)

<a id="63"></a> <br>
## 6-4-3  Histogram

In [None]:
train["target"].hist();

In [None]:
sns.distplot(train['target'])

In [None]:
sns.violinplot(data=train, x="feature_1", y='target')

<a id="24"></a> <br>
## 7- Apply Learning
How to understand what is the best way to solve our problem?!

The answer is always "**It depends**." It depends on the **size**, **quality**, and **nature** of the **data**. It depends on what you want to do with the answer. It depends on how the **math** of the algorithm was translated into instructions for the computer you are using. And it depends on how much **time** you have. Even the most **experienced data scientists** can't tell which algorithm will perform best before trying them.(see a nice [cheatsheet](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist/blob/master/cheatsheets/microsoft-machine-learning-algorithm-cheat-sheet-v7.pdf) for this section)
Categorize the problem
The next step is to categorize the problem. This is a two-step process.

1. **Categorize by input**:
    1. If you have labelled data, it’s a supervised learning problem.
    1. If you have unlabelled data and want to find structure, it’s an unsupervised learning problem.
    1. If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.
1. **Categorize by output**.
    1. If the output of your model is a number, it’s a regression problem.
    1. If the output of your model is a class, it’s a classification problem.
    1. If the output of your model is a set of input groups, it’s a clustering problem.
    1. Do you want to detect an anomaly ? That’s anomaly detection
1. **Understand your constraints**
    1. What is your data storage capacity? Depending on the storage capacity of your system, you might not be able to store gigabytes of classification/regression models or gigabytes of data to clusterize. This is the case, for instance, for embedded systems.
    1. Does the prediction have to be fast? In real time applications, it is obviously very important to have a prediction as fast as possible. For instance, in autonomous driving, it’s important that the classification of road signs be as fast as possible to avoid accidents.
    1. Does the learning have to be fast? In some circumstances, training models quickly is necessary: sometimes, you need to rapidly update, on the fly, your model with a different dataset.
1. **Find the available algorithms**
    1. Now that you a clear understanding of where you stand, you can identify the algorithms that are applicable and practical to implement using the tools at your disposal. Some of the factors affecting the choice of a model are:

    1. Whether the model meets the business goals
    1. How much pre processing the model needs
    1. How accurate the model is
    1. How explainable the model is
    1. How fast the model is: How long does it take to build a model, and how long does the model take to make predictions.
    1. How scalable the model is


you can Fork and Run this kernel on **Github**:
> ###### [ GitHub](https://github.com/mjbahmani/10-steps-to-become-a-data-scientist)


#### The kernel is not completed and will be updated soon  !!!