**Hello, guys. This is my first Kernel at Kaggle. I hope you find it useful. **

Here are steps of problem solving process:

1.  Understand the problem set; 
    - if you don't have any domain knowledge, you should do more research online or ask your friends for help
2.  Analyze each data set before using them;
    - Check Relationships of data
    - Drop any redundant data
    - Explore New Features from current features
3.  Merge data sets by primary keys; 
    - Understand the relations among datasets
4.  Select a suitable model;
    - Prepare data for the model
    - Set up the model
    - Tune parameters
 

As we know, a data scientist spends much time analyzing data. What's more important, he or she needs to uderstand the problem first. Then he/she needs to think about what could be got from current given data, how to use these data, what kinds of useful features could be created from these given data sets, and what kinds of models that we can try to solve the problem. 

#### Reference:  
[Elo world by FabienDaniel](https://www.kaggle.com/fabiendaniel/elo-world)

## **Analyze each data set before using them**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import datetime
from datetime import date
import calendar
import dateutil
import os

In [None]:
## Check files
print(os.listdir("../input/elo-merchant-category-recommendation"))

From the list above, there are a total of 7 files. In additon, we can uderstand the meaning of each current given data by Data_Dictionary.xlsx. 

***Train.csv***

In [None]:
file_path = '../input/elo-merchant-category-recommendation/train.csv'
train = pd.read_csv(file_path)
train.head()

In [None]:
## check the correlation of each feature and target
train.corr()

From the table above, it shows that feature_1 has a stronger correlation with feature_3, while all features have a weak correlation with target. Which means we need more other features so that we can predict target accurately. 

In [None]:
train.target.describe()

The mean of the target is about 0,  and 25% and 75% are closed to 0, but the min and max values are much higher than that. We should pay attention to these since these might be wrong labels. 

***Test.csv***

In [None]:
file_path = '../input/elo-merchant-category-recommendation/test.csv'
test = pd.read_csv(file_path)
test.head()

From train and test data sets,  except the target column, they have the same number of columns with the same column names. So we can know, card_id is something related to other provided files. 

***Historical_transactions.csv***

In [None]:
history = pd.read_csv('../input/elo-merchant-category-recommendation/historical_transactions.csv')
history.head()

From the historical_transactions file, we know authorized_flag, category_1 and category_3 are categorical features except card_id and merchant_id, they should be tranformed to numerical data or dummies later. 
Our job is to help understand customer loyalty. At first glance, purchase_amount, month_lag, purchase_date, city_id, and state_id are important in this task. Let's check their relations. 

In [None]:
## Month_lag
fig = plt.figure(figsize=(14,6))
ax = sns.distplot(history.month_lag)
ax.set_xlabel('Month_lag', size=12)
ax.set_ylabel('Frequency', size=12)
ax.set_title('Month_lag', size=15)

In [None]:
history['month_lag'].describe()

The graph and the table show that the 75% of month_lag are less than -2.

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(16,8))
temp1 = history.groupby('city_id')['purchase_amount'].mean()
temp2 = history.groupby('state_id')['purchase_amount'].mean()

ax = [ax1, ax2]
temp = [temp1, temp2]
x_labels = ['city_id', 'state_id']
titles = ['Average City purchase amount', 'Average State purchase amount']
scale = [100, 10000]

for i in range(2):
    ax[i].scatter(x=temp[i].index, y=temp[i].values, c=temp[i].values, s=temp[i].values*scale[i], alpha=0.8)
    ax[i].set_xlabel(x_labels[i], size=12)
    ax[i].set_ylabel('Purchase amount', size=12)
    ax[i].set_title(titles[i], size=15)

From these two graphs, it is obvious to see that some cities or states contributed more to the Elo. So, I bet city_id and state_id are quite important to the project. 

In [None]:
## parse purchase_date and divide it into day, week, time session and see how important they will be
history['purchase_weekday'] = pd.to_datetime(history['purchase_date']).dt.day_name()
history['purchase_month'] = pd.to_datetime(history['purchase_date']).dt.month_name()

In [None]:
## Define a day session
## Morning: 5am to 12pm (05:00 to 11:59)
## Afternoon: 12pm to 5pm (12:00 to 16:59)
## Evening: 5pm to 9pm (17:00 to 20:59)
## Night: 9pm to 5am (21:00 to 04:59)

def time_session(time):
    
    if time >= 5 and time < 12:
        return 'Morning'
    elif time >=12 and time < 17:
        return 'Afternoon'
    elif time >=17 and time < 21:
        return 'Evening'
    else:
        return 'Night'

In [None]:
history['temp'] = pd.to_datetime(history['purchase_date']).dt.hour
history['purchase_time_session'] = history['temp'].apply(lambda x : time_session(x))

In [None]:
## Make categorical data has specific order
weekday_labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
history['purchase_weekday'] = pd.Categorical(history['purchase_weekday'], categories=weekday_labels, ordered=True)

month_labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August',
                'September', 'October', 'November', 'December']
history['purchase_month'] = pd.Categorical(history['purchase_month'], categories=month_labels, ordered=True)

session_labels = ['Morning', 'Afternoon', 'Evening', 'Night']
history['purchase_time_session'] = pd.Categorical(history['purchase_time_session'], categories=session_labels, ordered=True)

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, figsize=(14,10))
temp1 = history.groupby('purchase_weekday')['purchase_amount'].mean()
temp2 = history.groupby('purchase_month')['purchase_amount'].mean()
temp3 = history.groupby('purchase_time_session')['purchase_amount'].mean()

a = sns.lineplot(x=temp1.index, y=temp1.values, data=history, ax=ax2)
b = sns.lineplot(x=temp2.index, y=temp2.values, data=history, ax=ax1)
c = sns.lineplot(x=temp3.index, y=temp3.values, data=history, ax=ax3)

plt.xlabel('Purchase time', size=12)
plt.ylabel('Purchase amount', size=12)
f.suptitle('Time Series Analysis', size=15)

From the graphs above, we can see that purchase date contains some useful information. It is better to think about a way to convert the purchase date into categorical data or a numerical number. 

***Merchants.csv***

In [None]:
mer = pd.read_csv('../input/elo-merchant-category-recommendation/merchants.csv')
mer.head()

From the contents of merchants.csv, we can know it contains some columns which could be found on history file. In addition, there are some new columns could be found on merchants file, like ave_sales_lag3. 

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize=(16,8))
temp1 = mer.groupby('city_id')['avg_sales_lag3'].sum()
temp2 = mer.groupby('city_id')['avg_sales_lag6'].sum()
temp3 = mer.groupby('city_id')['avg_sales_lag12'].sum()

ax = [ax1, ax2, ax3]
temp = [temp1, temp2, temp3]
y_labels = ['Total avg sales lag3', 'Total avg sales lag6', 'Total avg sales lag12']

for i in range(3):
    ax[i].scatter(x=temp[i].index, y=temp[i].values, s=temp[i].values/100, c=temp[i].values, alpha=0.8)
    ax[i].set_xlabel('City id', fontsize=12)
    ax[i].set_ylabel(y_labels[i])
    ax[i].set_title(y_labels[i] + ' in each city')

In [None]:
f, (ax1, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize=(16,8))
temp1 = mer.groupby('city_id')['avg_purchases_lag3'].sum()
temp2 = mer.groupby('city_id')['avg_purchases_lag6'].sum()
temp3 = mer.groupby('city_id')['avg_purchases_lag12'].sum()

ax = [ax1, ax2, ax3]
temp = [temp1, temp2, temp3]
y_labels = ['Total avg purchases lag3', 'Total avg purchases lag6', 'Total avg purchases lag12']

for i in range(3):
    ax[i].scatter(x=temp[i].index, y=temp[i].values, s=temp[i].values/100, c=temp[i].values, alpha=0.8)
    ax[i].set_xlabel('City id', fontsize=12)
    ax[i].set_ylabel(y_labels[i])
    ax[i].set_title(y_labels[i] + ' in each city')

According to average sales graphs, we can know that some cities have higher total average sales lag than other cities. However, the majority of cities have lower total average purchases lag. 

***new_merchant_transactions.csv***

In [None]:
new_mer = pd.read_csv('../input/elo-merchant-category-recommendation/new_merchant_transactions.csv')
new_mer.head()

From the table above, we can know that new merchant transactions contain some useful features, like purchase amount, and purchase date. 

#### Due to kaggle kernel died for many times while merging datasets, so I have to process these data on Jupyter Notebook, and save the final files and uploaded them to Kaggle.  

#### Please check the appendix of this kernel for details. 

## Select a suitable model

In this kernel, we are focusing on ***LightGBM***, which is a gradient boosting framework that uses tree based learning algorithms. To better understand this algorithm, you should do more research about it. What's more inmportant, you should try to understand the thoery behind it and play with it. 

#### Reference
* [LightGBM’s documentation!](https://lightgbm.readthedocs.io/en/latest/)
* [What is LightGBM, How to implement it? How to fine tune the parameters?](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)
* [LightGBM, Light Gradient Boosting Machine](https://github.com/Microsoft/LightGBM)

**For Faster Speed:**
* Use bagging by setting bagging_fraction and bagging_freq
* Use feature sub-sampling by setting feature_fraction
* Use small max_bin
* Use save_binary to speed up data loading in future learning
* Use parallel learning, refer to parallel learning guide.

**For better accuracy:**
* Use large max_bin (may be slower)
* Use small learning_rate with large num_iterations
* Use large num_leaves(may cause over-fitting)
* Use bigger training data
* Try dart
* Try to use categorical feature directly

**To deal with over-fitting:**
* Use small max_bin
* Use small num_leaves
* Use min_data_in_leaf and min_sum_hessian_in_leaf
* Use bagging by set bagging_fraction and bagging_freq
* Use feature sub-sampling by set feature_fraction
* Use bigger training data
* Try lambda_l1, lambda_l2 and min_gain_to_split to regularization
* Try max_depth to avoid growing deep tree

## Prepare features

In [None]:
train = pd.read_csv('../input/elo-combined-data/X.csv')
train.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
test = pd.read_csv('../input/elo-combined-data/X_test.csv')
test.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
y = pd.read_csv('../input/elo-combined-data/y.csv', header=None)
y.drop(0, axis=1, inplace=True)
y.rename({1: 'target'}, axis=1, inplace=True)

In [None]:
not_use_col = ['first_active_month', 'card_id']
use_cols = [col for col in train.columns if col not in not_use_col]
X = train[use_cols]
X_test = test[use_cols]

In [None]:
features = list(train[use_cols].columns)
categorical_feat = [col for col in features if 'feature_' in col]

In [None]:
def model():
    lgb_params = {
              'objective': 'regression',
              'metric': 'rmse',
              'max_depth': 11,
              'min_chil_samples': 20,
              'min_data_in_leaf': 200,
              'reg_alpha': 1,
              'reg_lambda': 1,
              'num_leaves': 140,
              'learning_rate': 0.07,
              'subsample': 0.8,
              'colsample_bytress': 0.9,
              'verbosity': -1}
    
    folds = KFold(n_splits=10, shuffle=True, random_state=1)
    oof = np.zeros(len(X))
    predictions = np.zeros(len(X_test))
    
    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X.values, y.values)):
        print("LGB" + str(fold_) + '*' * 50)
        trn_data = lgb.Dataset(X.iloc[trn_idx][use_cols], label=y.iloc[trn_idx], categorical_feature=categorical_feat)
        val_data = lgb.Dataset(X.iloc[val_idx][use_cols], label=y.iloc[val_idx], categorical_feature=categorical_feat)

        num_round=1000

        clf = lgb.train(lgb_params, trn_data, valid_sets=[trn_data, val_data], verbose_eval=100, early_stopping_rounds=600)
        oof[val_idx] = clf.predict(X.iloc[val_idx][use_cols], num_iteration = clf.best_iteration)
        predictions += clf.predict(X_test[use_cols], num_iteration=clf.best_iteration) / folds.n_splits

    print("CV score: {:<8.5f}".format(mean_squared_error(oof, y)**0.5))

In [None]:
model()

## Appendix

### Data preprocessing and cleaning:

### Convert categorical data into numerical data if there are less than two classes
history['authorized_flag'] = history['authorized_flag'].map({'Y': 1, 'N': 0}) 

new_mer['authorized_flag'] = new_mer['authorized_flag'].map({'Y': 1, 'N': 0})

### Change type of purchase date, so that we can get extra useful information from them
history['purchase_date'] = pd.DatetimeIndex(history['purchase_date'])

new_mer['purchase_date'] = pd.DatetimeIndex(new_mer['purchase_date'])

### **Group merchants dataset **

#### def aggregate_merchants(data):
    
    agg_fun={
        'avg_sales_lag3': ['mean'],
        'avg_purchases_lag3': ['mean'],
        'active_months_lag3': ['mean'],
        'avg_sales_lag6': ['mean'],
        'avg_purchases_lag6': ['mean'],
        'active_months_lag6': ['mean'],
        'avg_sales_lag12': ['mean'],
        'avg_purchases_lag12': ['mean']
    }
    
    agg_mer = data.groupby('merchant_id').agg(agg_fun)
    agg_mer.columns = ['mer_' + '_'.join(col).strip() for col in agg_mer.columns.values]
    
    agg_mer.reset_index(inplace=True)
    
    return agg_mer

mer_agg = aggregate_merchants(mer)

#### Merge history dateset with merchant_agg dataset by using 'merchant_id'
history = pd.merge(history, mer_agg, on='merchant_id', how='left', suffixes=('_hist', '_mer'))

#### def aggregate_history_transactions(history):
    
    agg_fun = {
        'authorized_flag': ['sum', 'mean'],
        'city_id': ['nunique'],
        'installments': ['sum', 'max', 'mean', 'std'],
        'merchant_category_id': ['nunique'],
        'month_lag': ['mean', 'min'],
        'purchase_amount': ['sum', 'mean', 'max', 'min', 'std'],
        'state_id': ['nunique'],
        'subsector_id': ['mean', 'max', 'min'],
        'purchase_date': ['max', 'min', lambda x: max(x) - min(x)],
        'category_1': ['nunique'],
        'category_3': ['nunique'],
        'category_2': ['mean'],
        'mer_avg_sales_lag3_mean_mer': ['sum', 'mean'],
        'mer_avg_purchases_lag3_mean_mer': ['sum', 'mean'],
        'mer_active_months_lag3_mean_mer': ['sum'],
        'mer_avg_sales_lag6_mean_mer': ['sum', 'mean'], 
        'mer_avg_purchases_lag6_mean_mer': ['sum', 'mean'],
        'mer_active_months_lag6_mean_mer': ['sum'], 
        'mer_avg_sales_lag12_mean_mer': ['sum', 'mean'],
        'mer_avg_purchases_lag12_mean_mer': ['sum', 'mean']
    }
    
    agg_history = history.groupby(['card_id']).agg(agg_fun)
    agg_history.columns = ['hist_' + '_'.join(col).strip() for col in agg_history.columns.values]
    agg_history.reset_index(inplace=True)
    
    df = (history.groupby('card_id').size().reset_index(name='hist_transactions_count'))
    
    agg_history = pd.merge(df, agg_history, on='card_id', how='left')
    
    return agg_history

history_agg = aggregate_history_transactions(history)

#### Modified the time data so that they could be easily recognized by the machine. 
history_agg['hist_purchase_date_diff_day'] = pd.to_datetime(history_agg['hist_purchase_date_max']).dt.day - pd.to_datetime(history_agg['hist_purchase_date_min']).dt.day

history_agg.drop(['hist_purchase_date_max', 'hist_purchase_date_min'], axis=1, inplace=True)

history_agg = history_agg.rename(columns={"hist_purchase_date_<lambda>": "hist_purchase_date_diff"})

history_agg['hist_purchase_date_diff'] = history_agg['hist_purchase_date_diff'].dt.total_seconds()

### **Group new merchant transactions dataset**

#### def aggregate_new_merchant_transaction(data):
    
    agg_fun={
        'authorized_flag': ['sum'],
        'merchant_id': ['nunique'],
        'installments': ['sum', 'max', 'mean'],
        'month_lag': ['sum', 'mean'],
        'purchase_amount': ['sum', 'mean', 'max', 'min'],
        'purchase_date': ['max', 'min', lambda x : max(x) - min(x)]
    }
    
    agg_new_mer = data.groupby('card_id').agg(agg_fun)
    agg_new_mer.columns = ['new_' + '_'.join(col).strip() for col in agg_new_mer.columns.values]
    
    agg_new_mer.reset_index(inplace=True)
    
    return agg_new_mer

new_trans_agg = aggregate_new_merchant_transaction(new_mer)

#### Modified the time data so that they could be easily recognized by the machine. 
new_trans_agg['new_purchase_date_diff_days'] = pd.to_datetime(new_trans_agg['new_purchase_date_max']).dt.day - pd.to_datetime(new_trans_agg['new_purchase_date_min']).dt.day

new_trans_agg.drop(['new_purchase_date_max', 'new_purchase_date_min'], axis=1, inplace=True)

new_trans_agg = new_trans_agg.rename(columns={'new_purchase_date_<lambda>': 'new_purchase_date_diff'})

new_trans_agg['new_purchase_date_diff'] = new_trans_agg['new_purchase_date_diff'].dt.total_seconds()

#### process the train and test data set
def read_data(input_file):
    df = pd.read_csv(input_file)
   
   df['first_active_month'] = pd.to_datetime(df['first_active_month'])
    
   df['elapsed_time'] = (datetime.date(2018, 2, 1) - df['first_active_month'].dt.date).dt.days
    
   return df

#### Seperate the target values from train dataset and delete the target values from it.
y = train['target']

del train['target']

#### Merge the all aggregate files with train and test dataset, so that they contain the same number of features.
train = pd.merge(train, history_agg, on='card_id', how='left')

test = pd.merge(test, history_agg, on='card_id', how='left')

train = pd.merge(train, new_trans_agg, on='card_id', how='left')

test = pd.merge(test, new_trans_agg, on='card_id', how='left')