Skip to content

psygo/kdd2009

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KDD 2009 Telecom Data Science Challenge

My Results

In the end, for the training set we will have ROC-AUCs (Gradient Boosting):

  • Churn: 0.73202
  • Upselling: 0.86373
  • Appetency: 0.82624
  • Average: 0.80733

With these numbers, I would be around position 34 of the Fast Track Rankings. Come to think of it, I think I should have used Stacked classifiers, instead of only Gradient Boosting (GB). And GB can be deceiving sometimes, because it tends to overfit much more than Random Forests. Anyway, this performance seems acceptable as a first iteration.

Main Files

  • Notebooks:
    • Churn Training.ipynb
    • Churn Testing.ipynb
  • Python Scripts:
    • churn_hekima_v2.py - training
    • hekima_small_test.py - testing

Outline

  1. Preprocessing
    1. Imports
    2. Opening the Data
    3. Verifying Consistency
    4. Feature Scaling
    5. Deleting Vars with too many NaNs
    6. Filling NaNs
    7. Deleting Vars with too many Categories
    8. Feature Selection with Decision Trees
  2. Modelling
    1. Imports
    2. Train Test Split
    3. Evaluating Models' Performances (ROC-AUC)
  3. Best Model Optimization (Gradient Boosting)
    1. Separate Optimization
    2. Global Optimization
    3. Final Model's ROC-AUC

Short Notes

The targets are very sparse, which causes many problems, since putting everything to zeros should be good enough to get very high levels of accuracy.

To mitigate this effect, besides using ROC-AUC, FPR and TPR criteria, we could resample the positive values of the targets. However, this has not been proven very effective, neither Subsampling the negative values nor using SMOTE to oversample the positive values worked very well.

Another item to try in a later attempt is to calculate the mutual info scores (analogous to Pearson Correlations, but for classes), in order to get how much information even the variables with many NaNs carry. In this notebook, I've deliberately deleted variables with too many NaNs instead.

I've also tested SVMs, but, aside taking an unbearably long time to train, they do not perform very well, sitting between the Logistic Regression's and the Neural Network's performances. Also, the Neural Network has a lot of space to gain in improvements, since the number of hidden layers and their sizes still have to be optimized -- I've tried it with 4 hidden layers of 300 neurons (number of variables / 2), deeper networks would likely be better.

Further Reading

I've also made a post for my website, where I put some additional hopefully enlightening comments.