## **Train Lighgbm in GPU mode**

* GPU can achieve impressive acceleration on large and dense datasets.
* GPU implementation can scale to huge datasets over 10x larger due to memeory optimization done for Lightgbm GPU implementation.
* Generally for larger dataset (using more GPU memory) has better speedup, because the overhead of invoking GPU functions becomes significant when the dataset is small.

https://github.com/microsoft/LightGBM/blob/master/docs/GPU-Performance.rst

## **Load Data - Fake News dataset**

**Download data from Kaggle**
https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
!cp '/gdrive/My Drive/Lightgbm_GPU_Training/data/Fake.csv.zip' .
!cp '/gdrive/My Drive/Lightgbm_GPU_Training/data/True.csv.zip' .

In [3]:
!unzip Fake.csv.zip
!!unzip True.csv.zip

Archive:  Fake.csv.zip
  inflating: Fake.csv                


['Archive:  True.csv.zip', '  inflating: True.csv                ']

In [4]:
!ls

Fake.csv  Fake.csv.zip	sample_data  True.csv  True.csv.zip


## **Import Packages**

In [6]:
import pandas as pd
import time
import string

## Model building packages
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split

## Text pre-processing packages
import nltk
from nltk.stem import SnowballStemmer

## Variables used during analysis and model building
punctuation = string.punctuation
stemmer = SnowballStemmer("english")
STOPLIST = set(list(ENGLISH_STOP_WORDS))
SYMBOLS = " ".join(string.punctuation).split(" ") 

import warnings
warnings.filterwarnings("ignore")

# **Load Datasets**

In [7]:
true = pd.read_csv("True.csv")
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [8]:
fake = pd.read_csv("Fake.csv")
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [9]:
true.shape, fake.shape

((21417, 4), (23481, 4))

## **Prepare data for training**

In [10]:
true['category'] = 1
fake['category'] = 0

data = pd.concat([true,fake])
data.head()

Unnamed: 0,title,text,subject,date,category
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [11]:
data["text"] = data["title"] + data["text"] + data['subject']
data = data[["text", "category"]]

In [12]:
def preprocessing_text(text):
    import re    
    text = text.lower()
    text = [token for token in text.split()]
    text = [token for token in text if token not in STOPLIST]
    text = [token for token in text if token not in SYMBOLS]
    text = [stemmer.stem(token) for token in text]
    text = ' '.join(text)    
    return text

In [13]:
# Text pre-processing
text_features = data['text']
text_features = text_features.apply(preprocessing_text)

In [14]:
# Split data in train and test
train_x, test_x, train_y, test_y = train_test_split(text_features, data['category'], test_size=0.20, random_state=0)

# Perform text feature transformation on train data
vectorizer = TfidfVectorizer(max_features=5000)
transformed_train_features = vectorizer.fit_transform(train_x)
transformed_valid_features = vectorizer.transform(test_x)

## **Install GPU based Lightgbm** 

### **Download the Lightgbm source code**

In [15]:
!git clone --recursive https://github.com/Microsoft/LightGBM

Cloning into 'LightGBM'...
remote: Enumerating objects: 161, done.[K
remote: Counting objects: 100% (161/161), done.[K
remote: Compressing objects: 100% (120/120), done.[K
remote: Total 21120 (delta 105), reused 68 (delta 41), pack-reused 20959[K
Receiving objects: 100% (21120/21120), 16.67 MiB | 26.67 MiB/s, done.
Resolving deltas: 100% (15451/15451), done.
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'external_libs/compute'
Submodule 'eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'external_libs/eigen'
Submodule 'external_libs/fast_double_parser' (https://github.com/lemire/fast_double_parser.git) registered for path 'external_libs/fast_double_parser'
Submodule 'external_libs/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'external_libs/fmt'
Cloning into '/content/LightGBM/external_libs/compute'...
remote: Enumerating objects: 21728, done.        
remote: Total 21728 (delta 0), reused 0 (delta 0), pac

## **Change directory to compile Lightgbm for GPU**

### Checkout LightGBM and compile it with GPU support
https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html

In [16]:
%cd /content/LightGBM

/content/LightGBM


In [17]:
!mkdir build

### **Install cmake with GPU option and nproc**

In [18]:
!cmake -DUSE_GPU=1 
!make -j$(nproc)

-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2") 
-- OpenCL include directory: /usr/include
-- Boost version: 1.65.1
-- Found the following Boost libraries:
--

### **Get python-pip to install packages**

In [19]:
!sudo apt-get -y install python-pip

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libpython-all-dev python-all python-all-dev python-asn1crypto
  python-cffi-backend python-crypto python-cryptography python-dbus
  python-enum34 python-gi python-idna python-ipaddress python-keyring
  python-keyrings.alt python-pip-whl python-pkg-resources python-secretstorage
  python-setuptools python-six python-wheel python-xdg
Suggested packages:
  python-crypto-doc python-cryptography-doc python-cryptography-vectors
  python-dbus-dbg python-dbus-doc python-enum34-doc python-gi-cairo
  gnome-keyring libkf5wallet-bin gir1.2-gnomekeyring-1.0 python-fs
  python-gdata python-keyczar python-secretstorage-doc python-setuptools-doc
The following NEW packages will be installed:
  libpython-all-dev python-all python-all-dev python-asn1crypto
  python-cffi-backend python-crypto python-cryptography python-dbus
  python-enum34 python-gi python-

In [20]:
!sudo -H pip install setuptools pandas numpy scipy scikit-learn -U

Collecting setuptools
[?25l  Downloading https://files.pythonhosted.org/packages/15/0e/255e3d57965f318973e417d5b7034223f1223de500d91b945ddfaef42a37/setuptools-53.0.0-py3-none-any.whl (784kB)
[K     |▍                               | 10kB 19.2MB/s eta 0:00:01[K     |▉                               | 20kB 17.8MB/s eta 0:00:01[K     |█▎                              | 30kB 10.7MB/s eta 0:00:01[K     |█▊                              | 40kB 9.2MB/s eta 0:00:01[K     |██                              | 51kB 8.5MB/s eta 0:00:01[K     |██▌                             | 61kB 8.8MB/s eta 0:00:01[K     |███                             | 71kB 8.9MB/s eta 0:00:01[K     |███▍                            | 81kB 8.6MB/s eta 0:00:01[K     |███▊                            | 92kB 8.3MB/s eta 0:00:01[K     |████▏                           | 102kB 8.1MB/s eta 0:00:01[K     |████▋                           | 112kB 8.1MB/s eta 0:00:01[K     |█████                           | 122kB 8.1MB

## **Install Python Interface for Lighgbm**

###  Change context to python-package**

In [21]:
%cd /content/LightGBM/python-package

/content/LightGBM/python-package


### **Python setup for Lightgbm**

In [22]:
!sudo python setup.py install --precompile

running install
running build
running build_py
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/Grammar.txt
INFO:root:Generating grammar tables from /usr/lib/python3.6/lib2to3/PatternGrammar.txt
creating build
creating build/lib
creating build/lib/lightgbm
copying lightgbm/callback.py -> build/lib/lightgbm
copying lightgbm/dask.py -> build/lib/lightgbm
copying lightgbm/__init__.py -> build/lib/lightgbm
copying lightgbm/compat.py -> build/lib/lightgbm
copying lightgbm/sklearn.py -> build/lib/lightgbm
copying lightgbm/engine.py -> build/lib/lightgbm
copying lightgbm/basic.py -> build/lib/lightgbm
copying lightgbm/plotting.py -> build/lib/lightgbm
copying lightgbm/libpath.py -> build/lib/lightgbm
running egg_info
creating lightgbm.egg-info
writing lightgbm.egg-info/PKG-INFO
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing top-level names to lightgbm.egg-info/top_level.txt
writing manifest f

## **GPU Lighgbm Trainer**

* When using a GPU, it is advisable to use a bin size of 63 rather than 255, because it can speed up training significantly without noticeably affecting accuracy. 

In [23]:
import lightgbm as lgb

lgb_params = {
      'boosting_type': 'gbdt',
      'objective': 'binary',
      'metric':'AUC',
      'learning_rate': 0.01,
      'num_leaves': 16,  
      'max_depth': 4,    
      "max_bin": 63,  
      'subsample': 0.6,  
      'colsample_bytree': 0.4,  
      'verbose': 1,
      'seed' : 1983, 
      'device_type':"gpu"
     }

In [24]:
trn_data = lgb.Dataset(transformed_train_features, label=train_y) 
val_data = lgb.Dataset(transformed_valid_features, label=test_y) 
num_round = 50
start_time = time.time()

lgb_clf = lgb.train(lgb_params, 
                    trn_data,
                    num_round, 
                    valid_sets = [trn_data, val_data], 
                    verbose_eval=50, 
                    early_stopping_rounds = 50) 

end_time = time.time()
time_taken = end_time - start_time
print(f"Time taken to train a lightgbm GPU model is {time_taken}")

[LightGBM] [Info] Number of positive: 17107, number of negative: 18811
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 305077
[LightGBM] [Info] Number of data points in the train set: 35918, number of used features: 5000
[LightGBM] [Info] Using GPU Device: Tesla P100-PCIE-16GB, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 9 dense feature groups (0.41 MB) transferred to GPU in 0.001317 secs. 1 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.476279 -> initscore=-0.094954
[LightGBM] [Info] Start training from score -0.094954
Training until validation scores don't improve for 50 rounds
[50]	training's auc: 1	valid_1's auc: 0.999998
Did not meet early stopping. Best iteration is:
[40]	training's auc: 1	valid_1's auc: 0.999999
Time taken to train a lightgbm GPU model is 13.852726697921753
