<a href="https://colab.research.google.com/github/harshv47/r-india-Flair-Detector/blob/master/Jupyter%20Notebooks/r_india_flair_predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Packages
These are useful as later on, we will be calling functions from these libraries.

Generally, I import the usual libraries first and then all model preproccesing specific libraries are imported in their respective cells as that makes it easier if I have to copy and test them elsewhere.

The list of libraries that I import here is also relatively unchanged accross all projects that I do

In [0]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import os
import datetime
from scipy import stats

nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


  import pandas.util.testing as tm


True

## Praw and libs for downloading the dataset
To download dataset from reddit, I am using praw. It is a wrapper for the reddit API. I had already made a project using praw so I recycled many parts of the code.

Praw has built-in modules to access and download various parts of a subreddit, many of which are used here

In [0]:
!pip3 install praw
import praw
import os
import datetime



## Set-up for Praw module
Praw requires username and password of a reddit account as well as specific client ID, client secret and user_agent name; the later three have to be registered on reddit preferences under script tab.

I re-used the credentials for my previous project([reddit_back](https://https://github.com/harshv47/Reddit-Background)) in this one. 

This fuction returns an object that can then be used to crawl reddit and is colloquially named reddit.

In [0]:
def setUp(usrnm, passwd, cl_id, cl_sc):
    print('Getting Reddit Data...')
    reddit = praw.Reddit(client_id=cl_id, client_secret=cl_sc, user_agent='reddit_back', username=usrnm, password=passwd)

    return reddit

## Main Extract data function
This is the function that crawl r/india and collects the dataset to be used.

The name of all the flairs are saved in *flairs* list and the column names of our eventual dataset are stored in *topics*.
Then, we navigate to r/india using the subreddit method.

While collecting data, I first loop on *flairs* and then search & download data for the first 500 posts belonging to that flair. This was done to reduce bias in the dataset, so our dataset contains almost equal posts for each flair with the exception of few.

For comments, I used only top level comments: these are the parent comments of each thread on reddit. After getting each comment I concatonate them in a single column, which would later be preproccesed.

Finally this will be written as a csv file and saved so that it can be reused and one won't need to download it every time.

In [0]:
def extractData(reddit):
    reddit.read_only = True
    subreddit = reddit.subreddit('india')

    flairs = ["Politics",
            "Non-Political",
            "[R]eddiquette",
            "AskIndia",
            "Policy/Economy",
            "Business/Finance",
            "Science/Technology",
            "Scheduled",
            "Sports",
            "Food",
            "AMA",
            "Photography",
            "CAA-NRC-NPR",
            "Coronavirus"]

    topics = { "author":[],
                    "body":[],
                    "comments":[],
                    "comms_num":[],
                    "flair": [],
                    "id": [],
                    "score":[],
                    "title": [],
                    "url": [],
                    "created":[]}
    print('Collecting Flair Data...')
    flair_count = 1
    for flair in flairs:
        submissions = subreddit.search(flair, limit=500)
        for submission in submissions:
            
            topics["flair"].append(flair)
            topics["title"].append(submission.title)
            topics["score"].append(submission.score)
            topics["id"].append(submission.id)
            topics["url"].append(submission.url)
            topics["comms_num"].append(submission.num_comments)
            topics["created"].append(submission.created)
            topics["body"].append(submission.selftext)
            topics["author"].append(submission.author)
            
            #   Remove comments that are accessed by clicking on More Comments, limit = 0 implies no clicking
            submission.comments.replace_more(limit=0)
            comment = ''
            #   Only using top level comments
            for top_level_comment in submission.comments:
                comment = comment + ' ' + top_level_comment.body
            topics["comments"].append(comment)
        print('Collected flair: ',flair_count,' ',flair,' out of 12')
        flair_count = flair_count + 1

    topics_df = pd.DataFrame(topics)
    #   The created time is in Unix Time, convert it to timestamp before proceding
    print('Done Collecting Data, writing as csv')
    topics_df.to_csv('dataset.csv', index=False)

## Calling all praw functions and credentials
In this block all the credentials are stored in one place and then the functions that were defined above are called.

It should be noted that as the running time of the *extractData* function is a bit high. I downloaded this using almost the same code albiet in a single python file stored in **/dataset/procure.py**. The file works a bit differently, it asking for the creds one time and stores it on the local host. That method is not displayed here because colab does not allow reading the file system and getting input inline in colab is a bit tacky.

In [57]:

client_id = '#'
client_secret = '#'
username = '$'
password = '$'

#	Setting up reddit, :
reddit = setUp(username, password, client_id, client_secret)
#   Main functon here:
extractData(reddit)
print('Done')

Getting Reddit Data...
Collecting Flair Data...
Collected flair:  1   Politics  out of 12
Collected flair:  2   Non-Political  out of 12
Collected flair:  3   [R]eddiquette  out of 12
Collected flair:  4   AskIndia  out of 12
Collected flair:  5   Policy/Economy  out of 12
Collected flair:  6   Business/Finance  out of 12
Collected flair:  7   Science/Technology  out of 12
Collected flair:  8   Scheduled  out of 12
Collected flair:  9   Sports  out of 12
Collected flair:  10   Food  out of 12
Collected flair:  11   AMA  out of 12
Collected flair:  12   Photography  out of 12
Collected flair:  13   CAA-NRC-NPR  out of 12
Collected flair:  14   Coronavirus  out of 12
Done Collecting Data, writing as csv
Done


## Importing saved dataset
The dataset is importing as a DataFrame using pandas *read_csv* function.

I also checked it once by using *head()* property of a DataFrame to see if the dataset looks correct.

In [0]:
dataset_df = pd.read_csv('dataset.csv')
dataset_df.head()

Unnamed: 0,author,body,comments,comms_num,flair,id,score,title,url,created
0,aaluinsonaout,I don't know if it is the same situation in ot...,Our society thrives on abuse of power. We let...,82,Politics,g2ct57,405,A polite request to all Indians here,https://www.reddit.com/r/india/comments/g2ct57...,1587063000.0
1,HairLikeWinterFire,TLDR: My (unqualified) opinion is that dalit p...,"I don't really get along with ""government upl...",20,Politics,g76o5f,30,The real loser in India's errupting Islamaphob...,https://www.reddit.com/r/india/comments/g76o5f...,1587756000.0
2,chillinvillain122,First of all let me start by saying it was stu...,Our country is just too far in at the moment ...,73,Politics,futac9,194,Pitting a community against a political party ...,https://www.reddit.com/r/india/comments/futac9...,1586034000.0
3,aaluinsonaout,,This looks like an IIPM ad 1. Where did they ...,146,Politics,ff8sth,738,A new political party gave a full front page a...,https://i.redd.it/yjo9wpy38el41.jpg,1583678000.0
4,hipporama,,"Well, Some people really deserve to die. ~~/s...",67,Politics,fpaj1w,407,Hit by backlash over posts on lack of medical ...,https://theprint.in/india/hit-by-backlash-over...,1585254000.0


## Additional Info #1
I used the *shape* property of a DataFrame to see it's number of features and rows and also see the data types of each feature in the dataset by using *dtypes* property of a DataFrame.


In [0]:
print(dataset_df.shape)
print(dataset_df.dtypes)

(2883, 10)
author        object
body          object
comments      object
comms_num      int64
flair         object
id            object
score          int64
title         object
url           object
created      float64
dtype: object


## Additional Info #2
Here we see that not all flairs have the same number of examples, with Coronavirus havig the highest number due to recent pandemic.

I printed this because if we had gotten 500 examples for each flairs, we should have 7000 examples for all 14 flairs. But we see in the above cell that we only have 2883 examples.

In [0]:
dataset_df['flair'].value_counts()

Coronavirus           248
Politics              247
Food                  242
Scheduled             234
Business/Finance      233
Sports                231
AskIndia              231
Photography           222
Science/Technology    221
Policy/Economy        220
Non-Political         216
AMA                   213
CAA-NRC-NPR           107
[R]eddiquette          18
Name: flair, dtype: int64

#Preproccesing

## pip Requirements
I need to use *tldextract* and *catboost* but they do not come by default in colab, so I have to install them using pip.

In [0]:
!pip3 install tldextract
!pip install catboost



## Getting Domains of urls
URL in itself has not meaning, but depending upon which domain it comes from could be valuable.

This function extractes domain names from urls

In [0]:
import tldextract
def getDomain(url):
  ext = tldextract.extract(url)
  return ext.domain

## Handling illegal words
Body, comments and title may contain words and symbols which would not useful to us.

SO, here I used regex and beautiful soup to replace these (*[/(){}\[\]\|@,;]*) and remove these ([^0-9a-z #+_]) symbols.

I also replace stopwords with space(' ').

In [0]:
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
def clean_text(text):
    text = BeautifulSoup(text, "lxml").text
    text = text.lower()
    space_sub = re.compile('[/(){}\[\]\|@,;]')
    remove_bad_sym = re.compile('[^0-9a-z #+_]')
    STOPWORDS = set(stopwords.words('english'))
    text = space_sub.sub(' ', text)
    text = remove_bad_sym.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

## Convert to string
Intially the body, comments and title are all objects, they will have to be converted to string for *clean_text()* function to run. Moreover later on they have to be regarded as strings for further preproccesing.

This function converts them to string(str).

In [0]:
def to_str(text):
  return str(text)

## Calling previous functions
In this block the previous funcions are called on the features that have to be converted to string and cleaned.

The url is also converted to domain names.

This all is done using the *apply()* property of a DataFrame.

In [0]:
cols_to_string = ['body', 'comments', 'title']
for col in cols_to_string:
  dataset_df[col] = dataset_df[col].apply(to_str)
  print(col,"->" ,isinstance(dataset_df[col][0], str))
  dataset_df[col] = dataset_df[col].apply(clean_text)

dataset_df['url'] = dataset_df['url'].apply(getDomain)

body -> True


  ' that document to Beautiful Soup.' % decoded_markup


comments -> True
title -> True


## Breaking dataset_df into X_train and Y_train
The dataset has to be eventually broken into X and Y. The Y part will contain the *flair* column as that is our target and for the X part we will drop the *flair*, *id* and *created*.

*id* is dropped because it has not meaningful information, reddit calculates ids are just permalinks of the posts that can be used to uniquely identify the post sitewide.

*created* is just the unix timestamp of the time of creation of the post. I first trained using it, but it had vinsignificant bearing of the final classification so I decided to remove it. It can be included.

In [0]:
Y_train = dataset_df['flair']
X_train = dataset_df.drop(['flair', 'id', 'created'], axis = 1)

print("X_train", X_train.shape)
print("Y_train", Y_train.shape)

X_train (2883, 7)
Y_train (2883,)


## Normalization for Number of Comments and score
The *score* and *comms_num* contain numbers that can range from 0 to very large. As such we have to normalize it. 

I wrote the normalization function. That plus one is added to change the range of numbers from (-1, 1) to (0, 2) and after dividing them by 2, the effective range will become (0,1).

In [0]:
# Normalization for comms_num and score
from sklearn import preprocessing

cols_to_normalize = ['comms_num', 'score']
for col in cols_to_normalize:
  X_train[col] = ((X_train[col] - X_train[col].mean()) / (X_train[col].max() - X_train[col].min()) + 1)/2


Checking X_train for the previous block

In [0]:
X_train.head()

Unnamed: 0,author,body,comments,comms_num,score,title,url
0,aaluinsonaout,dont know situation countries india seen lot o...,society thrives abuse power let many idiots ab...,0.501207,0.507448,polite request indians,reddit
1,HairLikeWinterFire,tldr unqualified opinion dalit political movem...,dont really get along government uplifting bac...,0.498336,0.496972,real loser indias errupting islamaphobia caste...,reddit
2,chillinvillain122,first let start saying stupid whatever muslims...,country far moment theres turning back best ho...,0.500791,0.501554,pitting community political party fucking stupid,reddit
3,aaluinsonaout,,looks like iipm ad 1 get funds full page ads 2...,0.504171,0.516752,new political party gave full front page ad po...,redd
4,hipporama,,well people really deserve die country fucking...,0.500513,0.507504,hit backlash posts lack medical gear doctors g...,theprint


## Preproccesing for *author* and *url*
The *author* and *url* could be taken as categorical fields. 

Because there are many users on reddit which follow and post mainly political related matters. Especially r/india, where a majority of news mostly political and economical are shared. So, there are many users who consistently post posts belonging to a particular flair.

*ur* is also a major distinction. Sites like livelaw or space.com will often belong to a *flair*. It is highly unlikely that space.com will have a Political news. However, this is not always the case, eg. theprint could have political, non-political, etc. types of *flair*. But, it is an important feature, nonetheless.

Therefore, I used *get_dummies()* and delete it's initial column, I could have used *OneHotEncoder* for the same, it won't matter much. *get_dummies* creates aditional columns proportional to the number of classes, with a value of 0 or 1 if they belong to that class.

Also, due to some an [issue](https://stackoverflow.com/questions/58639104/duplicate-columns-from-pandas-get-dummies) some columns in author are repeated. This can create problems later on. So, I removed duplicated columns.

The *X_train.columns.duplicated()* returns True for any column that is repeated, since *.loc* only selectes columns with their index having a True value, we invert the bool value by using *~* and then select only those columns having True on their indices in the column DataFrame.


In [0]:
X_train = pd.concat([X_train, pd.get_dummies(X_train['url'], prefix='url_')], axis=1)
X_train = pd.concat([X_train, pd.get_dummies(X_train['author'], prefix='author_')], axis=1)

X_train = X_train.drop(['url', 'author'], axis=1)
X_train = X_train.loc[:,~X_train.columns.duplicated()]

## Checking for duplicates
I check for any duplicates features. The *value_counts()* sorts by descending order by default and since the first one is 1. So, there are no duplicate columns.

In [0]:
X_train.columns.T.value_counts()

author__akimera             1
author__bhodrolok           1
author__DisposableMAYBE     1
author__kanchudeep          1
author__teninchclitoris     1
                           ..
author__ExaltFibs24         1
author__nou_kar             1
author__aguyfrominternet    1
url__weather                1
author__xx_yariel_xx        1
Length: 2031, dtype: int64

## Creating a Corpus
The body, comments and title features are still just strings.

No, classifier can take strings as input. It must be transformed it into a form where a machine (classifier) can read.

A very good way of doing that is to apply TF-IDF. To do that we must first build a corpus. Our corpus will consist of all the words from body, comments and title so we create a new DataFrame by merging all three features.

Also, since I would be adding TF-IDF features so these three featueres need to be removed.

In [0]:
corpus = X_train['body'] + X_train['comments'] + X_train['title']
corpus.head()
X_train = X_train.drop(['body', 'comments', 'title'], axis=1)

## Applying TF-IDF
So built a corpus on the train set and then used *fit_transform* to transform the three feature data.

Then the resulting featues were added to *X_train* by concatenating them.

At the end, I checked if it worked by checking their shapes and taking a peek on the *X_train* reveals that it did.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vect = TfidfVectorizer(max_features=10000, stop_words='english', min_df=1, binary=0, use_idf=1, smooth_idf=0, sublinear_tf=1)

tfidf_vect_vectors = tfidf_vect.fit_transform(corpus.values.astype('U'))

col_names = ['tfidf_' + s for s in tfidf_vect.get_feature_names()]

col_tfidf_df = pd.DataFrame(tfidf_vect_vectors.todense(), columns=col_names)


X_train = pd.concat([X_train, col_tfidf_df], axis=1)

print(X_train.shape, Y_train.shape)
X_train.head()

(2883, 12028) (2883,)


Unnamed: 0,comms_num,score,url__164.100.47.4,url__500px,url__aljazeera,url__altnews,url__ampproject,url__aninews,url__article-14,url__asiasociety,url__baatbiharki,url__bangaloremirror,url__barandbench,url__bbc,url__behance,url__betootaadvocate,url__bloomberg,url__bloombergquint,url__bnewsindia,url__business-standard,url__businessinsider,url__businesstoday,url__businessweek,url__caravanmagazine,url__cbc,url__circleofcricket,url__cisce,url__cnbc,url__cnbctv18,url__cnn,url__convozine,url__cpim,url__crefacto,url__dailyo,url__deccanchronicle,url__deccanherald,url__delhishelterboard,url__dnaindia,url__dw,url__ecimpey,...,tfidf_yep,tfidf_yes,tfidf_yesterday,tfidf_yesterdays,tfidf_yield,tfidf_yields,tfidf_yo,tfidf_yoga,tfidf_yogendra,tfidf_yogi,tfidf_yojana,tfidf_york,tfidf_you1,tfidf_youd,tfidf_youll,tfidf_young,tfidf_younger,tfidf_youngsters,tfidf_youre,tfidf_yourstory,tfidf_yourstorycom,tfidf_youth,tfidf_youths,tfidf_youtube,tfidf_youve,tfidf_yr,tfidf_yrs,tfidf_yt,tfidf_yup,tfidf_zealand,tfidf_zee,tfidf_zero,tfidf_zerodha,tfidf_zerorating,tfidf_zindagi,tfidf_zomato,tfidf_zombies,tfidf_zone,tfidf_zones,tfidf_zyada
0,0.501207,0.507448,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053323,0.0,0.020054,0.0,0.0,0.0,0.0,0.0,0.026012,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.498336,0.496972,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.026478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.500791,0.501554,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.077111,0.0,0.0,0.035774,0.0,0.0,0.0,0.0,0.043516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.504171,0.516752,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.025904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.500513,0.507504,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Splitting dataset for training and cross-eval
For cross-eval, the dataset must be split into two parts.
I used 20% of the dataset for cross-eval



In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test_cv, Y_train, Y_test_cv = train_test_split(X_train, Y_train, test_size=0.2, random_state=10)

Value_counts for Cross-eval Y set

In [0]:
Y_test_cv.value_counts()

Politics              65
Photography           53
Science/Technology    50
Business/Finance      48
Sports                46
Policy/Economy        46
Food                  45
Scheduled             44
Coronavirus           41
AMA                   40
Non-Political         39
AskIndia              36
CAA-NRC-NPR           21
[R]eddiquette          3
Name: flair, dtype: int64

# Models

## Desicion Tree
They are a tree based classification algorithm. They are simple but quite powerful on classification problems.

In [0]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=10)
dt.fit(X_train, Y_train)
dt.score(X_test_cv, Y_test_cv)

0.6603119584055459

## Random Forest
I used Random Forest because as ensemble learning algorithms, they are ideally suited for this type of classification. Since these types of dataset overfit quite easily.

Random Forest doesn't overfit and the testing performance of Random Forests does not decrease (due to overfitting) as the number of trees increases. 

That is why I get better score than Descision Trees.

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=10)
rf.fit(X_train, Y_train)
rf.score(X_test_cv, Y_test_cv)

0.7435008665511266

## SGDClassifier
This is Logisitic regression optimized by Stochastic gradient descent.

As we can see, the accuracy is less than Random Forest so it is a little worse for this task.

In [0]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier(penalty='l2', loss='log', random_state=10)
sgd.fit(X_train, Y_train)
sgd.score(X_test_cv, Y_test_cv)

0.6811091854419411

## Linear SVC
I have also used linear svc.

Generally svm and derivates perform quite good in text classification problems. However, we find that it's score it lower than that of LR. I surmised that it might have been caused due to the *X_Train* matrix being quite sparse due to *get_dummies* and *TF-IDF*. This is also evident by it's relatively longer training time.

In [0]:
from sklearn import svm
svm_linear = svm.SVC(kernel='linear',random_state = 10).fit(X_train, Y_train)
svm_linear.score(X_test_cv, Y_test_cv)

0.6707105719237435

## Code for Ensemble
The code is for taking the mean of the probabilities for each class summed across all the models and at the end mapping them to the appropriate prediction class.

It can work with any classification algorithm that has *predict_proba* as their property. Notably, SGD with hinge loss, doesn't have this property. And, to enable this in svm (for Linear SVC) one has to add a parameter *Probability=True*.

In [0]:
def ensemble_predictions(members,n_members, testX):
  # make predictions
  yhats = [model.predict_proba(testX) for model in members]
  yhats = np.array(yhats)
  
  # sum across ensemble members
  summed = np.mean(yhats, axis=0)
  # argmax across classes
  result = [members[0].classes_[np.argmax(row)] for row in summed]
  
  return result
 

## Ensemble
The accuray of Ensemble is notably less than that of Random Forest which means that sgd moves the prediction away from the Random Forest.

In [0]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test_cv, ensemble_predictions([rf, sgd], 2, X_test_cv))

0.7261698440207972

## CatBoost
Until now I have only used Bagging Ensemble algorithm, CatBoost is a boosting ensemble algorithm.

It is very powerful and can intelligently find the best model for a given task as well as the best iteration.

However, it has the downside of taking good amount of time to train and due to it being a boosting ensemble, it can overfit.

It does provide the best cross-eval score, though it might have overfitted a bit.

In [0]:
from catboost import CatBoostClassifier, Pool, cv
catb=CatBoostClassifier(
    iterations = 200,
    random_seed = 10
    )
catb.fit(X_train,Y_train,eval_set=(X_test_cv, Y_test_cv))
catb.score(X_test_cv,Y_test_cv)

Learning rate set to 0.203045
0:	learn: 2.2797853	test: 2.3260453	best: 2.3260453 (0)	total: 5.42s	remaining: 17m 58s
1:	learn: 2.0845554	test: 2.1274863	best: 2.1274863 (1)	total: 10.1s	remaining: 16m 43s
2:	learn: 1.9057763	test: 1.9407153	best: 1.9407153 (2)	total: 14.8s	remaining: 16m 10s
3:	learn: 1.7448231	test: 1.8057989	best: 1.8057989 (3)	total: 19.4s	remaining: 15m 51s
4:	learn: 1.6570032	test: 1.7052055	best: 1.7052055 (4)	total: 24.2s	remaining: 15m 44s
5:	learn: 1.5751966	test: 1.6438718	best: 1.6438718 (5)	total: 29.1s	remaining: 15m 40s
6:	learn: 1.4991610	test: 1.5761534	best: 1.5761534 (6)	total: 33.9s	remaining: 15m 35s
7:	learn: 1.4340519	test: 1.5200192	best: 1.5200192 (7)	total: 38.9s	remaining: 15m 32s
8:	learn: 1.3995713	test: 1.4829034	best: 1.4829034 (8)	total: 43.7s	remaining: 15m 26s
9:	learn: 1.3515166	test: 1.4338222	best: 1.4338222 (9)	total: 48.5s	remaining: 15m 21s
10:	learn: 1.3199014	test: 1.3979953	best: 1.3979953 (10)	total: 53.5s	remaining: 15m 18s


0.7764298093587522

# Saving the model
I saved the best performing models using pickle

In [0]:
# Save Model
import pickle
pickle.dump(rf, open('rf_model.sav', 'wb'))
pickle.dump(catb, open('catb_model.sav', 'wb'))
# To load model use: model = pickle.load(open(filename, 'rb'))