<a href="https://colab.research.google.com/github/panchamdesai777/Hackathons/blob/master/hackathon_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Message Polarity Prediction**

All of us receive a ton of messages and emails on a daily basis. Collectively, that is a lot of data which can provide useful insights about the messages that each of us gets. What if you could know whether a certain message has brought you good news or bad news before opening the actual message. In this challenge, we will use Machine Learning to achieve this.

![alt text](https://www.machinehack.com/wp-content/uploads/2020/04/wk3-banner_2-1536x864.jpg)

## **Objective**

Given are 53 distinguishing factors that can help in understanding the polarity(Good or Bad) of a message,  your objective as a data scientist is to build a Machine Learning model that can predict whether a text message has brought you good news or bad news.

## **Dataset Description**

You are provided with the normalized frequencies of 50 words/emojis (Freq_Of_Word_1 to Freq_Of_Word_50) along with 3 engineered features listed below:

* TotalEmojiCharacters: Total number of individual emoji characters normalized. (eg. 🙂 )
* LengthOFFirstParagraph: The total length of the first paragraph in words normalized
* StylizedLetters: Total number of letters or characters with a styling element normalized

* Target Variable: IsGoodNews

## Dataset Overview

* Train data
![alt text](https://www.machinehack.com/wp-content/uploads/2020/04/Screenshot-2020-04-28-at-4.29.36-PM-1024x257.png)


* Test data

 ![alt text](https://www.machinehack.com/wp-content/uploads/2020/04/Screenshot-2020-04-28-at-4.29.56-PM-1024x279.png)


In [0]:
#importing all the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style # for styling the graphss
# style.available (to know the available list of styles)
style.use('ggplot') # chosen style
plt.rc('xtick',labelsize=13) # to globally set the tick size
plt.rc('ytick',labelsize=13) # to globally set the tick size
# To print multiple outputs together
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Change column display number during print
pd.set_option('display.max_columns', 500)
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
# To display float with 2 decimal, avoid scientific printing
pd.options.display.float_format = '{:.2f}'.format
import seaborn as sns
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, roc_auc_score, roc_curve, recall_score, classification_report
from sklearn.metrics import roc_auc_score ,mean_squared_error,accuracy_score,classification_report,roc_curve,confusion_matrix,precision_score,f1_score
import itertools
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

In [77]:
from google.colab import files
uploaded = files.upload()

Saving Train.csv to Train (2).csv


In [78]:
#Loading The Dataset
import io
#The command written below is generally used to load .csv format file or .data format file.
train_df = pd.read_csv(io.BytesIO(uploaded['Train.csv']))
train_df.head()

Unnamed: 0,Freq_Of_Word_1,Freq_Of_Word_2,Freq_Of_Word_3,Freq_Of_Word_4,Freq_Of_Word_5,Freq_Of_Word_6,Freq_Of_Word_7,Freq_Of_Word_8,Freq_Of_Word_9,Freq_Of_Word_10,Freq_Of_Word_11,Freq_Of_Word_12,Freq_Of_Word_13,Freq_Of_Word_14,Freq_Of_Word_15,Freq_Of_Word_16,Freq_Of_Word_17,Freq_Of_Word_18,Freq_Of_Word_19,Freq_Of_Word_20,Freq_Of_Word_21,Freq_Of_Word_22,Freq_Of_Word_23,Freq_Of_Word_24,Freq_Of_Word_25,Freq_Of_Word_26,Freq_Of_Word_27,Freq_Of_Word_28,Freq_Of_Word_29,Freq_Of_Word_30,Freq_Of_Word_31,Freq_Of_Word_32,Freq_Of_Word_33,Freq_Of_Word_34,Freq_Of_Word_35,Freq_Of_Word_36,Freq_Of_Word_37,Freq_Of_Word_38,Freq_Of_Word_39,Freq_Of_Word_40,Freq_Of_Word_41,Freq_Of_Word_42,Freq_Of_Word_43,Freq_Of_Word_44,Freq_Of_Word_45,Freq_Of_Word_46,Freq_Of_Word_47,Freq_Of_Word_48,Freq_Of_Word_49,Freq_Of_Word_50,TotalEmojiCharacters,LengthOFFirstParagraph,StylizedLetters,IsGoodNews
0,-0.35,2.62,1.25,-0.04,-0.47,-0.35,-0.3,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.71,1.07,0.95,-0.13,-0.29,0.76,-0.33,-0.3,-0.21,-0.23,-0.16,-0.23,-0.15,1.31,-0.18,-0.15,-0.24,-0.24,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,-0.21,-0.08,-0.12,0.08,0.16,-0.03,-0.05,0.22,1
1,-0.35,-0.32,-0.56,-0.04,-0.47,-0.35,-0.3,3.84,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.2,-0.33,-0.3,-0.21,-0.23,-0.16,-0.23,-0.15,-0.15,-0.18,-0.15,-0.24,-0.24,3.68,-0.06,-0.18,-0.19,-0.13,-0.17,15.21,-0.12,-0.32,-0.21,-0.08,-0.12,-0.15,-0.45,-0.11,-0.2,-0.41,0
2,-0.35,-0.32,-0.56,-0.04,-0.47,-0.35,-0.3,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.2,-0.33,-0.3,-0.21,-0.23,-0.16,-0.23,-0.15,-0.15,-0.18,-0.15,-0.24,-0.24,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,-0.21,-0.08,-0.12,-0.15,-0.45,-0.11,-0.19,-0.39,0
3,1.21,2.68,1.29,-0.04,0.22,-0.35,-0.3,0.86,-0.32,2.37,-0.32,2.03,-0.31,1.29,-0.19,0.29,-0.32,-0.36,0.13,-0.19,3.98,-0.13,-0.29,-0.2,-0.33,-0.3,-0.21,-0.23,-0.16,-0.23,-0.15,-0.15,-0.18,-0.15,-0.24,-0.24,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,-0.21,-0.08,-0.12,-0.15,0.34,1.33,2.27,0.6,1
4,-0.35,-0.32,-0.56,-0.04,-0.47,-0.35,-0.3,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.2,-0.33,-0.3,0.1,-0.23,0.64,-0.23,-0.15,-0.15,-0.18,-0.15,-0.24,-0.24,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,-0.21,-0.08,-0.12,-0.15,0.93,-0.03,-0.11,-0.13,0


In [79]:
from google.colab import files
uploaded = files.upload()

Saving Test.csv to Test (2).csv


In [80]:
#Loading The Dataset
import io
#The command written below is generally used to load .csv format file or .data format file.
test_df = pd.read_csv(io.BytesIO(uploaded['Test.csv']))
test_df.head()

Unnamed: 0,Freq_Of_Word_1,Freq_Of_Word_2,Freq_Of_Word_3,Freq_Of_Word_4,Freq_Of_Word_5,Freq_Of_Word_6,Freq_Of_Word_7,Freq_Of_Word_8,Freq_Of_Word_9,Freq_Of_Word_10,Freq_Of_Word_11,Freq_Of_Word_12,Freq_Of_Word_13,Freq_Of_Word_14,Freq_Of_Word_15,Freq_Of_Word_16,Freq_Of_Word_17,Freq_Of_Word_18,Freq_Of_Word_19,Freq_Of_Word_20,Freq_Of_Word_21,Freq_Of_Word_22,Freq_Of_Word_23,Freq_Of_Word_24,Freq_Of_Word_25,Freq_Of_Word_26,Freq_Of_Word_27,Freq_Of_Word_28,Freq_Of_Word_29,Freq_Of_Word_30,Freq_Of_Word_31,Freq_Of_Word_32,Freq_Of_Word_33,Freq_Of_Word_34,Freq_Of_Word_35,Freq_Of_Word_36,Freq_Of_Word_37,Freq_Of_Word_38,Freq_Of_Word_39,Freq_Of_Word_40,Freq_Of_Word_41,Freq_Of_Word_42,Freq_Of_Word_43,Freq_Of_Word_44,Freq_Of_Word_45,Freq_Of_Word_46,Freq_Of_Word_47,Freq_Of_Word_48,Freq_Of_Word_49,Freq_Of_Word_50,TotalEmojiCharacters,LengthOFFirstParagraph,StylizedLetters
0,-0.35,-0.32,-0.56,-0.04,-0.47,-0.35,-0.3,-0.24,-0.32,-0.35,-0.32,0.8,-0.31,-0.18,-0.19,1.37,-0.32,2.17,-0.97,-0.19,2.74,-0.13,-0.29,2.49,-0.33,-0.3,-0.21,-0.23,-0.16,-0.23,-0.15,-0.15,-0.18,-0.15,-0.24,-0.24,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,1.28,-0.08,-0.12,-0.15,-0.45,-0.0,-0.01,-0.28
1,-0.35,-0.32,-0.56,-0.04,0.01,-0.35,-0.3,-0.24,-0.32,0.57,-0.32,-0.64,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.6,-0.19,-0.42,-0.13,-0.29,-0.2,3.95,2.71,-0.21,-0.23,-0.16,-0.23,-0.15,-0.15,0.48,-0.15,-0.24,-0.24,2.64,-0.06,-0.18,-0.19,8.11,0.21,-0.21,0.29,-0.32,-0.21,-0.08,-0.12,1.07,1.15,-0.05,-0.15,0.05
2,0.01,-0.32,-0.35,-0.04,-0.31,0.03,-0.3,-0.24,3.43,-0.35,-0.32,-0.26,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.72,-0.19,-0.52,-0.13,-0.29,-0.2,-0.01,0.48,-0.15,-0.23,-0.16,-0.23,-0.15,-0.15,0.5,-0.15,-0.24,-0.24,0.17,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,-0.21,-0.08,-0.12,0.12,0.07,-0.05,-0.06,0.32
3,-0.35,-0.32,-0.56,-0.04,-0.47,-0.35,2.09,-0.24,-0.32,-0.35,-0.32,0.51,-0.31,-0.18,-0.19,-0.33,1.92,-0.36,1.1,-0.19,1.12,-0.13,-0.29,-0.2,-0.33,-0.3,-0.21,-0.23,-0.16,-0.23,-0.15,-0.15,-0.18,-0.15,-0.24,-0.24,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,-0.21,-0.12,-0.32,-0.21,-0.08,-0.12,0.07,0.3,-0.05,0.01,-0.2
4,0.62,0.61,-0.56,-0.04,0.39,0.69,-0.3,1.13,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.19,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.2,0.01,0.04,-0.21,-0.23,0.28,-0.23,0.54,-0.15,-0.18,-0.15,0.44,0.49,-0.33,-0.06,-0.18,-0.19,-0.13,-0.17,1.09,-0.12,-0.32,-0.21,-0.08,-0.12,-0.15,-0.45,-0.11,-0.19,-0.31


In [0]:
train, test = train_df,test_df
target = train['IsGoodNews']
features = [c for c in train.columns if c not in ['IsGoodNews']]

In [82]:
import datetime
import time
from sklearn.model_selection import *
from sklearn.metrics import *
def custom_metric(y_true, y_pred):
    return f'F1_score: {f1_score(y_true, np.round(y_pred))}'
N_FOLDS = 5

training_start_time = time.time()

max_iter = N_FOLDS
folds = StratifiedKFold(n_splits = max_iter)
oofs = np.zeros(len(train))
preds_test = np.zeros(len(test))

feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, target.values)):
    
    print(f'\n---- Fold {fold_} -----\n')
    
    fold_start_time = time.time()
    
    X_trn, y_trn = train.iloc[trn_idx][features], target.iloc[trn_idx]
    X_val, y_val = train.iloc[val_idx][features], target.iloc[val_idx]
    X_test = test[features]
    
    print(X_trn.shape)
    
    clf = LGBMClassifier(n_estimators = 4000, learning_rate = 0.01, num_leaves=200, colsample_by_tree =3, reg_alpha=0.5, reg_lambda=0.5, 
                        bagging_freq=1, bagging_fraction=0.8, max_bin=50,random_state=9)
    _ = clf.fit(X_trn, y_trn, eval_set = [(X_trn, y_trn), (X_val, y_val)], eval_metric = 'f1', verbose = 100, early_stopping_rounds = 200)
    
    oofs[val_idx] = clf.predict_proba(X_val)[:, 1]
    current_test_pred = clf.predict_proba(X_test)[:, 1]
    preds_test += clf.predict_proba(X_test)[:, 1]/max_iter
    
    print(f'\n Fold {custom_metric(y_val, oofs[val_idx])}')

    fold_importance_df = pd.DataFrame({'feature': X_trn.columns.tolist(), 'importance': clf.feature_importances_})
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    fold_end_time = time.time()
    total_fold_time = int(fold_end_time - fold_start_time)
    
    print(f"\n->-> Fold ran for {(total_fold_time)//60} minutes {(total_fold_time)%60} seconds")
    

print(f'\nOOF val score: {custom_metric(target, oofs)}')
training_end_time = time.time()
total_training_time = int(training_end_time - training_start_time)

print(f'\n->-> Total training time: {(total_training_time)//60} minutes {(total_training_time)%60} seconds')


---- Fold 0 -----

(757, 53)
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.34926	valid_1's binary_logloss: 0.370118
[200]	training's binary_logloss: 0.225559	valid_1's binary_logloss: 0.257285
[300]	training's binary_logloss: 0.164053	valid_1's binary_logloss: 0.209273
[400]	training's binary_logloss: 0.127549	valid_1's binary_logloss: 0.188804
[500]	training's binary_logloss: 0.103651	valid_1's binary_logloss: 0.178463
[600]	training's binary_logloss: 0.0867687	valid_1's binary_logloss: 0.173888
[700]	training's binary_logloss: 0.0744715	valid_1's binary_logloss: 0.171306
[800]	training's binary_logloss: 0.0652444	valid_1's binary_logloss: 0.16986
[900]	training's binary_logloss: 0.0582853	valid_1's binary_logloss: 0.170341
Early stopping, best iteration is:
[797]	training's binary_logloss: 0.0655028	valid_1's binary_logloss: 0.169675

 Fold F1_score: 0.8918918918918919

->-> Fold ran for 0 minutes 0 seconds

---- Fold 1 -----

(757

In [83]:
#Feature importance for lgb
fold_importance_df 

Unnamed: 0,feature,importance
0,Freq_Of_Word_1,35
1,Freq_Of_Word_2,133
2,Freq_Of_Word_3,426
3,Freq_Of_Word_4,0
4,Freq_Of_Word_5,395
5,Freq_Of_Word_6,66
6,Freq_Of_Word_7,340
7,Freq_Of_Word_8,150
8,Freq_Of_Word_9,96
9,Freq_Of_Word_10,154


In [84]:
#Selecting Features with high feature importance
df1=train_df[['Freq_Of_Word_1', 'Freq_Of_Word_2', 'Freq_Of_Word_3', 
       'Freq_Of_Word_5', 'Freq_Of_Word_6', 'Freq_Of_Word_7', 'Freq_Of_Word_8',
       'Freq_Of_Word_9', 'Freq_Of_Word_10', 'Freq_Of_Word_11',
       'Freq_Of_Word_12', 'Freq_Of_Word_13', 'Freq_Of_Word_14',
       'Freq_Of_Word_16', 'Freq_Of_Word_17',
       'Freq_Of_Word_18', 'Freq_Of_Word_19', 'Freq_Of_Word_20',
       'Freq_Of_Word_21', 'Freq_Of_Word_22', 'Freq_Of_Word_23',
       'Freq_Of_Word_24', 'Freq_Of_Word_25', 'Freq_Of_Word_26',
       'Freq_Of_Word_27', 'Freq_Of_Word_28', 'Freq_Of_Word_29',
       'Freq_Of_Word_30', 
       'Freq_Of_Word_33', 'Freq_Of_Word_36', 'Freq_Of_Word_37',
       'Freq_Of_Word_42', 'Freq_Of_Word_44',
       'Freq_Of_Word_45', 'Freq_Of_Word_46','Freq_Of_Word_49', 'Freq_Of_Word_50',
       'TotalEmojiCharacters', 'LengthOFFirstParagraph', 'StylizedLetters',
       'IsGoodNews']]
df1

Unnamed: 0,Freq_Of_Word_1,Freq_Of_Word_2,Freq_Of_Word_3,Freq_Of_Word_5,Freq_Of_Word_6,Freq_Of_Word_7,Freq_Of_Word_8,Freq_Of_Word_9,Freq_Of_Word_10,Freq_Of_Word_11,Freq_Of_Word_12,Freq_Of_Word_13,Freq_Of_Word_14,Freq_Of_Word_16,Freq_Of_Word_17,Freq_Of_Word_18,Freq_Of_Word_19,Freq_Of_Word_20,Freq_Of_Word_21,Freq_Of_Word_22,Freq_Of_Word_23,Freq_Of_Word_24,Freq_Of_Word_25,Freq_Of_Word_26,Freq_Of_Word_27,Freq_Of_Word_28,Freq_Of_Word_29,Freq_Of_Word_30,Freq_Of_Word_33,Freq_Of_Word_36,Freq_Of_Word_37,Freq_Of_Word_42,Freq_Of_Word_44,Freq_Of_Word_45,Freq_Of_Word_46,Freq_Of_Word_49,Freq_Of_Word_50,TotalEmojiCharacters,LengthOFFirstParagraph,StylizedLetters,IsGoodNews
0,-0.35,2.62,1.25,-0.47,-0.35,-0.30,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.33,-0.32,-0.36,-0.71,1.07,0.95,-0.13,-0.29,0.76,-0.33,-0.30,-0.21,-0.23,-0.16,-0.23,-0.18,-0.24,-0.33,-0.17,-0.12,-0.32,-0.21,0.08,0.16,-0.03,-0.05,0.22,1
1,-0.35,-0.32,-0.56,-0.47,-0.35,-0.30,3.84,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.20,-0.33,-0.30,-0.21,-0.23,-0.16,-0.23,-0.18,-0.24,3.68,-0.17,-0.12,-0.32,-0.21,-0.15,-0.45,-0.11,-0.20,-0.41,0
2,-0.35,-0.32,-0.56,-0.47,-0.35,-0.30,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.20,-0.33,-0.30,-0.21,-0.23,-0.16,-0.23,-0.18,-0.24,-0.33,-0.17,-0.12,-0.32,-0.21,-0.15,-0.45,-0.11,-0.19,-0.39,0
3,1.21,2.68,1.29,0.22,-0.35,-0.30,0.86,-0.32,2.37,-0.32,2.03,-0.31,1.29,0.29,-0.32,-0.36,0.13,-0.19,3.98,-0.13,-0.29,-0.20,-0.33,-0.30,-0.21,-0.23,-0.16,-0.23,-0.18,-0.24,-0.33,-0.17,-0.12,-0.32,-0.21,-0.15,0.34,1.33,2.27,0.60,1
4,-0.35,-0.32,-0.56,-0.47,-0.35,-0.30,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.33,-0.32,-0.36,-0.97,-0.19,-0.71,-0.13,-0.29,-0.20,-0.33,-0.30,0.10,-0.23,0.64,-0.23,-0.18,-0.24,-0.33,-0.17,-0.12,-0.32,-0.21,-0.15,0.93,-0.03,-0.11,-0.13,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
942,-0.25,-0.29,-0.28,-0.34,-0.25,-0.30,-0.17,-0.21,-0.20,-0.32,-0.36,0.01,-0.03,-0.29,-0.21,-0.30,-0.90,-0.19,-0.58,-0.13,-0.10,-0.20,-0.33,-0.30,-0.21,-0.23,-0.16,-0.23,3.51,0.02,1.72,-0.16,-0.08,-0.32,-0.10,-0.02,0.13,-0.08,0.20,4.02,0
943,-0.35,-0.32,-0.56,0.36,-0.35,-0.30,-0.24,-0.32,-0.35,-0.32,-0.64,-0.31,-0.18,-0.33,-0.32,-0.36,-0.48,-0.19,-0.45,-0.13,-0.29,-0.20,-0.17,0.03,-0.04,0.95,-0.16,-0.23,-0.18,-0.24,-0.33,-0.17,-0.12,-0.32,-0.21,-0.15,0.22,0.21,1.15,0.18,0
944,-0.19,-0.16,0.20,0.02,-0.35,-0.30,-0.24,1.76,-0.28,-0.32,-0.33,0.01,-0.18,-0.11,-0.32,-0.36,-0.39,-0.06,0.10,-0.13,-0.29,-0.10,-0.33,-0.30,-0.21,-0.23,-0.09,-0.23,-0.18,-0.24,-0.33,-0.04,-0.12,-0.28,-0.21,-0.09,-0.15,-0.03,0.15,1.41,1
945,-0.35,-0.32,0.41,-0.47,1.41,-0.30,-0.24,-0.32,-0.35,-0.32,-0.64,1.15,-0.18,0.33,-0.32,-0.36,-0.39,-0.19,-0.71,-0.13,-0.29,-0.20,-0.33,-0.30,-0.21,-0.23,-0.16,-0.23,-0.18,-0.24,-0.33,-0.17,-0.12,-0.32,-0.21,-0.15,-0.45,-0.11,-0.20,-0.37,0


In [0]:
#Training Lgbm with good feature importance
train, test = df1,test_df[['Freq_Of_Word_1', 'Freq_Of_Word_2', 'Freq_Of_Word_3', 
       'Freq_Of_Word_5', 'Freq_Of_Word_6', 'Freq_Of_Word_7', 'Freq_Of_Word_8',
       'Freq_Of_Word_9', 'Freq_Of_Word_10', 'Freq_Of_Word_11',
       'Freq_Of_Word_12', 'Freq_Of_Word_13', 'Freq_Of_Word_14',
       'Freq_Of_Word_16', 'Freq_Of_Word_17',
       'Freq_Of_Word_18', 'Freq_Of_Word_19', 'Freq_Of_Word_20',
       'Freq_Of_Word_21', 'Freq_Of_Word_22', 'Freq_Of_Word_23',
       'Freq_Of_Word_24', 'Freq_Of_Word_25', 'Freq_Of_Word_26',
       'Freq_Of_Word_27', 'Freq_Of_Word_28', 'Freq_Of_Word_29',
       'Freq_Of_Word_30', 
       'Freq_Of_Word_33', 'Freq_Of_Word_36', 'Freq_Of_Word_37',
       'Freq_Of_Word_42', 'Freq_Of_Word_44',
       'Freq_Of_Word_45', 'Freq_Of_Word_46','Freq_Of_Word_49', 'Freq_Of_Word_50',
       'TotalEmojiCharacters', 'LengthOFFirstParagraph', 'StylizedLetters']]
target = df1['IsGoodNews']
features = [c for c in train.columns if c not in ['IsGoodNews']]

In [86]:
#Function to perform training of lgb with cross validation
import datetime
import time
from sklearn.model_selection import *
from sklearn.metrics import *
def custom_metric(y_true, y_pred):
    return f'F1_score: {f1_score(y_true, np.round(y_pred))}'
N_FOLDS = 5

training_start_time = time.time()

max_iter = N_FOLDS
folds = StratifiedKFold(n_splits = max_iter)
oofs = np.zeros(len(train))
preds_test = np.zeros(len(test))

feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, target.values)):
    
    print(f'\n---- Fold {fold_} -----\n')
    
    fold_start_time = time.time()
    
    X_trn, y_trn = train.iloc[trn_idx][features], target.iloc[trn_idx]
    X_val, y_val = train.iloc[val_idx][features], target.iloc[val_idx]
    X_test = test[features]
    
    print(X_trn.shape)
    
    clf_1 = LGBMClassifier(n_estimators = 4000, learning_rate = 0.01, num_leaves=200, colsample_by_tree =3, reg_alpha=0.5, reg_lambda=0.5, 
                        bagging_freq=1, bagging_fraction=0.8, max_bin=50,random_state=9)
    _ = clf_1.fit(X_trn, y_trn, eval_set = [(X_trn, y_trn), (X_val, y_val)], eval_metric = 'f1', verbose = 100, early_stopping_rounds = 200)
    
    oofs[val_idx] = clf_1.predict_proba(X_val)[:, 1]
    current_test_pred = clf_1.predict_proba(X_test)[:, 1]
    preds_test += clf_1.predict_proba(X_test)[:, 1]/max_iter
    
    print(f'\n Fold {custom_metric(y_val, oofs[val_idx])}')

    #fold_importance_df = pd.DataFrame({'feature': X_trn.columns.tolist(), 'importance': clf.feature_importances_})
    #feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    fold_end_time = time.time()
    total_fold_time = int(fold_end_time - fold_start_time)
    
    print(f"\n->-> Fold ran for {(total_fold_time)//60} minutes {(total_fold_time)%60} seconds")
    

print(f'\nOOF val score: {custom_metric(target, oofs)}')
training_end_time = time.time()
total_training_time = int(training_end_time - training_start_time)

print(f'\n->-> Total training time: {(total_training_time)//60} minutes {(total_training_time)%60} seconds')


---- Fold 0 -----

(757, 40)
Training until validation scores don't improve for 200 rounds.
[100]	training's binary_logloss: 0.34926	valid_1's binary_logloss: 0.370118
[200]	training's binary_logloss: 0.225559	valid_1's binary_logloss: 0.257285
[300]	training's binary_logloss: 0.164053	valid_1's binary_logloss: 0.209273
[400]	training's binary_logloss: 0.127554	valid_1's binary_logloss: 0.188457
[500]	training's binary_logloss: 0.103603	valid_1's binary_logloss: 0.178185
[600]	training's binary_logloss: 0.086663	valid_1's binary_logloss: 0.173196
[700]	training's binary_logloss: 0.0743391	valid_1's binary_logloss: 0.170642
[800]	training's binary_logloss: 0.0650923	valid_1's binary_logloss: 0.169049
[900]	training's binary_logloss: 0.0581935	valid_1's binary_logloss: 0.169704
Early stopping, best iteration is:
[797]	training's binary_logloss: 0.0653434	valid_1's binary_logloss: 0.168908

 Fold F1_score: 0.8918918918918919

->-> Fold ran for 0 minutes 0 seconds

---- Fold 1 -----

(757

In [0]:
#Applying Logistic Regression
train, test = train_df,test_df
target = train_df['IsGoodNews']
features = [c for c in train.columns if c not in ['IsGoodNews']]

In [88]:
import datetime
import time
from sklearn.model_selection import *
from sklearn.metrics import *
def custom_metric(y_true, y_pred):
    return f'F1_score: {f1_score(y_true, np.round(y_pred))}'
N_FOLDS = 5

training_start_time = time.time()

max_iter = N_FOLDS
folds = StratifiedKFold(n_splits = max_iter)
oofs = np.zeros(len(train))
preds_test = np.zeros(len(test))

feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, target.values)):
    
    print(f'\n---- Fold {fold_} -----\n')
    
    fold_start_time = time.time()
    
    X_trn, y_trn = train.iloc[trn_idx][features], target.iloc[trn_idx]
    X_val, y_val = train.iloc[val_idx][features], target.iloc[val_idx]
    X_test = test[features]
    
    print(X_trn.shape)
    
    clf_2 = LogisticRegression(random_state=9)
    _ = clf_2.fit(X_trn, y_trn) 
    
    oofs[val_idx] = clf_2.predict_proba(X_val)[:, 1]
    current_test_pred = clf_2.predict_proba(X_test)[:, 1]
    preds_test += clf_2.predict_proba(X_test)[:, 1]/max_iter
    
    print(f'\n Fold {custom_metric(y_val, oofs[val_idx])}')

    #fold_importance_df = pd.DataFrame({'feature': X_trn.columns.tolist(), 'importance': clf.feature_importances_})
    #feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    fold_end_time = time.time()
    total_fold_time = int(fold_end_time - fold_start_time)
    
    print(f"\n->-> Fold ran for {(total_fold_time)//60} minutes {(total_fold_time)%60} seconds")
    

print(f'\nOOF val score: {custom_metric(target, oofs)}')
training_end_time = time.time()
total_training_time = int(training_end_time - training_start_time)

print(f'\n->-> Total training time: {(total_training_time)//60} minutes {(total_training_time)%60} seconds')


---- Fold 0 -----

(757, 53)

 Fold F1_score: 0.9078947368421053

->-> Fold ran for 0 minutes 0 seconds

---- Fold 1 -----

(757, 53)

 Fold F1_score: 0.875

->-> Fold ran for 0 minutes 0 seconds

---- Fold 2 -----

(758, 53)

 Fold F1_score: 0.8695652173913043

->-> Fold ran for 0 minutes 0 seconds

---- Fold 3 -----

(758, 53)

 Fold F1_score: 0.8759124087591241

->-> Fold ran for 0 minutes 0 seconds

---- Fold 4 -----

(758, 53)

 Fold F1_score: 0.8873239436619719

->-> Fold ran for 0 minutes 0 seconds

OOF val score: F1_score: 0.8835904628330996

->-> Total training time: 0 minutes 0 seconds


In [0]:
#Applying voting classifier
train, test = train_df,test_df
target = train_df['IsGoodNews']
features = [c for c in train.columns if c not in ['IsGoodNews']]

In [95]:
import datetime
import time
from sklearn.model_selection import *
from sklearn.metrics import *
def custom_metric(y_true, y_pred):
    return f'F1_score: {f1_score(y_true, np.round(y_pred))}'
N_FOLDS = 5

training_start_time = time.time()

max_iter = N_FOLDS
folds = StratifiedKFold(n_splits = max_iter)
oofs = np.zeros(len(train))
preds_test = np.zeros(len(test))

feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, target.values)):
    
    print(f'\n---- Fold {fold_} -----\n')
    
    fold_start_time = time.time()
    
    X_trn, y_trn = train.iloc[trn_idx][features], target.iloc[trn_idx]
    X_val, y_val = train.iloc[val_idx][features], target.iloc[val_idx]
    X_test = test[features]
    
    print(X_trn.shape)
    
    clf_vote = VotingClassifier(estimators =[('log',clf_2),('lgb',clf_1)], voting = 'soft')
    _ = clf_vote.fit(X_trn, y_trn) 
    
    oofs[val_idx] = clf_vote.predict_proba(X_val)[:, 1]
    current_test_pred = clf_vote.predict_proba(X_test)[:, 1]
    preds_test += clf_vote.predict_proba(X_test)[:, 1]/max_iter
    
    print(f'\n Fold {custom_metric(y_val, oofs[val_idx])}')

    #fold_importance_df = pd.DataFrame({'feature': X_trn.columns.tolist(), 'importance': clf.feature_importances_})
    #feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    fold_end_time = time.time()
    total_fold_time = int(fold_end_time - fold_start_time)
    
    print(f"\n->-> Fold ran for {(total_fold_time)//60} minutes {(total_fold_time)%60} seconds")
    

print(f'\nOOF val score: {custom_metric(target, oofs)}')
training_end_time = time.time()
total_training_time = int(training_end_time - training_start_time)

print(f'\n->-> Total training time: {(total_training_time)//60} minutes {(total_training_time)%60} seconds')


---- Fold 0 -----

(757, 53)

 Fold F1_score: 0.9150326797385621

->-> Fold ran for 0 minutes 2 seconds

---- Fold 1 -----

(757, 53)

 Fold F1_score: 0.9103448275862069

->-> Fold ran for 0 minutes 2 seconds

---- Fold 2 -----

(758, 53)

 Fold F1_score: 0.9444444444444445

->-> Fold ran for 0 minutes 1 seconds

---- Fold 3 -----

(758, 53)

 Fold F1_score: 0.9090909090909091

->-> Fold ran for 0 minutes 1 seconds

---- Fold 4 -----

(758, 53)

 Fold F1_score: 0.9115646258503401

->-> Fold ran for 0 minutes 2 seconds

OOF val score: F1_score: 0.9180327868852459

->-> Total training time: 0 minutes 10 seconds


In [0]:
test_df['IsGoodNews']=clf_vote.predict(test_df)

In [92]:
A=test_df['IsGoodNews']
A=pd.DataFrame(A)
A
A.to_excel('lgb_log_model_1.xlsx')

Unnamed: 0,IsGoodNews
0,1
1,0
2,0
3,1
4,0
...,...
522,0
523,1
524,0
525,0


In [0]:
from google.colab import files
files.download('lgb_log_model_1.xlsx')