## Overview

We are going to create machine learning model to predict news title. You can do prediction using this model through API. You can see the API documentation in the end of page

## Import Libraries

In [1]:
import pandas as pd

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Read excel file

In [3]:
df = pd.read_excel('News Title.xls')
df = df.drop('No', axis=1)

In [4]:
df

Unnamed: 0,News Title,Category
0,Google+ rolls out 'Stories' for tricked out ph...,Technology
1,Dov Charney's Redeeming Quality,Business
2,White God adds Un Certain Regard to the Palm Dog,Entertainment
3,"Google shows off Androids for wearables, cars,...",Technology
4,China May new bank loans at 870.8 bln yuan,Business
...,...,...
65530,Xbox One Homebrew Will Likely Be a Reality in ...,Technology
65531,Maker Recalls 1.9 Million Rear-Facing Infant S...,Technology
65532,Watch first 'Ninja Turtles' trailer,Entertainment
65533,23/05/2014Dogs triumph in Cannes as canine thr...,Entertainment


In [5]:
df1 = df['Category'].value_counts().to_frame()

In [6]:
df1

Unnamed: 0,Category
Entertainment,23961
Business,17707
Technology,16776
Medical,7091


NOTE: 
* It seem that imbalance class for the category. So we are going to calculate the class weight so we can use it on Random Forest

In [7]:
df1['Class Weight'] = df1['Category'].apply(
    lambda row : round(df.shape[0]/row,2)
)

In [8]:
df1

Unnamed: 0,Category,Class Weight
Entertainment,23961,2.74
Business,17707,3.7
Technology,16776,3.91
Medical,7091,9.24


In [9]:
df.head(20)

Unnamed: 0,News Title,Category
0,Google+ rolls out 'Stories' for tricked out ph...,Technology
1,Dov Charney's Redeeming Quality,Business
2,White God adds Un Certain Regard to the Palm Dog,Entertainment
3,"Google shows off Androids for wearables, cars,...",Technology
4,China May new bank loans at 870.8 bln yuan,Business
5,Firefox Windows 8 Metro Browser Development Ca...,Technology
6,Destiny Beta Kicks Off In July,Technology
7,Apple & Google's Motorola end legal battle,Technology
8,UPDATE 2-Facebook Q1 revenue grows 72 percent ...,Business
9,"Selena Gomez, Justin Bieber Spotted at the Sam...",Entertainment


In [10]:
import string

In [11]:
def processing_text_news(data):
    # removing punctuation
    nopunc = [char for char in data if char not in string.punctuation]
    data_ = ''.join(nopunc)
    
    # doing tokenization
    data1 = nlp(data_)
    data2 = [item for item in data1]
    
    # doing lematization
    data3 = [item.lemma_ for item in data2]
    
    # removing stop word
    data4 = [item for item in data3 if item not in nlp.Defaults.stop_words]
    
    return ' '.join(data4)

In [12]:
df['News Title Processed'] = df['News Title'].apply(processing_text_news)

In [13]:
df.head(10)

Unnamed: 0,News Title,Category,News Title Processed
0,Google+ rolls out 'Stories' for tricked out ph...,Technology,Google roll story tricked photo playback
1,Dov Charney's Redeeming Quality,Business,Dov charney redeem quality
2,White God adds Un Certain Regard to the Palm Dog,Entertainment,White God add Un Certain Regard Palm Dog
3,"Google shows off Androids for wearables, cars,...",Technology,Google android wearable car tv
4,China May new bank loans at 870.8 bln yuan,Business,China May new bank loan 8708 bln yuan
5,Firefox Windows 8 Metro Browser Development Ca...,Technology,Firefox Windows 8 Metro Browser Development ca...
6,Destiny Beta Kicks Off In July,Technology,Destiny Beta kick July
7,Apple & Google's Motorola end legal battle,Technology,Apple Googles Motorola end legal battle
8,UPDATE 2-Facebook Q1 revenue grows 72 percent ...,Business,UPDATE 2Facebook Q1 revenue grow 72 percent ri...
9,"Selena Gomez, Justin Bieber Spotted at the Sam...",Entertainment,Selena Gomez Justin Bieber spot Same Recording...


### Exploratory Data Analysts

In [14]:
list_word_1 = []
dict_word_1 = {}
for item1 in df[df['Category'] == 'Technology']['News Title Processed']:
    for item2 in item1.split(' '):
        if item2 not in list_word_1:
            list_word_1.append(item2)
            dict_word_1[item2] = 1
        else:
            dict_word_1[item2] += 1
df1 = pd.DataFrame([[key,val] for key, val in dict_word_1.items()], columns=['Technology','#'])
df1 = df1[(df1['Technology'] != '') & (df1['Technology'] != '-PRON-') & (df1['Technology'] != '\t')]
df1 = df1.sort_values('#', ascending=False).head(10)
df1 = df1.reset_index(drop=True)

In [15]:
list_word_2 = []
dict_word_2 = {}
for item1 in df[df['Category'] == 'Business']['News Title Processed']:
    for item2 in item1.split(' '):
        if item2 not in list_word_2:
            list_word_2.append(item2)
            dict_word_2[item2] = 1
        else:
            dict_word_2[item2] += 1
df2 = pd.DataFrame([[key,val] for key, val in dict_word_2.items()], columns=['Business','#'])
df2 = df2[(df2['Business'] != '') & (df2['Business'] != '-PRON-') & (df2['Business'] != '\t')]
df2 = df2.sort_values('#', ascending=False).head(10)
df2 = df2.reset_index(drop=True)

In [16]:
list_word_3 = []
dict_word_3 = {}
for item1 in df[df['Category'] == 'Entertainment']['News Title Processed']:
    for item2 in item1.split(' '):
        if item2 not in list_word_3:
            list_word_3.append(item2)
            dict_word_3[item2] = 1
        else:
            dict_word_3[item2] += 1
df3 = pd.DataFrame([[key,val] for key, val in dict_word_3.items()], columns=['Entertainment','#'])
df3 = df3[(df3['Entertainment'] != '') & (df3['Entertainment'] != '-PRON-') & (df3['Entertainment'] != '\t')]
df3 = df3.sort_values('#', ascending=False).head(10)
df3 = df3.reset_index(drop=True)

In [17]:
list_word_4 = []
dict_word_4 = {}
for item1 in df[df['Category'] == 'Medical']['News Title Processed']:
    for item2 in item1.split(' '):
        if item2 not in list_word_4:
            list_word_4.append(item2)
            dict_word_4[item2] = 1
        else:
            dict_word_4[item2] += 1
df4 = pd.DataFrame([[key,val] for key, val in dict_word_4.items()], columns=['Medical','#'])
df4 = df4[(df4['Medical'] != '') & (df4['Medical'] != '-PRON-') & (df4['Medical'] != '\t')]
df4 = df4.sort_values('#', ascending=False).head(10)
df4 = df4.reset_index(drop=True)

### Top Word by Category

In [18]:
pd.concat([df1, df2, df3, df4], axis=1)

Unnamed: 0,Technology,#,Business,#.1,Entertainment,#.2,Medical,#.3
0,Google,1539,US,1646,Kim,887,Ebola,574
1,Apple,1180,stock,728,New,846,study,370
2,Samsung,1052,China,576,Kardashian,783,cancer,351
3,Galaxy,876,rise,516,2014,759,US,338
4,Microsoft,764,rate,504,"""",758,case,246
5,Facebook,693,price,498,video,697,health,244
6,new,675,sale,480,Game,621,FDA,227
7,Android,615,deal,459,new,611,new,213
8,New,575,high,456,season,603,find,203
9,launch,485,billion,443,Star,591,death,203


NOTE: 
* As we can see that almost every category have different top 10 appeared word except for word "US" appear twice in "Business" and "Medical". Let's look at the top bigram by category

In [19]:
import re
from itertools import *
from collections import Counter

In [20]:
list_word_1 = []
dict_word_1 = {}
for item1 in df[df['Category'] == 'Technology']['News Title Processed']:
    words = re.findall("\w+",item1)
    for item in zip(words, islice(words, 1, None)):
        bigram = item[0] + '_' + item[1]
        if bigram not in list_word_1:
            list_word_1.append(bigram)
            dict_word_1[bigram] = 1 
        else:
            dict_word_1[bigram] += 1
df1 = pd.DataFrame([[key,val] for key, val in dict_word_1.items()], columns=['Technology','#'])
df1 = df1[(df1['Technology'] != '') & (df1['Technology'] != '-PRON-') & (df1['Technology'] != '\t')]
df1 = df1.sort_values('#', ascending=False).head(10)
df1 = df1.reset_index(drop=True)

In [21]:
list_word_2 = []
dict_word_2 = {}
for item1 in df[df['Category'] == 'Business']['News Title Processed']:
    words = re.findall("\w+",item1)
    for item in zip(words, islice(words, 1, None)):
        bigram = item[0] + '_' + item[1]
        if bigram not in list_word_2:
            list_word_2.append(bigram)
            dict_word_2[bigram] = 1 
        else:
            dict_word_2[bigram] += 1
df2 = pd.DataFrame([[key,val] for key, val in dict_word_2.items()], columns=['Business','#'])
df2 = df2[(df2['Business'] != '') & (df2['Business'] != '-PRON-') & (df2['Business'] != '\t')]
df2 = df2.sort_values('#', ascending=False).head(10)
df2 = df2.reset_index(drop=True)

In [22]:
list_word_3 = []
dict_word_3 = {}
for item1 in df[df['Category'] == 'Entertainment']['News Title Processed']:
    words = re.findall("\w+",item1)
    for item in zip(words, islice(words, 1, None)):
        bigram = item[0] + '_' + item[1]
        if bigram not in list_word_3:
            list_word_3.append(bigram)
            dict_word_3[bigram] = 1 
        else:
            dict_word_3[bigram] += 1
df3 = pd.DataFrame([[key,val] for key, val in dict_word_3.items()], columns=['Entertainment','#'])
df3 = df3[(df3['Entertainment'] != '') & (df3['Entertainment'] != '-PRON-') & (df3['Entertainment'] != '\t')]
df3 = df3.sort_values('#', ascending=False).head(10)
df3 = df3.reset_index(drop=True)

In [23]:
list_word_4 = []
dict_word_4 = {}
for item1 in df[df['Category'] == 'Medical']['News Title Processed']:
    words = re.findall("\w+",item1)
    for item in zip(words, islice(words, 1, None)):
        bigram = item[0] + '_' + item[1]
        if bigram not in list_word_4:
            list_word_4.append(bigram)
            dict_word_4[bigram] = 1 
        else:
            dict_word_4[bigram] += 1
df4 = pd.DataFrame([[key,val] for key, val in dict_word_4.items()], columns=['Medical','#'])
df4 = df4[(df4['Medical'] != '') & (df4['Medical'] != '-PRON-') & (df4['Medical'] != '\t')]
df4 = df4.sort_values('#', ascending=False).head(10)
df4 = df4.reset_index(drop=True)

### Top Bigram by Category

In [24]:
pd.concat([df1, df2, df3, df4], axis=1)

Unnamed: 0,Technology,#,Business,#.1,Entertainment,#.2,Medical,#.3
0,Samsung_Galaxy,564,Wall_Street,167,Kim_Kardashian,597,Ebola_outbreak,105
1,Galaxy_s5,347,US_stock,140,Game_Thrones,571,West_Nile,85
2,Google_Glass,277,Malaysia_Airlines,97,Miley_Cyrus,436,West_Africa,78
3,climate_change,221,SP_500,82,Star_Wars,363,breast_cancer,65
4,iPhone_6,182,New_York,76,Justin_Bieber,353,study_find,63
5,gas_price,153,gas_price,75,Kanye_West,330,Saudi_Arabia,61
6,Galaxy_Tab,128,candy_crush,75,Jay_Z,276,death_toll,52
7,Galaxy_Note,114,home_sale,73,Selena_Gomez,205,blood_test,49
8,Pro_3,107,Hong_Kong,73,Kardashian_Kanye,191,skin_cancer,48
9,HTC_One,103,interest_rate,71,Captain_America,182,Ebola_virus,46


NOTE: 
* We can see nicely that every category have different top 10 bigram

### Vectorization

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
messages_bow = CountVectorizer().fit_transform(df['News Title Processed'])

### TF-IDF

In [26]:
from sklearn.feature_extraction.text import TfidfTransformer
messages_tfidf = TfidfTransformer().fit_transform(messages_bow)

### Splitting

In [27]:
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(
    df['News Title Processed'], df['Category'],
    test_size = .2
)

## Naive Bayes

### Create Pipeline

In [28]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

modelNB = Pipeline([
    ('bow', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('classifier', MultinomialNB()),
])

### Model Training

In [29]:
modelNB.fit(xTrain, yTrain)

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('classifier',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

### Model Evaluation

#### Train Evaluation

In [30]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, roc_auc_score, accuracy_score, recall_score, precision_score, f1_score

In [31]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
prediction = modelNB.predict(xTrain)
print(classification_report(yTrain,prediction))
print('Accuracy = ', round(accuracy_score(yTrain,prediction)*100,2))
pd.DataFrame(confusion_matrix(yTrain, prediction))

               precision    recall  f1-score   support

     Business       0.90      0.92      0.91     14112
Entertainment       0.94      0.98      0.96     19151
      Medical       0.99      0.79      0.88      5725
   Technology       0.91      0.92      0.92     13440

     accuracy                           0.93     52428
    macro avg       0.94      0.90      0.92     52428
 weighted avg       0.93      0.93      0.93     52428

Accuracy =  92.71


Unnamed: 0,0,1,2,3
0,13009,345,37,721
1,172,18750,17,212
2,479,496,4513,237
3,768,325,11,12336


#### Test Evaluation

In [32]:
prediction = modelNB.predict(xTest)
print(classification_report(yTest,prediction))
print('Accuracy = ', round(accuracy_score(yTest,prediction)*100,2))
pd.DataFrame(confusion_matrix(yTest, prediction))

               precision    recall  f1-score   support

     Business       0.87      0.90      0.88      3595
Entertainment       0.92      0.97      0.95      4810
      Medical       0.97      0.73      0.84      1366
   Technology       0.89      0.88      0.88      3336

     accuracy                           0.90     13107
    macro avg       0.91      0.87      0.89     13107
 weighted avg       0.90      0.90      0.90     13107

Accuracy =  90.23


Unnamed: 0,0,1,2,3
0,3223,112,16,244
1,67,4671,8,64
2,146,151,1002,67
3,261,140,4,2931


#### Consistency Evaluation

create evaluation function

In [33]:
def evaluation(X, Y, model):
    prediction = model.predict(X)
    precision = precision_score(Y, prediction, average='macro') * 100
    recall = recall_score(Y, prediction, average='macro') * 100
    f1 = f1_score(Y, prediction, average='macro') * 100
    accuracy = accuracy_score(Y,prediction) * 100
    return {
        "f1" : f1,
        "precision" : precision,
        "recall" : recall,
        "accuracy" : accuracy
    }

def calculation_metrics(xTrain, yTrain, xTest, yTest, model):
    model.fit(xTrain, yTrain)
    train_error = evaluation(xTrain, yTrain, model)
    validation_error = evaluation(xTest, yTest, model)
    return train_error, validation_error

doing evaluation

In [34]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True, random_state=42)
data = df['News Title Processed']
target = df['Category']
train_error = []
validation_error = []
for trainIndex, valIndex in kf.split(data, target):
    xTrain, xTest = data.iloc[trainIndex], data.iloc[valIndex]
    yTrain, yTest = target.iloc[trainIndex], target.iloc[valIndex]
    
    trainError, valError = calculation_metrics(xTrain, yTrain, xTest, yTest, modelNB)
    
    train_error.append(trainError)
    validation_error.append(valError)

In [35]:
dfKFold = pd.DataFrame({
    "Train Precision" : [item['precision'] for item in train_error],
    "Train Recall" : [item['recall'] for item in train_error],
    "Train F1 Score" : [item['f1'] for item in train_error],
    "Train Accuracy" : [item['accuracy'] for item in train_error],
    "-" : ['-' for item in train_error],
    "Test Precision" : [item['precision'] for item in validation_error],
    "Test Recall" : [item['recall'] for item in validation_error],
    "Test F1 Sccore" : [item['f1'] for item in validation_error],
    "Test Accuracy" : [item['accuracy'] for item in validation_error],
})
additional = []
for item in dfKFold:
    if item != '-':
        additional.append(dfKFold[item].mean())
    else:
        additional.append('-')
dfKFold = pd.concat([dfKFold,pd.DataFrame(
    [additional],
    index=['Average'], columns=dfKFold.columns
)])
dfKFold

Unnamed: 0,Train Precision,Train Recall,Train F1 Score,Train Accuracy,-,Test Precision,Test Recall,Test F1 Sccore,Test Accuracy
0,93.60794,90.317284,91.690165,92.821417,-,91.344666,87.186872,88.837091,90.341776
1,93.494421,90.196505,91.573797,92.689171,-,91.490494,87.227714,88.897341,90.418065
2,93.473626,90.160268,91.542717,92.66713,-,91.791245,87.71737,89.332314,90.799512
3,93.629213,90.431155,91.774967,92.840067,-,90.709453,86.105106,87.863518,89.517852
4,93.533118,90.304472,91.659844,92.740035,-,90.968871,86.457902,88.171255,89.868782
5,93.506105,90.19148,91.576739,92.694381,-,91.069886,86.806554,88.456725,90.050359
6,93.569093,90.329853,91.688686,92.767285,-,91.172702,86.839065,88.505276,90.309782
7,93.518176,90.233668,91.60936,92.716422,-,91.77442,87.755571,89.362018,90.752327
8,93.588763,90.285149,91.664092,92.780848,-,90.962095,86.498312,88.240467,89.882497
9,93.548891,90.222823,91.608333,92.733376,-,91.443169,87.578594,89.156541,90.340302


## Random Forest

### Selecting Best Estimator

In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

searchRF = GridSearchCV(estimator = RandomForestClassifier(),
                     param_grid = {
                         'class_weight' : [{'Entertainment':2.74,'Business':3.70,'Technology':3.91,'Medical':9.24},None]
                     }, scoring='accuracy',
                     cv=5,
                     n_jobs = -1)

In [37]:
searchRF.fit(messages_tfidf, df['Category'])



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid

In [38]:
searchRF.best_estimator_

RandomForestClassifier(bootstrap=True,
                       class_weight={'Business': 3.7, 'Entertainment': 2.74,
                                     'Medical': 9.24, 'Technology': 3.91},
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=10, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)

### Build Pipeline

In [39]:
from sklearn.ensemble import RandomForestClassifier

modelRF = Pipeline([
    ('bow', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('classifier', searchRF.best_estimator_),
])

### Model Training

In [40]:
modelRF.fit(xTrain, yTrain)

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None...
                                                      'Entertainment': 2.74,
                                                      'Medical': 9.24,
                                                      'Technology': 3.91},
                                        criterion='gini', max_depth=None,
                                       

### Model Evaluation

#### Train Evaluation

In [41]:
prediction = modelRF.predict(xTrain)
print(classification_report(yTrain,prediction))
print('Accuracy = ', round(accuracy_score(yTrain,prediction)*100,2))
pd.DataFrame(confusion_matrix(yTrain, prediction))

               precision    recall  f1-score   support

     Business       0.99      0.99      0.99     15910
Entertainment       0.99      1.00      0.99     21570
      Medical       1.00      0.99      0.99      6408
   Technology       1.00      0.99      0.99     15094

     accuracy                           0.99     58982
    macro avg       0.99      0.99      0.99     58982
 weighted avg       0.99      0.99      0.99     58982

Accuracy =  99.24


Unnamed: 0,0,1,2,3
0,15815,43,12,40
1,48,21498,8,16
2,31,44,6326,7
3,99,91,11,14893


### Test Evaluation

In [42]:
prediction = modelRF.predict(xTest)
print(classification_report(yTest,prediction))
print('Accuracy = ', round(accuracy_score(yTest,prediction)*100,2))
pd.DataFrame(confusion_matrix(yTest, prediction))

               precision    recall  f1-score   support

     Business       0.85      0.84      0.84      1797
Entertainment       0.87      0.95      0.91      2391
      Medical       0.91      0.78      0.84       683
   Technology       0.88      0.82      0.85      1682

     accuracy                           0.87      6553
    macro avg       0.88      0.85      0.86      6553
 weighted avg       0.87      0.87      0.87      6553

Accuracy =  86.92


Unnamed: 0,0,1,2,3
0,1509,133,23,132
1,63,2264,20,44
2,56,74,536,17
3,150,132,13,1387


#### Consistency Evaluation

In [43]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True, random_state=42)
data = df['News Title Processed']
target = df['Category']
train_error = []
validation_error = []
for trainIndex, valIndex in kf.split(data, target):
    xTrain, xTest = data.iloc[trainIndex], data.iloc[valIndex]
    yTrain, yTest = target.iloc[trainIndex], target.iloc[valIndex]
    
    trainError, valError = calculation_metrics(xTrain, yTrain, xTest, yTest, modelRF)
    
    train_error.append(trainError)
    validation_error.append(valError)

In [44]:
dfKFold = pd.DataFrame({
    "Train Precision" : [item['precision'] for item in train_error],
    "Train Recall" : [item['recall'] for item in train_error],
    "Train F1 Score" : [item['f1'] for item in train_error],
    "Train Accuracy" : [item['accuracy'] for item in train_error],
    "-" : ['-' for item in train_error],
    "Test Precision" : [item['precision'] for item in validation_error],
    "Tesbt Recall" : [item['recall'] for item in validation_error],
    "Test F1 Sccore" : [item['f1'] for item in validation_error],
    "Test Accuracy" : [item['accuracy'] for item in validation_error],
})
additional = []
for item in dfKFold:
    if item != '-':
        additional.append(dfKFold[item].mean())
    else:
        additional.append('-')
dfKFold = pd.concat([dfKFold,pd.DataFrame(
    [additional],
    index=['Average'], columns=dfKFold.columns
)])
dfKFold

Unnamed: 0,Train Precision,Train Recall,Train F1 Score,Train Accuracy,-,Test Precision,Tesbt Recall,Test F1 Sccore,Test Accuracy
0,99.24569,99.141241,99.192369,99.260779,-,86.388236,84.98196,85.612725,86.985047
1,99.331641,99.214639,99.272269,99.30147,-,86.811729,85.011923,85.793283,87.076594
2,99.254847,99.132664,99.192588,99.259083,-,86.843716,85.069504,85.815386,87.137626
3,99.313481,99.17428,99.242744,99.284515,-,86.985184,85.125199,85.913713,87.000305
4,99.212078,98.994565,99.101491,99.167529,-,87.489156,85.014542,86.038517,86.8935
5,99.261255,99.093106,99.175763,99.248923,-,86.020341,84.747875,85.266985,86.433694
6,99.305587,99.119327,99.210976,99.267573,-,86.640203,84.223875,85.248086,86.677857
7,99.305524,99.150232,99.22679,99.2947,-,87.114176,85.326317,86.093862,87.059362
8,99.266077,99.104423,99.18361,99.240446,-,86.922,85.149718,85.925443,86.967801
9,99.281652,99.206153,99.24283,99.284527,-,86.807777,84.915958,85.749799,86.8915


## Linear SVC

### Biuld Pipeline

In [45]:
from sklearn.svm import LinearSVC

modelSVC = Pipeline([
    ('bow', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('classifier', LinearSVC()),
])

### Model Training

In [46]:
modelSVC.fit(xTrain, yTrain)

Pipeline(memory=None,
         steps=[('bow',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('classifier',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

### Model Evaluation

#### Train Evaluation

In [47]:
prediction = modelSVC.predict(xTrain)
print(classification_report(yTrain,prediction))
print('Accuracy = ', round(accuracy_score(yTrain,prediction)*100,2))
pd.DataFrame(confusion_matrix(yTrain, prediction))

               precision    recall  f1-score   support

     Business       0.97      0.97      0.97     15910
Entertainment       0.99      0.99      0.99     21570
      Medical       0.99      0.98      0.98      6408
   Technology       0.97      0.97      0.97     15094

     accuracy                           0.98     58982
    macro avg       0.98      0.98      0.98     58982
 weighted avg       0.98      0.98      0.98     58982

Accuracy =  98.07


Unnamed: 0,0,1,2,3
0,15410,56,46,398
1,61,21446,14,49
2,70,32,6287,19
3,314,64,13,14703


#### Test Evaluation

In [48]:
prediction = modelSVC.predict(xTest)
print(classification_report(yTest,prediction))
print('Accuracy = ', round(accuracy_score(yTest,prediction)*100,2))
pd.DataFrame(confusion_matrix(yTest, prediction))

               precision    recall  f1-score   support

     Business       0.91      0.88      0.90      1797
Entertainment       0.96      0.97      0.97      2391
      Medical       0.93      0.90      0.91       683
   Technology       0.89      0.92      0.91      1682

     accuracy                           0.93      6553
    macro avg       0.92      0.92      0.92      6553
 weighted avg       0.93      0.93      0.93      6553

Accuracy =  92.61


Unnamed: 0,0,1,2,3
0,1587,44,23,143
1,29,2316,12,34
2,39,15,614,15
3,92,26,12,1552


#### Consistency Evaluation

In [49]:
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True, random_state=42)
data = df['News Title Processed']
target = df['Category']
train_error = []
validation_error = []
for trainIndex, valIndex in kf.split(data, target):
    xTrain, xTest = data.iloc[trainIndex], data.iloc[valIndex]
    yTrain, yTest = target.iloc[trainIndex], target.iloc[valIndex]
    
    trainError, valError = calculation_metrics(xTrain, yTrain, xTest, yTest, modelSVC)
    
    train_error.append(trainError)
    validation_error.append(valError)

In [50]:
dfKFold = pd.DataFrame({
    "Train Precision" : [item['precision'] for item in train_error],
    "Train Recall" : [item['recall'] for item in train_error],
    "Train F1 Score" : [item['f1'] for item in train_error],
    "Train Accuracy" : [item['accuracy'] for item in train_error],
    "-" : ['-' for item in train_error],
    "Test Precision" : [item['precision'] for item in validation_error],
    "Tesbt Recall" : [item['recall'] for item in validation_error],
    "Test F1 Sccore" : [item['f1'] for item in validation_error],
    "Test Accuracy" : [item['accuracy'] for item in validation_error],
})
additional = []
for item in dfKFold:
    if item != '-':
        additional.append(dfKFold[item].mean())
    else:
        additional.append('-')
dfKFold = pd.concat([dfKFold,pd.DataFrame(
    [additional],
    index=['Average'], columns=dfKFold.columns
)])
dfKFold

Unnamed: 0,Train Precision,Train Recall,Train F1 Score,Train Accuracy,-,Test Precision,Tesbt Recall,Test F1 Sccore,Test Accuracy
0,98.07833,97.992379,98.034884,98.085824,-,92.917284,91.818848,92.336836,92.889838
1,98.042789,97.939964,97.99091,98.046829,-,92.903415,92.054842,92.461756,92.996643
2,98.076351,97.968115,98.021736,98.072261,-,93.318331,92.553798,92.920031,93.454379
3,98.055213,97.961369,98.007895,98.070565,-,92.36405,91.703183,92.021599,92.538908
4,98.083407,97.983213,98.032875,98.089215,-,92.503927,91.835493,92.158099,92.783033
5,98.037804,97.931009,97.983874,98.051948,-,92.855027,92.270461,92.553167,92.949794
6,98.104026,98.000691,98.051916,98.09942,-,92.462066,92.014083,92.230701,92.919274
7,98.01861,97.919784,97.968791,98.029907,-,92.676439,91.919255,92.282903,92.842973
8,98.050947,97.962334,98.006237,98.072293,-,92.369544,91.670808,92.008426,92.58355
9,98.067423,97.950938,98.0086,98.073989,-,92.295409,91.836428,92.047813,92.61407


## Summary

Based on 3 model that we already build, LinearSVC have the highest score in term of precission, recall and accuracy. So we are going to choose LinearSVC as our model

## API Documentation

you can do prediction through this link http://khakimh-nlp.herokuapp.com/. <br>

> NOTE : 
> <br>It might be take a while to connect since we are using free hosting app from heroku and they are set the app to sleep after 30 minutes unused

You can send json file using this format:
```sh
{
    "text" : your_title,
    "type" : "news"
}
```

and it will return something like:
```sh
{
    "prediction": prediction
}
```