# Model Interpretation

At this point we have selected the SVM as our preferred model to do the predictions. We will now study its behaviour by analyzing misclassified articles. Hopefully this will give us some insights on the way the model is working.

In [5]:
import pickle
import pandas as pd
import numpy as np
import random

Let's load what we need:

In [20]:
# Dataframe
path_df = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/df.pickle"
with open(path_df, 'rb') as data:
    df = pickle.load(data)
    
# X_train
path_X_train = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/X_train.pickle"
with open(path_X_train, 'rb') as data:
    X_train = pickle.load(data)

# X_test
path_X_test = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/X_test.pickle"
with open(path_X_test, 'rb') as data:
    X_test = pickle.load(data)

# y_train
path_y_train = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/y_train.pickle"
with open(path_y_train, 'rb') as data:
    y_train = pickle.load(data)

# y_test
path_y_test = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/y_test.pickle"
with open(path_y_test, 'rb') as data:
    y_test = pickle.load(data)

# features_train
path_features_train = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/features_train.pickle"
with open(path_features_train, 'rb') as data:
    features_train = pickle.load(data)

# labels_train
path_labels_train = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/labels_train.pickle"
with open(path_labels_train, 'rb') as data:
    labels_train = pickle.load(data)

# features_test
path_features_test = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/features_test.pickle"
with open(path_features_test, 'rb') as data:
    features_test = pickle.load(data)

# labels_test
path_labels_test = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/labels_test.pickle"
with open(path_labels_test, 'rb') as data:
    labels_test = pickle.load(data)
    
# SVM Model
path_model = "/Users/jimcody/Documents/2021Python/nlp/data/Pickles/best_svc.pickle"
with open(path_model, 'rb') as data:
    svc_model = pickle.load(data)
    
# Category mapping dictionary
category_codes = {
    'business': 0,
    'entertainment': 1,
    'politics': 2,
    'sport': 3,
    'tech': 4
}

category_names = {
    0: 'business',
    1: 'entertainment',
    2: 'politics',
    3: 'sport',
    4: 'tech'
}

FileNotFoundError: [Errno 2] No such file or directory: '/Users/jimcody/Documents/2021Python/nlp/data/Pickles/best_svc.pickle'

Let's get the predictions on the test set:

In [7]:
predictions = svc_model.predict(features_test)

Now we'll create the Test Set dataframe with the actual and predicted categories:

In [8]:
# Indexes of the test set
index_X_test = X_test.index

# We get them from the original df
df_test = df.loc[index_X_test]

# Add the predictions
df_test['Prediction'] = predictions

# Clean columns
df_test = df_test[['Content', 'Category', 'Category_Code', 'Prediction']]

# Decode
df_test['Category_Predicted'] = df_test['Prediction']
df_test = df_test.replace({'Category_Predicted':category_names})

# Clean columns again
df_test = df_test[['Content', 'Category', 'Category_Predicted']]

In [9]:
df_test.head()

Unnamed: 0,Content,Category,Category_Predicted
1691,Ireland call up uncapped Campbell\n\nUlster sc...,sport,sport
1103,Gurkhas to help tsunami victims\n\nBritain has...,politics,business
477,Egypt and Israel seal trade deal\n\nIn a sign ...,business,business
197,Cairn shares up on new oil find\n\nShares in C...,business,business
475,Saudi NCCI's shares soar\n\nShares in Saudi Ar...,business,business


Let's get the misclassified articles:

In [10]:
condition = (df_test['Category'] != df_test['Category_Predicted'])

df_misclassified = df_test[condition]

df_misclassified.head(3)

Unnamed: 0,Content,Category,Category_Predicted
1103,Gurkhas to help tsunami victims\n\nBritain has...,politics,business
1880,Half-Life 2 sweeps Bafta awards\n\nPC first pe...,tech,entertainment
627,REM concerts blighted by illness\n\nUS rock ba...,entertainment,sport


Let's get a sample of 3 articles. We'll define a function to make this process faster:

In [11]:
def output_article(row_article):
    print('Actual Category: %s' %(row_article['Category']))
    print('Predicted Category: %s' %(row_article['Category_Predicted']))
    print('-------------------------------------------')
    print('Text: ')
    print('%s' %(row_article['Content']))

We'll get three random numbers from the indexes:

In [16]:
random.seed(8)
list_samples = random.sample(list(df_misclassified.index), 3)
list_samples

[2003, 1339, 431]

First case:

In [17]:
output_article(df_misclassified.loc[list_samples[0]])

Actual Category: tech
Predicted Category: business
-------------------------------------------
Text: 
US blogger fired by her airline

A US airline attendant suspended over "inappropriate images" on her blog - web diary - says she has been fired.

Ellen Simonetti, known as Queen of the Sky, wrote an anonymous semi-fictional account of her life in the sky. She was suspended by Delta in September. In a statement, she said she was initiating legal action against the airline for "wrongful termination". A Delta spokesperson confirmed on Wednesday that Ms Simonetti was no longer an employee. Delta has repeatedly declined to elaborate on what it calls "internal employee matters". A spokesperson reiterated this position on Wednesday, confirming only that Ms Simonetti was no longer with the company. The spokesperson also confirmed that there were "very clear rules" attached to the unauthorised use of Delta branding, including uniforms. Ms Simonetti announced on her blog she had been fired on 1 

Second case:

In [18]:
output_article(df_misclassified.loc[list_samples[1]])

Actual Category: sport
Predicted Category: entertainment
-------------------------------------------
Text: 
Holmes feted with further honour

Double Olympic champion Kelly Holmes has been voted European Athletics (EAA) woman athlete of 2004 in the governing body's annual poll.

The Briton, made a dame in the New Year Honours List for taking 800m and 1,500m gold, won vital votes from the public, press and EAA member federations. She is only the second British woman to land the title after- Sally Gunnell won for her world 400m hurdles win in 1993. Swedish triple jumper Christian Olsson was voted male athlete of the year. The accolade is the latest in a long list of awards that Holmes has received since her success in Athens. In addition to becoming a dame, she was also named the BBC Sports Personality of the Year in December. Her gutsy victory in the 800m also earned her the International Association of Athletics Federations' award for the best women's performance in the world for 2004. 

Third case:

In [19]:
output_article(df_misclassified.loc[list_samples[2]])

Actual Category: business
Predicted Category: politics
-------------------------------------------
Text: 
BA to suspend two Saudi services

British Airways is to halt its flights from London Heathrow to Jeddah and Riyadh in Saudi Arabia from 27 March.


BA will now suspend the Saudi flights - which it says will remain "under constant review" - from 27 March. "The decision to suspend flights between the UK and Saudi Arabia is a difficult one to make as we have enjoyed a long history of flying between the two countries," said BA director of commercial planning, Robert Boyle. "However, the routes don't currently make a profitable contribution to our business and we are unable to sustain them while this remains the case." Passengers with flights booked after the suspension date will be contacted by BA for alternative arrangements to be made.


We can see that in all cases the category is not 100% clear, since these articles contain concepts of both categories. These errors will always happen and we are not looking forward to be 100% accurate on them.