In [1]:
# Imports
import gzip
import os
from helper import *
import pandas as pd
import numpy as np
# Standard plotly imports
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

Vocabulary constructed


# Data Screening

## label2answer 

First we will start by investigating similar files that only differs in their names by 'raw' or 'token' in order to see the difference. We will first start by label2answer files.

In [2]:
# Read label2answer files, raw and token
raw_label2answer = read_data(PATH + 'InsuranceQA.label2answer.raw.encoded.gz', "label2answer")
token_label2answer = read_data(PATH + 'InsuranceQA.label2answer.token.encoded.gz', "label2answer")

Reading label2answer data. 
This data format is: <Answer Label><TAB><Answer Text> 

Reading label2answer data. 
This data format is: <Answer Label><TAB><Answer Text> 



These files contains two columns Answer labels and the answer text i.e. every answer is assigned to a number.<br>
Answers here are a list of indexes, each one of them is a key in the vocabulary dictionary. In the helper.py we defined functions that help us manipulate the indexes and map them to words.

In [3]:
# Select an example from the raw_label2answer data
' '.join(convert_from_idx_str(raw_label2answer[100][1]))

'If the primary beneficiary dies before the policy owner dies, the contingent beneficiary would be the next in line for benefits. If the policy owner is well, he or she would simply call the insurance company and request a "change of beneficiary" form and rename a new primary beneficiary. If it is the contingent who becomes primary then the owner would rename a new contingent beneficiary.'

In [4]:
# Select an example from the token_label2answer data
' '.join(convert_from_idx_str(token_label2answer[100][1]))

"If the primary beneficiary dies before the policy owner dies , the contingent beneficiary would be the next in line for benefits . If the policy owner is well , he or she would simply call the insurance company and request a `` change of beneficiary '' form and rename a new primary beneficiary . If it is the contingent who becomes primary then the owner would rename a new contingent beneficiary ."

As you can see from above the 'raw' contains the data as it's entered by the users! The 'token' one is the data processed(cleaned, word separated from punctuation) ready for vectorization i.e. ready for embedding. 

For visualisation and better manipulation of the data we will use pandas dataframes for our analysis. 

In [5]:
# dataFrame that contains the answer_labes and the answer_text 
# We will consider the data in the 'token' files
l2a = pd.DataFrame(read_data(PATH + 'InsuranceQA.label2answer.token.encoded.gz', "label2answer"), 
                                                       columns = ['answer_label', 'answer_idx'])
# to have every thing in the same dataFrame we will append the answer text to the dataframe 
l2a['answer_text'] = l2a['answer_idx'].apply(lambda a: convert_from_idx_str(a))

l2a.head()

Reading label2answer data. 
This data format is: <Answer Label><TAB><Answer Text> 



Unnamed: 0,answer_label,answer_idx,answer_text
0,1,"[idx_1, idx_2, idx_3, idx_12, idx_1305, idx_5,...","[Coverage, follows, the, car, ., Example, 1, :..."
1,2,"[idx_124, idx_107, idx_11, idx_125, idx_757, i...","[That, is, a, great, question, !, One, I, 'm, ..."
2,3,"[idx_7, idx_8, idx_77, idx_292, idx_97, idx_66...","[If, you, are, applying, for, Medicaid, ,, lif..."
3,4,"[idx_315, idx_3, idx_294, idx_20, idx_316, idx...","[Calling, the, life, insurance, company, throu..."
4,5,"[idx_76, idx_341, idx_41, idx_11, idx_342, idx...","[The, cost, of, a, Medigap, plan, is, differen..."


In the helper.py file we defined a function that convert a list of indexes to the sentence associated to it (from vocabulary dictionary).

**Stats and visualization on the label2answer data**

In [6]:
print("The number of answers is ", len(l2a))

The number of answers is  27413


In [7]:
# Adding the length of the questions
l2a['answer_length'] = l2a['answer_text'].apply(lambda q: len(q))
l2a.head()

Unnamed: 0,answer_label,answer_idx,answer_text,answer_length
0,1,"[idx_1, idx_2, idx_3, idx_12, idx_1305, idx_5,...","[Coverage, follows, the, car, ., Example, 1, :...",235
1,2,"[idx_124, idx_107, idx_11, idx_125, idx_757, i...","[That, is, a, great, question, !, One, I, 'm, ...",424
2,3,"[idx_7, idx_8, idx_77, idx_292, idx_97, idx_66...","[If, you, are, applying, for, Medicaid, ,, lif...",71
3,4,"[idx_315, idx_3, idx_294, idx_20, idx_316, idx...","[Calling, the, life, insurance, company, throu...",66
4,5,"[idx_76, idx_341, idx_41, idx_11, idx_342, idx...","[The, cost, of, a, Medigap, plan, is, differen...",215


In [8]:
# Description of the answers_length attribute
l2a[['answer_length']].describe()

Unnamed: 0,answer_length
count,27413.0
mean,111.826214
std,77.166811
min,16.0
25%,67.0
50%,87.0
75%,125.0
max,1335.0


The mean of the answers length is 111.8 word! and the mean is 16, this means that the size of the answers we have is quite long. Let's investigate more it's distribution. 

In [9]:
# The distribution of the length of questions 
l2a['answer_length'].iplot(kind='scatter', xTitle = 'answer', yTitle = 'length'
                                         , title='Answers length scatter plot')

From the scatter plot above it seems to be that the length of the answer is quite high. This may be due to the nature of our dataset, the answers about insurance need to be precise and well formulated in order to explain better to clients. This also can be due to the fact that collaborators are using a lot of formal expressions since they are talking to clients.

In [10]:
# The distribution of the length of answers 
l2a['answer_length'].iplot(kind='hist', xTitle='length',histnorm = 'density', 
                         yTitle='count', title='Answers length Distribution')

The distribution is skewed to the right! This means that the mean length is bigger than the median length. Besides, the data points on the right side of the distribution are 'invisible' which tells us that they can be outliers! Let's use a box plot to see the 'suspectedoutliers' points.

In [11]:
# Box plot of the length of answres with suspectedoutliers as argument
l2a[['answer_length']].iplot(kind='box', mode='lines', boxpoints = 'suspectedoutliers')

Here from the size of the box plot we can see that we have outliers! So let's have a closer look.

In [12]:
# let's look at the questions with lenght > 1000
list(l2a[l2a['answer_length'] > 1000]['answer_text'].apply(lambda q: ' '.join(q)))[0]

'Gap insurance is a coverage that is offered either on your auto policy or through the auto finance company as an option -LRB- or requirement -RRB- on your loan . Basically , what it does is provide you a clean slate if you total a car with a loan on it . Lets look at an example : Erica buys a 2009 Volkswagon Jetta from a used car shop for $ 14,000 . The used car value could be 12,500 depending on what form you use -LRB- blackbook value , NADA , Kelly Bluebook value , etc. -RRB- The form doesnt matter here other than answering the question of is this car worth less than the loan I have on it ? Erica is required to have other-than-collision coverage -LRB- often called comprehensive coverage , which is a misnomer -RRB- and collision coverage as a stipulation for the loan she is getting with the bank or finance company . That coverage states that it will repair her car for specific reasons in the policy OR pay her the actual cash value of the car if it costs less than the repairs it would

As you can see from the example above, the insurance responses seems to be so technical sometimes so in order to make sure that the information is well transmitted to clients, collaborators uses examples and create scenarios. 

In [13]:
# Let's take an answer with small length
list(l2a[l2a['answer_length'] < 40]['answer_text'].apply(lambda q: ' '.join(q)))[0]

'Disability Insurance is paycheck protection . When you insure yourself against sickness or injury , youre valuing your ability to earn money for your family and/or business partners , who can benefit if youre unable to work .'

Other questions are straight forward so the answer is relatively short! 
<br>
However, the answer is always related to the question! So if the answer is too long it's because the question covered an important number of subjects. So let's look at the questions data.

## label2question

In [14]:
# Reading the questions files, raw and token 
raw_label2question = read_data(PATH + 'InsuranceQA.question.anslabel.raw.encoded.gz',"question.anslabel")
token_label2question = read_data(PATH + 'InsuranceQA.question.anslabel.token.encoded.gz',"question.anslabel")

Reading questions.anslabel data.
This data foramt is: <Domain><TAB><QUESTION><TAB><Groundtruth>

Reading questions.anslabel data.
This data foramt is: <Domain><TAB><QUESTION><TAB><Groundtruth>



In [15]:
raw_label2question[0]

['medicare-insurance',
 ['idx_1285', 'idx_1010', 'idx_467', 'idx_47610', 'idx_18488', 'idx_65760'],
 ['16696']]

Here the lable2question data contains in the first the position the domain of the asked question, followed by a list of indexes (again keys of words in vocabulary dictionary) and a list of ground truths which they are labels for answers. This means that a questions can have multiple answers. 

In [16]:
' '.join(convert_from_idx_str(raw_label2question[0][1]))

'What Does Medicare IME Stand For?'

In [17]:
' '.join(convert_from_idx_str(token_label2question[0][1]))

'What Does Medicare IME Stand For ?'

The same thing goes for label2question files, the 'raw' files are the questions as entered by clients and the 'token' are the one processed for machine learning purposes.

In [18]:
# Construct dataFrame for raw_label2question
l2q = pd.DataFrame(read_data(PATH + 'InsuranceQA.question.anslabel.token.encoded.gz',"question.anslabel"), 
                                               columns = ['domain', 'questions_idx', 'groundTruth_labels'])
l2q.head()

Reading questions.anslabel data.
This data foramt is: <Domain><TAB><QUESTION><TAB><Groundtruth>



Unnamed: 0,domain,questions_idx,groundTruth_labels
0,medicare-insurance,"[idx_1285, idx_1010, idx_467, idx_47610, idx_1...",[16696]
1,long-term-care-insurance,"[idx_3815, idx_604, idx_605, idx_891, idx_136,...",[10277]
2,health-insurance,"[idx_3019, idx_55039, idx_27647, idx_60975, id...",[12076]
3,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]"
4,medicare-insurance,"[idx_1010, idx_467, idx_21593, idx_64564, idx_...",[22643]


To have a better visualization of the data we will append the plain text associated to indexes.

In [19]:
# Add the questions plain text 
l2q['questions_text'] = l2q['questions_idx'].apply(lambda q: ' '.join(convert_from_idx_str(q)))
l2q.head()

Unnamed: 0,domain,questions_idx,groundTruth_labels,questions_text
0,medicare-insurance,"[idx_1285, idx_1010, idx_467, idx_47610, idx_1...",[16696],What Does Medicare IME Stand For ?
1,long-term-care-insurance,"[idx_3815, idx_604, idx_605, idx_891, idx_136,...",[10277],Is Long Term Care Insurance Tax Free ?
2,health-insurance,"[idx_3019, idx_55039, idx_27647, idx_60975, id...",[12076],Can Husband Drop Wife From Health Insurance ?
3,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",Is Medicare Run By The Government ?
4,medicare-insurance,"[idx_1010, idx_467, idx_21593, idx_64564, idx_...",[22643],Does Medicare Cover Co-Pays ?


Next, we will handle groundTruth_labels with multiple values. We will explode those rows and we will duplicate the values of the remaining ones. 

In [20]:
def split_data_frame_list(df, target_column):
    """
    Splits a column with lists into rows
    
    Keyword arguments:
        df -- dataframe
        target_column -- name of column that contains lists        
    """
    # create a new dataframe with each item in a seperate column, dropping rows with missing values
    col_df = pd.DataFrame(df[target_column].dropna().tolist(),index=df[target_column].dropna().index)

    # create a series with columns stacked as rows         
    stacked = col_df.stack()

    # rename last column to 'idx'
    new_df = pd.DataFrame(stacked, columns=[target_column])
    return new_df

In [21]:
# Explode the raws that contain more than one element in the groundTruth_labels list
df_groundTruth = split_data_frame_list(l2q, 'groundTruth_labels').reset_index().drop('level_1', axis = 1)
df_groundTruth.head()

Unnamed: 0,level_0,groundTruth_labels
0,0,16696
1,1,10277
2,2,12076
3,3,25578
4,3,6215


In [22]:
# Merge the df_groundTruth withe l2q dataframe
l2q = df_groundTruth.merge(l2q, left_on = 'level_0', right_index = True).drop(['level_0'], axis = 1)
l2q.head()

Unnamed: 0,groundTruth_labels_x,domain,questions_idx,groundTruth_labels_y,questions_text
0,16696,medicare-insurance,"[idx_1285, idx_1010, idx_467, idx_47610, idx_1...",[16696],What Does Medicare IME Stand For ?
1,10277,long-term-care-insurance,"[idx_3815, idx_604, idx_605, idx_891, idx_136,...",[10277],Is Long Term Care Insurance Tax Free ?
2,12076,health-insurance,"[idx_3019, idx_55039, idx_27647, idx_60975, id...",[12076],Can Husband Drop Wife From Health Insurance ?
3,25578,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",Is Medicare Run By The Government ?
4,6215,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",Is Medicare Run By The Government ?


Now every all possible answer to a question are considered as different question answer tuple.

In [23]:
l2q['groundTruth_labels_x'].unique()

array(['16696', '10277', '12076', ..., '6453', '4693', '13116'],
      dtype=object)

In [24]:
l2q['groundTruth_text'] = l2q['groundTruth_labels_x'].apply(lambda t:
                                ' '.join(convert_from_idx_str(token_label2answer[int(t)-1][1])))

In [25]:
l2q.head()

Unnamed: 0,groundTruth_labels_x,domain,questions_idx,groundTruth_labels_y,questions_text,groundTruth_text
0,16696,medicare-insurance,"[idx_1285, idx_1010, idx_467, idx_47610, idx_1...",[16696],What Does Medicare IME Stand For ?,According to the Centers for Medicare and Medi...
1,10277,long-term-care-insurance,"[idx_3815, idx_604, idx_605, idx_891, idx_136,...",[10277],Is Long Term Care Insurance Tax Free ?,"As a rule , if you buy a tax qualified long te..."
2,12076,health-insurance,"[idx_3019, idx_55039, idx_27647, idx_60975, id...",[12076],Can Husband Drop Wife From Health Insurance ?,Can a spouse drop another spouse from health i...
3,25578,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",Is Medicare Run By The Government ?,Medicare Part A and Part B is provided by the ...
4,6215,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",Is Medicare Run By The Government ?,Definitely . It is ran by the Center for Medic...


In [26]:
# Reorder the columns for better comprehension
l2q = l2q[['domain', 'questions_idx', 'groundTruth_labels_y', 'groundTruth_labels_x', 'questions_text', 'groundTruth_text']]

In [27]:
l2q.head()

Unnamed: 0,domain,questions_idx,groundTruth_labels_y,groundTruth_labels_x,questions_text,groundTruth_text
0,medicare-insurance,"[idx_1285, idx_1010, idx_467, idx_47610, idx_1...",[16696],16696,What Does Medicare IME Stand For ?,According to the Centers for Medicare and Medi...
1,long-term-care-insurance,"[idx_3815, idx_604, idx_605, idx_891, idx_136,...",[10277],10277,Is Long Term Care Insurance Tax Free ?,"As a rule , if you buy a tax qualified long te..."
2,health-insurance,"[idx_3019, idx_55039, idx_27647, idx_60975, id...",[12076],12076,Can Husband Drop Wife From Health Insurance ?,Can a spouse drop another spouse from health i...
3,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",25578,Is Medicare Run By The Government ?,Medicare Part A and Part B is provided by the ...
4,medicare-insurance,"[idx_3815, idx_467, idx_34801, idx_1655, idx_7...","[25578, 6215]",6215,Is Medicare Run By The Government ?,Definitely . It is ran by the Center for Medic...


##  anslabel

In [None]:
raw_anslabel = read_data(PATH + 'InsuranceQA.question.anslabel.raw.100.pool.solr.test.encoded.gz', "anslabel")

In [None]:
raw_anslabel[0][3][:2]

In [None]:
# Question 
' '.join(convert_from_idx_str(raw_anslabel[0][1]))

In [None]:
# Ground Truth
# Again it's index -1 
' '.join(convert_from_idx_str(raw_label2answer[98][1]))

In [None]:
# Pool again it's index -1 
' '.join(convert_from_idx_str(raw_label2answer[15812][1]))

The candidate answer pool includes ground_truth and also randomly selected negative answers.

In [None]:
# Read the file in a dataframe
raw_al = pd.DataFrame(token_anslabel, columns = ['domain', 'Questions', 'groundTruth', 'pool'])

raw_al.head()