In [1]:
import pandas as pd
import difflib  as dl
import numpy as np

In [2]:
# read a sample transcript which we saved in data frame
transcript = pd.read_excel("../res/processed_transcript_from_json_to_dataframe.xlsx")

In [3]:
transcript.head()

Unnamed: 0.1,Unnamed: 0,start_time,end_time,speaker,text,confidence
0,0,0:00:00.0,9.13,spk_0,And resources at the National University of Si...,"[{'text': 'And', 'confidence': 0.9411, 'start_..."
1,1,0:00:09.13,1.23,spk_1,[PII] nice to be here.,"[{'text': '[PII]', 'confidence': 0.9999, 'star..."
2,2,0:00:10.37,3.37,spk_0,Dr [PII] on general practitioner from Dr anywh...,"[{'text': 'Dr', 'confidence': 0.7912, 'start_t..."
3,3,0:00:13.75,2.07,spk_2,Thank you very much for inviting us here today.,"[{'text': 'Thank', 'confidence': 1.0, 'start_t..."
4,4,0:00:15.83,4.04,spk_0,And dr medical director at Felix Medical group.,"[{'text': 'And', 'confidence': 1.0, 'start_tim..."


### Assumptions:
1. spk_0 is always an interviewer
2. Other speakers are interviewees
3. In this project we combine spk_1 + spk_2 + spk_3 into one "person" as a "collective interviewee"

In [4]:
transcript['speaker'] = transcript['speaker'].astype(str)

In [5]:
transcript['speaker'] = transcript['speaker'].apply(lambda x:x if x=='spk_0' else 'spk_1')

In [6]:
transcript = transcript[['speaker','text']]
transcript.head(10)

Unnamed: 0,speaker,text
0,spk_0,And resources at the National University of Si...
1,spk_1,[PII] nice to be here.
2,spk_0,Dr [PII] on general practitioner from Dr anywh...
3,spk_1,Thank you very much for inviting us here today.
4,spk_0,And dr medical director at Felix Medical group.
5,spk_1,Good afternoon [PII] and everyone else.
6,spk_0,"Okay gentlemen, thanks so much for joining us...."
7,spk_1,I think healthier S. G. Is a whole rethinking ...
8,spk_0,And what do you think? Do you agree with that?
9,spk_1,Absolutely. I think [PII] has got the summary ...


#### This is a reference list of questions, for this interview
The reference list represents high level question for different topics and may varies from all the questioned the interviewer asked. Sometimes interviewer asks more questions for the same topic, or gives some background before asks a question. We will identify and map lines in transcrip to the reference questions and answers to these questions. 

In [7]:
list_of_questions =[
    "From your point of view What is Healthier SG",
    "when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction?  ",
    "The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?",
    "Will this also add on to increasing burdens for GPs? What would Healthier SG mean for GPs?"
    "How dispence of medications will affect your operations?",
    "Fees will be paid to GPS based on the health risk profiles of patients and depending on the outcomes.How are you going to measure this when outcome is going to be so different for everyone?",
    "As a national savings for the healthcare system, will we save money is this is running correctly?",
    "Should the typical patient be concerned about rising costs when they go see the doctor world prices increase over time?",
    "If there's one thing that you could you know change right now fixed right now with regards to our health care system, what would it be?"
]

### Step 1: to map reference questions  to the interviewer (spk_0) transcript. 


In [8]:
# Select only interviewer's lines:
df_spk0 = transcript[transcript.speaker=="spk_0"]

In [9]:
df_spk0=df_spk0.reset_index()

In [10]:
df_spk0.head()

Unnamed: 0,index,speaker,text
0,0,spk_0,And resources at the National University of Si...
1,2,spk_0,Dr [PII] on general practitioner from Dr anywh...
2,4,spk_0,And dr medical director at Felix Medical group.
3,6,spk_0,"Okay gentlemen, thanks so much for joining us...."
4,8,spk_0,And what do you think? Do you agree with that?


Some lines in transcript contains a 'reference question, but some are just a part of conversation. So our goal is identify only lines which are the most similar to the reference list. We will use the difflib library for this, and for each interviewer's line and for each question we will calculate the "matching" score. The line with the highest score will be labeled as the line for a reference question.

    
    for each question from the list of questions: 
        create a new column in data frame
        for each line in the interviewer's transcript:
            get the score how similar the line to question
        find the line with the highest matching score
        lable the line with the highest matching score as a line where the reference question was asked
        

In [11]:
import difflib  as dl

In [12]:
possibilities = df_spk0.text.tolist()
# get max length for padding:
padding_len = max([len(val) for val in possibilities])
print(padding_len)

786


In [13]:
def get_similarity(s1,s2, padding_len):
    """
    s1: current line of speaker
    s2: is for question from the reference list of questions
    padding_len: all lines in transcript are aligned to the left side by inserting 
    padding to the right end of the string. This help to ensure a fair comparision.
    """
    # s1 need to be padded with 0:
    s1=s1.ljust(padding_len, '0')
    return dl.SequenceMatcher(None, s2, s1).ratio()
    
    

Getting a similarity scores for each question vs each line in transcript:

In [14]:
for i,q_i in enumerate(list_of_questions):
    df_spk0[q_i] = 0
    df_spk0[q_i] = df_spk0["text"].apply(lambda x:get_similarity(str(x), q_i,padding_len ))

In [15]:
df_spk0

Unnamed: 0,index,speaker,text,From your point of view What is Healthier SG,"when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction?",The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?,Will this also add on to increasing burdens for GPs? What would Healthier SG mean for GPs?How dispence of medications will affect your operations?,Fees will be paid to GPS based on the health risk profiles of patients and depending on the outcomes.How are you going to measure this when outcome is going to be so different for everyone?,"As a national savings for the healthcare system, will we save money is this is running correctly?",Should the typical patient be concerned about rising costs when they go see the doctor world prices increase over time?,"If there's one thing that you could you know change right now fixed right now with regards to our health care system, what would it be?"
0,0,spk_0,And resources at the National University of Si...,0.024096,0.041037,0.058036,0.075107,0.041026,0.056625,0.046409,0.026059
1,2,spk_0,Dr [PII] on general practitioner from Dr anywh...,0.026506,0.032397,0.03125,0.049356,0.053333,0.038505,0.050829,0.015201
2,4,spk_0,And dr medical director at Felix Medical group.,0.033735,0.034557,0.033482,0.040773,0.045128,0.029445,0.048619,0.021716
3,6,spk_0,"Okay gentlemen, thanks so much for joining us....",0.074699,0.019438,0.020089,0.025751,0.022564,0.015855,0.024309,0.004343
4,8,spk_0,And what do you think? Do you agree with that?,0.026506,0.036717,0.046875,0.027897,0.047179,0.03624,0.024309,0.030402
5,10,spk_0,"So from the sounds of it, it is quite a fundam...",0.031325,0.300216,0.069196,0.021459,0.028718,0.088335,0.035359,0.093377
6,12,spk_0,"it really is a preventive approach. You know, ...",0.048193,0.021598,0.013393,0.023605,0.043077,0.01359,0.030939,0.017372
7,14,spk_0,"why wait this long, is it only now? Because we...",0.024096,0.064795,0.245536,0.038627,0.028718,0.047565,0.050829,0.015201
8,16,spk_0,"[PII], what do you think to",0.024096,0.036717,0.033482,0.017167,0.036923,0.02718,0.01989,0.02823
9,18,spk_0,But but do we really have that ecosystem? I've...,0.021687,0.041037,0.055804,0.038627,0.041026,0.03624,0.041989,0.043431


Map reference question based on max similarity score

In [16]:
# Create a new column
df_spk0['reference_q_id']=""

We will print the reference question and the line in transcript it has been map too

In [17]:
for i,q_i in enumerate(list_of_questions):
    # get row index with max score in each column:
    max_score_index = df_spk0[q_i].idxmax()
    print(q_i)
    print(max_score_index)
    print(df_spk0.at[max_score_index,'text'])
    df_spk0.at[max_score_index,'reference_q_id'] = q_i
    print('-----')

From your point of view What is Healthier SG
3
Okay gentlemen, thanks so much for joining us. Let's start off talking about what's different about this plan and the importance of pivoting. So what is healthier S. G. All about first. Let me ask the Gps perhaps from your point of view, what is healthier S. G. And is it better [PII]? So
-----
when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction?  
5
So from the sounds of it, it is quite a fundamental shift because it's about me going to see the doctor when I actually don't need to see the doctor. So it's preventive, right? I mean, you know, for those years as director of medical services, when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction? But
-----
The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?
7
why wait 

In [18]:
df_spk0=df_spk0[['speaker','text','reference_q_id']]

### Map our results back to the original transcript

In [19]:
transcript = pd.merge(transcript,df_spk0,how='left',on=['speaker','text'])

In [20]:
transcript.tail(20)

Unnamed: 0,speaker,text,reference_q_id
46,spk_0,two decades down the road. 20 years from now.,
47,spk_1,I feel that it takes a generation. That's the ...,
48,spk_0,[PII] generally,
49,spk_1,I think for complications 10 to 15 years you s...,
50,spk_0,"perform. Maybe, let me ask you, I imagine in t...",
51,spk_1,We should be seeing some of that. But I think ...,
52,spk_0,"And for the typical patient, should they be co...",Should the typical patient be concerned about ...
53,spk_1,over time. We will see a general increase in c...,
54,spk_0,one more question related to that. This is a g...,
55,spk_1,"there's always a worry, there's always a risk....",


## Step 2: Now we need to identify answers to our reference questions.
We will first find the index for lines with reference questions. Next, all records (which belong to the interviewee, spk_1) between a pair of two subsequent reference questions  will be considered as the answeres to the reference question, which comes first in this pair.

In [21]:
import numpy as np

In [22]:
# 1. Find Index for each questions:
question_index_dict ={}
question_index_list = []
for q_i in list_of_questions:
    
    index_q_i = np.where(transcript['reference_q_id'].isin([q_i]))
    question_index_dict[q_i]=index_q_i[0][0]
    question_index_list.append(index_q_i[0][0])
    

In [23]:
question_index_list

[6, 10, 14, 23, 36, 44, 52, 58]

In [24]:
question_index_dict

{'From your point of view What is Healthier SG': 6,
 'when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction?  ': 10,
 'The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?': 14,
 'Will this also add on to increasing burdens for GPs? What would Healthier SG mean for GPs?How dispence of medications will affect your operations?': 23,
 'Fees will be paid to GPS based on the health risk profiles of patients and depending on the outcomes.How are you going to measure this when outcome is going to be so different for everyone?': 36,
 'As a national savings for the healthcare system, will we save money is this is running correctly?': 44,
 'Should the typical patient be concerned about rising costs when they go see the doctor world prices increase over time?': 52,
 "If there's one thing that you could you know change right now fixed right now with regards 

In [25]:
# Step 2: map answeres 
index_answers = dict()
for i in range(len(question_index_list)-1):
    start = question_index_list[i]+1
    end = question_index_list[i+1]
    index_answers[list_of_questions[i]]=list(range(start,end))
# last question and answers:
index_answers[list_of_questions[-1]] = list(range(question_index_list[-1]+1,transcript.shape[0]))


In [26]:
transcript.shape[0]


66

In [27]:
transcript['responce_to_question']='empty'

In [28]:
for q_i in list_of_questions:
    response_indexes = index_answers.get(q_i)
    for i in response_indexes:
        transcript.at[i,'responce_to_question'] = q_i

In [29]:
transcript.head(40)

Unnamed: 0,speaker,text,reference_q_id,responce_to_question
0,spk_0,And resources at the National University of Si...,,empty
1,spk_1,[PII] nice to be here.,,empty
2,spk_0,Dr [PII] on general practitioner from Dr anywh...,,empty
3,spk_1,Thank you very much for inviting us here today.,,empty
4,spk_0,And dr medical director at Felix Medical group.,,empty
5,spk_1,Good afternoon [PII] and everyone else.,,empty
6,spk_0,"Okay gentlemen, thanks so much for joining us....",From your point of view What is Healthier SG,empty
7,spk_1,I think healthier S. G. Is a whole rethinking ...,,From your point of view What is Healthier SG
8,spk_0,And what do you think? Do you agree with that?,,From your point of view What is Healthier SG
9,spk_1,Absolutely. I think [PII] has got the summary ...,,From your point of view What is Healthier SG


## Step 3: Concatenate rows with answers by question

In [30]:

# Select only speaker 1 , and all text replies , except with 'empty' values in the responce_to_question
transcript_responses = transcript[transcript.speaker=='spk_1'][['text','responce_to_question']]

In [31]:
transcript_responses=transcript_responses[transcript_responses.responce_to_question!='empty']

In [32]:
transcript_responses_agg = transcript_responses.groupby(['responce_to_question'])['text'].apply(lambda x: ','.join(x)).reset_index()

In [34]:
# drop all ',':
transcript_responses_agg['responce_to_question'] = transcript_responses_agg['responce_to_question'].apply(
    lambda x:x.replace(',',''))
transcript_responses_agg['text'] = transcript_responses_agg['text'].apply(
    lambda x:x.replace(',',''))

In [35]:
transcript_responses_agg.head(20)

Unnamed: 0,responce_to_question,text
0,As a national savings for the healthcare syste...,if this is running correctly? And we get buy i...
1,Fees will be paid to GPS based on the health r...,start first. What I used to hear a lot of is t...
2,From your point of view What is Healthier SG,I think healthier S. G. Is a whole rethinking ...
3,If there's one thing that you could you know c...,for me I think information to unify all the in...
4,Should the typical patient be concerned about ...,over time. We will see a general increase in c...
5,The idea of having family doctors is something...,I don't think we're only doing it now. I think...
6,Will this also add on to increasing burdens fo...,be honest and say that it is going to be certa...
7,when you look at our health care system what w...,[PII] I would actually just extend a little bi...


In [37]:
transcript['responce_to_question'] = transcript_responses_agg['responce_to_question'].apply(
    lambda x:x.replace(',',''))
transcript['text'] = transcript_responses_agg['text'].apply(
    lambda x:x.replace(',',''))

### Saving processed results
This results are ready for text processing with AWS 

In [39]:
transcript.to_csv("../res/transcript_with_mapped_questions_and_answers_all.csv",index=False)
transcript_responses_agg.to_csv("../res/transcript_with_mapped_questions_and_answers.csv", index=False)

moving to AWS SageMaker