## Speaker classification pipeline
---
This Jupyter notebook contains a pipeline for classifying the speakers in an interview transcript into two categories: "interviewer" and "interviewee". The pipeline uses a reference list of interview questions and calculates matching scores between each row in the transcript and every question in the reference list. The scores are then aggregated at the average level for each speaker, and the speaker with the highest matching score with the questions in the transcript is defined as the "interviewer".

### Pipeline steps

The pipeline consists of the following steps:

1. __Loading the data__: The interview transcript and reference list are loaded into the notebook.

2. __Preprocessing__: The transcript is preprocessed to remove any unnecessary elements, such as timestamps or speaker tags. The reference list is also preprocessed to remove any duplicate questions or irrelevant information.

3. __Calculating matching scores__: For each speaker*i*:  for each speaker *i*'s records in the transcript and each question in the reference list, a matching score is calculated using a text similarity metric. The matching score reflects the degree of similarity between the row and the question. Then for each question we select the maximum score.

4. __Aggregating scores by speaker__: The matching scores are aggregated at the average level for each speaker across questions' scores, creating a summary table that shows the average score for each speaker.

5. __Identifying the interviewer__: The speaker with the highest average score is identified as the "interviewer", while the other speaker(s) are classified as "interviewee(s)".

### Conclusion

This pipeline provides a simple yet effective way of identifying the interviewer in an interview transcript based on the matching scores between the transcript and a reference list of questions. It can be useful for various applications, such as analyzing interview sentiments.

---

In [1]:
import pandas as pd
import difflib  as dl
import numpy as np

### Loading the data

In [4]:
# read a sample transcript which we saved in data frame
transcript = pd.read_excel("../res/processed_transcript_from_json_to_dataframe.xlsx")

In [5]:
transcript.head()

Unnamed: 0.1,Unnamed: 0,start_time,end_time,speaker,text,confidence
0,0,0:00:00.0,9.13,spk_0,And resources at the National University of Si...,"[{'text': 'And', 'confidence': 0.9411, 'start_..."
1,1,0:00:09.13,1.23,spk_1,[PII] nice to be here.,"[{'text': '[PII]', 'confidence': 0.9999, 'star..."
2,2,0:00:10.37,3.37,spk_0,Dr [PII] on general practitioner from Dr anywh...,"[{'text': 'Dr', 'confidence': 0.7912, 'start_t..."
3,3,0:00:13.75,2.07,spk_2,Thank you very much for inviting us here today.,"[{'text': 'Thank', 'confidence': 1.0, 'start_t..."
4,4,0:00:15.83,4.04,spk_0,And dr medical director at Felix Medical group.,"[{'text': 'And', 'confidence': 1.0, 'start_tim..."


#### Assumptions:
1. In this project we combine spk_1 + spk_2 + spk_3 into one "person" as a "collective interviewee"

In [6]:
transcript['speaker'] = transcript['speaker'].astype(str)

In [7]:
transcript['speaker'] = transcript['speaker'].apply(lambda x:x if x=='spk_0' else 'spk_1')

In [8]:
transcript = transcript[['speaker','text']]
transcript.head(10)

Unnamed: 0,speaker,text
0,spk_0,And resources at the National University of Si...
1,spk_1,[PII] nice to be here.
2,spk_0,Dr [PII] on general practitioner from Dr anywh...
3,spk_1,Thank you very much for inviting us here today.
4,spk_0,And dr medical director at Felix Medical group.
5,spk_1,Good afternoon [PII] and everyone else.
6,spk_0,"Okay gentlemen, thanks so much for joining us...."
7,spk_1,I think healthier S. G. Is a whole rethinking ...
8,spk_0,And what do you think? Do you agree with that?
9,spk_1,Absolutely. I think [PII] has got the summary ...


#### This is a reference list of questions, for this interview
The reference list represents high level question for different topics and may varies from all the questioned the interviewer asked. Sometimes interviewer asks more questions for the same topic, or gives some background before asks a question. We will identify and map lines in transcrip to the reference questions and answers to these questions. 

In [9]:
list_of_questions =[
    "From your point of view What is Healthier SG",
    "when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction?  ",
    "The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?",
    "Will this also add on to increasing burdens for GPs? What would Healthier SG mean for GPs?"
    "How dispence of medications will affect your operations?",
    "Fees will be paid to GPS based on the health risk profiles of patients and depending on the outcomes.How are you going to measure this when outcome is going to be so different for everyone?",
    "As a national savings for the healthcare system, will we save money is this is running correctly?",
    "Should the typical patient be concerned about rising costs when they go see the doctor world prices increase over time?",
    "If there's one thing that you could you know change right now fixed right now with regards to our health care system, what would it be?"
]

### Calculating matchin score for *spk_0*


In [8]:
# Select only spk_0 lines:
df_spk0 = transcript[transcript.speaker=="spk_0"]

In [9]:
df_spk0=df_spk0.reset_index()

In [10]:
df_spk0.head()

Unnamed: 0,index,speaker,text
0,0,spk_0,And resources at the National University of Si...
1,2,spk_0,Dr [PII] on general practitioner from Dr anywh...
2,4,spk_0,And dr medical director at Felix Medical group.
3,6,spk_0,"Okay gentlemen, thanks so much for joining us...."
4,8,spk_0,And what do you think? Do you agree with that?


In [11]:
import difflib  as dl

In [12]:
possibilities = df_spk0.text.tolist()
# get max length for padding:
padding_len = max([len(val) for val in possibilities])
print(padding_len)

786


In [13]:
def get_similarity(s1,s2, padding_len):
    """
    s1: current line of speaker
    s2: is for question from the reference list of questions
    padding_len: all lines in transcript are aligned to the left side by inserting 
    padding to the right end of the string. This help to ensure a fair comparision.
    """
    # s1 need to be padded with 0:
    s1=s1.ljust(padding_len, '0')
    return dl.SequenceMatcher(None, s2, s1).ratio()
    
    

#### a) Getting a similarity scores for each question vs each line in transcript:

In [14]:
for i,q_i in enumerate(list_of_questions):
    df_spk0[q_i] = 0
    df_spk0[q_i] = df_spk0["text"].apply(lambda x:get_similarity(str(x), q_i,padding_len ))

#### b) For each Reference question (column in our case) calculate the max score and then take the mean:

In [20]:
spk_0_avg_score = np.mean(df_spk0.max().tolist()[3:])
spk_0_avg_score

0.20355017080803853

#### c) create a new column in the transcript and fill out this score as score for the spk_0:

In [26]:
# add this score as a column to the transcript
transcript['spk_0_score']=spk_0_avg_score

### Calculating matchin score for *spk_1*


In [21]:
# Select only interviewer's lines:
df_spk1 = transcript[transcript.speaker=="spk_1"]
df_spk1=df_spk1.reset_index()
df_spk1.head()

Unnamed: 0,index,speaker,text
0,1,spk_1,[PII] nice to be here.
1,3,spk_1,Thank you very much for inviting us here today.
2,5,spk_1,Good afternoon [PII] and everyone else.
3,7,spk_1,I think healthier S. G. Is a whole rethinking ...
4,9,spk_1,Absolutely. I think [PII] has got the summary ...


#### a) Getting a similarity scores for each question vs each line in transcript:

In [22]:
for i,q_i in enumerate(list_of_questions):
    df_spk1[q_i] = 0
    df_spk1[q_i] = df_spk1["text"].apply(lambda x:get_similarity(str(x), q_i,padding_len ))

In [23]:
df_spk1.head()

Unnamed: 0,index,speaker,text,From your point of view What is Healthier SG,"when you look at our health care system, what was wrong with it? I mean, why are we doing this now? Why are we shifting in this direction?",The idea of having family doctors is something so common elsewhere in the world. Why are we only doing it now?,Will this also add on to increasing burdens for GPs? What would Healthier SG mean for GPs?How dispence of medications will affect your operations?,Fees will be paid to GPS based on the health risk profiles of patients and depending on the outcomes.How are you going to measure this when outcome is going to be so different for everyone?,"As a national savings for the healthcare system, will we save money is this is running correctly?",Should the typical patient be concerned about rising costs when they go see the doctor world prices increase over time?,"If there's one thing that you could you know change right now fixed right now with regards to our health care system, what would it be?"
0,1,spk_1,[PII] nice to be here.,0.014458,0.019438,0.024554,0.015021,0.022564,0.024915,0.024309,0.015201
1,3,spk_1,Thank you very much for inviting us here today.,0.040964,0.045356,0.046875,0.038627,0.043077,0.038505,0.046409,0.043431
2,5,spk_1,Good afternoon [PII] and everyone else.,0.021687,0.019438,0.022321,0.040773,0.041026,0.03624,0.033149,0.023887
3,7,spk_1,I think healthier S. G. Is a whole rethinking ...,0.012048,0.025918,0.013393,0.023605,0.026667,0.02265,0.01768,0.021716
4,9,spk_1,Absolutely. I think [PII] has got the summary ...,0.01626,0.0059,0.003604,0.00823,0.005734,0.001211,0.001195,0.00355


#### b) For each Reference question (column in our case) calculate the max score and then take the mean:

In [24]:
spk_1_avg_score = np.mean(df_spk1.max().tolist()[3:])
spk_1_avg_score

0.05568236160918105

#### c) create a new column in the transcript and fill out this score as score for the spk_1:

In [25]:
# add this score as a column to the transcript
transcript['spk_1_score']=spk_1_avg_score

### Identifying the interviewer
Compare the score for spk_0 and spk_1.
Those who has the highest score is "interviewer", because it's the transcript's lines are best matched with the list of reference question

In [29]:
transcript.head(5)

Unnamed: 0,speaker,text,spk_1_score,spk_0_score
0,spk_0,And resources at the National University of Si...,0.055682,0.20355
1,spk_1,[PII] nice to be here.,0.055682,0.20355
2,spk_0,Dr [PII] on general practitioner from Dr anywh...,0.055682,0.20355
3,spk_1,Thank you very much for inviting us here today.,0.055682,0.20355
4,spk_0,And dr medical director at Felix Medical group.,0.055682,0.20355


In [30]:
def who_is_interviewer(df,i):
    # if the i-line is for spk_0 and spk_0_score>spk_1_score then spk_0 is interviewer
    if df['speaker'][i]=="spk_0":
        if df["spk_0_score"][i]>=df["spk_1_score"][i]:
            return "interviewer"
        else:
            return "interviewee"
    else: # the line i belongs to spk_1
        if df["spk_0_score"][i]>=df["spk_1_score"][i]:
            return "interviewee"
        else:
            return "interviewer"

In [33]:
# create a new column to keep values who is interviewee/interviewer:
transcript['role']=""
for ind in transcript.index:
    transcript['role'][ind] = who_is_interviewer(transcript, ind)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [34]:
transcript

Unnamed: 0,speaker,text,spk_1_score,spk_0_score,role
0,spk_0,And resources at the National University of Si...,0.055682,0.20355,interviewer
1,spk_1,[PII] nice to be here.,0.055682,0.20355,interviewee
2,spk_0,Dr [PII] on general practitioner from Dr anywh...,0.055682,0.20355,interviewer
3,spk_1,Thank you very much for inviting us here today.,0.055682,0.20355,interviewee
4,spk_0,And dr medical director at Felix Medical group.,0.055682,0.20355,interviewer
...,...,...,...,...,...
61,spk_1,I will say vaccinations. You know the accessib...,0.055682,0.20355,interviewee
62,spk_0,also the flick of switch and everyone is tomorrow,0.055682,0.20355,interviewer
63,spk_1,somehow.,0.055682,0.20355,interviewee
64,spk_1,"Well, I wish we had done all of this five or s...",0.055682,0.20355,interviewee


### Saving processed results
This results are ready for text processing with AWS 

In [35]:
transcript.to_csv("../res/transcript_with_differentiated_interviewer_interviewee.csv",index=False)

moving to AWS SageMaker