# Using Metadata to Improve Artifical Intelligance Medical Image Diagnostic Accuracy
**Purpose and Background**
Conduct a descriptive analysis of crowdsourced data extracted from user interaction with a mobile application where tasked to binarly (yes or no) identify abnormalities in medical images. 

Two user categories were differentiated: Medical experts hired to interact with the application; and crowd, anyone who downloaded and used the application.



### Import datasets

In [188]:
import pandas as pd
import numpy as np
results = pd.read_csv('1345_customer_results.csv') #medical case results
admin = pd.read_csv('1345_admin_reads.csv') #raw individual read

### Inspect Customer Results

In [189]:
results.dtypes
results = results.set_index('Case ID')

**Preliminary filtering for security purposes**


In [190]:
results = results.dropna(subset=['Origin']) 
results["Expert: Abnormal Votes"] = results["Origin"].str.extract(r'vote(\d)').astype(float)
results = results.drop(['Origin Created At','Origin','Content ID','URL'],axis=1)

In [191]:
results.head(2)

Unnamed: 0_level_0,Labeling State,Series,Series Index,Patch,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Internal Notes,Comments,Explanation,Expert: Abnormal Votes
Case ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
5888087,Gold Standard,,,,2,'no','no',0.0,1.0,'no',2,1.54,'yes',0,0.0,,[],,2.0
5888088,Gold Standard,,,,3,'no','no',0.0,1.0,'no',3,2.34,'yes',0,0.0,,[],,0.0


Any rows that did not have a string associated with expert votes in the URL were dropped (i.e. NA)

In [192]:
results = results.dropna(subset=["Expert: Abnormal Votes"])

**Inspect NaN Columns for Content**

In [193]:
results.loc[results['Series'].notna()| results['Series Index'].notna() | results['Patch'].notna() | results['Internal Notes'].notna() | results['Explanation'].notna()]

Unnamed: 0_level_0,Labeling State,Series,Series Index,Patch,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Internal Notes,Comments,Explanation,Expert: Abnormal Votes
Case ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1


Dataframe is empty; columns inspected will be dropped

In [194]:
results = results.drop(['Series','Series Index','Patch','Internal Notes','Explanation'],axis=1)

**Inspect Comments for Relevance**

In [195]:
results[results['Comments'] != '[]']


Unnamed: 0_level_0,Labeling State,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Comments,Expert: Abnormal Votes
Case ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
5892332,Gold Standard,1,'no','no',0.0,1.0,'no',1,0.8,'yes',0,0.0,['There was rapid and spiky rates so why am I ...,3.0
5894116,Gold Standard,5,'no','yes',1.0,1.0,'yes',5,4.0,'no',0,0.0,['Can someone explain why the answer is “no”?'],0.0
5896433,Gold Standard,3,'yes','no',1.0,1.0,'no',3,2.32,'yes',0,0.0,['??'],5.0
5899520,Gold Standard,2,'yes','no',1.0,1.0,'no',2,1.58,'yes',0,0.0,"[""i can't see any spike in this question so wh...",5.0
5900998,Gold Standard,2,'no','yes',1.0,1.0,'yes',2,1.56,'no',0,0.0,['There is obviously a peak happened in there'],3.0
5901914,Gold Standard,6,'yes','no',1.0,1.0,'no',6,4.72,'yes',0,0.0,['No spike present'],5.0
5902040,Gold Standard,2,'no','yes',1.0,1.0,'yes',2,1.58,'no',0,0.0,['How?'],3.0
5904120,Gold Standard,1,'yes','no',1.0,1.0,'no',1,0.78,'yes',0,0.0,['How? '],6.0
5904413,Gold Standard,3,'yes','no',1.0,1.0,'no',3,2.46,'yes',0,0.0,['Multiple?'],6.0
5904472,Gold Standard,3,'yes','yes',0.333,0.667,'yes',2,1.58,'no',1,0.78,[' Wtf'],5.0


None of the comments seem relevant; comments column will be dropped

In [196]:
results = results.drop(['Comments'],axis=1)

There should only be 8 experts total; drop cases for expert count greater than 8

In [197]:
results = results[results["Expert: Abnormal Votes"] <= 8]

### Important columns for analysis; original metadata
Each row corresponds to a medical case 

**Identifiers:** 

Case ID: unique identifier will serve as index

Labeling State: identifies whether a expert consensus has been achieved (yes=Gold Standard, no= In Progress)

URL: Extracted out expert vote count within the URL 

**Reads and Annotations**

Qualified Reads: total crowd vote count

Expert: Abnormal Votes: number of experts who thought the case was abnormal

(note, the total of experts voting is always 8)

Correct Label: overall expert consensus 

{yes=case is abnormal, no=case is normal, NaN=no consensus}

Majority Label: overall crowd consensus on each case

**Measures of Confidence**

Difficulty: Qualified Reads *without the Correct Label* divided by total Qualified Reads.

Agreement: Qualified Reads *with the Majority Label* divided by total Qualified Reads.

Nth Choice Answer: crowd answer (First Choice is the Majority Label)
        
Nth Choice Votes: number of crowd votes per answer
        
Nth Choice Weight:
        
        
        



### Add Relevant Columns and Optimize Dataframe




#### Cluster cases categorically based on difficulty

In [198]:
bins=[0,0.2,0.4,0.6,0.8,1]
labels=['very easy','easy','moderate','challenging','very challenging']
results['Difficulty Category'] = pd.cut(results['Difficulty'],bins=bins,labels=labels,include_lowest=True)

#### Expert: Normal Votes:
I subtracted the number of total experts by the known number of experts who voted the case as abnormal

#### Expert Agreement: 
I divided the number of experts who voted the case as abnormal by the total number of experts to get the porportion of experts who agree that the case is abnormal.

#### Error Rate: 
I extracted the indexes for each category and calculated the "error rate" for the experts who did not vote for the expert majority

#### Consensus:
I indiciated cases where there was unanimity between experts and crowd.

### I will rename some of the original columns for clarity

   #### {Original column --> Renamed Column}
    
    Correct Label --> Expert Majority

    Majority Label --> Crowd Majority

    Difficulty --> Expert/Crowd Disagreement

    Agreement --> Crowd Agreement

In [199]:
df = results
df["Expert Majority"] = results["Correct Label"]
df["Crowd Majority"] = results["Majority Label"]
df["Expert/Crowd Disagreement"] = results["Difficulty"] 
df["Crowd Agreement"] = results["Agreement"] 
df = df.drop(columns= ["Correct Label","Majority Label","Difficulty","Agreement"])

#### Expert/Crowd Disagreement 
is the porportion of crowd disagreeing with expert consensus (i.e. difficulty)

#### Crowd Agreement
is the porportion of crowd agreeing with crowd consensus (i.e. agreement)

#### Error rate of experts
I extracted the indexes for each category and calculated the "error rate" for the experts who did not vote for the expert majority

In [200]:
expert_count = 8
df["Expert: Normal Votes"] = (expert_count - results["Expert: Abnormal Votes"])
df["Expert Agreement"] = df["Expert: Abnormal Votes"]/expert_count
df['Consensus'] = np.where(df['Expert Majority'] == df['Crowd Majority'],'yes','no')

In [201]:
EM_yes = df.index[df['Expert Majority'] == "'yes'"].tolist()

EM_no = df.index[df['Expert Majority'] == "'no'"].tolist()

df.loc[EM_yes,"Error Rate"]= df['Expert: Normal Votes'][EM_yes]/expert_count
df.loc[EM_no,"Error Rate"]= df['Expert: Abnormal Votes'][EM_no]/expert_count
#df.fillna('', inplace=True)
beg_index = list(df.columns).index('Expert: Abnormal Votes') #9
df.iloc[ : , 13:]


Unnamed: 0_level_0,Crowd Agreement,Expert: Normal Votes,Expert Agreement,Consensus,Error Rate
Case ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5888087,1.000,6.0,0.250,yes,0.250
5888088,1.000,8.0,0.000,yes,0.000
5888089,1.000,8.0,0.000,yes,0.000
5888090,1.000,8.0,0.000,yes,0.000
5888091,0.571,4.0,0.500,no,
...,...,...,...,...,...
5918375,1.000,6.0,0.250,no,0.250
5918376,0.667,5.0,0.375,no,0.375
5918377,1.000,4.0,0.500,no,
5918378,,3.0,0.625,no,0.375


In [202]:
filt_df = df[df['Qualified Reads'] >= 5]
print(sum(df['Expert: Normal Votes']+df['Expert: Abnormal Votes'])-sum(df['Qualified Reads']))
print(sum(filt_df['Expert: Normal Votes']+filt_df['Expert: Abnormal Votes'])-sum(filt_df['Qualified Reads']))


78020.0
16432.0


When we filtered the qualified reads to 5 or more, the disparity between expert and reader vote count significantly decreases (16,400 from 78,000);
#### so, cases with ONLY 5  or more crowd reads will be further analyzed

In [203]:
filt_df

Unnamed: 0_level_0,Labeling State,Qualified Reads,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Expert: Abnormal Votes,Difficulty Category,Expert Majority,Crowd Majority,Expert/Crowd Disagreement,Crowd Agreement,Expert: Normal Votes,Expert Agreement,Consensus,Error Rate
Case ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5888091,In Progress,7,'yes',4,3.28,'no',3,2.32,4.0,,,'yes',,0.571,4.0,0.500,no,
5888093,Gold Standard,6,'no',6,4.94,'yes',0,0.00,0.0,very easy,'no','no',0.000,1.000,8.0,0.000,yes,0.000
5888099,Gold Standard,5,'no',5,4.05,'yes',0,0.00,0.0,very easy,'no','no',0.000,1.000,8.0,0.000,yes,0.000
5888100,Gold Standard,7,'no',6,4.92,'yes',1,0.80,0.0,very easy,'no','no',0.143,0.857,8.0,0.000,yes,0.000
5888102,Gold Standard,6,'no',6,4.66,'yes',0,0.00,0.0,very easy,'no','no',0.000,1.000,8.0,0.000,yes,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5918367,Gold Standard,5,'yes',5,3.98,'no',0,0.00,3.0,very challenging,'no','yes',1.000,1.000,5.0,0.375,no,0.375
5918368,Gold Standard,8,'yes',8,6.50,'no',0,0.00,5.0,very easy,'yes','yes',0.000,1.000,3.0,0.625,yes,0.375
5918372,Gold Standard,7,'yes',7,5.78,'no',0,0.00,6.0,very easy,'yes','yes',0.000,1.000,2.0,0.750,yes,0.250
5918377,In Progress,6,'yes',6,4.78,'no',0,0.00,4.0,,,'yes',,1.000,4.0,0.500,no,


# Exploratory Analysis 

In [204]:
#%pip install jupyter-dash
import plotly.express as px
import plotly.io as pio
import plotly.figure_factory as ff
pio.renderers.default='notebook'
import matplotlib.pyplot as plt
import plotly.graph_objects as go

In [209]:

index = filt_df.index[filt_df['Crowd Majority']=="'no'"].tolist()
filt_df['Crowd Agreement'][index] -= 0.5
filt_df

Unnamed: 0_level_0,Labeling State,Qualified Reads,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Expert: Abnormal Votes,Difficulty Category,Expert Majority,Crowd Majority,Expert/Crowd Disagreement,Crowd Agreement,Expert: Normal Votes,Expert Agreement,Consensus,Error Rate
Case ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5888091,In Progress,7,'yes',4,3.28,'no',3,2.32,4.0,,,'yes',,0.571,4.0,0.500,no,
5888093,Gold Standard,6,'no',6,4.94,'yes',0,0.00,0.0,very easy,'no','no',0.000,0.500,8.0,0.000,yes,0.000
5888099,Gold Standard,5,'no',5,4.05,'yes',0,0.00,0.0,very easy,'no','no',0.000,0.500,8.0,0.000,yes,0.000
5888100,Gold Standard,7,'no',6,4.92,'yes',1,0.80,0.0,very easy,'no','no',0.143,0.357,8.0,0.000,yes,0.000
5888102,Gold Standard,6,'no',6,4.66,'yes',0,0.00,0.0,very easy,'no','no',0.000,0.500,8.0,0.000,yes,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5918367,Gold Standard,5,'yes',5,3.98,'no',0,0.00,3.0,very challenging,'no','yes',1.000,1.000,5.0,0.375,no,0.375
5918368,Gold Standard,8,'yes',8,6.50,'no',0,0.00,5.0,very easy,'yes','yes',0.000,1.000,3.0,0.625,yes,0.375
5918372,Gold Standard,7,'yes',7,5.78,'no',0,0.00,6.0,very easy,'yes','yes',0.000,1.000,2.0,0.750,yes,0.250
5918377,In Progress,6,'yes',6,4.78,'no',0,0.00,4.0,,,'yes',,1.000,4.0,0.500,no,


0.5 was subtracted from the 'no' subsection of crowd agreements to match how expert agreement was calculated.

## When experts and crowd disagree, what's the extent of their division?

In [210]:
consensus_no = filt_df.index[filt_df['Consensus']=='no'].tolist()
split_conses = (df[df["Expert Agreement"]==0.5])


fig2 = px.histogram(filt_df.loc[consensus_no], 
                    x='Expert Agreement',color='Expert Majority',
                    marginal='violin',color_discrete_map={"'yes'":'purple',"'no'":'red'}, 
                    labels={'x' : 'Agreement Ratio', 'y' : 'Count'}, text_auto=True
                   )
fig1 = px.histogram(filt_df.loc[consensus_no], 
                    x='Crowd Agreement',color='Crowd Majority',
                    marginal='violin', color_discrete_map={"'yes'":'green',"'no'":'yellow'},
                   text_auto=True)
fig3 = px.histogram(split_conses, x='Expert Agreement',color='Expert Majority', color_discrete_map={'nan':'blue'},
                   text_auto=True) #won't change color

fig1.update_xaxes(dtick=0.2)
fig1.update_traces(opacity=0.32)
fig2.update_layout(title_text='Overall Vote Distribution',
    title_x=0.5, showlegend=True,
    legend_title=None)
fig1.data[0].name="Crowd Majority: Yes"
fig2.data[0].name="Expert Majority: No"
fig1.data[2].name="Crowd Majority: No"
fig2.data[2].name="Expert Majority: Yes"
fig3.data[0].name="Expert Majority: NaN"
#########################
#NORMAL DISTRIBUTION: 
#https://stackoverflow.com/questions/63865209/plotly-how-to-show-both-a-normal-distribution-and-a-kernel-density-estimation-i

#group_labels = ['distplot']
#fig3 = ff.create_distplot(filt_df.loc[consensus_no], group_labels, curve_type = 'normal')
#normal_x = fig3.data[1]['x']
#normal_y = fig3.data[1]['y']

#fig2.add_traces(go.Scatter(x=normal_x, y=normal_y, mode = 'lines',
                        #  line = dict(color='rgba(0,255,0, 0.6)',
                                      #dash = 'dash'
                           #           width = 1),
                         # name = 'normal'
                        # ))
                    
############################                    
fig2.add_trace(fig1.data[0])
fig2.add_trace(fig1.data[1])
fig2.add_trace(fig1.data[2])
fig2.add_trace(fig1.data[3])
fig2.add_trace(fig3.data[0])
#fig1.show()

fig2.update_layout(barmode='overlay')
fig2.update_xaxes(dtick=0.2)
fig2.update_traces(opacity=0.32)
import plotly.graph_objects as go
fig2.add_shape(type="rect",x0=0.5,x1=0.55,y0=0,y1=750,line_width=1,line_dash='dot')
#fig2.add_trace(go.Scatter(filt_df.loc[consensus_no], x)
fig2.show()
fig2.update_layout(title_text='Vote Distribution: Normal',
    title_x=0.5, showlegend=True,
    legend_title=None)
fig2.update_xaxes(range=[-0.04,0.46])
fig2.update_yaxes(range=[0,850])
fig2.show()
fig2.update_xaxes(range=[0.55,1.04])
fig2.update_layout(title_text='Vote Distribution: Abnormal',
    title_x=0.5, showlegend=True,
    legend_title=None)
fig2.show()


Axis range represents the extent of agreement within a group: 0.5 is split; 1 is full group agreement that the case is normal; 0 is full group agreement that the case is abnormal. Normal line distribution for each category shows that the crowd is more frequently unanimous in decision than experts are. This is evident because the curve peaks towards the middle of the graph for experts while for crowd the curve peaks towards the axis bounds, showing group agreement with the expert majority conclusion. **This shows that the crowd agrees with the expert majority more than experts agreeing with the expert majority**

## Exploring in more detail what split and unanimous agreement look like

In [216]:

fig1 = px.histogram(filt_df, x='Crowd Agreement',color='Crowd Majority', color_discrete_map={"'yes'":'green',"'no'":'yellow'},
                   text_auto=True)

fig1.data[1].name="Crowd Majority: No"


fig1.data[0].name="Crowd Majority: Yes"


from plotly.subplots import make_subplots
fig = make_subplots(rows=1,cols=2)

fig.add_trace(fig1.data[1],row=1,col=1)
fig.update_xaxes(title_text='Crowd Majority: No',row=1,col=1)
fig.update_xaxes(range=[0.48,0.52])

fig.add_trace(fig1.data[0],row=1,col=2)

fig.update_xaxes(title_text='Crowd Majority: Yes',range=[0.48,0.52],row=1,col=2)

fig.update_yaxes(range=[0,3000])
fig.update_traces(opacity=0.6)
fig.update_layout(title_text="Number of Cases Where Crowd is Completely Split")
fig.add_shape(type="rect",x0=0.493,x1=0.507,y0=0,y1=2950,line_width=1,line_dash='dot', row=1,col=1)
fig.add_shape(type="rect",x0=0.493,x1=0.507,y0=0,y1=2950,line_width=1,line_dash='dot', row=1,col=2)
fig.show()

print(3003/(3003+186)*100)
print((743+116)/len(filt_df.index)*100)


94.16745061147695
5.770135017129039


Crowd is significantly more split on deciding what cases are abnormal; less dispute among crowd for choosing normal. 3003 cases failed to reach majority consensus where 94% of cases indecisive on whether the case is normal/healthy

Additionally 743 cases were confidently labeled by crowd as abnormal (=1) and 116 as normal (=0), 
### Only 5.77% of cases were able to be labeled by the crowd with full confidence



## Are experts reliable? How frequently . Turns out, yes.


In [208]:
#fig = px.histogram(filt_df.loc[experts_split], x='Crowd Agreement',color='Crowd Majority',marginal='box')
#fig.show()


split_conses = filt_df.index[filt_df["Expert Agreement"]==0.5].tolist()
fig3 = px.histogram(filt_df[split_conses], x='Crowd Agreement',color='Crowd Majority')
fig3.show()

KeyError: "None of [Int64Index([5888091, 5888782, 5891822, 5891960, 5891979, 5892075, 5892132,\n            5892139, 5892162, 5892179,\n            ...\n            5918333, 5918334, 5918335, 5918337, 5918344, 5918358, 5918360,\n            5918363, 5918366, 5918377],\n           dtype='int64', length=2844)] are in the [columns]"

Seems uniform; which cases fall into the category? Are they the harder ones?

Make these into bubble charts to show porportion: https://plotly.com/python/bubble-charts/