# Using Metadata to Improve Artifical Intelligance Medical Image Diagnostic Accuracy
**Purpose and Background**
Conduct a descriptive analysis of crowdsourced data extracted from user interaction with a mobile application where tasked to identify abnormalities in medical images. 

Two user categories were differentiated: Medical experts hired to interact with the application; and crowd, anyone who downloaded and used the application.

**Show that the crowd agrees with the expert majority more than experts agreeing with the expert majority**


### Import datasets

In [768]:
import pandas as pd
import numpy as np
results = pd.read_csv('1345_customer_results.csv') #medical case results
admin = pd.read_csv('1345_admin_reads.csv') #raw individual read

### Inspect Customer Results

In [769]:
results.dtypes
#results = results.set_index('Case ID')

Case ID                   int64
Origin                   object
Origin Created At        object
Content ID                int64
URL                      object
Labeling State           object
Series                  float64
Series Index            float64
Patch                   float64
Qualified Reads           int64
Correct Label            object
Majority Label           object
Difficulty              float64
Agreement               float64
First Choice Answer      object
First Choice Votes        int64
First Choice Weight     float64
Second Choice Answer     object
Second Choice Votes       int64
Second Choice Weight    float64
Internal Notes          float64
Comments                 object
Explanation             float64
dtype: object

**Preliminary filtering for security purposes**


In [770]:
results = results.dropna(subset=['Origin']) 
results["Expert: Abnormal Votes"] = results["Origin"].str.extract(r'vote(\d)').astype(float)
results = results.drop(['Origin Created At','Origin','Content ID','URL'],axis=1)

In [771]:
results.head(2)

Unnamed: 0,Case ID,Labeling State,Series,Series Index,Patch,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Internal Notes,Comments,Explanation,Expert: Abnormal Votes
0,5888087,Gold Standard,,,,2,'no','no',0.0,1.0,'no',2,1.54,'yes',0,0.0,,[],,2.0
1,5888088,Gold Standard,,,,3,'no','no',0.0,1.0,'no',3,2.34,'yes',0,0.0,,[],,0.0


#### Correct number of experts

Any rows that did not have a string associated with expert votes in the URL were dropped (i.e. NA)

There should only be 8 experts total; cases were dropped for experts count >8

(0 means all experts thought the case was normal) 

In [772]:
results = results[results["Expert: Abnormal Votes"] <= 8]

In [773]:
results = results.dropna(subset=["Expert: Abnormal Votes"])
#results["Expert: Abnormal Votes"].isnull().any()

**Inspect NaN Columns for Content**

In [774]:
results.loc[results['Series'].notna()| results['Series Index'].notna() | results['Patch'].notna() | results['Internal Notes'].notna() | results['Explanation'].notna()]

Unnamed: 0,Case ID,Labeling State,Series,Series Index,Patch,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Internal Notes,Comments,Explanation,Expert: Abnormal Votes


Dataframe is empty; None of the columns scanned through the pipeline contained any data

In [775]:
results = results.drop(['Series','Series Index','Patch','Internal Notes','Explanation'],axis=1)

**Inspect Comments for Relevance**

In [776]:
results[results['Comments'] != '[]']


Unnamed: 0,Case ID,Labeling State,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Comments,Expert: Abnormal Votes
4245,5892332,Gold Standard,1,'no','no',0.0,1.0,'no',1,0.8,'yes',0,0.0,['There was rapid and spiky rates so why am I ...,3.0
6029,5894116,Gold Standard,5,'no','yes',1.0,1.0,'yes',5,4.0,'no',0,0.0,['Can someone explain why the answer is “no”?'],0.0
8346,5896433,Gold Standard,3,'yes','no',1.0,1.0,'no',3,2.32,'yes',0,0.0,['??'],5.0
11433,5899520,Gold Standard,2,'yes','no',1.0,1.0,'no',2,1.58,'yes',0,0.0,"[""i can't see any spike in this question so wh...",5.0
12911,5900998,Gold Standard,2,'no','yes',1.0,1.0,'yes',2,1.56,'no',0,0.0,['There is obviously a peak happened in there'],3.0
13827,5901914,Gold Standard,6,'yes','no',1.0,1.0,'no',6,4.72,'yes',0,0.0,['No spike present'],5.0
13953,5902040,Gold Standard,2,'no','yes',1.0,1.0,'yes',2,1.58,'no',0,0.0,['How?'],3.0
16033,5904120,Gold Standard,1,'yes','no',1.0,1.0,'no',1,0.78,'yes',0,0.0,['How? '],6.0
16326,5904413,Gold Standard,3,'yes','no',1.0,1.0,'no',3,2.46,'yes',0,0.0,['Multiple?'],6.0
16385,5904472,Gold Standard,3,'yes','yes',0.333,0.667,'yes',2,1.58,'no',1,0.78,[' Wtf'],5.0


None of the comments seem relevant, so the Comments column will be dropped

In [777]:
results = results.drop(['Comments'],axis=1)
results

Unnamed: 0,Case ID,Labeling State,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Expert: Abnormal Votes
0,5888087,Gold Standard,2,'no','no',0.000,1.000,'no',2,1.54,'yes',0,0.00,2.0
1,5888088,Gold Standard,3,'no','no',0.000,1.000,'no',3,2.34,'yes',0,0.00,0.0
2,5888089,Gold Standard,2,'no','no',0.000,1.000,'no',2,1.70,'yes',0,0.00,0.0
3,5888090,Gold Standard,1,'no','no',0.000,1.000,'no',1,0.82,'yes',0,0.00,0.0
4,5888091,In Progress,7,,'yes',,0.571,'yes',4,3.28,'no',3,2.32,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30288,5918375,Gold Standard,2,'no','yes',1.000,1.000,'yes',2,1.56,'no',0,0.00,2.0
30289,5918376,Gold Standard,3,'no','yes',0.667,0.667,'yes',2,1.56,'no',1,0.76,3.0
30290,5918377,In Progress,6,,'yes',,1.000,'yes',6,4.78,'no',0,0.00,4.0
30291,5918378,Gold Standard,0,'yes',,,,'yes',0,0.00,'no',0,0.00,5.0


### Important columns for analysis; original metadata
Each row corresponds to a medical case 

**Identifiers:** 

Case ID: unique identifier will serve as index

Labeling State: identifies whether a expert consensus has been achieved (yes=Gold Standard, no= In Progress)

URL: Extracted out expert vote count within the URL 

**Reads and Annotations**

Qualified Reads: total crowd vote count

Expert: Abnormal Votes: number of experts who thought the case was abnormal

(note, the total of experts voting is always 8)

Correct Label: overall expert consensus 

{yes=case is abnormal, no=case is normal, NaN=no consensus}

Majority Label: overall crowd consensus on each case

**Measures of Confidence**

Difficulty: Qualified Reads *without the Correct Label* divided by total Qualified Reads.

Agreement: Qualified Reads *with the Majority Label* divided by total Qualified Reads.

Nth Choice Answer: crowd answer (First Choice is the Majority Label)
        
Nth Choice Votes: number of crowd votes per answer
        
Nth Choice Weight:
        
        
        



### Add Additional Relevant Columns 




In [778]:
data = results
data.dropna(inplace = True)
per = [.2,.4,.6,.8]
desc = data['Difficulty'].describe(percentiles = perc)
print(desc)
str(data['Difficulty'][2])

count    23758.000000
mean         0.244252
std          0.334227
min          0.000000
20%          0.000000
40%          0.000000
50%          0.000000
60%          0.200000
80%          0.500000
max          1.000000
Name: Difficulty, dtype: float64


'0.0'

In [781]:
df = results 
expert_count = 8

#first digit

FirstDigit = {
'0' : 'very easy',
'1' : 'very challenging'
}

#second digit
SecondDigit = {
    '1' : 'very easy',
    '2' : 'easy',
    '3' : 'easy',
    '4' : 'moderate',
    '5' : 'moderate',
    '6' : 'moderately challenging',
    '7' : 'moderately challenging',
    '8' : 'challenging',
    '9' : 'challenging'
}


#df['Difficulty Category'] = df['Difficulty'].apply(lambda num: SecondDigit[str(num)[3]])
#df['Difficulty Category'] = df['Difficulty'].apply(lambda num: FirstDigit[str(num)[1]])

df["Expert: Normal Votes"] = (expert_count - results["Expert: Abnormal Votes"])
df["Expert: Abnormal Agreement"] = df["Expert: Abnormal Votes"]/expert_count
df['Crowd/Expert Consensus'] = np.where(df['Correct Label'] == df['Majority Label'],'yes','no')
df['Split_Expert Consensus'] = np.where(df["Expert: Abnormal Agreement"]==0.5, 'yes', 'no')

EM_yes = df.index[df['Correct Label'] == "'yes'"].tolist() #EM=Expert Majority

EM_no = df.index[df['Correct Label'] == "'no'"].tolist()

df.loc[EM_yes,"Error Rate"]= df['Expert: Normal Votes'][EM_yes]/expert_count
df.loc[EM_no,"Error Rate"]= df['Expert: Abnormal Votes'][EM_no]/expert_count
df.fillna('', inplace=True)

df

Unnamed: 0,Case ID,Labeling State,Qualified Reads,Correct Label,Majority Label,Difficulty,Agreement,First Choice Answer,First Choice Votes,First Choice Weight,Second Choice Answer,Second Choice Votes,Second Choice Weight,Expert: Abnormal Votes,Expert: Normal Votes,Expert Abnormal Agreement,Crowd/Expert Consensus,Split_Expert Consensus,Error Rate,Expert: Abnormal Agreement
0,5888087,Gold Standard,2,'no','no',0.000,1.000,'no',2,1.54,'yes',0,0.00,2.0,6.0,0.250,yes,no,0.250,0.250
1,5888088,Gold Standard,3,'no','no',0.000,1.000,'no',3,2.34,'yes',0,0.00,0.0,8.0,0.000,yes,no,0.000,0.000
2,5888089,Gold Standard,2,'no','no',0.000,1.000,'no',2,1.70,'yes',0,0.00,0.0,8.0,0.000,yes,no,0.000,0.000
3,5888090,Gold Standard,1,'no','no',0.000,1.000,'no',1,0.82,'yes',0,0.00,0.0,8.0,0.000,yes,no,0.000,0.000
5,5888092,Gold Standard,4,'no','no',0.000,1.000,'no',4,3.30,'yes',0,0.00,0.0,8.0,0.000,yes,no,0.000,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30286,5918373,Gold Standard,1,'yes','yes',0.000,1.000,'yes',1,0.80,'no',0,0.00,5.0,3.0,0.625,yes,no,0.375,0.625
30287,5918374,Gold Standard,4,'yes','yes',0.000,1.000,'yes',4,3.18,'no',0,0.00,5.0,3.0,0.625,yes,no,0.375,0.625
30288,5918375,Gold Standard,2,'no','yes',1.000,1.000,'yes',2,1.56,'no',0,0.00,2.0,6.0,0.250,no,no,0.250,0.250
30289,5918376,Gold Standard,3,'no','yes',0.667,0.667,'yes',2,1.56,'no',1,0.76,3.0,5.0,0.375,no,no,0.375,0.375


For each case:

Expert: Normal Votes: I subtracted the number of total experts by the known number of experts who voted the case as abnormal

Expert Agreement: I divided the number of experts who voted the case as abnormal by the total number of experts to get the porportion of experts who agree that the case is abnormal.

Error Rate: I extracted the indexes for each category and calculated the "error rate" for the experts who did not vote for the expert majority



## Exploratory Analysis 1

In [None]:
#%pip install jupyter-dash
import plotly.express as px
import plotly.io as pio
import plotly.figure_factory as ff
pio.renderers.default='notebook'
import matplotlib.pyplot as plt
#ax = plt.subplot()

### Exploratory Analysis

In [787]:
fig = px.histogram(df,x= "Qualified Reads", color="Error Rate")
fig.update_layout(xaxis_range=[0,15])
fig.show()

In [790]:
print(len(df[df['Qualified Reads'] < 5].value_counts()))
filt_df = df[df['Qualified Reads'] >= 5]

11715


#### How reliable are the individual experts on average?

In [788]:
fig = px.density_heatmap(df, x="Correct Label", y='Majority Label',text_auto=True)
fig.show()






Case ID                       7972.932186
Qualified Reads                  2.186893
Difficulty                       0.334227
Agreement                        0.162701
First Choice Votes               2.127848
First Choice Weight              1.703385
Second Choice Votes              0.815450
Second Choice Weight             0.648536
Expert: Abnormal Votes           2.740012
Expert: Normal Votes             2.740012
Expert Abnormal Agreement        0.342502
Error Rate                       0.139776
Expert: Abnormal Agreement       0.342502
dtype: float64

When the experts are undecided (N=4) on the case prognosis, crowd appears to have a more unified opinion on the case. Let's make a histogram examining the cases where there's lack of consensus.

With the number of experts being 8, filtering the qualified reads to 5 or more would keep things more porportional

In [792]:

filt_fig = px.density_heatmap(filt_df, x="Correct Label", y='Majority Label',text_auto=True)
fig.show()
filt_df.std(axis=0)





Case ID                       8066.388925
Qualified Reads                  1.555279
Difficulty                       0.281597
Agreement                        0.149788
First Choice Votes               1.755511
First Choice Weight              1.407069
Second Choice Votes              0.969948
Second Choice Weight             0.770991
Expert: Abnormal Votes           2.718787
Expert: Normal Votes             2.718787
Expert Abnormal Agreement        0.339848
Error Rate                       0.139738
Expert: Abnormal Agreement       0.339848
dtype: float64

In [793]:
fig = px.scatter(filt_df,x= "Expert Agreement", y="Crowd Agreement")
#fig.update_layout(yaxis_range=[0.4,1.1])
fig.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Case ID', 'Labeling State', 'Qualified Reads', 'Correct Label', 'Majority Label', 'Difficulty', 'Agreement', 'First Choice Answer', 'First Choice Votes', 'First Choice Weight', 'Second Choice Answer', 'Second Choice Votes', 'Second Choice Weight', 'Expert: Abnormal Votes', 'Expert: Normal Votes', 'Expert Abnormal Agreement', 'Crowd/Expert Consensus', 'Split_Expert Consensus', 'Error Rate', 'Expert: Abnormal Agreement'] but received: Expert Agreement

This doesn't tell us much; perhaps calculate an average?

In [None]:
consensus_no = filt_df.index[filt_df['Consensus']=='no'].tolist()
fig = px.histogram(filt_df.loc[consensus_no], x='Crowd Agreement',color='Crowd Majority',marginal='box')
fig.update_layout(xaxis_range=[-0.1,1.1])
fig.show()

Move red to left side of the graph

In [None]:
fig = px.histogram(filt_df.loc[consensus_no], x='Expert Agreement',color='Expert Majority',marginal='box')
fig.show()

### Let's check out what are the trends when experts are split in their opinion:
what about when crowd is split?

In [None]:
experts_split = filt_df.index[filt_df['Expert Agreement']==0.500].tolist()
fig = px.histogram(filt_df.loc[experts_split], x='Crowd Agreement',color='Crowd Majority',marginal='box')
fig.update_layout(xaxis_range=[0.45,1.05])
fig.show()

## Trying to aggregate case ids

In [None]:

filt_df = df[df['Qualified Reads'] > 5]
fig = px.histogram(filt_df,x= "Case ID")
#fig.update_layout(yaxis_range=[0.4,1.1])
fig.show()

per case ID:
sum the instances and show porportions of error rate; ### aggregate case IDs!!

In [None]:
import statistics
#filt_df['aggreg_error_rate'] = filt_df.groupby(['Case ID'], as_index = False)['Error Rate'].mean()
#filt_df['aggreg_error_rate'] = filt_df.groupby(['Case ID'], as_index = False).apply(lambda x: statistics.mean(x))
def average(x):
    x/sum(x)
#print(average(df['Expert: Normal Votes']))
#caseid_agg = df.groupby(['Case ID'])['Expert Agreement'].aggregate{}
print(df['Case ID'].unique)
#fig = px.histogram(caseid_agg)
#fig.show()
#print(sum(caseid_agg))
print(df.index[-1] - df.index[1])


In [None]:

filt_df = df[df['Qualified Reads'] > 5]
fig = px.histogram(filt_df,x="Case ID", color="Error Rate")
#fig.update_layout(yaxis_range=[0.45,1.1])
fig.show()

Seems uniform; which cases fall into the category? Are they the harder ones?

Make these into bubble charts to show porportion: https://plotly.com/python/bubble-charts/