Hi all, this is my **medical review** of all the **143 features** in the NBME competition. 

**Hope you find this medical exploratory analysis useful!**

In [1]:
# Imports

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

train = pd.read_csv("../input/nbme-score-clinical-patient-notes/train.csv")
patient_notes = pd.read_csv("../input/nbme-score-clinical-patient-notes/patient_notes.csv")


# Table of Contents <a class="anchor" id="toc"></a>

* [Introduction](#intro)
* [Case #0: Palpitations](#case0)
* [Case #1: Right lower quadrant (RLQ) abdominal pain](#case1)
* [Case #2: Multiple symptoms of menopause](#case2)
* [Case #3: Pain in stomach area (epigastric pain)](#case3)
* [Case #4: Excessive nervousness](#case4)
* [Case #5: episodes of palpitations and feelings of impending doom](#case5)
* [Case #6: Chest pain in a young adult](#case6)
* [Case #7: Irregular menstrual cycles with heavy bleeding](#case7)
* [Case #8: Symptoms of grief](#case8)
* [Case #9: Headache with additional symptoms](#case9)
* [Conclusions](#conclusions)

# Introduction <a class="anchor" id="intro"></a>

During my exploratory data analysis I realized that all of the patient notes in each case number have the same number of features (including so called "missing" ('[]') features):
* All patients in Case 0 have 13 features.
* All patients in Case 1 have 13 features.
* All patients in Case 2 have 17 features, etc...

That made me think about the essence of this competition - the data in this competition consists of patient notes that were written by examinees, while they are being tested on their competence in a physician-patient encounter. Seems like the features in this competition are medical pieces of information that always appear in the same order for each case. In case a piece of information does not appear in the patient record that feature would be "missing" ('[]').

This is the explanation from the competition's description -<br>
*In this competition, you’ll identify specific clinical concepts in patient notes. Specifically, you'll develop an automated method to map clinical concepts from an exam rubric (e.g., “diminished appetite”) to various ways in which these concepts are expressed in clinical patient notes written by medical students (e.g., “eating less,” “clothes fit looser”). Great solutions will be both accurate and reliable.*

**The features are required details about the clinical cases.** 
I will review each case and present the medical information required in each case. 

**You are welcome to write any questions in the notebook's comments!**

Also, don't forget to upvote if you find this notebook useful. Enjoy!

## Notebook overview

Each case number is explored in order to describe the required clinical information for each feature.
Cases are described in the following way:
1. Present a patient note with a minimal number of missing annotations from the training data.
2. Describe the medical information that each feature should contain.
3. Validate the feature description with a random patient note.

The required answers are listed by feature number.


### [Back to table of contents](#toc)

# Case #0: Palpitations <a class="anchor" id="case0"></a>

Palpitations are perceived abnormalities of the heartbeat characterized by hard, fast or irregular beats.

Assuming the features are correct answers for the test - let's find patient notes with the minimal number of missing features. 

In [2]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 0 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

The following cell was copied was SANSKAR HASIJA's excellent EDA -

https://www.kaggle.com/odins0n/nbme-detailed-eda

**Go check it out and give it an upvote!**

In [3]:
# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

# Required answers for Case #0
*Listed by feature_num*

0 - Father had a myocardial infarction.<br>
1 - Mother had thyroid disease.<br>
2 - Complains of chest pressure.<br>
3 - "Episode/s"... Might refer to defined episodes of palpitations, in contrast to an ongoing feeling of palpitations.<br>
4 - Complains of lightheadedness.<br>
5 - Denies symptoms of thyroid disease (such as heat/cold intolerance, hair loss).<br>
6 - Reports the use of adderall (a medication commonly prescribed for ADHD).<br>
7 - Complains of shortness of breath (SOB).<br>
8 - Reports drinking caffeinated drinks.<br>
9 - Reports the feeling of palpitations.<br>
10 - Time since symptoms started (2-3 months).<br>
11 - Patient's age is 17 (yo = year-old).<br>
12 - Patient is a male (M).<br>

Let's compare this to a random patient and see if the features stay in the same order.


In [4]:
# Extract feature numbers for this case_num
train.query("`case_num` == 0")['feature_num'].unique()

# Validation of Case #0

Review the features of a random patient to see if they match the required answers.

In [5]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 0].sample(random_state=12)['pn_num']
display(random_id.iloc[0])

In [6]:
# Taken from 
PATIENT_IDX = random_id.iloc[0]

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)


# Case #1: right lower quadrant (RLQ) abdominal pain <a class="anchor" id="case1"></a>


In [7]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 1 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

**The first answer is missing, let's see if we can find it in another patient -**

In [8]:
train.query("`case_num` == 1 and `location` != '[]' and `feature_num` == 100").head()

# Required answers for Case #1
*Listed by feature_num*

100 - Patient denies having vaginal discharge. <br>
101 - Reports a decrease in weight. <br>
102 - Had sexual relations 9 months ago.<br>
103 - Reports episodes of diarrhea.<br>
104 - Age - 20 years old.<br>
105 - Denies blood in stool.<br>
106 - Recurrent episodes of abdominal pain.<br>
107 - Pain in right lower quadrant (RLQ) of abdomen.<br>
108 - Denies urinary symptoms or change in urinary habits.<br>
109 - Reports decreased appetite.<br>
110 - Last menstrual period (LMP) was 2 weeks ago.<br>
111 - Onset of pain 8-10 hours ago.<br>
112 - Patient is a female (F).<br>


In [9]:
# Extract feature numbers for this case_num
train.query("`case_num` == 1")['feature_num'].unique()

# Validation of Case #1

In [10]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 1].sample(random_state=12)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #2: Multiple symptoms of menopause <a class="anchor" id="case2"></a>



In [11]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 2 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

Find missing details of information -

In [12]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 2")['feature_num'].unique()
display(feature_num_for_case)

In [13]:
# Question 7 - feature_num 206
display('Question 7 - ')
display(train.query("`case_num` == 2 and `location` != '[]' and `feature_num` == 206").head())

# Question 8 - feature_num 207
display('Question 8 - ')
display(train.query("`case_num` == 2 and `location` != '[]' and `feature_num` == 207").head())

# Question 10 - feature_num 209
display('Question 10 - ')
display(train.query("`case_num` == 2 and `location` != '[]' and `feature_num` == 209").head())

# Question 11 - feature_num 210
display('Question 11 - ')
display(train.query("`case_num` == 2 and `location` != '[]' and `feature_num` == 210").head())

# Required answers for Case #2:
*Listed by feature_num*

200 - Description of past periods.<br>
201 - Last Papanicolaou test (also known as PAP, Pap smear, Pap test).<br>
202 - Has an intrauterine device (IUD) in order to avoid pregnancy.<br>
203 - Sexually active.<br>
204 - Reports vaginal dryness.<br>
205 - Complains of irregular menstrual cycles.<br>
206 - Reports nausea and vomiting 1 week ago with flu like symptoms.<br>
207 - Decreased post-menopausal (PMS) symptoms.<br>
208 - Patient is a female (F).<br>
209 - Reports feeling stressed.<br>
210 - Last menstrual period was 2 months ago.<br>
211 - Reports hot flashes.<br>
212 - Details of irregular postmenstrual cycles - timing, duration, blood flow.<br>
213 - Time since symptoms started - 3 years.<br>
214 - Excessive sweating.<br>
215 - Sleeping difficulties.<br>
216 - Age - 44 years old (yo).<br>


# Validation of Case #2

In [14]:
PATIENT_IDX = 21054

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #3: Pain in stomach area (epigastric pain) <a class="anchor" id="case3"></a>


In [15]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 3 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

# Required answers for Case #3:
*Listed by feature_num*

300 - Uncle had a gastric ulcer.<br>
301 - Reports epigastric pain.<br>
302 - Reports darker stool, suggestive of bleeding in stomach.<br>
303 - Reports the use of Motrin (ibuprofen), a non-steroidal anti-inflammatory drug (NSAID).<br>
304 - Reports burning sensation, so called heart-burn.<br>
305 - Bloating after eating.<br>
306 - Frequency and course of symptoms.<br>
307 - Drinks alcohol.<br>
308 - Patient is a male.<br>
309 - Onset of symptoms was 2 months ago.<br>
310 - Pain waking him up from his sleep.<br>
311 - No fresh blood in stools.<br>
312 - Episodic pain - intermittent episodes (not ongoing).<br>
313 - Alleviating factors ('tums').<br>
314 - Reports nausea.<br>
315 - Age - 35 years old.<br>



In [16]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 3")['feature_num'].unique()
display(feature_num_for_case)

# Validation of Case #3

In [17]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 3].sample(random_state=12)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #4: Excessive nervousness <a class="anchor" id="case4"></a>


In [18]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 4 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

# Required answers for Case #4:
*Listed by feature_num*

400 - Denies heat intolerance, changes in bowel habits, palpitations (symptoms related to thyroid disease).<br>
401 - Reports nervousness.<br>
402 - Reports family related stress.<br>
403 - Reports excessive coffee drinking.<br>
404 - Denies other psychological symptoms.<br>
405 - Denies weight change.<br>
406 - Has difficulty falling asleep.<br>
407 - Patient is a female.<br>
408 - Reports decreased appetite.<br>
409 - Age - the patient is 45 years old.<br>


In [19]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 4")['feature_num'].unique()
display(feature_num_for_case)

# Validation of Case #4

In [20]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 4].sample(random_state=1)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #5: episodes of palpitations and feelings of impending doom <a class="anchor" id="case5"></a>

In [21]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 5 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

Find missing details of information -

In [22]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 5")['feature_num'].unique()
display(feature_num_for_case)

In [23]:
# Question 6 - feature_num 505
display('Question 6 - ')
display(train.query("`case_num` == 5 and `location` != '[]' and `feature_num` == 505").head())

# Question 7 - feature_num 506
display('Question 7 - ')
display(train.query("`case_num` == 5 and `location` != '[]' and `feature_num` == 506").head())

# Question 11 - feature_num 510
display('Question 11 - ')
display(train.query("`case_num` == 5 and `location` != '[]' and `feature_num` == 510").head())

# Question 16 - feature_num 515
display('Question 16 - ')
display(train.query("`case_num` == 5 and `location` != '[]' and `feature_num` == 515").head())


# Required answers for Case #5
*Listed by feature_num*

500 - Onset of symptoms - 5 years ago.<br>
501 - Patient is female.<br>
502 - Does not consume caffeinated drinks.<br>
503 - Reports shortness of breath (SOB).<br>
504 - Reports palpitations.<br>
505 - Was in the emergency department (ED or ER) 2 weeks ago and all tests were normal (blood tests, ECG).<br>
506 - Denies chest pain (CP).<br>
507 - Denies drug use.<br>
508 - Reports nausea during episodes.<br>
509 - Worsened in last 3 weeks.<br>
510 - Feeling dhe is going to die during episode.<br>
511 - Episode Duration - Lasts for 15-30 minutes.<br>
512 - Feels tightness in her throat.<br>
513 - Feeling hot, clammy.<br>
514 - Numbness of fingers during episode.<br>
515 - Feels tired, has difficulty concentrating.<br>
516 - Experiences stress due to unemployment and buying new house.<br>
517 - Age - 26 years old.<br>


# Validation of Case #5

Review the features of a random patient to see if they match the required answers.

In [24]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 5].sample(random_state=1)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #6: Chest pain in a young adult <a class="anchor" id="case6"></a>

In [25]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 6 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

# Required answers for Case #6:
*Listed by feature_num*

600 - Subjective feeling of fever.<br>
601 - Patient is male.<br>
602 - Age - 17 years old.<br>
603 - Reports nasal congestion, rhinorrhea.<br>
604 - Description of "pleuritic" chest pain.<br>
605 - Reports exercise induced asthma.<br>
606 - Complains of chest pain (CP).<br>
607 - Onset of pain - yesterday.<br>
608 - Denies shortness of breath (SOB).<br>
609 - Engages in rock climbing.<br>
610 - Asthma medication (albuterol inhaler) did not help the pain.<br>
611 - Quality and intensity of pain - sharp, stabbing, 8/10.


In [26]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 6")['feature_num'].unique()
display(feature_num_for_case)

# Validation of Case #6

Review the features of a random patient to see if they match the required answers.

In [27]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 6].sample(random_state=1)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #7: Irregular menstrual cycles with heavy bleeding <a class="anchor" id="case7"></a>

In [28]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 7 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

# Required answers for Case #7
*Listed by feature_num*

700 - Patient is female.<br>
701 - Reports weight gain.<br>
702 - Irregular menstrual cycles with heavy bleeding.<br>
703 - Last period was 2 months ago.<br>
704 - Does not use contraceptives.<br>
705 - Feels fatigue.<br>
706 - History of infertility.<br>
707 - Age - 35 years old.<br>
708 - Onset of symptoms - 5-6 months.<br>


In [29]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 7")['feature_num'].unique()
display(feature_num_for_case)

# Validation of Case #7

Review the features of a random patient to see if they match the required answers.

In [30]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 7].sample(random_state=1)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #8: Symptoms of grief <a class="anchor" id="case8"></a>

In [31]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 8 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

Find missing details of information -

In [32]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 8")['feature_num'].unique()
display(feature_num_for_case)

In [33]:
# Question 8 - feature_num 807
display('Question 8 - ')
display(train.query("`case_num` == 8 and `location` != '[]' and `feature_num` == 807").head())

# Question 10 - feature_num 809
display('Question 10 - ')
display(train.query("`case_num` == 8 and `location` != '[]' and `feature_num` == 809").head())



# Required answers for Case #8
*Listed by feature_num*

800 - Reports eating more than usual.<br>
801 - 3 weeks since her sons death.<br>
802 - Gender - female.<br>
803 - Auditory hallucinations - heard a party next door.<br>
804 - Tossing and turning at night.<br>
805 - Age - 67 years old.<br>
806 - Difficulty falling asleep.<br>
807 - Took sleep medication (Ambien) from a friend.<br>
808 - Onset of symptoms - 3 weeks ago.<br>
809 - Cannot sleep during the day.<br>
810 - Sleep medication (Ambien) did not help her.<br>
811 - Feels drained and tired.<br>
812 - Lost normal interests.<br>
813 - Visual hallucinations - saw her son in the kitchen.<br>
814 - Family history (FMx) of depression.<br>
815 - Wakes up early in the morning.<br>
816 - Does not feel suicidal.<br>
817 - Reports getting too little sleep - 4-5 hours a day.<br>


# Validation of Case #8

Review the features of a random patient to see if they match the required answers.

In [34]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 8].sample(random_state=12)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Case #9: Headache with additional symptoms <a class="anchor" id="case9"></a>

In [35]:
# Return the index of the max number of feature that are not empty for this case_num
best_answer_pn_num = train.query("`case_num` == 9 and `location` != '[]'").groupby(['pn_num']).size().idxmax()

# The following cell was copied was SANSKAR HASIJA's excellent EDA
PATIENT_IDX = best_answer_pn_num

patient_df = train[train['pn_num'] == PATIENT_IDX]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

Find missing details of information -

In [36]:
# Extract feature numbers for this case_num
feature_num_for_case = train.query("`case_num` == 9")['feature_num'].unique()
display(feature_num_for_case)

In [37]:
# Question 6 - feature_num 905
display('Question 6 - ')
display(train.query("`case_num` == 9 and `location` != '[]' and `feature_num` == 905").head())

# Question 11 - feature_num 910
display('Question 11 - ')
display(train.query("`case_num` == 9 and `location` != '[]' and `feature_num` == 910").head())

# Question 12 - feature_num 911
display('Question 12 - ')
display(train.query("`case_num` == 9 and `location` != '[]' and `feature_num` == 911").head())


# Required answers for Case #9
*Listed by feature_num*

900 - Over the counter pain medications (Ibuprofen, Tylenol) have not helped.<br>
901 - Age - 22 years old.<br>
902 - Onset of symptoms - yesterday.<br>
903 - Reports muscle aches (myalgias).<br>
904 - Location of pain - whole head.<br>
905 - Reports neck stiffness or neck pain.<br>
906 - Reports vomiting.<br>
907 - Does not have a rash.<br>
908 - Reports feeling nausea.<br>
909 - Reports upper respiratory tract infection symptoms - rhinorrhea, throat symptoms.<br>
910 - Lives with roommate.<br>
911 - Did not get meningitis vaccination shot.<br>
912 - Family history (FH) of migraines.<br>
913 - Gender - female.<br>
914 - Reports photofobia.<br>
915 - No contact with sick contacts.<br>
916 - Subjective feeling of fever.


# Validation of Case #9

Review the features of a random patient to see if they match the required answers.

In [38]:
# Choose a random patient from this case_num 
random_id = train[train['case_num'] == 9].sample(random_state=2)['pn_num']
display(random_id.iloc[0])

patient_df = train[train['pn_num'] == random_id.iloc[0]]
print(f"\033[94mPatient Notes - ")
print(f'\033[94m',patient_notes[patient_notes["pn_num"] == PATIENT_IDX]["pn_history"].iloc[0])
print("------------")
print(f'\033[92mAnnotaions:')
for i in range(len(patient_df)):
    print(f'\033[92m',patient_df["annotation"].iloc[i])

### [Back to table of contents](#toc)

# Conclusions <a class="anchor" id="conclusions"></a>

In this notebook we reviewed all of the features for all cases, and described their medical meaning. 
Understanding that the features are a predefined list of medical information which may or may not appear in the patient note changed my perspective on the machine learning pipeline required for this competition.

I hope it will help you in this competition. **Good luck!**

### [Back to table of contents](#toc)