**hi welcome to my notebook, I hope you are having a good day**

**in this notebook, I reviewed the NBME Score Clinial Patient Notes dataset and went deeper**

**I tried to give as much detail as possible**

**if you have any issue, please don't hesitate to contact me**

**upvotes are all appreciated!**


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os # operating system module
import plotly.express as px # visualisations and graphs
import math # mathematical operations
import nltk # natural language toolkit
from nltk.corpus import stopwords # stopwords in English e.g. out, as, has etc.
from collections import Counter # useful counter in order to find most common words
import re # regular expressions
import string # string operations

In [None]:
train_df = pd.read_csv('../input/nbme-score-clinical-patient-notes/train.csv')
display(os.path.getsize('../input/nbme-score-clinical-patient-notes/train.csv') / (1024 ** 2), 'in megabytes') # our train_df dataset is around 0.72 megabytes, it's relatively small

**we calculated the size of data above**

In [None]:
for i, column in enumerate(train_df.columns):
    print(f'{i+1}. column is {column}')

**Columns**

we have 6 columns in our train_df dataset in total

> id: stands for unique ids of patients

> case_num: number of a unique case

> pn_num: uniqe number of patient

> feature_num: numbers of features

> annotation: annotations as it refers

> location: Character spans indicating the location of each annotation within the note. Multiple spans may be needed to represent an annotation, in which case the spans are delimited by a semicolon
**Credit goes to [Joao de Oliveira](https://www.kaggle.com/jdoliveira) for help me to correct it.**


In [None]:
train_df.head()

In [None]:
train_df.dtypes

**our dataset consists of two different data types..**

In [None]:
train_df.isnull().any().sum()

**we don't have any type of missing data, that's quite good**

In [None]:
print(f'we have {train_df.shape[0]} rows and {train_df.shape[1]} columns in our dataset')

In [None]:
patient_notes_df = pd.read_csv('../input/nbme-score-clinical-patient-notes/patient_notes.csv')
patient_notes_df.head()

> we have another dataset named as *patient notes*

**it contains patient histories about stated patient (pn_num in this example)**

**now we can merge them in order to enrich our main dataset**

In [None]:
features_df = pd.read_csv('../input/nbme-score-clinical-patient-notes/features.csv')
patient_notes_df = pd.read_csv('../input/nbme-score-clinical-patient-notes/patient_notes.csv')
df = train_df.append(features_df)

**we merged our data**

In [None]:
train_df = train_df.merge(patient_notes_df, on=['pn_num', 'case_num'], how='left')
train_df = train_df.merge(features_df, on=['feature_num', 'case_num'], how='left')
display(train_df.head())
nullResult = 'nope' if not train_df.isnull().any().sum() else 'yup' # still we don't have any null values
display('-' * 120)
print(f' +do we have null values? \n -{nullResult}')
display('-' * 120)
display(train_df.tail())

In [None]:
display(train_df.duplicated().sum())
train_df.tail()

**none of our rows are duplicated, this means we won't need to have our data cleansed**

**let's visualise then!**

In [None]:
counts_of_notes_df = patient_notes_df.groupby("case_num").count()
ind = counts_of_notes_df.index
fig = px.bar(data_frame= counts_of_notes_df, y='pn_num', x=ind, text_auto='.2s',
            title="Count Distribution of Different Cases",
            labels={'case_num': 'Case Number', 'pn_num': 'Number of Patients'},
            width=1100, height=700,
            color ='pn_num',
            color_continuous_scale='aggrnyl'
            )
fig.update_layout(
    title_x=0.5,
    xaxis = dict(
        tickmode = 'array',
        tickvals = list(range(0,10)),
        ticktext = ['Case Zero', 'Case One', 'Case Two', 'Case Three', 'Case Four', 'Case Five', 'Case Six', 'Case Seven', 'Case Eight', 'Case Nine']
    )
)

fig.show()

**let's check average length of annotations and patient histories**

In [None]:
annotation_length, pn_history_length = math.ceil(train_df['annotation'].str.len().mean()), math.ceil(train_df['pn_history'].str.len().mean())
print(f' average length of annotations: ~{annotation_length} \n average length of patient histories: ~{pn_history_length}')

**now we can visualise it!**

In [None]:
annotation_length = []
for i in train_df['annotation']:
    annotation_length.append(len(i))
    
df = pd.DataFrame(annotation_length)

fig = px.scatter(df,
                 x=df[0].index,
                 y=df[0],
                 labels={'index': 'Annotation Index', '0': 'Annotation Length'},
                 color=df[0],
                 color_continuous_scale=px.colors.sequential.Viridis)
fig.update_layout(
    title='Length Distribution of Annotations',
    title_x=0.5,)
fig.show()

**now we are gonna imply the same operation for patient histories**

In [None]:
pn_history_length = []
for i in train_df['pn_history']:
    pn_history_length.append(len(i))

df = pd.DataFrame(pn_history_length)

fig = px.scatter(df,
                 x=df[0].index,
                 y=df[0],
                 labels={'index': 'History Index', '0': 'History Length'},
                 color=df[0],
                 color_continuous_scale=px.colors.sequential.algae)
fig.update_layout(
    title='Length Distribution of Patient Histories',
    title_x=0.5,)
fig.show()

**all done so far!**

**now we will filter our data, we will remove brackets, punctuations, stopwords etc.**

In [None]:
annotation_words_list =  pd.Series(' '.join(train_df['annotation']).lower().split()).value_counts()[:50]
filtered_words = [word for word in annotation_words_list.index if word not in stopwords.words('english')]
#display(annotation_words_list.size)
#display(len(filtered_words))

filtered_words = [x for x in filtered_words if not any(c.isdigit() for c in x)] # we remove digits
final_filtered_annotation_words =  []
for doc in filtered_words:
    doc = doc.translate(str.maketrans('', '', string.punctuation)) # we remove punctuations
    final_filtered_annotation_words.append(doc)
tmp_ann = []
for doc in annotation_words_list.index:
    doc = doc.translate(str.maketrans('', '', string.punctuation)) # we apply same process for indices
    tmp_ann.append(doc)
annotation_words_list.index = tmp_ann
for i in final_filtered_annotation_words:
    print(i)
display('these things are our brand new filtered most occurring words in annotations!')
    

In [None]:
df = pd.DataFrame(final_filtered_annotation_words)
df = df.iloc[1: , :] # indexing
df['values'] = None # create new column
df = df.rename(columns={0: 'words'})
for i in range(1, len(df)):
    #df['values'].loc[i+1] = annotation_words_list[i]
    for j in range(len(annotation_words_list.index)):
        if annotation_words_list.index[j] == df['words'].iloc[i]:
            df['values'].iloc[i] = annotation_words_list[j]
# some manual corrections
df['values'].iloc[0] = 509
df['values'].iloc[2] = 360
df.drop(11 ,axis=0)
df.drop_duplicates(subset=None, keep="first", inplace=True) # drop one of two duplicates
df.reset_index(drop=True) # resetting our index after some drop process

**now, it looks clean yet! let's visualise it then**

**for this case we will be using pie chart**

In [None]:
fig = px.pie(df, values='values', names='words',color_discrete_sequence=px.colors.sequential.RdBu, title='Occurency of Some of Words in Annotations',
            width=800, height=700,
                labels={
                     'values': "Occurence Count",
                     "words": "Word"
                 },)
fig.show()

**not gonna lie i like pie charts anyway**

**we will apply the very similar process for our patient histories data**

In [None]:
pn_history_words_list =  pd.Series(' '.join(train_df['pn_history']).lower().split()).value_counts()[:50]
filtered_words = [word for word in pn_history_words_list.index if word not in stopwords.words('english')]
#display(pn_history_words_list.size)
#display(len(filtered_words))

filtered_words = [x for x in filtered_words if not any(c.isdigit() for c in x)]
final_filtered_pn_history_words =  []
for doc in filtered_words:
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    final_filtered_pn_history_words.append(doc)
tmp_ann = []
for doc in pn_history_words_list.index:
    doc = doc.translate(str.maketrans('', '', string.punctuation))
    tmp_ann.append(doc)
pn_history_words_list.index = tmp_ann

df = pd.DataFrame(final_filtered_pn_history_words)
df = df.iloc[1: , :]
df['values'] = None
df = df.rename(columns={0: 'words'})
for i in range(1, len(df)):
    #df['values'].loc[i+1] = pn_history_words_list[i]
    for j in range(len(pn_history_words_list.index)):
        if pn_history_words_list.index[j] == df['words'].iloc[i]:
            df['values'].iloc[i] = pn_history_words_list[j]
df['values'].iloc[0] = 509
df['values'].iloc[2] = 360
df.drop(1 ,axis=0, inplace=True)
df.drop(4 ,axis=0, inplace=True)
df.drop_duplicates(subset=None, keep="first", inplace=True)
df.reset_index(drop=True)
df.dropna(how='all', inplace=True)

df = pd.DataFrame(final_filtered_pn_history_words)
df = df.iloc[1: , :]
df['values'] = None
df = df.rename(columns={0: 'words'})
for i in range(1, len(df)):
    #df['values'].loc[i+1] = pn_history_words_list[i]
    for j in range(len(pn_history_words_list.index)):
        if pn_history_words_list.index[j] == df['words'].iloc[i]:
            df['values'].iloc[i] = pn_history_words_list[j]
df['values'].iloc[0] = 509
df['values'].iloc[2] = 360
df.drop(1 ,axis=0, inplace=True)
df.drop(4 ,axis=0, inplace=True)
df.drop_duplicates(subset=None, keep="first", inplace=True)
df.reset_index(drop=True)
df.dropna(how='all', inplace=True)
fig = px.pie(df, values='values', names='words',color_discrete_sequence=px.colors.sequential.RdBu, title='Occurency of Some of Words in Patient Histories',
            width=800, height=700,
                labels={
                     'values': "Occurence Count",
                     "words": "Word"
                 },)
fig.show()

**thanks for reading.**