# Nym & 8200Bio Data Challenge
*__The goal__ is to predict the medical specialty from the transcription field, that is for each file (unique transcription) you will need to predict whether it belongs to each label (medical specialty).* <br/><br/>
The process consists of three different companies:
* This notebook for cleaning the data and convert it to features.
* The __imbalanced data handling.ipynb__ file for making the data more balanced.
* The __Training & Testing.ipynb__ file for traing and testing our data on a ML classification model.
So lest's start.

# Text cleaning & processing 

In [90]:
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = set(stopwords.words('english'))

## Clean the text
__My text cleaning pipeline:__
* Convert the text into a set of words as simply as possible.
* Drop the meaningless words from the array
* Clean marks, saturates, etc. to maintain uniformity

In [36]:
df = pd.read_csv('Data/data.csv')

In [37]:
df.sample(5)

Unnamed: 0,transcription,specialty
602,"PREOPERATIVE DIAGNOSIS: , Appendicitis.,POSTOP...",Gastroenterology
2060,"CC:, Fluctuating level of consciousness.,HX:, ...",Radiology
2732,"PREOPERATIVE DIAGNOSIS: , Scalp lacerations.,P...",Surgery
560,"EXAM: , CT scan of the abdomen and pelvis with...",Gastroenterology
2800,"PREOPERATIVE DIAGNOSIS:, Carcinoma of the lef...",Surgery


In [38]:
#Level 1: Convert the text into a set of words as simply as possible

df['transcription'] = df['transcription'].apply(lambda x: nltk.word_tokenize(x))

In [39]:
# I decided not to delete the negative words because sometimes they change the meaning of the text drastically

stop_words = list(stop_words) + [',',':','.',';','',' ',"'","'s"]
stop_words = list(filter(lambda x: "n't" not in x and x != 'not', stop_words))

In [40]:
#Level 2: Drop the meaningless words from the array

df['transcription'] = df['transcription'].apply(lambda x: list(filter(lambda word: word.lower() not in stop_words, x)))

In [41]:
def clean_array(array):
    array = list(map(lambda x: x.replace(".",""), array))
    array = list(map(lambda x: x.replace("'",""), array))
    array = list(map(lambda x: x.replace("`",""), array))
    array = list(map(lambda x: x.replace("-"," "), array))
    array = list(map(lambda x: x.strip(), array))
    array = list(map(lambda x: x.lower(), array))
    return array

In [42]:
#Level 3: Clean marks and etc. to maintain uniformity

df['transcription'] = df['transcription'].apply(lambda x: clean_array(x))

In [88]:
df

Unnamed: 0,transcription,specialty
0,"[admitting, diagnosis, kawasaki, disease, disc...",Allergy / Immunology
1,"[chief, complaint, 5 year old, male, presents,...",Allergy / Immunology
2,"[history, 34 year old, male, presents, today, ...",Allergy / Immunology
3,"[history, pleasure, meeting, evaluating, patie...",Allergy / Immunology
4,"[subjective, 42 year old, white, female, comes...",Allergy / Immunology
...,...,...
3687,"[preoperative, diagnoses, ,1, urinary, retenti...",Urology
3688,"[preoperative, diagnosis, clinical, stage, ta,...",Urology
3689,"[preoperative, diagnosis, bilateral, vesicoure...",Urology
3690,"[preoperative, diagnosis, inguinal, hernia, po...",Urology


In [89]:
df.to_csv('Data/final_data.csv', index=False)

Since there is not always a clear advantage to one of the word2vec methods I decided to use both and put both into training in the model.
<br/><br/>
In addition, I created binary columns for each specialty to make the training clearer in context.
<br/><br/>
*Please note, I have significantly increased the data set - saving the data in the csv file may take some time.*

## Word2Vec first version: Count vectors

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
transcriptions = list(df['transcription'])

In [12]:
transcriptions = list(map(lambda x: ' '.join(x),transcriptions))

In [13]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(transcriptions)

In [14]:
X = X.toarray()

In [10]:
def is_specialty(x,specialty):
    if x == specialty:
        return 1
    return 0

In [16]:
features_df = pd.DataFrame(X)
features_df['specialty'] = df['specialty']

for specialty in features_df['specialty'].unique():
    features_df[specialty] = features_df['specialty'].apply(lambda x: is_specialty(x, specialty))

features_df.to_csv('Data/Clean_data_1.csv', index=False)

In [17]:
features_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,Pediatrics - Neonatal,Physical Medicine - Rehab,Podiatry,Psychiatry / Psychology,Radiology,Rheumatology,Sleep Medicine,Surgery,Urology,speciality
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3687,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3688,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3689,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3690,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Word2Vec second version: TF-IDF Score

In [64]:
transcriptions = list(df['transcription'])
transcriptions = list(map(lambda x: ' '.join(x),transcriptions))

In [65]:
from sklearn.feature_extraction import DictVectorizer

In [66]:
vec = TfidfVectorizer()
X = vec.fit_transform(transcriptions)

In [67]:
X = X.toarray()

In [68]:
features_df = pd.DataFrame(X)

features_df += 0.1
features_df['specialty'] = df['specialty']

for specialty in features_df['specialty'].unique():
    features_df[specialty] = features_df['specialty'].apply(lambda x: is_specialty(x, specialty))

features_df.to_csv('Data/Clean_data_2.csv', index=False)

In [69]:
features_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,Pediatrics - Neonatal,Physical Medicine - Rehab,Podiatry,Psychiatry / Psychology,Radiology,Rheumatology,Sleep Medicine,Surgery,Urology,speciality
0,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,0,0
1,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,0,0
2,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,0,0
3,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,0,0
4,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3687,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,1,0
3688,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,1,0
3689,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,1,0
3690,0.1,0.1,0.100000,0.1,0.1,0.1,0.1,0.1,0.1,0.1,...,0,0,0,0,0,0,0,0,1,0


I saved both data frames for training but before that we have to deal with the imbalance in the data.