**[CAT 2 Assessment]()**

Note: This notebook has all the libraries needed already imported. You are thus limited to using:

*   Tensforlow for builing your ANN
*   Pandas for reading data (already done for you)
*   Sklearn for Data spliting
*   Matplotlib for plots


You need to upload the provided csv nmed 'CAT2_data.csv' file to this runtime. This file can be downloaded from:

[E_learning](https://elearning.strathmore.edu/mod/resource/view.php?id=223539) OR
[Google Drive](https://drive.google.com/file/d/1xJwoSy9KTC1ETodnKNFetgNOWuJqxR6s/view?usp=sharing)

For those working offline, ensure you know the file path to the downloaded csv file. The easiest way to do this would be to have both the Notebook and CSV files in the same folder. You can also run ```python % ls ``` on an empty cell to confirm.

Additionals:

*   Train for at least 30 epochs, 50 recommended
*   This does not need a GPU. *Training 30 epochs on a Dual Core 2.4 Ghz CPU takes about 2 minutes (could be less)*

-------

In [64]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Embedding, LSTM, Dense

### Data Aquisition

In [65]:
df = pd.read_csv('../Data/CAT2_data.csv')                                      # read data

df['X'].iloc[np.random.choice(df.index, 50, replace=False)] = np.nan            # randomly drop 50 rows
df = df.sample(frac=1).reset_index(drop=True)                                   # shuffle rows

labels = df['Y'].unique()                                                       # get unique labels

df = df[df['Y'].isin(np.random.choice(labels, 2, replace=False))]               # randomly choose 2 labels
df = df.reset_index(drop=True)                                                  # reset index

print('Your will be using the two labels below with value counts as shown. Runing this cell again will generate another randomised dataframe for you to use')
print(df['Y'].value_counts())                                                   # check value counts
print(df.shape)                                                                 # check shape, rows and columns

df.head()

Your will be using the two labels below with value counts as shown. Runing this cell again will generate another randomised dataframe for you to use
 Radiology           273
 Gastroenterology    230
Name: Y, dtype: int64
(503, 2)


Unnamed: 0,X,Y
0,Upper endoscopy with foreign body removal (Pe...,Gastroenterology
1,Iron deficiency anemia. Diverticulosis in th...,Gastroenterology
2,CT Abdomen & Pelvis W&WO Contrast,Gastroenterology
3,Generalized abdominal pain with swelling at t...,Radiology
4,Patient with dysphagia.,Gastroenterology


### Data Preparation / Preprocessing

In [66]:
df.dropna(inplace=True) # Handle missing  values

In [67]:
def enc(y):
    if (y == 'Gastroenterology'):
       return 1
    if (y == 'General Medicine'):
      return 2
    if (y == 'Orthopedic'):
      return 3
    if (y == 'Neurology'):
      return 4
    else:
      return 0

In [68]:
df['Label'] = df['Y'].apply(lambda x: enc(x))

In [69]:
df['X'] = df['X'].str.replace('[^\w\s]', '') # removing punctuation
df['X'] = df['X'].str.replace('\d+', '') # removing numbers
df['X'] = df['X'].str.replace('@', '') # removing leading whitespace '@'
df['X'] = df['X'].str.replace('#', '') # removing trailing '#'
df['X'] = df['X'].str.replace('\s+', ' ') # removing extra whitespace
df['X'] = df['X'].str.lower() # lowercase

df.head()

Unnamed: 0,X,Y,Label
0,upper endoscopy with foreign body removal pen...,Gastroenterology,0
1,iron deficiency anemia diverticulosis in the ...,Gastroenterology,0
2,ct abdomen pelvis wwo contrast,Gastroenterology,0
3,generalized abdominal pain with swelling at t...,Radiology,0
4,patient with dysphagia,Gastroenterology,0


In [70]:
train_texts, test_texts, train_labels, test_labels = train_test_split(df['X'], df['Y'], test_size=0.2, random_state=42)

### Modeling

In [71]:
# Tokenize the texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

In [72]:
# Pad sequences to have the same length
max_sequence_length = max(max(map(len, train_sequences)), max(map(len, test_sequences))) # max from train and test samples
train_data = pad_sequences(train_sequences, maxlen=max_sequence_length)
test_data = pad_sequences(test_sequences, maxlen=max_sequence_length)

In [73]:
# Create an embedding layer with dimension 300
embedding_dim = 300
vocab_size = len(tokenizer.word_index) + 1

embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim,
                                            input_length=max_sequence_length)


In [74]:
# Define the (LSTM) model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index) + 1, 100, input_length=max_sequence_length))
model.add(LSTM(128))
model.add(Dense(128, activation='ReLU'))
model.add(Dense(1, activation='sigmoid'))

In [75]:
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model
history = model.fit(train_data, train_labels, validation_data=(test_data, test_labels), epochs=30, batch_size=64)

Epoch 1/30


UnimplementedError: ignored

### Evaluation

In [None]:
# plot loss and accuracy
def plot(metric):
    plt.plot(history.history[metric])
    plt.xlabel('Epochs')
    plt.ylabel(metric)
    plt.legend([metric, 'val_'+metric])
    plt.show()

In [None]:
plot(history, 'accuracy')

In [None]:
plot(history, 'loss')

### Save Files

In [None]:
model.save('CAT2.h5')

In [None]:
model2 = load_model('CAT2.h5')

In [None]:
model2.summary()