# Data prep for Rasa blog post

This notebook outlines the preliminary data prep done for my blog post on creating a medical diagnosis bot using rasa.

Part 1 describes how to prepare the input data for the diagnosis portion of the project while part 2 describes how I generated some training data for Rasa's NLU model.

In [13]:
import pandas as pd
import spacy
import random
import re

nlp = spacy.load('en_core_web_md')

# Part 1: Vector data prep 

## Functions

Here we just have th one function `get_sentence_vector`. This function simply takes some text and a spacy nlp object and converts the text to a 246D vector representation.

In [2]:
def get_sentence_vectors(text, nlp):
    
    # get tokens for each word in sentence
    embedding = nlp(text).vector.tolist()
    
    # return mean token
    return embedding

# Import and merge data

We only care about 3 files downloaded from Kaggle `dia_t.csv` (list of illnesses with id's), `sym_t.csv` (list of symptoms with id's), and `diffsydiw.csv` (a mapping of illness id's to various symptom id's).

once imported we can join everyhing together on the ID columns and do some basic cleaning.

In [3]:
illness_df = pd.read_csv('data/dia_t.csv')
symptom_df = pd.read_csv('data/sym_t.csv')
links_df = pd.read_csv('data/diffsydiw.csv')

In [4]:
source_data = (links_df
               .merge(illness_df, on="did")
               .merge(symptom_df, on="syd"))

# remove any missing data and select columns we need
source_data = source_data.loc[~(source_data['symptom'].isna())
                             & ~(source_data['diagnose'].isna()),
                             ['did', 'syd', 'diagnose', 'symptom']]
source_data.columns = ['illness_id', 'symptom_id', 'illness', 'symptom']


# tidy up some new messy characters
source_data['illness'] = source_data['illness'].str.replace('\x0b', ' ')
source_data['symptom'] = source_data['symptom'].str.replace('\x0b', ' ')

## Use our function to convert each symptom to a vector representation we can then save our symptom data down to the data folder

In [5]:
symptom_df = symptom_df.loc[~symptom_df['symptom'].isna()]
symptom_df['embedding'] = symptom_df.apply(lambda row: get_sentence_vectors(row['symptom'], nlp), axis = 1)
symptom_df.columns = ['symptom_id', 'symptom', 'symptom_vector']

# remove any messy characters
symptom_df['symptom'] = symptom_df['symptom'].str.replace('\x0b', ' ')

In [6]:
source_data.to_pickle('data/source_data.pkl')
symptom_df.to_pickle('data/symptoms.pkl')

# Create a vector of symtpoms for each illness described in the data

Here we will loop through each illness described in the dataset and flag (with a 1 or 0) each of the symptoms of that illness. We will do then compare a list of flagged symptoms described by the user to this using a cosine similarity function to diagnose a potential illness.

In [7]:
# ist of illness
illnesses = list(source_data['illness'].drop_duplicates())

# list we will use to store our illness vectors
symptom_vectors = []

for illness in illnesses:
    
    illness_symptoms = list(source_data.loc[source_data["illness"]==illness, 'symptom'].drop_duplicates())
    
    symptom_df["related_to_illness"] = 0
    symptom_df.loc[symptom_df["symptom"].isin(illness_symptoms), "related_to_illness"] = 1
    
    
    symptom_vectors.append(list(symptom_df["related_to_illness"]))
    
diagnosis_data = pd.DataFrame({"illness":illnesses,
                              "illness_vector": symptom_vectors})

In [None]:
diagnosis_data.to_pickle('data/diagnosis_data.pkl')

# Part 2: Generate training samples with flagged entities

In this part we will attempt to generate some training examples for the NLU model to understand how our users will describe symtoms to the chat bot. To do this we simply loop through our symptoms, sometimes combining them before appending them to different beginnings/endings.

We also make sure to tag any symptom describtions as a symptom entity, which will be understood by Rasa's NLU model. You can see this is accomplished by inserting '(symptom)' beside any mention of symptoms.

The output of this is inserted into nlu.md within the main project.

In [14]:
number_of_symptoms = [1, 2, 3, 4]
start_of_description = [
    "I have",
    "I'm suffering from",
    "I have really bad",
    "My symptoms are",
    "For the last few days I have had",
    "My husband is suffering from" ,
    "My wife is suffering from",
    "My son is suffering from",
    "My daughter is suffering from",
    "My child is suffering from",
    "I don't feel well, I have"
]

# get some examples of users describing different numbers of syptoms
for symptons_count in number_of_symptoms:
    
    # make 100 examples of each number of symptoms
    for ex in range(1, 101):
    
        description_beginning = random.choice(start_of_description)
        
        # collect some random symtpoms
        symptom_1 = symptom_df['symptom'].sample(1).iloc[0].lower()
        symptom_2 = symptom_df['symptom'].sample(1).iloc[0].lower()
        symptom_3 = symptom_df['symptom'].sample(1).iloc[0].lower()
        symptom_4 = symptom_df['symptom'].sample(1).iloc[0].lower()
        
        symptoms = [symptom_1, symptom_2, symptom_3, symptom_4]
        symptoms_entity = []
        
        # remove parenthases from symptoms and add nessecary entitiy tags to symptoms
        for symptom in symptoms:
            symptom = re.sub(r"\([^)]+\)", "", symptom).strip()
            symptom = f"[{symptom}](symptom)"
            symptoms_entity.append(symptom)
            
        symptom_1 = symptoms_entity[0]
        symptom_2 = symptoms_entity[1]
        symptom_3 = symptoms_entity[2]
        symptom_4 = symptoms_entity[3]
        
        # create the training sample string
        if symptons_count == 1:
            
            symptom_string = f"- {description_beginning} {symptom_1}"
            
        if symptons_count == 2:
            
            symptom_string = f"- {description_beginning} {symptom_1} and {symptom_2}"
            
        if symptons_count == 3:
            
            symptom_string = f"- {description_beginning} {symptom_1}, {symptom_2}, and {symptom_3}"
            
        if symptons_count == 4:
            
            symptom_string = f"- {description_beginning} {symptom_1}, {symptom_2}, {symptom_3}, {symptom_4}"
        
        print(symptom_string)

- For the last few days I have had [groin pain](symptom)
- My wife is suffering from [mouth ulcers](symptom)
- My wife is suffering from [suicidal tendencies](symptom)
- My child is suffering from [blister](symptom)
- For the last few days I have had [puncture wound](symptom)
- I have [armpit pain](symptom)
- I have [muscle spasm](symptom)
- I'm suffering from [neck ache or pain](symptom)
- My son is suffering from [excessive appetite](symptom)
- My wife is suffering from [anxiety](symptom)
- My wife is suffering from [nipple discharge](symptom)
- I have [ear swelling](symptom)
- I have [headache](symptom)
- I don't feel well, I have [unusual color or odor of urine](symptom)
- I have [abdominal swelling](symptom)
- My husband is suffering from [skin trauma](symptom)
- My husband is suffering from [vaginal pain](symptom)
- My son is suffering from [headache after trauma](symptom)
- I don't feel well, I have [diarrhea](symptom)
- My son is suffering from [penile discharge](symptom)
- I d

- My symptoms are [hemoptysis](symptom), [vaginal itching](symptom), and [nasal congestion](symptom)
- I don't feel well, I have [shin pain](symptom), [pulling at ears](symptom), and [memory loss](symptom)
- My child is suffering from [speech problem](symptom), [hypothermia](symptom), and [fever](symptom)
- I don't feel well, I have [flank pain](symptom), [liver failure](symptom), and [low blood pressure](symptom)
- My daughter is suffering from [blister](symptom), [bleeding in brain](symptom), and [double vision](symptom)
- I have really bad [fainting](symptom), [toe pain](symptom), and [excessive appetite](symptom)
- My child is suffering from [learning difficulties](symptom), [eye swelling](symptom), and [arm redness](symptom)
- My husband is suffering from [increased facial hair](symptom), [toe pain](symptom), and [chest pain](symptom)
- My wife is suffering from [scrotal swelling](symptom), [sinus pain and pressure](symptom), and [agitated](symptom)
- My wife is suffering from [sc