# Process Handbook

This notebook processes answers related to questions about COVID-19 and pregnancy from a CSV (Flubert_pregnancy_answers.csv).  This is done in a simple fashion using the following heuristic: If a line of text consisting of less than 5 words is followed by paragraphs of text the assume the line of text with less than 5 words is a topic (i.e. the topic of a question an employee might ask) and that the paragraphs of text are the answer to that question (called action_text for the lack of a better term).

When a topic and action_text are found these are stored in Cloud Datastore as a key-value pair with the topic as the key and the action_text as the value.

In [1]:
!pip uninstall -y google-cloud-datastore

Uninstalling google-cloud-datastore-1.12.0:
  Successfully uninstalled google-cloud-datastore-1.12.0


In [2]:
!pip install google-cloud-datastore

Collecting google-cloud-datastore
  Using cached https://files.pythonhosted.org/packages/27/e9/1132b0e4dce7d96df7620ce7c465bd99803cc62a467396ea93fee3a82931/google_cloud_datastore-1.12.0-py2.py3-none-any.whl
Installing collected packages: google-cloud-datastore
Successfully installed google-cloud-datastore-1.12.0


Hit Reset Session > Restart, then resume with the following cells. 

In [1]:
from google.cloud import datastore

In [2]:
datastore_client = datastore.Client()

In [None]:
#!conda update -n base -c defaults conda

Collecting package metadata: done
Solving environment: done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - conda


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.3                |           py27_0         3.0 MB  defaults
    conda-package-handling-1.6.0|   py27h7b6447c_0         865 KB  defaults
    openssl-1.1.1g             |       h7b6447c_0         3.8 MB  defaults
    tqdm-4.47.0                |             py_0          62 KB  defaults
    ------------------------------------------------------------
                                           Total:         7.7 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
  conda-package-han~ pkgs/main/linux-64::conda-package-handling-1.6.0-py27h7b6447c_0
  tqdm               pkgs/main/noarch::tqdm-4.47.0-py_0

The following p

In [None]:
#y

In [None]:
#!conda install -y pandas
#!conda install -y scikit-learn

In [3]:
import pandas as pd

In [4]:
p_content = pd.read_csv('Flubert_pregnancy_answers_UTF8.csv')
#pregnancy_topics = open('Flubert_pregnancy_answers.csv', 'r')


In [5]:
p_content.head()

Unnamed: 0,Question,Answer
0,What can I do to protect myself from catching ...,"To protect yourself from catching coronavirus,..."
1,Can I travel for my baby-moon?,"We recommend avoiding all travel at this time,..."
2,Should I reschedule my baby shower because of ...,While a baby shower is a joyous and important ...
3,What should I do if I have a fever or cough?,"If you have a fever or cough, the first step i..."
4,What if I’ve traveled from a place where the v...,The first step is to call your doctor’s office...


In [6]:
p_content.dropna(inplace=True)

In [7]:
#to see how many articles
len(p_content)

19

In [8]:
#to see article, index row
print(p_content['Answer'][17])

Miscarriage can occur in any pregnancy. Studies have not been done to see if having COVID-19 during pregnancy could increase the chance of miscarriage.


In [9]:
#Before LDA need to preprocess data

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
cv = CountVectorizer(max_df=0.8, min_df=1, stop_words='english') #max_df ignores terms with high document freq (really common, e.g. show up in 90% of docs)

#can do max and min df either as proportion, or integer for actual number of docs
#countvectorizer can also remove stopwords

In [12]:
#create a document term matrix using the Answers
dtm = cv.fit_transform(p_content['Answer'])

In [13]:
#get dimensions of dtm and confirm is a sparse matrix

In [14]:
dtm 

<19x386 sparse matrix of type '<class 'numpy.int64'>'
	with 625 stored elements in Compressed Sparse Row format>

In [15]:
#got 386 terms in the 19 articles

In [16]:
from sklearn.decomposition import LatentDirichletAllocation

In [38]:
LDA = LatentDirichletAllocation(n_components=8, random_state=42)

In [39]:
LDA.fit(dtm)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=8, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [40]:
# Grab vocab of words
len(cv.get_feature_names())

386

In [41]:
type(cv.get_feature_names())

list

In [42]:
# getting words at index locations
cv.get_feature_names()[317]

'small'

In [43]:
# to print out a list of a random bunch of words from the list

import random

random_word_id = random.randint(0, 386)
cv.get_feature_names()[random_word_id] #generates a random index number

'mothers'

In [44]:
# Grab the topics
len(LDA.components_)

8

In [45]:
LDA.components_.shape # componenents by number of words

(8, 386)

In [46]:
single_topic = LDA.components_[0]

In [47]:
single_topic.argsort() #don't know what topic represents, argsort returns index positions that would sort this array

array([158, 376, 337, 224, 260, 332, 142, 135,   9, 354, 253, 234,  66,
       313, 371,  78, 244, 117, 310, 375, 338, 204, 163, 161, 285, 258,
        60, 205,  75, 296,  86, 278, 257, 230, 373, 372, 150, 119, 169,
        85,  22, 123, 381, 287,  21, 292, 269, 159,  10, 370, 333, 357,
       364, 164, 349, 283, 110, 351,  80,  19, 303, 228, 120, 291, 109,
       343, 116,  36,  76, 178, 213,  98, 190, 223, 263, 267, 144, 360,
        82, 132, 284, 103, 195, 206, 155,  56, 272,  92,  32, 301, 308,
       221, 149, 314, 141, 298, 286,  43, 173, 323,  16, 174, 304, 277,
        95,  77, 199, 179, 138,  14, 276, 225, 363, 347, 201, 299, 345,
       148, 136, 202,   6, 233,  28, 342, 240, 317,  50, 369,  61, 327,
       167,  31,  51,  38, 266,  57, 331,  83,  37,  27,  47,  17, 348,
       254, 374, 302, 133,   2, 275,  48, 207, 321,  70, 293,  39,  62,
       209, 325, 211, 170, 197, 385,   1, 189, 282, 365,  26, 145, 236,
       274, 214, 268,   3, 261, 104,  65, 115, 112,  13, 377,  4

In [48]:
single_topic.argsort()[-10:] 
#grabs top 10 words (highest counts)  grabs last 10 values of argsort.  returns index positions
# Top 10 values --> last 10 values of argsort

array([152, 222, 220, 134, 248,  69, 382, 227, 341, 129])

In [49]:
top_ten_words = single_topic.argsort()[-10:]

for index in top_ten_words:
    print(cv.get_feature_names()[index]) #10 highest proba words in first topic

gynecologists
mothertobaby
mother
feed
passing
college
workers
newborns
tested
eyes


In [50]:
top_twenty_words = single_topic.argsort()[-20:]
for index in top_twenty_words:
    print(cv.get_feature_names()[index]) #20 highest proba words in first topic

paternal
novel
woman
report
determine
restrict
38
proper
trimester
just
gynecologists
mothertobaby
mother
feed
passing
college
workers
newborns
tested
eyes


In [30]:
def f(string):
    # (!) Using globals is bad bad bad
    return string.format(**globals())

# Use as follows:
ans = 'SPAM'
print(f('we love {ans}'))

we love SPAM


In [51]:
# Grab the highest probability words per topic

for i, topic in enumerate(LDA.components_):
    print(f('The top 20 words for Topic {i}'))
    print([cv.get_feature_names()[index] for index in topic.argsort()[-20:]])
    print('\n')
    print('\n')

The top 20 words for Topic 0
['paternal', 'novel', 'woman', 'report', 'determine', 'restrict', '38', 'proper', 'trimester', 'just', 'gynecologists', 'mothertobaby', 'mother', 'feed', 'passing', 'college', 'workers', 'newborns', 'tested', 'eyes']




The top 20 words for Topic 1
['exposed', 'trying', 'emergencies', 'people', 'currently', 'present', 'physician', 'covid', '19', 'minimize', 'obstetrics', 'contingency', 'ask', 'health', 'obstetric', 'birth', 'plans', 'staff', 'team', 'hospital']




The top 20 words for Topic 2
['cov', 'disease', 'care', 'time', 'birth', 'sars', 'tested', 'study', 'case', 'travel', 'evidence', 'newborns', 'risk', 'infection', 'virus', 'infected', 'pregnant', 'women', '19', 'covid']




The top 20 words for Topic 3
['obstetricians', 'step', 'health', 'exposure', 'help', 'pregnant', 'mask', 'spread', 'protect', 'obstetrician', 'office', 'pregnancy', 'limit', 'vaccines', 'important', 'doctor', 'virus', 'symptoms', '19', 'covid']




The top 20 words for Topic 

In [52]:
#now attached topic numbers to original articles

In [53]:
topic_results = LDA.transform(dtm)

In [54]:
topic_results.shape #number of articles with probabilities of each topic

(19, 8)

In [55]:
#get example topic results for document at index 1, rounded to 2 decimals
topic_results[1].round(2)

array([0.01, 0.01, 0.94, 0.01, 0.01, 0.01, 0.01, 0.01])

In [56]:
#now connect back to original dataframe
#use ARGMAX to return index position of highest probability

p_content['Topic'] = topic_results.argmax(axis = 1)

In [57]:
p_content.head(21)

Unnamed: 0,Question,Answer,Topic
0,What can I do to protect myself from catching ...,"To protect yourself from catching coronavirus,...",3
1,Can I travel for my baby-moon?,"We recommend avoiding all travel at this time,...",2
2,Should I reschedule my baby shower because of ...,While a baby shower is a joyous and important ...,6
3,What should I do if I have a fever or cough?,"If you have a fever or cough, the first step i...",3
4,What if I’ve traveled from a place where the v...,The first step is to call your doctor’s office...,3
5,What is my risk of becoming very ill if I do h...,"Given that this is a novel virus, little is kn...",2
6,Does becoming ill with COVID-19 increase risk ...,An increased risk of miscarriage or fetal malf...,2
7,"If I become sick, what is the risk of passing ...",The risk of passing the infection to a fetus a...,2
8,Is it safe to take medicine if I have a temper...,If a woman has an infection with a high fever ...,5
9,Can I be tested for COVID-19?,If you’re worried you have COVID-19covid 19 or...,4


In [58]:
#create a dictionary of topics

my_topic_dict = {0:'research', 1:'visiting the hospital', 2:'prenatal risks', 3:'transmission', 4:'testing', 5:'increased risks', 6: 'baby shower', 7: 'prenatal care'}
p_content['Topic_Label'] = p_content['Topic'].map(my_topic_dict)

In [59]:
p_content.head()

Unnamed: 0,Question,Answer,Topic,Topic_Label
0,What can I do to protect myself from catching ...,"To protect yourself from catching coronavirus,...",3,transmission
1,Can I travel for my baby-moon?,"We recommend avoiding all travel at this time,...",2,prenatal risks
2,Should I reschedule my baby shower because of ...,While a baby shower is a joyous and important ...,6,baby shower
3,What should I do if I have a fever or cough?,"If you have a fever or cough, the first step i...",3,transmission
4,What if I’ve traveled from a place where the v...,The first step is to call your doctor’s office...,3,transmission


In [60]:
dictionary_df = p_content[['Question','Topic_Label']].copy()
dictionary_df.head()

Unnamed: 0,Question,Topic_Label
0,What can I do to protect myself from catching ...,transmission
1,Can I travel for my baby-moon?,prenatal risks
2,Should I reschedule my baby shower because of ...,baby shower
3,What should I do if I have a fever or cough?,transmission
4,What if I’ve traveled from a place where the v...,transmission


In [61]:
grouped_answers = dictionary_df.groupby('Topic_Label')

In [63]:
type(grouped_answers)

pandas.core.groupby.DataFrameGroupBy

In [69]:
topic_q_dict = grouped_answers['Question'].apply(lambda s: s.tolist()).to_dict()

In [70]:
type(topic_q_dict)

dict

In [71]:
for key, value in topic_q_dict.items():
  print(key, '->', value)


increased risks -> ['Is it safe to take medicine if I have a temperature?', 'If a man has COVID-19, could it affect his fertility (ability to get partner pregnant) or increase the chance of birth defects?']
prenatal risks -> ['Can I travel for my baby-moon?', 'What is my risk of becoming very ill if I do have COVID-19?', 'Does becoming ill with COVID-19 increase risk of miscarriage or other complications?', 'If I become sick, what is the risk of passing the virus on to my fetus or newborn?', 'Does having COVID-19 in pregnancy cause long-term problems in behavior or learning for the baby?']
transmission -> ['What can I do to protect myself from catching the coronavirus?', 'What should I do if I have a fever or cough? ', 'What if I’ve traveled from a place where the virus is widespread or have been in contact with a person confirmed to have COVID-19?', 'If a coronavirus vaccine comes out, can I take it when pregnant?', 'If I test positive for COVID-19, can I breastfeed my baby?', 'Can ha

In [79]:
for k, value in topic_q_dict.items():
    kind = k
    topic_key = datastore_client.key(kind)
    
    topic = datastore.Entity(key=topic_key)
    topic['action_text'] = value

    datastore_client.put(topic)
    
    print('Saved {}: {}'.format(topic.key.name, topic['action_text']))
    print('\n')

Saved None: ['Is it safe to take medicine if I have a temperature?', 'If a man has COVID-19, could it affect his fertility (ability to get partner pregnant) or increase the chance of birth defects?']


Saved None: ['Can I travel for my baby-moon?', 'What is my risk of becoming very ill if I do have COVID-19?', 'Does becoming ill with COVID-19 increase risk of miscarriage or other complications?', 'If I become sick, what is the risk of passing the virus on to my fetus or newborn?', 'Does having COVID-19 in pregnancy cause long-term problems in behavior or learning for the baby?']


Saved None: ['What can I do to protect myself from catching the coronavirus?', 'What should I do if I have a fever or cough? ', 'What if I’ve traveled from a place where the virus is widespread or have been in contact with a person confirmed to have COVID-19?', 'If a coronavirus vaccine comes out, can I take it when pregnant?', 'If I test positive for COVID-19, can I breastfeed my baby?', 'Can having COVID-19