## Notebook Overview: Identify Users Self-Reporting Symptoms of Interest

This notebook was used to process all Facebook posts to identify users self-reporting symptoms of interest. This notebook makes use of the "sentence_transformers" module, which is used to load a pretrained BERT ML model for creating text embeddings for sentences. The results from this notebook directly apply to the self-reporting positive users notebook ([notebook link](nlp_sentence_transformer_positive.ipynb)). The basic sequence of steps is
- Load BERT model
- Create a symptom map between symptom categories and key-words associated to those categories
- Create dummy phrases for self-reporting symptoms of interest using this symptom map
- Convert dummy phrases to to text embeddings using BERT model
- Iterate through all Facebook posts/sentences
    - Compare dummy sentences to Facebook senteces
    - If a match is detected--save the relevant symptom/user information as an entry
- Save final dataset of self-reporting users with referenced date of infection

In [59]:
from sentence_transformers import SentenceTransformer, util
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize

import pandas as pd
import sqlite3
import numpy as np
import os

In [2]:
# load pretrained BERT model for phrase/sentence similarity
model = SentenceTransformer('all-MiniLM-L6-v2')

In [66]:

# generate map of symptom category (key) and corresponding list of search terms (values)
symptom_map = {
    "blood clot":["clot","blood clot"],
    "heart":["heart problem","heart issue","palpitation","rapid heartbeat","fast heartbeat", "increased heart rate"],
    "stroke":["stroke"],
    "dvt":["deep vein thrombosis"],
    "pe":["pulmonary embolism"],
    "breathing":["out of breath","shortness of breath","trouble breathing"],
    "lightheaded":["ligthheaded","lightheadedness","faint","dizzy","vertigo"],
    "leg":["leg pain","leg swelling"],
    "skin":["clammy skin","skin discoloration","cyanosis"],
}
# generate list of common self-report symptom phrases
self_report_phrases = ["I had <symptom>","I experienced <symptom>","I felt <symptom>","I suffer from",
"my symptoms included <symptom>","I felt <symptom>","dealing with <symptom>","has anyone else <symptom>"]

# create map between each search term (key) and the resultant self-report phrases and corresponding text embeddings
symptom_embedding_map = {}
for symptom_key in symptom_map:
    symptom_embedding_map[symptom_key] = []
    for symptom in symptom_map[symptom_key]:
        for phrase in self_report_phrases:
            phrase = phrase.replace("<symptom>",symptom)
            embedding = model.encode(phrase, convert_to_tensor=True)
            symptom_embedding_map[symptom_key].append([phrase,embedding])

In [23]:
# read in all post data from sqlite DB
datadir = r"C:\Users\keatu\Regis_archive\practicum_data"
dbfile = os.path.join(datadir,"Facebook.db")
con = sqlite3.connect(dbfile)
posts = pd.read_sql("select * from posts",con)
comments = pd.read_sql("select * from comments", con)
replies = pd.read_sql("Select * from replies",con)
con.close()

# grab text/id fields from each data type--treating them all like unique posts
all_text = pd.concat([
                    posts[["user_id","post_id","text"]],
                    comments[["commenter_id","comment_id","comment_text"]].rename(columns={"commenter_id":"user_id","comment_id":"post_id","comment_text":"text"}),
                    replies[["commenter_id","comment_id","comment_text"]].rename(columns={"commenter_id":"user_id","comment_id":"post_id","comment_text":"text"})
                    ], sort = False)

In [72]:
len(all_text)

195682

In [82]:
# compare input sentence embedding to all matching sentence combinations
# and use threshold cosine similarity score to indentify matches
self_report_df = pd.DataFrame()
threshold = 0.75 # cosine similarity threshold
i=0
for idx, row in all_text.iterrows():
    i=+1
    if (i%10000)==0:
        print("{} completed of {}".format(i,len(all_text)))
    for sent in sent_tokenize(row['text']):
        sent_embed = model.encode(sent, convert_to_tensor=True)
        for term in symptom_embedding_map:
            top_score = 0
            top_match = ""
            for (match_sentence,match_embedding) in symptom_embedding_map[term]:
                cos_score = util.cos_sim(sent_embed, match_embedding).item()
                if cos_score > top_score:
                    top_score = cos_score
                    top_match = match_sentence
            if top_score > threshold:
                self_report_df = self_report_df.append({'user_id':row['user_id'],"post_id":row['post_id'],'sentence':sent,"match_sentence":top_match,"symptom":term,"cos_similarity":top_score}, ignore_index=True)

In [19]:
# remove duplicates using post_id field
self_reportdf = self_report_df.groupby(["post_id"]).aggregate("first").reset_index().drop(columns="Unnamed: 0").sort_values("user_id")

In [66]:
# grab all unique user_ids from this subset of self-reporting individuals
user_ids = self_reportdf["user_id"].unique().tolist()
user_ids = [str(i) for i in user_ids]

In [68]:
# get subset of posts, comments, and replies for only these users
sr_posts = posts[posts["user_id"].isin(user_ids)]
sr_comments = comments[comments["commenter_id"].isin(user_ids)]
sr_replies = replies[replies["commenter_id"].isin(user_ids)]

In [69]:
# create a new database containing posts only for users self-reporting symptoms of interest
# this database has an additional table for the self-reporting symptoms of interest entries
outcon = sqlite3.connect(r"C:\Users\keatu\Regis_archive\practicum_data\Facebook_Self_Report.db")
sr_posts.astype(str).to_sql("posts",con=outcon)
sr_comments.astype(str).to_sql("comments",con=outcon)
sr_replies.astype(str).to_sql("replies",con=outcon)
self_reportdf.astype(str).to_sql("self_reporting",con=outcon)
outcon.close()