## Notebook Overview: Identifying Positive Cases and Initial Date of Infection.

This notebook was used to process posts from the subset of users self-reporting symptoms of interest [notebook link](nlp_sentence_transformer_self_report.ipynb). This notebook also makes use of the "sentence_transformers" module, which is used to load a pretrained BERT ML model for creating text embeddings for sentences. The basic sequence of steps is
- Load BERT model
- Create dummy phrases for self-reporting positive and convert them to text embeddings using model
- Iterate through all relevant Facebook posts/sentences
    - Compare dummy sentences to Facebook senteces
    - If a match is detected--save the post/user information in a dataset for possible positive users
- Manually update a "date_reported" field for this data set (i.e. find a date reference within the respective post like "I got covid last August")
- Save final dataset of self-reporting users with referenced date of infection

In [24]:
from sentence_transformers import SentenceTransformer, util
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize

import pandas as pd
import sqlite3
import os
import re

In [25]:
# load pretrained BERT model for phrase/sentence similarity
model = SentenceTransformer('all-MiniLM-L6-v2')

In [29]:

# list of "I tested positive" phrases
covid_synonyms = ["covid-19","covid19","covid","coronavirus","corona","rona"]
positive_phrases = [
    "I tested positive",
    "since my <covid> diagnosis",
    "since my positive diagnosis",
    "after my positive diagnosis",
    "ater testing positive",
    "I had <covid>",
    "I got <covid>",
    "diagnosed with <covid>",
    "it has been since I had <covid>",
]
# create list of covid positive phrases using list of commonly used synonyms
all_phrases = []
for phrase in positive_phrases:
    for syn in covid_synonyms:
        all_phrases.append(phrase.replace("<covid>",syn))

# create text/vector embeddings for each phrase
positive_embedding_map = {}
for phrase in all_phrases:
    embedding = model.encode(phrase, convert_to_tensor=True)
    positive_embedding_map[phrase] = embedding

In [45]:
# read in the data from sqlite DB
datadir = r"C:\Users\keatu\Regis_archive\practicum_data"
dbfile = os.path.join(datadir,"Facebook_self_report.db")
con = sqlite3.connect(dbfile)
posts = pd.read_sql("select * from posts",con)
comments = pd.read_sql("select * from comments", con)
replies = pd.read_sql("Select * from replies",con)
con.close()

# grab text/id fields from each data type--treating them all like unique posts
all_text = pd.concat([
                    posts[["user_id","post_id","text"]],
                    comments[["commenter_id","comment_id","comment_text"]].rename(columns={"commenter_id":"user_id","comment_id":"post_id","comment_text":"text"}),
                    replies[["commenter_id","comment_id","comment_text"]].rename(columns={"commenter_id":"user_id","comment_id":"post_id","comment_text":"text"})
                    ], sort = False)

In [38]:
# compare input sentence embedding to all matching sentence combinations and use threshold cosine similarity score
positive_report_df = pd.DataFrame()
threshold = 0.75 # cosine similarity threshold
i=0

# iterate through all facebook posts
for idx, row in all_text.iterrows():
    i=+1
    if (i%1000)==0:
        print("{} completed of {}".format(i,len(all_text)))
    # iterate through each sentence within each post
    for sent in sent_tokenize(row['text']):
        sent_embed = model.encode(sent, convert_to_tensor=True)
        # only select the best match for each input Facebook sentence--this will allow for multiple symptom matches for 1 sentence
        top_score = 0
        top_match = ""
        for phrase in positive_embedding_map:
            cos_score = util.cos_sim(sent_embed, positive_embedding_map[phrase]).item()
            if cos_score > top_score:
                top_score = cos_score
                top_match = phrase
        if top_score > threshold:
            positive_report_df = positive_report_df.append({'user_id':row['user_id'],"post_id":row['post_id'],'sentence':sent,"match_sentence":top_match,"cos_similarity":top_score}, ignore_index=True)

In [47]:
# create a new dataframe containing these possibly positive users
text_with_time = pd.concat([
                    posts[["user_id","post_id","text","time"]],
                    comments[["commenter_id","comment_id","comment_text","comment_time"]].rename(columns={"commenter_id":"user_id","comment_id":"post_id","comment_text":"text","comment_time":"time"}),
                    replies[["commenter_id","comment_id","comment_text","comment_time"]].rename(columns={"commenter_id":"user_id","comment_id":"post_id","comment_text":"text","comment_time":"time"})
                    ], sort = False)
positive_report_with_time = pd.merge(positive_report_df,text_with_time[["post_id","time"]], on="post_id", how="left").sort_values("user_id")

# save results to a csv file--this is where I manually entered results
positive_report_with_time.to_csv(os.path.join(datadir,"positive_reporting.csv"))

In [73]:
# read in the same file--I added a "date_reported" column manually
positive_reporting = pd.read_csv(os.path.join(datadir,"positive_reporting.csv"))

# save all positive reporting users for which a positive infection date can be extracted
outcon = sqlite3.connect(r"C:\Users\keatu\Regis_archive\practicum_data\Facebook_Self_Report.db")
positive_reporting[~positive_reporting["date_reported"].isna()].astype(str).to_sql("positive_reporting",con=outcon)
outcon.close()