## Parsing Survey Results
This notebook produces the intended datasets by looking at the original files of text and relationships scraped from SNPedia and the results of the survey in order to construct the relationships between genes, SNPs, and different types of text snippets or context sentences. This notebook also deals with looking at how much of the dataset was covered by a given set of results from the survey.

In [1]:
import pandas as pd
import numpy as np
import json
import math
import random
import datetime
import re
from collections import Counter
from utils import flatten
from nltk.tokenize import sent_tokenize

### Defining paths to input files
These first two files should always stay the same, the first one defines the project-wide mapping between unique strings and IDs. The second one is the starting dataset of the text that was scraped and cleaned from SNPedia, so changing that would impact everything else downstream in the creation of the dataset. The paths to the Qualtrics results files can be changed though, and all the output files will be with respect to just those results. Running with different result sets should produce individual files that could be stacked after the fact.

In [2]:
# This is the original data with duplicate text strings because IDs refer to particular SNPs from particular genes.
ORIGINAL_DATA_PATH = "../data/3_snps_and_cleaned_text_with_ids.csv"

# What the survey was built from, the IDs in this refer to unique text strings so that duplicates weren't part of it.
SURVEY_SOURCE_DATA_PATH = "../data/6_part_2_binned_and_blocked_texts.csv"

# JSON file from parsing the results of the qualtrics survey designed for extracting phenotypes from text.
QUALTRICS_RAW_CSV_EXPORT_PATH = "../qualtrics/part_2_results_raw_qualtrics_exports/SNPedia_Survey__Version_2_October_1_2020_22_14_Completed_and_in_Progress.csv"
QUALTRICS_RESULTS_JSON_PATH = "../qualtrics/part_2_results_processed_json_files/snpedia_survey_version_2_october_1_2020_22_14_completed_and_in_progress.json"

### Defining paths to output files
These should be left alone, they'll be named automatically according to when they were run and will be grouped.

In [3]:
# Where to send the resulting output dataframes and some other information about coverage.
random_six_digit_number = random.randrange(100000,999999)
datetime_str = datetime.datetime.now().strftime('%m_%d_%Y_h%Hm%Ms%S')
FILENAMES_OUTPUT_PATH = "../data/S{}_input_file_paths_{}.txt".format(random_six_digit_number, datetime_str)
RESPONSES_OUTPUT_PATH = "../data/S{}_processed_survey_responses_{}.csv".format(random_six_digit_number, datetime_str)
SURVEY_TAKERS_OUTPUT_PATH = "../data/S{}_processed_survey_takers_{}.csv".format(random_six_digit_number, datetime_str)
SNPS_AND_SNIPPETS_OUTPUT_PATH = "../data/S{}_snps_and_snippets_{}.csv".format(random_six_digit_number, datetime_str)
SNPS_AND_CONTEXTS_OUTPUT_PATH = "../data/S{}_snps_and_contexts_{}.csv".format(random_six_digit_number, datetime_str)
NUM_ANNOTATIONS_PATH = "../data/S{}_number_of_annotations_per_id_{}.csv".format(random_six_digit_number, datetime_str)

In [4]:
# Create an output file that makes a not of which input files were used for these results.
with open(FILENAMES_OUTPUT_PATH, "w") as f:
    filenames = [
        ORIGINAL_DATA_PATH,
        SURVEY_SOURCE_DATA_PATH,
        QUALTRICS_RAW_CSV_EXPORT_PATH,
        QUALTRICS_RESULTS_JSON_PATH]
    filenames = "\n".join(filenames)
    f.write(filenames)

In [5]:
# Make sure the mapping between the unique texts used to create the survey and the original SNP data exists.
original_df = pd.read_csv(ORIGINAL_DATA_PATH)
#survey_df = pd.read_csv(SURVEY_SOURCE_DATA_PATH)
#text_to_unique_text_id = dict(zip(survey_df.text, survey_df.id))
#unique_text_id_to_text = dict(zip(survey_df.id, survey_df.text))
#original_df["id"] = original_df["text"].map(text_to_unique_text_id)

id_to_unique_text = dict(zip(original_df["id"].values, original_df["text"].values))


original_df.head(20)

Unnamed: 0,gene,snp,text,id
0,AANAT,Rs28936679,"rs28936679, also known as Ala129Thr or A129T (...",1
1,AANAT,Rs3760138,Genetic differences in human circadian clock g...,2
2,AANAT,Rs4238989,Genetic differences in human circadian clock g...,3
3,ABCA1,Rs1800977,The -14C->T polymorphism rs1800977 of the ABCA...,4
4,ABCA1,Rs1883025,Apolipoprotein E levels in cerebrospinal fluid...,5
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6
6,ABCA1,Rs2066714,Apolipoprotein E levels in cerebrospinal fluid...,7
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8
8,ABCA1,Rs2230806,"rs2230806, also known as Arg219Lys or R219K, i...",9
9,ABCA1,Rs2230808,"rs2297404, rs2230808, and rs2020927 haplotype ...",10


In [6]:
original_df.isnull().sum(axis = 0)

gene    0
snp     0
text    0
id      0
dtype: int64

In [7]:
original_df.shape

(6535, 4)

In [8]:
# Create a dictionary using the JSON data output from a qualtrics survey.
with open(QUALTRICS_RESULTS_JSON_PATH) as f:
    responses = json.load(f)
responses

[{'response_id': 'R_2eUMgNrFozgjPf3',
  'recorded_datetime': '10/1/20 13:40',
  'status': 'IP Address',
  'progress': 100,
  'is_finished': 'TRUE',
  'duration': 973,
  'hilights': [{'qid': 3279,
    'selection': 'prone to miscalls',
    'selection_index': 0},
   {'qid': 957,
    'selection': 'vaccine-induced immunity to HBV',
    'selection_index': 25},
   {'qid': 5071,
    'selection': "increased risk for Crohn's disease",
    'selection_index': 16},
   {'qid': 1807, 'selection': 'risk of breast cancer', 'selection_index': 55},
   {'qid': 1103,
    'selection': 'predictive of Cystic Fibrosis',
    'selection_index': 22},
   {'qid': 4536, 'selection': 'schizophrenia', 'selection_index': 104},
   {'qid': 4536,
    'selection': 'developmental disorders',
    'selection_index': 122},
   {'qid': 3015,
    'selection': 'hyperalphalipoproteinemia',
    'selection_index': 59},
   {'qid': 2178, 'selection': 'Krabbe disease', 'selection_index': 66},
   {'qid': 2393, 'selection': 'oxidative str

In [9]:
# Load the original source CSV that was used in creating the survey. This will used to check against the results.
# We want to make sure there is no discrepency in which IDs are referring to which texts.
source_df = pd.read_csv(SURVEY_SOURCE_DATA_PATH)
#source_df.reset_index(drop=False, inplace=True)

# Question IDs in the actual survey will use the 
#source_df["qid"] = source_df["index"]
#source_df.drop(labels=["index"], axis=1, inplace=True)
#qid_to_source_text = {i:text for i,text in zip(source_df["qid"].values, source_df["text"].values)}
#qid_to_unique_text_id = {qid:i for qid,i in zip(source_df["qid"].values, source_df["id"].values)}
source_df.head(10)

Unnamed: 0,id,text,bin_id,bin_size,block_id,block_size,block_sample
0,1169,smoking,1,630,1,210,5
1,1913,Achondroplasiars121913105,1,630,1,210,5
2,540,c.348-9_351del,1,630,1,210,5
3,3309,haplogroups,1,630,1,210,5
4,3712,Phenylketonuriars62508588,1,630,1,210,5
5,1446,Warfarin,1,630,1,210,5
6,3711,Phenylketonuriars62516095,1,630,1,210,5
7,3698,Phenylketonuriars5030851,1,630,1,210,5
8,3697,Phenylketonuriars5030850,1,630,1,210,5
9,3696,Phenylketonuriars5030847,1,630,1,210,5


In [10]:
print(source_df.shape)
print(len(pd.unique(source_df["id"])))

(2429, 7)
2429


In [11]:
# Put the responses into a dataframe.
row_tuples = []
for response in responses:
    
    # Metadata associated with this response.
    response_id = response["response_id"]
    recorded_datetime = response["recorded_datetime"]
    status = response["status"]
    progress = response["progress"]
    is_finished = response["is_finished"]
    duration = response["duration"]


    # The actual highlighted text strings from this response.
    for hilight in response["hilights"]:

        # The information about this one particular highlight.
        qid = hilight["qid"]
        hilighted_text = hilight["selection"]
        index_of_first_selected_char = hilight["selection_index"]
        source_text = id_to_unique_text[qid]        

        # First check that the question IDs are correct so we know what the source text was for this question.
        # Then additionally make sure the location of the highlight makes sense as well.
        #assert hilighted_text in source_text
        #assert source_text[index_of_first_selected_char:index_of_first_selected_char+len(hilighted_text)] == hilighted_text
       
        # Those asserts don't pass 100% of the time due to special cases with the text.
        # Save them to variables instead and check how often they don't pass, should be very infrequently.
        # Look at the special cases by hand.
        text_match = (hilighted_text in source_text)
        indices_match = (source_text[index_of_first_selected_char:index_of_first_selected_char+len(hilighted_text)] == hilighted_text)
        
        # Add this as a row.
        row_tuples.append((response_id, recorded_datetime, status, progress, is_finished, duration, qid, hilighted_text, text_match, indices_match))

        
        
# Create the dataframe that holds all this information.
columns = ["response_id", "recorded_datetime", "status", "progress", "is_finished", "duration", "id", "snippet", "text_match", "idx_match"]
df = pd.DataFrame(row_tuples, columns=columns)


# Add another column indicating how many times a particular ID was annotated by a survey taker.
snippets_per_id = dict(df.groupby("id").size())
df["num_ann"] = df["id"].map(lambda x: snippets_per_id[x])


# Save the dataframe to a file.
df.to_csv(RESPONSES_OUTPUT_PATH, index=False)
df.head(10)

Unnamed: 0,response_id,recorded_datetime,status,progress,is_finished,duration,id,snippet,text_match,idx_match,num_ann
0,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,3279,prone to miscalls,True,True,1
1,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,957,vaccine-induced immunity to HBV,True,True,2
2,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,5071,increased risk for Crohn's disease,True,True,1
3,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,1807,risk of breast cancer,True,True,1
4,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,1103,predictive of Cystic Fibrosis,True,True,1
5,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,4536,schizophrenia,True,True,3
6,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,4536,developmental disorders,True,True,3
7,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,3015,hyperalphalipoproteinemia,True,True,1
8,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,2178,Krabbe disease,True,True,2
9,R_2eUMgNrFozgjPf3,10/1/20 13:40,IP Address,100,True,973,2393,oxidative stress,True,True,2


In [12]:
# How many times did the asserts not evaluate to true?
print(Counter(df["text_match"].values))
print(Counter(df["idx_match"].values))

Counter({True: 6665, False: 70})
Counter({True: 6232, False: 503})


In [13]:
# Preparing a dataset of context sentences that include the highlighted text snippets.
def get_contexts(i, snippet, text):
    sentences = sent_tokenize(text)
    sentences_with_snippet = [s for s in sentences if snippet in s]
    
    # Some additional processing to clean up the context sentences. Can insert more steps here if needed.
    
    # One way to do substitutions with regex. re.sub(old, new, string). The old pattern matches a SNP name.
    # That pattern includes () to define a group. In the new pattern, you can reference whatever is in that 
    # group with \1, as long as it's a raw string. A second group would be referenced with \2, etc.
    sentences_with_snippet = [re.sub(r"(rs[0-9]+)", r" \1 ", s) for s in sentences_with_snippet]
    
    
    
    
    
    
    # Make sure that bracketed text has spaces around it, this might add extra whitespace, will be removed later.
    sentences_with_snippet = [s.replace("("," (") for s in sentences_with_snippet]
    sentences_with_snippet = [s.replace(")",") ") for s in sentences_with_snippet]
    
    # Replace all instances of whitespace with a single space.
    whitespace_pattern = re.compile(r"\s+")
    sentences_with_snippet = [whitespace_pattern.sub(r" ", s) for s in sentences_with_snippet]
    
    # Remove whitespaces that occur before periods or commas.
    whitespace_before_period_pattern = re.compile(r"\s+[\.\;]")
    whitespace_before_comma_pattern = re.compile(r"\s+[\,\;]")
    sentences_with_snippet = [whitespace_before_period_pattern.sub(r".", s) for s in sentences_with_snippet]
    sentences_with_snippet = [whitespace_before_comma_pattern.sub(r",", s) for s in sentences_with_snippet]
    
    # Strip of leading and trailing whitespaces.
    sentences_with_snippet = [s.strip() for s in sentences_with_snippet]
    
    # Capitalize each sentence.
    sentences_with_snippet = ["{}{}".format(s[0].upper(),s[1:]) for s in sentences_with_snippet]
    
    # End each sentence with a single period.
    add_end_character = lambda x: "{}.".format(x[:len(x)-1]) if (x[len(x)-1]==".") or (x[len(x)-1]==";") else "{}.".format(x)
    sentences_with_snippet = [add_end_character(s) for s in sentences_with_snippet]
    
    # Remove all quotes from the sentences.
    sentences_with_snippet = [s.replace('"', '') for s in sentences_with_snippet]
    sentences_with_snippet = [s.replace("'", "") for s in sentences_with_snippet]
    
    
    # Notes to save about how to do this.
    # Another regex substitution trick. The repl (new) argument can be a function, which has to always take a single
    # parameter, which is the regex match. You can then capture groups inside that function, do any transformations,
    # and return, which would have been less straight forward or not possible to do in the repl raw string.
    def make_lowercase(match):
        group = match.group(1)
        group = group.lower()
        return(group)
    
    # Fix a problem where SNP names were being capitalized if they were the first token in the sentence.
    sentences_with_snippet = [re.sub(r"(Rs[0-9]+)", make_lowercase, s) for s in sentences_with_snippet]
    
    # Done.
    return((i,sentences_with_snippet))


# Create a mapping between text IDs and lists of the context sentences that relevant snippets were found in.
obj = df.apply(lambda row: get_contexts(row.id, row.snippet, id_to_unique_text[row.id]), axis=1)
id_to_context_sentences = dict(obj.values)
id_to_context_sentences

{3279: ['Prone to miscalls (false positives) ? rs267608099.'],
 957: ['Host genetic factors and vaccine-induced immunity to HBV infection: haplotype analysis.'],
 5071: ['Associated with increased risk for Crohns disease in a study of 380 Korean patients.'],
 1807: ['Association of genetic polymorphisms of EXO1 gene with risk of breast cancer in Taiwan.'],
 1103: ['Previously considered predictive of Cystic Fibrosis, 23andMe has removed it from their reports as unreliable.'],
 4536: ['Also known as c.1272delC and p.Tyr425Thrfs Rare loss-of-function variants in SETD1A are associated with schizophrenia and developmental disorders.'],
 3015: ['Association of an intronic haplotype of the LIPC gene with hyperalphalipoproteinemia in two independent populations.'],
 2178: ['Aka c.1592G>A, p.Arg531HisIdentified in ClinVar as pathogenic for Krabbe disease (when inherited recessively or as a compound heterozygote).'],
 2393: ['Mistakenly mentioned as rs18006688 in Lead exposure, polymorphisms in

In [14]:
# We need the unprocessed CSV file exported from Qualtrics for some of the following information.
# Specifically, let's use this to get the mapping betweeen response IDs (from Qualtrics) and user IDs (from Prolific).
exported_df = pd.read_csv(QUALTRICS_RAW_CSV_EXPORT_PATH)
exported_df.drop(df.index[[0,1]], inplace=True)
prolific_pid_column = "Q564"
response_id_column = "ResponseId"
response_id_to_prolific_pid = dict(zip(exported_df[response_id_column], exported_df[prolific_pid_column]))
response_id_to_prolific_pid

{'R_2eUMgNrFozgjPf3': '5ecba4b6ef75d53505406d57',
 'R_utbyJnQ1UKOs1mp': '5cdc5a76fa9e8d0001bcb934',
 'R_3nAyzwespvUg02B': '5bf5adc9d944c30001263593',
 'R_32XGNR4nwZcoLZI': '5ebf84f79a8ebe3610e2c79b',
 'R_3oYnbK47VvX8jrt': '5bf9c3137618a60001608942',
 'R_3rDF5fdF3oDgOuR': '5f3a820565d604693631d6fb',
 'R_YWGiyoyuJx2FDIR': '5ec5ca6abbd42f64282295d7',
 'R_1CkzzlePivYgMZr': '5e67d5ca0ec2fc0801580430',
 'R_3EFFKjKVN3PQ0In': '5f2be37b1e52920009dadba6',
 'R_32JAcjfY3df6fzg': '5e9f3b738bee310bed8379ad',
 'R_2Yt7JOaXOeN5jJ2': '5e032eefdfdc95e5e0c31dfd',
 'R_1IXCuSnWZ7cyfh3': '5f74e009929e1012397c31e9',
 'R_32MejXkqXetplmo': '5e98bfdee3aef30fc9a99e28',
 'R_2U4sdhbvxYpTR5O': '5865dd647fbbcd00013973b8',
 'R_1PZ7OJP8e7aeIjO': '5be385ba24f70e00018b0024',
 'R_3GpR4HRv21NuLtm': '5e93301515768a1680c9705a',
 'R_1pWmg4wjRUOa542': '5ec6b3ce04e80f0c9a211c5b',
 'R_R8j0APpr9g9xfc5': '5df7eeb4cc6822000ac81b07',
 'R_1H0Opfec4YYU02C': '5edede64402f7d18f040266e',
 'R_RUdmhInXoPo1WwN': '5d2b153afa24ba00173e8ee0',


In [15]:
exported_df

Unnamed: 0,StartDate,EndDate,Status,IPAddress,Progress,Duration (in seconds),Finished,RecordedDate,ResponseId,RecipientLastName,...,LocationLongitude,DistributionChannel,UserLanguage,Q564,hilights,PROLIFIC_PID,STUDY_ID,SESSION_ID,client_errors,navigator_info
2,10/1/20 13:24,10/1/20 13:40,IP Address,78.154.94.55,100,973,TRUE,10/1/20 13:40,R_2eUMgNrFozgjPf3,,...,19.76759338,anonymous,EN,5ecba4b6ef75d53505406d57,"[[""3279"",""prone to miscalls"",0],[""957"",""vaccin...",5ecba4b6ef75d53505406d57,5f7618a006b78606c8f1c588,5f762cd1f653f2096918d838,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 10.0; Wi..."
3,10/1/20 13:32,10/1/20 13:43,IP Address,93.56.134.178,100,653,TRUE,10/1/20 13:43,R_utbyJnQ1UKOs1mp,,...,8.944396973,anonymous,EN,5cdc5a76fa9e8d0001bcb934,"[[""4315"",""Hirschsprung disease"",32],[""2617"",""H...",5cdc5a76fa9e8d0001bcb934,5f7618a006b78606c8f1c588,5f762eccceb01f09da5a4f41,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 10.0; Wi..."
4,10/1/20 13:36,10/1/20 13:44,IP Address,85.202.110.140,100,454,TRUE,10/1/20 13:44,R_3nAyzwespvUg02B,,...,22.27319336,anonymous,EN,5bf5adc9d944c30001263593,"[[""499"","" 2T>"",11],[""614"",""Brain-derived"",0],[...",5bf5adc9d944c30001263593,5f7618a006b78606c8f1c588,5f762fbfa6eafb0afbdf038c,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 6.3; Win..."
5,10/1/20 13:37,10/1/20 13:47,IP Address,83.6.53.20,100,615,TRUE,10/1/20 13:47,R_32XGNR4nwZcoLZI,,...,21.00149536,anonymous,EN,5ebf84f79a8ebe3610e2c79b,"[[""838"",""adult bronchial asthma"",78],[""3384"",""...",5ebf84f79a8ebe3610e2c79b,5f7618a006b78606c8f1c588,5f762ff476442d0a541dc2d6,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 10.0; Wi..."
6,10/1/20 13:23,10/1/20 13:49,IP Address,37.30.39.19,100,1524,TRUE,10/1/20 13:49,R_3oYnbK47VvX8jrt,,...,20.92379761,anonymous,EN,5bf9c3137618a60001608942,[],5bf9c3137618a60001608942,5f7618a006b78606c8f1c588,5f762cb772893609ceca96ae,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 10.0; Wi..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,10/1/20 13:37,10/1/20 14:09,IP Address,83.28.238.92,1,1,FALSE,,FS_Xk4jjhD7d54HjJT,,...,-79.43430328,anonymous,EN,5f23f9035554f32008f8a35e,"[[""3433"",""aka c.3126T>G"",0],[""1636"",""Can"",11],...",5f23f9035554f32008f8a35e,5f7618a006b78606c8f1c588,5f763009fb2b4b0a21271d19,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 6.1; Win..."
65,10/1/20 13:40,10/1/20 15:16,IP Address,99.225.196.73,1,1,FALSE,,FS_1dBlMB6Zi7jyiCJ,,...,-79.43430328,anonymous,EN,5cffec8eb5f15d0017b5dbca,"[[""3712"",""Phenylketonuria"",0],[""2415"",""contrib...",5cffec8eb5f15d0017b5dbca,5f7618a006b78606c8f1c588,5f7630953923650a6c5d318b,[],"{""userAgent"":""Mozilla/5.0 (Macintosh; Intel Ma..."
66,10/1/20 13:41,10/1/20 17:00,IP Address,89.68.167.54,1,1,FALSE,,FS_3QFNaOgC6T3oekz,,...,-79.43430328,anonymous,EN,5f6b3dbc7a71cb2393732fa5,[],5f6b3dbc7a71cb2393732fa5,5f7618a006b78606c8f1c588,5f7630f275b83a09be15de74,[],"{""userAgent"":""Mozilla/5.0 (Linux; Android 9; F..."
67,10/1/20 13:45,10/1/20 13:45,IP Address,188.83.156.96,1,1,FALSE,,FS_1CkGU9TH8KZkAxg,,...,-79.43430328,anonymous,EN,5eb2c8d889123905858e3c20,[],5eb2c8d889123905858e3c20,5f7618a006b78606c8f1c588,5f76302fc3a3e109fc642a48,[],"{""userAgent"":""Mozilla/5.0 (Windows NT 10.0; Wi..."


In [16]:
# Breaking it down by specific users who took the survey.
users_df = df.copy(deep=True)[["response_id","recorded_datetime","status","progress","is_finished","duration"]]
users_df.drop_duplicates(inplace=True)
users_df.reset_index(inplace=True, drop=True)

# Some extra information to know, the duration, and the number of total highlights in each response, and Prolific IDs.
users_df["duration_min"] = users_df["duration"].map(lambda x: math.ceil(x/60))
response_id_to_num_snippets = dict(df.groupby("response_id").size())
users_df["num_snippets"] = users_df["response_id"].map(response_id_to_num_snippets)
users_df["prolific_id"] = users_df["response_id"].map(response_id_to_prolific_pid)
users_df.to_csv(SURVEY_TAKERS_OUTPUT_PATH, index=False)
users_df.tail(50)

Unnamed: 0,response_id,recorded_datetime,status,progress,is_finished,duration,duration_min,num_snippets,prolific_id
11,R_2U4sdhbvxYpTR5O,10/1/20 13:59,IP Address,100,True,1511,26,97,5865dd647fbbcd00013973b8
12,R_1PZ7OJP8e7aeIjO,10/1/20 14:00,IP Address,100,True,1160,20,108,5be385ba24f70e00018b0024
13,R_3GpR4HRv21NuLtm,10/1/20 14:01,IP Address,100,True,1310,22,169,5e93301515768a1680c9705a
14,R_1pWmg4wjRUOa542,10/1/20 14:01,IP Address,100,True,1343,23,151,5ec6b3ce04e80f0c9a211c5b
15,R_R8j0APpr9g9xfc5,10/1/20 14:01,IP Address,100,True,1229,21,42,5df7eeb4cc6822000ac81b07
16,R_1H0Opfec4YYU02C,10/1/20 14:02,IP Address,100,True,1696,29,87,5edede64402f7d18f040266e
17,R_RUdmhInXoPo1WwN,10/1/20 14:02,IP Address,100,True,1457,25,88,5d2b153afa24ba00173e8ee0
18,R_1LGAwDpbJTS9frP,10/1/20 14:03,IP Address,100,True,1181,20,68,5d6c12352e6bbb001a122cef
19,R_2yjBPHu6DffFx8u,10/1/20 14:04,IP Address,100,True,1748,30,112,5f2097ac626bc33b28c2cbf9
20,R_1rk9ABpZImGhqxT,10/1/20 14:05,IP Address,100,True,1347,23,65,5d722d644fee74001adb559d


In [17]:
# Create a mapping from unique text IDs to all of the text snippets that were hilighted in these survey results.
unique_text_id_to_snippet_list = {}
for unique_text_id,row_indices in df.groupby("id", axis=0).groups.items():
    hilighted_texts_list = list(df.iloc[row_indices]["snippet"].values)
    unique_text_id_to_snippet_list[unique_text_id] = hilighted_texts_list
print(unique_text_id_to_snippet_list)

{1: ['sleep disorder', 'sleep phase syndrome'], 8: ['Apolipoprotein E levels in cerebrospinal fluid ', 'effects of ABCA1 polymorphisms.', 'HDL cholesterol level', 'ABCA1 sequence variation confirms association with dementia.'], 12: ['dementia'], 14: ['C allele', 'HDL cholesterol'], 16: ['aka c.7093G>', 'p.Asp2365Asn'], 18: ['rare mutation', 'surfactant metabolism dysfunction', 'pulmonary', 'type 3', 'disorder involving severe neonatal distress', 'death of a newborn', 'ambiguous flip confusion'], 20: ["late-onset Alzheimer's disease", 'could result in defective protein function'], 23: ["late onset Alzheimer's disease"], 24: ["Alzheimer's disease", 'ABCA7', 'rs72973581(A)', ' p.G215S'], 30: ['adverse side effects', 'depression ', 'depression ', 'depression '], 33: ['intracerebral concentrations of certain drugs', 'relationships for selective serotonin reuptake inhibitors', 'antidepressant drugs'], 34: ['were more likely to remit', 'depression', 'major depression'], 38: ["A Danish case-co

In [18]:
# Use that mapping to create a version of the original dataframe with just the hilighted text snippets.
subset_df = original_df.copy(deep=True)[original_df["id"].isin(unique_text_id_to_snippet_list.keys())]
subset_df["n"] = subset_df["id"].map(lambda x: len(unique_text_id_to_snippet_list[x]))
text_snippets = flatten([unique_text_id_to_snippet_list[i] for i in subset_df["id"].values])

# Extend the dataframe to duplicate each row n times where n is the number of text hilight results from the surveys.
modified_df = subset_df.reindex(np.repeat(subset_df.index.values, subset_df["n"]), method="ffill")

# Make sure that the extension occured as expected based on the number of text snippets, and add them as a new column.
assert len(modified_df) == len(text_snippets)
modified_df["snippet"] = text_snippets
modified_df.head(20)

Unnamed: 0,gene,snp,text,id,n,snippet
0,AANAT,Rs28936679,"rs28936679, also known as Ala129Thr or A129T (...",1,2,sleep disorder
0,AANAT,Rs28936679,"rs28936679, also known as Ala129Thr or A129T (...",1,2,sleep phase syndrome
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8,4,Apolipoprotein E levels in cerebrospinal fluid
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8,4,effects of ABCA1 polymorphisms.
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8,4,HDL cholesterol level
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8,4,ABCA1 sequence variation confirms association ...
11,ABCA1,Rs363717,Examining the effect of linkage disequilibrium...,12,1,dementia
13,ABCA1,Rs4149274,C allele is associated with 1.51mg/dl increase...,14,2,C allele
13,ABCA1,Rs4149274,C allele is associated with 1.51mg/dl increase...,14,2,HDL cholesterol
19,ABCA12,Rs726070,aka c.7093G>A (p.Asp2365Asn)The variant allele...,16,2,aka c.7093G>


In [19]:
# Check all IDs for unique texts and see how many times they were annotated here.
id_to_num_annotations = {}
for i in original_df["id"].values:
    # This ID was annotated atleast once, it was in the results dataframe for this survey.
    if i in snippets_per_id:
        id_to_num_annotations[i] = snippets_per_id[i]
    else:
        id_to_num_annotations[i] = 0

# Save that information to a file.
num_ann_df = pd.DataFrame({"id":list(id_to_num_annotations.keys()),"num_highlights":list(id_to_num_annotations.values())})
num_ann_df.to_csv(NUM_ANNOTATIONS_PATH, index=False)
num_ann_df.head(10)

Unnamed: 0,id,num_highlights
0,1,2
1,2,0
2,3,0
3,4,0
4,5,0
5,6,0
6,7,0
7,8,4
8,9,0
9,10,0


In [20]:
# Save that dataframe as a new CSV file with just the final cleaned text snippets for each gene and SNP.
modified_df = modified_df[["gene","snp","snippet"]]
modified_df.sort_values(by="gene", inplace=True)
modified_df.to_csv(SNPS_AND_SNIPPETS_OUTPUT_PATH, index=False)
modified_df.head(20)

Unnamed: 0,gene,snp,snippet
0,AANAT,Rs28936679,sleep disorder
0,AANAT,Rs28936679,sleep phase syndrome
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid
7,ABCA1,Rs2066715,effects of ABCA1 polymorphisms.
7,ABCA1,Rs2066715,HDL cholesterol level
7,ABCA1,Rs2066715,ABCA1 sequence variation confirms association ...
11,ABCA1,Rs363717,dementia
13,ABCA1,Rs4149274,C allele
13,ABCA1,Rs4149274,HDL cholesterol
19,ABCA12,Rs726070,aka c.7093G>


In [21]:
# How many genes had atleast one snippet associated to it in these results?
len(modified_df["gene"].unique())

814

In [22]:
# Create a version of the dataset with mapping genes and SNPs to context sentences.
subset_df = original_df.copy(deep=True)[original_df["id"].isin(id_to_context_sentences.keys())]
subset_df["n"] = subset_df["id"].map(lambda x: len(id_to_context_sentences[x]))
sentences = flatten([id_to_context_sentences[i] for i in subset_df["id"].values])

# Extend the dataframe to duplicate each row n times where n is the number of text hilight results from the surveys.
modified_df = subset_df.reindex(np.repeat(subset_df.index.values, subset_df["n"]), method="ffill")

# Make sure that the extension occured as expected based on the number of text snippets, and add them as a new column.
assert len(modified_df) == len(sentences)
modified_df["context"] = sentences
modified_df.head(20)

Unnamed: 0,gene,snp,text,id,n,context
0,AANAT,Rs28936679,"rs28936679, also known as Ala129Thr or A129T (...",1,1,Its activity increases 10- to 100-fold at nigh...
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8,1,A survey of ABCA1 sequence variation confirms ...
11,ABCA1,Rs363717,Examining the effect of linkage disequilibrium...,12,1,A survey of ABCA1 sequence variation confirms ...
13,ABCA1,Rs4149274,C allele is associated with 1.51mg/dl increase...,14,1,C allele is associated with 1.51mg/dl increase...
19,ABCA12,Rs726070,aka c.7093G>A (p.Asp2365Asn)The variant allele...,16,1,Aka c.7093G>A (p.Asp2365Asn) The variant allel...
21,ABCA3,Rs149989682,"rs149989682, also known as c.875A>T, p.Glu292V...",18,1,"rs149989682, also known as c.875A>T, p.Glu292V..."
23,ABCA7,Rs115550680,"A 2013 meta-analysis comprising a total of ~2,...",20,1,The deleted allele could result in defective p...
26,ABCA7,Rs4147929,rs4147929 is a SNP in the ATP-binding cassette...,23,1,rs4147929 is a SNP in the ATP-binding cassette...
27,ABCA7,Rs72973581,A sequencing study of 332 sporadic Alzheimer's...,24,1,A sequencing study of 332 sporadic Alzheimers ...
33,ABCB1,Rs11983225,rs11983225 is a SNP in the ABCB1 gene (also kn...,30,3,"According to a recent review, ten studies repo..."


In [23]:
# Save that dataframe as a new CSV file with just the contexts for each gene and SNP.
modified_df = modified_df[["gene","snp","context"]]
modified_df.sort_values(by="gene", inplace=True)
modified_df.to_csv(SNPS_AND_CONTEXTS_OUTPUT_PATH, index=False)
modified_df.head(20)

Unnamed: 0,gene,snp,context
0,AANAT,Rs28936679,Its activity increases 10- to 100-fold at nigh...
7,ABCA1,Rs2066715,A survey of ABCA1 sequence variation confirms ...
11,ABCA1,Rs363717,A survey of ABCA1 sequence variation confirms ...
13,ABCA1,Rs4149274,C allele is associated with 1.51mg/dl increase...
19,ABCA12,Rs726070,Aka c.7093G>A (p.Asp2365Asn) The variant allel...
21,ABCA3,Rs149989682,"rs149989682, also known as c.875A>T, p.Glu292V..."
23,ABCA7,Rs115550680,The deleted allele could result in defective p...
26,ABCA7,Rs4147929,rs4147929 is a SNP in the ATP-binding cassette...
27,ABCA7,Rs72973581,A sequencing study of 332 sporadic Alzheimers ...
44,ABCB1,Rs4148740,"A well-run, double-blind study of ABCB1 substr..."
