## Parsing Survey Results
This notebook produces the intended datasets by looking at the original files of text and relationships scraped from SNPedia and the results of the survey in order to construct the relationships between genes, SNPs, and different types of text snippets or context sentences. This notebook also deals with looking at how much of the dataset was covered by a given set of results from the survey.

In [1]:
import pandas as pd
import numpy as np
import json
import math
import random
import datetime
import re
from collections import Counter
from utils import flatten
from nltk.tokenize import sent_tokenize

### Defining paths to input files
These first two files should always stay the same, the first one defines the project-wide mapping between unique strings and IDs. The second one is the starting dataset of the text that was scraped and cleaned from SNPedia, so changing that would impact everything else downstream in the creation of the dataset. The paths to the Qualtrics results files can be changed though, and all the output files will be with respect to just those results. Running with different result sets should produce individual files that could be stacked after the fact.

In [2]:
# This is the original data with duplicate text strings because IDs refer to particular SNPs from particular genes.
ORIGINAL_DATA_PATH = "../data/3_snps_and_cleaned_text_with_ids.csv"

# What the survey was built from, the IDs in this refer to unique text strings so that duplicates weren't part of it.
SURVEY_SOURCE_DATA_PATH = "../data/6_part_1_binned_and_blocked_texts.csv"

# JSON file from parsing the results of the qualtrics survey designed for extracting phenotypes from text.
QUALTRICS_RAW_CSV_EXPORT_PATH = "../qualtrics/results_raw_qualtrics_exports_v1/SNPedia_Survey_September_24_2020_22_24_Completed_and_In_Progress.csv"
QUALTRICS_RESULTS_JSON_PATH = "../qualtrics/results_processed_json_files_v1/snpedia_survey_september_24_2020_h22_m24_completed_and_in_progress.json"

### Defining paths to output files
These should be left alone, they'll be named automatically according to when they were run and will be grouped.

In [3]:
# Where to send the resulting output dataframes and some other information about coverage.
random_six_digit_number = random.randrange(100000,999999)
datetime_str = datetime.datetime.now().strftime('%m_%d_%Y_h%Hm%Ms%S')
FILENAMES_OUTPUT_PATH = "../data/S{}_input_file_paths_{}.txt".format(random_six_digit_number, datetime_str)
RESPONSES_OUTPUT_PATH = "../data/S{}_processed_survey_responses_{}.csv".format(random_six_digit_number, datetime_str)
SURVEY_TAKERS_OUTPUT_PATH = "../data/S{}_processed_survey_takers_{}.csv".format(random_six_digit_number, datetime_str)
SNPS_AND_SNIPPETS_OUTPUT_PATH = "../data/S{}_snps_and_snippets_{}.csv".format(random_six_digit_number, datetime_str)
SNPS_AND_CONTEXTS_OUTPUT_PATH = "../data/S{}_snps_and_contexts_{}.csv".format(random_six_digit_number, datetime_str)
NUM_ANNOTATIONS_PATH = "../data/S{}_number_of_annotations_per_id_{}.csv".format(random_six_digit_number, datetime_str)

In [4]:
# Create an output file that makes a not of which input files were used for these results.
with open(FILENAMES_OUTPUT_PATH, "w") as f:
    filenames = [
        ORIGINAL_DATA_PATH,
        SURVEY_SOURCE_DATA_PATH,
        QUALTRICS_RAW_CSV_EXPORT_PATH,
        QUALTRICS_RESULTS_JSON_PATH]
    filenames = "\n".join(filenames)
    f.write(filenames)

In [5]:
# Make sure the mapping between the unique texts used to create the survey and the original SNP data exists.
original_df = pd.read_csv(ORIGINAL_DATA_PATH)
#survey_df = pd.read_csv(SURVEY_SOURCE_DATA_PATH)
#text_to_unique_text_id = dict(zip(survey_df.text, survey_df.id))
#unique_text_id_to_text = dict(zip(survey_df.id, survey_df.text))
#original_df["id"] = original_df["text"].map(text_to_unique_text_id)

id_to_unique_text = dict(zip(original_df["id"].values, original_df["text"].values))


original_df.head(20)

Unnamed: 0,gene,snp,text,id
0,AANAT,Rs28936679,"rs28936679, also known as Ala129Thr or A129T (...",1
1,AANAT,Rs3760138,Genetic differences in human circadian clock g...,2
2,AANAT,Rs4238989,Genetic differences in human circadian clock g...,3
3,ABCA1,Rs1800977,The -14C->T polymorphism rs1800977 of the ABCA...,4
4,ABCA1,Rs1883025,Apolipoprotein E levels in cerebrospinal fluid...,5
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6
6,ABCA1,Rs2066714,Apolipoprotein E levels in cerebrospinal fluid...,7
7,ABCA1,Rs2066715,Apolipoprotein E levels in cerebrospinal fluid...,8
8,ABCA1,Rs2230806,"rs2230806, also known as Arg219Lys or R219K, i...",9
9,ABCA1,Rs2230808,"rs2297404, rs2230808, and rs2020927 haplotype ...",10


In [6]:
original_df.isnull().sum(axis = 0)

gene    0
snp     0
text    0
id      0
dtype: int64

In [7]:
original_df.shape

(6535, 4)

In [8]:
# Create a dictionary using the JSON data output from a qualtrics survey.
with open(QUALTRICS_RESULTS_JSON_PATH) as f:
    responses = json.load(f)
responses

[{'response_id': 'R_eOSosrnczqYqE9P',
  'recorded_datetime': '9/22/20 10:37',
  'status': 'IP Address',
  'progress': 100,
  'is_finished': 'TRUE',
  'duration': 1252,
  'hilights': [{'qid': 4769, 'selection': 'Disease ', 'selection_index': 13},
   {'qid': 5162,
    'selection': 'Focal segmental glomerulosclerosis',
    'selection_index': 0},
   {'qid': 2451, 'selection': ' Alpha Thalassemia', 'selection_index': 23},
   {'qid': 3964,
    'selection': 'Congenital Disorder of Glycosylation',
    'selection_index': 0},
   {'qid': 5448, 'selection': 'schizophrenia.', 'selection_index': 54},
   {'qid': 3027, 'selection': 'glaucoma', 'selection_index': 29},
   {'qid': 2287,
    'selection': 'Nonsyndromic Sensorineural',
    'selection_index': 33},
   {'qid': 2287, 'selection': 'Connexin', 'selection_index': 13},
   {'qid': 1783, 'selection': 'schizophrenia', 'selection_index': 10},
   {'qid': 628, 'selection': 'cholesterol', 'selection_index': 4},
   {'qid': 628, 'selection': 'cholesterol', 

In [9]:
# Load the original source CSV that was used in creating the survey. This will used to check against the results.
# We want to make sure there is no discrepency in which IDs are referring to which texts.
source_df = pd.read_csv(SURVEY_SOURCE_DATA_PATH)
#source_df.reset_index(drop=False, inplace=True)

# Question IDs in the actual survey will use the 
#source_df["qid"] = source_df["index"]
#source_df.drop(labels=["index"], axis=1, inplace=True)
#qid_to_source_text = {i:text for i,text in zip(source_df["qid"].values, source_df["text"].values)}
#qid_to_unique_text_id = {qid:i for qid,i in zip(source_df["qid"].values, source_df["id"].values)}
source_df.head(10)

Unnamed: 0,id,text,bin_id,bin_size,block_id,block_size,block_sample
0,1169,smoking,1,988,1,247,5
1,3702,Phenylketonuriars62514952,1,988,1,247,5
2,3701,Phenylketonuriars5030860,1,988,1,247,5
3,3700,Phenylketonuriars5030859,1,988,1,247,5
4,3699,Phenylketonuriars5030856,1,988,1,247,5
5,3698,Phenylketonuriars5030851,1,988,1,247,5
6,3697,Phenylketonuriars5030850,1,988,1,247,5
7,3696,Phenylketonuriars5030847,1,988,1,247,5
8,3695,Phenylketonuriars5030846,1,988,1,247,5
9,3694,Phenylketonuriars5030843,1,988,1,247,5


In [10]:
print(source_df.shape)
print(len(pd.unique(source_df["id"])))

(5081, 7)
5081


In [11]:
# Put the responses into a dataframe.
row_tuples = []
for response in responses:
    
    # Metadata associated with this response.
    response_id = response["response_id"]
    recorded_datetime = response["recorded_datetime"]
    status = response["status"]
    progress = response["progress"]
    is_finished = response["is_finished"]
    duration = response["duration"]


    # The actual highlighted text strings from this response.
    for hilight in response["hilights"]:

        # The information about this one particular highlight.
        qid = hilight["qid"]
        hilighted_text = hilight["selection"]
        index_of_first_selected_char = hilight["selection_index"]
        source_text = id_to_unique_text[qid]        

        # First check that the question IDs are correct so we know what the source text was for this question.
        # Then additionally make sure the location of the highlight makes sense as well.
        #assert hilighted_text in source_text
        #assert source_text[index_of_first_selected_char:index_of_first_selected_char+len(hilighted_text)] == hilighted_text
       
        # Those asserts don't pass 100% of the time due to special cases with the text.
        # Save them to variables instead and check how often they don't pass, should be very infrequently.
        # Look at the special cases by hand.
        text_match = (hilighted_text in source_text)
        indices_match = (source_text[index_of_first_selected_char:index_of_first_selected_char+len(hilighted_text)] == hilighted_text)
        
        # Add this as a row.
        row_tuples.append((response_id, recorded_datetime, status, progress, is_finished, duration, qid, hilighted_text, text_match, indices_match))

        
        
# Create the dataframe that holds all this information.
columns = ["response_id", "recorded_datetime", "status", "progress", "is_finished", "duration", "id", "snippet", "text_match", "idx_match"]
df = pd.DataFrame(row_tuples, columns=columns)


# Add another column indicating how many times a particular ID was annotated by a survey taker.
snippets_per_id = dict(df.groupby("id").size())
df["num_ann"] = df["id"].map(lambda x: snippets_per_id[x])


# Save the dataframe to a file.
df.to_csv(RESPONSES_OUTPUT_PATH, index=False)
df.head(10)

Unnamed: 0,response_id,recorded_datetime,status,progress,is_finished,duration,id,snippet,text_match,idx_match,num_ann
0,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,4769,Disease,True,True,2
1,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,5162,Focal segmental glomerulosclerosis,True,True,2
2,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,2451,Alpha Thalassemia,True,True,6
3,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,3964,Congenital Disorder of Glycosylation,True,True,2
4,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,5448,schizophrenia.,True,True,1
5,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,3027,glaucoma,True,True,5
6,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,2287,Nonsyndromic Sensorineural,True,True,2
7,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,2287,Connexin,True,True,2
8,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,1783,schizophrenia,True,True,2
9,R_eOSosrnczqYqE9P,9/22/20 10:37,IP Address,100,True,1252,628,cholesterol,True,True,4


In [12]:
# How many times did the asserts not evaluate to true?
print(Counter(df["text_match"].values))
print(Counter(df["idx_match"].values))

Counter({True: 18027, False: 109})
Counter({True: 17101, False: 1035})


In [13]:
# Preparing a dataset of context sentences that include the highlighted text snippets.
def get_contexts(i, snippet, text):
    sentences = sent_tokenize(text)
    sentences_with_snippet = [s for s in sentences if snippet in s]
    
    # Some additional processing to clean up the context sentences. Can insert more steps here if needed.
    
    # One way to do substitutions with regex. re.sub(old, new, string). The old pattern matches a SNP name.
    # That pattern includes () to define a group. In the new pattern, you can reference whatever is in that 
    # group with \1, as long as it's a raw string. A second group would be referenced with \2, etc.
    sentences_with_snippet = [re.sub(r"(rs[0-9]+)", r" \1 ", s) for s in sentences_with_snippet]
    
    
    
    
    
    
    # Make sure that bracketed text has spaces around it, this might add extra whitespace, will be removed later.
    sentences_with_snippet = [s.replace("("," (") for s in sentences_with_snippet]
    sentences_with_snippet = [s.replace(")",") ") for s in sentences_with_snippet]
    
    # Replace all instances of whitespace with a single space.
    whitespace_pattern = re.compile(r"\s+")
    sentences_with_snippet = [whitespace_pattern.sub(r" ", s) for s in sentences_with_snippet]
    
    # Remove whitespaces that occur before periods or commas.
    whitespace_before_period_pattern = re.compile(r"\s+[\.\;]")
    whitespace_before_comma_pattern = re.compile(r"\s+[\,\;]")
    sentences_with_snippet = [whitespace_before_period_pattern.sub(r".", s) for s in sentences_with_snippet]
    sentences_with_snippet = [whitespace_before_comma_pattern.sub(r",", s) for s in sentences_with_snippet]
    
    # Strip of leading and trailing whitespaces.
    sentences_with_snippet = [s.strip() for s in sentences_with_snippet]
    
    # Capitalize each sentence.
    sentences_with_snippet = ["{}{}".format(s[0].upper(),s[1:]) for s in sentences_with_snippet]
    
    # End each sentence with a single period.
    add_end_character = lambda x: "{}.".format(x[:len(x)-1]) if (x[len(x)-1]==".") or (x[len(x)-1]==";") else "{}.".format(x)
    sentences_with_snippet = [add_end_character(s) for s in sentences_with_snippet]
    
    # Remove all quotes from the sentences.
    sentences_with_snippet = [s.replace('"', '') for s in sentences_with_snippet]
    sentences_with_snippet = [s.replace("'", "") for s in sentences_with_snippet]
    
    
    # Notes to save about how to do this.
    # Another regex substitution trick. The repl (new) argument can be a function, which has to always take a single
    # parameter, which is the regex match. You can then capture groups inside that function, do any transformations,
    # and return, which would have been less straight forward or not possible to do in the repl raw string.
    def make_lowercase(match):
        group = match.group(1)
        group = group.lower()
        return(group)
    
    # Fix a problem where SNP names were being capitalized if they were the first token in the sentence.
    sentences_with_snippet = [re.sub(r"(Rs[0-9]+)", make_lowercase, s) for s in sentences_with_snippet]
    
    # Done.
    return((i,sentences_with_snippet))


# Create a mapping between text IDs and lists of the context sentences that relevant snippets were found in.
obj = df.apply(lambda row: get_contexts(row.id, row.snippet, id_to_unique_text[row.id]), axis=1)
id_to_context_sentences = dict(obj.values)
id_to_context_sentences

{4769: ['Niemann-Pick Disease Type A rs120074117.'],
 5162: ['Focal segmental glomerulosclerosis 2 rs121434393.'],
 2451: ['rs281864889 see HBA2 and Alpha Thalassemia.'],
 3964: ['Congenital Disorder of Glycosylation Type 1a rs28936415.'],
 5448: ['Evidence of sex-modulated association of ZNF804A with schizophrenia.'],
 3027: ['Association with exfoliation glaucoma; possible genoset partner with rs2165241.'],
 2287: ['Hearing loss?Connexin 26-Related Nonsyndromic Sensorineural Hearing Losssee rs80338939.'],
 1783: ['Linked to schizophrenia with rs7598440, rs839523, and rs707284.'],
 628: ['LDL cholesterol and total cholesterol levels being the quantitative trait associated with in.'],
 2515: ['Hemochromatosis related; better known in dbSNP nomenclature as rs1800730 / discussion at 23andMe.'],
 3684: ['rs1621005 increases susceptibility to Rheumatoid Arthritis 1.89 times for carriers of the G allele.'],
 4613: ['rs2073838 and rs3792876 replicated for rheumatoid arthritis in Japanese, bu

In [14]:
# We need the unprocessed CSV file exported from Qualtrics for some of the following information.
# Specifically, let's use this to get the mapping betweeen response IDs (from Qualtrics) and user IDs (from Prolific).
exported_df = pd.read_csv(QUALTRICS_RAW_CSV_EXPORT_PATH)
exported_df.drop(df.index[[0,1]], inplace=True)
prolific_pid_column = "Q564"
response_id_column = "ResponseId"
response_id_to_prolific_pid = dict(zip(exported_df[response_id_column], exported_df[prolific_pid_column]))
response_id_to_prolific_pid

{'R_eOSosrnczqYqE9P': '5f56f670e492b316bbd2a47c',
 'R_2XjJU6SMJAlU4I2': '5e703f3dd6d4336135b8818d',
 'R_25R9SJyLb4p6iv7': '5f68eb472f60ad099401f587',
 'R_1CpRIVeWQ8AGNqi': '5f206e18fb9d281e91500fcd',
 'R_2wQNxJNyQdjSeq9': '5d3cc9010e510a00013df6f6',
 'R_qPe5vObZ349P3I5': '5ec5925cb3d6035afea34d20',
 'R_2Bs2kBIw93NmaCj': '5e7e6df49867d653e250ade8',
 'R_2wodESRdFsJZ5pq': '5f3e31e2769ffb267b9d1a24',
 'R_PUpvfu7l6QKh6mJ': '5f4da07b144b1ca7f76c51ed',
 'R_3p2QR3KSEVmy0d6': nan,
 'R_1mOvl29nViC00Om': nan,
 'R_3oHIVnFYPhAiXmm': '5e6ce6ff1b327528ec9bf6ec',
 'R_27OU6CzuLPmcd3m': '5e5695b653966101689727e6',
 'R_3RqnqavYAULCLRa': '5ccd9d0c95e4a30016d4652b',
 'R_2YVCzat5Edfkexn': '5c587fb58fba0a000189f4d8',
 'R_W2vxxai4v7q8uUF': '5cfa65c24b639a0019a45c8e',
 'R_un8BFK0QoCkgDIJ': '5c8becf18f5b6400158e1214',
 'R_RQRoTbUD7VIhjAl': '5eab7c4c10a7c01634f3b4ef',
 'R_sp56F93gtkQUdoJ': '5d98bb1424a6e302b056c77e',
 'R_3HMHg1x9JldKGTZ': '5e9b725d85157c0816f1f0dd',
 'R_1NkwPymSEyk4XxD': '5c726ae51384ee0001115f7

In [15]:
exported_df

Unnamed: 0,StartDate,EndDate,Status,IPAddress,Progress,Duration (in seconds),Finished,RecordedDate,ResponseId,RecipientLastName,...,ExternalReference,LocationLatitude,LocationLongitude,DistributionChannel,UserLanguage,Q564,hilights,PROLIFIC_PID,STUDY_ID,SESSION_ID
2,9/22/20 10:16,9/22/20 10:37,IP Address,174.16.161.244,100,1252,TRUE,9/22/20 10:37,R_eOSosrnczqYqE9P,,...,,39.73010254,-104.9077988,anonymous,EN,5f56f670e492b316bbd2a47c,"[[""4769"",""Disease "",13],[""5162"",""Focal segment...",5f56f670e492b316bbd2a47c,5f64e6570956ab016016af94,5f6a23566fb13d080b7c3f69
3,9/22/20 10:22,9/22/20 10:40,IP Address,96.63.207.134,100,1074,TRUE,9/22/20 10:40,R_2XjJU6SMJAlU4I2,,...,,40.11199951,-88.2365036,anonymous,EN,5e703f3dd6d4336135b8818d,"[[""4622"",""Skin color"",0],[""5199"",""cardiac amyl...",5e703f3dd6d4336135b8818d,5f64e6570956ab016016af94,5f6a24bc329c8d08bec90645
4,9/22/20 10:36,9/22/20 10:42,IP Address,107.77.235.23,100,361,TRUE,9/22/20 10:42,R_25R9SJyLb4p6iv7,,...,,33.74850464,-84.38710022,anonymous,EN,5f68eb472f60ad099401f587,[],5f68eb472f60ad099401f587,5f64e6570956ab016016af94,5f6a2802f5f82209600309eb
5,9/22/20 10:12,9/22/20 10:45,IP Address,185.241.197.34,100,2002,TRUE,9/22/20 10:45,R_1CpRIVeWQ8AGNqi,,...,,54.19020081,18.68330383,anonymous,EN,5f206e18fb9d281e91500fcd,"[[""1115"",""Cystic Fibrosisrs"",0],[""1100"",""Cysti...",5f206e18fb9d281e91500fcd,5f64e6570956ab016016af94,5f6a2229ebc2b407cd210f87
6,9/22/20 10:55,9/22/20 11:19,IP Address,73.103.85.226,100,1438,TRUE,9/22/20 11:20,R_2wQNxJNyQdjSeq9,,...,,40.44439697,-86.92559814,anonymous,EN,5d3cc9010e510a00013df6f6,"[[""1092"",""Cystic Fibrosisrs"",0],[""3416"",""famil...",5d3cc9010e510a00013df6f6,5f64e6570956ab016016af94,5f6a2b5951380a0a0049815a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,9/24/20 17:36,9/24/20 18:26,IP Address,177.237.183.200,1,1,FALSE,9/24/20 16:23,R_in_progress_31,,...,,44.31359863,-78.23999786,anonymous,EN,5ed9fe560adb8b5a860d8d80,"[[""1883"",""Marfan syndrome"",0],[""1101"",""Cystic ...",5ed9fe560adb8b5a860d8d80,5f6ced9888653511ed8d2192,5f6d1f8534ac6f1923dd9414
82,9/24/20 17:40,9/24/20 18:33,IP Address,81.111.140.54,1,1,FALSE,9/24/20 16:24,R_in_progress_32,,...,,47.49839783,19.04040527,anonymous,EN,5c3dcf4fc1c7e30001e4e735,"[[""666"",""breast cancer"",15],[""454"",""A deficien...",5c3dcf4fc1c7e30001e4e735,5f6ced9888653511ed8d2192,5f6d200e3a85c5187676c901
83,9/24/20 17:52,9/24/20 18:41,IP Address,193.137.168.32,1,1,FALSE,9/24/20 16:24,R_in_progress_33,,...,,52.11579895,21.271698,anonymous,EN,5ec554706960444f4a1768de,"[[""1089"",""Cystic Fibrosisrs"",0],[""3311"",""allel...",5ec554706960444f4a1768de,5f6ced9888653511ed8d2192,5f6d233ac179911989a6d742
84,9/24/20 18:22,9/24/20 20:03,IP Address,90.94.55.188,1,1,FALSE,9/24/20 16:25,R_in_progress_34,,...,,20.67599487,-103.3358002,anonymous,EN,5f3fa50ad4ad3d025f3bca8e,"[[""3705"",""Phenylketonuria"",0],[""276"",""heredita...",5f3fa50ad4ad3d025f3bca8e,5f6ced9888653511ed8d2192,5f6d24f9850e4918e5f2653f


In [16]:
# Breaking it down by specific users who took the survey.
users_df = df.copy(deep=True)[["response_id","recorded_datetime","status","progress","is_finished","duration"]]
users_df.drop_duplicates(inplace=True)
users_df.reset_index(inplace=True, drop=True)

# Some extra information to know, the duration, and the number of total highlights in each response, and Prolific IDs.
users_df["duration_min"] = users_df["duration"].map(lambda x: math.ceil(x/60))
response_id_to_num_snippets = dict(df.groupby("response_id").size())
users_df["num_snippets"] = users_df["response_id"].map(response_id_to_num_snippets)
users_df["prolific_id"] = users_df["response_id"].map(response_id_to_prolific_pid)
users_df.to_csv(SURVEY_TAKERS_OUTPUT_PATH, index=False)
users_df.tail(50)

Unnamed: 0,response_id,recorded_datetime,status,progress,is_finished,duration,duration_min,num_snippets,prolific_id
30,R_3qrMJOeN0LPyHJz,9/24/20 16:24,IP Address,100,True,4069,68,284,5eb44a4d4e9bd8289b01ccb5
31,R_1IbIXHu7uanaL2Y,9/24/20 16:25,IP Address,100,True,3699,62,162,5dd6e561bd95ad000d495788
32,R_3kqcIc6gI78SxIM,9/24/20 16:26,IP Address,100,True,3522,59,238,5ee3a20ec682de03c0234f67
33,R_22FdhiHG7WuRMCQ,9/24/20 16:29,IP Address,100,True,4120,69,189,5cfba9da04fd2c001941a7e3
34,R_3qfBaSVHtzNpMaS,9/24/20 16:33,IP Address,100,True,4198,70,208,5ee111bcc53e0f4bbf967e40
35,R_2tmciRl13xTlpOK,9/24/20 16:48,IP Address,100,True,5340,89,328,59c1a5c0e3a73b00011a89ca
36,R_2QyrBITmGtNXaJt,9/24/20 16:50,IP Address,100,True,5316,89,330,5d9ad74c0bc1550016bf4daf
37,R_1gBBeWaS2oVWQE8,9/24/20 16:54,IP Address,100,True,5087,85,363,5f08f3e3ff3278174e40069b
38,R_10D9ysIiSEHlmBY,9/24/20 16:54,IP Address,100,True,5189,87,240,5f63b9d3a83a7d0ecdda4eb2
39,R_W28XHcuJqnpaYlb,9/24/20 16:55,IP Address,100,True,5226,88,219,5e2a5541e179570b9a3dcbc8


In [17]:
# Create a mapping from unique text IDs to all of the text snippets that were hilighted in these survey results.
unique_text_id_to_snippet_list = {}
for unique_text_id,row_indices in df.groupby("id", axis=0).groups.items():
    hilighted_texts_list = list(df.iloc[row_indices]["snippet"].values)
    unique_text_id_to_snippet_list[unique_text_id] = hilighted_texts_list
print(unique_text_id_to_snippet_list)



In [18]:
# Use that mapping to create a version of the original dataframe with just the hilighted text snippets.
subset_df = original_df.copy(deep=True)[original_df["id"].isin(unique_text_id_to_snippet_list.keys())]
subset_df["n"] = subset_df["id"].map(lambda x: len(unique_text_id_to_snippet_list[x]))
text_snippets = flatten([unique_text_id_to_snippet_list[i] for i in subset_df["id"].values])

# Extend the dataframe to duplicate each row n times where n is the number of text hilight results from the surveys.
modified_df = subset_df.reindex(np.repeat(subset_df.index.values, subset_df["n"]), method="ffill")

# Make sure that the extension occured as expected based on the number of text snippets, and add them as a new column.
assert len(modified_df) == len(text_snippets)
modified_df["snippet"] = text_snippets
modified_df.head(20)

Unnamed: 0,gene,snp,text,id,n,snippet
1,AANAT,Rs3760138,Genetic differences in human circadian clock g...,2,2,autism
1,AANAT,Rs3760138,Genetic differences in human circadian clock g...,2,2,autism
3,ABCA1,Rs1800977,The -14C->T polymorphism rs1800977 of the ABCA...,4,2,polymorphism
3,ABCA1,Rs1800977,The -14C->T polymorphism rs1800977 of the ABCA...,4,2,therothrombotic cerebral infarction
4,ABCA1,Rs1883025,Apolipoprotein E levels in cerebrospinal fluid...,5,2,risk for carotid artery disease
4,ABCA1,Rs1883025,Apolipoprotein E levels in cerebrospinal fluid...,5,2,Mexican dyslipidemic
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6,17,Alzheimer's disease
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6,17,atherogenic dyslipidaemia
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6,17,Type 2 diabetes
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6,17,haplotype


In [19]:
# Check all IDs for unique texts and see how many times they were annotated here.
id_to_num_annotations = {}
for i in original_df["id"].values:
    # This ID was annotated atleast once, it was in the results dataframe for this survey.
    if i in snippets_per_id:
        id_to_num_annotations[i] = snippets_per_id[i]
    else:
        id_to_num_annotations[i] = 0

# Save that information to a file.
num_ann_df = pd.DataFrame({"id":list(id_to_num_annotations.keys()),"num_highlights":list(id_to_num_annotations.values())})
num_ann_df.to_csv(NUM_ANNOTATIONS_PATH, index=False)
num_ann_df.head(10)

Unnamed: 0,id,num_highlights
0,1,0
1,2,2
2,3,0
3,4,2
4,5,2
5,6,17
6,7,3
7,8,0
8,9,10
9,10,37


In [20]:
# Save that dataframe as a new CSV file with just the final cleaned text snippets for each gene and SNP.
modified_df = modified_df[["gene","snp","snippet"]]
modified_df.sort_values(by="gene", inplace=True)
modified_df.to_csv(SNPS_AND_SNIPPETS_OUTPUT_PATH, index=False)
modified_df.head(20)

Unnamed: 0,gene,snp,snippet
1,AANAT,Rs3760138,autism
1,AANAT,Rs3760138,autism
9,ABCA1,Rs2230808,Alzheimer's disease
9,ABCA1,Rs2230808,Alzheimer's disease
9,ABCA1,Rs2230808,polymorphisms and Alzheimer's disease.
9,ABCA1,Rs2230808,Alzheimer's disease
9,ABCA1,Rs2230808,schizophrenia and related brain changes.
9,ABCA1,Rs2230808,polymorphism
9,ABCA1,Rs2230808,dementia
9,ABCA1,Rs2230808,polymorphisms.


In [21]:
# How many genes had atleast one snippet associated to it in these results?
len(modified_df["gene"].unique())

1076

In [22]:
# Create a version of the dataset with mapping genes and SNPs to context sentences.
subset_df = original_df.copy(deep=True)[original_df["id"].isin(id_to_context_sentences.keys())]
subset_df["n"] = subset_df["id"].map(lambda x: len(id_to_context_sentences[x]))
sentences = flatten([id_to_context_sentences[i] for i in subset_df["id"].values])

# Extend the dataframe to duplicate each row n times where n is the number of text hilight results from the surveys.
modified_df = subset_df.reindex(np.repeat(subset_df.index.values, subset_df["n"]), method="ffill")

# Make sure that the extension occured as expected based on the number of text snippets, and add them as a new column.
assert len(modified_df) == len(sentences)
modified_df["context"] = sentences
modified_df.head(20)

Unnamed: 0,gene,snp,text,id,n,context
1,AANAT,Rs3760138,Genetic differences in human circadian clock g...,2,1,Examination of association of genes in the ser...
3,ABCA1,Rs1800977,The -14C->T polymorphism rs1800977 of the ABCA...,4,1,The -14C->T polymorphism rs1800977 of the ABCA...
4,ABCA1,Rs1883025,Apolipoprotein E levels in cerebrospinal fluid...,5,1,Investigation of variants identified in caucas...
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ...",6,1,"rs2297404, rs2230808, and rs2020927 haplotype ..."
6,ABCA1,Rs2066714,Apolipoprotein E levels in cerebrospinal fluid...,7,1,Association of genetic variants with chronic k...
8,ABCA1,Rs2230806,"rs2230806, also known as Arg219Lys or R219K, i...",9,1,Increase in HDL-C concentration by a dietary p...
9,ABCA1,Rs2230808,"rs2297404, rs2230808, and rs2020927 haplotype ...",10,1,A polymorphism of the ABCA1 gene confers susce...
10,ABCA1,Rs2297404,"rs2297404, rs2230808, and rs2020927 haplotype ...",11,1,"rs2297404, rs2230808, and rs2020927 haplotype ..."
12,ABCA1,Rs4149268,G allele is associated with 0.82mg/dl increase...,13,1,G allele is associated with 0.82mg/dl increase...
14,ABCA12,Rs28940268,This is a recessive SNP for congenital Lamella...,15,1,This is a recessive SNP for congenital Lamella...


In [23]:
# Save that dataframe as a new CSV file with just the contexts for each gene and SNP.
modified_df = modified_df[["gene","snp","context"]]
modified_df.sort_values(by="gene", inplace=True)
modified_df.to_csv(SNPS_AND_CONTEXTS_OUTPUT_PATH, index=False)
modified_df.head(20)

Unnamed: 0,gene,snp,context
1,AANAT,Rs3760138,Examination of association of genes in the ser...
10,ABCA1,Rs2297404,"rs2297404, rs2230808, and rs2020927 haplotype ..."
9,ABCA1,Rs2230808,A polymorphism of the ABCA1 gene confers susce...
8,ABCA1,Rs2230806,Increase in HDL-C concentration by a dietary p...
12,ABCA1,Rs4149268,G allele is associated with 0.82mg/dl increase...
5,ABCA1,Rs2020927,"rs2297404, rs2230808, and rs2020927 haplotype ..."
4,ABCA1,Rs1883025,Investigation of variants identified in caucas...
3,ABCA1,Rs1800977,The -14C->T polymorphism rs1800977 of the ABCA...
6,ABCA1,Rs2066714,Association of genetic variants with chronic k...
14,ABCA12,Rs28940268,This is a recessive SNP for congenital Lamella...
