## Initial Web Scraping Investigation

The real challenge is finding COVID-positive people and then following those users to see cardiovascular-related symptoms in them... Existing studies have used the term covid long-haulers and social media pages associated with that to identify people who have had it and create a timeline for those people and tie symptoms to that timeline. This project seems a bit more challenging, because we are not only interested with positive-covid patients, but we are narrowing down the pool of people to those who have reported soecific cardiovascular events following this.

In [51]:
from proxycrawl.proxycrawl_api import ProxyCrawlAPI
import pandas as pd

In [52]:
js_token = "Gzb-YK7SoijLHCDpqKMc0g"
normal_token = "H-EOaYFwXN76Wg4C6VvLrA"

In [53]:
# crawling function
def crawl(token,url,scraper,scroll=False,scroll_interval = 0):
    api = ProxyCrawlAPI({'token': token})
    if scroll:
        response = api.get(url,{
            'scraper':scraper, 'scroll':'true','scroll_interval':scroll_interval
        })
    else:
        response = api.get(url, {
            'scraper':scraper
    })
    return response

## Facebook
The hard part with facebook is getting past permissions... The best way to do this (in my opinion) is to look at specific groups of people (ideally groups like COVID support groups or long-hauler groups). FB seemed like a good option, but I can't get any useable data from the facebook group api

In [21]:
#while response['status_code'] != 200:
api = ProxyCrawlAPI({'token': js_token})
response = api.get('https://www.facebook.com/Amazon/', {
    'scraper':'facebook-page', 'scroll':'true','scroll_interval':15
})
print(response['status_code'])

520


In [95]:
response = crawl(js_token,'https://www.facebook.com/groups/373920943948661','facbook-group', True, 58)
print(response['status_code'])

400


In [90]:
response['body']

''

In [82]:
with open('facebook-generic.json','wb+') as f:
    f.write(response['body'])

## Instagram-Hashtag
Targeting specific groups/hashtags seems to be the best way of getting positive diagnosis information. Long-haulers

In [20]:
api = ProxyCrawlAPI({'token': js_token})
response = api.get('https://www.instagram.com/explore/tags/longhaulers/', {
    'scraper':'instagram-hashtag','scroll':'true','scroll_interval': 10
})
print(response['status_code'])

520


In [15]:
with open('instagram-hashtag-longhaulers3-js.json', 'wb+') as f:
    f.write(response['body'])

## Quora-serp
Initial idea here is to find users asking questions about symptoms after covid--this could possibly lend itself well to finding individuals who have contracted covid and you could include terms from the symptom list in the search queries.

Down the line--there is also a scraper extension that can dig into the resonses from these questions (quora-question)... Could be a good idea to possibly create recursive searches and data subsets driven by covid-positive individuals asking questions of interest (i.e. questions pertaining to cardiovascular issues w/ symptoms from our symptom list).

In [42]:
def crawl(token,url,scraper,scroll=False,scroll_interval = 0):
    api = ProxyCrawlAPI({'token': token})
    if scroll:
        response = api.get(url,{
            'scraper':scraper, 'scroll':'true','scroll_interval':scroll_interval
        })
    else:
        response = api.get(url, {
            'scraper':scraper
    })
    return response

In [43]:
response = crawl(js_token,'https://www.quora.com/search?q=heart%20problems%20after%20covid','quora-serp',True,60)
#api = ProxyCrawlAPI({'token': js_token})
#response = api.get('https://www.quora.com/search?q=heart%20problems%20after%20covid', {
#    'scraper':'quora-serp','scroll':'true','scroll_interval':60
#})
print(response['status_code'])

200


In [38]:
print(response['body'])




In [50]:
search_terms = ["blood clot","heart","cardiovascular","stroke","deep vein thrombosis","embolism","breathing","heparin","warfarin","rapid heartbeat","lightheaded","sweating","fever","leg pain","leg swelling", "leg swollen","clammy skin","discolor skin","cyanosis"]
base_url = "https://www.quora.com/search?q={}%20after%20covid"
for term in search_terms[3:]:
    print("Searching {}".format(term))
    term_no_space = term.replace(" ","%20")
    url = base_url.format(term_no_space)
    try:
        response = crawl(js_token,url,'quora-serp',True,60)
    except Exception as e:
        print(e)
        continue
    if response['status_code'] == 200:
        with open('quora-serp-{}.json'.format(term.replace(" ","_")), 'wb+') as f:
            f.write(response['body'])
    else:
        print("{} failed with status {}".format(term,response['status_code']))

Searching stroke
stroke failed with status 520
Searching deep vein thrombosis
The read operation timed out
Searching embolism
Searching breathing
Searching heparin
Searching warfarin
Searching rapid heartbeat
Searching lightheaded
Searching sweating
Searching fever
Searching leg pain
The read operation timed out
Searching leg swelling
Searching leg swollen
leg swollen failed with status 520
Searching clammy skin
Searching discolor skin
Searching cyanosis
cyanosis failed with status 520


In [None]:
url = "'https://www.quora.com/search?q={}%20after%20covid".format("blood%20clot")
response = crawl(js_token,'https://www.quora.com/search?q=heart%20problems%20after%20covid','quora-serp',True,20)