## Scraping Patient Info Data for Medical Research

This project involves extracting data from the Patient Info website, a platform offering medical advice, articles, and user discussions on various health conditions. The aim is to gather and process this data for applications such as medical research, patient sentiment analysis, and healthcare trend monitoring.

#### Objectives
- **Data Collection**: Scrape patient discussions, medical articles, and FAQs from specific categories on Patient Info (e.g., chronic illnesses, lifestyle, and medications).
- **Data Processing**: Preprocess the gathered data, including cleaning text and standardizing formats for analysis.
- **Data Storage**: Save the extracted data in a structured format like CSV, JSON, or a database for future use.

#### Tools and Technologies
- **Python**: The main programming language for implementing the web scraping workflow.
- **Beautiful Soup**: A library for parsing HTML and XML documents to extract relevant information.
- **Selenium**: For handling dynamic web pages and automating the browser.
- **Pandas**: For organizing, analyzing, and exporting the collected data into a structured format.

#### Getting Started
1. Set up the environment by installing the necessary Python libraries.
2. Identify the target URLs based on categories of interest, such as "Diabetes" or "Mental Health."
3. Implement the scraping logic, including functions to retrieve article titles, discussion threads, and summaries while handling pagination and errors.
4. Run the scraper to collect the data and ensure the process is monitored to address issues like CAPTCHA or IP blocking.
5. Process and analyze the collected data, cleaning and organizing it using Pandas for further exploration.

#### Ethical Considerations
- Ensure compliance with the website's terms of use and avoid violating ethical guidelines.
- Use the data responsibly, ensuring user privacy and data security.

#### Conclusion
This project provides a practical application of web scraping for healthcare research. By leveraging Patient Info's resources, researchers can gain valuable insights into patient experiences, emerging trends, and medical discussions.


In [None]:
!pip install bs4
!pip install selenium

In [1]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request

### Scraping Health-Related Topics


In [4]:
# communities in patients Info are organize like index-letter grouping all topics starting with letter 
# for example https://patient.info/forums/index-b contain all topics starting with "b" like "baby and infants" ,  "baclofen" , "backache"  
featuresIndexes=[f"index-{chr(i)}" for i in range(97,123)]
topics=[]

In [None]:
def scrapIndexedGroups(url):
    print(f"Actually Collecting topics with {url.split('index-')[1]}")
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)
        
    # Fetch page content
    with urlopen(request) as response:
        page_source = response.read()
    soup = BeautifulSoup(page_source, 'html.parser')
    sourceElement=soup.find("table",class_="table")
    tdElements=sourceElement.find_all("td")
    for element in tdElements:
        topic={}
        topic["topicLink"]="https://patient.info"+element.find("a").get("href")
        topic["topicName"]=element.find("a").get_text()
        topics.append(topic)
for feature in featuresIndexes:
    scrapIndexedGroups(f"https://patient.info/forums/{feature}")

In [6]:
import pandas as pd 
topics=pd.DataFrame(topics)

In [7]:
topics.to_csv("../data/patientInfoTopics.csv")

### For each Topic , We Scrap Health-Related Posts 


In [4]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request

In [1]:
import pandas  as pd 
topics=pd.read_csv("../data/patientInfoTopics.csv")
posts=[]

In [2]:
def periodicSave(data,index):
    print(f"Periodic Save : {index}")
    data=pd.DataFrame(data)
    data.to_csv("../data/patientInfosPosts2.csv")
def scrapTopicUrl(url,topic):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)
    print(f"Navigating to {url}")
    # Fetch page content
    with urlopen(request) as response:
        page_source = response.read()
    soup = BeautifulSoup(page_source, 'html.parser')
    elements=soup.find_all("article",class_="post thread")
    for element in elements:
        try:
            postDetails=element.find("h3","post__title")
            postAuthor=element.find("div","post__actions")
            post={}
            post["postTitle"]=postDetails.find("a").get_text()
            post["postLink"]=postDetails.find("a").get("href")
            post["postType"]=postDetails.find("a").get("rel")[0]
            post["postTopic"]=topic
            post["postAuthor"]=(postAuthor.find("a").get_text().replace(" ","").replace("\n",""))
            post["createdAt"]=datetime.now()
            post["commentsCount"]=element.find("div",class_="actions").find("span").get_text()
            posts.append(post)
            if(len(posts)%500==0):
                periodicSave(posts,(len(posts))//500)
        except:
            continue
    try:
        nextLink=(soup.find("a",class_="reply__control reply-ctrl-last link").get("href"))

        if(nextLink):
            print("scrapped actual page , moving to the next one ")
            scrapTopicUrl(nextLink,topic)
    except :
        print("finished actual topic , moving to the next topic")
        return 


In [None]:
for index in range(870,len(topics)):
    topicName=topics.iloc[index]["topicName"]
    topicLink=topics.iloc[index]["topicLink"]
    scrapTopicUrl(topicLink,topicName)
periodicSave(posts,len(posts))
  