## Scraping Patient Info Data for Medical Research

This project involves extracting data from the Patient Info website, a platform offering medical advice, articles, and user discussions on various health conditions. The aim is to gather and process this data for applications such as medical research, patient sentiment analysis, and healthcare trend monitoring.

#### Objectives
- **Data Collection**: Scrape patient discussions, medical articles, and FAQs from specific categories on Patient Info (e.g., chronic illnesses, lifestyle, and medications).
- **Data Processing**: Preprocess the gathered data, including cleaning text and standardizing formats for analysis.
- **Data Storage**: Save the extracted data in a structured format like CSV, JSON, or a database for future use.

#### Tools and Technologies
- **Python**: The main programming language for implementing the web scraping workflow.
- **Beautiful Soup**: A library for parsing HTML and XML documents to extract relevant information.
- **Selenium**: For handling dynamic web pages and automating the browser.
- **Pandas**: For organizing, analyzing, and exporting the collected data into a structured format.

#### Getting Started
1. Set up the environment by installing the necessary Python libraries.
2. Identify the target URLs based on categories of interest, such as "Diabetes" or "Mental Health."
3. Implement the scraping logic, including functions to retrieve article titles, discussion threads, and summaries while handling pagination and errors.
4. Run the scraper to collect the data and ensure the process is monitored to address issues like CAPTCHA or IP blocking.
5. Process and analyze the collected data, cleaning and organizing it using Pandas for further exploration.

#### Ethical Considerations
- Ensure compliance with the website's terms of use and avoid violating ethical guidelines.
- Use the data responsibly, ensuring user privacy and data security.

#### Conclusion
This project provides a practical application of web scraping for healthcare research. By leveraging Patient Info's resources, researchers can gain valuable insights into patient experiences, emerging trends, and medical discussions.


In [None]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Scraping Health-Related Drugs


In [22]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request
import pandas as pd 
import warnings
warnings.filterwarnings("ignore")

In [7]:
drugsList=pd.read_csv("../data/patientsInfos/drugs.csv")

In [None]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
drugs=[]
def collectDrugDetails(drug):
    try :
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(drug["drugLink"], headers=headers)
        
        try:
            # Open the URL and read the page content
            with urlopen(request) as response:
                page_source = response.read()
            
            soup = BeautifulSoup(page_source, 'html.parser')
            
            
            # Find the drug description div using the correct class
            description_div = soup.find("div", class_=lambda x: x and x.startswith('Markup_markup_'))
            
            if description_div:
                # Find all <p> elements within the div
                paragraphs = description_div.find_all("p")
                
                # Merge the text from each <p> element
                merged_description = " ".join([para.get_text(strip=True) for para in paragraphs])
                
                drug["drugDescription"] = merged_description
            else:
                drug["drugDescription"] = "Description not found"
            
            return drug
        
        except Exception as e:
            print(f"An error occurred while collecting drug details: {e}")
    except:
        return 


In [None]:
drugsAnnotated=[]
for index in range(len(drugsList)):
    drug=drugsList.iloc[index]
    drugAnnoated=collectDrugDetails(drug)
    drugsAnnotated.append(drugAnnoated)
    

In [None]:
drugsAnnotated=pd.DataFrame(drugsAnnotated)

In [15]:
drugsAnnotated.to_csv("../data/patientsInfos/drugs.csv")

### Collection of drugs Details 

In [32]:
import pandas as pd 
drugs=pd.read_csv("../data/patientsInfos/drugs.csv")

In [None]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def collecDrugDetails(drugUrl, drug):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(drug["drugLink"], headers=headers)
    with urlopen(request) as response:
        page_source = response.read() 
    soup = BeautifulSoup(page_source, 'html.parser')
    
    # Locate the table
    table_element = soup.find("table")
    if not table_element:
        print(f"No table found for {drug['drugLink']}")
        return
    
    # Initialize the data structure
    drug_details = {
        "Type of medicine": None,
        "Used for": None,
        "Also called": None,
        "Available as": None
    }
    
    # Iterate over table rows
    for row in table_element.find_all("tr"):
        cells = row.find_all("td")
        if len(cells) == 2:  # Ensure row has exactly two cells (label and value)
            label = cells[0].get_text(strip=True)
            value = cells[1].get_text(strip=True)
            if label in drug_details:
                drug_details[label] = value
    
    # Map the collected data to the drug dictionary
    drug["drugType"] = drug_details["Type of medicine"]
    drug["usedFor"] = drug_details["Used for"]
    drug["alsoCalled"] = drug_details["Also called"].split(";") if drug_details["Also called"] else []
    drug["drugFormat"] = drug_details["Available as"].split(",") if drug_details["Available as"] else []
    
    return (drug)


In [38]:
refinedDrugs=[]
for  drugIndex in range(len(drugs)):
    drug=drugs.iloc[drugIndex]
    refinedDrugs.append(collecDrugDetails(drug["drugLink"],drug))


Unnamed: 0.1                                                       0
Unnamed: 0                                                         0
drugLink           https://patient.info/medicine/calcium-and-ergo...
drugName                                                    A1 Cal-E
drugDescription    Calcium and ergocalciferol tablets are a miner...
drugType                              Mineral and vitamin supplement
usedFor            To promote healthy bones and to prevent osteop...
alsoCalled         [Calcium and vitamin D,  (Ergocalciferol is al...
drugFormat                                                 [Tablets]
Name: 0, dtype: object
Unnamed: 0.1                                                       1
Unnamed: 0                                                         1
drugLink           https://patient.info/medicine/calcium-suppleme...
drugName                                                      A1-Cal
drugDescription    Calcium supplements are available as different...
drugType   