## Scraping Patient Info Data for Medical Research

This project involves extracting data from the Patient Info website, a platform offering medical advice, articles, and user discussions on various health conditions. The aim is to gather and process this data for applications such as medical research, patient sentiment analysis, and healthcare trend monitoring.

#### Objectives
- **Data Collection**: Scrape patient discussions, medical articles, and FAQs from specific categories on Patient Info (e.g., chronic illnesses, lifestyle, and medications).
- **Data Processing**: Preprocess the gathered data, including cleaning text and standardizing formats for analysis.
- **Data Storage**: Save the extracted data in a structured format like CSV, JSON, or a database for future use.

#### Tools and Technologies
- **Python**: The main programming language for implementing the web scraping workflow.
- **Beautiful Soup**: A library for parsing HTML and XML documents to extract relevant information.
- **Selenium**: For handling dynamic web pages and automating the browser.
- **Pandas**: For organizing, analyzing, and exporting the collected data into a structured format.

#### Getting Started
1. Set up the environment by installing the necessary Python libraries.
2. Identify the target URLs based on categories of interest, such as "Diabetes" or "Mental Health."
3. Implement the scraping logic, including functions to retrieve article titles, discussion threads, and summaries while handling pagination and errors.
4. Run the scraper to collect the data and ensure the process is monitored to address issues like CAPTCHA or IP blocking.
5. Process and analyze the collected data, cleaning and organizing it using Pandas for further exploration.

#### Ethical Considerations
- Ensure compliance with the website's terms of use and avoid violating ethical guidelines.
- Use the data responsibly, ensuring user privacy and data security.

#### Conclusion
This project provides a practical application of web scraping for healthcare research. By leveraging Patient Info's resources, researchers can gain valuable insights into patient experiences, emerging trends, and medical discussions.


In [None]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Scraping Health-Related Drugs


In [6]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request
import pandas as pd 

In [7]:
drugsList=pd.read_csv("../data/patientsInfos/drugs.csv")

In [8]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
drugs=[]
def collectDrugDetails(drug):
    try :
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(drug["drugLink"], headers=headers)
        
        try:
            # Open the URL and read the page content
            with urlopen(request) as response:
                page_source = response.read()
            
            soup = BeautifulSoup(page_source, 'html.parser')
            
            
            # Find the drug description div using the correct class
            description_div = soup.find("div", class_=lambda x: x and x.startswith('Markup_markup_'))
            
            if description_div:
                # Find all <p> elements within the div
                paragraphs = description_div.find_all("p")
                
                # Merge the text from each <p> element
                merged_description = " ".join([para.get_text(strip=True) for para in paragraphs])
                
                drug["drugDescription"] = merged_description
            else:
                drug["drugDescription"] = "Description not found"
            
            return drug
        
        except Exception as e:
            print(f"An error occurred while collecting drug details: {e}")
    except:
        return 


In [12]:
drugsAnnotated=[]
for index in range(len(drugsList)):
    drug=drugsList.iloc[index]
    drugAnnoated=collectDrugDetails(drug)
    drugsAnnotated.append(drugAnnoated)
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drug["drugDescription"] = merged_description
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drug["drugDescription"] = merged_description
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drug["drugDescription"] = merged_description
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  drug["drugDescription"] = merged_

In [None]:
drugsAnnotated=pd.DataFrame(drugsAnnotated)

In [15]:
drugsAnnotated.to_csv("../data/patientsInfos/drugs.csv")

### Collection of drugs Details 

In [2]:
import pandas as pd 
data=pd.read_csv("../data/patientsInfos/drugs.csv")

In [None]:
def collecDrugDetails(drugUrl):
    