# Scraping Health-Related Forums and Articles

This project focuses on extracting posts and articles from health-related forums using web scraping techniques. These forums host a wide range of user-generated content on diverse health topics, making them valuable for applications like sentiment analysis, trend monitoring, and topic modeling.

## Objectives
- **Data Collection**: Retrieve posts and articles from specific health-related categories based on topic criteria (e.g., mental health, fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

## Tools and Technologies
- **Python**: The primary programming language for web scraping.
- **Beautiful Soup**: A library for parsing HTML and extracting data.
- **Requests**: A library for making HTTP requests to access web pages.
- **Pandas**: A data manipulation library to handle and analyze the scraped data.

## Getting Started
1. **Set Up the Environment**: Install the necessary libraries using `pip`.
2. **Define Scraping Logic**: Write functions to scrape data from specific health categories on the forum.
3. **Run the Scraper**: Execute the scraping script and monitor the data collection process.
4. **Analyze the Data**: Use Pandas to analyze the collected posts and articles for insights.

## Conclusion
This project provides hands-on experience in web scraping and data analysis using Python, leveraging real-world data from online health communities for deeper insights.


<p style="color:#FE4406;text-align:center;font-size:30px"> Scraping PLOS Articles</p>

In [4]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
# importing packages
import requests
from bs4 import BeautifulSoup

### Scraping Health Boards

In [1]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

<p style="color:#FFC107;text-align:left;font-size:20px"> Searching for PLOS's health related Articles   </p>

In [2]:
articles=[]

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup

# Set up the Selenium WebDriver
driver = webdriver.Chrome()  # Ensure the correct WebDriver is installed and in PATH

# URL to scrape
url = 'https://journals.plos.org/plosone/browse/medicine_and_health_sciences?resultView=list&page=1520'

def scrape_current_page(url):
    # Open the URL and wait for it to load
    driver.get(url)
    time.sleep(5)  # Adjust the sleep time as needed

    # Get the page source and parse with BeautifulSoup
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all articles on the page
    communities_elements = soup.find_all('ul', id="search-results")

    # Extract article data
    for community_element in communities_elements:
        try:
            article = {
                "authors": [span.get_text(strip=True) for span in community_element.find("p", class_="authors").find_all("span")],
                "articleTitle": community_element.find("a")["title"],
                "articleLink": community_element.find("a")["href"]
            }
            articles.append(article)
        except Exception as e:
            print(f"Error processing article: {e}")

    # Pagination handling
    current_links = soup.find("nav", id="article-pagination").find_all("a")
    next_link = None
    for index, link in enumerate(current_links):
        if "active" in link.get("class", []):
            # Check if there is a next element after the "active" one
            if index + 1 < len(current_links):
                next_link = current_links[index + 1]
            break

    # If there's a next page, recursively scrape it
    if next_link:
        next_url = "https://journals.plos.org" + next_link['href']
        scrape_current_page(next_url)
    else:
        return  # Exit condition if no next link is found

# Start scraping from the initial URL
scrape_current_page(url)



# Close the driver
driver.quit()


AttributeError: 'NoneType' object has no attribute 'find_all'

In [4]:
import pandas as pd 
previouslyCollected=pd.read_csv("../data/PLOSArticles.csv")
articles=pd.DataFrame(articles)
data=pd.concat([previouslyCollected,articles])
data.to_csv("../data/PLOSArticles.csv")

In [5]:
len(data)

20280