# Scraping Health-Related Forums and Articles

This project focuses on extracting posts and articles from health-related forums using web scraping techniques. These forums host a wide range of user-generated content on diverse health topics, making them valuable for applications like sentiment analysis, trend monitoring, and topic modeling.

## Objectives
- **Data Collection**: Retrieve posts and articles from specific health-related categories based on topic criteria (e.g., mental health, fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

## Tools and Technologies
- **Python**: The primary programming language for web scraping.
- **Beautiful Soup**: A library for parsing HTML and extracting data.
- **Requests**: A library for making HTTP requests to access web pages.
- **Pandas**: A data manipulation library to handle and analyze the scraped data.

## Getting Started
1. **Set Up the Environment**: Install the necessary libraries using `pip`.
2. **Define Scraping Logic**: Write functions to scrape data from specific health categories on the forum.
3. **Run the Scraper**: Execute the scraping script and monitor the data collection process.
4. **Analyze the Data**: Use Pandas to analyze the collected posts and articles for insights.

## Conclusion
This project provides hands-on experience in web scraping and data analysis using Python, leveraging real-world data from online health communities for deeper insights.


<p style="color:#FE4406;text-align:center;font-size:30px"> Scraping PLOS Articles</p>

In [1]:
!pip install bs4
!pip install selenium

^C



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Scraping PLOS Articles 

In [1]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request


<p style="color:#FFC107;text-align:left;font-size:20px"> Searching for PLOS's health related Articles   </p>

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv
pageIndex = 19851  # Initialize pageIndex
# URL to scrape
url = 'https://journals.plos.org/plosone/browse/biology_and_life_sciences?resultView=list&page='+str(pageIndex)

# List to store scraped articles
articles = []
 

# Function to save articles periodically
def save_articles_to_file():
    try:
        with open("../data/PLOSArticlesPartIIII.csv", "w", encoding="utf-8", newline="") as f:
            writer = csv.writer(f)
            # Write header if the file is being written for the first time
            if f.tell() == 0:
                writer.writerow(["Authors", "Article Title", "Article Link"])
            for article in articles:
                writer.writerow([
                    ", ".join(article["authors"]),
                    article["articleTitle"],
                    article["articleLink"]
                ])
        print("Data saved to ../data/PLOSArticles.csv")
    except :
        print(f"Error saving data ")

def scrape_current_page(url, pageIndex):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)

    # Open the URL and read the page content
    with urlopen(request) as response:
        page_source = response.read()
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all articles on the page
    communities_elements = soup.find_all('ul', id="search-results")

    # Extract article data
    for community_element in communities_elements:
        try:
            article = {
                "authors": [span.get_text(strip=True) for span in community_element.find("p", class_="authors").find_all("span")],
                "articleTitle": community_element.find("a")["title"],
                "articleLink": community_element.find("a")["href"]
            }
            articles.append(article)
            print(article)

            # Save data periodically every 500 articles
            if len(articles) % 25 == 0:
                print(f"Last seen page {pageIndex}")
                save_articles_to_file()

        except :
            print(f"Error processing article")
            continue

    # Pagination handling
    current_links = soup.find("nav", id="article-pagination").find_all("a")
    next_link = None
    for index, link in enumerate(current_links):
        if "active" in link.get("class", []):
            # Check if there is a next element after the "active" one
            if index + 1 < len(current_links):
                next_link = current_links[index + 1]
            break

    # If there's a next page, recursively scrape it
    if next_link:
        next_url = "https://journals.plos.org" + next_link['href']
        print(f"Currently fetching page {pageIndex}")
        scrape_current_page(next_url, pageIndex + 1)  # Pass incremented pageIndex
        
    else:
        save_articles_to_file()  # Final save when scraping is complete
        return  # Exit condition if no next link is found

# Start scraping from the initial URL
scrape_current_page(url, pageIndex)


{'authors': ['Florian Kurth,', 'Sabine Bélard,', 'Ghyslain Mombo-Ngoma,', 'Katharina Schuster,', 'Ayola A. Adegnika,', 'Marielle K. Bouyou-Akotet,', 'Peter G. Kremsner,', 'Michael Ramharter'], 'articleTitle': 'Adolescence As Risk Factor for Adverse Pregnancy Outcome in Central Africa – A Cross-Sectional Study', 'articleLink': '/plosone/article?id=10.1371/journal.pone.0014367'}
{'authors': ['Jeffrey A. Longmate,', 'Garrett P. Larson,', 'Theodore G. Krontiris,', 'Steve S. Sommer'], 'articleTitle': 'Three Ways of Combining Genotyping and Resequencing in Case-Control Association Studies', 'articleLink': '/plosone/article?id=10.1371/journal.pone.0014318'}
{'authors': ['Suma Ghosh,', 'Jane Heffernan'], 'articleTitle': 'Influenza Pandemic Waves under Various Mitigation Strategies with 2009 H1N1 as a Case Study', 'articleLink': '/plosone/article?id=10.1371/journal.pone.0014307'}
{'authors': ['Melanie Norgate,', 'Adam Southon,', 'Mark Greenough,', 'Michael Cater,', 'Ashley Farlow,', 'Philip Bat