### Scraping Stack Exchange Health Posts and Articles

In this project, we focus on extracting posts and discussions from Stack Exchange's **Health & Wellness** section using web scraping techniques. Stack Exchange is a platform with a wealth of user-generated Q&A content on various health topics, making it an excellent resource for applications such as sentiment analysis, trend monitoring, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and answers from specific health-related categories on Stack Exchange based on topic criteria (e.g., mental health, nutrition, fitness).  
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.  
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping.  
- **Beautiful Soup**: A library for parsing HTML and extracting data.  
- **Requests**: A library for making HTTP requests to access web pages.  
- **Pandas**: A data manipulation library to handle and analyze the scraped data.  

#### Getting Started
1. **Set Up the Environment**: Install the necessary libraries using `pip`.  
2. **Define Scraping Logic**: Write functions to scrape data from specific health-related categories on Stack Exchange.  
3. **Run the Scraper**: Execute the scraping script and monitor the data collection process.  
4. **Analyze the Data**: Use Pandas to analyze the collected posts and discussions for insights.  

#### Conclusion
This project serves as a practical introduction to web scraping and data analysis using Python, providing valuable experience in handling real-world data from an online health community.


<p style="color:#FE4406;text-align:center;font-size:30px"> Scraping Health board's  Posts And Articles </p>

In [34]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
import re
from bs4 import BeautifulSoup
import pandas as pd
import time
from urllib.request import urlopen, Request

### Scraping Stack Xchange 

In [25]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

<p style="color:#FFC107;text-align:center;font-size:20px"> Searching for stackXchange's health related topics  </p>

In [26]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request
import pandas as pd

# List to store tags
tags = []

# Function to save posts periodically
def saveTags(tags):
    try:
        df = pd.DataFrame(tags)
        df.to_csv("../data/stackXchange/stackXchangeTags.csv", index=False)
        print(f"Saved {len(tags)} tags.")
    except Exception as e:
        print(f"Error saving posts: {e}")

# Function to scrape the current page
def collectStackTags(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(url, headers=headers)
        
        # Fetch page content
        with urlopen(request) as response:
            page_source = response.read()

        soup = BeautifulSoup(page_source, 'html.parser')
        flexItems=soup.find_all("a",class_="s-tag post-tag")
        for aElement in flexItems:
            tag={}
            tag["tagLink"]=aElement.get("href")
            tag["tagName"]=aElement.get_text()
            tags.append(tag)
        print(flexItems) 
    except Exception as e:
        print(f"Error scraping URL {url}: {e}")




In [27]:
links=[
    "https://medicalsciences.stackexchange.com/tags",
    "https://medicalsciences.stackexchange.com/tags?page=2&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=3&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=4&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=5&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=6&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=7&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=8&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=9&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=10&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=11&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=12&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=13&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=14&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=15&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=16&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=17&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=18&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=19&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=20&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=21&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=22&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=23&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=24&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=25&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=26&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=27&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=28&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=29&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=30&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=31&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=32&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=33&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=34&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=35&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=36&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=37&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=38&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=39&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=40&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=41&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=42&tab=popular",
]
for link in links :
    collectStackTags(link)

[<a aria-label="show questions tagged 'covid-19'" aria-labelledby="tag-covid-19-tooltip-container" class="s-tag post-tag" data-tag-menu-origin="Unknown" href="/questions/tagged/covid-19" rel="tag" title="show questions tagged 'covid-19'">covid-19</a>, <a aria-label="show questions tagged 'nutrition'" aria-labelledby="tag-nutrition-tooltip-container" class="s-tag post-tag" data-tag-menu-origin="Unknown" href="/questions/tagged/nutrition" rel="tag" title="show questions tagged 'nutrition'">nutrition</a>, <a aria-label="show questions tagged 'medications'" aria-labelledby="tag-medications-tooltip-container" class="s-tag post-tag" data-tag-menu-origin="Unknown" href="/questions/tagged/medications" rel="tag" title="show questions tagged 'medications'">medications</a>, <a aria-label="show questions tagged 'vaccination'" aria-labelledby="tag-vaccination-tooltip-container" class="s-tag post-tag" data-tag-menu-origin="Unknown" href="/questions/tagged/vaccination" rel="tag" title="show questions

KeyboardInterrupt: 

In [41]:
saveTags(tags)

Saved 1493 tags.


### Starting collecting posts Texts 

In [28]:
linksToScrap=[f"https://medicalsciences.stackexchange.com/questions?tab=newest&page={index}" for index in range(1,533)]

In [29]:

posts=[]
def periodicSave(data,index):
    data.to_csv("../data/stackXchange/stackXchangeQuestions.csv",index=False)
    print(f"Periodic save completed: {index} new post texts have been saved.")
def scrapCurrentPage(url):
    try:
        # Set up a request with a user-agent header to avoid blocking
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(url, headers=headers)
        
        # Open the URL and read the page content
        with urlopen(request) as response:
            page_source = response.read()
        
        # Parse the page with BeautifulSoup
        soup = BeautifulSoup(page_source, 'html.parser')
        
        # Locate the element with the desired ID pattern
        elements = soup.find_all("div", id=re.compile(r"question-summary.*"))
        for element in elements:
            if element:
                # Clean and return the extracted text
                post={}
                post["postText"]=element.find("h3").find("a").get_text()
                post["postLink"]=element.find("h3").find("a").get("href")
                post["postTags"]=[liElement.find("a").get_text() for liElement in element.find("ul",class_="js-post-tag-list-wrapper").find_all("li")]
                post["postUser"]=element.find("div",class_="s-user-card--link").find("a").get_text()
                print(post)
                posts.append(post)
            else:
                return "This post is not available"
    except Exception as e:
        print(f"Error processing post: {e}")
        return "An error occurred while scraping the post."


In [None]:
for link in linksToScrap:
    scrapCurrentPage(link)

{'postText': 'Confusion about upper limb CT Session [closed]', 'postLink': '/questions/34393/confusion-about-upper-limb-ct-session', 'postTags': ['medical-imaging', 'orthopedics', 'radiology', 'ct-scans', 'radioactivity'], 'postUser': 'Tokugava'}
{'postText': 'What is the risk factor of exposure to radon gas?', 'postLink': '/questions/34390/what-is-the-risk-factor-of-exposure-to-radon-gas', 'postTags': ['cancer', 'lungs', 'radioactivity'], 'postUser': 'Ray Butterworth'}
{'postText': 'Is there any serious ongoing research into preventing the body from developing a tolerance to given drugs?', 'postLink': '/questions/34387/is-there-any-serious-ongoing-research-into-preventing-the-body-from-developing-a', 'postTags': ['drug-metabolism', 'drug-tolerance'], 'postUser': 'Nethesis'}
{'postText': 'Treating inflammation necessary or optional?', 'postLink': '/questions/34386/treating-inflammation-necessary-or-optional', 'postTags': ['inflammation', 'anti-inflammatory', 'nsaids-pain-meds', 'evolut

In [4]:
posts=pd.DataFrame(posts)

In [6]:
posts.to_csv("../data/stackXchange/medicalStackXchangePosts.csv")

### Colleting posts for stack X change medical Posts

In [1]:
import pandas as pd
dataset=pd.read_csv("../data/stackXchange/medicalStackXchangePosts.csv")

In [None]:
import pandas as pd
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import os
import time

baseUrl = "https://medicalsciences.stackexchange.com"

def scrapPostText(postUrl):
    postUrl = baseUrl + postUrl
    print(postUrl)
    try:
        # Set up a request with a user-agent header to avoid blocking
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(postUrl, headers=headers)
        
        # Open the URL and read the page content
        with urlopen(request) as response:
            page_source = response.read()
        # Parse the page with BeautifulSoup
        soup = BeautifulSoup(page_source, 'html.parser')
        postBodyElement = soup.find("div", class_="js-post-body")
        print(f"navigating to {postUrl}")
        return postBodyElement.get_text(separator='\n', strip=True)

    except Exception as e: 
        print(f"Post not found: {e}")
        return "post's text is not available"

def save_new_posts(dataframe, file_path):
    """Appends new posts to the CSV file."""
    if not dataframe.empty:
        dataframe.to_csv(file_path, mode='a', header=not os.path.exists(file_path), index=False)
        print(f"Saved {len(dataframe)} new posts to {file_path}")
    else:
        print("No new posts to save.")

# Path to the CSV file
file_path = "../data/stackXchange/medicalStackXchangePosts.csv"

# Load existing data if the file exists
if os.path.exists(file_path):
    existing_data = pd.read_csv(file_path)
    existing_links = set(existing_data["postLink"])
else:
    existing_data = pd.DataFrame(columns=["postLink", "postFullText"])
    existing_links = set()


# Process new posts
processed_posts = []
for i in range(len(dataset)):
    post = dataset.iloc[i]
    postLink = post["postLink"]
    postText = scrapPostText(postLink)
    processed_posts.append({"postLink": postLink, "postFullText": postText})

    # Save in batches
    if (i + 1) % 50 == 0 or i == len(dataset) - 1:
        batch_df = pd.DataFrame(processed_posts)
        save_new_posts(batch_df, file_path)
        processed_posts = []  # Clear batch after saving


https://medicalsciences.stackexchange.com/questions/34368/does-strep-lay-dormant-in-the-body-throat-after-a-successful-course-of-antibioti
navigating to https://medicalsciences.stackexchange.com/questions/34368/does-strep-lay-dormant-in-the-body-throat-after-a-successful-course-of-antibioti
https://medicalsciences.stackexchange.com/questions/34365/how-do-i-know-if-my-neck-is-hypermobile-what-is-its-normal-range-of-movement
navigating to https://medicalsciences.stackexchange.com/questions/34365/how-do-i-know-if-my-neck-is-hypermobile-what-is-its-normal-range-of-movement
https://medicalsciences.stackexchange.com/questions/34364/quais-s%c3%a3o-as-melhores-op%c3%a7%c3%b5es-para-tratamento-de-depend%c3%aancia-qu%c3%admica-em-sp
Post not found: HTTP Error 404: Not Found
https://medicalsciences.stackexchange.com/questions/34362/what-distances-for-plus-and-minus-eyeglasses-which-are-for-middle-distance
navigating to https://medicalsciences.stackexchange.com/questions/34362/what-distances-for-p