### Scraping Stack Exchange Health Posts and Articles

In this project, we focus on extracting posts and discussions from Stack Exchange's **Health & Wellness** section using web scraping techniques. Stack Exchange is a platform with a wealth of user-generated Q&A content on various health topics, making it an excellent resource for applications such as sentiment analysis, trend monitoring, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and answers from specific health-related categories on Stack Exchange based on topic criteria (e.g., mental health, nutrition, fitness).  
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.  
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping.  
- **Beautiful Soup**: A library for parsing HTML and extracting data.  
- **Requests**: A library for making HTTP requests to access web pages.  
- **Pandas**: A data manipulation library to handle and analyze the scraped data.  

#### Getting Started
1. **Set Up the Environment**: Install the necessary libraries using `pip`.  
2. **Define Scraping Logic**: Write functions to scrape data from specific health-related categories on Stack Exchange.  
3. **Run the Scraper**: Execute the scraping script and monitor the data collection process.  
4. **Analyze the Data**: Use Pandas to analyze the collected posts and discussions for insights.  

#### Conclusion
This project serves as a practical introduction to web scraping and data analysis using Python, providing valuable experience in handling real-world data from an online health community.


<p style="color:#FE4406;text-align:center;font-size:30px"> Scraping Health board's  Posts And Articles </p>

In [34]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [24]:
import re
from bs4 import BeautifulSoup
import pandas as pd
import time
from urllib.request import urlopen, Request

### Scraping Stack Xchange 

In [25]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

<p style="color:#FFC107;text-align:center;font-size:20px"> Searching for stackXchange's health related topics  </p>

In [26]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request
import pandas as pd

# List to store tags
tags = []

# Function to save posts periodically
def saveTags(tags):
    try:
        df = pd.DataFrame(tags)
        df.to_csv("../data/stackXchange/stackXchangeTags.csv", index=False)
        print(f"Saved {len(tags)} tags.")
    except Exception as e:
        print(f"Error saving posts: {e}")

# Function to scrape the current page
def collectStackTags(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(url, headers=headers)
        
        # Fetch page content
        with urlopen(request) as response:
            page_source = response.read()

        soup = BeautifulSoup(page_source, 'html.parser')
        flexItems=soup.find_all("a",class_="s-tag post-tag")
        for aElement in flexItems:
            tag={}
            tag["tagLink"]=aElement.get("href")
            tag["tagName"]=aElement.get_text()
            tags.append(tag)
        print(flexItems) 
    except Exception as e:
        print(f"Error scraping URL {url}: {e}")




In [None]:
links=[
    "https://medicalsciences.stackexchange.com/tags",
    "https://medicalsciences.stackexchange.com/tags?page=2&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=3&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=4&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=5&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=6&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=7&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=8&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=9&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=10&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=11&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=12&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=13&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=14&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=15&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=16&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=17&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=18&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=19&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=20&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=21&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=22&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=23&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=24&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=25&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=26&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=27&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=28&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=29&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=30&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=31&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=32&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=33&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=34&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=35&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=36&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=37&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=38&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=39&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=40&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=41&tab=popular",
    "https://medicalsciences.stackexchange.com/tags?page=42&tab=popular",
]
for link in links :
    collectStackTags(link)

In [41]:
saveTags(tags)

Saved 1493 tags.


### Starting collecting posts Texts 

In [2]:
import pandas as pd 
tagsData=pd.read_csv("../data/stackXchange/stackXchangeTags.csv")

### Starting collecting posts Texts 

In [67]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request
import pandas as pd

def periodicSave(data,topicTag):
    data=pd.DataFrame(data)
    data.to_csv(f"../data/stackXchange/stackXchangePosts{topicTag}.csv",index=False)
    print(f"Periodic save completed: {len(data)} new post texts have been saved.")
def scrapCurrentPage(pageIndex,topicTag,posts):
    url=f"https://medicalsciences.stackexchange.com/questions/tagged/{topicTag}?tab=newest&page={pageIndex}&pagesize=50"
    
    # Set up a request with a user-agent header to avoid blocking
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)
    
    # Open the URL and read the page content
    with urlopen(request) as response:
        page_source = response.read()
    
    # Parse the page with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')
    maxPage=soup.find_all("a",class_="s-pagination--item")
    try:
        if(maxPage[-1].get_text()==" Next"):
            maxPage=maxPage[-2].get_text()
        else :
            maxPage=maxPage[-1].get_text()
    except:
        maxPage=0
    # Locate the element with the desired ID pattern
    elements = soup.find_all("div", id=re.compile(r"question-summary.*"))
    for element in elements:
        try:
            # Clean and return the extracted text
            post={}
            post["postText"]=element.find("h3").find("a").get_text()
            post["postLink"]=element.find("h3").find("a").get("href")
            post["postTags"]=[liElement.find("a").get_text() for liElement in element.find("ul",class_="js-post-tag-list-wrapper").find_all("li")]
            post["postUser"]=element.find("div",class_="s-user-card--link").find("a").get_text()
            post["createdAt"]=element.find("span",class_="relativetime").get_text()
            print(post)
            posts.append(post)

        except:
            continue
    pageIndex+=1
    if(int(pageIndex)<=int(maxPage)):
        scrapCurrentPage(pageIndex,topicTag,posts)
    else : 
        print("Stopping Scroll . Topic Scrapping Done")
       
        periodicSave(posts,topicTag)  
        return
  


In [68]:
for index in range(59,len(tagsData)):
    pageIndex=1
    topicTag=tagsData.iloc[index]["tagName"]
    scrapCurrentPage(pageIndex, topicTag,[])


{'postText': 'Basic Questions regarding Long-COVID from COVID Symptom Study app', 'postLink': '/questions/25181/basic-questions-regarding-long-covid-from-covid-symptom-study-app', 'postTags': ['covid-19', 'lasting-effects-duration', 'covid-19-datasets'], 'postUser': 'surfmuggle', 'createdAt': 'Nov 23, 2020 at 23:51'}
{'postText': 'Could COVID-19 be less "dangerous" by the pass of the time?', 'postLink': '/questions/22995/could-covid-19-be-less-dangerous-by-the-pass-of-the-time', 'postTags': ['covid-19', 'virus', 'lasting-effects-duration', 'sars-cov-2', 'evolution'], 'postUser': 'I likeThatMeow', 'createdAt': 'Apr 3, 2020 at 23:22'}
{'postText': 'What are the long-term consequences of binge drinking?', 'postLink': '/questions/17302/what-are-the-long-term-consequences-of-binge-drinking', 'postTags': ['risks', 'alcohol', 'lasting-effects-duration', 'overdose', 'substance-abuse'], 'postUser': 'got trolled too much this week', 'createdAt': 'Aug 31, 2018 at 12:55'}
{'postText': 'Is the incr

### concat all files to a single file to start collecting full texts 

In [None]:
import os
import glob
import pandas as pd

# Define the folder path
folder_path = "../data/stackXchange"

# Find all CSV files in the folder that start with "stackXchangePosts"
csv_files = glob.glob(os.path.join(folder_path, "stackXchangePosts*.csv"))

# Initialize an empty list to store dataframes
dataframes = []
total=0
# Iterate over each CSV file and read it into a dataframe
for file in csv_files:
    try:
        print(f"Reading {file}...")
        df = pd.read_csv(file)
        total+=len(df)
        dataframes.append(df)
    except:
        continue
""" 
# Concatenate all dataframes into one
merged_df = pd.concat(dataframes, ignore_index=True)

# Save the merged dataframe to a single CSV file
output_file = "../merged_stackXchangePosts.csv"
merged_df.to_csv(output_file, index=False)

print(f"All files starting with 'stackXchangePosts' merged into {output_file}")
 """
print(f"total number of rows is {total}")

### Colleting posts for stack X change medical Posts

In [73]:
import pandas as pd
dataset=pd.read_csv("../data/stackXchange/stackXchangePosts.csv")

In [77]:
import pandas as pd
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import os
import time

baseUrl = "https://medicalsciences.stackexchange.com"

def scrapPostText(postUrl):
    postUrl = baseUrl + postUrl
    
    try:
        # Set up a request with a user-agent header to avoid blocking
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(postUrl, headers=headers)
        
        # Open the URL and read the page content
        with urlopen(request) as response:
            page_source = response.read()
        # Parse the page with BeautifulSoup
        soup = BeautifulSoup(page_source, 'html.parser')
        postBodyElement = soup.find("div", class_="js-post-body")
        print(f"navigating to {postUrl}")
        return postBodyElement.get_text(separator='\n', strip=True)

    except Exception as e: 
        print(f"Post not found: {e}")
        return "post's text is not available"

def save_new_posts(dataframe, file_path):
    """Appends new posts to the CSV file."""
    if not dataframe.empty:
        dataframe.to_csv(file_path, mode='a', header=not os.path.exists(file_path), index=False)
        print(f"Saved {len(dataframe)} new posts to {file_path}")
    else:
        print("No new posts to save.")

# Path to the CSV file
file_path = "../data/stackXchange/medicalStackXchangePostsFullText.csv"

# Load existing data if the file exists
if os.path.exists(file_path):
    existing_data = pd.read_csv(file_path)
    existing_links = set(existing_data["postLink"])
else:
    existing_data = pd.DataFrame(columns=["postLink", "postFullText"])
    existing_links = set()


# Process new posts
processed_posts = []
for i in range(len(dataset)):
    post = dataset.iloc[i]
    postLink = post["postLink"]
    postText = scrapPostText(postLink)
    processed_posts.append({"postLink": postLink, "postFullText": postText})
    
    # Save in batches
    if (i + 1) % 50 == 0 or i == len(dataset) - 1:
        batch_df = pd.DataFrame(processed_posts)
        save_new_posts(batch_df, file_path)
        processed_posts = []  # Clear batch after saving


navigating to https://medicalsciences.stackexchange.com/questions/5797/can-a-pregnant-woman-travel-30-minutes-when-she-is-only-10-days-pregnant
navigating to https://medicalsciences.stackexchange.com/questions/3631/any-risk-to-the-fetus-if-alcohol-consumption-in-only-12-months-pregnancy
navigating to https://medicalsciences.stackexchange.com/questions/1632/which-of-these-things-is-needed-for-a-first-time-pregnancy-checkup
navigating to https://medicalsciences.stackexchange.com/questions/4460/what-is-the-best-probability-of-recurrent-stillbirth-with-the-given-data
navigating to https://medicalsciences.stackexchange.com/questions/3522/how-to-treat-a-pregnant-woman-with-a-very-low-platelet-count-in-her-9th-month-of
navigating to https://medicalsciences.stackexchange.com/questions/34084/is-trauma-to-the-abdominal-wall-classed-as-somatic-pain-or-visceral-pain
navigating to https://medicalsciences.stackexchange.com/questions/26316/fluid-in-lungs-from-pressure-during-bowel-movement
navigating

In [1]:
import pandas as pd 
data=pd.read_csv("../data/stackXchange/medicalStackXchangePostsFullText.csv")

In [2]:
len(data)

26707