### Scraping Health Board's Posts and Articles

In this project, we focus on extracting posts and articles from Health Board forums using web scraping techniques. Health Board is a platform with a wealth of user-generated content on various health topics, making it an excellent resource for applications such as sentiment analysis, trend monitoring, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and articles from specific health-related categories on Health Board based on topic criteria (e.g., mental health, fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping.
- **Beautiful Soup**: A library for parsing HTML and extracting data.
- **Requests**: A library for making HTTP requests to access web pages.
- **Pandas**: A data manipulation library to handle and analyze the scraped data.

#### Getting Started
1. **Set Up the Environment**: Install the necessary libraries using pip.
2. **Define Scraping Logic**: Write functions to scrape data from specific health categories on Health Board.
3. **Run the Scraper**: Execute the scraping script and monitor the data collection process.
4. **Analyze the Data**: Use Pandas to analyze the collected posts and articles for insights.

#### Conclusion
This project serves as a practical introduction to web scraping and data analysis using Python, providing valuable experience in handling real-world data from an online health community.


<p style="color:#FE4406;text-align:center;font-size:30px"> Scraping Health board's  Posts And Articles </p>

In [34]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# importing packages
import requests
from bs4 import BeautifulSoup

### Scraping Health Boards

In [2]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

<p style="color:#FFC107;text-align:left;font-size:20px"> Searching for Health Board's health related topics  </p>

In [41]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup

# Set up the Selenium WebDriver
driver = webdriver.Chrome()  # Ensure you have the correct WebDriver
communities = []

# URL to scrape
url = 'https://www.healthboards.com/boards/hbcategory.php'
driver.get(url)

# Allow the page to load
time.sleep(5)

def scrape_current_page():
    # Get the page source
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all posts in the page (use the common class or structure to target posts)
    communities_elements = soup.find_all('b')

    new_content_found = False
    titlesList=soup.find_all("font")
    
    for community_element in titlesList:
        try:
            category=(community_element.contents[0])
            aElement=((community_element.parent.parent.parent))
            for element in aElement.next_siblings:
                if(str(element).startswith("<li>")):
                    topicLink=(element.a["href"])
                    topicName=(element.a.contents[0])
                    # Ensure the list exists
                    element={}
                    element["topicName"]=topicName
                    element["topicLink"]=topicLink
                    if(element not in communities):
                        communities.append(element)  
        except Exception as e:
            print(f"Error processing post: {e}")

    return new_content_found
# Scrape content from the current page
content_found = scrape_current_page()
# Close the driver when done
driver.quit()



In [44]:
# convert the topicLinks list to a dataset 
import pandas as pd
communities=pd.DataFrame(communities)
communities.head()

Unnamed: 0,topicName,topicLink
0,Addison's Disease,https://www.healthboards.com/boards/forumdispl...
1,Arthritis,https://www.healthboards.com/boards/forumdispl...
2,Autoimmune Disorders,https://www.healthboards.com/boards/forumdispl...
3,Chronic Fatigue,https://www.healthboards.com/boards/forumdispl...
4,Diabetes,https://www.healthboards.com/boards/forumdispl...


In [45]:
communities=communities.to_csv("../data/healthBoardsTopics.csv")

In [21]:
## Scraping topics from the topics list 
import pandas as pd 
topicsList=pd.read_csv("../data/healthBoardsTopics.csv")

In [None]:
topicsNotDone=pd.read_csv("../data/healthBordsPosts.csv")[:8468]["topicTag"]
topics=[]
topicsNotDone=[element.replace(" ","") for element in list(set(list(topicsNotDone)))]
indexes=[]

In [62]:
for index in range(len(pd.read_csv("../data/healthBoardsTopics.csv"))):
    if(topicsList.iloc[index]["topicName"].replace(" ","") in  topicsNotDone):
        indexes.append(index)

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re
from bs4 import BeautifulSoup
from datetime import datetime

# Set up the Selenium WebDriver
driver = webdriver.Chrome()

posts = []

def get_total_pages(soup):
    page_info = soup.find("td", class_="vbmenu_control", style="font-weight:normal")
    if page_info:
        match = re.search(r"Page \d+ of (\d+)", page_info.text)
        if match:
            return int(match.group(1))
    return 1

def scrape_current_page(url, topicTag):
    driver.get(url)
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "threadslist")))
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    tableElement = soup.find("table", id="threadslist")
    if tableElement:
        trElements = tableElement.find_all("tr")
        for i in range(5, len(trElements)):
            try:
                content_found = trElements[i]
                author = content_found.find("div", class_="smallfont").contents[0].strip().replace(" ", "")
                if author != "Administrator":
                    element = {
                        "commentsLink": content_found.find("a", id=re.compile(".*thread_title.*"))['href'],
                        "postTitle": content_found.find("a", id=re.compile(".*thread_title.*")).contents[0],
                        "authorId": author,
                        "postId": content_found.find("a", id=re.compile(".*thread_title.*"))["id"].replace("thread_title_", ""),
                        "commentsCount": content_found.find_all("td", class_="alt2")[1]["title"].split("Replies:")[1].split(",")[0].strip(),
                        "createdAt": content_found.find_all("div", class_="smallfont")[1].contents[0].strip().replace(" ", "") + " " + content_found.find("span", class_="time").contents[0].strip().replace(" ", ""),
                        "collectedAt": datetime.now(),
                        "topicTag": topicTag
                    }
                    print(element)
                    posts.append(element)

                    if len(posts) % 20 == 0:
                        return True
            except Exception as e:
                print(f"Error processing post: {e}")
                continue
    return False

for  index in indexes[2:]:
    topic_url = topicsList.loc[index, "topicLink"]
    topic_tag = topicsList.loc[index, "topicName"]
    
    driver.get(topic_url)
    WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, "threadslist")))
    first_page_soup = BeautifulSoup(driver.page_source, 'html.parser')
    total_pages = get_total_pages(first_page_soup)
    print("The number of total pages is ", total_pages)

    page_number = 1
    while page_number <= total_pages:
        if page_number == 1:
            current_page_url = topic_url
        else:
            current_page_url = f"{topic_url}index{page_number}.html"

        print(f"Navigating to: {current_page_url}")
        scrape_current_page(current_page_url, topic_tag)
        page_number += 1

    index += 1

driver.quit()


In [72]:
import pandas as pd 
posts=pd.DataFrame(posts)
previousCollected=pd.read_csv("../data/healthBordsPosts.csv")
data=pd.concat([posts,previousCollected])
data.to_csv("../data/healthBordsPosts.csv")

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


### Starting collecting posts Texts 

In [1]:
import pandas as pd 
postsDataset=pd.read_csv("../data/healthBordsPosts.csv")

In [None]:
import re
from bs4 import BeautifulSoup
import pandas as pd
import time
from urllib.request import urlopen, Request
def periodicSave(data):
    data.to_csv("../data/healthBordsPosts.csv")
def scrape_current_page_with_urllib(url):
    try:
        # Set up a request with a user-agent header to avoid blocking
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(url, headers=headers)
        
        # Open the URL and read the page content
        with urlopen(request) as response:
            page_source = response.read()
        
        # Parse the page with BeautifulSoup
        soup = BeautifulSoup(page_source, 'html.parser')
        
        # Locate the element with the desired ID pattern
        element = soup.find("div", id=re.compile(r"post_message_.*"))
        if element:
            # Clean and return the extracted text
            text = element.get_text()
            cleaned_text = re.sub(r"\s+", " ", text).strip()
            return cleaned_text
        else:
            return "This post has been deleted by the author."
    except Exception as e:
        print(f"Error processing post: {e}")
        return "An error occurred while scraping the post."

# Function to handle retries if scraping fails
def safe_scrape_with_urllib(url, retries=3):
    for attempt in range(retries):
        result = scrape_current_page_with_urllib(url)
        if result != "An error occurred while scraping the post.":
            return result
        else:
            print(f"Retrying ({attempt + 1}/{retries})...")
            time.sleep(5)  # Wait before retrying
    return "Failed after multiple attempts."

# Iterate over postsDataset
for index in range(0, len(postsDataset), 1):
    try:
        post = postsDataset.iloc[index]
        link = post["commentsLink"]
        
        # Use the safe_scrape_with_urllib function to avoid interruptions
        resultScr = safe_scrape_with_urllib(link)
        print(resultScr)
        postsDataset.loc[index, "postText"] = resultScr
        if(index%500==0):
            periodicSave(postsDataset)
        
    except Exception as e:
        postsDataset.loc[index, "postText"] = f"An error occurred: {e}"
        continue



Hi everyone, I am starting this thread to post some information that may be helpful to you. TF
had half of thyroid out 19 years ago, multinodular goiter. have antibodies for hashimotos quit smoking in Feb, now my thyroid went the other way, TSH is .01, and FT3 is elevated. I believe my multinodular goiter is now toxic. seeing endo next week and ENT next month. i'm concerned because my heartbeat is very noticeable after i go up 3 flights of stairs. i have extreme fatigue. I hope I will be ok in the meantime until we get this sorted out
I am posting this question for my husband. 2 Yrs ago, he had a bad accident at work where he almost lost his leg. 9 "Leg salvage" surgeries where a muscle &amp; artery was taken from upper thigh &amp; sewn in lower leg, tibia nail, screws, skin flap, bone graft, he now has drop foot and our biggest fear is his coming down with Diabetes with his LEG LIKE THIS. It's basically a "Frankenstein" leg now. He recently had Blood work (will see Doc in 2 weeks) his

KeyboardInterrupt: 