### Scraping Health Board's Posts and Articles

In this project, we focus on extracting posts and articles from Health Board forums using web scraping techniques. Health Board is a platform with a wealth of user-generated content on various health topics, making it an excellent resource for applications such as sentiment analysis, trend monitoring, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and articles from specific health-related categories on Health Board based on topic criteria (e.g., mental health, fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping.
- **Beautiful Soup**: A library for parsing HTML and extracting data.
- **Requests**: A library for making HTTP requests to access web pages.
- **Pandas**: A data manipulation library to handle and analyze the scraped data.

#### Getting Started
1. **Set Up the Environment**: Install the necessary libraries using pip.
2. **Define Scraping Logic**: Write functions to scrape data from specific health categories on Health Board.
3. **Run the Scraper**: Execute the scraping script and monitor the data collection process.
4. **Analyze the Data**: Use Pandas to analyze the collected posts and articles for insights.

#### Conclusion
This project serves as a practical introduction to web scraping and data analysis using Python, providing valuable experience in handling real-world data from an online health community.


<p style="color:#FE4406;text-align:center;font-size:30px"> Scraping Health board's  Posts And Articles </p>

In [34]:
!pip install bs4
!pip install selenium




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
# importing packages
import requests
from bs4 import BeautifulSoup

### Scraping Health Boards

In [6]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

<p style="color:#FFC107;text-align:left;font-size:20px"> Searching for Health Board's health related topics  </p>

In [41]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup

# Set up the Selenium WebDriver
driver = webdriver.Chrome()  # Ensure you have the correct WebDriver
communities = []

# URL to scrape
url = 'https://www.healthboards.com/boards/hbcategory.php'
driver.get(url)

# Allow the page to load
time.sleep(5)

def scrape_current_page():
    # Get the page source
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find all posts in the page (use the common class or structure to target posts)
    communities_elements = soup.find_all('b')

    new_content_found = False
    titlesList=soup.find_all("font")
    
    for community_element in titlesList:
        try:
            category=(community_element.contents[0])
            aElement=((community_element.parent.parent.parent))
            for element in aElement.next_siblings:
                if(str(element).startswith("<li>")):
                    topicLink=(element.a["href"])
                    topicName=(element.a.contents[0])
                    # Ensure the list exists
                    element={}
                    element["topicName"]=topicName
                    element["topicLink"]=topicLink
                    if(element not in communities):
                        communities.append(element)  
        except Exception as e:
            print(f"Error processing post: {e}")

    return new_content_found
# Scrape content from the current page
content_found = scrape_current_page()
# Close the driver when done
driver.quit()



In [44]:
# convert the topicLinks list to a dataset 
import pandas as pd
communities=pd.DataFrame(communities)
communities.head()

Unnamed: 0,topicName,topicLink
0,Addison's Disease,https://www.healthboards.com/boards/forumdispl...
1,Arthritis,https://www.healthboards.com/boards/forumdispl...
2,Autoimmune Disorders,https://www.healthboards.com/boards/forumdispl...
3,Chronic Fatigue,https://www.healthboards.com/boards/forumdispl...
4,Diabetes,https://www.healthboards.com/boards/forumdispl...


In [None]:
import time
import pandas as pd
import requests


# List to store the final URLs
urls = []

# Function to get the final URL after redirects
def get_final_url(url):
    try:
        # Make a request to the URL to follow any redirects
        response = requests.get(url, allow_redirects=True)
        # Return the final URL after all redirects
        return response.url
    except requests.exceptions.RequestException as e:
        print(f"Error while accessing {url}: {e}")
        return None

# Loop through each topic and extract the URL after redirects
for index, row in communities.iterrows():
    topic_url = row['topicLink']  # Assuming your CSV contains a column 'topicLink' with the links
    final_url = get_final_url(topic_url)
    communities.loc[index,'topicLink']=final_url
    print(final_url)
    urls.append(final_url)

communities.to_csv("../data/healthBoardsTopics.csv")

In [4]:
## Scraping topics from the topics list 
import pandas as pd 
topicsList=pd.read_csv("../data/healthBoards/healthBoardsTopics.csv")

In [7]:
import re
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.request import urlopen, Request
import pandas as pd

# List to store posts
posts = []

# Function to extract total pages
def get_total_pages(soup):
    try:
        page_info = soup.find("td", class_="vbmenu_control", style="font-weight:normal")
        if page_info:
            match = re.search(r"Page \d+ of (\d+)", page_info.text)
            if match:
                return int(match.group(1))
        return 1
    except Exception as e:
        print(f"Error determining total pages: {e}")
        return 1

# Function to scrape the current page
def scrape_current_page(url, topicTag):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(url, headers=headers)
        
        # Fetch page content
        with urlopen(request) as response:
            page_source = response.read()

        soup = BeautifulSoup(page_source, 'html.parser')
        table_element = soup.find("table", id="threadslist")

        if table_element:
            tr_elements = table_element.find_all("tr")
            for i in range(5, len(tr_elements)):
                try:
                    content_found = tr_elements[i]
                    author = content_found.find("div", class_="smallfont").contents[0].strip().replace(" ", "")

                    if author != "Administrator":
                        element = {
                            "commentsLink": content_found.find("a", id=re.compile(".*thread_title.*"))['href'],
                            "postTitle": content_found.find("a", id=re.compile(".*thread_title.*")).contents[0],
                            "authorId": author,
                            "postId": content_found.find("a", id=re.compile(".*thread_title.*"))["id"].replace("thread_title_", ""),
                            "commentsCount": content_found.find_all("td", class_="alt2")[1]["title"].split("Replies:")[1].split(",")[0].strip(),
                            "createdAt": content_found.find_all("div", class_="smallfont")[1].contents[0].strip().replace(" ", "") + " " + content_found.find("span", class_="time").contents[0].strip().replace(" ", ""),
                            "collectedAt": datetime.now(),
                            "topicTag": topicTag
                        }
                        print(element)
                        posts.append(element)

                        # Save periodically
                        if len(posts) % 500 == 0:
                            save_posts()
                except Exception as e:
                    print(f"Error processing post: {e}")
                    continue
    except Exception as e:
        print(f"Error scraping URL {url}: {e}")

# Function to save posts periodically
def save_posts():
    try:
        df = pd.DataFrame(posts)
        df.to_csv("../data/healthBoards/healthBordsPosts.csv", index=False)
        print(f"Saved {len(posts)} posts.")
    except Exception as e:
        print(f"Error saving posts: {e}")

# Main script
try:
    for index in range(0, len(topicsList)):
        try:
            topic_url = topicsList.loc[index, "topicLink"]
            topic_tag = topicsList.loc[index, "topicName"]

            headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
            request = Request(topic_url, headers=headers)

            # Fetch first page to determine total pages
            with urlopen(request) as response:
                page_source = response.read()

            first_page_soup = BeautifulSoup(page_source, 'html.parser')
            total_pages = get_total_pages(first_page_soup)
            print(f"The number of total pages is {total_pages}")

            page_number = 1
            while page_number <= total_pages:
                try:
                    if page_number == 1:
                        current_page_url = topic_url
                    else:
                        current_page_url = f"{topic_url}index{page_number}.html"

                    print(f"Navigating to: {current_page_url}")
                    scrape_current_page(current_page_url, topic_tag)
                    page_number += 1
                except Exception as e:
                    print(f"Error processing page {page_number}: {e}")
                    continue

        except Exception as e:
            print(f"Error processing topic index {index}: {e}")
            continue
except Exception as e:
    print(f"Critical error in main loop: {e}")

# Final save
try:
    save_posts()
except Exception as e:
    print(f"Error during final save: {e}")


The number of total pages is 60
Navigating to: https://www.healthboards.com/boards/addisons-disease/
{'commentsLink': 'https://www.healthboards.com/boards/addisons-disease/1046249-addisons-d-lactose-gluten-intolerance-what-medications-can-i-take.html', 'postTitle': "Addison's D. + Lactose+gluten intolerance. What medications can I take?", 'authorId': 'Ivva', 'postId': '1046249', 'commentsCount': '3', 'createdAt': '05-10-2022 12:20PM', 'collectedAt': datetime.datetime(2024, 12, 8, 9, 15, 4, 534124), 'topicTag': "Addison's Disease"}
{'commentsLink': 'https://www.healthboards.com/boards/addisons-disease/1053265-high-heart-rate-sweating.html', 'postTitle': 'High heart rate and sweating', 'authorId': 'NormW', 'postId': '1053265', 'commentsCount': '0', 'createdAt': '10-08-2021 07:36PM', 'collectedAt': datetime.datetime(2024, 12, 8, 9, 15, 4, 534124), 'topicTag': "Addison's Disease"}
{'commentsLink': 'https://www.healthboards.com/boards/addisons-disease/1044377-testing-diagnosis.html', 'postT

KeyboardInterrupt: 

### Starting collecting posts Texts 

In [1]:
import pandas as pd 
postsDataset=pd.read_csv("../data/healthBoards/healthBordsPosts.csv")

In [2]:
import re
from bs4 import BeautifulSoup
import pandas as pd
import time
from datetime import datetime
from urllib.request import urlopen, Request

# Function to append multiple records to the file
def append_records_to_file(records):
    try:
        records.to_csv(
            "../data/healthBoards/healthBordsPostsData.csv",
            mode="a",
            header=False,
            index=False,
            errors="ignore"  # Ignores problematic characters
        )
        print(f"{len(records)} records appended to the file.")
    except KeyError:
        print(KeyError)
        

# Function to scrape a single post
def scrape_current_page_with_urllib(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
        request = Request(url, headers=headers)
        with urlopen(request) as response:
            page_source = response.read()
        soup = BeautifulSoup(page_source, 'html.parser')
        element = soup.find("div", id=re.compile(r"post_message_.*"))
        if element:
            text = element.get_text()
            cleaned_text = re.sub(r"\s+", " ", text).strip()
            return cleaned_text
        else:
            return "This post has been deleted by the author."
    except Exception as e:
        print(f"Error processing post: {e}")
        return "An error occurred while scraping the post."

# Function to handle retries
def safe_scrape_with_urllib(url, retries=3):
    for attempt in range(retries):
        result = scrape_current_page_with_urllib(url)
        if result != "An error occurred while scraping the post.":
            return result
        else:
            print(f"Retrying ({attempt + 1}/{retries})...")
            time.sleep(5)
    return "Failed after multiple attempts."


In [None]:
# Check if file exists; write header only if file does not exist
file_path = "../data/healthBoards/healthBordsPostsData.csv"
if not pd.io.common.file_exists(file_path):
    postsDataset.head(0).to_csv(file_path, mode="w", index=False)  # Write header

# Temporary DataFrame to store processed posts
temp_data = []

# Process the dataset
for index in range(305549,len(postsDataset)):
    try:
        post = postsDataset.iloc[index]
        link = post["commentsLink"]
        resultScr = safe_scrape_with_urllib(link)
        print(f"Processed post {index + 1}: {resultScr}")

        # Update the current row with the scraped text
        postsDataset.loc[index, "postText"] = resultScr
        temp_data.append(postsDataset.loc[index])

        # Save every 100 posts
        if len(temp_data) >= 100:
            append_records_to_file(pd.DataFrame(temp_data))
            temp_data = []  # Clear the temporary data
    except Exception as e:
        postsDataset.loc[index, "postText"] = f"An error occurred: {e}"
        continue

# Save any remaining posts in temp_data
if temp_data:
    append_records_to_file(pd.DataFrame(temp_data))


Processed post 305550: I am 48 years old and starting to show all the symptoms of perimenopause; hair growing in places I don't want it to, hot flashes, difficulty sleeping, weepiness, crankiness, brain fog, vaginal dryness, etc. preventing proper sleep, and putting a crimp on all other wonderful things that make life great. I had an endometrial ablation 5 years ago to deal with heavy and painful periods. The benefits ended up being extremely light periods 1.5 days requiring only a pantyliner and my period becoming "regular" at 26 days. Before that my period was all over the map at anywhere from 30 to 55 days. The only thing that's changed is that now my period is 7 days long, still very light, but 7 days nonetheless. Since I am having regular period, my Dr. is reluctant to prescribe a small amount of natural hormones to eliminate some of the symptoms so I can sleep. At my insistance, my Dr. is sending me for a blood test to establish my FSH level. What is the best time of the monthly 

In [4]:
append_records_to_file(pd.DataFrame(temp_data))

24 records appended to the file.


In [4]:
import pandas as pd

# Define a function to log bad lines
def bad_line_handler(line):
    print(f"Bad line: {line}")

# Use on_bad_lines to call the custom function
data = pd.read_csv(
    "../data/healthBoards/healthBordsPostsData.csv",
    on_bad_lines='warn',  # or pass bad_line_handler if you want custom handling
)

print(data.head())


Skipping line 206906: expected 9 fields, saw 10
Skipping line 206907: expected 9 fields, saw 10
Skipping line 206908: expected 9 fields, saw 10
Skipping line 206909: expected 9 fields, saw 10
Skipping line 206910: expected 9 fields, saw 10
Skipping line 206911: expected 9 fields, saw 10
Skipping line 206912: expected 9 fields, saw 10
Skipping line 206913: expected 9 fields, saw 10
Skipping line 206914: expected 9 fields, saw 10
Skipping line 206915: expected 9 fields, saw 10
Skipping line 206916: expected 9 fields, saw 10
Skipping line 206917: expected 9 fields, saw 10
Skipping line 206918: expected 9 fields, saw 10
Skipping line 206919: expected 9 fields, saw 10
Skipping line 206920: expected 9 fields, saw 10
Skipping line 206921: expected 9 fields, saw 10
Skipping line 206922: expected 9 fields, saw 10
Skipping line 206923: expected 9 fields, saw 10
Skipping line 206924: expected 9 fields, saw 10
Skipping line 206925: expected 9 fields, saw 10
Skipping line 206926: expected 9 fields,

ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

In [4]:
data.head()

Unnamed: 0,postId,commentsLink,postTitle,authorId,postId.1,commentsCount,createdAt,collectedAt,topicTag,postText
0,https://www.healthboards.com/boards/addisons-d...,Addison's D. + Lactose+gluten intolerance. Wha...,Ivva,1046249,3,05-10-2022 12:20PM,2024-12-05 09:24:38.891737,Addison's Disease,Hi everyone! I was diagnosed with Addison's mo...,
1,https://www.healthboards.com/boards/addisons-d...,High heart rate and sweating,NormW,1053265,0,10-08-2021 07:36PM,2024-12-05 09:24:38.891737,Addison's Disease,"Hello, I constantly have a high heart rate and...",
2,https://www.healthboards.com/boards/addisons-d...,Testing and Diagnosis,RaleighMom,1044377,1,10-09-2018 03:44PM,2024-12-05 09:24:38.892736,Addison's Disease,"Hi, I'm new here but hoping to get some advice...",
3,https://www.healthboards.com/boards/addisons-d...,Need advice on colon issues with Addison's,Ambersmom09,1043202,1,10-09-2018 03:38PM,2024-12-05 09:24:38.893736,Addison's Disease,I have a question. My mother was recently diag...,
4,https://www.healthboards.com/boards/addisons-d...,Addisons Disease for 62 years,huey35,1045227,2,10-09-2018 03:24PM,2024-12-05 09:24:38.893736,Addison's Disease,I am 82 and have been primary Addisons Disease...,
