# Web Scraping The Hill News website  
## www.thehill.com
The Hill is a news outlet that covers politics, policy, and government affairs in the United States. It has a reputation for providing in-depth analysis and reportng on the latest politcal developments, which makes it a valuable source of information for researchers, journalists, and politcal analysts. Specifically in this project we would like to scrape articles from different topic categories and see how the sentiment is different based off the topics. The topics included are News, Policy, Business, Opinion, Events, and Jobs. We believe that based on the topic, such as opinion, might provide a different analysis than others. Grouping artcles into categories will allow us to analyze each category independently as well as looking at all articles as well.

### Web scraping

Libraries Needed

In [5]:
import os
import requests
from bs4 import BeautifulSoup

Entire Script for web scraping, putting each html file into a folder labeled by the category it is in

Note: Takes over an hour to run

The first article of many pages caused issues, so we decided to skip those articles

In [8]:
# Function for scraping
def scrape_section_articles(base_url, section, folder_name, start_page=1, end_page=1):
    section_url = f"{base_url}/{section}/"
    section_folder = os.path.join(folder_name, section)

    if not os.path.exists(section_folder):
        os.makedirs(section_folder)

    # This outer for loop gets the html response from each page
    for page_num in range(start_page, end_page + 1):
        page_url = f"{section_url}?page_num={page_num}"
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, 'html.parser')

        articles = soup.find_all('article', class_='archive__item')

        # This inner for loop goes into the html response from each page
        # and gets the href link for each article on the page, and then writes it to a new file in our folder

        for idx, article in enumerate(articles):
            article_anchor = article.find('a')
            if article_anchor and article_anchor['href']:
                article_link = article_anchor['href']
                if not article_link.startswith('http'):
                    article_link = base_url + article_link

                article_response = requests.get(article_link)
                article_soup = BeautifulSoup(article_response.text, 'html.parser')

                file_name = f"{section}_page{page_num}_article{idx+1}.html"
                with open(os.path.join(section_folder, file_name), 'w', encoding='utf-8') as f:
                    f.write(str(article_soup))

            # Some articles would not scrape correctly, specifically first article on each page, so skip these
            else:
                continue

# Setting search parameters
base_url = 'https://thehill.com'
sections = ['news', 'policy', 'business', 'opinion'] # All Categories of Articles
folder_name = "the_hill_HTML"

# Creating folder path
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# Actual code to scrape using function, and printing useful information to know what went wrong
for section in sections:
    print(f"Scraping articles from {section} section...")
    scrape_section_articles(base_url, section, folder_name, start_page=1, end_page=100)

print('Articles saved as HTML files in the "thehill_html" folder.')


Scraping articles from news section...
Skipped article 1 on news page 1 due to missing or invalid URL.
Skipped article 1 on news page 2 due to missing or invalid URL.
Skipped article 1 on news page 3 due to missing or invalid URL.
Skipped article 1 on news page 4 due to missing or invalid URL.
Skipped article 1 on news page 5 due to missing or invalid URL.
Skipped article 1 on news page 6 due to missing or invalid URL.
Skipped article 1 on news page 7 due to missing or invalid URL.
Skipped article 1 on news page 8 due to missing or invalid URL.
Skipped article 1 on news page 9 due to missing or invalid URL.
Skipped article 1 on news page 10 due to missing or invalid URL.
Skipped article 1 on news page 11 due to missing or invalid URL.
Skipped article 1 on news page 12 due to missing or invalid URL.
Skipped article 1 on news page 13 due to missing or invalid URL.
Skipped article 1 on news page 14 due to missing or invalid URL.
Skipped article 1 on news page 15 due to missing or invalid 

### Finding how many articles scraped

In [11]:
folder_name = "the_hill_HTML"
total_files = 0

# Iterate through each subfolder in the folder
for subfolder in os.listdir(folder_name):
    subfolder_path = os.path.join(folder_name, subfolder)

    # Check if the subfolder_path is a directory
    if os.path.isdir(subfolder_path):
        subfolder_files = 0
        # Iterate through each file in the subfolder
        for file in os.listdir(subfolder_path):
            file_path = os.path.join(subfolder_path, file)

            # Check if the file_path is a file, not a directory
            if os.path.isfile(file_path):
                subfolder_files += 1

        print(f'There are {subfolder_files} files in the "{subfolder}" subfolder.')
        total_files += subfolder_files

print(f'There are {total_files} files in total in the "{folder_name}" folder.')

There are 1897 files in the "business" subfolder.
There are 3400 files in the "opinion" subfolder.
There are 1600 files in the "policy" subfolder.
There are 1600 files in the "news" subfolder.
There are 8497 files in total in the "the_hill_HTML" folder.
