Skip to content

Python Scripts for Academic Web Scraping of WSJ Articles: Database Setup, Crawl, and Scrape.

License

Notifications You must be signed in to change notification settings

gonzalezcortes/scraping_news_articles

Repository files navigation

Scraping News Paper:

This GitHub repository is part of my work as a Research Assistant. It hosts various code and scripts employed in the research. The sole intention for using the scraped data is strictly for research purposes.

Data Availability:

Please note that the data itself is not provided in this repository.

Note: The data collected will never be shared or used for any purpose other than the research for which it was intended.

Prerequisites:

  • A .env file needs to be created to host passwords and user credentials.
  • A folder named article_titles_json should be created if you wish to additionally save article titles in JSON format.

Warning: When you run the script for the first time, you may encounter popups for cookies on the website. These should be manually clicked according to your user preference before continuing with the web scraping.


Files

First run createDB.py to create a database, then run crawl.py to obtain the links of the existing webpages, then run web_scrap.py to obtain the content in the webpages.

crawl.py:

The code performs web scraping of articles from the Wall Street Journal (WSJ) for a specific year. It extracts the article details and stores them in an SQLite database and also saves the information in JSON files. It uses libraries such as requests, BeautifulSoup, json, and sqlite3. The code is organized into distinct classes and methods:

  • ManagementDB class: Handles SQLite database operations, such as inserting article details and tracking page explorations.
  • WebScrap class: Manages the web scraping logic, extracting article details like headlines, timestamps, and links.
  • Uses the requests library to make HTTP requests and fetch web pages.
  • Stores the scraped article information both in an SQLite database and JSON files.
  • Supports pagination by incrementing the page number until no more articles are found.

The script also includes rate-limiting measures by incorporating sleep timings. It is capable of printing debugging information like page status codes and the URLs being explored.


createDB.py:

This script is responsible for initializing the SQLite database that will store the scraped articles and related information. It creates three tables: articles_index, article, and exploration. Each table is designed to hold specific types of data that are crucial for the web scraping and data analysis process.

The articles_index table contains metadata about the articles such as the headline, publication time, and link.

The article table stores detailed information about each article, including any associated images and the text corpus.

The exploration table logs details about the web scraping process itself, including the URLs that have been checked, the number of articles found on each page, and whether or not values were extracted from that page.


delete_duplicates.py:

This script is designed to eliminate duplicate records from the articles_index table in the SQLite database. It does so by identifying articles with the same link and retaining only one instance of each while deleting the others. The number of deleted elements is displayed at the end.


delete_jumplines.py:

This script is aimed at cleaning the headline field in the articles_index table of the SQLite database. Specifically, it removes newline characters ('\n') that may have been inadvertently inserted during the scraping process. Modified headlines are updated in the database, and a log is printed to indicate the IDs of the updated records.

web_scrap.py:

This script performs automated web scraping of Wall Street Journal articles. It picks randomly n articles link from table articles_index that have not been scrapped.It uses Selenium for browser automation and navigation. The outcome is save in table article.The script is divided into different classes and functions for better modularity:

  • Scraper class initializes a web driver and prepares the browser for scraping.
  • Search4Articles class inherits from Scraper and provides specific functionalities for WSJ such as signing in and scraping articles.
  • The script leverages the .env file to securely manage sensitive information like login credentials.
  • It fetches unscanned article links from an SQLite database, scrapes their content, and then updates the database.
  • Logs are generated using Python's built-in logging module.
  • Scraped articles, along with their metadata, are inserted back into the SQLite database.

For performance and rate-limiting, the script includes various sleep timings. It also prints out debugging and state information during execution.

Additional Information

Repository Guidelines:

The source code for this project is publicly available on GitHub for educational and illustrative purposes. However, it is crucial to note:

  • The code should not be used for any illegal or unethical activities.
  • Respect the rate-limiting and scraping guidelines of the websites you are targeting.
  • This user reserves the right to delete the GitHub repository or make it private without prior notice if it is found to be misused.

Library Prerequisites:

This script relies on several Python libraries for its functionality. Ensure you have the following libraries installed before running the script:

  • requests - For making HTTP requests.
  • BeautifulSoup from bs4 - For parsing HTML content.
  • json - For handling JSON data.
  • datetime - For manipulating and formatting dates and times.
  • sqlite3 - For database management.
  • time - For sleep timings.
  • dotenv - For environment variable management.
  • numpy - For numerical computations.
  • os - For system and environment manipulation.
  • selenium - For web browser automation.

About

Python Scripts for Academic Web Scraping of WSJ Articles: Database Setup, Crawl, and Scrape.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages