This script provides a comprehensive solution for web scraping and summarization using Python, Selenium, BeautifulSoup, Trafilatura, and OpenAI's GPT-3.5 Turbo. It automates the process of extracting relevant content from web pages, filtering and cleaning the data, and generating concise summaries using advanced natural language processing techniques.
- Proxy Management: Automatically loads and manages a list of proxies to avoid IP bans and ensure continuous scraping.
- Headless Browser: Uses Selenium with headless Chrome for efficient and seamless web scraping.
- Content Extraction: Employs Trafilatura to extract and clean textual content from web pages.
- URL Filtering: Filters out irrelevant URLs based on predefined criteria to ensure only relevant content is processed.
- Summarization: Utilizes OpenAI's GPT-3.5 Turbo to generate concise summaries of extracted content.
- Logging: Provides detailed logging for monitoring script execution and debugging purposes.
- Configuration Management: Stores configurable parameters such as API keys, file paths, and directories in a JSON settings file for easy modification.
Before running the script, ensure you have the following installed:
- Python 3.6+
- Selenium
- Selenium Wire
- BeautifulSoup4
- Trafilatura
- OpenAI
- Chromedriver (compatible with your version of Chrome)
- Retry (for retrying failed requests)
-
Clone the Repository:
git clone https://github.com/iamBlogPro/Scrape-Summarize.git cd Scrape-Summarize
-
Install Required Packages:
pip install -r requirements.txt
-
Download Chromedriver:
Download Chromedriver from here and place it in the same directory as the script.
-
Edit the
settings.json
:Edit a
settings.json
file in the root directory with the following content:{ "openai_api_key": "your_openai_api_key", "output_directory": "Output_Folder_Name", "keywords_file": "keywords.csv", "proxy_file": "proxylist.txt" }
Replace
"your_openai_api_key"
with your actual OpenAI API key
-
Prepare Input Files:
- Keywords File (
keywords.csv
): A CSV file containing keywords to search for. Each keyword should be on a separate line without a header. - Proxy File (
proxylist.txt
): A text file containing proxies in the formatip:port:username:password
, one per line.
- Keywords File (
-
Run the Script:
python magic.py
-
Initialization:
- Loads settings from
settings.json
. - Sets up logging to capture script execution details.
- Loads settings from
-
Proxy Management:
- Loads proxies from
proxylist.txt
.
- Loads proxies from
-
Keyword Processing:
- Reads keywords from
keywords.csv
. - For each keyword, it performs a Google search using Selenium.
- Reads keywords from
-
Content Extraction:
- Retrieves the page source and uses BeautifulSoup to extract URLs.
- Filters URLs based on predefined criteria (e.g., domain, extensions, keywords).
-
Data Extraction and Summarization:
- Extracts content from each URL using Trafilatura.
- Cleans the extracted content.
- Summarizes the content using OpenAI's GPT-3.5 Turbo.
-
Output:
- Saves the summarized content to JSON files in the specified output directory.
- Updates the
keywords.csv
file to remove processed keywords.
- The script generates log files in the
logs
directory, which can be used to monitor its execution and troubleshoot any issues.
- Modify the
settings.json
file to update the OpenAI API key, output directory, keywords file, and proxy file as needed. - Adjust filtering criteria in the
get_links_with_beautifulsoup
function to refine URL selection based on your requirements.