**Before starting, make sure GeckoDriver is installed and on path. Gecko driver install guide is paired with this file. Code BELOW does a check.** 
Google's robots.txt file disallows scraping for certain pages, including Google Search and Google News. This is strictly for personal use.

In [None]:
import subprocess

def check_geckodriver():
    try:
        # Execute the command to get the GeckoDriver version
        result = subprocess.run(['geckodriver', '--version'], capture_output=True, text=True)
        
        # Check if the command was successful
        if result.returncode == 0:
            print("GeckoDriver is on PATH.")
            print("Version:", result.stdout.strip())
        else:
            print("GeckoDriver is not on PATH.")
    except FileNotFoundError:
        print("GeckoDriver is not found. Please ensure it is installed and in your PATH.")

# Call the function to check GeckoDriver status
check_geckodriver()


# Setting Up Your Firefox Browser Profile

1. First, go to `about:profiles` in your Firefox browser.
2. Click **Create a New Profile**.
3. Click **Next**.
4. Enter a new unique profile name (e.g., `selenium`).
5. Scroll down to find your newly created profile.
6. Click **Launch Profile in New Browser**.

For everyday browsing, you'll want to continue using your **main profile**. The profile you just created is specifically for the Python bot to use. You cannot use this new profile simultaneously while the Python bot is running. However, you can have multiple Firefox profiles open at the same time, just not the same one that's being used by the bot.

# Changing User Agent Extension in Firefox - for your selenium profile

1. Go to [User Agent String Switcher](https://addons.mozilla.org/en-US/firefox/addon/user-agent-string-switcher/).
2. Click **Add to Firefox**.
3. Click the puzzle icon beside the address bar.
4. Click on the extension **User-Agent Switcher and Manager**.
5. In the drop-down menu, select **Firefox** (it initially shows Chrome), then select **Android** (it initially shows Windows or Mac), and finally, choose the latest user-agent.
6. Click **Apply (container)**.

If it says "User-Agent is set," the change was successful. You can click **Reset (container)** to revert.


# Turning Off JavaScript in Firefox

1. Go to [Disable JavaScript Add-on](https://addons.mozilla.org/en-US/firefox/addon/disable-javascript/).
2. Click **Add to Firefox** to install the extension.
3. After installation, click the puzzle icon beside the address bar.
4. Right-click the **Disable JavaScript** icon and select **Pin to Toolbar** to make it easily accessible.
5. Right-click the **Disable JavaScript** icon again, click **Open Disable JS Settings**, then set **Default State** to **JS off**.

To enable JavaScript again, adjust the settings or click the icon as needed.


# Important Settings for Firefox Profile

## Profile Information

| Setting            | Value                        |
|--------------------|------------------------------|
| Default Profile     | no                           |
| Root Directory      | `C:\Users\YourUserName\AppData\Roaming\Mozilla\Firefox\Profiles\xxxxxxxx.selenium` |
| Local Directory      | `C:\Users\YourUserName\AppData\Roaming\Mozilla\Firefox\Profiles\xxxxxxxx.selenium` |

*Your profile path is the one in the root directory. xxxxxxxx is a sensitive unique code of your profile, be careful with it and do not make it public*

## Disable JavaScript Settings

### Settings

| Setting                     | Option     |
|-----------------------------|------------|
| Default state               | JS on / **JS off** |
| Disable behavior             | By domain / By tab |
| Enable shortcuts            | Yes / No   |
| Enable context menu item    | Yes / No   |

Set Default state to Js off


# TOOL 1: Article Scraping

# Web Scraping with Selenium and Tkinter

This Python script performs web scraping on Google News using Selenium and provides a user-friendly GUI for input. The main functionalities include scraping news articles based on a user-defined search term and saving the results to an Excel file. The user can also control the scraping process through the GUI.

## Key Components

### Libraries Used

- **Pandas**: For handling and saving data in Excel format.
- **Selenium**: For automating web browser interaction.
- **Tkinter**: For creating the GUI for user input and status updates.
- **Datetime**: For formatting timestamps in the output filenames.
- **Threading**: For running the scraping process in a separate thread to keep the GUI responsive.

### Function Definitions

1. **`human_scroll(driver)`**: 
   - Simulates human-like scrolling on the webpage to load dynamic content.
   - Introduces random pauses between scrolls for a more natural behavior.

2. **`maybe_idle()`**: 
   - Randomly simulates idle time to mimic a user who occasionally pauses.

3. **`process_google_search_links(link)`**: 
   - Processes the links obtained from Google search results to extract the actual URLs from Google redirects.

4. **`scrape_google_news(driver, search_url)`**: 
   - Scrapes the news articles from the provided search URL.
   - Extracts titles, links, publishers, dates, and bylines from the search results.
   - Handles any exceptions that occur during the scraping process.

5. **`run_scraping(driver, encoded_query)`**: 
   - Manages the overall scraping process.
   - Iterates through search results, scraping news articles, and saving progress to an Excel file every 10 pages.
   - Checks if scraping should continue based on user input.

6. **`get_profile_path()`**: 
   - Prompts the user to enter the path to their Selenium profile, validating the input.

7. **`get_search_term()`**: 
   - Prompts the user to enter a search term and confirms the choice.

8. **`stop_scraping()`**: 
   - Stops the scraping process when called by the user through the GUI.

### GUI Implementation

- The script creates a Tkinter GUI that:
  - Asks for the Selenium profile path.
  - Asks for the search term and confirms the userâ€™s intention to search.
  - Displays the scraping progress and provides a button to stop the process.

### Scraping Logic

1. The user is prompted to enter the path to their Selenium profile.
2. The user is then asked for a search term to scrape from Google News.
3. The program initiates the scraping process in a separate thread to keep the GUI responsive.
4. As the scraping runs, the program processes news articles and saves them to an Excel file.
5. The user can stop the scraping process at any time, and the progress will be saved.
6. Finally, the program saves the results to a final output file and closes the browser.

### File Naming Convention

- The output file is named based on the search query and the current timestamp:
  - Temporary files are named as `{query}-temp.xlsx`.
  - The final results are saved as `{query}-{current_time}-final.xlsx`.

## Conclusion

This script provides an efficient way to scrape news articles from Google News while allowing user interaction through a GUI. The use of Selenium ensures that the scraping mimics human behavior, while Tkinter provides an easy-to-use interface for input and status updates.


# Important Notes for Running the Web Scraping Script

## Managing Firefox Sessions

- **Check Task Manager**: 
  - Before running the web scraping script, ensure that there are no multiple sessions of Firefox running. Open the Task Manager and look for any instances of Firefox that may be active. 
  - If there are multiple sessions, consider closing them to prevent conflicts and ensure that the script can run smoothly.

## Stopping the Kernel in Jupyter Notebook

- **Interrupting the Scraping Process**: 
  - If you choose to interrupt the scraping process while the program is running, make sure to stop the kernel in Jupyter Notebook.
  - This will help free up resources and ensure that no lingering processes are running in the background that could affect future executions of the script.

By following these steps, you can ensure a smoother experience when running the web scraping script and avoid potential issues related to multiple browser sessions and active kernels.


Install dependencies

In [None]:
!pip install pandas selenium openpyxl


In [None]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
import time
import random
import tkinter as tk
from tkinter import simpledialog, messagebox
from datetime import datetime
import threading

# Global variable to control the scraping process
scraping_active = True

# Function to perform human-like scrolling
def human_scroll(driver):
    scroll_pause = random.uniform(2, 5)  # Random pause between 2 to 5 seconds
    scroll_height = random.randint(200, 800)  # Scroll by a random height
    driver.execute_script(f"window.scrollBy(0, {scroll_height});")  # Scroll the page
    time.sleep(scroll_pause)  # Wait for the pause duration

# Function to simulate idle time
def maybe_idle():
    if random.random() < 0.2:  # 20% chance of idle
        idle_time = random.uniform(2, 5)  # Idle for 2 to 5 seconds
        print(f"Idling for {idle_time:.2f} seconds...")
        time.sleep(idle_time)  # Pause the script to simulate being idle

# Function to process Google search links and extract real article URLs
def process_google_search_links(link):
    if 'url?q=' in link:
        # Extract the real URL from Google redirect
        actual_link = link.split('url?q=')[1].split('&')[0]
        return actual_link
    else:
        return link  # It's a direct article link, so return as is

# Function to scrape Google News using consistent nth-child selectors
def scrape_google_news(driver, search_url):
    driver.get(search_url)
    
    # Simulate human scrolling
    human_scroll(driver)

    # Maybe idle for some time
    maybe_idle()

    # Fetch the titles, links, publishers, dates, and bylines using nth-child selectors
    articles = []
    try:
        # Loop through div:nth-child(5) to div:nth-child(14)
        for i in range(5, 15):  # Adjust the range based on your needs
            if not scraping_active:  # Check if scraping is still active
                break
            
            try:
                # Dynamically generate the CSS selector for each nth-child
                selector = f"#main > div:nth-child({i}) > div:nth-child(1) > a:nth-child(1)"
                link_element = driver.find_element(By.CSS_SELECTOR, selector)
                raw_link = link_element.get_attribute('href')

                # Process the link to handle both Google redirects and direct article links
                processed_link = process_google_search_links(raw_link)

                # Extract the title
                title_element = link_element.find_element(By.TAG_NAME, 'h3')
                title_text = title_element.text

                # Assuming publisher, date, and byline exist and have a standard structure
                publisher_element = driver.find_element(By.CSS_SELECTOR, f'#main > div:nth-child({i}) div.BNeawe.UPmit.AP7Wnd')
                date_element = driver.find_element(By.CSS_SELECTOR, f'#main > div:nth-child({i}) span.r0bn4c.rQMQod')
                byline_element = driver.find_element(By.CSS_SELECTOR, f'#main > div:nth-child({i}) div.BNeawe.s3v9rd.AP7Wnd')

                publisher = publisher_element.text
                date = date_element.text
                byline = byline_element.text

                # Add the collected info
                articles.append((title_text, processed_link, publisher, date, byline))

            except Exception as e:
                print(f"Error processing div:nth-child({i}): {e}")
                continue 

    except Exception as e:
        print(f"Error occurred: {e}")

    return articles

# Function to run the scraping in a separate thread
def run_scraping(driver, encoded_query):
    global scraping_active
    all_articles = []
    start_index = 0

    while scraping_active:
        search_url = f"https://www.google.com/search?q={encoded_query}&tbm=nws&start={start_index}"
        print(f"Processing URL: {search_url}")
        
        articles = scrape_google_news(driver, search_url)
        
        if not articles:
            print("No more articles found. Stopping the scrape.")
            break

        all_articles.extend(articles)

        # Update status message
        status_var.set(f"Processed {start_index // 10 + 1} pages...")

        # Save progress to an Excel file every 10 pages
        if start_index % 10 == 0:
            output_file_path = f"{query}-temp.xlsx"
            articles_df = pd.DataFrame(all_articles, columns=['Title', 'Link', 'Publisher', 'Date', 'Byline'])
            articles_df.to_excel(output_file_path, index=False)
            status_var.set(f"Saved progress to {output_file_path}")

        # Increment start_index for the next batch of results
        start_index += 10

    # Final save after exiting the loop
    current_time = datetime.now().strftime("%d%m%Y%H%M")
    output_file_path = f"{query}-{current_time}-final.xlsx"
    articles_df = pd.DataFrame(all_articles, columns=['Title', 'Link', 'Publisher', 'Date', 'Byline'])
    articles_df.to_excel(output_file_path, index=False)
    status_var.set(f"Scraping completed! Final results saved to {output_file_path}")

    # Quit driver after saving
    driver.quit()

# Set up the GUI for profile path and search term
def get_profile_path():
    while True:
        profile_path = simpledialog.askstring("Input", "Enter the path to your Selenium profile:")
        if profile_path:
            try:
                # Attempt to use the profile path
                return profile_path
            except Exception as e:
                messagebox.showerror("Error", "Invalid path. Please try again.")
        else:
            messagebox.showinfo("Info", "Please enter a valid path.")

# Set up the GUI for search term confirmation
def get_search_term():
    while True:
        query = simpledialog.askstring("Input", "Enter your search query:", initialvalue="")
        if query is not None:
            confirm = messagebox.askyesno("Confirm", f"Are you sure you want to search for '{query}'?")
            if confirm:
                return query
        else:
            messagebox.showinfo("Info", "Please enter a valid search term.")

# Function to stop scraping
def stop_scraping():
    global scraping_active
    scraping_active = False
    status_var.set("Scraping stopped. Saving progress...")

# Set up Firefox options to use the logged-in profile
profile_path = get_profile_path()
firefox_options = Options()
firefox_options.add_argument("-profile")
firefox_options.add_argument(profile_path)

# Initialize the WebDriver (Firefox)
driver = webdriver.Firefox(options=firefox_options)

# Set up the status message variable
root = tk.Tk()
root.withdraw()  # Hide the main window
status_var = tk.StringVar()
status_window = tk.Toplevel()
status_window.title("Scraping Status")
status_label = tk.Label(status_window, textvariable=status_var, wraplength=300)
status_label.pack(pady=10)

# Create a button to stop scraping
stop_button = tk.Button(status_window, text="Stop Scraping", command=stop_scraping)
stop_button.pack(pady=5)

# Get the search term
query = get_search_term()
if query:
    # Encode the query for the URL
    encoded_query = query.replace(" ", "+")
    
    # Start the scraping in a separate thread
    scraping_thread = threading.Thread(target=run_scraping, args=(driver, encoded_query))
    scraping_thread.start()

# Start the Tkinter main loop
status_window.mainloop()

# Close the browser once done (moved inside the run_scraping function)


# TOOL 2: Article Scraping

In [None]:
!pip install newspaper3k pandas


Make sure to make a new folder to put the articles produced.

In [None]:
import os
import pandas as pd
from newspaper import Article
import tkinter as tk
from tkinter import filedialog, messagebox

# Function to extract article content from a URL
def extract_article_content(url):
    try:
        article = Article(url)
        article.download()
        article.parse()
        return article.text
    except Exception as e:
        print(f"Error fetching article from {url}: {e}")
        return None

# Function to load an Excel file with a file dialog
def load_excel_file():
    root = tk.Tk()
    root.withdraw()  # Hide the main window
    file_path = filedialog.askopenfilename(title="Select Excel File", filetypes=[("Excel Files", "*.xlsx;*.xls")])
    return file_path

# Function to choose directory for saving articles
def choose_save_directory():
    root = tk.Tk()
    root.withdraw()  # Hide the main window
    directory = filedialog.askdirectory(title="Select Directory to Save Articles")
    return directory

# Load Excel file with the list of URLs
file_path = load_excel_file()  # User selects the file
if not file_path:
    messagebox.showerror("Error", "No file selected. Exiting.")
    exit()

df = pd.read_excel(file_path)

# Choose directory to save articles
save_dir = choose_save_directory()  # User selects directory
if not save_dir:
    messagebox.showerror("Error", "No directory selected. Exiting.")
    exit()

# Add a new column to track status of each article extraction
df['Status'] = ''

# Loop through the URLs, fetch the article content, and save to txt files
for index, row in df.iterrows():
    url = row['Link']  # Assuming the column with URLs is named 'Link'
    article_text = extract_article_content(url)
    
    if article_text:
        file_name = os.path.join(save_dir, f"article_{index + 1}.txt")
        with open(file_name, "w", encoding="utf-8") as file:
            file.write(article_text)
        df.at[index, 'Status'] = 'Successful'
        print(f"Article {index + 1} saved to {file_name}")
    else:
        df.at[index, 'Status'] = 'Failed'

# Save the Excel file with the status column
base_name = os.path.basename(file_path)  # Get the base name of the file
name_without_ext = os.path.splitext(base_name)[0]  # Remove the file extension
output_file_path = os.path.join(os.path.dirname(file_path), f"{name_without_ext}_with_status.xlsx")
df.to_excel(output_file_path, index=False)
print(f"Excel file saved with status column at {output_file_path}")
