The structure of CDN or external links can vary depending on the specific resource being loaded. However, here are a few common structures you may encounter:

1. CSS or JavaScript files:
   - Relative path: `<link href="/path/to/file.css" rel="stylesheet">`
   - Absolute path: `<link href="https://cdn.example.com/file.css" rel="stylesheet">`

2. Image files:
   - Relative path: `<img src="/path/to/image.jpg" alt="Image">`
   - Absolute path: `<img src="https://cdn.example.com/image.jpg" alt="Image">`

3. Font files:
   - Relative path: `@font-face { font-family: 'Font'; src: url('/path/to/font.woff2') format('woff2'); }`
   - Absolute path: `@font-face { font-family: 'Font'; src: url('https://cdn.example.com/font.woff2') format('woff2'); }`

4. External libraries or frameworks:
   - Script tag: `<script src="https://cdn.example.com/library.js"></script>`
   - Link tag (stylesheet): `<link href="https://cdn.example.com/framework.css" rel="stylesheet">`

These examples illustrate the general structure of CDN or external links. However, please note that the actual structure can vary based on the specific CDN or external resource you are referencing. It's always important to refer to the documentation or specific resource provider for accurate information on how to include their resources in your project.

In [39]:
import os
import re
import sys
import pandas as pd
from tqdm.auto import tqdm
from datetime import datetime

In [2]:
def find_external_links_in_file(file_path):
    cannot_read_files = []
    try:
        with open(file_path, 'r') as file:
            content = file.read()
            
        # Find all <link> tags with href attributes
        link_tags = re.findall(r'<link[^>]+href=["\'](.*?)["\']', content)

        # Find all <img> tags with src attributes
        img_tags = re.findall(r'<img[^>]+src=["\'](.*?)["\']', content)

        # Find all @font-face declarations with src attributes
        font_face_tags = re.findall(r'@font-face[^}]+src:\s*url\(["\'](.*?)["\']', content)

        # Find all <script> tags with src attributes
        script_tags = re.findall(r'<script[^>]+src=["\'](.*?)["\']', content)

        # Filter out the external/CDN links
        external_links = [link for link in link_tags + img_tags + font_face_tags + script_tags
                          if link.startswith(('http://', 'https://'))]

        return external_links
    except:
        print(f"Can't read {file_path}")
        cannot_read_files.append(file_path)

In [48]:
# len(os.listdir(project_path))

# import glob
# glob.glob(r'C:\xampp\htdocs\glob\Code\html2/*/*.html')

# project_path = r"C:\xampp\htdocs\glob\Code\html2\packages\Webkul\Admin\src\Resources\views"
project_path = r"C:\xampp\htdocs\glob\Code\html2"

number_of_files = 0
for dirpath, dirnames, filenames in os.walk(project_path):
        # for filename in filenames:
        #     if any(filename.endswith(ext) for ext in ['.html', '.php', '.blade.php']):
        number_of_files += len(dirnames)

print(number_of_files)

12814


In [43]:
def search_project_files_for_external_links(project_path):
    # Define the file extensions to search
    file_extensions = ['.html', '.php', '.blade.php']
    
    # tqdm progress bar
    pbar = tqdm(total=number_of_files, file=sys.stdout, colour='green')

    # Collect external links and file information
    results = []
    for dirpath, dirnames, filenames in os.walk(project_path):
        for filename in filenames:
            if any(filename.endswith(ext) for ext in file_extensions):
                # update tqdm progress bar
                pbar.update(1)
                pbar.refresh()
                
                file_path = os.path.join(dirpath, filename)
                file_external_links = find_external_links_in_file(file_path)
                
                # Get file information
                # file_name = os.path.splitext(filename)[0]
                # file_extension = os.path.splitext(filename)[1]
                file_name = filename
                file_parent_folder = os.path.basename(dirpath)
                
                # Append results for each external link found in the file
                if file_external_links is not None:
                    for link in file_external_links:
                        results.append({
                            'Domain': re.search('https?://([A-Za-z_0-9.-]+).*', link).group(1),
                            'External Link': link,
                            'File Name': file_name,
                            # 'File Extension': file_extension,
                            'File Parent Folder': file_parent_folder,
                            'File Path': file_path
                        })
    pbar.close()
    return results

In [44]:
external_links_results = search_project_files_for_external_links(project_path)

  0%|          | 0/323773 [00:00<?, ?it/s]

Can't read C:\xampp\htdocs\glob\Code\html2\database\seeders\ComplaintOptionTranslationsTableSeeder.php
Can't read C:\xampp\htdocs\glob\Code\html2\database\seeders\SectionSeeder.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Glob\CustomStyle\src\Resources\lang\ar\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Glob\Pickup\src\resources\lang\ar\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Glob\Pickup\src\resources\lang\fa\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\RKREZA\Contact\src\Resources\lang\ar\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Webkul\Admin\src\Resources\lang\ar\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Webkul\Admin\src\Resources\lang\de\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Webkul\Admin\src\Resources\lang\en\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Webkul\Admin\src\Resources\lang\es\app.php
Can't read C:\xampp\htdocs\glob\Code\html2\packages\Web

In [45]:
df = pd.DataFrame(data=external_links_results, 
                  columns=['Domain', 'External Link', 'File Name', 'File Parent Folder', 'File Path'])
df.head()

Unnamed: 0,Domain,External Link,File Name,File Parent Folder,File Path
0,cdnjs.cloudflare.com,https://cdnjs.cloudflare.com/ajax/libs/intl-te...,index.blade.php,shop,C:\xampp\htdocs\glob\Code\html2\packages\RKREZ...
1,cdnjs.cloudflare.com,https://cdnjs.cloudflare.com/ajax/libs/intl-te...,index.blade.php,shop,C:\xampp\htdocs\glob\Code\html2\packages\RKREZ...
2,cdnjs.cloudflare.com,https://cdnjs.cloudflare.com/ajax/libs/intl-te...,index.blade.php,shop,C:\xampp\htdocs\glob\Code\html2\packages\RKREZ...
3,cdnjs.cloudflare.com,https://cdnjs.cloudflare.com/ajax/libs/font-aw...,edit.blade.php,account,C:\xampp\htdocs\glob\Code\html2\packages\Webku...
4,cdnjs.cloudflare.com,https://cdnjs.cloudflare.com/ajax/libs/intl-te...,create.blade.php,customers,C:\xampp\htdocs\glob\Code\html2\packages\Webku...


In [46]:
df.shape

(229, 5)

In [23]:
now = datetime.utcnow().strftime('%Y-%m-%d')
now

'2023-06-26'

In [24]:
df.to_excel(f"external-links_{now}.xlsx", index=False)

In [25]:
df.to_csv(f"external-links_{now}.csv", index=False)

The Python script provided offers a practical and effective way to search for external links within your project files. However, depending on your specific requirements and preferences, you might consider using specialized tools or libraries that offer more advanced features for analyzing and extracting information from HTML files. Here are a few alternatives worth exploring:

1. Beautiful Soup with Requests: You can combine the power of the Beautiful Soup library, which simplifies HTML parsing, with the Requests library for making HTTP requests. This approach allows you to retrieve HTML content directly from URLs and then search for external links using Beautiful Soup's flexible querying capabilities.

2. Web scraping frameworks: Frameworks like Scrapy provide a complete solution for web scraping tasks, including parsing HTML and extracting relevant information. With Scrapy, you can define rules and selectors to navigate and extract data from web pages, which could be beneficial if you have more complex requirements beyond finding external links.

3. Custom HTML parser: If you have a deep understanding of the HTML structure in your project files and need fine-grained control, you can develop a custom HTML parser using Python's built-in libraries, such as `html.parser`. This approach requires more manual coding but provides flexibility to extract specific elements and attributes based on your project's needs.

4. IDE or code editor search functionality: Some integrated development environments (IDEs) or advanced code editors have powerful search functionality that supports regular expressions. You can leverage these features to search for specific patterns (e.g., external link structures) within your project files.

Consider the complexity of your project, the volume of files, and the desired level of automation when choosing the approach that best suits your needs. While the Python script provides a good starting point, exploring specialized tools or libraries might provide additional advantages for your specific use case.

Yes, you can run the script on a schedule by utilizing task scheduling tools available on your operating system. Here are the steps to schedule the script to run automatically:

1. Determine the operating system: Identify the operating system on which you want to schedule the script. The steps for scheduling tasks may vary depending on the operating system (e.g., Windows, macOS, Linux).

2. Choose a task scheduler: Select a task scheduler suitable for your operating system. Here are a few options:

   - Windows: Task Scheduler is a built-in task scheduling tool in Windows.
   - macOS: Launchd is the default task scheduler for macOS.
   - Linux: Cron is a commonly used task scheduler for Linux systems.

3. Configure the task scheduler: Once you've chosen a task scheduler, configure it to run your Python script on the desired schedule. The specific steps may vary depending on the task scheduler you're using. Here's a general outline:

   - Identify the Python interpreter: Determine the path to the Python interpreter you want to use to run the script. For example, `/usr/bin/python` or `/usr/bin/env python3`.
   - Specify the script: Provide the full path to your Python script that you want to schedule.
   - Set the schedule: Configure the task scheduler to run the script at the desired frequency (e.g., daily, weekly) and specify the time or interval.

4. Test the scheduled task: After setting up the task scheduler, test it to ensure that the script runs as expected at the scheduled time. Verify that the script executes without errors and that the results are generated as desired.

By scheduling the script, you can automate the process of searching for external links in your project files, allowing it to run at regular intervals without manual intervention. This can be particularly useful if you want to periodically monitor and update the list of external links in your project.