# **Guided Lab 341.4.2 - Regular Expression - Match URL pattern in Python**

---



## **Learning Objectives:**

By the end of this lab, you will be able to:

- Describe the basics of regular expressions.
- Use regex to validate URL patterns, ensuring they conform to expected formats.
- Extract URLs from text using regex, identifying web addresses within larger content.
- Extract domain names from URLs using regex, isolating the core website address.

## **Lab Objective:**

This lab demonstrate you how to validate and match URL using Regex in python. By the end of this lab, learners will be able to utilize the Regex to valida and match URL patterns in python.

## **Lab Structure:**

The lab is divided into three main examples:

- **Validate the URL:** This example focuses on using regex to check if a given URL is valid. It introduces the validate_url function, which employs a complex regex pattern to ensure URLs adhere to standard structures.
- **Match the URL from the string:** This example demonstrates how to extract URLs from a given text string. It utilizes the extract_urls function and a regex pattern designed to identify URLs within text.
- **Match the URL from the string and print URL:** This example expands on the previous one by not only extracting URLs but also extracting the domain names from those URLs. It uses the `extract_urls_and_domains` function and two regex patterns for URL and domain extraction.

### **Key Concepts:**

- **Regular Expressions:** A sequence of characters that define a search pattern, used for pattern matching within text.
- **URL Validation:** The process of checking if a URL is syntactically correct and conforms to expected standards.
- **URL Extraction:** The process of identifying and extracting URLs from larger text content.
- **Domain Extraction:** The process of isolating the domain name (e.g., google.com) from a URL.

# **Example 1: Validate the URL**

To match a url pattern, you can use the following regular expression:

The given code defines a Python function validate_url(url) that checks if a given URL is valid. It uses a regular expression to validate URLs, and the regular expression pattern is quite complex. The function returns True if the URL is valid and False otherwise.

In [None]:
import re

def validate_url(url):
    pattern = re.compile(r'''^https?://(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\b|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(:[0-9]{1,4})?(?:/?|[/?]\S+)$''', re.IGNORECASE)
    return pattern.match(url) is not None

print(validate_url("http://www.google.com")) # Output: True
print(validate_url("www.google.com")) # Output: False

True
False


### **Here's an explanation of the code:**

**def validate_url(url):** This line defines the validate_url function, which takes a single parameter, url, representing the URL to be validated.

**pattern = re.compile(r'^https?://(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\b|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(:[0-9]{1,4})?(?:/?|[/?]\S+)$', re.IGNORECASE):** This line defines a regular expression pattern to validate URLs. The pattern checks for the following components in a URL:

**^https?://:** The URL must start with either "http://" or "https://".

- **(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\b|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)):** This part of the pattern checks for the domain name or IP address part of the URL. It can be a valid domain name or a valid IP address. and The `?` makes the `"s"` in "https" optional.

- **(:[0-9]{1,4})?:** This matches an optional port number (e.g., :8080). The `?` makes the entire group optional.

- **(?:/?|[/?]\S+)$:** This part of the pattern checks for optional path components.

- **return pattern.match(url) is not None:** This line uses the re.match method to check if the provided url matches the regular expression pattern. If it matches, it returns True, indicating that the URL is valid. If it doesn't match, it returns False, indicating that the URL is not valid.

- The code then tests the validate_url function with two sample URLs and prints the results.

- print(validate_url("http://www.google.com")): This URL is valid, so the function returns True.

- print(validate_url("www.google.com")): This URL is not valid because it lacks the "http://" or "https://" prefix, so the function returns False.

# **Example 2: Match the URL from the string**

In [None]:
import re

def extract_urls(text):
    # Regular expression for matching URLs
    url_pattern = r'\bhttps?://[^\s/$.?#].[^\s]*\b' # -----------------REFORMATTED TO RETURN FULL URLS INSTEAD OF A BLANK SPACE AND /products------------------------


    # Find all URLs in the text
    urls = re.findall(url_pattern, text)

    return urls

text = "Visit my website at https://www.example.com for more information. You can also check http://www.example.org/products"

# Call the function to extract URLs
urls = extract_urls(text)

print("URLs found in the text:")
print(urls)
for url in urls:
    print(url)

URLs found in the text:
['https://www.example.com', 'http://www.example.org/products']
https://www.example.com
http://www.example.org/products


This regular expression is used to match URLs. It looks for patterns starting with "http://" or "https://" followed by a domain part, which consists of alphanumeric characters, dots, and hyphens. It also allows for an optional path and query string (if any) captured by (/\\S*)?.
We use re.findall to find all URLs in the given text. The URLs found in the text are then printed.

Here's a breakdown:

- **url_pattern = ...:** This part assigns the regex pattern to a variable named url_pattern for later use.
- **r'...':** The r' before the opening quote indicates a raw string literal, preventing Python from interpreting backslashes (\) in the pattern as escape sequences.
- **https?://:** This matches the beginning of a URL, looking for either "http://" or "https://". The ? makes the "s" in "https" optional, allowing for both secure and non-secure URLs.
- **[a-zA-Z0-9.-]+:** This matches one or more (+) occurrences of any alphanumeric character (a-zA-Z0-9), dot (.), or hyphen (-). This part typically captures the domain name portion of the URL (e.g., "[redacted link]).
- **(/\S*)?:** This part is optional, as indicated by the ? at the end.
-- **(/ ... ):** This group captures the path and query string part of the URL, if present.
-- **\S*:** This matches zero or more (*) occurrences of any non-whitespace character (\S). This allows for various path and query string structures (e.g., "/products", "/blog?id=123").

Example 3: Match the URL from the string and print URL

In [None]:
import re

def extract_urls_and_domains(text):
    # Regular expression for matching URLs
    url_pattern = r'https?://\S+'

    # Regular expression for extracting domain names from URLs
    domain_pattern = r'https?://([a-zA-Z0-9.-]+)'

    # Find all URLs in the text
    urls = re.findall(url_pattern, text)
    print(urls) # ---------------PRINTED THE URLS SO I CAN SEE WHAT IT CONTAINED--------------------------------------

    # Extract domain names from the URLs
    domains = [re.search(domain_pattern, url).group(1) for url in urls]

    return urls, domains

text = "Visit my website at https://www.example.com for more information. https://www.second_example.com" # -------------ADDED A SECOND EXAMPLE----------------

# Call the function to extract URLs and domains
urls, domains = extract_urls_and_domains(text)

print("URLs found in the text:")
for url in urls:
    print(url)

print("\nExtracted domain names:")
for domain in domains:
    print(domain)

['https://www.example.com', 'https://www.second_example.com']
URLs found in the text:
https://www.example.com
https://www.second_example.com

Extracted domain names:
www.example.com
www.second


## **Submission**
- Submit your completed lab using the Start Assignment button on the assignment page in Canvas.
- Your submission can be include:
  - if you are using notebook then, all tasks should be written and submitted in a single notebook file, for example: (**your_name_labname.ipynb**).
  - if you are using python script file, all tasks should be written and submitted in a single python script file for example: **(your_name_labname.py)**.
- Add appropriate comments and any additional instructions if required.
