# Python NLP Regular Expression (regex) Exercise

This notebook demonstrates how to extract structured patterns from unstructured text using **Python's `re` module**. We'll cover three real-world examples:

1. Extract Instagram handles from URLs.
2. Extract types of risk from labeled text.
3. Extract financial reporting periods (quarterly/semi-annual) from financial data.

In [1]:
import re

## 1. Extract Instagram Handles

We want to extract Instagram handles from a given block of text. An Instagram handle appears after `https://instagram.com/` and contains only alphanumeric characters and underscores.

In [2]:
text = '''
Check out our design influencer at https://instagram.com/design_with_mia,
and our product updates at https://www.producthunt.com/. Also, here are other
key Instagram accounts:
https://instagram.com/creative_genius
https://instagram.com/daily.design
https://instagram.com/user123_test
'''

# Extract handles
pattern = r'https:.*instagram.com/([a-zA-Z0-9_]+)'
handles = re.findall(pattern, text)
print(handles)

['design_with_mia', 'creative_genius', 'daily', 'user123_test']


## 2. Extract Risk Types

We want to extract different types of risk from text labeled with `Type of Risk:`. This helps us identify and categorize potential risks mentioned in documents.

In [3]:
text = '''
Type of Risk: Operational Risk
Our infrastructure depends on multiple cloud vendors...

Type of Risk: Liquidity Risk
We may face challenges in converting assets into cash...

Type of Risk: Regulatory Risk
Changes in compliance requirements can affect operations...
'''

# Extract risk types
pattern = r'Type of Risk: (.*)'
risks = re.findall(pattern, text)
print(risks)

['Operational Risk', 'Liquidity Risk', 'Regulatory Risk']


## 3. Extract Financial Reporting Periods

This example shows how to extract fiscal year periods like Q1–Q4 and S1–S2 using a regex pattern that supports non-capturing groups.

In [4]:
text = '''
Apple's total revenue in FY2022 Q3 was $90 billion.
Samsung reported earnings in FY2023 S1 amounting to $120 billion.
Google saw an increase in ad revenue in FY2021 Q4.
'''

# Extract FY periods
pattern = r'FY(\d{4} (?:Q[1-4]|S[1-2]))'
periods = re.findall(pattern, text)
print(periods)

['2022 Q3', '2023 S1', '2021 Q4']


## Summary

- We extracted Instagram handles by capturing text that follows `https://instagram.com/` using a pattern that allows alphanumeric characters and underscores.

- We used regex to extract risk types that appear after the label `Type of Risk:`.

- We extracted quarterly and semi-annual fiscal periods using a regex with non-capturing groups to support multiple formats like Q1–Q4 and S1–S2.

These examples (exercises) demonstrate how regular expressions can be used to clean and extract structured information from raw text.