<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Regular expressions
© ExploreAI Academy

In this notebook, we test some of the concepts we've learned for regular expressions.

## Learning objectives

In this train, we will:
- Reinforce the understanding of basic regular expression syntax and special characters in Python.
- Demonstrate the ability to apply regular expression methods in Python.

## Exercises

You have been tasked with a data science project where the goal is to analyse social media posts related to forest conservation. The project involves extracting specific information from the compilation of social media posts below.

In [1]:
social_media_posts = """
Great news! The GreenWood Project has successfully planted 10000 trees in the Amazon Rainforest #GreenEarth #Conservation
Update: ForestCoverApp shows a 12% increase in forest cover in the last 5 years. #TechForGood
Sad to see illegal logging in Madagascan rainforests. We need stricter laws! #SaveForests #ActNow
Celebrating World Environment Day with a pledge to plant 20000 trees. Join us! #EnvironmentDay #GoGreen
Interesting study published in NatureJournal: Rainforest biodiversity is crucial for ecological balance. #ScienceForNature
"""

### Exercise 1

**Scenario**: Your first task is to analyse the hashtags used in these posts, as they can give insights into popular environmental campaigns.

**Exercise**: Write a Python function called `extract_hashtags` to extract all unique hashtags from the social media posts.

In [2]:
import re

In [3]:
def extract_hashtags(text):
    pattern = re.compile(r'#\w+')
    # The regular expression '#\w+' matches any word (\w+) that follows the '#' character, capturing hashtags.
    hashtags = pattern.findall(text)
    return set(hashtags)  # Using a set to get unique hashtags

### Exercise 2

**Scenario**: You want to quantify the impact of these conservation efforts. For this, extracting numerical data from the posts will be helpful.

**Exercise**: Write a Python function called `extract_numbers` to find all numbers mentioned in the posts.


In [4]:
def extract_numbers(text):
    pattern = r'\b\d+\b'
    # The regex \b\d+\b matches whole numbers (\d+) that appear as separate words (bounded by \b).
    return re.findall(pattern, text)

# Test with the provided text
print(extract_numbers(social_media_posts))

['10000', '12', '5', '20000']


### Exercise 3

**Scenario**: To understand public sentiment, you need to count how often words related to negative impacts (like "illegal", "logging") appear.

**Exercise**: Write a function called `count_specific_words` to count the occurrences of the words "illegal" and "logging" in the posts.

In [5]:
def count_specific_words(text):
    pattern = r'\billegal\b|\blogging\b'
    # The regex \billegal\b|\blogging\b uses the alternation operator | to match either "illegal" or "logging",
    # each as a separate word.
    return len(re.findall(pattern, text))

# Test with the provided text
print(count_specific_words(social_media_posts))

2


### Exercise 4

**Scenario**: For geographical analysis, you need to extract mentioned locations such as "Amazon Rainforest" and "Madagascan rainforests".

**Exercise**: Write a function to extract proper names that refer to locations.

1. **Regular expression pattern**: The regex `\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*(?:\s[rR]ainforests?)?\b` is designed to capture phrases that typically represent location names. It starts by matching any word that begins with a capital letter, which is common for proper nouns such as place names. The pattern allows for additional capitalised words to form multi-word names. The optional part `(?:\s[rR]ainforests?)?` is included to specifically capture terms like "rainforest" or "rainforests", which are relevant in the environmental context of our dataset.

2. **Filtering out false positives**: Chances are that the solution will include false positives like the standalone word "Rainforest". This would happen because the regex is general enough to capture any capitalised word, and "Rainforest" appeared in the text as part of a hashtag, which could lead to its unintended inclusion. To rectify this, we've added specific conditions in the filtering process:
   * not `loc.endswith(':')` excludes matches that end with a colon, as these are more likely to be hashtags or other non-location elements.
   * `'rainforest' in loc.lower()` ensures the phrase includes "rainforest" or "Rainforest", keeping the focus on relevant environmental terms.
   * `loc != 'Rainforest'` explicitly excludes the standalone word "Rainforest" to avoid this specific false positive.

In [14]:
def extract_locations(text):
    # Regular expression to match potential location names
    pattern = r'\b[A-Z][a-z]+(?:\s[A-Z][a-z]+)*(?:\s[rR]ainforests?)?\b'
    # \b: Word boundary anchor
    # [A-Z]: Uppercase letter at the beginning
    # [a-z]+: One or more lowercase letters
    # (?:\s[A-Z][a-z]+)*: Non-capturing group for additional words with an optional uppercase letter followed by lowercase letters
    # (?:\s[rR]ainforests?)?: Non-capturing group for optional " Rainforest" or " rainforest" at the end
    # \b: Word boundary anchor at the end

    # Find all matches
    potential_locations = re.findall(pattern, text)

    # Filtering based on context and avoiding false positives
    locations = [loc for loc in potential_locations if not loc.endswith(':') and 'rainforest' in loc.lower() and loc != 'Rainforest']

    return locations

# Test with the provided text
print(extract_locations(social_media_posts))

['Amazon Rainforest', 'Madagascan rainforests']


#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>