# Regex exos

## Exo 1

### Problem: Write a regex to match the word "hello" in a given string.

In [1]:
import re
text = "hello world"
pattern = r'hello'
match = re.search(pattern, text)
print(match.group())

hello


##  Exo 2

### Problem: Write a regex to find all digits in a string.

In [2]:
text = "My phone number is 12345."
pattern = r'\d'
matches = re.findall(pattern, text)
print(matches)


['1', '2', '3', '4', '5']


## Exo 3

### Problem : Write a regex to extract all words that end with "ing"

In [3]:
text = "I am running and jumping while singing."
pattern = r'\b\w+ing\b'
matches = re.findall(pattern, text)
print(matches)  

['running', 'jumping', 'singing']


# Tokenization Exos

## Exo 1


### Problem: Write a Python script that tokenizes a simple sentence into individual words using NLTK

In [5]:
import nltk
nltk.download('punkt')
#use word_tokenize on any paragraph

[nltk_data] Downloading package punkt to /home/mohamed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Exo 2

### Problem: Write a Python script that tokenizes a paragraph into individual sentences using NLTK

In [6]:
import nltk
nltk.download('punkt')
#use sent_tokenize on any paragraph

[nltk_data] Downloading package punkt to /home/mohamed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Exo 3

### Problem: Write a Python script that tokenizes a sentence into individual words using NLTK. Then, use a regular expression with grouping to extract pairs of the format "Year-Month" (e.g., "2023-08").

In [7]:
text = "The project started in 2022-05-10 and was completed by 2023-08."
pattern = r'(\d{4})-(\d{2})'
matches = re.findall(pattern, text)
print("Year-Month:", matches)

Year-Month: [('2022', '05'), ('2023', '08')]


# Stemming and Lemmatization exos

## Exo 1

### Problem: Write a Python script that uses the PorterStemmer to stem a list of words. Display the original words and their stemmed forms.

In [8]:
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
# List of words to stem
words = ["running", "jumps", "easily", "faster", "runner"]
# Stem each word and print the results
for word in words:
    stemmed_word = stemmer.stem(word)
    print(f"Original: {word}, Stemmed: {stemmed_word}")

Original: running, Stemmed: run
Original: jumps, Stemmed: jump
Original: easily, Stemmed: easili
Original: faster, Stemmed: faster
Original: runner, Stemmed: runner


## Exo 2

### Problem: Write a Python script that uses the WordNetLemmatizer to lemmatize a list of words, considering both nouns and verbs. Display the original words and their lemmatized forms. note that for pos n stands for noun and v for Verb

In [9]:
import nltk
from nltk.stem import WordNetLemmatizer
# Download WordNet data if not already downloaded
nltk.download('wordnet')
nltk.download('omw-1.4')
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# List of words to lemmatize
words = ["running", "jumps", "leaves", "better","easily"]
# Lemmatize each word as a verb and noun and print the results
for word in words:
    lemmatized_verb = lemmatizer.lemmatize(word, pos='v')  # Verb
    lemmatized_noun = lemmatizer.lemmatize(word, pos='n')  # Noun
    print(f"Original: {word}, Lemmatized (Verb): {lemmatized_verb}, Lemmatized (Noun): {lemmatized_noun}")

[nltk_data] Downloading package wordnet to /home/mohamed/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/mohamed/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Original: running, Lemmatized (Verb): run, Lemmatized (Noun): running
Original: jumps, Lemmatized (Verb): jump, Lemmatized (Noun): jump
Original: leaves, Lemmatized (Verb): leave, Lemmatized (Noun): leaf
Original: better, Lemmatized (Verb): better, Lemmatized (Noun): better
Original: easily, Lemmatized (Verb): easily, Lemmatized (Noun): easily


## Exo 3

### Problem: Write a Python script that compares the results of stemming and lemmatization for a list of words. Analyze cases where the results differ significantly.

In [10]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# List of words to compare
words = ["running", "wolves", "studies", "children", "better", "flying"]
# Compare stemming and lemmatization
for word in words:
    stemmed_word = stemmer.stem(word)
    lemmatized_word = lemmatizer.lemmatize(word, pos='n')  # Noun as default POS
    print(f"Original: {word}, Stemmed: {stemmed_word}, Lemmatized: {lemmatized_word}")

Original: running, Stemmed: run, Lemmatized: running
Original: wolves, Stemmed: wolv, Lemmatized: wolf
Original: studies, Stemmed: studi, Lemmatized: study
Original: children, Stemmed: children, Lemmatized: child
Original: better, Stemmed: better, Lemmatized: better
Original: flying, Stemmed: fli, Lemmatized: flying


[nltk_data] Downloading package wordnet to /home/mohamed/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/mohamed/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# POS Tagging

## Exo 1

### Problem: Write a Python script that tags the parts of speech for each word in a simple sentence using spaCy. Display the word along with its POS tag.

In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "The quick brown fox jumps over the lazy dog."
doc = nlp(sentence)
for token in doc:
    print(f"{token.text}: {token.pos_}")

The: DET
quick: ADJ
brown: ADJ
fox: NOUN
jumps: VERB
over: ADP
the: DET
lazy: ADJ
dog: NOUN
.: PUNCT


## Exo 2

### Problem: Write a Python script that tags the parts of speech for each word in a paragraph using spaCy. Then, count how many nouns, verbs, adjectives, and adverbs are present in the paragraph.

In [12]:
import spacy
from collections import Counter
nlp = spacy.load('en_core_web_sm')
paragraph = "The quick brown fox jumps over the lazy dog. The fox is very quick and agile."
doc = nlp(paragraph)
pos_counts = Counter(token.pos_ for token in doc)
nouns = pos_counts['NOUN']
verbs = pos_counts['VERB']
adjectives = pos_counts['ADJ']
adverbs = pos_counts['ADV']
print(f"Nouns: {nouns}, Verbs: {verbs}, Adjectives: {adjectives}, Adverbs: {adverbs}")


Nouns: 3, Verbs: 1, Adjectives: 5, Adverbs: 1


## Exo 3

### Problem: Write a Python script that tags the parts of speech for each word in a sentence using spaCy. Then, extract and display all the nouns and verbs separately.

In [11]:
sentence = "The programmer quickly wrote code and fixed bugs in the software."