# Wrap up Regex and Markdown

v.ekc


## Table of Contents

- [1. Regex: Grouping Dates](#continuation)
  - [1.1 Capturing month/day/year with groups](#capturing-with-groups)
  - [1.2 Reordering with `re.sub()`](#reordering-with-resub)
  - [1.3 Named groups with `?P<>` and `\g<>`](#named-groups)
  - [Check-in 1](#regex-checkin-1)
  - [Check-in 2](#regex-checkin-2)
- [2. Markdown Essentials](#markdown-essentials)
  - [2.1 Headings](#headings)
  - [2.2 Lists: numbered + bulleted](#lists)
  - [2.3 Emphasis: bold, italics, strikethrough](#emphasis)
  - [2.4 Code formatting](#code-formatting)
  - [2.5 Math formatting](#math-formatting)
  - [2.6 Links](#links)
  - [2.7 Images](#images)
  - [Markdown Mini Example](#markdown-mini-example)
  - [Check-in 3](#markdown-checkin-3)
  - [Check-in 4](#markdown-checkin-4)

# 1 Continuation from Monday <a id="continuation"></a>

## Regex Quick Reference (Python)

| Pattern | Name | What it Matches | Example Match |
|-------|------|----------------|---------------|
| `.` | Periot | Any single character (except newline) | `a`, `7`, `!` |
| `\d` | Digit | Any digit 0–9 | `5` |
| `\D` | Non-digit | Any character that is NOT a digit | `a` |
| `\w` | Word character | Letters, digits, underscore | `a`, `7`, `_` |
| `\W` | Non-word character | Anything NOT a word character | `@`, `!` |
| `\s` | Whitespace | Space, tab, newline | ` ` |
| `\S` | Non-whitespace | Any non-space character | `a` |
| `[abc]` | Set of characters | Characters that are only a, b, or c | `b` |
| `[^abc]` | Negated set of characters | Characters except a, b, or c | `z` |
| `[a-z]` | Set of a range | Any lowercase letter | `k` |
| `[A-Z]` | Set of a range | Any uppercase letter | `K` |
| `[0-9]` | Set of a digit range | Any digit | `7` |
| `^` | Start anchor / complement | Start of string, used to exclude a set | `^Hello` |
| `$` | End anchor | End of string | `world$` |
| `*` | Zero or more | 0 or more of previous | `wo*` |
| `+` | One or more | 1 or more of previous | `wo+` |
| `?` | Optional | 0 or 1 of previous | `colou?r` |
| `{n}` | Exact count | Exactly n repetitions | `\d{4}` |
| `{n,}` | At least n | n or more repetitions | `\d{2,}` |
| `{n,m}` | Between | Between n and m repetitions | `\d{2,4}` |
| `()` | Group | Capture part of pattern | `(ab)` |
| `(?: )` | Non-capturing group | Group without capturing | `(?:ab)` |
| `\1` | Backreference | Refers to group 1 | `(ab)\1` |
| `(?P<name>)` | Named group | Capture with a name | `(?P<year>\d{4})` |
| `\g<name>` | Named backreference | Refer to named group | `\g<year>` |
| `\|` | OR | Either pattern | `cat\|dog` |
| `re.findall()` | Function | Return all matches | — |
| `re.search()` | Function | First match only | — |
| `re.sub()` | Function | Replace matches | — |


In [2]:
import re

<a id="capturing-with-groups"></a>
## 1.1 Capturing with multiple groups

We want to extract information from the following string

```statement = 'Mary has 3 cats. Ben had 2 dogs. Maya has 14 chickens, and April has 1 alpaca.'```

The over all sentence structure can be described as:

```name + verb + number + pet```

```letters chunk/space/leters chunk/space/digit chunk/space/leters chunk```

In [6]:
# Using grouping for collections of info
statement = 'Mary has 3 cats. Ben had 2 dogs. Maya has 14 chickens, and April has 1 alpaca.'
statement

'Mary has 3 cats. Ben had 2 dogs. Maya has 14 chickens, and April has 1 alpaca.'

We can pull out the sentence pattern (without splitting on the period):

In [8]:
# get all the statements in the form "person has or had # pets"
re.findall(r'[A-Za-z]+\s[A-Za-z]+\s\d+\s[A-Za-z]+',statement)

['Mary has 3 cats',
 'Ben had 2 dogs',
 'Maya has 14 chickens',
 'April has 1 alpaca']

Now that we have the sentence pattern down, we can select out the sections of interest such as the ```person```:

```(name) + verb + number + pet```

In [11]:
# if I only care about the people, group by the first part
re.findall(r'([A-Za-z]+)\s[A-Za-z]+\s\d+\s[A-Za-z]+',statement)

['Mary', 'Ben', 'Maya', 'April']

We we want the ```person``` and their ```pet```:

```(name) + verb + number + (pet)```

The output will be a list of matches and each object of the list will be a tuple of ```(name, pet)```

In [13]:
# if I care about the people and the number of pets

re.findall(r'([A-Za-z]+)\s[A-Za-z]+\s\d+\s([A-Za-z]+)',statement)

[('Mary', 'cats'), ('Ben', 'dogs'), ('Maya', 'chickens'), ('April', 'alpaca')]

In [335]:
# if I care about the people and the number of pets and the type of pet

re.findall(r'([A-Za-z]+)\s[A-Za-z]+\s(\d+)\s([A-Za-z]+)',statement)

[('Mary', '3', 'cats'),
 ('Ben', '2', 'dogs'),
 ('Maya', '14', 'chickens'),
 ('April', '1', 'alpaca')]

<a id="reordering-with-resub"></a>
## 1.2 Example with grouping dates:

We can reorder group and name their references for better code readability!

#### Question: how can we pull out the month/day/year into a tuple (month, day, year) using grouping?

```dates = '12-25-2025 01-01-2023 11-14-2022' ```

#### Answer

In [3]:
dates = '12-25-2025 01-01-2023 11-14-2022'

re.findall('(\d{2})-(\d{2})-(\d{4})',dates)

[('12', '25', '2025'), ('01', '01', '2023'), ('11', '14', '2022')]

#### Example with dates: switch the order of the groups!

We can switch the order of the groups using ```re.sub(old, new, text)```

Right now, we have month/day/year but we will change the group ordering to year/month/day. We will use group references to switch the order

In [14]:
# Reference the capture groups switch from MM/DD/YYY to YYYY/MM/DD

print(f'''Convert to month-day-year -> year-month-day
Use function re.sub():
''')

re.sub(r'(\d{2})-(\d{2})-(\d{4})', r'\3-\1-\2', dates)

Convert to month-day-year -> year-month-day
Use function re.sub():



'2025-12-25 2023-01-01 2022-11-14'

**Groups**

1: month

2: day

3: year

and so we use ```re.sub(r'(1)(2)(3)', r'(3)(1)(2)', text)```

<a id="named-groups"></a>
## 1.3 Example with dates: create our own label for groups

This only works with the ```re.sub(r'(?P<new_label>)', r'(\g<new_label>)', text)``` function

In [17]:
# You can name capture groups with ?P to make it easier to reference. use \g when referencing
# Note this only works for substitutions

print(f'''Labeling groups and reordering groups
Use function, re.sub:
''')

re.sub(r'(?P<month>\d{2})-(?P<day>\d{2})-(?P<year>\d{4})',r'\g<year>-\g<month>-\g<day>', dates)

Labeling groups and reordering groups
Use function, re.sub:



'2025-12-25 2023-01-01 2022-11-14'

#### Example with dates: referencing the groups with labels and ```re.sub()```

If we want to change out a certain group, we can use ```re.sub()``` and target the specific group.

In this example, the date is 'messy', the day is only one digit. We will match the pattern by finding the day group which is sandwiched between the two forward slashes.

```regex = r`/\d/```  -> grabs the one digit day  --> group 1 = ```\1```

Then we can use ```re.sub()``` and the reference to the group to add a zero before the one-digit day!

```re.sub(r'-(\d)-', r'-0\1-', messy_dates)```

In [27]:
# This type of technique is helpful for cleaning data
#   We find 

messy_dates = "dates: 12-25-2025 01-4-2023 11-14-2022"

print(f'''messy dates:
{messy_dates}

cleaned dates:
{re.sub(r'-(\d)-', r'-0\1-', messy_dates)}
''')



messy dates:
dates: 12-25-2025 01-4-2023 11-14-2022

cleaned dates:
dates: 12-25-2025 01-04-2023 11-14-2022



## Check-ins

<a id="regex-checkin-1"></a>
### Question 1: Find all occurances of 3 digits in a row

```string_with_nums = "123 432 543 578 443 444 757 577 222 974 199"```

#### Answer

In [349]:
string_with_nums = "123 432 543 578 443 444 757 577 777 974 199"

In [350]:
# find all occurances of 3 digits in a row
re.findall(r'\d{3}',string_with_nums)

['123', '432', '543', '578', '443', '444', '757', '577', '777', '974', '199']

### We want to find all occurances of identical digit repeated 3 times in a row and print out the repeated digit

```string_with_nums = "123 432 543 578 443 444 757 577 222 974 199"```

In [394]:
# find all occurances of identical digit repeated 3 times in a row
re.findall(r'(\d)\1{2}',string_with_nums)

['4', '7']

We use this to pattern match the 3 repeats in a row and return the number:

```regex = r'(\d)\1{2}'```

- ```(\d)```: we capture one digit and it is tagged group 1, ```\1```
- ```\1{2}```: we match group 1, two more times
- we return the single digit we saved to group 1 to output

### With 2 groups

Here is a different way to get the repeats but with two groups

In [395]:
# it takes some work to show the repeats
matches = re.findall(r'((\d)\2{2})', string_with_nums)
#[tup[0] for tup in matches]

matches

[('444', '4'), ('777', '7')]

We use this to pattern match the 3 repeats in a row and return the number **and** the triplet:

```regex = r'((\d)\2{2})'```

Question: how does the group labeling work?

Answer: order of parenthesis from left to right

- ```((\d)\2{2})```: outer most parenthesis so the full triplet is group 1, ```\1```
- ```(\d)```: the inner parenthesis is the single digit and is group 2, ```\2```
- we return group 1, the triplet, and group 2, the single digit

If we just want the triplets, we can use list comprehension to pull out from the output list:

```[tup[0] for tup in matches]```

<a id="regex-checkin-2"></a>
### Question 2: Extract all the emoticons such as `:)` or `:-)` etc.



*Hint* 
Each of the faces at least have eyes and a mouth. There is an optional nose or tear :'(

Break up the face into three possible sets!

In [17]:
greeting = """
Hi! :D 
It is so nice to meet you! :-) 
I wish I could stay and chat :P but I have to go. :( 
Bye bye. D,:
"""

### Answer

In [371]:
regex = '[:D][\-,]?[D\)\(\:]'
re.findall(regex, greeting)

[':D', ':-)', ':(', 'D,:']

Each of the faces at least have eyes and a mouth. There is an optional nose or tear :'(

The first set has the eyes from the first three emojis ```:``` and the mouth from the last emoji ```D```:

```[:D]```

The second set is optional and has a nose ```-``` and a tear ```'```:

```[\-\,]?```

The last set has the mouths from the first three ```D ) (``` and the eyes from the last emoji ```:```

```[D\)\(\:]```

### Extract the year, month, and day for each date in the list. The dates are in the form MM-DD-YYYY.

*Hint*
Note that we have a list of strings.

You can try out grabbing the first 2 characters with the ```^``` for month? Or possibly, use grouping?

Lastly, recall the output is a list for each string we inspect. We do not want a list of lists!

```
dates = ['01-31-2001','02-28-2002','03-30-2003','04-29-2004','05-28-2005','06-27-2006',
         '07-07-2007','08-08-2008','09-09-2009','10-10-2010','11-11-2011','12-12-2012']
```

#### Answer

In [4]:
dates = ['01-31-2001','02-28-2002','03-30-2003','04-29-2004','05-28-2005','06-27-2006',
         '07-07-2007','08-08-2008','09-09-2009','10-10-2010','11-11-2011','12-12-2012']

year = [re.findall('\d{4}', date)[0] for date in dates]
month = [re.findall('(\d{2})-\d{2}', date)[0] for date in dates]
day = [re.findall('\d{2}-(\d{2})', date)[0] for date in dates]

print(year)
print(month)
print(day)

['2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012']
['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
['31', '28', '30', '29', '28', '27', '07', '08', '09', '10', '11', '12']


# Markdown in Jupyter Notebooks

In [None]:
# This is a code cell

This is a Markdown cell. Markdown is a language for creating formatted text. Jupyter notebooks integrate Markdown cells, which allow us to describe what we're doing with nice clean formatting in between our code cells. Let's see  what we can do with it.

<a id="headings"></a>
## 2.1 Headings

blah

# Heading 1
## Heading 2
### Heading 3
#### Heading 4

<a id="lists"></a>
## 2.2 Use Markdown to Describe Reproducable Work
Example: The following cell contains a function to double a number. The function accepts one argument and multiplies the input by 2. This function can be used if anyone asks you to double something.

You can use this function:
1. in data science
    1. 111
    2. 271
2. in math
3. in stats

In no particular order, I've used this function
- at home
- at school
    - in office
    - in classroom

In [None]:
# Function that doubles
def a_function(a):
    return 2*a

a_function(2)

<a id="emphasis"></a>
## 2.3 Emphasis: bold, italics, strikethrough...

**To make text bold, enclose it in double stars**

*To italicize, enclose text in single stars*

~~strikethrough is two squiggles~~

<a id="code-formatting"></a>
## 2.4 Formatting code

When I give you problems in labs, I say 

**QuestionX** Do a thing. Assign it to variable `variable`.

For example:

```python
def doing_a_thing(arg):
    return(arg)
    
variable = doing_a_thing(4)
```

<a id='math-formatting'></q>
## 2.5 Formatting math

Inline: $\frac{1}{2}$

Display:
$$\frac{1}{2}\cdot1$$

Common mathy things: $\theta, \lambda,\pi$

Multiple lines:
\begin{align*}
f(x) &= x^2+3x+2\\
&= (x+2)(x+1)
\end{align*}

Matrix: 
\begin{bmatrix}
1&2\\
3&4
\end{bmatrix}



To learn more, lookup Latex. 

<a id="links"></a>
## 2.6 Adding Links

We are at [Cal Poly Humboldt](https://www.humboldt.edu/)

The text we want to link goes in square brackets and the link goes in parentheses. 

<a id="images"></a>
## 2.7 Add images
Add pics with the following syntax:
```markdown
![alt text](image_url)
```

Example: 

![humboldt logo](https://www.times-standard.com/wp-content/uploads/2022/02/spiritSeal-interim-calPolyHumboldt.jpg?w=575)

If you need more flexibility (such as resizing), you can use HTML syntax:

```markdown
<img src="image_path" width="300"/>
```
Example:

<img src="gus_fat.JPG" width="600" />

# My Mini Lab Notes

Today we used **regex groups** to reformat dates.

Steps:
1. Write a pattern with groups
2. Test with `re.findall()`
3. Replace with `re.sub()`

Inline code example: `r"(\d{2})/(\d{2})/(\d{4})"`

**Triangle Inequality**

$|a+b| \le |a| + |b| $

# $$|a+b| \le |a| + |b| $$



Link: [Course repo](https://github.com/mlekimchi/data271_sp26/)