### April 24

# REGULAR EXPRESSIONS (regex) 🤖

Is a library that collects a set of tools that allow us to **search**, **match**, and **manipulate** text based on patterns, and not on exact words or phrases.

In Python, this library is called *re*.


### Basic Functions
| **Function**   | **What it does**                                     |                             
|----------------|------------------------------------------------------ |
| `re.match()`   | Checks if the beginning of a string matches a pattern.|                  
| `re.search()`  | Looks for the first match anywhere in the string.     |
| `re.findall()` | Finds all matches in the string.                      | 
| `re.sub()`     | Replaces parts of the string.                         |

### Basic Syntax
| **Symbol** | **Meaning**                                       | **Example** | **What it matches**                          |
|------------|---------------------------------------------------|-------------|----------------------------------------------|
| `.`        | Any character except newline                      | `a.b`       | `acb`, `a_b`, `a7b`                          |
| `\d`       | A digit (0-9)                                      | `\d\d`      | `23`, `45`                                  |
| `\w`       | A word character (letter, digit, or underscore)    | `\w+`       | `abc`, `hello_123`                           |
| `\s`       | A whitespace character (space, tab, newline)       | `\s`        | `' '` (a space)                              |
| `\.`       | means a real dot                                   | `\.`       | `a.c`, `hello.com`                           |
|

### 💬 Matching and Capturing
📍Matching= The regex found something that fits your pattern. 
- [^>]* = match any characters that are NOT >, as many as possible 

🔄 Capturing = You told the regex to "save this part" using parentheses ()
- ([^<]+) = captures the inner text (any characters except '<')

**Sooo, only the part inside parentheses (...) is captured and returned in group**

## Task 1

In [1]:
# Library

import re
from re import findall

In [2]:
# open document

mails = open('ex1.txt','r')
content = mails.read()
print(content)


In [3]:
# Extract only the mails found in the txt document ex1

R = re.findall(r'[\w.-]+@\w+\.\w+', content)

In [4]:
print(R)

## Task 2

In [5]:
# link = https://www.stsci.edu/stsci-research/research-directory

code = open('mails stsci.txt','r')

In [6]:
redirectory = code.read()
print(redirectory)

**2.1 Create a list with mails**

In [7]:
part_mails = re.findall(r'>\w+<br', redirectory)

In [8]:
print(part_mails)

In [9]:
# Replace the pattern > and <br for a space using sub function

clean_mail= [re.sub(r'>|<br', '', mail) for mail in part_mails] 

# sub replace part of the string with '' 
# | mean "or" in Regex. Son in this case (>|<br) help us to replace 2 characteres in the same line of code

In [64]:
clean_mail

['kalatalo',
 'aloisi',
 'jayander',
 'tbeck',
 'bellini',
 'boeker',
 'bohlin',
 'mboyer',
 'cbritt',
 'tbrown',
 'calamida',
 'jcarlberg',
 'stefano',
 'chayer',
 'cchen',
 'marcoc',
 'carolc',
 'dcoe',
 'gderosa',
 'debes',
 'dixon',
 'ferguson',
 'flanagan',
 'afox',
 'ofox',
 'friedman',
 'fruchter',
 'fullerton',
 'gennaro',
 'kgilbert',
 'golim',
 'kgordon',
 'goudfroo',
 'jgreen',
 'nagrogin',
 'nhathi',
 'ahenry',
 'hines',
 'bholler',
 'bjames',
 'kassin',
 'keyes',
 'koekemoer',
 'gak',
 'slamassa',
 'dlaw',
 'lawton',
 'leitherer',
 'nlevenson',
 'long',
 'lubow',
 'mackenty',
 'duccio',
 'emarcucci',
 'maca',
 'pmcc',
 'mclean',
 'amaya',
 'muzerol',
 'nelan',
 'anoriega',
 'norman',
 'pogle',
 'osten',
 'bosullivan',
 'jegpeek',
 'molly',
 'mperrin',
 'npirzkal',
 'postman',
 'proffitt',
 'pueyo',
 'mrafelski',
 'mregan',
 'inr',
 'arest',
 'ariess',
 'robberto',
 'duval',
 'rryan',
 'ksahu',
 'rsankrit',
 'sargent',
 'sembach',
 'sirianni',
 'anand',
 'gcsloan',
 'lsmith

In [11]:
# add @stsci.com to the string

real_mail= [name+"@stsci.com" for name in clean_mail]
real_mail

In [62]:
print("Total Mails = " + str(len(real_mail)))

Total Mails = 245


**2.2 Create a list with names**

In this case, I identified three different patterns to extract the names. I created three separate lists based on these patterns, and after cleaning the data, I merged them into a single list.

List # 1

In [13]:
part_names1 = re.findall(r'\w+\,\s*\w+</a></td>', redirectory)

In [14]:
part_names1

['Alatalo, Katey</a></td>',
 'Aloisi, Alessandra</a></td>',
 'Anderson, Jay</a></td>',
 'Beck, Tracy</a></td>',
 'Bellini, Andrea</a></td>',
 'Boeker, Torsten</a></td>',
 'Boyer, Martha</a></td>',
 'Britt, Christopher</a></td>',
 'Brown, Robert</a></td>',
 'Calamida, Annalisa</a></td>',
 'Carlberg, Joleen</a></td>',
 'Chen, Christine</a></td>',
 'Chiaberge, Marco</a></td>',
 'Christian, Carol</a></td>',
 'Coe, Dan</a></td>',
 'Fox, Andrew</a></td>',
 'Fox, Ori</a></td>',
 'Fullerton, Alex</a></td>',
 'Gennaro, Mario</a></td>',
 'Gilbert, Karoline</a></td>',
 'Gordon, Karl</a></td>',
 'Goudfrooij, Paul</a></td>',
 'Hathi, Nimish</a></td>',
 'Henry, Alaina</a></td>',
 'Hines, Dean</a></td>',
 'Holler, Bryan</a></td>',
 'James, Bethan</a></td>',
 'Kassin, Susan</a></td>',
 'Koekemoer, Anton</a></td>',
 'Kriss, Gerard</a></td>',
 'LaMassa, Stephanie</a></td>',
 'Law, David</a></td>',
 'Lawton, Brandon</a></td>',
 'Leitherer, Claus</a></td>',
 'Levenson, Nancy</a></td>',
 'Long, Knox</a></t

In [15]:
names_1 = [re.sub(r'</a></td>', '', name1) for name1 in part_names1] 

In [16]:
names_1 

['Alatalo, Katey',
 'Aloisi, Alessandra',
 'Anderson, Jay',
 'Beck, Tracy',
 'Bellini, Andrea',
 'Boeker, Torsten',
 'Boyer, Martha',
 'Britt, Christopher',
 'Brown, Robert',
 'Calamida, Annalisa',
 'Carlberg, Joleen',
 'Chen, Christine',
 'Chiaberge, Marco',
 'Christian, Carol',
 'Coe, Dan',
 'Fox, Andrew',
 'Fox, Ori',
 'Fullerton, Alex',
 'Gennaro, Mario',
 'Gilbert, Karoline',
 'Gordon, Karl',
 'Goudfrooij, Paul',
 'Hathi, Nimish',
 'Henry, Alaina',
 'Hines, Dean',
 'Holler, Bryan',
 'James, Bethan',
 'Kassin, Susan',
 'Koekemoer, Anton',
 'Kriss, Gerard',
 'LaMassa, Stephanie',
 'Law, David',
 'Lawton, Brandon',
 'Leitherer, Claus',
 'Levenson, Nancy',
 'Long, Knox',
 'Lubow, Steve',
 'MacKenty, John',
 'Marcucci, Emma',
 'McCullough, Peter',
 'McLean, Brian',
 'Martin, Amaya',
 'Page, James',
 'Crespo, Alberto',
 'Ogle, Patrick',
 'Osten, Rachel',
 'Peek, Joshua',
 'Peeples, Molly',
 'Pirzkal, Nor',
 'Postman, Marc',
 'Proffitt, Charles',
 'Pueyo, Laurent',
 'Rafelski, Marc',
 'R

List # 2

In [17]:
part_names2 = re.findall(r'\w+\,\s*\w+</td>', redirectory)

In [18]:
part_names2 

['Casertano, Stefano</td>',
 'Chayer, Pierre</td>',
 'Rosa, Gisella</td>',
 'Flanagan, Kathy</td>',
 'Fruchter, Andrew</td>',
 'Gilliland, Ronald</td>',
 'Golimowski, David</td>',
 'Grogin, Norman</td>',
 'Nebulae, Dust</td>',
 'JWST, HST</td>',
 'distributions, Ultraviolet</td>',
 'Galaxies, dust</td>',
 'Evolution, Calibrations</td>',
 'Macchetto, Duccio</td>',
 'Marin, Macarena</td>',
 'Nelan, Ed</td>',
 'Norman, Colin</td>',
 'Sullivan, Brian</td>',
 'Regan, Mike</td>',
 'Riess, Adam</td>',
 'Sirianni, Marco</td>',
 'Sivaramakrishnan, Anand</td>',
 'Soummer, Remi</td>',
 'Stansberry, John</td>',
 'Stockman, Peter</td>',
 'Plate, Maurice</td>',
 'Valenti, Jeff</td>',
 'Whitmore, Brad</td>',
 'Dieterich, Serge</td>',
 'Espinoza, Nestor</td>',
 'Rowlands, Kate</td>',
 'Exoplanets, Instrumentation</td>',
 'Bennet, Paul</td>',
 'Gull, Ted</td>',
 'Ntampaka, Michelle</td>',
 'Pierel, Justin</td>',
 'Petric, Andreea</td>',
 'Schreier, Ethan</td>',
 'Willman, Beth</td>',
 'Murray, Claire</

In [19]:
names_2 = [re.sub(r'</td>', '', name2) for name2 in part_names2] 

In [20]:
names_2

['Casertano, Stefano',
 'Chayer, Pierre',
 'Rosa, Gisella',
 'Flanagan, Kathy',
 'Fruchter, Andrew',
 'Gilliland, Ronald',
 'Golimowski, David',
 'Grogin, Norman',
 'Nebulae, Dust',
 'JWST, HST',
 'distributions, Ultraviolet',
 'Galaxies, dust',
 'Evolution, Calibrations',
 'Macchetto, Duccio',
 'Marin, Macarena',
 'Nelan, Ed',
 'Norman, Colin',
 'Sullivan, Brian',
 'Regan, Mike',
 'Riess, Adam',
 'Sirianni, Marco',
 'Sivaramakrishnan, Anand',
 'Soummer, Remi',
 'Stansberry, John',
 'Stockman, Peter',
 'Plate, Maurice',
 'Valenti, Jeff',
 'Whitmore, Brad',
 'Dieterich, Serge',
 'Espinoza, Nestor',
 'Rowlands, Kate',
 'Exoplanets, Instrumentation',
 'Bennet, Paul',
 'Gull, Ted',
 'Ntampaka, Michelle',
 'Pierel, Justin',
 'Petric, Andreea',
 'Schreier, Ethan',
 'Willman, Beth',
 'Murray, Claire',
 'Larson, Kirsten',
 'Lucy, Adrian',
 'Aguilar, Jonathan',
 'Avila, Roberto',
 'Burger, Matthew',
 'Cara, Mihai',
 'Conroy, Kyle',
 'Filippazzo, Joe',
 'Hami, Maryam',
 'Hargis, Jonathan',
 'Jon

List # 3

In [21]:
part_names3 = re.findall(r'\w+\,\s*\w+\s*\w+\.</a></td>', redirectory)

part_names3 

In [22]:
names_3 = [re.sub(r'</a></td>', '', name3) for name3 in part_names3] 

In [23]:
names_3

['Bohlin, Ralph C.',
 'Brown, Thomas M.',
 'Debes, John H.',
 'Ferguson, Henry C.',
 'Friedman, Scott D.',
 'Green, Joel D.',
 'Hauser, Michael G.',
 'Perrin, Marshall D.',
 'Reid, Neill I.',
 'Sahu, Kailash C.',
 'Tollerud, Erik J.',
 'Marel, Roeland P.',
 'Welty, Daniel E.',
 'Blair, William P.',
 'Girard, Julien H.',
 'Mullally, Susan E.',
 'Lucas, Ray A.',
 'Wu, John F.',
 'Por, Emiel H.',
 'Tea, Mason V.',
 'Johnson, Christian I.']

merge 3 list

In [24]:
names= names_1 + names_2 + names_3
names

['Alatalo, Katey',
 'Aloisi, Alessandra',
 'Anderson, Jay',
 'Beck, Tracy',
 'Bellini, Andrea',
 'Boeker, Torsten',
 'Boyer, Martha',
 'Britt, Christopher',
 'Brown, Robert',
 'Calamida, Annalisa',
 'Carlberg, Joleen',
 'Chen, Christine',
 'Chiaberge, Marco',
 'Christian, Carol',
 'Coe, Dan',
 'Fox, Andrew',
 'Fox, Ori',
 'Fullerton, Alex',
 'Gennaro, Mario',
 'Gilbert, Karoline',
 'Gordon, Karl',
 'Goudfrooij, Paul',
 'Hathi, Nimish',
 'Henry, Alaina',
 'Hines, Dean',
 'Holler, Bryan',
 'James, Bethan',
 'Kassin, Susan',
 'Koekemoer, Anton',
 'Kriss, Gerard',
 'LaMassa, Stephanie',
 'Law, David',
 'Lawton, Brandon',
 'Leitherer, Claus',
 'Levenson, Nancy',
 'Long, Knox',
 'Lubow, Steve',
 'MacKenty, John',
 'Marcucci, Emma',
 'McCullough, Peter',
 'McLean, Brian',
 'Martin, Amaya',
 'Page, James',
 'Crespo, Alberto',
 'Ogle, Patrick',
 'Osten, Rachel',
 'Peek, Joshua',
 'Peeples, Molly',
 'Pirzkal, Nor',
 'Postman, Marc',
 'Proffitt, Charles',
 'Pueyo, Laurent',
 'Rafelski, Marc',
 'R

In [25]:
print("Total Names = " + str(len(names)))

Total Names = 253


**2.3 Match each name with its corresponding email**

create the username from the name to enable maching with the email later

In [26]:
# {name[0]}{last} -> This expression combines the first letter of the first name with the last name into a single string

match = []

for n in names:
    last, name = n.split(',')
    last = last.strip().lower()
    name = name.strip().lower()
    
    for m in real_mail:
        username = m.split('@')[0]
        if username == f"{name[0]}{last}":
            match.append(f"{last}, {name}; {m}")
            break


In [27]:
print(match)

['alatalo, katey; kalatalo@stsci.com', 'beck, tracy; tbeck@stsci.com', 'boyer, martha; mboyer@stsci.com', 'britt, christopher; cbritt@stsci.com', 'carlberg, joleen; jcarlberg@stsci.com', 'chen, christine; cchen@stsci.com', 'coe, dan; dcoe@stsci.com', 'fox, andrew; afox@stsci.com', 'fox, ori; ofox@stsci.com', 'gilbert, karoline; kgilbert@stsci.com', 'gordon, karl; kgordon@stsci.com', 'hathi, nimish; nhathi@stsci.com', 'henry, alaina; ahenry@stsci.com', 'holler, bryan; bholler@stsci.com', 'james, bethan; bjames@stsci.com', 'lamassa, stephanie; slamassa@stsci.com', 'law, david; dlaw@stsci.com', 'levenson, nancy; nlevenson@stsci.com', 'marcucci, emma; emarcucci@stsci.com', 'ogle, patrick; pogle@stsci.com', 'pirzkal, nor; npirzkal@stsci.com', 'rafelski, marc; mrafelski@stsci.com', 'rest, armin; arest@stsci.com', 'ryan, russell; rryan@stsci.com', 'sankrit, ravi; rsankrit@stsci.com', 'smith, linda; lsmith@stsci.com', 'bradley, larry; lbradley@stsci.com', 'kendrew, sarah; skendrew@stsci.com', 

In [28]:
len(match)

137

In [29]:
match2 = []

for n in names:
    last, name = n.split(',')
    last = last.strip().lower()
    name = name.strip().lower()
    
    for m in real_mail:
        username2 = m.split('@')[0]
        if username2 == f"{last}":
            match2.append(f"{last}, {name}; {m}")
            break

In [30]:
print(match2)

['aloisi, alessandra; aloisi@stsci.com', 'bellini, andrea; bellini@stsci.com', 'boeker, torsten; boeker@stsci.com', 'calamida, annalisa; calamida@stsci.com', 'fullerton, alex; fullerton@stsci.com', 'gennaro, mario; gennaro@stsci.com', 'hines, dean; hines@stsci.com', 'kassin, susan; kassin@stsci.com', 'koekemoer, anton; koekemoer@stsci.com', 'lawton, brandon; lawton@stsci.com', 'leitherer, claus; leitherer@stsci.com', 'long, knox; long@stsci.com', 'lubow, steve; lubow@stsci.com', 'mackenty, john; mackenty@stsci.com', 'mclean, brian; mclean@stsci.com', 'osten, rachel; osten@stsci.com', 'postman, marc; postman@stsci.com', 'proffitt, charles; proffitt@stsci.com', 'pueyo, laurent; pueyo@stsci.com', 'robberto, massimo; robberto@stsci.com', 'duval, julia; duval@stsci.com', 'sargent, beth; sargent@stsci.com', 'sembach, kenneth; sembach@stsci.com', 'tumlinson, jason; tumlinson@stsci.com', 'volk, kevin; volk@stsci.com', 'fleming, scott; fleming@stsci.com', 'mutchler, maximilian; mutchler@stsci.c

In [31]:
len(match2)

53

In [32]:
match3 = []

for n in names:
    last, name = n.split(',')
    last = last.strip().lower()
    name = name.strip().lower()
    
    for m in real_mail:
        username3 = m.split('@')[0]
        if username3 == f"{name}":
            match3.append(f"{last}, {name}; {m}")

In [33]:
print(match3)

['martin, amaya; amaya@stsci.com', 'peeples, molly; molly@stsci.com', 'casertano, stefano; stefano@stsci.com', 'grogin, norman; norman@stsci.com', 'grogin, norman; norman@stsci.com', 'macchetto, duccio; duccio@stsci.com', 'sivaramakrishnan, anand; anand@stsci.com']


In [34]:
len(match3)

7

In [35]:
match4 = []

for n in names:
    last, name = n.split(',')
    last = last.strip().lower()
    name = name.strip().lower()
    
    for m in real_mail:
        username4 = m.split('@')[0]
        if username4 == f"{name[:3]}{last[:5]}":
            match4.append(f"{last}, {name}; {m}")
            break

In [36]:
print(match4)

['anderson, jay; jayander@stsci.com']


In [37]:
mail_name= match + match2 + match3

In [38]:
len(mail_name)

197

**2.2 Create a list with names**

In [42]:
match1 = re.search(r'<a[^>]*>([^<]+)</a></td>', redirectory)
#result1 = match1.group(1) if match1 else None
print(match1.group(1))

Alatalo, Katey


In [43]:
match_name = re.search(r'<td[^>]*style="[^"]*width:\s*15%;[^"]*"[^>]*>\s*<a[^>]*>([^<]+)</a></td>', redirectory)
match_mail= re.search(r'<td[^>]*>([^<]+)<br\s*/?>', redirectory)

print(
    f"{match_name.group(1) if match_name else ''};{match_mail.group(1) if match_mail else ''}"
)
      
      

Alatalo, Katey;kalatalo


In [44]:

# Find all name matches
names = re.findall(r'<td[^>]*style="[^"]*width:\s*15%;[^"]*"[^>]*>\s*<a[^>]*>([^<]+)</a>', redirectory)

# Find all email matches
emails = re.findall(r'<td[^>]*>([^<]+)<br\s*/?>', redirectory)

# Combine name-email pairs
for name, email in zip(names, emails):
    print(f"{name};{email}")

Alatalo, Katey;kalatalo
Aloisi, Alessandra;aloisi
Anderson, Jay;jayander
Beck, Tracy;tbeck
Bellini, Andrea;bellini
Boeker, Torsten;boeker
Bohlin, Ralph C.;bohlin
Boyer, Martha;mboyer
Britt, Christopher;tbrown
Brown, Robert;calamida
Brown, Thomas M.;jcarlberg
Calamida, Annalisa;stefano
Carlberg, Joleen;chayer
Chen, Christine;cchen
Chiaberge, Marco;marcoc
Christian, Carol;carolc
Coe, Dan;dcoe
Debes, John H.;gderosa
Dixon, William Van Dyke;debes
Ferguson, Henry C.;dixon
Fox, Andrew;ferguson
Fox, Ori;flanagan
Friedman, Scott D.;afox
Fullerton, Alex;ofox
Gennaro, Mario;friedman
Gilbert, Karoline;fruchter
Gordon, Karl;fullerton
Goudfrooij, Paul;gennaro
Green, Joel D.;kgilbert
Hathi, Nimish;golim
Hauser, Michael G.;kgordon
Henry, Alaina;goudfroo
Hines, Dean;jgreen
Holler, Bryan;nagrogin
James, Bethan;nhathi
Kassin, Susan;ahenry
Keyes, Charles Tony;hines
Koekemoer, Anton;bholler
Kriss, Gerard;bjames
LaMassa, Stephanie;kassin
Law, David;keyes
Lawton, Brandon;koekemoer
Leitherer, Claus;gak
Leven

In [45]:
len(names)

151

In [83]:

# Pattern to capture the name

name_re = re.compile(
    r'<td[^>]*style="[^"]*width:\s*15%;[^"]*"[^>]*>\s*(?:<a[^>]*>)?([^<]+)(?:</a>)?\s*</td>',
    re.IGNORECASE
)

# Pattern to capture the email, whether in <td> or in <p>

mail_re = re.compile(
    r'<(?:td[^>]*|p[^>]*)>\s*([^<]+)\s*<br\s*/?>',
    re.IGNORECASE
)

# Prepare the list
results = []

# Split into rows and process each one
rows = re.findall(r'<tr.*?>.*?</tr>', redirectory, flags=re.DOTALL|re.IGNORECASE)
for html_row in rows:
    nm = name_re.search(html_row)
    ml = mail_re.search(html_row)
    name = nm.group(1) if nm else ''
    mail = ml.group(1) if ml else ''
    email = f"{mail}@gmail.com" if mail else ''
    

    #results.append(f"{name};{mail}@stsci.com")
    results.append(f"{name};{email}")
    
    

In [84]:
results 

[';',
 'Alatalo, Katey;kalatalo@gmail.com',
 'Aloisi, Alessandra;aloisi@gmail.com',
 'Anderson, Jay;jayander@gmail.com',
 'Beck, Tracy;tbeck@gmail.com',
 'Bellini, Andrea;bellini@gmail.com',
 'Boeker, Torsten;boeker@gmail.com',
 'Bohlin, Ralph C.;bohlin@gmail.com',
 'Bond, Howard E.;',
 'Boyer, Martha;mboyer@gmail.com',
 'Britt, Christopher;cbritt@gmail.com',
 'Brown, Robert;',
 'Brown, Thomas M.;tbrown@gmail.com',
 'Calamida, Annalisa;calamida@gmail.com',
 'Carlberg, Joleen;jcarlberg@gmail.com',
 'Casertano, Stefano;stefano@gmail.com',
 'Chayer, Pierre;chayer@gmail.com',
 'Chen, Christine;cchen@gmail.com',
 'Chiaberge, Marco;marcoc@gmail.com',
 'Christian, Carol;carolc@gmail.com',
 'Coe, Dan;dcoe@gmail.com',
 'de Rosa, Gisella;gderosa@gmail.com',
 'Debes, John H.;debes@gmail.com',
 'Dixon, William Van Dyke;dixon@gmail.com',
 'Ferguson, Henry C.;ferguson@gmail.com',
 'Flanagan, Kathy;flanagan@gmail.com',
 'Fox, Andrew;afox@gmail.com',
 'Fox, Ori;ofox@gmail.com',
 'Friedman, Scott D.;fr

In [79]:
# add @stsci.com to the string

real_mail= [name+"@stsci.com" for name in clean_mail]
real_mail

['kalatalo@stsci.com',
 'aloisi@stsci.com',
 'jayander@stsci.com',
 'tbeck@stsci.com',
 'bellini@stsci.com',
 'boeker@stsci.com',
 'bohlin@stsci.com',
 'mboyer@stsci.com',
 'cbritt@stsci.com',
 'tbrown@stsci.com',
 'calamida@stsci.com',
 'jcarlberg@stsci.com',
 'stefano@stsci.com',
 'chayer@stsci.com',
 'cchen@stsci.com',
 'marcoc@stsci.com',
 'carolc@stsci.com',
 'dcoe@stsci.com',
 'gderosa@stsci.com',
 'debes@stsci.com',
 'dixon@stsci.com',
 'ferguson@stsci.com',
 'flanagan@stsci.com',
 'afox@stsci.com',
 'ofox@stsci.com',
 'friedman@stsci.com',
 'fruchter@stsci.com',
 'fullerton@stsci.com',
 'gennaro@stsci.com',
 'kgilbert@stsci.com',
 'golim@stsci.com',
 'kgordon@stsci.com',
 'goudfroo@stsci.com',
 'jgreen@stsci.com',
 'nagrogin@stsci.com',
 'nhathi@stsci.com',
 'ahenry@stsci.com',
 'hines@stsci.com',
 'bholler@stsci.com',
 'bjames@stsci.com',
 'kassin@stsci.com',
 'keyes@stsci.com',
 'koekemoer@stsci.com',
 'gak@stsci.com',
 'slamassa@stsci.com',
 'dlaw@stsci.com',
 'lawton@stsci.

In [71]:
len(results)

260

In [61]:

# Match names inside <td style="width: 15%">, with or without <a>
name_re = re.compile(
    r'<td[^>]*style="[^"]*width:\s*15%;[^"]*"[^>]*>\s*(?:<a[^>]*>)?([^<]+)(?:</a>)?\s*</td>',
    re.IGNORECASE
)

# Match emails in any of these:
# - <td style="width: 10%;">email</td>
# - <td>email<br>
# - <p>email<br>
mail_re = re.compile(
    r'(?:<td[^>]*style="[^"]*width:\s*10%;[^"]*"[^>]*>|<td[^>]*>|<p[^>]*>)\s*([^<]+)',
    re.IGNORECASE
)

# Extract rows
rows = re.findall(r'<tr.*?>.*?</tr>', redirectory, flags=re.DOTALL | re.IGNORECASE)

results2 = []

for row in rows:
    name_match = name_re.search(row)
    mail_match = mail_re.search(row)
    name = name_match.group(1).strip() if name_match else ''
    mail = mail_match.group(1).strip() if mail_match else ''
    results2.append(f"{name};{mail}")

