# Scientific Programming: A Crash Course

## Class 3 – Good Housekeeping

So far we've learned all the core structures of programming: the basic data types (ints, lists, strings, etc.), the fundamental control structures (for-loops, while-loops, if-statements), and higher level abstractions (functions and objects). This is a powerful set of tools that will let you accomplish a lot of stuff. Even if you decide not to stick with Python, the core principles we've covered so far are pretty universal, and they should prove useful wherever your coding journey takes you.

Before we put these tools to more practical scientific uses, there is a bunch of things that I still want to cover, so today's class is a little bit of a jumble of stuff, including reading/writing files, project and data organization, handling errors, and a few other bits and pieces.

## Reading and Writing Files

So far, everything we've been doing has been very self-contained inside the notebook. But, in the real world, we very often need to interact with other files, either by reading data in or writing data out. Let's first look at how to read a file in Python:

In [None]:
with open('some_file.txt') as file_object:
    content_of_the_file = file_object.read()
    
print(content_of_the_file)

When you run this code, you will get a `FileNotFoundError` error. Obviously this is because there is no file called `some_file.txt` on your computer right now. So, first, I want you to go and create a plain text file (it should have the file name extension `.txt`) and save it to the same folder as this notebook. Write a message inside the file like, `Hello world!`. You will need to use a text editor app like TextEdit (Mac) or Notepad (Windows), or some other text editor that you like. Once you've created the file, try running the code again and check that you can recover the message that you saved in the file.

Now let's look at the new syntax. Here we are using a **context manager** – the `with` statement. Context managers are not used very often (opening files is the most common use-case), so if you don't really get it, don't worry... just learn the syntax above so that you know how to read and write files.

To understand what's going on, we need to rewind back to the old days of Python, when you would open a file like this:

```python
file_object = open('some_file.txt')
content_of_the_file = file_object.read()
file_object.close()
```

The first line, where we create the file object, essentially initializes a connection to the file. In the second line, we use the file object's `.read()` method to read the content of the file, which we then assign to the variable `content_of_the_file`. Finally, we need to explicitly close the connection to the file by calling the file object's `.close()` method. Closing a file connection can be important because this releases Python's lock on the file, so that some other program can access it.

The context manager syntax simplifies this a little:

```python
with open('some_file.txt') as file_object:
    content_of_the_file = file_object.read()
```

As you can see, we are using the same bits of code – the `open()` function and the `.read()` method – and we have the same two variables – `file_object` and `content_of_the_file`. However, the context manager automates the process of closing the file connection. Any code indented inside the with-statement is run while the file connection is open. Once you exit the with-statement, the connection is automatically closed.

Now let's try writing a file:

In [None]:
secret_message = 'Python is cool 😎'

with open('some_file.txt', 'w') as file:
    file.write(secret_message)

As, you can see the syntax is pretty similar. The only differences are (1) we need an extra `'w'` in the call to the `open()` function, which requests that the file be opened with write permissions, and (2) we need to use the file object's `.write()` method. Go ahead and verify that the message was correctly stored by opening `some_file.txt` in another app. Note, also, that the original content of the file is completely overwritten. Be careful when working with files: Python will quite happily overwrite data without any warnings. In fact, even just opening a connection to the file in `'w'` mode will erase the content of the file, even if you don't actually write anything:

In [None]:
with open('some_file.txt', 'w') as file:
    pass

A useful feature of the file object is that it is iterable. When you iterate over a file object, you get access to each line of the file. This could be useful, for example, if you had a file that contains a list of stimuli and you wanted to read the file line by line. Before running the following code, edit your `some_file.txt` so that it contains multiple lines of stuff.

In [None]:
with open('some_file.txt') as file:
    for line in file:
        print(line)

## Structured Data Formats

Storing data in plain text – as we just did above – is a great idea. Plain text is lightweight, accessible, platform agnostic, and unencumbered by any kind of proprietary lock-in. A plain text file made 50 years ago can still be opened today, and will – hopefully – still be openable in another 50 years' time. I highly encourage you to stop using proprietary, closed formats wherever you can, especially in science, where we want to maximize the accessibility, transparency, and longevity of our datasets.

Typically, our datasets are highly structured – for example, your data may be naturally arranged in a table or in a hierarchical structure. This means that storing such data as plain text can get a bit messy. For example, let's say you were running an experiment, and your script writes the data to plain text. You might decide that the first line of the file is the subject ID, the second line is the condition ID, then the next *n* lines contain practice trials, and the next *m* lines contain real trials, and the last line records the time the experiment ended. Each of the lines that records an individual trial might contain two pieces of information – the test stimulus and the participant's response, separated by a dash. This is messy, right? Anyone looking at this file will not be able to easily interpret it, and it's also difficult for you yourself to work with computationally.

Here's my advice: Don't try to reinvent the wheel by inventing your own ad-hoc plain text data format; instead, use a standard format to store your data. The two most popular are CSV and JSON. You've probably encountered CSV files before, but maybe not JSON. Both are useful, but they have different philosophies and ideal use-cases. Crucially, both formats are actually just extensions of plain text. CSV and JSON files are simply plain text files that follow certain conventions to indicate their own structure. This means we get the benefits of plain text, as well as the benefits of something more structured.

### CSV files

The CSV (comma-separated values) format mirrors the basic structure of a spreadsheet with rows and columns. For example, you might have a CSV file in which the raw plain text looks like this:

```
subject,test_type,system,correct
1,production,size,1
1,production,size,1
1,production,size,0
1,production,size,1
1,production,size,1
...
240,comprehension,shape,0
240,comprehension,shape,1
240,comprehension,shape,1
240,comprehension,shape,0
240,comprehension,shape,0
```

And here's what it looks like if I tidy up the alignment a little so that you can read it better:

```
subject, test_type,     system, correct
1,       production,    size,   1
1,       production,    size,   1
1,       production,    size,   0
1,       production,    size,   1
1,       production,    size,   1
...
240,     comprehension, shape,  0
240,     comprehension, shape,  1
240,     comprehension, shape,  1
240,     comprehension, shape,  0
240,     comprehension, shape,  0
```

The details of the dataset don't matter; the main point is we have a tabular structure. Each row is on a new line and each cell within a row is separated by a comma. Deciding how to organize your data in a table would require an entire class in itself, so I won't go into too much detail here; instead, I'd recommend you read up on the "Tidy Data" philosophy here: https://tidyr.tidyverse.org/articles/tidy-data.html (perhaps bookmark it for the weekend – the code is in R, but the advice is universal).

Anyhow, we could try to parse this CSV file manually. For example, we could iterate over the lines in the file and split each line into parts based on the comma character. If you're feeling adventurous, try writing a function to parse a CSV file; hint: strings have a `.split()` method that should be useful:

In [None]:
def my_csv_reader(file_name):
    # write your function here
    return

However, there are already functions available to you for reading CSV files; better to use something that is well-tested than to roll your own!

Let's import the `csv` module from the Python standard library:

In [None]:
import csv

To use the module, you are expected to open the file yourself and use the relevant function to parse the CSV. This will result in code like this (note, make sure you downloaded the two example files from [the GitHub repo](https://github.com/jwcarr/sciprog22)):

In [None]:
with open('example_data.csv') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

This is somewhat helpful – at least it does the parsing for us – but still not much better than just manually iterating over the lines of the raw file. A better option is `DictReader()` where each row will be represented as a dictionary like this:

```python
{'subject': '1', 'test_type': 'production', 'category_system': 'size', 'correct': '1'}
```

which can make it easier to pull out the info you need. Have a look at the following code and try to understand what it does before you run it.

In [None]:
n_correct = 0
total_trials = 0

with open('example_data.csv', newline='') as file:
    reader = csv.DictReader(file)
    for row in reader:
        if row['subject'] == '1' and row['test_type'] == 'production':
            if row['correct'] == '1':
                n_correct += 1
            total_trials += 1

accuracy = n_correct / total_trials
print(accuracy)

So, here, I'm manually isolating the rows where `subject == 1` and `test_type == production` and then calculating the accuracy. This is still not a great solution, however. I mean, it looks pretty messy, right?

A much better solution would be to use a full-fledged data frame to handle this kind of data. We won't look at data frames today – we'll return to this topic tomorrow – but to give you a quick preview, here's how you would do it using the `pandas` package. Pandas is not part of the Python Standard Library but it is included with Anaconda, so you may need to install it first to try out the following code. Installing packages is also something I will come back to later, so if you're not sure how to do it yet, don't worry, you can skip over this for now.

In [None]:
import pandas as pd

# open the CSV file
df = pd.read_csv('example_data.csv')

# filter the rows of the data frame for subject==1 and test_type == production
subject_1 = df[ (df['subject'] == 1) & (df['test_type'] == 'production') ]

# calculate the accuracy
subject_1['correct'].mean()

For those of you coming from an R background, this will probably look very familiar. In any case, the main point is that, if you want to work with CSV files, it's probably best to use a package that's designed to work with tabular style data; the built-in CSV module is a bit too low-level for our typical use-cases. Whenever there are multiple options available to you, it's always a good idea to explore a few of them to see what's going to work best for your needs.

### JSON files

JSON stands for JavaScript Object Notation. Although the format originally came from the JavaScript world, it is now very widely used across many languages; there is nothing about it that is specific to JavaScript. JSON is a great format to use when you are working with data that is *not* naturally tabular. Let's jump right in and have a look at a JSON file from an experiment I ran recently:

In [None]:
import json

with open('example_data.json') as file:
    data_set = json.load(file)
    
print(json.dumps(data_set, indent=4))

The `json` module is part of the Python Standard Library, so we don't need to install anything. As with the `csv` module, we need to open the file ourselves and then let the `json` module deal with parsing and reading the raw plain text. The resulting variable `data_set` is a Python dictionary containing all the data. In the last line I'm using the `json.dumps()` function to print the the dictionary with nice indentation to make it more human-readable, but this is not strictly necessary.

If you scroll through the data, you should be able to get a very rough idea of the design of the experiment. For example, the `trial_sequence` part lists all the trials that the participant did: first, a consent form, then a calibration, then some instructions, and then a training block, etc... And at the end of the file you will find the comments that the participant made and some information about what languages they speak.

A nice feature of JSON is that it is self-documenting. Because each value is accompanied by a key, it is relatively easy to understand what each piece of data means. The format is also very flexible in terms of the overall structure, allowing us to mirror the natural hierarchical structure of the experiment. It would be quite hard to force all this data into a table structure based on rows and columns. In which row and column would you put the participant's comments for example?

Since `data_set` is a Python dictionary, we can easily access the things that are of interest:

In [None]:
print(data_set['comments'])

In [None]:
print(data_set['other_languages'])

In [None]:
print(data_set['creation_time'])
print(data_set['modified_time'])
print(data_set['modified_time'] - data_set['creation_time'])

Incidentally, the timestamps above are expressed in [UNIX time](https://en.wikipedia.org/wiki/Unix_time) – the number of seconds since 00:00:00 on 1st January 1970. This is a universal format – not just in Python – for representing dates and times without having to deal with all the awkwardness of non-decimal calendars and clocks. Here, `modified_time` is the last time the data was modified, so the difference between the two times tells us how long the participant took to complete the experiment (in seconds).

With a little bit of code, we can pull out the test trials and calculate accuracy:

In [None]:
accuracy_by_position = [0, 0, 0, 0, 0, 0, 0]

for trial in data_set['responses']:
    if trial['test_type'] == 'controlled_fixation_test':
        correct = trial['object'] == trial['selected_object']
        accuracy_by_position[trial['fixation_position']] += correct

print(accuracy_by_position)

The details of the experiment don't matter, I mostly just want to give you the idea that you can work with JSON data quite naturally in Python and pull out the things that are interesting to you. JSON is also easy to use from the perspective of the experiment script; as the experiment is running, you simply store data in a Python dictionary, and then, at the end of the experiment, you write that dictionary to JSON – it's structure will be perfectly preserved.

That being said, it is often easier to work with numbers when they are arranged in a table – especially, when it comes to the statistical part of your analysis pipeline. It may, therefore, be useful to combine both formats. You could use JSON to store the raw data that comes directly out of your experiment, and then you could write a script to transform that raw data into a clean CSV file that can be passed into your statistical code. This CSV file would just contain the numbers relevant to the analyses; all the metadata, like timestamps and comments, would be left in the JSON in case you ever needed to refer back to it.

The main message I want to give you is that you should spend some time thinking about your data organization. Try to put yourself in the shoes of another scientist who wants to use your data in the future. Can they easily understand it and work with it to answer their own questions? Will it still be possible to open and process the data in ten years? Do you need to have access to specialist software to open the files? Will this software still exist in the future, and, if so, will it still run on computers of the future? These kinds of questions can be particularly challenging in the context of, for example, imaging data or eye tracking data, and there aren't necessarily good answers to these questions – you just have to do the best you can.

## Pain with Paths

I'm sure all of you are very experienced with working with folders on your computer, and I'm sure all of you are super organized! Nevertheless, as your projects grow, things can quickly get out of control and messy. It is therefore also a good idea to think about not just the organization of your data, but also the organization of your entire project. Do you have data, manuscripts, stimuli, and code scattered all over the place with crazy long file names? Then it's time to get organized!

I'm not going to spend a lot of time here giving organizational advice; instead, I would suggest that you go read this website: https://goodresearch.dev (another one for the weekend). It is full of really excellent advice on how to keep your projects organized, particularly in the context of Python (although a lot of the advice is more general).

What I do want to focus on here is handling file paths. We have already used a very simple kind of path above. When opening a file, like `some_file.txt`, we are specifying the path to where the file can be found. In this case, since the file is in the same folder as the code (i.e. this notebook), we only need to specify the name of the file; the rest of the path is implied. However, as your project grows, you will need to organize data into subfolders and subsubfolders, so it's therefore really useful to understand how paths are handled in Python.

First, let's rewind a little and check that we're all on the same page in terms of what a path is. A full path is something like this:

```
/Users/jon/Code/sciprog23/some_file.txt
```

It describes where a file is located on your computer. Of course, the path needs to be stored as a string; otherwise, it would be interpreted as the variable `Users` divided by the variable `jon` divided by the variable `Code` etc... After many years of programming, I still frequently forget to put paths inside quotation marks and then Python freaks out trying to divide a bunch of variables that don't exist. So, your path should be in quotation marks like this:

```python
"/Users/jon/Code/sciprog23/some_file.txt"
```

 For people on Windows, your paths will look a little different, maybe something like this:

```python
"C:\Users\jon\Code\sciprog23\some_file.txt"
```

In particular, note that Windows uses the backslash (`\`) as the path separator. This is problematic in Python (and other languages) because the backslash has a special meaning in strings – it is used as an ["escape character"](https://en.wikipedia.org/wiki/Escape_character). For example, `\n` represent a new line and `\t` represents a TAB character. Therefore, to specify paths on Windows you need to escape the escape character (!) by using double backslashes, so that you end up with something like this:

```python
"C:\\Users\\jon\\Code\\sciprog23\\some_file.txt"
```

An alternative solution is to use "raw strings" which are created similarly to the f-strings that we've already been using. In a raw string, which you use by placing an `r` in front the opening quotation mark, backslashes are not treated as special escape characters, allowing you to specify the path literally:

```python
r"C:\Users\jon\Code\sciprog23\some_file.txt"
```

Anyway, to read in the file we created earlier, we can specify the full path:

In [None]:
with open('/Users/jon/Code/sciprog23/some_file.txt') as file_object:
    content_of_the_file = file_object.read()
    
print(content_of_the_file)

Naturally, the code above won't work for you because the path is specific to my computer – it contains my username, for example. Go ahead and edit the path to make sure you can open the file correctly using its full path.

Hardcoding full paths like this is a bad idea, but, sadly, it's very common practice in science. What happens if a colleague tried to run your code? It won't work until they fix the path(s). And what happens if you move between Windows and Mac? For example, I code my experiments on my Mac but then I run them on a Windows computer in the lab, which can sometimes lead to annoying problems, like changing all the forwardslashes to backslashes.

There are two things you can do to avoid these kinds of issues, and it's a good idea to get into these two habits early on. First, you should always try to specify paths relative to the where the code is located. For example, let's say your project is organized like this:

```
project
|- code
|  |- analysis.py
|- data
|  |- exp1
|  |  |- data.csv
|  |- exp2
|  |  |- data.csv
```

When the Python script (`analysis.py`) is run, it is run from the `code` directory. Therefore, to access the data files, the code needs to move *up* into the `project` directory and then *down* into the data directory and then *down* again into, for example, the `exp1` directory. This is how we express the path from the perspective of the analysis script:

```python
"../data/exp1/data.csv"
```

or on Windows

```python
r"..\data\exp1\data.csv"
```

Note that the two dots (`..`) means "move up one directory." Thus, if someone downloads your entire project repository, they can run the `analysis.py` script and it will be able to locate `data.csv` with no issues. They don't need to fiddle around with your code, finding and changing all the paths. Unless, of course, this person is using a different operating system...

This brings me to the second thing you should always do. Instead, of writing paths as strings, use `Path` objects, which are designed specifically for representing paths. For example, to represent the path above, you would do this:

```python
Path('..') / 'data' / 'exp1' / 'data.csv'
```

Because we are using a `Path` object here, the forward slash `/`, which usually means divide, is automatically redefined to mean concatenate. (This might sound a bit crazy, but if you followed the bonus class on object-oriented programming, then you will recognize this as operator overloading.) Underlyingly, the path will automatically use the appropriate path separator depending on the OS you happen to be using. This allows you to specify paths in a OS-agnostic way.

To use the `Path` object, it needs to be imported from the `pathlib` module. The `Path` object also allows us to do much more than simply *representing* paths. For example, it also allows you to iterate over all files in a path, for example:

In [None]:
from pathlib import Path

home_dir = Path.home()
print(f'My home directory is {home_dir}')

print('These are the files in my home directory:')
for item in home_dir.iterdir():
    if item.is_file():
        print(item)
        
print('These are the folders in my home directory:')
for item in home_dir.iterdir():
    if item.is_dir():
        print(item)

This could be useful, for example, if you needed to iterate over all CSV files in a particular directory. Maybe, for example, you have a separate data file for each experiment you've done, and you need to iterate through them all to do some kind of processing. For example, here I am iterating over all the CSV files in the same directory as this notebook:

In [None]:
current_working_directoy = Path.cwd()

for item in current_working_directoy.iterdir():
    if item.suffix == '.csv':
        print(item)

`Path` objects have lots more handy methods and attributes for working with the file system. These include:

- `exists()` - Returns `True` if the path exists
- `mkdir()` - Create the path on the file system
- `rename()` - Rename the file
- `parent` - The parent directory
- `stem` - The file name without the file name extension
- `suffix` - The file name extension

If you've ever needed to do batch renaming of 100s of files, you know how tedious it can be. The tools contained in `pathlib` should help you out. For a comprehensive review of all the things you can do, check out this link: https://realpython.com/python-pathlib/ My main pieces of advice are (1) to get into the habit of using `Path` objects and (2) to try to specify paths relatively rather than absolutely. Make sure that, if someone downloads your repository, they can just press run and immediately generate output without any fuss.

## List Comprehensions

Okay! That's enough data organization. For the rest of this class, let's play around with some more interesting coding topics. First, a very elegant feature of Python is the **list comprehension**, which allows you to generate lists in an intuitive and concise way. List comprehensions are very ["pythonic"](https://towardsdatascience.com/how-to-be-pythonic-and-why-you-should-care-188d63a5037e). You don't *have* to use list comprehensions, but they are quite nice and widely used, so you are bound to encounter them pretty soon. Here's a list comprehension to generate the first ten square numbers:

In [None]:
squares = [x**2 for x in range(1, 11)]
print(squares)

As you can see, this mixes together some of the syntax of a for-loop with some of the syntax of a list. This is how we would write the code if we were just using a regular for-loop:

In [None]:
squares = []
for x in range(1, 11):
    squares.append(x**2)
print(squares)

We get exactly the same output, but with the list comprehension the code is shorter and more elegant. List comprehensions can even incorporate an if-statement. For example, let's make a list of squares that are also even:

In [None]:
even_squares = [x**2 for x in range(1, 11) if x**2 % 2 == 0]
print(even_squares)

To get some practice, try rewriting the list comprehension above as an ordinary for-loop and if-statement:

Now try rewriting the following code as a list comprehension, and check that you get the same output:

In [None]:
numbers = []

for x in range(100):
    if x % 2 == 0:
        numbers.append(x**2)
print(numbers)

You should notice that the list comprehension is basically just a reordering of all the syntax into a more compact form. There is also such a thing **dictionary comprehensions**, too:

In [None]:
squares = {x: x**2 for x in range(20)}
print(squares)

Does the syntax make sense? Try writing a dictionary comprehension where the keys are the numbers 1 through 26 and the values are the corresponding letter of the alphabet, `1:'A'`, `2:'B'`, etc... (hint: the `chr()` function might prove useful).

Overall, list and dictionary comprehensions provide some very elegant syntax for quickly generating simple lists and dictionaries. However, they should not be abused. It's actually possible to create really crazy list comprehensions that combine multiple loops and conditions, and the code can quickly become totally unreadable. Never forget that one of the most important things about your code is that it is readable – not just to others but also to your future self. **Readability is always more important than brevity, cleverness, or shaving milliseconds off the compute time.**

## Error Handling

Over the past week, I'm sure you've bumped into lots of errors, and maybe you sometimes feel frustrated with them. But one thing I want you to remember is that **errors are a good thing!** Errors inform you that something fishy is going on. It is much better to be aware of a potential problem than for the problem to go silently undetected.

### Exception handling with `try`/`except`

Sometimes, if you expect a particular type of error to occur, you might be able to take corrective action. Exception handling using the `try` and `except` statements can do exactly this. Here's an example:

In [None]:
# pick a number...
number = 0

colors = {1:'red', 2:'green', 3:'blue'}

try:
    picked_color = colors[number]
except KeyError:
    picked_color = 'black'
    
print(picked_color)

Python will attempt to run the code inside the `try` block. If that code runs fine with no errors, then we just continue as normal. However, if that code fails, specifically if it produces a `KeyError`, the code in the `try` block is abandoned and the code inside the `except` block will be run instead. In this case, if you pick the numbers `1`, `2`, or `3`, the code will run fine and you will get the corresponding color. If you choose some other number that is not a valid key in the `colors` dictionary, the code will fail with a `KeyError`, causing the variable `picked_color` to be set to `black`. Exception handling is especially useful if you expect a particular type of error to occur and you know in advance how to correct it.

### Raising errors with `raise`

The inverse of exception handling is explicitly forcing errors to occur. This is useful because it alerts you to a particular problem. To raise an error, you use the `raise` statement. For example, here we will perform a check to see if `number` is set to `1`, `2`, or `3` and raise a `ValueError` if it is not. As you can see we can also write a custom error message.

In [None]:
# pick a number...
number = 0

if number not in [1, 2, 3]:
    raise ValueError('You must pick either 1, 2, or 3!')

colors = {1:'red', 2:'green', 3:'blue'}
picked_color = colors[number]
print(picked_color)

### Asserting things to be `True` with `assert`

Another way to approach this is to write an assertion with the `assert` statement. This allows you to assert that something is `True`, and if it turns out to be `False`, the code will fail.

In [None]:
# pick a number...
number = 0

assert number in [1, 2, 3]

colors = {1:'red', 2:'green', 3:'blue'}
picked_color = colors[number]
print(picked_color)

A very smart thing to do is to put lots of `assert` statements all over your code. Make lots of obvious assertions – things that should obviously be `True`; one day one of those assertions will be be `False` forcing you to realize that something dodgy is going on. For example, in a script I wrote recently, I needed to pair up experimental trial data stored in one file with the corresponding eye tracker recording data in another file. I put in some `assert` statements which assert that the trial ID in the response data should match the trial ID in the eye tracker recording. In principle, everything should be fine and the two data sources should line up correctly. However, the assert statements give me extra peace of mind; if there's ever a mismatch for some obscure, unpredictable reason, I will see big red errors.

## Regular Expressions

I hated regular expressions for many years, and, honestly, I still don't love them very much. But they are important, especially if you need to do any kind of text processing. Furthermore, regular expressions (also known as regex) are not specific to Python, so learning the basics should prove useful in other contexts too.

First, what is regex? Regular expressions are a way to express textual patterns. These patterns can then be searched for in a text, or they can be the basis for string validation. A classic example is checking that an email address is valid. Here's how I could write some code to check that an email address is a valid SISSA address and, if so, extract the username:

In [None]:
import re

sissa_email = re.compile(r"^(\w+)@sissa.it$")

def extract_username(email):
    if match := sissa_email.match(email):
        extracted_email = match[0]
        extraxted_username = match[1]
        print(f'The email is {extracted_email} and the username is {extraxted_username}')
    else:
        print('Not a valid SISSA address')


extract_username('jcarr@sissa.it')

First we need to import the `re` module, which is part of the Python Standard Library. Next, we "compile" a regex pattern. This means we specify what a valid email address should look like. Notice, first, that we are using a raw string here; the opening quotation mark is preceded by an `r`. When defining regex patterns, it's often a good idea to use raw strings because the backslash character is common in the regex language. Now, let's unpack the pattern:

- `^` matches the start of the string
- `(\w+)` matches one of more "word" characters (i.e. alpha-numeric characters). `\w` means a word character, and `+` means one or more
- `@sissa.it` matches literally `@sissa.it`
- `$` matches the end of the string

So we're essentially saying that a valid email starts with one or more word characters and ends with `@sissa.it`. By placing `\w+` inside parentheses, we additionally allow that portion of the email address (i.e. the username) to be captured separately. (By the way, not regex-related but did you notice the `:=`? This is called the "walrus operator," and it was only added to the language a few years ago – look it up if you're curious what it does.)

Test the code with various emails to see if it works correctly. Then, try writing a new function to validate a phone number. (hints: `\d` matches a digit, `\s` matches a space character, `?` makes something optional.) A super handy resource is https://regex101.com where you can type regex patterns and see visually how they match strings. Whenever I'm designing a regex pattern, I always go straight to this website.

The second useful thing you can do with regex is find occurrences of a pattern in a given text. For example, let's say we have some piece of Italian text and we want to pick out all the masculine and feminine words. This problem is, of course, a pretty complex one, but let's just define masculine words as anything ending in *-o* and feminine words as anything ending in *-a*.

In [None]:
import re

text = '''Trieste è un comune italiano di 200 480 abitanti, capoluogo della
regione italiana a statuto speciale Friuli-Venezia Giulia, affacciato
sull'omonimo golfo nella parte più settentrionale dell'Alto Adriatico, fra la
penisola italiana e l'Istria, a qualche chilometro dal confine con la
Slovenia nella regione storica della Venezia Giulia. Già capoluogo
dell'omonima provincia, è sede dell'omonimo ente di decentramento regionale
(EDR), istituito con Legge regionale 29 novembre 2019, n. 21
("Esercizio coordinato di funzioni e servizi tra gli enti locali del Friuli
Venezia Giulia e istituzione degli Enti di decentramento regionale"), ed
operativo dal 1º luglio 2020. Rappresenta da secoli un ponte tra l'Europa
centrale e quella meridionale, mescolando caratteri mediterranei,
mitteleuropei e slavi ed è il comune più popoloso e densamente popolato della
regione. Il porto di Trieste dal 2016 è il porto italiano col maggior
traffico merci ed è uno dei più importanti nel sud Europa.'''

msc_pattern = re.compile(r"\b\w+o\b")
fem_pattern = re.compile(r"\b\w+a\b")

print('The masculine words are:')
for match in msc_pattern.finditer(text):
    print('-', match[0])

print('The feminine words are:')    
for match in fem_pattern.finditer(text):
    print('-', match[0])

There are a few new things here:

- The triple quote `'''` allows us to create multiline strings
- `\b` matches a word boundary
- The `.finditer()` method of regex objects iterates over all matches in the text

Try making this code more organized by putting it into a function. The first argument to the function should be a text and the second argument should allow the user to specify either masculine or feminine words. The function should return a list of words rather than printing them out.

Now write a function that counts how many times each word occurs. For example, the word *porto* occurs twice. The function should take in a list of words and return a dictionary of counts, like this: `{'porto':2, 'golfo':1}`.

## More...

Still want more...? Check out this page of 100 tips and tricks in Python: https://holypython.com/100-python-tips-tricks/ Peruse the list and try out the ones that seem interesting to you.

Spend some time thinking about the organization of your current projects. Do you need to do some housekeeping? Is your project well documented? Is your data open and accessible? Are the results reproducible with minimal effort? Check out https://goodresearch.dev/ and https://tidyr.tidyverse.org/articles/tidy-data.html for solid advice.