# Lab 4

*Written by Jack Bullen*

*Monday, September 18th, 2023*

## Objectives

#### 1. Reading and writing files
#### 2. JSON and CSV file formats
#### 3. Basic regular expressions
#### 4. Built-in libraries collections, time, and json

In [1]:
# Libraries we will use

from collections import Counter, defaultdict
import pandas as pd
import time
import json
import csv
import re

# Reading and writing files

I/O (input/output) in python revolves around the built-in function `open()`. The most basic usage of this function involves passing it two arguments. 

1. A string that contains a path to the file you want to open.
2. A string that contains the mode that we wish to open the file with. 

The first argument should make immediate sense. The mode will make more sense after reading further.

## Reading

To read use the `open()` function and set the mode to `'r'` or `'a'`

```python
with open('file.txt', 'r') as file:
    text = file.read()
```

This opens the file file.txt and saves its entire contents into the variable text as a string.

In [2]:
# Example 1: .read() 

with open('random_text.txt', 'r') as f:
    random_text = f.read()
print(random_text)

here is some random text on the first line
some new line characters \n \n
these are escaped so python knows whats what
here is some more on the fourth line


this is the 7th line

8th line is followed by three new lines





In [3]:
# Example 2: .readlines()

with open('random_text.txt', 'r') as f:
    random_text = f.readlines()
print(random_text)

# notice the \\n vs \n

['here is some random text on the first line\n', 'some new line characters \\n \\n\n', 'these are escaped so python knows whats what\n', 'here is some more on the fourth line\n', '\n', '\n', 'this is the 7th line\n', '\n', '8th line is followed by three new lines\n', '\n', '\n']


we can also manually close the file. **be careful doing this way in jupyter notebook!**

make sure to close files in the same cell you open them if you choose to do this.

In [4]:
# Example 3: looping over the file

f = open('random_text.txt', 'r')

for line in f:
    if line=='\n':
        #skip empty lines
        continue
    print(line)

f.close()

# notice extra new line compared before. why is happening?

here is some random text on the first line

some new line characters \n \n

these are escaped so python knows whats what

here is some more on the fourth line

this is the 7th line

8th line is followed by three new lines



The mode `'a'` can be used to append to 

## Writing

To write a file we call the file path to open and the mode is `'w'` or `'x'`.

```python
text = 'hello world!'
with open('file.ext', 'w') as f:
    f.write(text)
```

This will create (or overwrite) the file file.ext and write hello world! to it.

The mode `'x'` is useful sometimes when you do not want to overwrite data. As it will throw an error if the file already exists.

In [5]:
# Example 4: reading then writing as text and as bytes

with open('random_text.txt', 'r') as f:
    text = f.readlines()

# writing random_text.txt as text in one line
with open('one_line.txt', 'w') as f:
    f.write(' '.join([line[:-1] for line in text]))

# writing the original text as bytes
with open('bytes', 'wb') as f:
    f.write(bytes(''.join(text), 'utf-8'))

In [6]:
# Example 5: 'x' mode throwing error when writing to an existing file

with open('random_text.txt', 'x') as f:
    f.write('this wont work') 

FileExistsError: [Errno 17] File exists: 'random_text.txt'

Take a look at the various files in /Labs/Week\ 2/Lab-4-Sep-18/: 
- `random_text.txt`
- `one_line.txt`
- `bytes`

Notice how the bytes file is no different from random_text.txt.

**If you would like more detailed information for I/O check out the [python docs section 7.2](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)**

This link further explains the some [open() modes](https://stackoverflow.com/questions/1466000/difference-between-modes-a-a-w-w-and-r-in-built-in-open-function):

 ``r`` -  Open text file for reading.  The stream is positioned at the
         beginning of the file.

 ``r+`` - Open for reading and writing.  The stream is positioned at the
         beginning of the file.

 ``w`` -  Truncate file to zero length or create text file for writing.
         The stream is positioned at the beginning of the file.

 ``w+``  -  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated.  The stream is positioned at
         the beginning of the file.

 ``a`` -  Open for writing.  The file is created if it does not exist.  The
         stream is positioned at the end of the file.  Subsequent writes
         to the file will always end up at the then current end of file,
         irrespective of any intervening fseek(3) or similar.

 ``a+``  -  Open for reading and writing.  The file is created if it does not
         exist.  The stream is positioned at the end of the file.  Subse-
         quent writes to the file will always end up at the then current
         end of file, irrespective of any intervening fseek(3) or similar.

There are others. `'rb'` and `'wb'` are for working with individual bytes.

## JSON and CSV

JSON and CSV are common file formats for storing data.

### JSON: JavaScript Object Notation (.json)
- Commonly used for web applications
- What is typically served by APIs 
- Essentially a Python dictionary in plain text


### CSV: Comma Seperated Values (.csv)
- Text that contains data seperate by commas and new lines.
- First line contains the data headers and rest is the data.
- There is also TSV which uses Tabs instead of Commas.
- Can import csv files easily with the pandas library

In [7]:
# Example 6: Reading csv data with pandas function .read_csv()

airports = pd.read_csv('data/airports.csv')

airports.head() # .head() is a method on the pandas Dataframe object. It gives you a preview of the data.

Unnamed: 0,ID,CODE,C_CODE,TYPE,AROPORT_T,TYPE_CODE,AIRPORT_AROPORT,CITY,PROV,PROVINCE_FR,LAT,LONG
0,177,YXY,CYXY,Control Tower,Tour de contr�le,1,Erik Nielsen Whitehorse,Whitehorse,Yukon,Yukon,60.716666,-135.066666
1,185,YYJ,CYYJ,Control Tower,Tour de contr�le,1,Victoria International,Victoria,British Columbia,Colombie-Britannique,48.65,-123.433333
2,188,YYT,CYYT,Control Tower,Tour de contr�le,1,St John's Intl,Saint John,Newfoundland and Labrador,Terre-Neuve-et-Labrador,47.616666,-52.75
3,121,YQX,CYQX,Control Tower,Tour de contr�le,1,Gander International,Gander,Newfoundland and Labrador,Terre-Neuve-et-Labrador,48.933333,-54.566666
4,71,YHZ,CYHZ,Control Tower,Tour de contr�le,1,Halifax Robert L. Stanfield International,Halifax,Nova Scotia,Nouvelle-�cosse,44.883333,-63.516666


In [8]:
# Example 7: Reading json data with json library.

with open('data/buildings.json', 'r') as f:
    buildings = json.load(f)

CLE = buildings['Clearihue Building']
print(f"({CLE['lat']}, {CLE['long']})\nNumber of rooms = {len(CLE['rooms'])}")

(48.464284484363276, -123.3099313530692)
Number of rooms = 45


[Here are is this (latitude, longitude) in google maps](https://www.google.com/maps/place/48%C2%B027'51.4%22N+123%C2%B018'35.8%22W/@48.4642845,-123.3125063,17z/data=!3m1!4b1!4m4!3m3!8m2!3d48.4642845!4d-123.3099314?entry=ttu)

## Regular Expressions

- Regular expressions are a collection of tools for pattern matching. They can look very scary at first glance.

- They can be difficult to read, but basic ones aren't difficult to write, and they can be very useful.

- Python has the `re` library for regular expressions.

- We will only look at one function, `re.findall()`, from this library, and the exercises only require a basic regex.

To use this function we pass a regular expression as first input and then the text we are trying to find the pattern in as second input.

```python
re.findall(r'REGEX_GOES_HERE', text)
```

Notice that the first input has an r in front of the string. This is called a raw string. Similar notation to f-strings. When we use regular expressions it is important to write them as raw strings. [Don't worry about why, just do it](https://note.nkmk.me/en/python-raw-string-escape/#:~:text=In%20Python%2C%20raw%20strings%20are,paths%20or%20regular%20expression%20patterns.).

In [9]:
# Example 8: Find all occurences of two of the same character in text

text = "Hello, this is some text we will look at. We are looking for occurences of two of the same letter. The word occurence has two c's \
        so we have two atleast. There are even some more. Also i will write the word three a few times because it has two e's \
        three three three niiiiiiiine niiiiiiiine"

print(re.findall(r'(\w)\1+', text))

['l', 'l', 'o', 'o', 'c', 't', 'c', 'l', 'e', 'e', 'e', 'e', 'i', 'i']


The above example is using the regular expression `(\w)\1+` and matching against the text in the string `text`

I wont lie, I did not come up with that regex. It uses something called groups as [explained here](https://stackoverflow.com/questions/644714/what-regex-can-match-sequences-of-the-same-character). Don't worry about how it works. I will stick to more basic regex.

#### The important things to understand are:

1. How to import `re` module and call `re.findall()` with a particular regex and text to match against.
2. That you can look on the internet for the regex you want. Chatgpt does surprisingly well at this, but I'd recommend google first as most regex you can ever need have been asked on stack before.

#### Now a few basic regex patterns
#### They will make sense once you start using them

- `[A-Z]` matches capital letters

- `[a-z]` matches lower-case letters

- `[a-zA-Z]` matches both upper-case and lower-case letters

- `\d` matches a digit 0-9

### ? (optional)

You can put a `?` after any of the above to indicate that it can or cannot be in the text

- `[A-Z][a-z]?` matches one capital letter that is may or may not be followed by a lower case letter

### * (kleene star)
You can put a `*` after any of the above to indicate that you want 0 or many occurences of it
- `\d*` matches 0 or many digits

### + (kleene star but not 0)
You can put a `+` after any of the above to indicate that you want 1 or many occurences of it
- `\d+` matches 1 or many digits

### {} (kind of like a finite kleene star)

You can put `{}` after any of the above and put in a single number or a range of numbers to indicate you want to match that many occurences
- `\d{5,8}` matches 5 to 8 occurences of digits

**This is more than what you need to answer the below exercise.**

[Here are some more examples and more in-depth explanations](https://cs.lmu.edu/~ray/notes/regex/)

# Exercises

#### 1. Open the data in `./data/courses.json` and store it to a variable.

#### 2. Find the Math 248 course. What is the pid for this course?

#### 3. Find your degree in `./data/degrees.json` *(don't just open the json file and Ctrl+f...)*

#### 4. Determine the number of required courses from each progam (Math, CSC, etc)

#### 5. Parse the requirements to remove the HTML tags and store them in a file called requirements.txt