<a href="https://colab.research.google.com/github/madeline-evenson/Northwestern-CIERA-Python-Intro/blob/main/Section_2_Reading_and_Parsing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 2 Reading and Parsing Data From a File

So far, you have been working with very short lists of data or generating your own lists of data. However, in data-driven science, you'll likely have to work with large sets of data that can come in a variety of formats: comma-separated values, tab-separated values, space-separated values, to name a few. Sometimes specific quantities can come in strange formats depending on the meaning, for example, a list of times might be formatted in hours, minutes and seconds, like this: 10:03:20. Whatever format your data might come in, you must be able to read in that data and separate it into more easily processed components that you can then analyze. That's exactly what parsing data means.

## 2.1 Reading Data From a File

The first step is to learn how to use Python to directly read data from a file. Let's say we have a data file called `planet_data.txt` containing the solar system planet names and their masses, that looks something like this:

| Planet name        | Mass ($10^{24}$ kg)  |
| :----------------- | :------- |
| __Mercury__       |   0.330  |
| __Venus__         | 4.87  |
| __Earth__         | 5.97  |
| __Mars__          | 0.642  |
| __Jupiter__       | 1898 |
| __Saturn__        | 568  |
| __Uranus__        | 86.8  |
| __Neptune__       | 102 |


To open and read this data file, we use the commands:

```python
with open('planet_data.txt') as planet_file:
    data = planet_file.read()
 ```

Here, `planet_file` is a file object, which is a kind of object that allows you to access and manipulate a user file. Once the file object is created (the first line of code), you can use this to reference the file, and access or manipulate it with functions such as read, readline, readlines, write, seek, and close.

The second line reads the entire contents of the file. Notice that the first row is a comment explaining the meanings of the columns, which is helpful to anyone looking at the file.


In [None]:
# Running in Google Colab? Run this cell
!wget https://raw.githubusercontent.com/CIERA-Northwestern/REACHpy/main/Module_3/data/planet_data.txt

# If you're not running in Colab, this file should be in the data directory.
# Change the loading path of the file to include 'data/' when the file is loaded

--2025-06-09 21:21:55--  https://raw.githubusercontent.com/CIERA-Northwestern/REACHpy/main/Module_3/data/planet_data.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206 [text/plain]
Saving to: ‘planet_data.txt’


2025-06-09 21:21:55 (3.68 MB/s) - ‘planet_data.txt’ saved [206/206]



In [None]:
with open('planet_data.txt') as planet_file:
    data = planet_file.read()

# Exiting the 'with' indented block closes the file
# Print the string with the full file contents.
print(data)

#PlanetName      Mass(1e24 kg)
Mercury          0.330
Venus            4.87
Earth            5.97
Mars             0.642
Jupiter          1898
Saturn           568
Uranus           86.8
Neptune          102


When we open the file this way, Python gave us the permission to read it by default. Sometimes you might want to instead write or append to the file. You can specify which permissions you want from a file when you open it by typing 'r', 'w', or 'a' (they are read, write, and append) in the open command , like this:
```python
with open('planet_data.txt', 'r') as planet_file:
    data = planet_file.read()
```

When you use `readlines()` instead of `read()`, you see some unexpected characters interspersed with it. This result is a single string that includes the entire contents of the opened file, including the tab symbols (represented by `\t`) and the "newline" (represented by `\n`) symbols, both of which were technically part of the original file's structure. This is how Python interprets strings that it reads in. Here is the example:

In [None]:
with open('planet_data.txt', 'r') as planet_file:
    data = planet_file.readlines()

# Print the list
print(data)

['#PlanetName      Mass(1e24 kg)\n', 'Mercury          0.330\n', 'Venus            4.87\n', 'Earth            5.97\n', 'Mars             0.642\n', 'Jupiter          1898\n', 'Saturn           568\n', 'Uranus           86.8\n', 'Neptune          102']


If you want to split the output into a list of lines without the additional newline characters (`\n`), you can use `read().splitlines()`:

```python
with open('planet_data.txt','r') as planet_file:
    data = planet_file.read().splitlines()

print(data)
```

In [None]:
# Try the code snippet here
with open('planet_data.txt', 'r') as planet_file:
    data = planet_file.read().splitlines()

print(data)

['#PlanetName      Mass(1e24 kg)', 'Mercury          0.330', 'Venus            4.87', 'Earth            5.97', 'Mars             0.642', 'Jupiter          1898', 'Saturn           568', 'Uranus           86.8', 'Neptune          102']


## 2.2 Parsing Data

There are multiple ways to break down the content of a file while reading it into Python. Let's think about the structure of our planet data file.

First, you may have noticed that the first line of the data file is a comment indicating what the different columns represent - a very good practice! But we want to be able to separate that out when we're reading in our data - we can do this with the startswith function.

Then, the entries are separated by whitespace, and we can use the split function to separate the columns. The block of code below shows how this comes together to help us parse the data file. Analyze it, then evaluate the cell and inspect the results.

In [None]:
planets, masses = [], []   # Notice, we can define multiple things on a single line
with open('planet_data.txt', 'r') as planet_file:
    data = planet_file.read().splitlines()

for line in data:
    # Only continue if the line is NOT a comment
    if not line.startswith('#'):
        fields = line.split()
        planets = planets + [fields[0]]
        masses = masses + [float(fields[1])]

# Print the lists we created of names and masses
print(planets)
print(masses)


['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
[0.33, 4.87, 5.97, 0.642, 1898.0, 568.0, 86.8, 102.0]


Here's what the above for loop did:

 1. First, we split each line in data into a separate string. <br>

 2. Then, we split the fields at the white space that is between them (creating two fields, fields[0] and fields[1]): <br>
['Mercury', '0.330']<br>
['Venus', '4.87']<br>
['Earth', '5.97']<br>
['Mars', '0.642']<br>
['Jupiter', '1898']<br>
['Saturn', '568']<br>
['Uranus', '86.8']<br>
['Neptune', '102']<br>

3. Finally, we append the planet name and the mass from each line to the lists we created at the beginning of the block of code.

Note that there are two columns in planet_data.txt which are planet names and masses. The fields list contains both planet names and masses. The first element of the field list is planet name = field[0] and the second one is mass = field[1]. If we had another column, such as radius, then we would write radius = field[2].


### 2.2.1 Advanced Parsing


As we mentioned, data can come in many different formats, using many different *delimiters*. A delimiter is a sequence of one or more characters used to specify the boundary between separate entries in a file.

Luckily, we can use the split function to separate strings with any given delimiter (or multiple delimiters) by giving the delimiter as an argument, e.g., `split(',')`, `split(':')`.
With nothing inside the parentheses,e.g., `split()`, Python will be looking for simply whitespace (this is what we did for the planets data). Now, let's look at a more complex string.

In [None]:
challenge = 'Do:you think you:can-parse; this string?'

See if you can break this down to just the word parse in the cell below. We'll get you started.

In [None]:
# You know what this does already! The result is a list of 5 strings
a = challenge.split()
print('a=', a)

# Now we take the third item and split it off at the dash (-)
b = a[2].split('-')
print('b=', b)

c = a[0].split(':')
print(c)

d = b[0].split(':')
print(d)

e = b[1][0:5]

final = []
final.extend(c)
final.extend([a[1]])
final.extend(d)
final.extend([e])
final.extend([a[3]])
final.extend([a[4]])

print(f'{final[0]} {final[1]} {final[2]} {final[3]} {final[4]} {final[5]} {final[6]} {final[7]}')

a= ['Do:you', 'think', 'you:can-parse;', 'this', 'string?']
b= ['you:can', 'parse;']
['Do', 'you']
['you', 'can']
Do you think you can parse this string?


## Practice

Follow the instructions in the cell below to practice reading in and parsing data from a file. The data file, which contains a small snippet of seismic wave data (detection times and wave phases), is named `seismic.txt`. You'll be reading in a few lines of that data, and then extracting the arrival times of the signals, which are in a format that looks like year-month-day"T"hour:minute:seconds"Z". However, let's parse out the detection times for just the waves that have "Pn" seismic phase (but not the ones with phase "P").

In [None]:
# Running in Google Colab? Run this cell
!wget https://raw.githubusercontent.com/CIERA-Northwestern/REACHpy/main/Module_3/data/seismic.txt

# If you're not running in Colab, this file should be in the data directory.
# Change the loading path of the file to include 'data/' when the file is loaded

--2025-06-09 21:41:17--  https://raw.githubusercontent.com/CIERA-Northwestern/REACHpy/main/Module_3/data/seismic.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1345 (1.3K) [text/plain]
Saving to: ‘seismic.txt’


2025-06-09 21:41:17 (62.5 MB/s) - ‘seismic.txt’ saved [1345/1345]



In [None]:
# seismic.txt contains the data you'll use in this exercise

# Open and read that data file using read().splitlines()

with open('seismic.txt', 'r') as file:
    data = file.read().splitlines()

print(data)

# For all lines that that contain Pn (not P) in column 7,
#    isolate the time portion of the line (hrs:mins:secs)

# Save hours, minutess and seconds to three separate lists.
# Hours and minutes should stored in integers, and seconds in floats.
# Remember to skip over the comment line!

hours = []
minutes = []
seconds = []

for line in data:
    #skips comment lines
    if not line.startswith('#'):
        fields = line.split()

        #make sure line has enough columns to access index 6
        if len(fields) >= 7 and fields[6] == 'Pn':
            time_str_with_z = fields[1]
            time_str = time_str_with_z.split('T')[1].replace('Z', '')

            #split time string by colon
            h, m, s = time_str.split(':')

            #convert and append to appropriate list
            hours.append(int(h))
            minutes.append(int(m))
            seconds.append(float(s))


# Print your lists to check that you were successful

print(f'Hours: {hours}')
print(f'Minutes: {minutes}')
print(f'Seconds: {seconds}')

['#Channel\t    Distance Azimuth Phase Arrival_Time(Date,Time) Status\tResidual Weight', 'CX PB08  BHZ --\t11.6076\t11.8566\tPn\t2015-09-16T22:57:12.65Z\tmanual\t-4.50\t0.0000', 'C GO07   BHZ --\t11.6284\t187.245\tPn\t2015-09-16T22:57:14.33Z\tmanual\t-2.70\t0.7400', 'CX PSGCX BHZ --\t12.0071\t7.04986\tPn\t2015-09-16T22:57:17.9Z\tmanual\t-4.40\t0.0000', 'BL ITQB  HHZ --\t13.0996\t85.5512\tPn\t2015-09-16T22:57:35.29Z\tmanual\t-1.80\t0.6400', 'GT CPUP  BHZ 00\t13.5978\t71.0392\tPn\t2015-09-16T22:57:41.97Z\tmanual\t-1.90\t0.6100', 'BL PLTB  HHZ --\t15.3914\t95.4734\tPn\t2015-09-16T22:58:05.1Z\tmanual\t-3.10\t0.6100', 'GT LPAZ  BHZ 00\t15.5554\t12.7937\tPn\t2015-09-16T22:58:08.88Z\tmanual\t-2.00\t0.6200', 'BL AQDB  HHZ --\t18.0934\t56.2093\tPn\t2015-09-16T22:58:40.06Z\tmanual\t-2.20\t0.8200', 'BL TRCB  HHZ --\t19.04\t  67.3597\tP\t  2015-09-16T22:58:51.59Z\tmanual\t-1.30\t0.6200', 'BR PTLB  HHZ --\t19.7142\t38.3762\tP\t  2015-09-16T22:58:59.27Z\tmanual\t-0.90\t0.6400', 'II NNA   BHZ 10\t20.

IndexError: list index out of range

## Takeaways

- Real data can come in many different formats, some more complex than others. You must be able to read in and parse your data before you can extract the quantities needed to do your calculations<br>
- There are many ways to read in files. One of the simplest is with Python's built-in functions for working with file objects, including read, readline, readlines, which return a string or a list of strings that you can then manipulate<br>
- Use the split function to break up a string into its individual fields based on the specific delimiter(s) used in the string, e.g., split(), split(':') and split(',').<br>