### **Intro**

In this notebook, we will practice reading files (tables) using the method we learned in the video. Additionally, we will explore other useful techniques for handling these types of files.

## **Open and explore file with the built-in open function**

## Example 1: exploring the content of a file

Make sure to have the *example.txt* file in the same directory of this notebook. 

In this example, we are going to explore the content of a simple text file.

#### Read and print the content of a file

In [3]:
filename = 'example.txt'

with open(filename, 'r') as file: ## open the file in readin mode
    
    # we can read and print the content of the file
    print(file.read())
    
    

Hello
This is the second line of this example file
This is a random address
Randomgatan 22, Albanova, Stockholm, Sweden
And this is a random phone number
0819338970


In [5]:
## we can read just a few characters
with open(filename, 'r') as file: ## open the file in readin mode
    
    # we can read and print the first 10 characters of the file
    print(file.read(10))

Hello
This


In [13]:
## or we can read single lines of the file
with open(filename, 'r') as file: ## open the file in readin mode
    
    print(file.readline())
    print(file.readline())

Hello

This is the second line of this example file



In [19]:
## or all the lines

with open(filename, 'r') as file: ## open the file in readin mode
    
    print(file.readlines())
    

['Hello\n', 'This is the second line of this example file\n', 'This is a random address\n', 'Randomgatan 22, Albanova, Stockholm, Sweden\n', 'And this is a random phone number\n', '0819338970']


In [27]:
## Let's store the output of readlines in a variable 
with open(filename, 'r') as file: ## open the file in readin mode
    
    lines=file.readlines()
    print(type(lines))

<class 'list'>


#### **The output of readlines() is a list where each item corresponds to a line in the file**

In [32]:
## Let's read the 4th line

print(lines[3])

## we can also split the file if there is a separator like in this case (the comma)

lines_split = lines[3].split(',')

print('First item of the split line:',lines_split[0])

Randomgatan 22, Albanova, Stockholm, Sweden

First item of the split line: Randomgatan 22


## Example 2: writing a file

Let's write a new file with the same content of *example.txt*. We call it *example_write.txt*

In [33]:
filename = 'example_write.txt'  

with open(filename, 'w') as file:  ## this time we open the file in writing mode
    ## let's write the first line
    file.write('Hello\n') ## we need the chars \n to go to a new line
    ##second line
    file.write('This is the second line of this example file\n')


You can check the new created file in your directory

#### **Quick Exercise: complete the writing of the file**

### We can also write a file that contains the values of a given array

In [35]:
## example: rite the numbers from 1 to 10 in a file, with each number on a new line.

x = [1,2,3,4,5,6,7,8,9,10]

filename = 'numbers.txt'
with open(filename, 'w') as file: ## we open the file in writing mode
    
    for num in x: ## we loop over the list x
        
        file.write(str(num)) ## file.write takes as input a string! So we need to convert the content of num (integer) to string!
        file.write("\n") ## new line
        




## Example 3: exploring an astronomical data catalogue

Make sure to have the file *catalog.csv* in the same directory of this notebook

In [1]:
filename = 'catalog.csv' ## if the file is not in the cwd, write the path: '/path/catalog.csv'

In [15]:
##let's open the file and read its lines

with open(filename, 'r') as file:
    data = file.readlines()


In [16]:
data

["# The astronomer's digital toolbox course\n",
 '# Self-study course\n',
 '# Astronomical catalogue\n',
 '# Units\n',
 '# pix,pix,pix2,pix,Jy,- \n',
 'x_center,y_center,area,radius,flux,morph_flag\n',
 '3829.693158884559,4027.481130845276,193.98625491064396,7.85797319426752,8.744416112112191e-05,2\n',
 '4172.080650731051,4635.916871816943,105.67947705920724,5.799898474514772,6.216117771761661e-05,3\n',
 '1836.0422664038128,1230.0160104786758,137.084535186882,6.6057068352216675,2.231008536700474e-05,1\n',
 '3681.824284818905,402.30133432114013,77.85744302103727,4.978231998067923,1.9609296037152936e-05,2\n',
 '1140.403963327656,7266.510628803315,142.1429492586025,6.726477978878255,7.382578491658577e-05,1\n',
 '2975.39525542118,6904.070081406163,133.71202617896333,6.523945112769563,8.20561097029589e-05,1\n',
 '1275.6753592406023,6706.07861287055,42.74485594283435,3.688646124271903,9.355385886752786e-05,1\n',
 '4216.21373756767,6018.579573856334,2.927994579276061,0.9654064539229676,2.2818

This file is a bit more complex to handle compared to the one presented in the video. As you can see, it starts with a multi-line explanatory section, followed by the column labels. The actual data begins on the line right after the labels.

We'll need to be mindful of this structure as we work with the file, ensuring that we correctly read and process the data. Let's dive in and explore how to manage this format efficiently.

In [17]:
## we can skip the first 6 rows by slicing the list data

header_rows = 6
data_sliced = data[header_rows:]   
print(data_sliced)

['3829.693158884559,4027.481130845276,193.98625491064396,7.85797319426752,8.744416112112191e-05,2\n', '4172.080650731051,4635.916871816943,105.67947705920724,5.799898474514772,6.216117771761661e-05,3\n', '1836.0422664038128,1230.0160104786758,137.084535186882,6.6057068352216675,2.231008536700474e-05,1\n', '3681.824284818905,402.30133432114013,77.85744302103727,4.978231998067923,1.9609296037152936e-05,2\n', '1140.403963327656,7266.510628803315,142.1429492586025,6.726477978878255,7.382578491658577e-05,1\n', '2975.39525542118,6904.070081406163,133.71202617896333,6.523945112769563,8.20561097029589e-05,1\n', '1275.6753592406023,6706.07861287055,42.74485594283435,3.688646124271903,9.355385886752786e-05,1\n', '4216.21373756767,6018.579573856334,2.927994579276061,0.9654064539229676,2.2818394004391607e-05,3\n', '1135.965445469591,4540.846120370255,197.57706028788047,7.930367682066398,7.20161387608199e-05,3\n', '1855.4476479021225,3296.9804094299943,88.20485162454501,5.298724023901138,3.92487808

Now that we have removed the header, we can store the content of the file in variables

In [41]:
## we define variables according to the header labels

x = []
y = []
area = []
radius = []
flux = []
flag = []


for line in data_sliced:
    
    x.append(float((line.split(','))[0]))
    y.append(float((line.split(','))[1]))
    area.append(float((line.split(','))[2]))
    radius.append(float((line.split(','))[3]))
    flux.append(float((line.split(','))[4]))
    flag.append(int((line.split(','))[5]))  ## the flag is 1, 2, 3 so it's better to use int




### **numpy.genfromtxt**

An alternative to the built-in function is the genfromtxt function in the NumPy library (https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html). 

In this case, both the delimiter and the number of rows to skip are specified when calling the function. Check the documentation for more parameters you may specify.

In [43]:
import numpy as np

In [61]:
x,y,area,radius,flux,flag = np.genfromtxt(filename, delimiter = ',', skip_header=6, unpack = True)

*Note that in this case the output are numpy arrays and not lists*

**Quick exercise:** Use the *dtype* parameter to ensure that the *flag* column is read as integers, while keeping the other columns as floats.

https://www.python4data.science/en/24.1.0/workspace/numpy/dtype.html

### **Working with Tables in Pandas**

A useful package to work with tables and catalogues is Pandas (https://pandas.pydata.org/docs/index.html).

A possible way to read a table is using the method *read_csv*. 

Pandas can infer the columns names of the table before the data start. 
We use the parameter *skiprows* to skip the explanatory lines.

In [62]:
import pandas as pd

In [77]:
data = pd.read_csv(filename, delimiter = ",", skiprows = 5) 

In [78]:
data

Unnamed: 0,x_center,y_center,area,radius,flux,morph_flag
0,3829.693159,4027.481131,193.986255,7.857973,0.000087,2
1,4172.080651,4635.916872,105.679477,5.799898,0.000062,3
2,1836.042266,1230.016010,137.084535,6.605707,0.000022,1
3,3681.824285,402.301334,77.857443,4.978232,0.000020,2
4,1140.403963,7266.510629,142.142949,6.726478,0.000074,1
...,...,...,...,...,...,...
195,4340.909467,666.554511,192.193151,7.821571,0.000084,2
196,749.342297,2574.565538,25.167104,2.830360,0.000099,1
197,2157.340646,5532.973091,143.999360,6.770260,0.000022,2
198,4881.235781,2745.636089,155.129320,7.027033,0.000041,1


The result is the table above. The method *read_csv* infers the type of each column from the context. However, the user can set dtype for either all columns or single columns as for genfromtxt.

We can obtain information on the dataframe:

In [79]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   x_center    200 non-null    float64
 1   y_center    200 non-null    float64
 2   area        200 non-null    float64
 3   radius      200 non-null    float64
 4   flux        200 non-null    float64
 5   morph_flag  200 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 9.5 KB


#### **Read the content of a column**

In [80]:
data['morph_flag']

0      2
1      3
2      1
3      2
4      1
      ..
195    2
196    1
197    2
198    1
199    2
Name: morph_flag, Length: 200, dtype: int64

#### **Store the content of a column in a variable**

In [81]:
data['morph_flag'].values ## to access the values

array([2, 3, 1, 2, 1, 1, 1, 3, 3, 2, 1, 2, 2, 1, 2, 3, 1, 1, 3, 1, 3, 3,
       1, 3, 1, 3, 3, 2, 1, 3, 1, 1, 3, 1, 3, 3, 2, 2, 3, 1, 1, 3, 2, 1,
       1, 3, 1, 3, 2, 3, 1, 3, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2, 2, 1, 3,
       2, 3, 1, 1, 2, 2, 2, 2, 1, 3, 2, 1, 1, 1, 3, 2, 2, 3, 1, 1, 3, 2,
       1, 2, 2, 1, 2, 1, 3, 3, 2, 1, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3,
       1, 3, 1, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 2, 3,
       2, 2, 3, 1, 1, 2, 1, 2, 3, 3, 2, 1, 1, 3, 1, 1, 2, 1, 1, 3, 2, 2,
       1, 2, 2, 2, 2, 3, 3, 2, 1, 3, 1, 1, 3, 1, 1, 1, 2, 1, 1, 1, 1, 3,
       2, 3, 3, 1, 3, 2, 1, 2, 3, 3, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 2,
       1, 2])

**Quick exercise** Store the content of the columns in the respective variables *x, y, area, radius, flux, flag*

## **Exercise:**

Save the columns x and y in the file *output_coordinates.out*, use the built-in method (open the file in writing mode).

Use "," as delimiter.



Add an explanatory header with the following format:

*Name and Surname*

*Date*

*Title of the course*

*Units*




Add the label of each column, x and y.


*hint:* file.write(THIS MUST BE A STRING). If you have a float like x[i], you can change the type using str(x[i])