# I/O, storing and reading data

In [1]:
import numpy as np

### Usually the slowest part of codes is the Input/Output part. Careful how many times you do I/O operations


### for example, compare the time it takes to read this:

In [2]:
cf = np.zeros(10663)

In [3]:
%%time

with open('10k_file.dat', 'r') as f:
    j = 0
    for line in f:
        p = line.split()
        cf[ j ] = p[4]
        j = j + 1
        


CPU times: total: 15.6 ms
Wall time: 10 ms


### versus: 

In [4]:
%%time
f = open('10k_file.dat','r')  
my_text = f.read()               
f.close()                        

CPU times: total: 0 ns
Wall time: 2.03 ms


### You can open a file to:

read: f = open(‘my_file’, ‘r’)

write: f = open(‘my_file’, ‘w’)

append: f = open(‘my_file’, ‘a’)

read/write: f = open(‘my_file’, ‘r+’)



#### -----------------------

### After you open a file you can do things with the data you read in:


### read everything in one go: f.read()


### read line by line: f.readline()

### Don’t forget to eventually close your file: f.close()


### Note that after closing the file the f. is released from memory and trying to access it to do anything will give you an error.


### Better practice for working with a file:

with open(‘my_file') as f: <br>
------ do things here>

file is closed


### Let's try it out:

#### Let's start by reading in a random text file, all in one go:

In [None]:
f = open('text_from_your_book.txt','r')  # open the file in read ('r') mode under name 'f'
my_text = f.read()               # read all of f ( f.read() ) in variable my_text
f.close()                        # close the file f

In [None]:
#let's see what we read:

print(my_text)

In [None]:
#try:
print( f )

### it worked!

### Now let's try the readline():

In [None]:
f = open('text_from_your_book.txt','r')   # open the file in read ('r') mode under name 'f'
 
print( f.readline() )             # read a line of f ( f. readline() ) and print it
print( f.readline() )             # read another line of f

f.close()                         # close the file f

In [None]:
f.readline()

In [None]:
#What does readline() do?

### now let's open it with the *with open() as f* and read it line by line:

In [None]:
with open("text_from_your_book.txt", "r") as f:   # we again open the file as f
    for line in f:                        # we now loop f line by line
        print( line )                       # and we print the line we just read

### What if we want to split the lines into their elements to use them for some reason?

In [None]:
with open("text_from_your_book.txt", "r") as f:  # open file as f
    for line in f:                       # start looping the file line by line 
        q = line.split()                 # split the line in its parts ( split() - delimeter space )
        print ( q )                      # print it

        if q[0] == 'they':               # if the first word in the line is 'they' :
            break                        # break out of the loop


In [None]:
# what happens if I ask it to print:
print( q[ 0 ], q[ 3 ] )
# and why?

### Now lets try to write our first file:

In [None]:
f = open('my_first_writen_file.txt','w')                # open a file in write mode ( 'w' )

f.write('This is my first written and saved line! \n')  # write something in the file 

f.close()                                               # close the file

### try to open the file to read it in and see what you just did:

In [None]:
with open('my_first_writen_file.txt','r') as f:
    a = f.read()  

print( a )   # it worked!

### Now let's go back to this file and open it in read/write mode:

In [None]:
f = open('my_first_writen_file.txt','r+')   # open the file in read/write mode ('r+')

my_text = f.read()                          # read the file into my_text

f.write('oh oh! what did I just do? \n ')   # write something in the file; \n for new line

f.close()                                   # close the file


print(my_text)                              # print my_text

### let's open it again and see what we did:

In [None]:
f = open('my_first_writen_file.txt','r')
 
my_text = f.read() 

print(my_text)

f.close()  
#it worked! we read a file and wrote something at the end.

### why do we need the \n ? Let's open the file again and add 2 new lines:

In [None]:
f = open('my_first_writen_file.txt','r+')   # open the file in read/write mode ('r+')

my_text = f.read()                          # read the file into my_text

f.write( 'What would happen if I forget ' )
f.write( 'to add a new line ? ')
f.close()                                   # close the file


### now let's read it in again and see what we did:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()  
print( my_text )

### oops! I wanted it in 2 lines and it just wrote it in 1 ! Let's try again with the \n :

In [None]:
f = open('my_first_writen_file.txt','r+')   # open the file in read/write mode ('r+')

my_text = f.read()                          # read the file into my_text

f.write( "What would happen if I wouldn't forget \n" )
f.write( 'to add a new line ? \n ')
f.close()                                   # close the file


### now let's read it in again and see what we did:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()  
print(my_text)

### See what happened? 

### How you open a file is crucial. Make sure you always check before you run a code.

e.g.,

In [None]:
#Let's open it one more time to write on it again:
f = open('my_first_writen_file.txt','w')
f.write('And I will add this line as well now! \n')
f.close()

# and let's open and read it again:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()   

print(my_text) 

### !!!oops! I completely erased the previous text! Can you see why that happened?

In [None]:
#### Let's fall back to the initial file:

f = open('my_first_writen_file.txt','w')      # open a file in write mode ( 'w' )

f.write('This is my first written line! \n')  # write something in the file 
f.write( 'oh oh! what did I just do? \n' )
f.close()                                     # close the file

### now lets open the file again to append a line:

In [None]:
f = open ('my_first_writen_file.txt','a')         # open the file in appending ('a') mode
f.write('and this is the other line I wrote! \n') # write a line
f.close()                                         # close the file

In [None]:
# and let's open and read it again:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()   

print(my_text) 

### Example 1: Open and read the first 3 lines of file emma.txt; split the lines in the words they are made of. 
- Discuss: how do we open the file to read the first 3 lines only?
- how do we split the lines in the words they are made of?
- code!

### Example 2: Now read all of Emma and split it in the words its made of. Then count how many 'the' and how many 'Emma' exist in the book.
- Discuss: how do we read the whole book in one go?
- how do we split the whole book in words?
- how do we count how many 'the' and 'Emma's exist?

# -----------------------

## np.savetxt()

### What about having to deal with data instead of text?

In [None]:
#### Let's create a 3 by 3 data array of 1s

data = np.ones ( ( 3 ,3 ) )


### How do we write the data in the file?

In [None]:
# one way to do it would be to add it as an appended string:

f = open('my_first_writen_file.txt','a')

for i in range(3):
    f.write(str(data[i,:])+'\n')   #note that with the f.write you need to convert the data to a string
                                   #if you keep it an array the code will crash/complain
f.close()


In [None]:
# and let's open and read it again:
f = open('my_first_writen_file.txt','r')
my_text = f.read() 
f.close()   

print( my_text ) 

### what if we want to save the real data as nunbers? 
### We saw how easy it is to read data with numpy (loadtxt or genfromtxt). How does it work to write data?  np.savetxt()

In [None]:
#Let's try it out first on its own:

np.savetxt('my_first_numpy_array_saved.txt', data, 
           fmt='%.2f', delimiter=' ')              #notice that you can use fmt to format your output
                    

#let's check what we did:
f = open('my_first_numpy_array_saved.txt','r')
a = f.read()
f.close()

print( a ) 

In [None]:
#what happens if you want to append the numbers to a text file using savetxt?
# first make a backup copy of my_first_writen_file.txt (trust me)

#try the following. Do you think that it would work? Why/why not?

f = open("my_first_writen_file.txt", "a")
np.savetxt('my_first_writen_file.txt', data, fmt='%.2f', delimiter=' ')
f.close()


#let's see what we did:

f = open('my_first_writen_file.txt','r')
a = f.read()
f.close()

print(a) #!! ouch! it erased your entire file! 

In [None]:
#what you can do (but you normally will not need to, unless you do something weird to need to do this), is 
#open your file in a binary format:

f=open('my_first_writen_file.txt','ab')

np.savetxt(f,data, fmt='%.2f', delimiter=' ')

f.close()

f = open('my_first_writen_file.txt','r')
a = f.read()
f.close()

print(a) #it worked! 

### numpy.savetxt is an OK method to store your small arrays. For larger sets you need to delve into pickles/csv ...

In [None]:
# FYI: you can also use headers with savetxt:
np.savetxt('test_2.dat',data, fmt='%.2f', delimiter=' ', 
           header= 'This are my random data')

### Example 3: Make a 5 by 5 numpy array *data_2d* of zeros. Set variable *x* equal to a numpy range from 5 to 10; and variable *y* the exponential of x. Populate *data_2d* following $z_{ij} = x_i * y_j$ . Save  *data_2d* to a file named my_radom_data.dat. Add a header like "These are my random data" and format the output to have 4 digit accuracy.

# ------------------

## Pickles

### Pickles: converts your input with a binary protocol to serialize/ save your input and to deserialize/open (unpickle) it

### **“Never unpickle data received from an untrusted or unauthenticated source”**

### pickle module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again


### Efficient way to compress data – you can reconstruct complete Python datasets 

### ----
Some may be familiar with JSON and pickles may sound comparable but keep in mind:

Pickles are not human readable

Pickles are Python specific
### ----


### When trying to unpickle a dataset the version of Python used may come into play: pickles are not compatible between Python 2 and 3!


### Things to remember:
 
import pickle <br>
pickle.dump() <br>
pickle.load() <br>


In [None]:
#Pickles

import pickle


In [None]:

#Let's get a dictionary from our previous demo :

car1 = {
  "model" : "Escape",
  "make" : "Ford"
}
car2 = {
  "model" : "500",
  "make" : "Fiat"
}
car3 = {
  "model" : "Tucson",
  "make" : "Hyundai"
}


all_my_cars = {
  "car1" : car1,
  "car2" : car2,
  "car3" : car3
}


In [None]:
#and now let's open a file to save the data in:

f = open('my_first_pickle.pickle','wb')       # open file in binary mode 

pickle.dump(all_my_cars,f)                    # let's dump our pickled dictionary in there:

f.close()                                     # close the file

### congrats! you saved your first pickle!

In [None]:
#Let's now see what we saved actually:

pickle_read  = open('my_first_pickle.pickle','rb')  # open and read pickle
example_dict = pickle.load(pickle_read)             # load the pickle in example_dict

In [None]:
# test:

print( example_dict.items()  )
print( example_dict.keys()   )
print( example_dict.values() )


example_dict['car1']['model']

#it worked!

### Saving the dictionary in a pickle preserved its dictionary nature.

### Sometimes you can have a pickle that you know a priori what data it has in it (how many columns, what each column is...)

In [None]:
# let's make some data up
f_in = np.arange(10)
q_in = f_in**2
u_in = f_in**3

# open a file and store your second pickled data:

f = open('my_second_pickle.pickle','wb')

pickle.dump([f_in, q_in, u_in],f)

f.close()

In [None]:
# now that I know that my data are f_in, q_in, u_in lets unpickle it in one go:

f, q, u = pickle.load( open( 'my_second_pickle.pickle','rb' ) )


In [None]:
# test:
print( f == f_in )
print( q == q_in )

# ------------------

## CSV

### CSV: Comma Separated data (Values) --> used to save tabular data such as a spreadsheet or a database


### Things to remember:

import csv  <br>
csv.reader() <br>
csv.writer()<br>


In [None]:
#Let's now do some csv reading/writing:

import csv 

In [None]:
## as a test case we will use the dictionary from before:

with open('my_first_csv.csv', 'w') as f:     # open the file to write in it

    writer = csv.writer(f)                   # you will use the csv module to write the data
    
    for key, value in example_dict.items():  # loop over items in your dictionary 
        
        writer.writerow([key, value])        # write the items

In [None]:
#open the file from your Jupyter notebook tree: it is human readable (unlike the pickle...)

#let's read it back in:

with open('my_first_csv.csv') as f:
    reader = csv.reader(f)
    my_csved_dict = dict(reader)

    
print(my_csved_dict.items())

print(type(my_csved_dict))



In [None]:
#note though that in this example we have lost the structure of the nested dictionary:
#my_csved_dict['car1']['make'] will give you an error:

print( my_csved_dict['car1']['make'] )

### Example 4: Store the data from *data_2d* in a csv file.

# ------------------

## Pandas

### A great tool for data analysis and modeling of large datasets (think ML sizes….). It’s a software library that can read big amounts of data and analyze it fast


### Can read in CSV files, SQL databases and create a Python object with rows and columns (data frame) out of that – makes working on such data faster that using tuples/dictionaries





In [None]:
## Last but not least, let's try Pandas out
import pandas as pd 

In [None]:
# we will use the pickled data from above here:

pd.read_pickle('my_first_pickle.pickle')  # reads it in and shows you the nested dictionaries


In [None]:
pd.read_csv('my_first_csv.csv')           # reads it in and shows you the table

In [None]:
#Now let's make a panda dataframe out of the pickle:

pd.DataFrame(pd.read_pickle('my_first_pickle.pickle'))


In [None]:
# let's read in a dataframe the second pickle with the random data

pd.DataFrame( pd.read_pickle('my_second_pickle.pickle') )  #it gives you the columns and their values: 

In [None]:
# read the data in a dataframe named df:
df = pd.DataFrame(pd.read_pickle('my_second_pickle.pickle')) 

#ask it to describe() your 
df.describe()  # summary statistics for numerical columns ; not very useful here but 
               # imagine what you can do with LARGE datasets!

In [None]:
#get the mean of all columns:
print( df.mean() )

In [None]:
#you can get correlations between columns: 
#print( df.corr() )

#max (& min) of each column:
#df.max()   #& df.min()

#get the standard deviation of each column:
#df.std()

#and much much more...   If you are interested in playing around with big data there's plenty 
#of Open Access databases

## -------------------------
### Revisiting concepts part 1

### - 1. Functions: 

def function_name( input_parameter1, input_parameter2,... ): 

    """ very_informative_docstring
    Input: input_parameter1, input_parameter2,.. [units]
    Output: output_parameters [units] """

    <do calculations here>
    
    return output_parameters
    
    
    


### When you have writen the function in the same notebook/code part you can call it directly w/o importing

In [1]:
def add_one( number ):
    """Takes a number and adds one to it. 
    Input: number [float or integer]
    Output: number plus one """
    
    
    number = number + 1
    
    return number 

In [2]:
#call the function as: name( here_goes_input)

m = add_one( 1 )
print( m )

2


In [3]:
m2 = add_one( 8 )
print( m2 )

9


### When you have it written in a different .py (NOT .ipynb) file you need to import it:

    from filename_without_py_extention import function_name

### & then you can call it

### Practice 1. Make a simple function that takes as input the name of a file with data, reads it in using one of the numpy functions and returns the data to the main code. Store the file in a separate file named "practice_functions.py". Import it here and call the functions for *my_first_numpy_array_saved.txt*

### - 2. Numpy arrays (access elements, slices, data manipulation): 

### Numpy has a wealth of functions you can use to do anything you need to solve a numerical problem. If it doesn't exist in numpy, scipy will have it (we will see scipy functions later on). We import numpy at the start of our code with:

In [5]:
 import numpy as np

### Numpy is there to make numerical things east for us. Take array data1 here:

In [6]:
data1 = np.array( [ 12, 8, 4, 4, 0, -4, -4, 0  ] ) 

### How do we multiply all elements of data1 with 5, or how do we add 10 to all elements of data1?

### Often you need to access a small part of the data you have to do something (slicing). To do this you need to think where the data you want 'lives' in the array

<img src="houses_to_arrays_loc.png" width=300 height=500 />

### Instead of having an address of Libra_Dr 4111 you have Libra_Dr[ 4111 ]

### If we have 2D arrays Python reads [line, list]  ; if we have 3D: [x,y,z] etc.....If at some point you want all elements of one dimension you just use ":"  (think [0, : ] )

### Consider the data1 array. Is it a 'house' or a 'flat' address? What is the address of 8? What is the address of the first -4?

In [None]:
data1 = np.array( [ 12, 8, 4, 4, 0, -4, -4, 0  ] ) 

### We have this array:

In [8]:
data2 = np.array( [1, 3, 5, 7, 9, 2, 4, 8, 10, -2, -4, -10] ).reshape( (3,4) ) 

### Consider the data2 array. Is it a 'house' or a 'flat' address (if it helps you, print it)? What is the address of 8 now? What is the address of  -4?

### Practice 2. We have already imported numpy, so here we will practice reading data into an array and manipulating it. Get gile *practice_numpy_f22.dat* and read it in variable *student_info*. Open the file to check it out. 
- Assign the HW grades of all students to a variable *hw_grades* and the exam grades to variable *exam_grades*. 
- Assign all grades of student 2 to variable *stud_2*
- Assuming that the average grade of a student is 25% their HW grade, 25% the Midterm, 25% their final exam and 25% their Quiz grade can you calculate the final grade of each student? Use a for loop to loop through all students and print their final grade.

### Practice 3. *space_travel.dat* contains the distances of an interplanetary trip between four bodies of the solar system. For simplicity, the data are stored in a 4 by 4 table with every line being the distance from a body to another body in AU (indicated by the columns; see also the file). 
- Read the data from *space_travel.dat* in a variable *space_trip*. 
- Your spaceship takes passengers from Mercury. You ask them where they want to go. 
    - If trip is possible you tell them how far the destination is and how long it takes if they travel with a speed of 0.05 AU/day. 
    - Discuss: how do we ask our passengers where they want to go?
    - how do we find how long their trip will take?
    - code!
    - try it out for a trip to the Earth
    
    
- What is the longest trip we can do from the Earth?
    - Discuss: how do we find the longest trip?
    - print an informative statement for the longest distance you can fly from the Earth in AU.


### - 2b. Numpy vs FOR

### FOR is for things you want to do repeatedly. It is great for working to do things with loops, tuples, dictionaries or sometimes ND arrays. When you can use numpy though, you should prefer it over FOR. It is *much* faster:

### Create list that is the square of list check_list:


In [18]:
check_list = [ 1, 3, 5, 7, 9 ]

# can I do check_list = check_list**2? 
 

In [None]:
check_list = check_list **2

In [16]:
for i in range( len( check_list ) ):
    check_list[ i ] = check_list[ i ] **2

### Create numpy array cube_ar that is the array ar1 to the power of 3

In [10]:
ar1 = np.arange( 0.1, 1e7, 0.2 )

In [11]:
%time
#compare this:

cube_ar = ar1**3


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 8.11 µs


In [12]:
%time
#to this:

for i in range( len( ar1 ) ):
    ar1[ i ] = ar1[ i ] ** 3

CPU times: user 1 µs, sys: 0 ns, total: 1 µs
Wall time: 4.05 µs


KeyboardInterrupt: 

### Practice 4. You have a dictionary of names of students and a list of their average grades ***classroom***. You need to scan through the names and print a letter grade for every student (assume plain ABC with A if grade is >0.90 ; B if grade 0.90> n > 0.80 and C if grade < 0.80).


In [13]:
classroom = { 'Jane': 0.92 , 'Joe': 0.85, 'Petra': 0.78, 'Peter': 0.81 }