## CM4044: AI In Chemistry
## Semester 1 2020/21

<hr>

## Tutorial 1b: Introduction to Numpy Part II
## Objectives
### $\bullet$ Data Types
### $\bullet$ Binary Operations
### $\bullet$ Comparison and Logic Operations
### $\bullet$ Sorting, Searching and Counting
### $\bullet$ Methods to Change Array Shape
### $\bullet$ Read and Save Data


<hr>



In [1]:
import numpy as np

## 1. Data Types of Numpy Array

The interface of `numpy.array()` is below:
```python
    numpy.array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
```
    
The argument `dtype` decides the data type of numpy.array. 

The Common Numpy dtype arguments are below:

|Basic Type|**Numpy** keyword|Comment
|--|--|--
|boolean|`bool`|1 byte
|integer|`int8, int16, int32, int64, int128, int`| `int` same as `long` in **C** language
|unsigned integer|`uint8, uint16, uint32, uint64, uint128, uint`| `uint` same as `unsigned long` in **C** language
|floating point number| `float16, float32, float64, float, longfloat`|default type is `float64`，`longfloat` depends on machine
|complex number| `complex64, complex128, complex, longcomplex`| default type is `complex128` 
|string| `string, unicode` | `dtype=S4` represent a 4-byte string
|object| `object` |any type of value|
|Records| `void` ||
|time| `datetime64, timedelta64` ||

On the other hand, **Numpy** automatically detects the data type of input object to create `ndarray`, so it is allowed not to pass argument to `dtype` variable when `numpy.array()` is called.    

More details about the allowed data type and the relevant memory size information can be found from [here](https://numpy.org/doc/1.18/reference/index.html)

In [2]:
a = np.array([0,4,-4])
print(a.dtype)
a = np.array([0,4,-4],dtype = int)
print(a.dtype)
a = np.array([0,4,-4],dtype = complex)
print(a)
print(a.dtype)

a = np.array([1,1.2,'hello', [10,20,30]], 
          dtype=object)
print(a)
print(a.dtype)

int32
int32
[ 0.+0.j  4.+0.j -4.+0.j]
complex128
[1 1.2 'hello' list([10, 20, 30])]
object


As an advantage to use **Numpy** array, user can define customerized data type for the content to keep in a `ndarray` object, for example,

In [3]:
a = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)], dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
a

array([('Rex', 9, 81.), ('Fido', 3, 27.)],
      dtype=[('name', '<U10'), ('age', '<i4'), ('weight', '<f4')])

Here a is a one-dimensional array of length two whose datatype is a structure with three fields: 1. A string of length 10 or less named ‘name’, 2. a 32-bit integer named ‘age’, and 3. a 32-bit float named ‘weight’.

If you index a at position 1 you get a structure: 

In [4]:
a[1]

('Fido', 3, 27.)

You can access and modify individual fields of a structured array by indexing with the field name:

In [5]:
a['age']

array([9, 3])

## 2. Binary Operations

## 2.1 Arithmetic Operations

If a and b are two `ndarray` data objects, the binary operations are lised in the following table:

Operation|Numpy Function
--- | --- 
`a + b` | `add(a,b)`
`a - b` | `subtract(a,b)`
`a * b` | `multiply(a,b)`
`a / b` | `divide(a,b)`
`a ** b` | `power(a,b)`
`a % b` | `remainder(a,b)`

These operations are performed in the **<font color=red>element-wise </font>** style, see below:

In [6]:
a = np.array([0,1,2,3,4])
b = np.array([5,6,7,8,9])
c = a + b          # element wise sum, same as c = np.add(a, b)
print(c)
c = a * b          # element wise multiplication, same as c = np.multiply(a,b)
print(c)           
c = np.multiply(a,b)  # the equivalent calculation to multiplication
print(c)
c = a ** b # same as c = np.power(a,b)
print(c)

[ 5  7  9 11 13]
[ 0  6 14 24 36]
[ 0  6 14 24 36]
[     0      1    128   6561 262144]


In fact, if we use **Numpy** binary alrithmetic functions, we can pass in three array arguments, the third argument is the array to keep the operation results, for example,

In [7]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
np.multiply(a,b,a)    # the results from binary multiplicaton are kept in array a
a

array([ 5, 12, 21, 32])

## 2.2 Comparison and Logic Operations

Opreation|Numpy Function
--- | --- 
`==` | `equal`
`!=` | `not_equal`
`>` | `greater`
`>=` | `greater_equal`
`<` | `less`
`<=` | `less_equal`
 | `logical_and`
 | `logical_or`
 | `logical_xor`
 | `logical_not`
`&` | `bitwise_and`
 | `bitwise_or`
`^` | `bitwise_xor`
`~` | `invert`
`>>` | `right_shift`
`<<` | `left_shift`

These operations are also **<font color=red>element-wise</font>** operations, for example,

In [8]:
a = np.array([[1,2,3,4],
              [2,3,4,5]])
b = np.array([[1,2,5,4],
              [1,3,4,5]])
a == b

array([[ True,  True, False,  True],
       [False,  True,  True,  True]])

Examples on bitwise operations. Again, these operations are element-wise operations.

In [9]:
a = np.array([0,1,2])
b = np.array([0,10,0])

print(np.logical_and(a, b))

a = np.array([1,2,3,4], np.uint8)
b = ~a
print(b)

b = a << 3
print(b)

a = np.array([1,2,4,8])
b = np.array([16,32,64,128])

c = (a > 3) & (b < 100)
print(c)

[False  True False]
[254 253 252 251]
[ 8 16 24 32]
[False False  True False]


Note that the `bitwise_and` (`&`) has higher priority than the comparison operators `<`, `>` in the above example. So the brackets are used to change operation order.

### 2.4 Boolean indexing

Numpy arrays can be indexed with slices, but also with boolean or integer arrays (**masks**). This method is called fancy indexing. It creates a **new** numpy array.


In [10]:
np.random.seed(3)
a = np.random.randint(0, 21, 15)   # use random number generator to generate integer numbers
print(a)

mask = (a % 3 == 0)   # check the elements in array a is divisible by 3
print(mask)

extract_from_a = a[mask]  #extract those elements divisible by 3

print(extract_from_a)

[10  3  8  0 19 10 11  9 10  6  0 20 12  7 14]
[False  True False  True False False False  True False  True  True False
  True False False]
[ 3  0  9  6  0 12]


The example below calculates the prime numbers between 0 and 100 by using a Boolean array.

In [11]:

is_prime = np.ones((100,), dtype=bool)
print('is_prime at the start:\n', is_prime)

# Cross out 0 and 1 which are not primes:
is_prime[:2] = 0
print('After cross out 0 and 1:\n', is_prime)

# cross out its higher multiples (sieve of Eratosthenes):
nmax = int(np.sqrt(len(is_prime)))
print(nmax)   # equals 10
for i in range(2, nmax):      #loop over the possible factors from 2 to 10
    is_prime[2*i::i] = False

print(np.nonzero(is_prime))

is_prime at the start:
 [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True]
After cross out 0 and 1:
 [False False  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True

## 3. Sorting, Searching and Counting

## 3.1 sort()

Content kept in `ndarray` object can be sorted by `numpy.sort()` function or `ndarray.sort()` of the array object, the sorting citeria depends on the data type of content. The difference between these two functions are: np.sort() creates a new array for sorted result, `ndarray.sort()` sorts the array in place. Compare the two methods in the following example,

In [12]:
names = np.array(['bob', 'sue', 'jan', 'ad'])
weights = np.array([20.8, 93.2, 53.4, 61.8])

print(np.sort(weights))    # np.sort(weights) returned a new sorted array, weights unchanged
print(weights)   

weights.sort()    # sort() method of ndarray object, which is "weights" in this line, sorts itself.
print(weights)    # the order of content changed after sorting.

print()   # print a new line

print(np.sort(names))  # np.sort(names) sorts the string array based on alphebatic order and stored in new array
print(names)    # the order of content in names unchanged

names.sort()    #sort() method of ndarray object, which is "names" in this line, sorts itself.
print(names)    # the order of content changed after sorting.


[20.8 53.4 61.8 93.2]
[20.8 93.2 53.4 61.8]
[20.8 53.4 61.8 93.2]

['ad' 'bob' 'jan' 'sue']
['bob' 'sue' 'jan' 'ad']
['ad' 'bob' 'jan' 'sue']


For a multidimensional `ndarray`object, sorting can be performed along the selected axis. The selected axis is a pass-in argument to the sort function. By default, the selected axis is that for the last dimension. For example,

In [13]:
a = np.array([
        [.2, .1, .5], 
        [.4, .8, .3],
        [.9, .6, .7]
    ])
print(a)
print()

b = np.sort(a)    # a is a two-dimensional array, sorting along last dimension axis, which is axis = 1.
print(b)

print()

b = np.sort(a, axis=0)
print(b)     # sorting along axis = 0

[[0.2 0.1 0.5]
 [0.4 0.8 0.3]
 [0.9 0.6 0.7]]

[[0.1 0.2 0.5]
 [0.3 0.4 0.8]
 [0.6 0.7 0.9]]

[[0.2 0.1 0.3]
 [0.4 0.6 0.5]
 [0.9 0.8 0.7]]


The general interface of these two functions are below:
```python
       numpy.sort(a, axis=-1, kind='quicksort', order=None)
       ndarray.sort(axis=-1, kind='quicksort', order=None)
```

More details about sorting functions can be found in the **Numpy** document. https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html
and
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.sort.html#numpy.ndarray.sort

## 3.2 argsort()

Another useful sorting function is `numpy.argsort(a)`. When it is called, it returns an array of indices of the same shape as a that index data along the given axis in sorted order. For example,

In [14]:
a = np.array([3, 1, 2])
print(a)
print(np.argsort(a))    # return an index array for sorting purpose based on elements' index in the input array a

print()

a = np.array([[0, 3], [2, 2]])
print(a)
print()
print(np.argsort(a, axis=0))   # return an index array for sorting along axis 0, based on elements' index in the input array a

print()

print(np.argsort(a, axis=1))   # return an index array for sorting along axis 0, based on elements' index in the input array a

[3 1 2]
[1 2 0]

[[0 3]
 [2 2]]

[[0 1]
 [1 0]]

[[0 1]
 [0 1]]


The general interface of `argsort()` is below:

    numpy.argsort(a, axis=-1, kind='quicksort', order=None)

More information can be found here https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

## 3.3 searchsorted()

This function accepts two arguments, the first argument is a **sorted** array and the second can be some values.

    searchsorted(sorted_array, values)
    
It find indices where elements should be inserted to the array in the first argument to maintain order. For example,

In [15]:
sorted_array = np.linspace(0,1,5)
values = np.array([.1,.8,.3,.12,.5,.25])
np.searchsorted(sorted_array, values)   # return an array of right indices when the values are inserted

array([1, 4, 2, 1, 2, 1], dtype=int64)

sorted array：

|0|1|2|3|4|
|-|-|-|-|-|
|0.0|0.25|0.5|0.75|1.0

values：

|value|0.1|0.8|0.3|0.12|0.5|0.25|
|-|-|-|-|-|-|-|
|insert index|1|4|2|1|2|1|

`searchsorted` returns the indices to insert values to maintain the ascending order:

For example, `0.1` is in [0.0, 0.25)，so it is inserted at index 1，so the return value is `1`。

Another example,

In [16]:
from numpy.random import rand
data = rand(100)
data.sort()

bounds = .4, .6   # tuple of low and high bounds

low_idx, high_idx = np.searchsorted(data, bounds)  #returns both left and right indices for a value to insert

print('low_idx = ', low_idx)
print('high_idx = ', high_idx)

print(data[low_idx:high_idx])

low_idx =  38
high_idx =  63
[0.40651992 0.4151012  0.44045372 0.44514505 0.45462208 0.45527936
 0.4576864  0.46894025 0.47508861 0.48023996 0.48358553 0.48509423
 0.48887324 0.51403506 0.54101967 0.54359433 0.54464902 0.55284457
 0.55327773 0.55784076 0.55885409 0.57279387 0.5862529  0.59086282
 0.59666377]


More information abot `searchsorted()` can be found from (https://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html#numpy.searchsorted)

## 3.5 nonzero()

The `nonzero(a)` returns indices of nonzero elements in an array `a`. The returned indices are a tuple of arrays, one for each dimension of a, containing the indices of the non-zero elements in that dimension.  For example,

In [17]:
a = np.array([[3, 0, 0], [0, 4, 0], [5, 6, 0]])      
print(a)

print()

print(np.nonzero(a))

[[3 0 0]
 [0 4 0]
 [5 6 0]]

(array([0, 1, 2, 2], dtype=int64), array([0, 1, 0, 1], dtype=int64))


## 3.6 count_nonzero() 

`numpy.count_nonzero(a)` returns the number of nonzero elements in array a.

In [18]:
a = np.array([[3, 0, 0], [0, 4, 0], [5, 6, 0]])      
print(a)

print()

print(np.count_nonzero(a))  # the number of nonzero elements in array a is 4.

[[3 0 0]
 [0 4 0]
 [5 6 0]]

4


## 4. Methods to Change Array Shape

## 4.1 reshape()

In the first part of tutorial, we have seen an example below,

In [19]:
a = np.arange(3*4*5)  # this function creats a 1D ndarray
print(a)
print()
a.shape = 3,4,5       # change the 1D array to a 3D array
print(a)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59]

[[[ 0  1  2  3  4]
  [ 5  6  7  8  9]
  [10 11 12 13 14]
  [15 16 17 18 19]]

 [[20 21 22 23 24]
  [25 26 27 28 29]
  [30 31 32 33 34]
  [35 36 37 38 39]]

 [[40 41 42 43 44]
  [45 46 47 48 49]
  [50 51 52 53 54]
  [55 56 57 58 59]]]


In the above example, we assign the size of every dimension to `shape` attribute of Numpy array `a` and change it to a 3D array. The same task can be achieved by calling function `numpy.reshape(a)`. The general interface of this function is,

    numpy.reshape(a, newshape, order='C')
   
The function needs at least two arguments as the last argument is set to the default value `'C'`, C-style index of array. For example,

In [20]:
a = np.arange(3*4)  # this function creats a 1D ndarray
print(a)
print()

# reshape a to a 3 by 4 data view
# it is the same as a.reshape(3,4)
b = np.reshape(a, (3, 4)) 
#b = a.reshape(3,4)
print(b)
print()

b = np.reshape(a, (4, 3)) # the shape of a does not change, different to the direct setting a's shape
#b = a.reshape(4,3)
print(b)
print()

b = np.reshape(a, (3, -1))  # same as (3, 4), the last value -1 means to determine the last dimension by Python
#b = a.reshape(3,-1)
print(b)


[ 0  1  2  3  4  5  6  7  8  9 10 11]

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


The use of index `-1` is very convenient to the transformation between matrix and vector. More examples are:

In [21]:
a = np.array([[1, 2, 3, 4],
         [5, 6, 7, 8],
         [9, 10, 11, 12]])
print(a.shape)      # 3 rows by 4 columns

a2 = a.reshape(-1)  # 1 row and 12 columns, a (row) vector
print(a2.shape)     # (12,)
print(a2)

a3 = a.reshape(-1,1)   # 12 rows and 1 column, a column vector
print(a3.shape)        # (12,1)
print(a3)

(3, 4)
(12,)
[ 1  2  3  4  5  6  7  8  9 10 11 12]
(12, 1)
[[ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]]


We can think the index `-1` means the unknown size of certain axis and its value to be determined by Python, so the rest size of different axes must be provided then the Python interpreter knows how to calculate the unknown size of the specific axis.

## 4.2 transpose()

`numpy.transpose(a)`permutes the dimensions of an array.

The interface is,
    
    numpy.transpose(a, axes=None)
   
The argument `a` is the input array, `axes` is optional. `axes` is a tuple of axis numbers to be permuted. If `axes` is not given, it means permution between axis 0 and the last axis. 

For example,

In [22]:
a = np.arange(4).reshape((2,2))
print(a)
print()

b = np.transpose(a)
print(b)
print()

a = np.arange(24).reshape((2,3,4))   # a three dimensional array
print(a)
print()

b = a
print(np.transpose(b))  # permution between axis 0 and axis 2
print()

print(np.transpose(b, (1,0,2))) # permutation between axis 0 and axis 1

[[0 1]
 [2 3]]

[[0 2]
 [1 3]]

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]

[[[ 0 12]
  [ 4 16]
  [ 8 20]]

 [[ 1 13]
  [ 5 17]
  [ 9 21]]

 [[ 2 14]
  [ 6 18]
  [10 22]]

 [[ 3 15]
  [ 7 19]
  [11 23]]]

[[[ 0  1  2  3]
  [12 13 14 15]]

 [[ 4  5  6  7]
  [16 17 18 19]]

 [[ 8  9 10 11]
  [20 21 22 23]]]


The `transpose` of array a has a shorthand `a.T`.

## 4.3 flatten()

`flatten()` is a function of `ndarray` data object. It returns a copy of the array collapsed into one dimension. For example,

In [23]:
a = np.array([[0,1],
           [2,3]])
b = a.flatten()
b

array([0, 1, 2, 3])

Because array b is a copy of array a. Any change in b does not affect array a. For example, 

In [24]:
b[0] = 1
print(b)
print(a)

[1 1 2 3]
[[0 1]
 [2 3]]


## 4.4 ravel()

`ravel()` is a function of `ndarray` data object. It returns a one-dimensional view of the array object. 

In [25]:
a = np.array([[0,1],
           [2,3]])
b = a.ravel()
b

array([0, 1, 2, 3])

The difference between `flatten()` and `ravel()` is any change on the view created by `ravel()` affects the array object. For example,

In [26]:
b [0] = 1
a

array([[1, 1],
       [2, 3]])

## 4.5 concatenate()

Two arrays of the same shape can be concatenated. The general interface of this function is,

    numpy.concatenate((a1, a2, ...), axis=0, out=None)
    
This function joins a sequence of arrays along an specified axis and returns the concatenated array. For example,

In [27]:
x = np.array([
        [0,1,2],
        [10,11,12]
    ])
y = np.array([
        [50,51,52],
        [60,61,62]
    ])
print(x.shape)
print(y.shape)
print()

z = np.concatenate((x,y))      # join along axis 0, the default axis
print(z)
print()

z = np.concatenate((x,y), axis=1) # join along axis 1.
print(z)

(2, 3)
(2, 3)

[[ 0  1  2]
 [10 11 12]
 [50 51 52]
 [60 61 62]]

[[ 0  1  2 50 51 52]
 [10 11 12 60 61 62]]


## 4.6 np.newaxis and None

Simply put, the `numpy.newaxis` expression is used to increase the dimension of the existing array by one more dimension, when used once.

For example,


In [28]:
a = np.arange(4)
print(a.shape)  # 1D array with four elements (4,)
print(a)

a_new = a[:,np.newaxis]  # a_new is increased to 2D array (4,1)
print(a_new.shape)       # (4,1)
print(a_new)

a2_new = a[np.newaxis,:] # a2-new is 2D array (1,4)
print(a2_new.shape)      # (1,4)
print(a2_new)

(4,)
[0 1 2 3]
(4, 1)
[[0]
 [1]
 [2]
 [3]]
(1, 4)
[[0 1 2 3]]


And `numpy.newaxis` is the same as `None`

In [29]:
np.newaxis is None

True

So the previous examples to use `np.newaxis` can be changed in the followings:

In [30]:
a_new = a[:,None]  # a_new is increased to 2D array (4,1)
print(a_new.shape)       # (4,1)
print(a_new)

a2_new = a[None,:] # a2-new is 2D array (1,4)
print(a2_new.shape)      # (1,4)
print(a2_new)

(4, 1)
[[0]
 [1]
 [2]
 [3]]
(1, 4)
[[0 1 2 3]]


## 5. Reading and Writing Arrays

## 5.1 Read formatted data from text file by `numpy.loadtxt()`

First, let us create a formatted text data file with magic command in Jupyter Notebook.

In [31]:
%%writefile myfile.txt
2.1 2.3 3.2 1.3 3.1
6.1 3.1 4.2 2.3 1.8

Overwriting myfile.txt


Now, we want to read data from 'myfile.txt' and load them to a `ndarray` object. A naive and slow way is below:

In [32]:
data = []

with open('myfile.txt') as f:
    # read one line each time
    for line in f:
        fileds = line.split()
        row_data = [float(x) for x in fileds]
        data.append(row_data)

data = np.array(data)

print(data)

[[2.1 2.3 3.2 1.3 3.1]
 [6.1 3.1 4.2 2.3 1.8]]


We can use a much simpler method `numpy.loadtxt()` from the data file.

In [33]:
data = np.loadtxt('myfile.txt')
print(data)

[[2.1 2.3 3.2 1.3 3.1]
 [6.1 3.1 4.2 2.3 1.8]]


In .csv format data file, the delimiter is not empty space but `,`, for example, the data file is:

In [34]:
%%writefile myfile.txt
2.1,2.3,3.2,1.3,3.1
6.1,3.1,4.2,2.3,1.8

Overwriting myfile.txt


We can modify the call to `nump.loadtxt()` by one more argument, `delimiter=','`

In [35]:
data = np.loadtxt('myfile.txt', delimiter=',')
print(data)

[[2.1 2.3 3.2 1.3 3.1]
 [6.1 3.1 4.2 2.3 1.8]]


The general interface of `numpy.loadtxt()` is below:

```python
    loadtxt(fname, dtype=<type 'float'>, 
            comments='#', delimiter=None, 
            converters=None, skiprows=0, 
            usecols=None, unpack=False, ndmin=0)
```

`loadtxt` has many optional arguments, `delimiter` is used for seperator.

`skiprows` means the row numbers to be ignored in reading，it can be used to handle data file with text header.

`usercols` allows reading data from specified columns.

`comment` is the characters or list of characters used to indicate the start of a comment row.

In [36]:
%%writefile myfile.txt
X Y Z MAG ANG
2.1,2.3,3.2,1.3,3.1
6.1,3.1,4.2,2.3,1.8

Overwriting myfile.txt


In [37]:
b = np.loadtxt('myfile.txt', delimiter= ',', skiprows=1)
print(b)

[[2.1 2.3 3.2 1.3 3.1]
 [6.1 3.1 4.2 2.3 1.8]]


Let us check a complicated data file. First, we create the file.

In [38]:
%%writefile myfile.txt
 -- BEGINNING OF THE FILE
% Day, Month, Year, Skip, Power
01, 01, 2000, x876, 13 % wow!
% we don't want have Jan 03rd
04, 01, 2000, xfed, 55

Overwriting myfile.txt


We want to read data from first, second, third and fifth columns because these columns contain values. 

In [39]:
data = np.loadtxt('myfile.txt', 
                  skiprows=1,         #skip first line
                  dtype=np.int,      #data type
                  delimiter=',',     #delimiter is comma
                  usecols=(0,1,2,4), #read column 0, 1, 2, and 4
                  comments='%'       # line starts with % sign is comment line
                 )
print(data)

[[   1    1 2000   13]
 [   4    1 2000   55]]


We can also define a data converter to handle more data conversion in `numpy.loadtxt()`. Let say we creat a file below.

In [40]:
%%writefile myfile.txt
2010-11-01 2.3 3.2
2011-11-01 6.1 3.1

Overwriting myfile.txt


So we use the code below to read the data and load to an array.

In [41]:
import datetime

def date_converter(s):
    s = str(s, 'utf-8')  # convert the byte info to a string object
    
    return datetime.datetime.strptime(s, "%Y-%m-%d")

data = np.loadtxt('myfile.txt',
                  dtype=np.object, #data type in the ndarray
                  converters={0:date_converter,  #self-defined method to convert the data for first column
                              1:float,           # floating point number in second and third column
                              2:float})

print(data)

[[datetime.datetime(2010, 11, 1, 0, 0) 2.3 3.2]
 [datetime.datetime(2011, 11, 1, 0, 0) 6.1 3.1]]


A more powerful function to read and load data to `ndarray` oject is `numpy.genfromtxt()`. The general interface is:

    genfromtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, 
               skiprows=0, skip_header=0, skip_footer=0, converters=None, 
               missing='', missing_values=None, filling_values=None, usecols=None, 
               names=None, excludelist=None, deletechars=None, replace_space='_', 
               autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, 
               usemask=False, loose=True, invalid_raise=True)
               
The details can be found from https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt

## 5.2 Save formatted data to text file by `numpy.savetxt()`

`numpy.savetxt()` saves 1D or 2D `ndarray` to a file. By default, the file is text based and values are saved in scientific format.

In [42]:
a = np.array([[1,2], 
            [3,4]])

np.savetxt('out.txt', a)   # save the array a to the file out.txt

with open('out.txt') as f:
    for line in f:
        print(line)
        

1.000000000000000000e+00 2.000000000000000000e+00

3.000000000000000000e+00 4.000000000000000000e+00



The general interface of `numpy.savetxt()` is
```python
    savetxt(fname, 
            X, 
            fmt='%.18e', 
            delimiter=' ', 
            newline='\n', 
            header='', 
            footer='', 
            comments='# ')
```
So we can use more arguments to save the data in a nice format.

In [43]:
data = np.array([[1,2], 
                 [3,4]])

np.savetxt('out.txt', data, fmt="%d")  # save the data as integer

with open('out.txt') as f:
    for line in f:
        print(line)
        

1 2

3 4



Another example,

In [44]:
data = np.array([[1,2], 
                 [3,4]])

np.savetxt('out2.txt', data, fmt="%.2f", delimiter=',') #save data in floating point number, two decimal, delimiter is comma

with open('out2.txt') as f:
    for line in f:
        print(line)    

1.00,2.00

3.00,4.00



## 5.3 Save/Read Data in Binary Format File

A `ndarray` object can be saved in binary format file .npy. A binary .npy file not only keep values in an array, but also the information about `dtype`, `shape`, and so on. So it is easy to reconstruct the array object in memory when the file is read.


To save：

- `save(file, arr)` save single array，`.npy` format
- `savez(file, *args, **kwds)` save several arrays，uncompressed `.npz` format
- `savez_compressed(file, *args, **kwds)` save several arrays，compressed `.npz` format

To read：

- `load(file, mmap_mode=None)` from `.npy`file，return array，for `.npz` file，return dictionary of name-array pair.


In [45]:
a = np.array([[1.0,2.0], [3.0,4.0]])

fname = 'afile.npy'
np.save(fname, a)

aa = np.load(fname)
print(aa)

[[1. 2.]
 [3. 4.]]


In [46]:
a = np.array([[1.0,2.0], 
              [3.0,4.0]])
b = np.arange(1000)

np.savez('data.npz', a=a, b=b) # save the two arrays a and b in the file 'data.npz', npz is zipped npy file

data = np.load('data.npz')

print(data['a'])

[[1. 2.]
 [3. 4.]]
