# I/O with Numpy

NumPy provides several functions to create arrays from tabular data. We focus here on the `genfromtxt` function. In a nutshell, `genfromtxt` runs two main loops. The first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. This mechanism is slower than a single loop, but gives more flexibility.

In [None]:
import numpy as np
from io import StringIO

### Defining the input
The only mandatory argument of `genfromtxt` is the source of the data. It can be a string, a list of strings, a generator or an open file-like object with a read method, for example, a file or `io.StringIO` object. If a single string is provided, it is assumed to be the name of a local or remote file.

### Splitting the lines into columns
Once the file is defined and open for reading, `genfromtxt` splits each non-empty line into a sequence of strings. Empty or commented lines are just skipped. The delimiter keyword is used to define how the splitting should take place.


In [2]:
import numpy as np
from io import StringIO
data = "1, 2, 3\n4, 5, 6"
np.genfromtxt(StringIO(data), delimiter=",")

array([[1., 2., 3.],
       [4., 5., 6.]])

Another common separator is `"\t"`, the tabulation character. However, we are not limited to a single character, any string will do. By default, `genfromtxt` assumes delimiter=None, meaning that the line is split along white spaces (including tabs) and that consecutive white spaces are considered as a single white space.

In [3]:
data = "  1  2  3\n  4  5 67\n890123  4"
np.genfromtxt(StringIO(data), delimiter=3)

array([[  1.,   2.,   3.],
       [  4.,   5.,  67.],
       [890., 123.,   4.]])

### The autostrip argument
By default, when a line is decomposed into a series of strings, the individual entries are not stripped of leading nor trailing white spaces.


In [5]:
data = "1, abc , 2\n 3, xxx, 4"
x = np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
print("Without autostrip \n", x)
print("-" * 20)
y = np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)
print("With autostrip \n", y)

Without autostrip 
 [['1' ' abc ' ' 2']
 ['3' ' xxx' ' 4']]
--------------------
With autostrip 
 [['1' 'abc' '2']
 ['3' 'xxx' '4']]


### The comments argument
The optional argument `comments` is used to define a character string that marks the beginning of a comment. By default, `genfromtxt` assumes `comments='#'`. The comment marker may occur anywhere on the line.

> [NOTE!]
> There is one notable exception to this behavior: if the optional argument names=True, the first commented line will be examined for names.

In [6]:
data = """#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
"""
s = np.genfromtxt(StringIO(data), comments="#", delimiter=",")
print(s)

[[1. 2.]
 [3. 4.]
 [5. 6.]
 [7. 8.]
 [9. 0.]]


### The skip_header and skip_footer arguments
The presence of a header in the file can hinder data processing. In that case, we need to use the `skip_header` optional argument. The values of this argument must be an integer which corresponds to the number of lines to skip at the beginning of the file, before any other action is performed. Similarly, we can skip the last n lines of the file by using the `skip_footer` attribute and giving it a value of `n`:


In [8]:
data = "\n".join(str(i) for i in range(10))
c = np.genfromtxt(StringIO(data),)
print(c)
print("-" * 20)
x = np.genfromtxt(StringIO(data),
              skip_header=3, skip_footer=5)
print(x)

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
--------------------
[3. 4.]


### The usecols argument
In some cases, we are not interested in all the columns of the data but only a few of them. We can select which columns to import with the `usecols` argument. This argument accepts a single integer or a sequence of integers corresponding to the indices of the columns to import. Remember that by convention, the first column has an index of 0.

In [10]:
data = "1 2 3\n4 5 6"
k = np.genfromtxt(StringIO(data), usecols=(0, -1))
print(k)

[[1. 3.]
 [4. 6.]]


If the columns have names, we can also select which columns to import by giving their name to the usecols argument, either as a sequence of strings or a comma-separated string:

In [11]:
data = "1 2 3\n4 5 6"
p = np.genfromtxt(StringIO(data),
              names="a, b, c", usecols=("a", "c"))
print(p)
print("-" * 20)
x = np.genfromtxt(StringIO(data),
              names="a, b, c", usecols=("a, c"))
print(x)

[(1., 3.) (4., 6.)]
--------------------
[(1., 3.) (4., 6.)]


### Choosing the data type
The main way to control how the sequences of strings we have read from the file are converted to other types is to set the `dtype` argument. Acceptable values for this argument are

a single type, such as dtype=float. The output will be 2D with the given dtype, unless a name has been associated with each column with the use of the names argument (see below). Note that dtype=float is the default for genfromtxt.

- a sequence of types, such as `dtype=(int, float, float)`.
- a comma-separated string, such as `dtype="i4,f8,|U3"`.
- a dictionary with two keys `'names'` and `'formats'`.
- a sequence of tuples `(name, type)`, such as `dtype=[('A', int), ('B', float)]`.
- an existing `numpy.dtype` object.
- the special value None. In that case, the type of the columns will be determined from the data itself (see below).

### The names argument
A natural approach when dealing with tabular data is to allocate a name to each column. A first possibility is to use an explicit structured dtype, as mentioned previously


In [12]:
data = StringIO("1 2 3\n 4 5 6")
x = np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])
print(x)

[(1, 2, 3) (4, 5, 6)]


Another simpler possibility is to use the names keyword with a sequence of strings or a comma-separated string:



In [13]:
data = StringIO("1 2 3\n 4 5 6")
x = np.genfromtxt(data, names="A, B, C")
print(x)

[(1., 2., 3.) (4., 5., 6.)]


### filling_values
We can get a finer control on the conversion of missing values with the filling_values optional argument. Like missing_values, this argument accepts different kind of values

In [19]:
data = "N/A, 2, 3\n4, ,???"
kwargs = dict(delimiter=",",
              dtype=int,
              names="a,b,c",
              missing_values={0:"N/A", 'b':" ", 2:"???"},
              filling_values={0:0, 'b':0, 2:-999}, autostrip=True)
x = np.genfromtxt(StringIO(data), **kwargs)
print(x)

[(0, 2,    3) (4, 0, -999)]
