# Numpy: Importing Data with genfromtxt

We can use genfromtxt to create an array from delimited text.

In [1]:
import numpy as np
from io import BytesIO

## Comma-delimited

In [2]:
data = "1, 2, 3\n4, 5, 6"
np.genfromtxt(BytesIO(data.encode()), delimiter=",")

array([[1., 2., 3.],
       [4., 5., 6.]])

## Fixed width
We can use fixed-width columns.  The delimeter parameter is the width of the columns.

If all the columns have the same width:

In [5]:
data = "  1  2  3\n  4  5 67\n890123  4"
np.genfromtxt(BytesIO(data.encode()), delimiter=3)

array([[  1.,   2.,   3.],
       [  4.,   5.,  67.],
       [890., 123.,   4.]])

If the columns have different widths, we can use multiple values for the delimeter.

In [6]:
data = "123456789\n   4  7 9\n   4567 9"
np.genfromtxt(BytesIO(data.encode()), delimiter=(4, 3, 2))

array([[1234.,  567.,   89.],
       [   4.,    7.,    9.],
       [   4.,  567.,    9.]])

## Autostrip

We can strip off spaces:

In [7]:
data = "1, abc , 2\n 3, xxx, 4"
# Without autostrip
np.genfromtxt(BytesIO(data.encode()), delimiter=",", dtype="S5")

array([[b'1', b' abc ', b' 2'],
       [b'3', b' xxx', b' 4']], dtype='|S5')

In [8]:
# With autostrip
np.genfromtxt(BytesIO(data.encode()), delimiter=",", dtype="S5", autostrip=True)

array([[b'1', b'abc', b'2'],
       [b'3', b'xxx', b'4']], dtype='|S5')

## Comments

Ignore comments:

In [9]:
data = """#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
"""
np.genfromtxt(BytesIO(data.encode()), comments="#", delimiter=",")

array([[1., 2.],
       [3., 4.],
       [5., 6.],
       [7., 8.],
       [9., 0.]])

## Skipping headers and footers

In [10]:
data = """Header1
Header2
Header3
1
3
5
7
Footer1
Footer2
"""
np.genfromtxt(BytesIO(data.encode()), skip_header=3, skip_footer=2)

array([1., 3., 5., 7.])

## Choosing specific columns

In [11]:
data = "1 2 3\n4 5 6"
np.genfromtxt(BytesIO(data.encode()), usecols=(0, 2))

array([[1., 3.],
       [4., 6.]])

## Naming the columns

Name the columns in code:

In [12]:
data = "1 2 3\n 4 5 6"
np.genfromtxt(BytesIO(data.encode()), names="A, B, C")

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])

Take the name from the first column:

In [13]:
data = "a b c\n1 2 3\n 4 5 6"
np.genfromtxt(BytesIO(data.encode()), names=True)

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

## Overriding the conversion

In the example below, the second column is not converting because of the %:

In [14]:
data = "1, 2.3%, 45.\n6, 78.9%, 0"
np.genfromtxt(BytesIO(data.encode()), delimiter=",")

array([[ 1., nan, 45.],
       [ 6., nan,  0.]])

So we can use a converter function to strip the % and force a conversion to float:

In [61]:
convertfunc = lambda x: float(x.strip("%".encode()))/100.
np.genfromtxt(BytesIO(data.encode()), delimiter=",", converters={1: convertfunc})

array([[  1.00000000e+00,   2.30000000e-02,   4.50000000e+01],
       [  6.00000000e+00,   7.89000000e-01,   0.00000000e+00]])

## Missing values

By default, this value is determined from the expected dtype according to this table:

<table>
<tr><th>Expected type</th><th>Default</th></tr>
<tr><td>bool</td><td>False</td></tr>
<tr><td>int</td><td>-1</td></tr>
<tr><td>float</td><td>np.nan</td></tr>
<tr><td>complex</td><td>np.nan+0j</td></tr>
<tr><td>string</td><td>'???'</td></tr>
</table>

We can override this with:<br/>
**a single value**<br/>
This will be the default for all columns<br/>
**a sequence of values**<br/>
Each entry will be the default for the corresponding column<br/>
**a dictionary**<br/>
Each key can be a column index or a column name, and the corresponding value should be a single object. We can use the special key None to define a default for all columns.	<br/>
	
	

In [15]:
data = "N/A, 2, 3\n4, ,???"
kwargs = dict(delimiter=",",
              dtype=int,
              names="a,b,c",
              missing_values={0:"N/A", 'b':" ", 2:"???"},
              filling_values={0:0, 'b':0, 2:-999})
np.genfromtxt(BytesIO(data.encode()), **kwargs)

array([(0, 2,    3), (4, 0, -999)],
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])