# Importing data with genfromtxt

## Splitting the lines into columns
本质上来说都是把每一行的数变成一个矩阵的第一行，\n就代表矩阵的下一行

In [1]:
import numpy as np
from io import StringIO
data = "1, 2, 3\n4, 5, 6"
np.genfromtxt(StringIO(data), delimiter=",")

array([[1., 2., 3.],
       [4., 5., 6.]])

In [2]:
data = "  1  2  3\n  4  5 67\n890123  4"
np.genfromtxt(StringIO(data), delimiter=3)

array([[  1.,   2.,   3.],
       [  4.,   5.,  67.],
       [890., 123.,   4.]])

In [3]:
data = "123456789\n   4  7 9\n   4567 9"
np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)) # 先4，再3，后2这样分割

array([[1234.,  567.,   89.],
       [   4.,    7.,    9.],
       [   4.,  567.,    9.]])

## autostrip自动省略括号

In [4]:
data = "1, abc , 2\n 3, xxx, 4"
np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")

array([['1', ' abc ', ' 2'],
       ['3', ' xxx', ' 4']], dtype='<U5')

In [5]:
np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5", autostrip=True)# 这里的括号省略了


array([['1', 'abc', '2'],
       ['3', 'xxx', '4']], dtype='<U5')

## The comments argument

In [None]:
data = """#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
"""
np.genfromtxt(StringIO(data), comments="#", delimiter=",") # 省略#号后的内容（行）

# Skipping lines and choosing columns

## The skip_header and skip_footer arguments

In [2]:
import numpy as np
from io import StringIO
data = "\n".join(str(i) for i in range(10))
a = np.genfromtxt(StringIO(data),)

b = np.genfromtxt(StringIO(data),
              skip_header=3, skip_footer=5)# 跳过前面3个数和后面最后5个数
print(a)
print(b)

[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[3. 4.]


# The usecols argument

In [3]:
data = "1 2 3\n4 5 6"
np.genfromtxt(StringIO(data), usecols=(0, -1)) # 第1列和最后1列


array([[1., 3.],
       [4., 6.]])

In [5]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])

array([(1, 2, 3), (4, 5, 6)],
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])

2. dtype=[(_, int) for _ in "abc"]：
dtype 参数定义了数组的结构。我们用一个列表生成式来创建 dtype，其中 "abc" 是一个字符串迭代器，生成 a、b 和 c。
对于每个字符（_），我们生成一个元组 (_, int)，表示该字段的名称和数据类型。
最终的 dtype 变为 [('a', int), ('b', int), ('c', int)]，这意味着生成的数组将有三个字段：a、b 和 c，它们的数据类型都是 int。
3. np.genfromtxt(data, dtype=[('a', int), ('b', int), ('c', int)])：
genfromtxt 函数读取 data 中的内容，并根据 dtype 将数据转换为结构化数组。
每行数据会被解析并分配给数组的各个字段。例如，第一行的数字 1、2、3 会分别分配给字段 a、b 和 c。
在这里，每一列都被赋予了称号（a,b,c）

In [10]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, names="A, B, cD")

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('A', '<f8'), ('B', '<f8'), ('cD', '<f8')])

当使用names关键字时，自动赋予每一列称号，称号之间用逗号隔开
这里的原数列每一行都得有一样数量的元素

We may sometimes need to define the column names from the data itself. In that case, we must use the names keyword with a value of True. The names will then be read from the first line (after the skip_header ones), even if the line is commented out:

当我们需要用原列的数据头作为列表名的话，就必须用names = True这个关键字

In [13]:
data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")
np.genfromtxt(data, skip_header=1, names=True) # 这里先跳过第一列，即便第二列有comment注释，系统也会自动读取

array([(1., 2., 3.), (4., 5., 6.)],
      dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])

In [14]:
data = StringIO("1 2 3\n 4 5 6")
ndtype=[('a',int), ('b', float), ('c', int)]
names = ["A", "B", "C"]
np.genfromtxt(data, names=names, dtype=ndtype)
# 关键字names可重复overwrite

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('A', '<i4'), ('B', '<f8'), ('C', '<i4')])

# The defaultfmt argument

In [15]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int)) # 默认状态下names以f起始编号

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<i4')])

In [17]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int), names="a,b") # names可以overwrite其中一部分名字

array([(1, 2., 3), (4, 5., 6)],
      dtype=[('a', '<i4'), ('b', '<f8'), ('f0', '<i4')])

# Tweaking the conversion
The converters argument
For example, we may want to make sure that a date in a format YYYY/MM/DD is converted to a datetime object, or that a string like xx% is properly converted to a float between 0 and 1.

In [10]:
import numpy as np
from io import StringIO
convertfunc = lambda x: float(x.strip("%"))/100.

data = "1, 2.3%, 45.\n6, 78.9%, 0"
names = ("i", "p", "n")
# General case .....
np.genfromtxt(StringIO(data), delimiter=",", names=names)

array([(1., nan, 45.), (6., nan,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

In [11]:
# Converted case ...
np.genfromtxt(StringIO(data), delimiter=",", names=names,
              converters={1: convertfunc}, encoding='utf-8')

array([(1., 0.023, 45.), (6., 0.789,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

In [13]:
# 也可以单独对p行进行处理
np.genfromtxt(StringIO(data), delimiter=",", names=names,
              converters={"p": convertfunc}, encoding='utf-8')

array([(1., 0.023, 45.), (6., 0.789,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])

In [14]:
data = "1, , 3\n 4, 5, 6"
convert = lambda x: float(x.strip() or -999) # 如果为空，则返回-999
np.genfromtxt(StringIO(data), delimiter=",",
              converters={1: convert})

array([[   1., -999.,    3.],
       [   4.,    5.,    6.]])

# Using missing and filling values
## missing_values

![default_value_of_filling_values](default_value_of_filling_values.png)

In [2]:
import numpy as np
from io import StringIO
data = "N/A, 2, 3\n4, ,???"
kwargs = dict(delimiter=",",
              dtype=int,
              names="a,b,c",
              missing_values={0:"N/A", 'b':" ", 2:"???"},# 前者为替换后的值，后者为原值
              filling_values={0:0, 'b':0, 2:-999})# 前者为原值，后者为替换后的值
np.genfromtxt(StringIO(data), **kwargs)

array([(0, 2,    3), (4, 0, -999)],
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])