# 0.开始之前

关于如何学习numpy和pandas。

首先明确一点，先把所有的用法学得很熟练了之后再投入实践的做法是效率极其低下的。
高效的做法是，先了解最基本的概念，再明了numpy和pandas有什么功能可供使用，在实践中遇到实际需求时查阅文档或者利用搜索引擎临时学习代码具体如何编写。

# 1.numpy 

官方文档地址：

https://numpy.org/doc/stable/user

In [56]:
import numpy as np

## numpy 简介

Numpy,也即numeric python,numpy官方文档的介绍如下：

NumPy是Python中科学计算的基本包。它是一个Python库，提供了多维数组对象，各种派生对象(如掩码数组和矩阵)，以及用于数组快速操作的各种例程，包括数学，逻辑，形状操作，排序，选择，I/O，离散傅里叶变换，基本线性代数，基本统计操作，随机模拟等等。

### fast！

In [218]:
ls1 = [i for i in range(1000000)]
ls2 = [i for i in range(1000000)]

arr1 = np.arange(10000000)
arr2 = np.arange(10000000)

In [219]:
%%time
ls3 = [ls1[i] + ls2[i] for i in range(len(ls1))]

CPU times: total: 141 ms
Wall time: 220 ms


In [220]:
%%time
arr3 = arr1+arr2

CPU times: total: 15.6 ms
Wall time: 23 ms


### convenient！

In [221]:
#将数组的每个元素变成其自然对数，分别使用list和numpy的array
ls = [i for i in range(1,11)]
arr = np.array(ls)

In [222]:
import math
for i in range(10):
    ls[i] = math.log(ls[i])
ls

[0.0,
 0.6931471805599453,
 1.0986122886681098,
 1.3862943611198906,
 1.6094379124341003,
 1.791759469228055,
 1.9459101490553132,
 2.0794415416798357,
 2.1972245773362196,
 2.302585092994046]

In [223]:
np.log(arr)

array([0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
       1.79175947, 1.94591015, 2.07944154, 2.19722458, 2.30258509])

使用预编译好的C代码实现向量化，没有任何显式的循环和索引


## 1.2 ndarray及基本用法

NumPy的主要对象是ndarray。它是一个元素表（通常是数字），所有类型都相同，由非负整数元组索引。

### 属性 方法

In [67]:
arr = np.array([[ 1., 0., 0.],
 [ 0., 1., 2.]]
)

In [68]:
# ndarray.ndim - 数组的轴(维度)的个数
print(arr.ndim)

2


In [69]:
# ndarray.shape - 一个表示数组在每个维度上大小的整数元组
print(arr.shape)

(2, 3)


In [70]:
# ndarray.size - 数组元素的总数
print(arr.size)

6


In [71]:
# ndarray.dtype - 数组中元素的类型
print(arr.dtype)

float64


In [72]:
# ndarray.itemsize - 每个元素的字节大小
print(arr.itemsize)

8


In [79]:
#转置
arr.T

array([[1., 0.],
       [0., 1.],
       [0., 2.]])

In [82]:
#改变形状
arr.reshape(6, 1)

array([[1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [2.]])

In [83]:
#将多维数组转换为一维数组
arr.flatten()

array([1., 0., 0., 0., 1., 2.])

### 创建array

In [84]:
#使用np.array将列表、元组等序列类型的数据转换为数组
np.array([(1.5,2,3), (4,5,6)])

array([[1.5, 2. , 3. ],
       [4. , 5. , 6. ]])

In [86]:
#使用zeros和ones函数创建数组
np.zeros((3, 4))
np.ones((2, 3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [88]:
#用arange函数创建
np.arange(2, 9, 2)

array([2, 4, 6, 8])

In [89]:
#linspace函数和logspace函数
np.linspace(0, 1, 11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [90]:
np.logspace(0, 2, 5)

array([  1.        ,   3.16227766,  10.        ,  31.6227766 ,
       100.        ])

### 切片索引

获取数组中的部分数据或单个数据。

In [104]:
#一维数组切片，和list差不多
a = np.array([1, 2, 3, 4, 5])
print(a[1:3])

[2 3]


In [103]:
#多维数组切片
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(a[0:2, 1:3])

[[2 3]
 [5 6]]


In [107]:
#多维数组索引
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(a[1, 2])

6


In [110]:
#整数数组索引
a = np.array([1, 2, 3, 4, 5])
print(a[[0, 2, 4]])

[1 3 5]


In [111]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(a[[0, 2], [1, 2]])

[2 9]


In [112]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([True, False, True, False, True])
print(a[b])

[1 3 5]


In [113]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([True, False, True])
c = np.array([False, True, True])
print(a[b, c])

[2 9]


## 广播机制

In [114]:
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([1, 2, 3])
c = a + b
print(c)

[[2 4 6]
 [5 7 9]]


In [115]:
a = np.array([1, 2, 3])
b = 2
c = a + b
print(c)

[3 4 5]


## 统计函数

1、np.mean()

2、np.median()

3、np.std()

4、np.var()

5、np.min()

6、np.max()

7、np.sum()

8、np.prod()

9、np.percentile()

10、np.any()

11、np.all()

# pandas

In [117]:
import pandas as pd

官方文档：

https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials

## 2.1.pandas简介

In [119]:
#官方文档例子
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


## 2.2.Series及基本用法

Series是一种一维数组结构，它由两个部分组成：索引和值。索引是一个标签数组，可以用来标识数据。值可以是任何类型的数据，例如整数、浮点数、字符串等。Series提供了类似于Numpy数组的操作，同时也支持类似于Python字典的操作。

In [121]:
#使用Python列表、Numpy数组或者字典来创建Series。
s1 = pd.Series([1, 2, 3, 4, 5])
print(s1)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [131]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'],name = 'S')
print(s)

a    1
b    2
c    3
d    4
e    5
Name: S, dtype: int64


In [132]:
#索引
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [135]:
#数据类型
s.dtype

dtype('int64')

In [134]:
s.name

'S'

## 2.3 DataFrame及基本用法

表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。

In [125]:
#创建DataFrame，可以使用列表，ndarray,字典，Series或者直接读取文件来创建DataFrame
#官方文档例子
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


### 查看数据

In [137]:
df = pd.DataFrame(
np.random.randn(100,100)
)

In [138]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.211341,1.266878,-0.519553,-0.15774,-1.048931,-0.969402,1.496221,1.455252,0.903164,-0.158307,...,1.326783,1.931043,-0.848628,-0.170511,0.112376,-0.00612,-0.67363,-1.272305,-2.535614,-1.47391
1,1.146732,1.64219,-2.02314,-0.530335,0.713935,0.081807,-0.928176,1.3419,0.815401,1.159356,...,-1.366692,0.380955,-0.243676,-0.742537,-0.24193,0.974334,-0.789168,1.435104,-0.566427,-0.434579
2,-0.677553,0.27308,0.09343,-1.630971,1.251458,-1.285941,-1.113786,1.042323,-0.663038,0.58274,...,-0.341559,0.573504,0.448969,-0.793272,1.301704,2.702121,1.011581,1.272719,0.766406,1.450022
3,-0.653425,0.779035,1.764546,-2.54889,-1.900348,-0.089847,-2.43074,0.3568,-1.210981,-0.020634,...,-0.999337,1.258833,2.248874,-0.835083,-0.420628,1.935721,0.86002,0.139981,0.065963,1.578092
4,0.293547,-0.145902,0.414384,0.741449,-0.682276,0.691347,0.557779,0.839143,-1.23513,0.896661,...,1.313399,-0.293,0.86073,-1.808694,1.112225,-0.431822,-0.118751,1.444311,-0.743363,-0.502585


In [139]:
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
95,-0.126195,0.628935,-2.02063,0.46968,-2.006142,0.556262,0.013624,-0.282477,-0.283504,-0.424563,...,-0.260222,-1.890613,-0.301673,-0.863274,0.145628,0.235904,-2.119966,2.474978,-0.838864,0.297548
96,1.146325,-1.407622,2.225578,-0.153179,0.257631,0.531647,0.673955,1.014788,-0.929241,-0.519408,...,-0.071461,-0.564621,-0.011323,0.584443,0.843134,1.584683,0.801846,-0.356269,-1.344782,-0.671292
97,0.804172,0.313313,-1.335136,-0.722656,0.781495,0.402835,-0.445196,0.948714,0.881059,-0.731188,...,1.553129,0.146049,0.262724,0.250462,-0.249268,2.069833,-0.820261,-0.284279,0.374985,-0.948242
98,-0.887304,-0.715361,-0.658254,0.477384,1.066583,-0.752495,-0.301363,2.160708,-0.767394,-0.752097,...,-1.161535,-1.170235,0.202089,-0.498331,-0.456414,0.297166,0.453974,-0.524772,-1.432274,0.026494
99,1.478664,-1.370092,1.372437,0.063392,1.095827,1.020567,-0.41231,-0.264136,0.877478,1.204812,...,-0.921508,-1.824297,0.949535,-0.240234,-0.472351,-0.921514,0.914621,0.092921,-0.274738,0.097578


In [140]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 100 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       100 non-null    float64
 1   1       100 non-null    float64
 2   2       100 non-null    float64
 3   3       100 non-null    float64
 4   4       100 non-null    float64
 5   5       100 non-null    float64
 6   6       100 non-null    float64
 7   7       100 non-null    float64
 8   8       100 non-null    float64
 9   9       100 non-null    float64
 10  10      100 non-null    float64
 11  11      100 non-null    float64
 12  12      100 non-null    float64
 13  13      100 non-null    float64
 14  14      100 non-null    float64
 15  15      100 non-null    float64
 16  16      100 non-null    float64
 17  17      100 non-null    float64
 18  18      100 non-null    float64
 19  19      100 non-null    float64
 20  20      100 non-null    float64
 21  21      100 non-null    float64
 22  22

In [141]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,0.002047,-0.071748,-0.122162,0.094967,0.041669,-0.017967,-0.042264,0.13595,-0.069067,-0.062455,...,0.031082,0.028692,0.00098,-0.00299,0.186306,0.109799,0.008113,0.133903,0.021077,0.024363
std,1.138545,0.988379,0.975479,0.900838,0.996905,0.979857,0.989307,0.934369,0.958098,1.039842,...,0.966443,0.943793,0.970567,1.084303,0.968691,1.048866,0.917225,1.00684,0.986295,0.956048
min,-2.509886,-2.393475,-2.265902,-2.54889,-2.211559,-2.098111,-2.43074,-2.745156,-2.460143,-2.899123,...,-2.252404,-2.543039,-3.340014,-2.542408,-1.954963,-2.025566,-2.119966,-1.966888,-2.535614,-2.227089
25%,-0.740638,-0.703169,-0.78617,-0.524524,-0.580263,-0.717926,-0.650665,-0.374661,-0.683249,-0.72669,...,-0.502893,-0.549929,-0.618548,-0.815204,-0.455542,-0.628265,-0.618051,-0.48449,-0.642959,-0.573856
50%,-0.065399,-0.039723,-0.054513,0.057971,-0.052958,-0.028342,-0.101071,0.262808,-0.148785,-0.117928,...,0.059374,0.016661,0.067891,-0.120216,0.129002,0.157743,0.033025,0.127729,0.030734,-0.03741
75%,0.814675,0.513253,0.485417,0.728426,0.723186,0.464728,0.602233,0.745486,0.539971,0.702384,...,0.599677,0.6867,0.672813,0.840238,0.845926,0.76168,0.523708,0.771234,0.881585,0.728125
max,2.975575,2.309191,2.225578,1.888831,3.417377,3.393114,2.327243,2.29926,2.866724,2.845365,...,2.417018,2.260812,2.248874,2.861471,2.883765,2.972538,2.53774,4.001221,1.888895,2.256035


In [143]:
df.shape # 查看形状 
df.index # 行索引
df.columns # 列索引(变量名)
df.values # 查看二维ndarray数组

array([[-0.21134097,  1.26687787, -0.51955316, ..., -1.27230522,
        -2.53561365, -1.47391021],
       [ 1.14673157,  1.64218951, -2.02314023, ...,  1.4351042 ,
        -0.56642651, -0.43457862],
       [-0.67755257,  0.2730802 ,  0.09342978, ...,  1.27271929,
         0.76640613,  1.45002184],
       ...,
       [ 0.80417248,  0.31331264, -1.33513559, ..., -0.28427853,
         0.37498466, -0.94824214],
       [-0.88730368, -0.71536056, -0.65825438, ..., -0.52477212,
        -1.43227427,  0.02649421],
       [ 1.47866355, -1.37009175,  1.37243746, ...,  0.09292102,
        -0.27473834,  0.09757771]])

### 查珊增改

In [145]:
#loc与iloc
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


In [159]:
#loc用具体的自定义索引
df.loc[:2,:'Age']

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22
1,"Allen, Mr. William Henry",35
2,"Bonnell, Miss. Elizabeth",58


In [160]:
#iloc用位置
df.iloc[:2,:1]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Allen, Mr. William Henry"


In [161]:
df.iloc[:2,1:]

Unnamed: 0,Age,Sex
0,22,male
1,35,male


In [164]:
#普通索引和花式索引
df['Age']

0    22
1    35
2    58
Name: Age, dtype: int64

In [165]:
df[['Age']]

Unnamed: 0,Age
0,22
1,35
2,58


In [None]:
df.drop(
    labels=None, # 指定index或者columns
    axis=0,  # 默认按行删除， 1是删除一列
    index=None, # 要删除的index
    columns=None, # 删除的列名
    level=None, # 指定是哪层索引级别（针对多级索引的）
    inplace=False, # 是否对原始数据进行就地修改
    )

In [174]:
df["Age"] = [1,2,3]
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",1,male
1,"Allen, Mr. William Henry",2,male
2,"Bonnell, Miss. Elizabeth",3,female


In [178]:
df.loc[3,:] = ["Bonnell, Miss. Elizabeth",100,"female"]
df

Unnamed: 0,Name,Age,Sex
0,"Braund, Mr. Owen Harris",1.0,male
1,"Allen, Mr. William Henry",2.0,male
2,"Bonnell, Miss. Elizabeth",3.0,female
3,"Bonnell, Miss. Elizabeth",100.0,female


In [180]:
df['lwm'] = [4,4,4,4]
df

Unnamed: 0,Name,Age,Sex,lwm
0,"Braund, Mr. Owen Harris",1.0,male,4
1,"Allen, Mr. William Henry",2.0,male,4
2,"Bonnell, Miss. Elizabeth",3.0,female,4
3,"Bonnell, Miss. Elizabeth",100.0,female,4


### 数据合并

In [182]:
data1 = df.iloc[:1]
data2 = df.iloc[1:]
data3 = pd.concat([data1, data2], axis=0)

In [184]:
data1

Unnamed: 0,Name,Age,Sex,lwm
0,"Braund, Mr. Owen Harris",1.0,male,4


In [185]:
data2

Unnamed: 0,Name,Age,Sex,lwm
1,"Allen, Mr. William Henry",2.0,male,4
2,"Bonnell, Miss. Elizabeth",3.0,female,4
3,"Bonnell, Miss. Elizabeth",100.0,female,4


In [186]:
data3

Unnamed: 0,Name,Age,Sex,lwm
0,"Braund, Mr. Owen Harris",1.0,male,4
1,"Allen, Mr. William Henry",2.0,male,4
2,"Bonnell, Miss. Elizabeth",3.0,female,4
3,"Bonnell, Miss. Elizabeth",100.0,female,4


In [None]:
#SQL风格的合并。类似于SQL中的JOIN
pd.merge()

### 数据清洗

In [187]:
#去除重复数据，NaN数据
df.drop_duplicates(inplace=True)
df.dropna(axis=0, inplace=True)

### 数据计算

基础的聚合函数如max、min、mean、sum等和numpy大差不差，但有一点要注意，参数axis

In [228]:
df1 = pd.DataFrame(
[
    [1,2,3],
    [4,5,6]
])

In [229]:
df1.mean()#默认axis = 0

0    2.5
1    3.5
2    4.5
dtype: float64

In [230]:
df1.mean(axis=1)

0    2.0
1    5.0
dtype: float64

map、apply、applymap的应用(非常重要)

In [195]:
#$map()将一个自定义函数应用于Series结构中的每个元素
df['if_old'] =  df['Age'].map(lambda x:"A" if x<3 else "B")
df

Unnamed: 0,Name,Age,Sex,lwm,if_old
0,"Braund, Mr. Owen Harris",1.0,male,4,A
1,"Allen, Mr. William Henry",2.0,male,4,A
2,"Bonnell, Miss. Elizabeth",3.0,female,4,B
3,"Bonnell, Miss. Elizabeth",100.0,female,4,B


In [196]:
#apply()将一个函数作用于DataFrame中的每个行或者列

In [205]:
df2 = pd.DataFrame(np.ones((3,3)),columns = [1,2,3])
df2.apply(lambda x:x+x.name)

Unnamed: 0,1,2,3
0,2.0,3.0,4.0
1,2.0,3.0,4.0
2,2.0,3.0,4.0


In [206]:
#Applymap()将函数做用于DataFrame中的所有元素
df2.applymap(lambda x:str(x)+"A")

Unnamed: 0,1,2,3
0,1.0A,1.0A,1.0A
1,1.0A,1.0A,1.0A
2,1.0A,1.0A,1.0A


这里的介绍只是冰山一角，是一些最常用最基础的操作，在实践中不断使用不断学习，才能够越来越熟练

# 拓展