Ref: https://www.dataquest.io/blog/settingwithcopywarning/

## 楔子

学习R语言的时候, 听说R里的变量从来都是传值的, 心里还有点小惊讶.

Python的数据科学教科书上一般会仔细告诉读者, 在某某包里, 此时是传值, 此时是传参.
末了还加一句解释, "Python用于处理大数据的时候, 大量的传值操作是不现实也是不科学的".

看起来Python的处理貌似高明, 直到我遇到了settingwithcopywarning......

## view_or_copy

The rules about when a view on the data is returned are entirely dependent on NumPy.

Whenever an array of labels or a boolean vector are involved in the indexing operation, 
the result will be a copy. 

With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], 
a view will be returned.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [7]:
df = pd.DataFrame(np.arange(30).reshape((6, 5)), index=np.arange(6), columns=list("abcde"))
df

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,23,24
5,25,26,27,28,29


- Assignment 

赋值操作
Operations that set the value of something, for example data = pd.read_csv('xbox-3-day-auctions.csv'). Often referred to as a set.

- Access 

取值操作 
Operations that return the value of something, such as the below examples of indexing and chaining. Often referred to as a get.

- Indexing 

索引操作
Any assignment or access method that references a subset of the data; for example data[1:5].

.bidderrate, .loc[], .iloc[], .ix[]

- Chaining 

复合索引
The use of more than one indexing operation back-to-back; for example data[1:5][1:3].

- Chained assignment 复合索引后的取值操作


In [9]:
# e列中大于20的话, 那么d列应该为100

#先来看看e列大于20的行
df[df.e >= 20]
# This is an access method (get operation), 
# that will return a DataFrame containing desried rows
# new DataFrame. 

Unnamed: 0,a,b,c,d,e
4,20,21,22,23,24
5,25,26,27,28,29


In [12]:
df[df.e >= 20]['d'] = 100
df
# 我们得到了 SettingWithCopyWarning 的警报
# df中的d列也并没有变化
# We are not operating on the original DataFrame at all.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,23,24
5,25,26,27,28,29


In [15]:
# The solution is simple: combine the chained operations into a single operation using 
# loc so that pandas can ensure the original DataFrame is set.
df.loc[df.e >= 20,'d'] = 100
df

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,100,24
5,25,26,27,100,29


### Hidden chaining
不明显的复合索引

In [23]:
# 如果e列大于20,那么c列应为200
df2 = df.loc[df.e >= 20]
# df2 was created as the output of a get operation,
# it might be a copy of the original DataFrame or it might not be 
print(df2)
print("-------------------------")
df2.loc[4,"c"] = 200
# 当我们对df2进行索引时, 已经构成了复合索引

    a   b   c    d   e
4  20  21  22  100  24
5  25  26  27  100  29
-------------------------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
#假定此处, 我们希望df2是一个copy而不是link

df2 = df.loc[df.e >= 20].copy()
df2.loc[4,"c"] = 200
print(df2.loc[4,"c"])
print(df.loc[4,"c"])

200
22


 If you want to change the original, use a single assignment operation. 
 
 If you want a copy, make sure you force pandas to do just that. 

There are two possible ways to access a subset of a DataFrame: either one could 
- create a reference to the original data in memory (a view) or 
- copy the subset into a new, smaller DataFrame (a copy). 

A view is a way of looking at a particular portion the original data, 
whereas a copy is a clone of that data to a new location in memory. 

Modifying a view will modify the original variable but modifying a copy will not.

The output of ‘get’ operations in pandas is not guaranteed. 
Either a view or a copy could be returned when you index a pandas data structure, 
which means get operations on a DataFrame return a new DataFrame that can contain either:
- A copy of data from the original object.
- A reference to the original object’s data without making a copy.

In [29]:
df1 = pd.DataFrame(np.arange(6).reshape((3,2)), columns=list('AB'))
print(df1)
df2 = df1.loc[:1]
print(df2)

   A  B
0  0  1
1  2  3
2  4  5
   A  B
0  0  1
1  2  3
