## Pandas Data Wrangling: Avoiding that 'SettingWithCopyWarning'

If you use Python for data analysis, you probably use Pandas for Data Munging. And if you use Pandas, you’ve probably come across the warning below:

```
SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
```

### A Simple Reproducible Example of The Warning

Here’s where this issue pops up. Say you have some data:

In [1]:
import pandas as pd
df = pd.DataFrame({'Number' : [100,200,300,400,500], 'Letter' : ['a','b','c', 'd', 'e']})
df

Unnamed: 0,Letter,Number
0,a,100
1,b,200
2,c,300
3,d,400
4,e,500


…and you want to filter it on some criteria. Pandas makes that easy with Boolean Indexing:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

In [2]:
criteria = df['Number']>300
criteria

0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool

In [3]:
#Keep only rows which correspond to 'Number'>300 ('True' in the 'criteria' vector above)
df[criteria]

Unnamed: 0,Letter,Number
3,d,400
4,e,500


This works great right? Unfortunately not, because once we: 
1. Use that filtering code to create a new Pandas DataFrame, and 
2. Assign a new column or change an existing column in that DataFrame

like so…

In [4]:
#Create a new DataFrame based on filtering criteria
df_2 = df[criteria]

#Assign a new column and print output
df_2['new column'] = 'new value'
df_2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Letter,Number,new column
3,d,400,new value
4,e,500,new value


There’s the warning.

So what should we have done differently? The warning suggests using “.loc[row_indexer, col_indexer]“. So let’s try subsetting the DataFrame the same way as before, but this time using the df.loc[ ] method.

### Re-Creating Our New Dataframe Using .loc[]

In [5]:
df.loc[criteria, :]

Unnamed: 0,Letter,Number
3,d,400
4,e,500


In [6]:
#Create New DataFrame Based on Filtering Criteria
df_2 = df.loc[criteria, :]

In [7]:
#Add a New Column to the DataFrame
df_2.loc[:, 'new column'] = 'new value'
df_2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,Letter,Number,new column
3,d,400,new value
4,e,500,new value


Two warnings this time!

### OK, So What’s Going On?

Recall that our “criteria” variable is a Pandas Series of Boolean True/False values, corresponding to whether a row of ‘df’ meets our Number>300 criteria.

In [8]:
criteria

0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool

The Pandas Docs say a “common operation is the use of boolean vectors to filter the data” as we’ve done here. But apparently a boolean vector is not the “row_indexer” the warning advises us to use with .loc[] for creating new dataframes. Instead, Pandas wants us to use .loc[] with a vector of row-numbers (technically, “row labels”, which here are numbers).

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

In [9]:
df_2 = df[criteria]

We first grab the indices of that filtered dataframe using ```.index```…



In [10]:
criteria_row_indices = df[criteria].index
criteria_row_indices

Int64Index([3, 4], dtype='int64')

And pass that list of indices to `.loc[ ]` to create our new dataframe

In [11]:
new_df = df.loc[criteria_row_indices, :]
new_df

Unnamed: 0,Letter,Number
3,d,400
4,e,500


Now we can add a new column without throwing ``The Warning``

In [12]:
new_df['New Column'] = 'New Value'
new_df

Unnamed: 0,Letter,Number,New Column
3,d,400,New Value
4,e,500,New Value
