# Indexing - Selecting subsets of data
**Indexing** or **selecting subsets of data** is one of the most confusing aspects to pandas, which is unfortunate, because it is something done so frequently. There is so much to selecting subsets of data that I have dedicated a [7-part series](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c) to it.

Reasons why selecting subsets of data is confusing:
* **Indexing** is a confusing term. **Selecting subsets of data** is more descriptive
* Each row and column may be referenced by either their **label** or **integer location**
* This dual reference capability is powerful but confusing
* The documentation uses the term **position** instead of the more descriptive **integer location**. The indexer **`.iloc`** is an abbreviation of integer location. 
* There are six indexers, **`[]`**, **`.iloc`**, **`.loc`**, **`.ix`**, **`.at`**, **`.iat`**, that each do something different
* **`.ix`** has recently been deprecated in favor of **`.loc`** and **`.iloc`**, but old Stack Overflow answers and tutorials still show it. New questions get asked each day that use this horrible indexer.

### Getting help on `[]`, `.iloc`, `loc`
If you struggle with these indexers, I encourage you to read my detailed notebook with many practice exercises. [Selecting subsets of Data in Pandas with Exercises](../Selecting%20Subsets%20of%20Data%20in%20Pandas%20part%201.ipynb)

# `.ix` deprecation
One of the largest changes to the pandas API was the deprecation of the **`.ix`** indexer in version 0.20. The **`.ix`** indexer was versatile, ambiguous, and confusing and personally, I was very excited to see its deprecation. 

**`ix`** was able to select subsets of data by both label and integer location. For instance, if we wanted to select the math and verbal SAT scores for schools with integer location of 5, 505, and 1301 we could do the following:

In [1]:
import pandas as pd
import numpy as np

In [32]:
college = pd.read_csv('../data/college.csv', index_col=0)

In [33]:
# This is deprecated. NEVER DO THIS!!
college.ix[[5, 505, 1301], ['SATMTMID', 'SATVRMID']]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


Unnamed: 0_level_0,SATMTMID,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
The University of Alabama,565.0,555.0
Santa Clara University,665.0,635.0
University of Northern Iowa,524.0,549.0


### Alternatives to simultaneous selection by labels and integer location
In the rare event that you need to do simultaneous selection by label and integer location, you can do one of the following:

* Convert the integer locations to labels and use **`.loc`**
* Convert the labels to integer locations and use **`.iloc`**

#### Convert integer locations to labels
To convert integer locations to labels, use the index or columns to select the correct labels.

In [5]:
labels = college.index[[5, 505, 1301]]
labels

Index(['The University of Alabama', 'Santa Clara University',
       'University of Northern Iowa'],
      dtype='object', name='INSTNM')

In [6]:
college.loc[labels, ['SATMTMID', 'SATVRMID']]

Unnamed: 0_level_0,SATMTMID,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
The University of Alabama,565.0,555.0
Santa Clara University,665.0,635.0
University of Northern Iowa,524.0,549.0


#### Convert labels to integer locations
Converting labels to integer locations is a bit trickier. The documentation suggests the Index method **`get_loc`** for a single value and **`get_indexer`** for multiple values. These only work with Indexes containing only unique values.

In [7]:
cols = ['SATMTMID', 'SATVRMID']
ilocs = college.columns.get_indexer(cols)
ilocs

array([7, 6])

In [8]:
college.iloc[[5, 505, 1301], ilocs]

Unnamed: 0_level_0,SATMTMID,SATVRMID
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
The University of Alabama,565.0,555.0
Santa Clara University,665.0,635.0
University of Northern Iowa,524.0,549.0


If you were getting a single column:

In [9]:
college.iloc[[5, 505, 1301], college.columns.get_loc('SATMTMID')]

INSTNM
The University of Alabama      565.0
Santa Clara University         665.0
University of Northern Iowa    524.0
Name: SATMTMID, dtype: float64

# Assigning new values to subsets of data
The **`SettingWithCopy`** warning is a fairly common, yet important nuisance that will pop up for nearly everyone from time to time. It gets triggered whenever you select a subset of data and then in a separate operation, assign new values to that subset. This is referred to as [chained indexing in the documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).

The warning arises because pandas is unable to determine if you want the original DataFrame modified or just the subset. Furthermore, sometimes the original DataFrame is modified and sometimes it isn't. It's very confusing to know what is going on. Let's see some simple, straightforward examples that activate the **`SettingWithCopy`** warning with different results of the original data.

Make a fresh copy of the college dataset so we don't modify it.

In [10]:
c1 = college.copy()
c2 = college.copy()

Select a column as a Series and as a DataFrame

In [11]:
city1 = c1['CITY']   
city2 = c2[['CITY']]

In [12]:
city1.head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

In [13]:
city2.head()

Unnamed: 0_level_0,CITY
INSTNM,Unnamed: 1_level_1
Alabama A & M University,Normal
University of Alabama at Birmingham,Birmingham
Amridge University,Montgomery
University of Alabama in Huntsville,Huntsville
Alabama State University,Montgomery


Make the assignment that triggers the warning

In [16]:
city1.iloc[2:5] = 'NEW CITY'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [17]:
city2.iloc[2:5] = 'NEW CITY'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Confirm that our subset has new values

In [18]:
city1.head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                       NEW CITY
University of Alabama in Huntsville      NEW CITY
Alabama State University                 NEW CITY
Name: CITY, dtype: object

In [19]:
city2.head()

Unnamed: 0_level_0,CITY
INSTNM,Unnamed: 1_level_1
Alabama A & M University,Normal
University of Alabama at Birmingham,Birmingham
Amridge University,NEW CITY
University of Alabama in Huntsville,NEW CITY
Alabama State University,NEW CITY


Look at original data to see what has changed

In [20]:
c1['CITY'].head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                       NEW CITY
University of Alabama in Huntsville      NEW CITY
Alabama State University                 NEW CITY
Name: CITY, dtype: object

In [21]:
c2['CITY'].head()

INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object

#### Conclusion
Selecting a Series creates a **view** of the original, which is then modified. Selecting a DataFrame creates a **copy** of the original.

A view is a just a pointer to the original data. A copy is a fresh new object not connected to the original.

### Double check that we get the same results with `.loc`

In [None]:
idx = ['University of Alabama at Birmingham', 'Amridge University']

In [None]:
city1.loc[idx] = 'NEW CITY 99999'

In [None]:
city2.loc[idx] = 'NEW CITY 99999'

In [None]:
c1['CITY'].head()

In [None]:
c2[['CITY']].head()

#### Conclusion
The indexer used on the view/copy is irrelevant. The only thing that matters is whether the very first selection creates a copy or a view.

## Example select rows first with `.iloc`
Let's select some rows with **`.iloc`** and then set some values on this subset.

In [34]:
c1 = college.copy()
df1 = c1.iloc[:5]  # df1 is our new subset
df1

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


Make a couple different assignments

In [23]:
df1.iloc[:2] = 99
df1['STABBR'] = 'TX'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [24]:
c1.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,99,TX,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,...,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99,99.0
University of Alabama at Birmingham,99,TX,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,...,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99,99.0
Amridge University,Montgomery,TX,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,TX,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,TX,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


#### Conclusion
It appears that **`.iloc`** creates a view and our original data is modified

### Exercise 1
<span style="color:green; font-size:16px">Use **`.iloc`** to select the first 5 rows and columns. Is a view or copy created?</span>

In [43]:
c1 = college.copy()
df1 = c1.iloc[:5, :5]

df1.iloc[:2] = 99
df1['STABBR'] = 'TX'
print(df1._is_view)

c1.head()

False


Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Exercise 2
<span style="color:green; font-size:16px">Select a few rows and all the columns with **`.loc`**. Is a view or copy created?</span>

In [44]:
c2 = college.copy()
df2 = c2.loc[['Alabama A & M University', 'University of Alabama at Birmingham'], :]

df2.iloc[:2] = 99
df2['STABBR'] = 'TX'
print(df2._is_view)

c2.head()

False


Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


# Two common scenarios when assigning subsets of data
1. You want to make an assignment to a particular subset of your DataFrame but want to keep doing analysis on the entire DataFrame
2. You want to select a subset of data and store it as its own variable and modify that subset without modifying your original data.

We will cover how to properly handle each scenario with getting the **`SettingWithCopy`** warning.

## Scenario 1: Assign a particular subset, use the original DF though
For scenario 1, you won't be creating a new variable to reference the subset of data. Instead, you will use a single set of brackets to make the assignment. Let's look at a smaller sample dataset to make this clear.

In [45]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


Let's say, we want to change the color for Aaron and Dean to **`PURPLE`**. Doing this idiomatically with a single set of brackets would look like this:

In [51]:
# This is the right thing to do
df.loc[['Aaron', 'Dean'], 'color'] = 'PURPLE'
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,PURPLE,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,PURPLE,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


#### Incorrect versions
It's good to look at improper ways to do this so you can spot this in other code. 

In [47]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df['color'][['Aaron', 'Dean']] = 'PURPLE'
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,PURPLE,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,PURPLE,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


This was an example of **chained indexing**, which you should never do. The first indexing happens with **`df['color']`**. A temporary object is created which then indexes again with **`[['Aaron', 'Dean']]`**.

#### Another incorrect version

**WidjiTip**: Back to back brackets like ][ are usually probelms waiting to happen.

In [52]:
# NO WARNING! Assignment didn't complete. Looks like a bug
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df.loc[['Aaron', 'Dean']]['color'] = 'PURPLE'
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Using boolean indexing
Change the **`score`** of all the people under 30 to 100.

In [53]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df.loc[df['age'] < 30, 'score'] = 100
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,100.0
Aaron,FL,red,Mango,12,120,100.0
Penelope,AL,white,Apple,4,80,100.0
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 1
<span style="color:green; font-size:16px">Turn to your neighbors. Explain to them why the following did not work. 
Write another incorrect version of it as well as a correct version.</span>

In [50]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df[df.age < 30]['score'] = 100
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 2
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [57]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df.loc[df['color'].isin(['red', 'green']), 'score'] = 99
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,99.0
Aaron,FL,red,Mango,12,120,99.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,99.0


In [None]:
# your code here

### Exercise 3
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [58]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df[df['state'] == 'TX'][df['age'] > 30][['height', 'score']] = 99

  This is separate from the ipykernel package so we can avoid doing imports until


In [59]:
criteria = (df['state'] == 'TX') & (df['age'] > 30)
df.loc[criteria, ['height', 'score']] = 99
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,99,99.0
Cornelia,TX,red,Beans,69,99,99.0


### Exercise 4
<span style="color:green; font-size:16px">Make the following idiomatic</span>

In [None]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)

df.iloc[[6,4,2,3]].iloc[:, :3] = 'CHANGED'

### Summary of Scenario 1:
* Use exactly one set of brackets to make the assignment 
* You know you've made a mistake when you see back to back brackets like this **`][`**
* Separate row and column selection by a comma with the same set of brackets

## Scenario 2: Modify the subset and keep using it
Scenario 2 exists when you take a subset of data and want to keep working with just that subset. You may not care at all about the original DataFrame, but you probably won't want to change its data. 

In this scenario, you will use the **`copy`** method to create a fresh independent copy of your subset and then make changes to that.

In [None]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Let's say we are only interested in the **`food`** and **`score`** columns and store those columns as a new variable

In [None]:
food_score = df[['food', 'score']]

Now, if we wanted to change the scores of the rows with food equal to steak or lamb to 99 we would do the following, which triggers the warning.

In [None]:
criteria = food_score['food'].isin(['Steak', 'Lamb'])
food_score.loc[criteria, 'score'] = 99
food_score

This warning was triggered by the very first step

```
>>> food_score = df[['food', 'score']]
```

The warning will be triggered for any change to this dataset. For instance, let's add a new column:

In [None]:
food_score['newcol'] = -1
food_score

Let's look at our original DataFrame to see if it has changed

In [None]:
df

It is still in tact, but pandas does not know whether you wanted it that way. To divorce this new dataset from its original, use the **`copy`** method

In [None]:
food_score = df[['food', 'score']].copy()

criteria = food_score['food'].isin(['Steak', 'Lamb'])
food_score.loc[criteria, 'score'] = 99
food_score

In [None]:
food_score['newcol'] = -1
food_score

Pandas now knows that this is independent from any other DataFrame, so no warning appears.

### Exercise 5
<span style="color:green; font-size:16px">Select the first three rows and first three columns into a new variable and then change all the values to **`CHANGED`** without getting a **`SettingWithCopy`** warning.</span>

In [60]:
# run this first
df = pd.read_csv('../data/sample_data.csv', index_col=0)

In [61]:
# your code here
data = df.iloc[:3, :3].copy()
data.iloc[:, :] = 'CHANGED'

print(data)
print(df)

         state    color     food
Jane   CHANGED  CHANGED  CHANGED
Niko   CHANGED  CHANGED  CHANGED
Aaron  CHANGED  CHANGED  CHANGED
          state  color    food  age  height  score
Jane         NY   blue   Steak   30     165    4.6
Niko         TX  green    Lamb    2      70    8.3
Aaron        FL    red   Mango   12     120    9.0
Penelope     AL  white   Apple    4      80    3.3
Dean         AK   gray  Cheese   32     180    1.8
Christina    TX  black   Melon   33     172    9.5
Cornelia     TX    red   Beans   69     150    2.2


## No need to memorize
I don't know the rules for when pandas creates a view or a copy. I am always in either scenario 1 or 2 so either change my original DataFrame or force a copy of the subset and continue.

# Summary
* Make sure you know how to use **`[]`**, **`.iloc`**, **`.loc`**
* Never use the deprecated **`ix`**
* If you need to simultaneously select by label and integer location (rare), convert integers to labels (**`df.index[integers]`**), or labels to integers with (**`Index.get_loc/get_indexer`**) and use **`.loc`** or **`.iloc`**
* The **`SettingWithCopy`** is triggered whenever you select a subset of data and try to set a value on this subset.
* Use either an assignment with a single set of brackets or the **`copy`** method, depending on what scenario you are in to avoid the warning
* See this [excellent blog post](https://www.dataquest.io/blog/settingwithcopywarning/) from Benjamin Pryke on everything you ever wanted to know about the **`SettingWithCopy`** warning