# Assigning Subsets of Data

In previous chapters, we learned how to select subsets of data and create new columns with the assignment statement. In this chapter, we assign subsets of data with new data, overwriting the old data in-place. Let's begin by reading in our sample DataFrame.

In [1]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


## Setting new data with `loc`

The `loc` indexer simultaneously selects rows and columns from a DataFrame using labels. We covered this in great detail in previous chapters. Let's review this by selecting the age of Niko, Aaron, and Dean as a Series.

In [2]:
rows = ['Niko', 'Aaron', 'Dean']
df.loc[rows, 'age']

name
Niko      2
Aaron    12
Dean     32
Name: age, dtype: int64

We can assign these new values with a list or an array of the same length, or a single scalar value. Let's use the assignment statement to assign new values.

In [3]:
df.loc[rows, 'age'] = [4, 13, 34]

Let's verify that the assignment happened correctly.

In [4]:
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,4,70,8.3
Aaron,FL,red,Mango,13,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,34,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


You can even use one of the augmented assignment operators (`+=`, `-=`, etc...) to operate on the selection itself. Here, we increase the age of these three values by 2.

In [5]:
df.loc[rows, 'age'] += 2
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,6,70,8.3
Aaron,FL,red,Mango,15,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,36,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Set new row values

It's possible to modify values from a single row with `loc`. Here, we change the food and height column values for the row labeled with 'Niko'.

In [6]:
cols = ['food', 'height']
df.loc['Niko', cols] = ['PIZZA', 82]
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,PIZZA,6,82,8.3
Aaron,FL,red,Mango,15,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,36,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


## Setting new data with iloc

Setting new data with the `iloc` indexer works analogously. We begin by setting a single cell of data. This changes the first row and last column of data.

In [7]:
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,PIZZA,6,82,8.3
Aaron,FL,red,Mango,15,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,36,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [8]:
df.iloc[0, -1] = 99.999
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,99.999
Niko,TX,green,PIZZA,6,82,8.3
Aaron,FL,red,Mango,15,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,36,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


The `iloc` indexer can take a single integer, a list of integers, or a slice. Below, we slice the rows and use a single integer for the columns.

In [9]:
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,99.999
Niko,TX,green,PIZZA,6,82,8.3
Aaron,FL,red,Mango,15,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,36,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [10]:
df.iloc[3:, 4] = [155, 205, 195, 165]
df

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,99.999
Niko,TX,green,PIZZA,6,82,8.3
Aaron,FL,red,Mango,15,120,9.0
Penelope,AL,white,Apple,4,155,3.3
Dean,AK,gray,Cheese,36,205,1.8
Christina,TX,black,Melon,33,195,9.5
Cornelia,TX,red,Beans,69,165,2.2


## Boolean selection assignment

Typically, you will not be manually setting rows and columns as shown above. A more common procedure is to select a portion of the DataFrame with boolean selection and assign new values to that selection. Let's see some examples with the employee dataset.

In [11]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


Let's say we wanted to raise the minimum salary for all police department employees to 60,000. Before making the assignment let's find the number of police department employees currently making less than this.

In [12]:
filt1 = emp['salary'] < 60_000
filt2 = emp['dept'] == 'Police'
filt = filt1 & filt2
filt.sum()

np.int64(2190)

Use the `loc` indexer to select just the employees that meet the conditions for the above filter and reassign their salary.

In [13]:
emp.loc[filt, 'salary'] = 60_000

Let's use our same filter to select those employees and verify that their salary has now changed.

In [14]:
emp[filt].head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
38,Police,POLICE OFFICER,2017-07-17,60000.0,Male,White
54,Police,POLICE TRAINEE,2018-09-04,60000.0,Male,Asian
59,Police,ADMINISTRATIVE SPECIALIST,1998-12-11,60000.0,Female,Hispanic


## Improper Assignment

The above assignment is often done improperly, and in a way that has no effect. Let's reread in the dataset as a new variable name `emp2` and recreate our filters.

In [15]:
emp2 = pd.read_csv('../data/employee.csv')
filt1 = emp2['salary'] < 60000
filt2 = emp2['dept'] == 'Police'
filt = filt1 & filt2
emp2[filt].head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
38,Police,POLICE OFFICER,2017-07-17,56956.64,Male,White
54,Police,POLICE TRAINEE,2018-09-04,42000.0,Male,Asian
59,Police,ADMINISTRATIVE SPECIALIST,1998-12-11,51407.0,Female,Hispanic


The last expression from above returns a DataFrame object which we can use *just the brackets* again to select the `salary` column.

In [16]:
emp2[filt]['salary'].head(3)

38    56956.64
54    42000.00
59    51407.00
Name: salary, dtype: float64

If we try to use an assignment statement with the above syntax, no change will take place and a `SettingWithCopyWarning` will be emitted. Let's attempt the assignment and trigger the warning. Note, that this is a warning and not an error. The statement completed successfully.

In [17]:
emp2[filt]['salary'] = 60000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  emp2[filt]['salary'] = 60000


Selecting those employees that we were hoping to change salary exposes the improper assignment.

In [18]:
emp2[filt].head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
38,Police,POLICE OFFICER,2017-07-17,56956.64,Male,White
54,Police,POLICE TRAINEE,2018-09-04,42000.0,Male,Asian
59,Police,ADMINISTRATIVE SPECIALIST,1998-12-11,51407.0,Female,Hispanic


### What went wrong?

Executing `emp2[filt]['salary']` is called **chained indexing** in the pandas documentation or with the terminology in this book **chained selections**. There were two consecutive selections. The first was boolean selection with `[filt]` followed immediately by single-column selection with `['salary']`. 

The issue is that the first selection, `emp2[filt]`, creates a completely new DataFrame with its own copy of data in memory that has nothing to do with the original DataFrrame. From this new DataFrame, we select the `salary` column and attempt to reassign each value. What we have done is set the salary for this copy of the data. pandas is nice-enough to give us a warning that we might not have accomplished what we thought we did. In this example, the warning proved to be correct and our original DataFrame was not modified. In order to properly assign a subset of data using boolean selection along a column, you need to use `loc`, which is a single selection (one set of brackets) that doesn't involve making a copy of the data. The `SettingWithCopyWarning` requires a deeper discussion to fully understand which will be presented in a later chapter.

## Exercises

Use the bikes dataset for all of the following exercises.

In [52]:
bikes = pd.read_csv('../data/bikes.csv')

In [23]:
bikes

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19.0,Clark St & Randolph St,31.0,72.0,16.1,mostlycloudy
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19.0,Damen Ave & Pierce Ave,19.0,73.0,17.3,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...
50084,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,27.0,Clark St & Elm St,27.0,5.0,16.1,partlycloudy
50085,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),16.0,Union Ave & Root St,11.0,5.0,16.1,partlycloudy
50086,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,31.0,Halsted St & Blackhawk St (*),20.0,5.0,16.1,partlycloudy
50087,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,23.0,Kingsbury St & Kinzie St,31.0,7.0,11.5,partlycloudy


### Exercise 1

<span style="color:green; font-size:16px">Change the values of `events` to 'HEAT WAVE' for all rides where `temperature` is above 95. Verify this by outputting just the `events` and `temperature` columns that meet the condition.</span>

In [None]:
filt1 = bikes['temperature'] > 95

bikes.loc[filt1, 'events'] = 'HEAT WAVE'

In [55]:
bikes[filt1]

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
395,Female,2013-08-30 15:33:00,2013-08-30 15:39:00,361,Daley Center Plaza,47.0,Millennium Park,35.0,96.1,13.8,HEAT WAVE
396,Female,2013-08-30 15:37:00,2013-08-30 15:47:00,579,Ogden Ave & Chicago Ave,19.0,Wood St & Milwaukee Ave,15.0,96.1,13.8,HEAT WAVE
397,Male,2013-08-30 15:49:00,2013-08-30 16:06:00,1006,Wells St & Ohio St,19.0,Carpenter St & Huron St,19.0,96.1,13.8,HEAT WAVE


### Exercise 2

<span style="color:green; font-size:16px">Increase the trip duration by 50% for all the rides that took place with a wind speed above 40. Output just the trip duration and wind speed columns both before and after the assignment.</span>

In [37]:
bikes.loc[:,['tripduration','wind_speed']].head(5)

Unnamed: 0,tripduration,wind_speed
0,993,12.7
1,623,6.9
2,1040,16.1
3,667,16.1
4,130,17.3


In [38]:
bikes.loc[bikes['wind_speed'] > 40,['tripduration','wind_speed']].head(5)

Unnamed: 0,tripduration,wind_speed
22306,130,42.6
22307,528,42.6
22308,358,42.6
22309,221,41.4


In [56]:
bikes.loc[bikes['wind_speed'] > 40,'wind_speed'] *= 1.5

In [57]:
bikes.loc[bikes['wind_speed'] > 40,['tripduration','wind_speed']].head(5)

Unnamed: 0,tripduration,wind_speed
22306,130,63.9
22307,528,63.9
22308,358,63.9
22309,221,62.1


### Exercise 3

<span style="color:green; font-size:16px">Change the trip duration for the first two rows to 0.</span>

In [47]:
bikes.iloc[:2,3] = 0

In [48]:
bikes

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,0,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,0,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy
3,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,19.0,Clark St & Randolph St,31.0,72.0,16.1,mostlycloudy
4,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,19.0,Damen Ave & Pierce Ave,19.0,73.0,17.3,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...
50084,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,27.0,Clark St & Elm St,27.0,5.0,16.1,partlycloudy
50085,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),16.0,Union Ave & Root St,11.0,5.0,16.1,partlycloudy
50086,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,31.0,Halsted St & Blackhawk St (*),20.0,5.0,16.1,partlycloudy
50087,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,23.0,Kingsbury St & Kinzie St,31.0,7.0,11.5,partlycloudy
