# Editing Data in DataFrames 

## Outline
* Creating, replacing and deleting columns
* Transforming columns
* Setting data with `loc[]`



Along with creating a dataframe, we will also want to modify dataframe after creating them.  

In [1]:
import pandas as pd
import numpy as np

original_df = pd.read_csv('data/employee_attrition.csv')

original_df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


**Note about the code:**  Throughout the examples, you will see the code sections start with `df = original_df.copy()`.  The rest of the example will then typically work with the `df` variable.  All this does is copy the contents of our "original dataframe" (`original_df`) to a local variable so that the examples don't interfere with each other!

## Adding a new column

All columns of a dataframe, which can be accessed with `.` or `[]`. (like `df.Age`, or `df['Age']`), can be created in the same way. Dataframes can be thought of as a collection of Series (columns), and the pandas library supports adding or replacing them in the dataframe.


In [2]:
df = original_df.copy()

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


In [3]:
new_column = range(0, 1470)

df["new_column"] = new_column

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,new_column
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,0,8,0,1,6,4,0,5,0
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,1,10,3,3,10,7,1,7,1
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,0,7,3,3,0,0,0,0,2
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,0,8,3,3,8,7,3,0,3
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,1,6,3,3,2,2,2,2,4


### Replace an existing column

Replacing a column works exactly in the same as creating a new one!

In [5]:
df = original_df.copy()

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


In [6]:
df.Gender = df.Age

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,41,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,49,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,37,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,33,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,27,40,3,1,...,3,4,1,6,3,3,2,2,2,2



### Removing columns

To explicitly remove a column, we can use the `.drop()` function on dataframe. (Note, `drop()` returns a new copy of the dataframe with the dropped entity.  It doesn't mutate the original)

In [10]:
df = original_df.copy()

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


In [11]:
df = df.drop(columns = ["Attrition", "Age"])

df

Unnamed: 0,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobSatisfaction,MaritalStatus,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,Travel_Rarely,1102,1,2,Female,94,3,2,4,Single,...,3,1,0,8,0,1,6,4,0,5
1,Travel_Frequently,279,8,3,Male,61,2,2,2,Married,...,4,4,1,10,3,3,10,7,1,7
2,Travel_Rarely,1373,2,4,Male,92,2,1,3,Single,...,3,2,0,7,3,3,0,0,0,0
3,Travel_Frequently,1392,3,4,Female,56,3,1,3,Married,...,3,3,0,8,3,3,8,7,3,0
4,Travel_Rarely,591,2,1,Male,40,3,1,2,Married,...,3,4,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,Travel_Frequently,884,23,3,Male,41,4,2,4,Married,...,3,3,1,17,3,3,5,2,0,3
1466,Travel_Rarely,613,6,4,Male,42,2,3,1,Married,...,3,1,1,9,5,3,7,7,1,7
1467,Travel_Rarely,155,4,2,Male,87,4,2,2,Married,...,4,2,1,6,0,3,6,2,0,3
1468,Travel_Frequently,1023,2,4,Male,63,2,2,2,Married,...,3,4,0,17,3,2,9,6,0,8


## Transforming columns

It is valuable to use existing data when setting new columns, but you often would like to transform that data somehow first.  Perhaps you want to standardize the Gender column in this example.  Instead of 'Male' and 'Female', we just want 'm', or 'f'.  How can we change the data in that column to match what we want?

### `.map()`
Map is a universal concept in programming, and it always involves taking a collection of something as input, applying a function to each element in the collection, and returning all of the return values of that function as a new collection.  In our case, we'd like to create a function that can turn the values in the Gender column to either 'm' or 'f', and return a new column of data.  The `.map()` function on the column Series will do just that for us!

In [18]:
df = original_df.copy()

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


In [19]:
# create our new column
new_gender = df.Gender.map(lambda g: 'f' if g == 'Female' else 'm')

df.Gender = new_gender

df

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,f,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,m,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,m,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,f,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,m,40,3,1,...,3,4,1,6,3,3,2,2,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,23,3,m,41,4,2,...,3,3,1,17,3,3,5,2,0,3
1466,39,No,Travel_Rarely,613,6,4,m,42,2,3,...,3,1,1,9,5,3,7,7,1,7
1467,27,No,Travel_Rarely,155,4,2,m,87,4,2,...,4,2,1,6,0,3,6,2,0,3
1468,49,No,Travel_Frequently,1023,2,4,m,63,2,2,...,3,4,0,17,3,2,9,6,0,8


### Arithmetic operations with columns

Let's look at another example of transforming data.  Let's say we want identify people who have worked their entire career at this company.  If their "TotalWorkingYears" is equal to their "YearsAtCompany" value, then we want the value in the "Lifer" column to equal True, otherwise False.

We can do this by comparing two columns as if they were single values:

In [20]:
df = original_df.copy()
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,3,4,1,6,3,3,2,2,2,2


In [21]:
lifer_col = (df.TotalWorkingYears == df.YearsAtCompany)

df["lifer"] = lifer_col

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,lifer
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,0,8,0,1,6,4,0,5,False
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,1,10,3,3,10,7,1,7,True
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,0,7,3,3,0,0,0,0,False
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,0,8,3,3,8,7,3,0,True
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,1,6,3,3,2,2,2,2,False


Just to demonstrate, we can do all sorts of arithmetic operations on columns:

In [22]:
df = original_df.copy()

rate_col = df.DailyRate * df.HourlyRate

df["rate"] = rate_col

df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,DistanceFromHome,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,rate
0,41,Yes,Travel_Rarely,1102,1,2,Female,94,3,2,...,1,0,8,0,1,6,4,0,5,103588
1,49,No,Travel_Frequently,279,8,3,Male,61,2,2,...,4,1,10,3,3,10,7,1,7,17019
2,37,Yes,Travel_Rarely,1373,2,4,Male,92,2,1,...,2,0,7,3,3,0,0,0,0,126316
3,33,No,Travel_Frequently,1392,3,4,Female,56,3,1,...,3,0,8,3,3,8,7,3,0,77952
4,27,No,Travel_Rarely,591,2,1,Male,40,3,1,...,4,1,6,3,3,2,2,2,2,23640


## Setting data with `loc[]`

Remember how useful the `loc[]` attribute was for reading data from a DataFrame? Turns out it is just as useful for setting data within a DataFrame.


First, lets build a smaller DataFrame to demonstrate how this works!


In [23]:
columns = list('abcdef') # Create a list of chars from a string (yay python)
data = [[False for j in columns] for i in range(0, 10)] # create a list of lists of "False"s for our dataframe
sdf = pd.DataFrame(data, columns=columns)

sdf

Unnamed: 0,a,b,c,d,e,f
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [24]:
sdf.loc[0, 'a'] = True

sdf

Unnamed: 0,a,b,c,d,e,f
0,True,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


Instead of passing a single value to be assigned, pass data that matches the shape of your query to set that exact data:

In [28]:
sdf.loc[9] = [True, True, True, True, True, True]

sdf

SyntaxError: invalid syntax (2212304454.py, line 1)

### Summary

- You can add a new columns by using `.` or `[]`
- You can use the function `drop()` to remove one or more columns
- You can use the function `map()` or arithmetic operations to change the values of existing columns or to create new ones.
- You can perform arithmetic operation on columns
- `loc` can be also used to set data values in my dataframe.