# Data Analysis in Python - IX: Modifying DataFrames

## Introduction


In this lesson, we will learn how to add, edit and remove columns and rows from DataFrames. 

Note: 
1. Use the TOC to navigate between sections.


In [2]:
# create data frames

import pandas as pd

students = pd.DataFrame(
    [
        [9000, 'Amir', 'A1@psu.edu'],
        [9001, 'Biko', 'b10@psu.edu'],
        [9002, 'Chen', 'C2@psu.edu'],
        [9003, 'Darren', 'd@psu.edu'],
        [9004, 'Elena', 'e@psu.edu'],
    ], 
    columns = ['ID','Name','Email']
)

students

Unnamed: 0,ID,Name,Email
0,9000,Amir,A1@psu.edu
1,9001,Biko,b10@psu.edu
2,9002,Chen,C2@psu.edu
3,9003,Darren,d@psu.edu
4,9004,Elena,e@psu.edu


### Adding a new column to a DataFrame

A new column can be added using a simple assignment. 

In [52]:
# add a new column called "StartYear" to the students data frame
students["StartYear"] = [2020, 2019, 2020, 2021, 2019]

# add a new column called "StartMonth"
students["StartMonth"] = [8,8,8,1,1]

students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,A1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,C2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1


A new column may be based on a calculation on other columns and/or a number.

In [53]:
# years in program = current year - start year + 1

students["YearsInProgram"] = 2023 - students["StartYear"] + 1

students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth,YearsInProgram
0,9000,Amir,A1@psu.edu,2020,8,4
1,9001,Biko,b10@psu.edu,2019,8,5
2,9002,Chen,C2@psu.edu,2020,8,4
3,9003,Darren,d@psu.edu,2021,1,3
4,9004,Elena,e@psu.edu,2019,1,5


### Modifying a column

To modify a column, modify the value in a formula and assign it back to the same column

In [54]:
# extract emails and change email to lowercase
students["Email"].str.lower()

0     a1@psu.edu
1    b10@psu.edu
2     c2@psu.edu
3      d@psu.edu
4      e@psu.edu
Name: Email, dtype: object

In [55]:
# modify the column to change email to lowercase
students["Email"] = students["Email"].str.lower()
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth,YearsInProgram
0,9000,Amir,a1@psu.edu,2020,8,4
1,9001,Biko,b10@psu.edu,2019,8,5
2,9002,Chen,c2@psu.edu,2020,8,4
3,9003,Darren,d@psu.edu,2021,1,3
4,9004,Elena,e@psu.edu,2019,1,5


#### An aside on accessors

Columns of a dataframe can be of different datatypes. Specialized functions are available for these datatypes and can be invoked using 'accessors'. 'str' is the accessor for strings. Let's consider an advanced example that uses the split function explained at https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html.

In [56]:
# split email
students["Email"].str.split("@")

0     [a1, psu.edu]
1    [b10, psu.edu]
2     [c2, psu.edu]
3      [d, psu.edu]
4      [e, psu.edu]
Name: Email, dtype: object

In [57]:
# extract access id
students["Email"].str.split("@").str[0]

0     a1
1    b10
2     c2
3      d
4      e
Name: Email, dtype: object

In [7]:
#split email and expand
students["Email"].str.split("@", expand = True)

Unnamed: 0,0,1
0,A1,psu.edu
1,b10,psu.edu
2,C2,psu.edu
3,d,psu.edu
4,e,psu.edu


### Removing a column

To remove a column, use the drop function and specify the column name. Provide a list to delete multiple columns. 

In [59]:
# remove years in program column

students.drop('YearsInProgram', axis = 1, inplace = True)
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1


### Adding a new row to a data frame

A new row can be added using the concat function or the loc property. 

In [60]:
# use append to add a row to the students dataframe
new_student = pd.DataFrame([
[9006, 'Harper', 'h@psu.edu', 2020, 8]
], columns = ['ID', 'Name', 'Email', 'StartYear', 'StartMonth'])

students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1


In [61]:
new_student

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9006,Harper,h@psu.edu,2020,8


In [62]:
students = pd.concat([students, new_student])
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1
0,9006,Harper,h@psu.edu,2020,8


In [63]:
students.reset_index(inplace = True, drop = True)
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1
5,9006,Harper,h@psu.edu,2020,8


In [64]:
# inserting a new row using loc
students.loc[4.5] = [9005, 'Falguni', 'f@psu.edu', 2021, 1]
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0.0,9000,Amir,a1@psu.edu,2020,8
1.0,9001,Biko,b10@psu.edu,2019,8
2.0,9002,Chen,c2@psu.edu,2020,8
3.0,9003,Darren,d@psu.edu,2021,1
4.0,9004,Elena,e@psu.edu,2019,1
5.0,9006,Harper,h@psu.edu,2020,8
4.5,9005,Falguni,f@psu.edu,2021,1


In [65]:
students.sort_index(inplace = True)
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0.0,9000,Amir,a1@psu.edu,2020,8
1.0,9001,Biko,b10@psu.edu,2019,8
2.0,9002,Chen,c2@psu.edu,2020,8
3.0,9003,Darren,d@psu.edu,2021,1
4.0,9004,Elena,e@psu.edu,2019,1
4.5,9005,Falguni,f@psu.edu,2021,1
5.0,9006,Harper,h@psu.edu,2020,8


In [66]:
students.reset_index(inplace = True, drop = True)
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1
5,9005,Falguni,f@psu.edu,2021,1
6,9006,Harper,h@psu.edu,2020,8


### Modifying values in a row

Use the loc property to modify values in a row.

In [67]:
# change all values
students.loc[6] = [9007, "Harper", "h@psu.edu", 2021, 1]
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
3,9003,Darren,d@psu.edu,2021,1
4,9004,Elena,e@psu.edu,2019,1
5,9005,Falguni,f@psu.edu,2021,1
6,9007,Harper,h@psu.edu,2021,1


### Delete a row

Use the drop function to delete a row. Provide a list of index values to delete multiple rows.

In [68]:
students.drop(3, axis = 0, inplace = True)
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
4,9004,Elena,e@psu.edu,2019,1
5,9005,Falguni,f@psu.edu,2021,1
6,9007,Harper,h@psu.edu,2021,1


### Modifying specific cells

It is possible to modify one or more specific cells of a dataframe.

In [69]:
# change one cell using its location based on data labels
students.loc[4, "StartMonth"] = 8
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
4,9004,Elena,e@psu.edu,2019,8
5,9005,Falguni,f@psu.edu,2021,1
6,9007,Harper,h@psu.edu,2021,1


In [70]:
# change one cell using its location based on integer locations
students.iloc[3, 4] = 1
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
4,9004,Elena,e@psu.edu,2019,1
5,9005,Falguni,f@psu.edu,2021,1
6,9007,Harper,h@psu.edu,2021,1


In [71]:
# change cells based on conditions
students.loc[students["StartYear"] > 2020, "StartMonth"] = 8
students

Unnamed: 0,ID,Name,Email,StartYear,StartMonth
0,9000,Amir,a1@psu.edu,2020,8
1,9001,Biko,b10@psu.edu,2019,8
2,9002,Chen,c2@psu.edu,2020,8
4,9004,Elena,e@psu.edu,2019,1
5,9005,Falguni,f@psu.edu,2021,8
6,9007,Harper,h@psu.edu,2021,8
