# pandas: Selecting and modifying data

In [1]:
import pandas as pd

## Conditional selection

Let's load up file *data_ex9_2.csv* which contains data about some students' grades.

We can use `.read_csv()` with a URL directly. Let's download it from the course GitHub page:

In [2]:
url = "https://raw.githubusercontent.com/mkozturk/AI111/refs/heads/main/data_ex9_2.csv"
scores = pd.read_csv(url)
scores.head()

Unnamed: 0,name,gender,group,math,physics,literature,art
0,Herschel Arendsen,2,3,58,67,60,39
1,Julita Cumesky,1,3,50,48,33,54
2,Mada Form,1,1,60,51,49,55
3,Barr Sapsford,2,4,54,74,61,29
4,Jackie Dict,1,4,52,53,51,61


Select students in group 3

In [3]:
scores[scores["group"]==3]

Unnamed: 0,name,gender,group,math,physics,literature,art
0,Herschel Arendsen,2,3,58,67,60,39
1,Julita Cumesky,1,3,50,48,33,54
7,Keslie Alfonso,1,3,47,69,56,87
19,Maxie Yakobovicz,1,3,49,76,51,40
27,Brok Frowing,2,3,70,22,50,86
40,Burtie Shuttell,2,3,55,49,23,42
44,Idette Hendrik,2,3,48,79,46,75
47,Meg Spurgeon,1,3,41,54,54,67
50,Bone Dellenty,1,3,38,73,37,67
51,Ardeen Watsam,2,3,50,68,39,71


Select female (gender=1) students in group 3.

In [4]:
scores[(scores["group"]==3) & (scores["gender"]==1)]

Unnamed: 0,name,gender,group,math,physics,literature,art
1,Julita Cumesky,1,3,50,48,33,54
7,Keslie Alfonso,1,3,47,69,56,87
19,Maxie Yakobovicz,1,3,49,76,51,40
47,Meg Spurgeon,1,3,41,54,54,67
50,Bone Dellenty,1,3,38,73,37,67
59,Viki McCritichie,1,3,43,47,42,61
67,Tandi Muggleton,1,3,57,62,34,79
68,Tan Tucknott,1,3,50,63,38,44
72,Denver Bedwell,1,3,56,50,33,44
81,Stefanie Osband,1,3,58,85,30,56


Select students with literature score greater than 80 or art score greater than 90.

In [5]:
scores[(scores["literature"]>60) | (scores["art"]>80)]

Unnamed: 0,name,gender,group,math,physics,literature,art
3,Barr Sapsford,2,4,54,74,61,29
7,Keslie Alfonso,1,3,47,69,56,87
16,Sophey Truggian,2,4,64,60,66,71
22,Eran Goldthorp,2,2,50,41,66,29
23,Andris Donnan,2,1,52,44,61,92
26,Garwin Sieb,2,4,56,75,68,80
27,Brok Frowing,2,3,70,22,50,86
29,Eilis Andrieu,2,1,68,59,51,96
37,Candace Di Angelo,1,4,49,56,43,83
38,Chaim Mulqueen,2,2,59,45,58,91


Select students with a literature score greater than 70 or in group 1, but not both.

In [6]:
scores[(scores["literature"]>70) ^ (scores["group"]==1)] # exclusive OR

Unnamed: 0,name,gender,group,math,physics,literature,art
2,Mada Form,1,1,60,51,49,55
12,Petronella Ceci,1,1,41,68,48,62
20,Nanete Dounbare,1,1,57,51,32,63
21,Alic Looker,2,1,28,54,51,38
23,Andris Donnan,2,1,52,44,61,92
25,Brodie Sympson,2,1,49,64,24,46
29,Eilis Andrieu,2,1,68,59,51,96
45,Myer Bull,1,1,58,38,62,48
46,Neale Wale,2,1,33,50,48,55
49,Lonnie Banaszkiewicz,1,1,36,70,52,63


Select students that are NOT in group 1 with a art score greater than 80.

In [7]:
scores[(scores["art"]>80) & ~(scores["group"]==1)]

Unnamed: 0,name,gender,group,math,physics,literature,art
7,Keslie Alfonso,1,3,47,69,56,87
27,Brok Frowing,2,3,70,22,50,86
37,Candace Di Angelo,1,4,49,56,43,83
38,Chaim Mulqueen,2,2,59,45,58,91
52,Weston Dupree,2,2,66,49,44,103
63,Jacqueline Kondratyuk,1,4,48,48,44,91


## String methods in pandas Series

Regular string methods, and more, are implemented with Series objects. These return a new Series, applying a function to every element.

In [8]:
scores["name"].str.lower()  # convert to lowercase

0     herschel arendsen
1        julita cumesky
2             mada form
3         barr sapsford
4           jackie dict
            ...        
95       jud bengefield
96      catrina suggate
97     demott carragher
98          jess legier
99         roda rillett
Name: name, Length: 100, dtype: object

In [9]:
scores["name"].str.len()

0     17
1     14
2      9
3     13
4     11
      ..
95    14
96    15
97    16
98    11
99    12
Name: name, Length: 100, dtype: int64

Boolean output based on some condition:

In [10]:
scores["name"].str.startswith("A")

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: name, Length: 100, dtype: bool

We can use this output for selection:

In [11]:
scores[scores["name"].str.startswith("A")]

Unnamed: 0,name,gender,group,math,physics,literature,art
14,Aretha McQuade,1,4,62,51,55,59
17,Ashley Weblin,2,4,51,75,21,40
21,Alic Looker,2,1,28,54,51,38
23,Andris Donnan,2,1,52,44,61,92
48,Angelico Brandassi,1,2,27,58,44,57
51,Ardeen Watsam,2,3,50,68,39,71
62,Alexandre Oosthout de Vree,1,4,40,78,63,57
94,Ava Mendonca,2,3,50,51,59,63


Select students with the lastname beginning with "A":

In [12]:
scores[scores["name"].str.contains(" A")] # look for "A" following a space

Unnamed: 0,name,gender,group,math,physics,literature,art
0,Herschel Arendsen,2,3,58,67,60,39
7,Keslie Alfonso,1,3,47,69,56,87
29,Eilis Andrieu,2,1,68,59,51,96
37,Candace Di Angelo,1,4,49,56,43,83
41,Penrod Andres,1,2,45,65,49,52
61,Teddie Aleksashin,1,1,42,58,34,84


## isin(): Check if element is 

Suppose we want to select the rows where the names have particular values. One (bad) way to do this is:

In [13]:
selected = ( (scores["name"]=="Julita Cumesky") | 
            (scores["name"]=="Catrina Suggate") |
            (scores["name"]=="Weston Dupree")
           )
scores[selected]

Unnamed: 0,name,gender,group,math,physics,literature,art
1,Julita Cumesky,1,3,50,48,33,54
52,Weston Dupree,2,2,66,49,44,103
96,Catrina Suggate,1,4,53,31,26,44


However, this is inconvenient, and does not work if we need to check many cases.

The `.isin()` Series method does the same task in a short and flexible way.

In [14]:
selected = scores["name"].isin(["Julita Cumesky", "Catrina Suggate", "Weston Dupree"])
scores[selected]

Unnamed: 0,name,gender,group,math,physics,literature,art
1,Julita Cumesky,1,3,50,48,33,54
52,Weston Dupree,2,2,66,49,44,103
96,Catrina Suggate,1,4,53,31,26,44


## Adding, removing, and changing columns

To add a new column to an existing DataFrame, we simply assign with a new column name:

In [15]:
student_averages = scores[["math","physics","literature","art"]].mean(axis=1)
scores["average"] = student_averages
scores

Unnamed: 0,name,gender,group,math,physics,literature,art,average
0,Herschel Arendsen,2,3,58,67,60,39,56.00
1,Julita Cumesky,1,3,50,48,33,54,46.25
2,Mada Form,1,1,60,51,49,55,53.75
3,Barr Sapsford,2,4,54,74,61,29,54.50
4,Jackie Dict,1,4,52,53,51,61,54.25
...,...,...,...,...,...,...,...,...
95,Jud Bengefield,1,4,35,49,57,43,46.00
96,Catrina Suggate,1,4,53,31,26,44,38.50
97,Demott Carragher,1,1,39,64,53,45,50.25
98,Jess Legier,2,4,47,87,47,45,56.50


A column's name can be changed with the `.rename()` method.

Note that this returns a new DataFrame, without changing the original.

In [16]:
scores.rename(columns={"average":"average score"})

Unnamed: 0,name,gender,group,math,physics,literature,art,average score
0,Herschel Arendsen,2,3,58,67,60,39,56.00
1,Julita Cumesky,1,3,50,48,33,54,46.25
2,Mada Form,1,1,60,51,49,55,53.75
3,Barr Sapsford,2,4,54,74,61,29,54.50
4,Jackie Dict,1,4,52,53,51,61,54.25
...,...,...,...,...,...,...,...,...
95,Jud Bengefield,1,4,35,49,57,43,46.00
96,Catrina Suggate,1,4,53,31,26,44,38.50
97,Demott Carragher,1,1,39,64,53,45,50.25
98,Jess Legier,2,4,47,87,47,45,56.50


To delete a column, we use the `.drop()` method.

Note that this returns a new DataFrame, without changing the original.

In [17]:
scores.drop("average", axis=1) # to drop a column, specify axis=1

Unnamed: 0,name,gender,group,math,physics,literature,art
0,Herschel Arendsen,2,3,58,67,60,39
1,Julita Cumesky,1,3,50,48,33,54
2,Mada Form,1,1,60,51,49,55
3,Barr Sapsford,2,4,54,74,61,29
4,Jackie Dict,1,4,52,53,51,61
...,...,...,...,...,...,...,...
95,Jud Bengefield,1,4,35,49,57,43
96,Catrina Suggate,1,4,53,31,26,44
97,Demott Carragher,1,1,39,64,53,45
98,Jess Legier,2,4,47,87,47,45


Alternatively, we can use the *columns* parameter.

In [18]:
scores.drop(columns=["average"])

Unnamed: 0,name,gender,group,math,physics,literature,art
0,Herschel Arendsen,2,3,58,67,60,39
1,Julita Cumesky,1,3,50,48,33,54
2,Mada Form,1,1,60,51,49,55
3,Barr Sapsford,2,4,54,74,61,29
4,Jackie Dict,1,4,52,53,51,61
...,...,...,...,...,...,...,...
95,Jud Bengefield,1,4,35,49,57,43
96,Catrina Suggate,1,4,53,31,26,44
97,Demott Carragher,1,1,39,64,53,45
98,Jess Legier,2,4,47,87,47,45


## Changing values conditionally

Suppose that we discovered we made a grading mistake in the physics exam for group 2. We should add 5 points to ech student in this group.

In [19]:
scores[scores["group"]==3].head()

Unnamed: 0,name,gender,group,math,physics,literature,art,average
0,Herschel Arendsen,2,3,58,67,60,39,56.0
1,Julita Cumesky,1,3,50,48,33,54,46.25
7,Keslie Alfonso,1,3,47,69,56,87,64.75
19,Maxie Yakobovicz,1,3,49,76,51,40,54.0
27,Brok Frowing,2,3,70,22,50,86,57.0


Select the relevant rows and make the replacements. This operation changes the original data frame.

In [20]:
selected = (scores["group"]==3)
scores.loc[selected, "physics"] += 5

In such replacements *pandas* requires the `.loc[]` type indexing, rather than plain brackets `[]`.

Check the data frame to verify the change.

In [21]:
scores[scores["group"]==3].head()

Unnamed: 0,name,gender,group,math,physics,literature,art,average
0,Herschel Arendsen,2,3,58,72,60,39,56.0
1,Julita Cumesky,1,3,50,53,33,54,46.25
7,Keslie Alfonso,1,3,47,74,56,87,64.75
19,Maxie Yakobovicz,1,3,49,81,51,40,54.0
27,Brok Frowing,2,3,70,27,50,86,57.0


Suppose we want to replace the 1 value of gender with "female" and 2 with "male", for clarity.

We can filter each case separately and assign the new values to selected elements.

Note that these operations do change the original data frame.

In [22]:
scores["gender"] = scores["gender"].astype(str) # change the data type from integer to string.
scores.loc[scores.gender==1, "gender"] = "female"  # make replacements
scores.loc[scores.gender==2, "gender"] = "male"
scores

Unnamed: 0,name,gender,group,math,physics,literature,art,average
0,Herschel Arendsen,2,3,58,72,60,39,56.00
1,Julita Cumesky,1,3,50,53,33,54,46.25
2,Mada Form,1,1,60,51,49,55,53.75
3,Barr Sapsford,2,4,54,74,61,29,54.50
4,Jackie Dict,1,4,52,53,51,61,54.25
...,...,...,...,...,...,...,...,...
95,Jud Bengefield,1,4,35,49,57,43,46.00
96,Catrina Suggate,1,4,53,31,26,44,38.50
97,Demott Carragher,1,1,39,64,53,45,50.25
98,Jess Legier,2,4,47,87,47,45,56.50
