In [1]:
import pandas as pd

# Python Crash Course IDSP 2025 - Lecture 05

* seeding of randomnes
* type checking with isinstance()

More pandas:

* recap pandas so far
* tilde operator
* finding unique values
* filtering out missing values
* changing values in a cell
* how not to do it ("a value is being tried to set...")
* setting a data type for a series
* groupby!!

**Binary search algorithm** (if we get to it)

# More coding hacks

* type checking with `isinstance()`
* seeding of randomness with `random.seed()`

In [2]:
# type checking
# how to check if an object is of a certain type?
type(3)==int # ok, but not ideal
isinstance(3, int) # MUCH BETTER!

True

In [3]:
# random seeding
import random
# random.seed(1312)
random.choice(range(100))

36

# INSERT PANDAS RECAP

# pandas functions

```python
pd.read_csv()
pd.DataFrame()
pd.concat()
```

# pandas DataFrame attributes
```python
.index
.column
.shape
```

# pandas methods for DataFrames

```python
.head()
.tail()
.describe()
.sort_values()
.drop()
.reset_index()
```

# Boolean indexing (filtering by condition)

##### `df[columnlabel]` can be combined with comparison `> < == !=`  operators
##### `conditions` can be combined with "bitwise and"  `&`, "bitwise or" `|`  operators
##### `df[condition]` returns only those rows where condition is True
##### `df[(condition1) & (condition2)]` returns only those rows where both conditions are True
##### `df[(condition1) | (condition2)]` returns only those rows at least one of the conditions is True

***
# Let's load our Titanic data set again

In [4]:
df = pd.read_csv("./data/titanic.csv")

# Try it out yourself!

* read in the data one more time (we messed around with the old data frame) with `pd.read_csv()`
* filter by 2 conditions: `"Sex"=="female"` and `"Age">60`
* save a COPY of the filtered data set to the variable `old_ladies`
* how many old ladies were on the Titanic? (`len()`)
* how may of the old ladies were badass ladies that survived? `"Survived"==1`
* what is the mean fare that the old ladies paid? (`.describe()`, or calculate it manually, or with `.mean()`)

In [17]:
# YOUR CODE HERE
old_women = df[(df["Sex"] == "female") & (df["Age"] >60)].copy()
old_women


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


***

## UNIQUE VALUES

In [None]:
# df.head(3)
# set(df.Survived) # option 1
# df["Survived"].unique() # option 2

***

## Tilde operator `~` in pandas means `not` ("inversion" operator)

In [18]:
# everyone with an Age value below 18 (NA are not considered)
df[df["Age"] < 18]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
850,851,0,3,"Andersson, Master. Sigvard Harald Elias",male,4.0,4,2,347082,31.2750,,S
852,853,0,3,"Boulos, Miss. Nourelain",female,9.0,1,1,2678,15.2458,,C
853,854,1,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4000,D28,S
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S


In [None]:
# everyone with an Age value above or equal 18 (NA are NOT considered)
df[df["Age"] >= 18]

In [None]:
# everyone with an Age value NOT below 18 (NA are CONSIDERED!)
df[~(df["Age"] < 18)]

***

## MISSING VALUES

### Filtering out missing values (NaNs) with boolean conditions

* `NaN` .... "not a number"
* pandas functions: `.isna()` and `.notna()` (isna = is not available, notna = not not available)
* Note, for pandas: "not available" (includes both `NaN` and `None`)

In [None]:
df.head()
# in rows with index 0,2,4 the Cabin data is missing

In [None]:
df["Cabin"].isna()
# returns for each row whether data is NOT AVAILABLE (True) or available (False)

In [None]:
df["Cabin"].notna()
# returns for each row whether data is AVAILABLE (True) and not available (False)

### How to use `isna` and `notna` to clean data?

In [None]:
# boolean indexing to only have rows where we know the Cabin value:
my_condition = df["Cabin"].notna()
df[my_condition]
# or, shorter: df[df["Cabin"].notna()]

In [None]:
# filter by available data in *both* Age and Cabin columns 
df[(df['Age'].notna()) & (df['Cabin'].notna())]

In [None]:
# calling notna on entire dataframe:
df.notna()

### What if we don't want missing data ANYwhere?

`.dropna()`

has the optional parameter `axis`, default=0

In [None]:
# calling docstring for help! Note we have to spell out the object
# that the dropna method is applied to:
#?dropna # doesn't work
#?pd.dropna # doesn't work
?pd.DataFrame.dropna 
# works!

In [None]:
df.dropna() # drops all ROWS with missing values (in ANY of the columns)

In [None]:
df.dropna(axis=0) # drops all ROWS with missing values (in ANY of the columns) 
# (same as cell before; axis=0 is default value of optional parameter "axis")

In [None]:
df.dropna(axis=1) # drops all COLUMNS with missing values (in ANY of the rows)

# How to *replace* missing values?

rather than dropping entire rows/columns...

`.fillna()`

In [19]:
df.fillna(0)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,0,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,0,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,0,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,0,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,0.0,1,2,W./C. 6607,23.4500,0,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [20]:
df.fillna("no-info")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,no-info,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,no-info,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,no-info,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,no-info,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,no-info,1,2,W./C. 6607,23.4500,no-info,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
df["Cabin"] = df["Cabin"].fillna("no-cabin") # this line of code changes the df!!
df

## Changing values in a cell

#### `df.loc[rowlabel, columnlabel] = new_value`

In [None]:
# this will change the value in the cell of the first row, column "Name"
df.loc[0, "Name"] = "S.O.S" # assigning CHANGES the object!
df.head()

In [None]:
# this will change the value in the cells of the first 3 rows, column "Name"
df.loc[[0,1,2], "Name"] = "Not Me Please!" # assigning CHANGES the object!
df.head()

***

## Replacing values "dynamically"

`.replace()`

In [None]:
?replace # correct to see the docstring!

In [21]:
df.replace(to_replace=0, value="zero")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,zero,3,"Braund, Mr. Owen Harris",male,22.0,1,zero,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,zero,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,zero,zero,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,zero,113803,53.1,C123,S
4,5,zero,3,"Allen, Mr. William Henry",male,35.0,zero,zero,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,zero,2,"Montvila, Rev. Juozas",male,27.0,zero,zero,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,zero,zero,112053,30.0,B42,S
888,889,zero,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,zero,zero,111369,30.0,C148,C


***

## SettingWithCopyWarning

... and how to debug

In [None]:
# let's add a column, "lifeboat"...
df["lifeboat"] = None
# and let's assume all kids were taken to a lifeboat:
df[df["Age"]<18] # for these rows, we want lifeboat=True
df[df["Age"]<18]["lifeboat"] # so we need to set all these values to True
df[df["Age"]<18]["lifeboat"] = True
df.head(20) # why didn't it work?

# Try it out yourself!

This line of code 

```python
df[df["Age"]<18]["lifeboat"] = True
```

raised this error message:
```
/var/folders/66/3jkth_7d5gggg6pyr8yywwt40000gn/T/ipykernel_93168/2070943703.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["Age"]<18]["lifeboat"] = True
```

Use the documentation (link in error message!) & google/StackOverflow to figure out how to *correct* the line of code.

In [None]:
# YOUR DEBUGGED CODE HERE

***

## GROUPING 😱


In [None]:
grouping = df.groupby("Sex")
grouping

In [None]:
for group in grouping:
    print(group)
    #print(len(group))
    #print(type(group))

In [None]:
for a, b in grouping:
    print(a)

In [None]:
for a, b in grouping:
    print(b)

In [None]:
group_list = []
for a, b in grouping:
    group_list.append(b)
group_list[0]
group_list[1]

In [None]:
grouping["Age"].mean()

In [None]:
grouping["Fare"].mean()

In [None]:
grouping.Age.max()

# Try it out yourself!

Use `.groupby()` to compute the mean ages of Survived vs. not-Survived passengers.

There are many ways to solve this, but you could do it with one single line of code!

In [23]:
# YOUR CODE HERE

grouping = df.groupby("Survived")["Age"].mean()
grouping 

Survived
0    30.626179
1    28.343690
Name: Age, dtype: float64

***

# Binary search...

(PDF slides & `.py` file shown in class)