# Introduction to Python
# Module 5 -  Data Manipulation and Analysis with Pandas

Instructor: Suyong Song 

Topics to be covered:
- Pandas series (+ exercises)
- Loading a dataset as a Pandas dataframe
- Element selection from a dataframe (+ exercises)
- Handling null values in a dataframe (+ exercises)
- Iteration over a dataframe
- Aggregation and grouping of a dataframe  (+ exercises)

Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation package in Python.

## Data Structures in Pandas

In [1]:
from IPython.display import Image

Image(url="https://cdn-images-1.medium.com/max/800/0*PWbW0OdJJw49kxMt.png")

Series are one-dimensional arrays. A series has an index array, which is called just index.

In [2]:
Image(url="https://cdn-images-1.medium.com/max/800/0*dddYH8GijZanG4dO.png")

A dataframe is designed to extend series to two dimensions. A dataframe has two index arrays: a row index called just index and a column index called columns. A dataframe is, in fact, a collection of mulitple series, each of which shares an index. 

## Importing the Pandas package

In [3]:
# ! pip install --user --upgrade pandas

In [1]:
import pandas as pd

## Pandas Series

In [2]:
data = range(10, 101, 10)
data

range(10, 101, 10)

In [3]:
series = pd.Series(data=data)     # Create a Pandas series using a list. 
series

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

pandas.Series: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

The right-hand side is called values of a series, while the left-hand side is called index of a series. If you do not specify any index during the series creation, Pandas will, by default, assign integer values increasing from 0 by 1 as its index.

In [4]:
type(series)

pandas.core.series.Series

In [5]:
series.index

RangeIndex(start=0, stop=10, step=1)

In [6]:
list(series.index)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [7]:
series.values

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100], dtype=int64)

## Selecting Elements from a Series

Element selection from a series is the same as that from a NumPy arrary. 

In [8]:
series

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

In [9]:
series[0]

10

In [13]:
series[:3]

0    10
1    20
2    30
dtype: int64

In [14]:
series[-3:]

7     80
8     90
9    100
dtype: int64

In [15]:
for num in series:
    print(num)

10
20
30
40
50
60
70
80
90
100


When iterating over a series, only the values are exposed. The index is not exposed, which means the index is only used for accessing elements in a series. 

In [16]:
index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]     
series2 = pd.Series(data=data, index=index)                  # Use Alphabet letters as the index of a series.
series2

a     10
b     20
c     30
d     40
e     50
f     60
g     70
h     80
i     90
j    100
dtype: int64

Often it is preferable to create a series using meaningful labels, instead of numbers, in order to distinguish and identify each item regardless of the order in which they were inserted into the series. 

In [17]:
series2.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

In [18]:
series2.values

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [19]:
series2["a"]

10

You can select individual elements, specifying the label corresponding to the position of the index. 

In [20]:
series2.a

10

`series.a` is equivalent to `series["a"]` if `a` is a string.

In [21]:
series2[0]

10

Element selection specifying the index position still works. 

In [22]:
series2[:3]

a    10
b    20
c    30
dtype: int64

In [23]:
series2[["a", "b", "c"]]

a    10
b    20
c    30
dtype: int64

Fancy indexing still works for Pandas series and dataframes.

In [24]:
for num in series2:
    print(num)

10
20
30
40
50
60
70
80
90
100


In [25]:
index = ["a", "b", "c", "d", "e", "a", "b", "c", "d", "e"]
series3 = pd.Series(data=data, index=index)
series3

a     10
b     20
c     30
d     40
e     50
a     60
b     70
c     80
d     90
e    100
dtype: int64

The index positions and labels do not have to be unique. 

In [26]:
series3["a"]

a    10
a    60
dtype: int64

## Exercises for Selecting Elements from a Series

## Loading a Dataset as a Pandas Dataframe

In [10]:
from seaborn import load_dataset
df = load_dataset("titanic")
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [11]:
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, random_state=1)  # data reduction: # of obs =10, seed for random number generator=1
df

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25
623,0,3,male,21.0,7.8542
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708
241,1,3,female,,15.5


In [12]:
df.shape     # 10 rows and 5 columns

(10, 5)

In [13]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'fare'], dtype='object')

In [14]:
df.index

Int64Index([862, 223, 84, 680, 535, 623, 148, 3, 34, 241], dtype='int64')

The index positions are not in order, as the 10 rows were randomly selected. 

In [15]:
df.values

array([[1, 1, 'female', 48.0, 25.9292],
       [0, 3, 'male', nan, 7.8958],
       [1, 2, 'female', 17.0, 10.5],
       [0, 3, 'female', nan, 8.1375],
       [1, 2, 'female', 7.0, 26.25],
       [0, 3, 'male', 21.0, 7.8542],
       [0, 2, 'male', 36.5, 26.0],
       [1, 1, 'female', 35.0, 53.1],
       [0, 1, 'male', 28.0, 82.1708],
       [1, 3, 'female', nan, 15.5]], dtype=object)

In [16]:
len(df)

10

The length of a dataframe refers to the number of rows in the dataframe. 

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 862 to 241
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  10 non-null     int64  
 1   pclass    10 non-null     int64  
 2   sex       10 non-null     object 
 3   age       7 non-null      float64
 4   fare      10 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 480.0+ bytes


pandas.DataFrame.info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

The <b>info</b> method shows a concise summary of a dataframe.

In [18]:
df.head()          # Returns the first 5 rows.

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


pandas.DataFrame.head: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [19]:
df.head(10)

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25
623,0,3,male,21.0,7.8542
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708
241,1,3,female,,15.5


In [20]:
df.tail()          # Returns the last 5 rows. 

Unnamed: 0,survived,pclass,sex,age,fare
623,0,3,male,21.0,7.8542
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708
241,1,3,female,,15.5


pandas.DataFrame.tail: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

When you have a new dataset, it is always a good idea to start by looking at the first and last few rows to get a sense of what the entire dataset would look like.

## Selecting Elements from a Dataframe

In [21]:
df["age"]          # Returns all rows in the column age.

862    48.0
223     NaN
84     17.0
680     NaN
535     7.0
623    21.0
148    36.5
3      35.0
34     28.0
241     NaN
Name: age, dtype: float64

NaN stands for "Not A Number", which means a null value in a series.

In [22]:
df.age

862    48.0
223     NaN
84     17.0
680     NaN
535     7.0
623    21.0
148    36.5
3      35.0
34     28.0
241     NaN
Name: age, dtype: float64

`df.a` is quivalent to `df["a"]` if `a` is a string.

In [23]:
type(df.age)

pandas.core.series.Series

A column in a dataframe is in fact a series.

In [24]:
df.age.index

Int64Index([862, 223, 84, 680, 535, 623, 148, 3, 34, 241], dtype='int64')

In [25]:
df.age.values

array([48. ,  nan, 17. ,  nan,  7. , 21. , 36.5, 35. , 28. ,  nan])

In [26]:
for num in df.age:
    print(num)

48.0
nan
17.0
nan
7.0
21.0
36.5
35.0
28.0
nan


In [27]:
df[0]

KeyError: 0

In [28]:
df[:3]                 # Returns the first 3 rows.

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5


In [29]:
df[-3:]

Unnamed: 0,survived,pclass,sex,age,fare
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708
241,1,3,female,,15.5


In [46]:
df["age"][862]         # Returns the element with row index position 862 in the age column. 

48.0

Note that you should look up the column label first, followed by the row index position, each in separate matching brackets.

In [47]:
df[862]["age"]

KeyError: 862

In [30]:
df["age"][:3]          # Returns the first 3 rows in the age column.

862    48.0
223     NaN
84     17.0
Name: age, dtype: float64

In [31]:
df.iloc[0]             # Returns the first row.

survived          1
pclass            1
sex          female
age            48.0
fare        25.9292
Name: 862, dtype: object

<b>iloc</b> means index location. If there is only one argument inside the matching square brackets, the only argument is for the row index. 

In [32]:
type(df.iloc[0])

pandas.core.series.Series

A row in a dataframe is a series too, just as a column in a dataframe is a series. In other words, a dataframe is a 2D collection of series. 

In [33]:
df.iloc[:, 0]          # Returns all rows in the first column.

862    1
223    0
84     1
680    0
535    1
623    0
148    0
3      1
34     0
241    1
Name: survived, dtype: int64

If there are two arguments inside the matching square brackets, the first one is for the row index while the second for the column index. Note that when using <b>iloc</b> you should look up the row index position first and then the column index positions, all in matching square brackets.

In [34]:
df.iloc[:, :2]         # Returns all rows in the first 2 columns. 

Unnamed: 0,survived,pclass
862,1,1
223,0,3
84,1,2
680,0,3
535,1,2
623,0,3
148,0,2
3,1,1
34,0,1
241,1,3


In [35]:
df.iloc[:3, :2]        # Returns the first 3 rows in the first 2 columns.

Unnamed: 0,survived,pclass
862,1,1
223,0,3
84,1,2


In [36]:
df.iloc[-3:, -2:]      # Returns the last 3 rows in the last 2 columns.

Unnamed: 0,age,fare
3,35.0,53.1
34,28.0,82.1708
241,,15.5


In [37]:
df[df.age > 30]        # Sets a condition for filtering.

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1


Masking, or Boolean indexing, is used to select a subset of rows from a dataframe. 

In [38]:
df[age > 30]

NameError: name 'age' is not defined

Make sure to put `df.` before the column name `age`. 

In [39]:
df[(df.age > 30) & (df.pclass == 1)]

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
3,1,1,female,35.0,53.1


In [40]:
df[(df.age > 30) | (df.pclass == 1)]

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708


In [41]:
df[["survived", "sex"]]

Unnamed: 0,survived,sex
862,1,female
223,0,male
84,1,female
680,0,female
535,1,female
623,0,male
148,0,male
3,1,female
34,0,male
241,1,female


To select all rows in certain columns, put a list of column names as a mask.

In [60]:
df["survived", "sex"]

KeyError: ('survived', 'sex')

Make sure to put a list of columns as a mask. 

In [42]:
df.drop("fare", axis=1)    # Returns a new copy of df with the column fare dropped. 

Unnamed: 0,survived,pclass,sex,age
862,1,1,female,48.0
223,0,3,male,
84,1,2,female,17.0
680,0,3,female,
535,1,2,female,7.0
623,0,3,male,21.0
148,0,2,male,36.5
3,1,1,female,35.0
34,0,1,male,28.0
241,1,3,female,


pandas.DataFrame.drop: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

The <b>drop</b> method is used to delete an entire column with all its content. The <b>axis</b> is for whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’). In the example above, the axis must be set to 1, so that the method can find the column on the column axis. Note that the <b>drop</b> method returns a copy, not changing the content of the target dataframe.

In [43]:
df.drop(862, axis=0)    # Returns a new copy of df with the row 862 dropped. 

Unnamed: 0,survived,pclass,sex,age,fare
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25
623,0,3,male,21.0,7.8542
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708
241,1,3,female,,15.5


In [63]:
from IPython.display import Image
Image(url="https://i.stack.imgur.com/DL0iQ.jpg")

In NumPy and Pandas, axis 0 refers to the row axis, while axis 1 to the column axis.

In [44]:
df.drop(["fare", "sex"], axis=1)

Unnamed: 0,survived,pclass,age
862,1,1,48.0
223,0,3,
84,1,2,17.0
680,0,3,
535,1,2,7.0
623,0,3,21.0
148,0,2,36.5
3,1,1,35.0
34,0,1,28.0
241,1,3,


In [45]:
# df = df.drop("fare", axis=1)

To actually delete the column `fare` from `df`, save the resulting dataframe from the <b>drop</b> method back in `df`. 

In [46]:
del df["fare"]

Another way to delete a column is to use the <b>del</b> command. This actually deletes the column in the dataframe.

In [47]:
df.columns

Index(['survived', 'pclass', 'sex', 'age'], dtype='object')

## Exercises for Selecting Elements from a Dataframe

## Handling Missing Values in a Dataframe

In many cases, you need to handle null values, or missing values, in a dataframe. There are two approaches you can consider to handle null values:
- Drop the rows with null values (easy but lose data)
- Fill the null values with something else (not lose data but careful what to be filled with) 

In [68]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, random_state=3)
df

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 395 to 157
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  10 non-null     int64  
 1   pclass    10 non-null     int64  
 2   sex       10 non-null     object 
 3   age       8 non-null      float64
 4   fare      10 non-null     float64
dtypes: float64(2), int64(2), object(1)
memory usage: 480.0+ bytes


In [70]:
df.isnull()

Unnamed: 0,survived,pclass,sex,age,fare
395,False,False,False,False,False
85,False,False,False,False,False
201,False,False,False,True,False
542,False,False,False,False,False
702,False,False,False,False,False
51,False,False,False,False,False
237,False,False,False,False,False
548,False,False,False,False,False
527,False,False,False,True,False
157,False,False,False,False,False


pandas.DataFrame.isnull: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

The <b>isnull</b> method returns which entries in a dataframe are null.

In [71]:
df.isnull().any(axis=1)

395    False
85     False
201     True
542    False
702    False
51     False
237    False
548    False
527     True
157    False
dtype: bool

pandas.DataFrame.any: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html

The <b>any</b> method returns whether any element is True over the specified axis.

In [72]:
mask = df.isnull().any(axis=1)
df[mask]     # Returns all rows with any null values.

Unnamed: 0,survived,pclass,sex,age,fare
201,0,3,male,,69.55
527,0,1,male,,221.7792


In [73]:
df1 = df.copy()
df1

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


pandas.DataFrame.copy: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html

The <b>copy</b> method makes a copy of a dataframe.

In [74]:
df1.dropna()

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
157,0,3,male,30.0,8.05


pandas.DataFrame.dropna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

The <b>dropna</b> method drops all rows with any null values in a dataframe. Note that it returns a copy of the target dataframe.

In [75]:
df1 = df1.dropna()

Note that the <b>dropna</b> method drops the entire row if there is any null value in the row. 

In [76]:
len(df1)

8

In [77]:
df2 = df.copy()
df2

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [78]:
df2 = df2.dropna(how="all")
df2

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


If you set the parameter `how` to *all*, it drops the row when all values in the row are null. Default is any.

In [79]:
df3 = df.copy()
df3

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [80]:
df3.age = df3.age.fillna(value=0)        # Targeted at a specific column
df3

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,0.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,0.0,221.7792
157,0,3,male,30.0,8.05


pandas.DataFrame.fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

The <b>fillna</b> method fills the null values using the specified method. The `value` parameter specifies the value to be replaced with.

Also, you need to determine whether you are going to handle the whole dataframe or one or more columns. 

In [81]:
df4 = df.copy()
df4

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [82]:
df4 = df4.fillna(value={"survived": df.survived.mean(), "pclass": df.pclass.mean(), "sex": "unknown",
                        "age": df.age.mean(), "fare": df.fare.mean()})
df4

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,22.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,22.0,221.7792
157,0,3,male,30.0,8.05


Instead of filling all null values with the same value, you can fill with different values depending on the column, specifying one by one the columns and the values to be replaced in a dictionary. 

In [83]:
df5 = df.copy()
df5

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [84]:
df5 = df5.fillna(method="ffill")
df5

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,33.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,33.0,221.7792
157,0,3,male,30.0,8.05


If you set the parameter `method` to *ffill*, which means forward fill, it propagates the last non-null observation forward. Setting it to *bfill*, which means backward fill, works backward.

In [85]:
df6 = df.copy()
df6

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [86]:
df6 = df6.fillna(method="bfill")
df6

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,11.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,30.0,221.7792
157,0,3,male,30.0,8.05


## Exercises for Handling Null Values

## Iteration over a Dataframe

There are multiple ways to iterate over a dataframe:  
- Using <b>iloc</b>
- Using the <b>iterrows</b> method to iterate over the rows as (index, series) pairs
- Using <b>itertuples</b> method to iterate over the rows as named tuples

You can choose any of the three above depending on how you want to retrieve data from a dataframe.

In [87]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, random_state=5)
df

Unnamed: 0,survived,pclass,sex,age,fare
126,0,3,male,,7.75
354,0,3,male,,7.225
590,0,3,male,35.0,7.125
509,1,3,male,26.0,56.4958
769,0,3,male,32.0,8.3625
545,0,1,male,64.0,26.0
759,1,1,female,33.0,86.5
261,1,3,male,3.0,31.3875
329,1,1,female,16.0,57.9792
349,0,3,male,42.0,8.6625


In [88]:
# Iterates by row
for i in range(len(df)):
    print(i, df.iloc[i].values)

0 [0 3 'male' nan 7.75]
1 [0 3 'male' nan 7.225]
2 [0 3 'male' 35.0 7.125]
3 [1 3 'male' 26.0 56.4958]
4 [0 3 'male' 32.0 8.3625]
5 [0 1 'male' 64.0 26.0]
6 [1 1 'female' 33.0 86.5]
7 [1 3 'male' 3.0 31.3875]
8 [1 1 'female' 16.0 57.9792]
9 [0 3 'male' 42.0 8.6625]


In [89]:
# Iterates over the rows as (index, series) pairs
for idx, series in df.iterrows():
    print(idx)
    print(series)
    print()

126
survived       0
pclass         3
sex         male
age          NaN
fare        7.75
Name: 126, dtype: object

354
survived        0
pclass          3
sex          male
age           NaN
fare        7.225
Name: 354, dtype: object

590
survived        0
pclass          3
sex          male
age          35.0
fare        7.125
Name: 590, dtype: object

509
survived          1
pclass            3
sex            male
age            26.0
fare        56.4958
Name: 509, dtype: object

769
survived         0
pclass           3
sex           male
age           32.0
fare        8.3625
Name: 769, dtype: object

545
survived       0
pclass         1
sex         male
age         64.0
fare        26.0
Name: 545, dtype: object

759
survived         1
pclass           1
sex         female
age           33.0
fare          86.5
Name: 759, dtype: object

261
survived          1
pclass            3
sex            male
age             3.0
fare        31.3875
Name: 261, dtype: object

329
survived        

In [90]:
# Iterates over the rows as (index, series) pairs
for idx, series in df.iterrows():
    print(idx)
    
    survived = series.survived
    pclass = series.pclass
    sex = series.sex
    age = series.age
    fare = series.fare
    print(survived, pclass, sex, age, fare)
    print()

126
0 3 male nan 7.75

354
0 3 male nan 7.225

590
0 3 male 35.0 7.125

509
1 3 male 26.0 56.4958

769
0 3 male 32.0 8.3625

545
0 1 male 64.0 26.0

759
1 1 female 33.0 86.5

261
1 3 male 3.0 31.3875

329
1 1 female 16.0 57.9792

349
0 3 male 42.0 8.6625



You can decompose each series at each iteration into a set of variables.

In [91]:
# Iterates over the rows as a tuple
for t in df.itertuples():
    print(t)

Pandas(Index=126, survived=0, pclass=3, sex='male', age=nan, fare=7.75)
Pandas(Index=354, survived=0, pclass=3, sex='male', age=nan, fare=7.225)
Pandas(Index=590, survived=0, pclass=3, sex='male', age=35.0, fare=7.125)
Pandas(Index=509, survived=1, pclass=3, sex='male', age=26.0, fare=56.4958)
Pandas(Index=769, survived=0, pclass=3, sex='male', age=32.0, fare=8.3625)
Pandas(Index=545, survived=0, pclass=1, sex='male', age=64.0, fare=26.0)
Pandas(Index=759, survived=1, pclass=1, sex='female', age=33.0, fare=86.5)
Pandas(Index=261, survived=1, pclass=3, sex='male', age=3.0, fare=31.3875)
Pandas(Index=329, survived=1, pclass=1, sex='female', age=16.0, fare=57.9792)
Pandas(Index=349, survived=0, pclass=3, sex='male', age=42.0, fare=8.6625)


In [92]:
# Iterates over the rows as a tuple
for index, survived, pclass, sex, age, fare in df.itertuples():
    print(index, survived, pclass, sex, age, fare)

126 0 3 male nan 7.75
354 0 3 male nan 7.225
590 0 3 male 35.0 7.125
509 1 3 male 26.0 56.4958
769 0 3 male 32.0 8.3625
545 0 1 male 64.0 26.0
759 1 1 female 33.0 86.5
261 1 3 male 3.0 31.3875
329 1 1 female 16.0 57.9792
349 0 3 male 42.0 8.6625


You can decompose each tuple at each iteration into a set of variables.

## Aggregation and Grouping of a Dataframe

In [93]:
df = load_dataset("titanic")
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [94]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [95]:
df.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [97]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


pandas.DataFrame.describe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

The <b>describe</b> method generates various descriptive summary statistics of a series or a dataframe.

In [98]:
df.mean()                         # Returns the mean value of each column.  

  df.mean()                         # Returns the mean value of each column.


survived       0.383838
pclass         2.308642
age           29.699118
sibsp          0.523008
parch          0.381594
fare          32.204208
adult_male     0.602694
alone          0.602694
dtype: float64

In [99]:
df.age.mean()                     # Returns the mean age. 

29.69911764705882

In [100]:
df["age"].mean()

29.69911764705882

You can specify the column you are interested in.

In [101]:
df.age.min()                      # Returns the minimum age.

0.42

In [102]:
df.age.max()                      # Returns the maximum age.

80.0

In [103]:
df.age.std()                      # Returns the standard deviation of age.

14.526497332334044

In [104]:
df[df.pclass == 1].age.mean()     # Returns the mean age of the first class passengers. 

38.233440860215055

If you are only interested in a subset of rows, first select the rows using a mask and then do what you want. 

In [105]:
df[(df.pclass == 1) & (df.survived == 0)].fare.mean()  # Returns the mean fare of the first class passengers who died. 

64.68400750000002

In [106]:
df.groupby("sex").mean() 

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818,0.0,0.401274
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893,0.930676,0.712305


pandas.DataFrame.groupby: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The <b>groupby</b> method groups a dataframe by a series of columns. It must be followed by an operation you want to do after grouping. 

In [107]:
df.groupby("class").mean()

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
First,0.62963,1.0,38.233441,0.416667,0.356481,84.154687,0.550926,0.50463
Second,0.472826,2.0,29.87763,0.402174,0.380435,20.662183,0.538043,0.565217
Third,0.242363,3.0,25.14062,0.615071,0.393075,13.67555,0.649695,0.659878


In [108]:
df.groupby("class").age.mean()

class
First     38.233441
Second    29.877630
Third     25.140620
Name: age, dtype: float64

In [109]:
df.groupby(["sex", "class"]).age.mean()

sex     class 
female  First     34.611765
        Second    28.722973
        Third     21.750000
male    First     41.281386
        Second    30.740707
        Third     26.507589
Name: age, dtype: float64

You can group by multiple columns. In this case, the order of columns matters.

In [110]:
df.sort_values(by="fare", ascending=False)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
737,1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True
679,1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
27,0,1,male,19.0,3,2,263.0000,S,First,man,True,C,Southampton,no,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
633,0,1,male,,0,0,0.0000,S,First,man,True,,Southampton,no,True
413,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True
822,0,1,male,38.0,0,0,0.0000,S,First,man,True,,Southampton,no,True
732,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True


pandas.DataFrame.sort_values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

The <b>sort_values</b> method sorts a dataframe by a series of columns.

In [111]:
df.sort_values(by=["fare", "age"], ascending=[False, True])

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
737,1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True
679,1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False
27,0,1,male,19.0,3,2,263.0000,S,First,man,True,C,Southampton,no,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
481,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True
633,0,1,male,,0,0,0.0000,S,First,man,True,,Southampton,no,True
674,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True
732,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True


You can sort by multiple columnms. Again, the order of columns matters.

In [112]:
df.sample(n=10, replace=False, random_state=0)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
495,0,3,male,,0,0,14.4583,C,Third,man,True,,Cherbourg,no,True
648,0,3,male,,0,0,7.55,S,Third,man,True,,Southampton,no,True
278,0,3,male,7.0,4,1,29.125,Q,Third,child,False,,Queenstown,no,False
31,1,1,female,,1,0,146.5208,C,First,woman,False,B,Cherbourg,yes,False
255,1,3,female,29.0,0,2,15.2458,C,Third,woman,False,,Cherbourg,yes,False
298,1,1,male,,0,0,30.5,S,First,man,True,C,Southampton,yes,True
609,1,1,female,40.0,0,0,153.4625,S,First,woman,False,C,Southampton,yes,True
318,1,1,female,31.0,0,2,164.8667,S,First,woman,False,C,Southampton,yes,False
484,1,1,male,25.0,1,0,91.0792,C,First,man,True,B,Cherbourg,yes,False
367,1,3,female,,0,0,7.2292,C,Third,woman,False,,Cherbourg,yes,True


pandas.DataFrame.sample: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

The <b>sample</b> method returns a random sample of items from a dataframe. The default value for the `replace` parameter is False, which means the sampling does not allow duplicates. 

## Exercises for Aggregation and Grouping