# Data Programming in Python | BAIS:6040
# Pandas Basics

Instructor: Jeff Hendricks 

Topics to be covered:
- Pandas series (+ exercises)
- Loading a dataset as a Pandas dataframe
- Element selection from a dataframe (+ exercises)
- Handling null values in a dataframe (+ exercises)
- Iteration over a dataframe
- Aggregation and grouping of a dataframe  (+ exercises)

References: 
- Pandas official website (http://pandas.pydata.org/) 
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)

## ▪ Data Structures in Pandas

In [None]:
from IPython.display import Image

Image(url="https://cdn-images-1.medium.com/max/800/0*PWbW0OdJJw49kxMt.png")

Series are one-dimensional arrays. A series has an index array, which is called just index.

In [None]:
Image(url="https://cdn-images-1.medium.com/max/800/0*dddYH8GijZanG4dO.png")

A dataframe is designed to extend series to multiple dimensions. A dataframe has two index arrays: a row index called just index and a column index called columns. A dataframe is in fact a collection of mulitple series, each of which shares an index. 

In [None]:
from IPython.display import Image
Image(url="https://i.stack.imgur.com/DL0iQ.jpg")

In NumPy and Pandas, axis 0 refers to the row axis, while axis 1 to the column axis.

## Import the Pandas package

In [None]:
import pandas as pd

## Create a Pandas series

In [None]:
import numpy as np

In [None]:
data = np.arange(10, 101, 10)
data

In [None]:
series = pd.Series(data=data)
series

pandas.Series: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

The right-hand side is called values of a series, while the left-hand side is called index of a series. If you do not specify any index during the series creation, by default, Pandas will assign numerical values increasing from 0 as index.

In [None]:
series.index

In [None]:
list(series.index)

In [None]:
series.index.values

In [None]:
series.values

In [None]:
type(series.values)

## Select elements in a series

Element selection from a series is the same as that from a NumPy arrary. 

In [None]:
series[0]

In [None]:
series[:3]

In [None]:
for num in series:
    print(num)

When iterating over a series, only the values are exposed. The index is not exposed, which means the index is only used for selecting elements in a series. 

In [None]:
index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]

series2 = pd.Series(data=data, index=index)
series2

Often it is preferable to create a series using meaningful labels, instead of numbers, in order to distinguish and identify each item regardless of the order in which they were inserted into the series. 

In [None]:
series2.index

In [None]:
series2.values

In [None]:
series2["a"]

You can select individual elements, specifying the label corresponding to the position of the index. 

In [None]:
series2.a

series.a is equivalent to series["a"].

In [None]:
series2[0]

Element selection specifying the index number still works. 

In [None]:
series2[["a", "b", "c"]]

In [None]:
series2[:3]

In [None]:
for num in series2:
    print(num)

In [None]:
index = ["a", "b", "c", "d", "e", "a", "b", "c", "d", "e"]

series3 = pd.Series(data=data, index=index)
series3

The labels in an index do not have to be unique. 

In [None]:
series3["a"]

## Exercises for selecting elements from a series (5 questions)

In [None]:
data = np.arange(10)
index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]

series = pd.Series(data=data, index=index)
series

1\. Get the first element in <i>series</i>. 

In [None]:
# Your answer here


2\. Get the last 3 elements in <i>series</i>. 

In [None]:
# Your answer here


3\. Get the element in <i>series</i> that the index label 'c' refers to.

In [None]:
# Your answer here


4\. Get the elements in <i>series</i> that the index labels 'a', 'c', and 'e' refer to.

In [None]:
# Your answer here


5\. Print all elements in <i>series</i> with a tab between elements.

In [None]:
# Your answer here


## Create a New DataFrame

### Define DataFrame as a List of Rows

In [None]:
import pandas as pd

df = pd.DataFrame(data = [10,20,30,40,50]
                  ,columns = ['Numbers']
                  ,index = [1,2,3,4,5])

In [None]:
type(df)

In [None]:
df.info()

The <b>info()</b> method shows a concise summary of a dataframe.

pandas.DataFrame.info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

In [None]:
df.head()   # shows first 5 rows by default

In [None]:
df.columns

pandas.DataFrame.head: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [None]:
data = [['Tom',10], ['Jane',20], ['Alfred',60]]
df2 = pd.DataFrame(data = data
                   ,columns=['Name', 'Age']
                   ,index = ['a','b','c'])
df2.head()

### Define DataFrame as a List of Tuples

In [None]:
dataTup = [('Tom',10,True), ('Jane',20,True), ('Alfred',60,False)]
df3 = pd.DataFrame(data = dataTup
                   ,columns=['Name', 'Age','Student']
                   ,index = ['a','b','c'])
df3.head()

### Define DataFrame From Dictionary

In [None]:
dataDict = {'Name': ['Tom','Jane','Alfred']
            ,'Age': [10,20,60]
            ,'Student' : [True,True,False]}

df4 = pd.DataFrame(data = dataDict
                   #,columns=['Names', 'Age','Student']
                   ,index = ['a','b','c'])
df4.head()

### Create DataFrame From Numpy Ndarray
- 10 X 5 Numpy array with random normal distribution with zero mean and std deviation of 1

In [None]:
import pandas as pd
import numpy as np

dataNp = np.random.normal(0,1,(10,5))

df5 = pd.DataFrame(data = dataNp
                  ,columns = ['Col1','Col2','Col3','Col4','Col5'])

df5.head()

In [None]:
df5.shape

In [None]:
dataNp.ndim

In [None]:
dataNp.shape

## Load a Dataset as a Pandas DataFrame

In [None]:
from seaborn import load_dataset

df = load_dataset("titanic")
#df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=1)  # data reduction

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
list(df.index)

In [None]:
df.values

In [None]:
len(df)

The length of a dataframe is the number of rows in the dataframe. 

In [None]:
df.info()

In [None]:
df.head(1)          # Returns the first 5 rows.

In [None]:
df.head(3)

In [None]:
df.tail(5)          # Returns the last 5 rows. 

pandas.DataFrame.tail: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

When you load a new dataset, it is a good idea to start by looking at the first and last few rows to get a sense of what the dataset would look like.

### Use Sampling for Data Reduction
- Subset of Columns
- Sampled subset of Rows

In [7]:
from seaborn import load_dataset

df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=1)

df.shape

(10, 5)

In [8]:
df.head()

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


### Load a CSV as Pandas DataFrame

In [None]:
import os
import pandas as pd

filepath=os.path.join(os.getcwd(), 'data', 'BoyNames.csv')
nameData = pd.read_csv(filepath)

In [None]:
filepath

In [None]:
nameData = pd.read_csv(filepath)
nameData.head()

In [None]:
nameData = pd.read_csv('data/BoyNames.csv')

In [None]:
nameData = pd.read_excel('data/BoyNames.xlsx')

In [None]:
type(nameData)

In [None]:
nameData.head()

## Select elements from a dataframe

In [9]:
df.head()

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


In [10]:
df["age"]          # Returns all rows under the column age.

862    48.0
223     NaN
84     17.0
680     NaN
535     7.0
623    21.0
148    36.5
3      35.0
34     28.0
241     NaN
Name: age, dtype: float64

NaN stands for "Not A Number", which means a null value in a series.

In [11]:
df.age

862    48.0
223     NaN
84     17.0
680     NaN
535     7.0
623    21.0
148    36.5
3      35.0
34     28.0
241     NaN
Name: age, dtype: float64

df.a is quivalent to df["age"]
- Be aware of column names with spaces in them. You need to use [] option or rename column to include an underscore

In [12]:
type(df.age)

pandas.core.series.Series

A column in a dataframe is in fact a series.

In [13]:
df.age.index

Int64Index([862, 223, 84, 680, 535, 623, 148, 3, 34, 241], dtype='int64')

In [14]:
df.age.values

array([48. ,  nan, 17. ,  nan,  7. , 21. , 36.5, 35. , 28. ,  nan])

In [15]:
df.survived.value_counts()

1    5
0    5
Name: survived, dtype: int64

In [16]:
df.survived.value_counts(normalize=True)

1    0.5
0    0.5
Name: survived, dtype: float64

In [None]:
for num in df.age:
    print(num)

In [None]:
df[:]

In [17]:
df.shape

(10, 5)

In [19]:
df[:5]                 # Returns the first three rows.

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


In [20]:
df["age"][862]         # Returns the element with row index number 862 under the column age. 

48.0

Note that you should look up the __column label first, followed by the row index number__, each in separate matching brackets.

In [21]:
df[862]["age"]

KeyError: 862

In [22]:
df["age"][:3]          # Returns the first three rows under the column age.

862    48.0
223     NaN
84     17.0
Name: age, dtype: float64

In [24]:
df.head()

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


In [23]:
df.iloc[0]             # Returns the first row.

survived          1
pclass            1
sex          female
age              48
fare        25.9292
Name: 862, dtype: object

<b>iloc</b> means index location. If there is only one argument inside the matching square brackets, the only argument is for the row index. 

In [25]:
type(df.iloc[0])

pandas.core.series.Series

A row in a dataframe is in fact a series too, just as a column in a dataframe was a series. In other words, a dataframe is a 2D collection of series. 

In [26]:
df.iloc[0].index

Index(['survived', 'pclass', 'sex', 'age', 'fare'], dtype='object')

In [27]:
df.iloc[0].values

array([1, 1, 'female', 48.0, 25.9292], dtype=object)

In [28]:
df.iloc[:, 0]          # Returns all rows under the first column.

862    1
223    0
84     1
680    0
535    1
623    0
148    0
3      1
34     0
241    1
Name: survived, dtype: int64

If there are two arguments inside the matching square brackets, the first one is for the row index while the second for the column index. Note that when using <b>iloc you should look up the row numbers first, followed by the column numbers</b>, all in matching square brackets.

In [None]:
df.iloc[:, :2]         # Returns all rows under the first two columns. 

In [None]:
df.iloc[:3, :2]        # Returns the first three rows under the first two columns.

In [None]:
df.iloc[-3:, -2:]      # Returns the last three rows under the last two columns.

In [29]:
df.age > 30

862     True
223    False
84     False
680    False
535    False
623    False
148     True
3       True
34     False
241    False
Name: age, dtype: bool

In [30]:
df[df.age > 30]        # Set a condition for filtering

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1


Make sure to put df. before the column name. 

In [31]:
df[age > 30]

NameError: name 'age' is not defined

You can also use pandas query to select rows that meet a condition

In [None]:
df.query('age > 30')

In [None]:
df[(df.age > 30) & (df.pclass == 1)]

In [None]:
df[(df.age > 30) | (df.pclass == 1)]

In [None]:
cols=["survived", "sex"]

In [None]:
type(cols)

In [None]:
df[["survived", "sex"]]                  # Equivalent to df.iloc[:, :2]

In [None]:
df[cols]

To select all rows under certain columns, put a list of column names as a filtering condition.

In [None]:
df.drop("fare", axis=1)

In [None]:
df.head()

pandas.DataFrame.drop: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

To delete an entire column and all its content, use the <b>drop()</b> method. The axis must be set to 1, so the method can find the column on the column axis. Note that the <b>drop</b> method returns a new copy, not changing the content of the dataframe provided.

In [None]:
del df["fare"]

Another way to delete a column is to use the <b>del</b> command. This actually deletes the column in the dataframe.

In [None]:
df.columns

## Exercises for selecting elements from a dataframe (15 questions)

Let's continue to use the Titanic dataset but with another samples this time.

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=20, replace=False, random_state=2)
df

1\. Get the columns of <i>df</i>. 

In [None]:
# Your answer here


2\. Get the index of <i>df</i>. 

In [None]:
# Your answer here


3\. Get the shape of <i>df</i>. 

In [None]:
# Your answer here


4\. Get the number of rows, or records, in <i>df</i>. 

In [None]:
# Your answer here


5\. Select all rows under the column <i>survived</i>. 

In [None]:
# Your answer here


6\. Select the first 3 rows. 

In [None]:
# Your answer here


7\. Select the last 3 rows.

In [None]:
# Your answer here


8\. Select the element with row index number 615 under the column <i>fare</i>.

In [None]:
# Your answer here


9\. Select the element in the third row and the fifth column.

In [None]:
# Your answer here


10\. Select all rows under the last column, not specifying the column label.

In [None]:
# Your answer here


11\. Select all rows under the last 2 columns, not specifying the column labels.

In [None]:
# Your answer here


12\. Select the first 5 rows under the last 2 columns, not specifying the column labels.

In [None]:
# Your answer here


13\. Select all rows with their column <i>sex</i> being male.

In [None]:
# Your answer here


14\. Select all rows with the column <i>fare</i> being between 50 (inclusive) and 100 (exclusive).

In [None]:
# Your answer here


15\. Print all values under the column <i>fare</i> with a tab between the values. 

In [None]:
# Your answer here


## Handle Null Values in a DataFrame

In many cases, you should take care of the null values, or missing values, in a dataframe. There are two approaches you can consider to handle null values:
- Drop the rows with null values
- Fill the null values with something else

In [32]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=3)
df

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


In [33]:
df.isnull()

Unnamed: 0,survived,pclass,sex,age,fare
395,False,False,False,False,False
85,False,False,False,False,False
201,False,False,False,True,False
542,False,False,False,False,False
702,False,False,False,False,False
51,False,False,False,False,False
237,False,False,False,False,False
548,False,False,False,False,False
527,False,False,False,True,False
157,False,False,False,False,False


pandas.DataFrame.isnull: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

The <b>isnull()</b> method returns which entries in a dataframe are null.

In [34]:
df.isnull().any(axis=0)

survived    False
pclass      False
sex         False
age          True
fare        False
dtype: bool

pandas.DataFrame.any: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html

The <b>any()</b> method returns whether any element is True, potentially over an axis.

In [35]:
df[df.isnull().any(axis=1)]       # Returns all rows with any null values.

Unnamed: 0,survived,pclass,sex,age,fare
201,0,3,male,,69.55
527,0,1,male,,221.7792


In [36]:
df.age.isnull().any()

True

In [37]:
df1 = df.copy()
df1

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,,221.7792
157,0,3,male,30.0,8.05


pandas.DataFrame.copy: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html

The <b>copy()</b> method makes a copy of a dataframe.

In [38]:
df1.dropna(inplace=True)

pandas.DataFrame.dropna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

The <b>dropna()</b> method drops null values in a dataframe. Note that it returns a new copy of the dataframe provided.

In [39]:
df1 = df1.dropna()

In [None]:
len(df1)

Note that the <b>dropna</b> function drops the entire row if there is any null value in the row. 

In [None]:
df2 = df.copy()

In [None]:
df2 = df2.dropna(how="all")
df2

If you set the parameter `how` to all, it drops the row when all values in the row are null. Default is any.

In [40]:
df3 = df.copy()

pandas.DataFrame.fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

The <b>fillna()</b> method fills the null values using the specified method. The `value` parameter specifies the value to be replaced with.

In [41]:
df3.age.fillna(value=0, inplace=True)        # Targeted at a specific column
df3

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,0.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,0.0,221.7792
157,0,3,male,30.0,8.05


In [42]:
df3 = df.copy()
df3.age = df3.age.fillna(value=0)        # Targeted at a specific column
df3

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,0.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,0.0,221.7792
157,0,3,male,30.0,8.05


In [43]:
df4 = df.copy()

In [44]:
df4 = df4.fillna(value={"survived": df.survived.mean(), "plcass": df.pclass.mean(), "sex": "unknown",
                        "age": df.age.mean(), "fare": df.fare.mean()})
df4

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,22.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,22.0,221.7792
157,0,3,male,30.0,8.05


Instead of filling all null values with the same value, you can fill with different values depending on the column, specifying one by one the columns and the values to be replaced. 

In [45]:
df5 = df.copy()

In [46]:
df5 = df5.fillna(method="ffill")
df5

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,33.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,33.0,221.7792
157,0,3,male,30.0,8.05


If you set the parameter `method` to ffill, which means forward fill, it propagates the last non-null observation forward. Setting it to bfill, which means backward fill, works backward.

In [47]:
df6 = df.copy()

In [48]:
df6.fillna(method="bfill", inplace=True)
df6

Unnamed: 0,survived,pclass,sex,age,fare
395,0,3,male,22.0,7.7958
85,1,3,female,33.0,15.85
201,0,3,male,11.0,69.55
542,0,3,female,11.0,31.275
702,0,3,female,18.0,14.4542
51,0,3,male,21.0,7.8
237,1,2,female,8.0,26.25
548,0,3,male,33.0,20.525
527,0,1,male,30.0,221.7792
157,0,3,male,30.0,8.05


## Exercises for handling null values (6 questions)

Suppose you have <i>df1</i>, <i>df2</i>, <i>df3</i>, <i>df4</i>, and <i>df5</i>, each of which is a copy of <i>df</i>.

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=4)
df.iloc[1, 0] = None
df.iloc[7, 1] = None
df.iloc[3, 4] = None
df.iloc[[5, 9], :] = None

df

In [None]:
df1, df2, df3, df4, df5 = df.copy(), df.copy(), df.copy(), df.copy(), df.copy()

1\. Select all rows in <i>df</i> with any null values.

In [None]:
# Your answer here


2\. Drop the rows in <i>df1</i> that have any null values. Make sure to assign the resulting dataframe back to <i>df1</i> to actually change <i>df1</i>.

In [None]:
# Your answer here


3\. Drop the rows in <i>df2</i> in which all values are null. Make sure to assign the resulting dataframe back to <i>df2</i> to actually change <i>df1</i>.

In [None]:
# Your answer here


4\. Fill the missing values under the column <i>sex</i> in <i>df3</i> with 'unknown'.

In [None]:
# Your answer here


5\. Fill the missing values under the columns <i>pclass</i>, <i>age</i>, and <i>fare</i> in <i>df4</i> with the minimum value of their column.

In [None]:
# Your answer here


6\. Fill all missing values in <i>df5</i> with the last non-null oberservation backward.

In [None]:
# Your answer here


## Working with Dataframe Indices
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [None]:
import pandas as pd

sampleTs = pd.read_csv('data/sampleTs.csv')

sampleTs.head(2)

In [None]:
sampleTs.set_index('TimeStamp')

In [None]:
sampleTs.head(2)

In [None]:
sampleTs = sampleTs.set_index('TimeStamp')
sampleTs.head(2)

In [None]:
sampleTs = pd.read_csv('data/sampleTs.csv')
sampleTs.set_index('TimeStamp', inplace=True)
sampleTs.head(2)

In [None]:
sampleTs.reset_index()

In [None]:
sampleTs.head(2)

In [None]:
sampleTs.reset_index(inplace=True)
sampleTs.head(2)

In [None]:
sampleTs = pd.read_csv('data/sampleTs.csv')
sampleTs.set_index('TimeStamp', inplace=True)
sampleTs.reset_index(inplace=True, drop=True)
sampleTs.head(2)

## Iterate over a dataframe

There are ways to iterate over a dataframe:  
- Using <b>iloc</b>
- Using <b>iteritems()</b> to iterate over the (key,value) pairs
- Using <b>iterrows()</b> to iterate over the rows as (index,series) pairs
- Using <b>itertuples()</b> to iterate over the rows as named tuples

You can choose any of the four above depending on how you want to retrieve data from a dataframe.

In [None]:
df

In [None]:
# Iterates by row
for i in range(len(df)):
    print(i, df.iloc[i].values)

In [None]:
# Iterates by column
for i in range(len(df.columns)):
    print(i, df.iloc[:, i].values)

In [None]:
# Iterates over the (key, value) pairs
for key, val in df.iteritems():
    print(key)
    print(val)

In [None]:
# Iterates over the rows as (index, series) pairs
for idx, series in df.iterrows():
    print(idx)
    print(series)

In [None]:
# Iterates over the rows as (index, series) pairs
for tuple in df.itertuples():
    print(tuple)

## Aggregate and group a dataframe

In [49]:
df = load_dataset("titanic")
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [50]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


pandas.DataFrame.describe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

The <b>describe()</b> method generates various descriptive summary statistics of a series or a dataframe.

In [51]:
df.mean()                         # Returns the mean value of each column.  

survived       0.383838
pclass         2.308642
age           29.699118
sibsp          0.523008
parch          0.381594
fare          32.204208
adult_male     0.602694
alone          0.602694
dtype: float64

In [52]:
df.age.mean()                     # Returns the mean age. 

29.69911764705882

In [53]:
df['age'].mean()

29.69911764705882

You can specify the column you are interested in.

In [None]:
df.age.min()                      # Returns the minimum age.

In [None]:
df.age.max()                      # Returns the maximum age.

In [None]:
df.age.std()                      # Returns the standard deviation of age.

In [54]:
df[df.pclass == 1].age.mean()     # Returns the mean age of the first class passengers. 

38.233440860215055

If you are only interested in a subset of rows, first select the rows using filtering and then do what you want. 

In [55]:
df[(df.pclass == 1) & (df.survived == 0)].fare.mean() # Returns the mean fare of the first class passengers who died. 

64.68400750000002

In [56]:
df.groupby("sex").mean() 

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818,0.0,0.401274
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893,0.930676,0.712305


pandas.DataFrame.groupby: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The <b>groupby()</b> method groups a dataframe by a series of columns. It should be followed by an operation you want to do after grouping. 

In [63]:
df.groupby("class").mean()

Unnamed: 0_level_0,survived,pclass,age,sibsp,parch,fare,adult_male,alone
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
First,0.62963,1.0,38.233441,0.416667,0.356481,84.154687,0.550926,0.50463
Second,0.472826,2.0,29.87763,0.402174,0.380435,20.662183,0.538043,0.565217
Third,0.242363,3.0,25.14062,0.615071,0.393075,13.67555,0.649695,0.659878


In [58]:
df.groupby("class").age.mean()

class
First     38.233441
Second    29.877630
Third     25.140620
Name: age, dtype: float64

In [59]:
df.groupby(["sex", "class"]).age.mean()

sex     class 
female  First     34.611765
        Second    28.722973
        Third     21.750000
male    First     41.281386
        Second    30.740707
        Third     26.507589
Name: age, dtype: float64

You can group by multiple columns, and the order of columns is important.

In [60]:
df.sort_values(by="fare", ascending=False)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
737,1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True
679,1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
27,0,1,male,19.0,3,2,263.0000,S,First,man,True,C,Southampton,no,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
633,0,1,male,,0,0,0.0000,S,First,man,True,,Southampton,no,True
413,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True
822,0,1,male,38.0,0,0,0.0000,S,First,man,True,,Southampton,no,True
732,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True


pandas.DataFrame.sort_values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

The <b>sort_values()</b> method sorts a dataframe by a series of columns.

In [61]:
df.sort_values(by=["fare", "age"], ascending=[False, True])

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
737,1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True
679,1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False
27,0,1,male,19.0,3,2,263.0000,S,First,man,True,C,Southampton,no,False
88,1,1,female,23.0,3,2,263.0000,S,First,woman,False,C,Southampton,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
481,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True
633,0,1,male,,0,0,0.0000,S,First,man,True,,Southampton,no,True
674,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True
732,0,2,male,,0,0,0.0000,S,Second,man,True,,Southampton,no,True


You can sort by multiple columnms, and the order of columns is important.

In [None]:
df.sample(n=10, replace=False, random_state=0)

pandas.DataFrame.sample: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

The <b>sample()</b> method returns a random sample of items from a dataframe. 

## Exercises for aggregation and grouping (8 questions)

Let's continue to use the entire Titanic datafame <i>df</i>.

1\. What was the highest fare?

In [None]:
# Your answer here


2\. What was the lowest fare?

In [None]:
# Your answer here


3\. What was the mean age of the female passengers? (Select the female passengers first.)

In [None]:
# Your answer here


4\. Select the rows, or passengers, who were under the age of ten and died? (Put two conditions for filtering.)

In [None]:
# Your answer here


5\. What were the mean ages for those who survived and who died, respectively? In other words, group the dataframe by <i>survived</i> and get the mean age of each group. 

In [None]:
# Your answer here


6\. Group the dataframe by <i>survived</i> and then by <i>pclass</i> and get the mean fare of each group. 

In [None]:
# Your answer here


7\. Get a copy of <i>df</i> in which the entire dataframe is sorted by <i>age</i> and then by <i>pclass</i> in descending order, respectively. 

In [None]:
# Your answer here


8\. Create a random sample of 50 rows with no duplicates from <i>df</i>. 

In [None]:
# Your answer here
