# Data Programming in Python | BAIS:6040
# Pandas Basics

Instructor: Jeff Hendricks 

Topics to be covered:
- Pandas series (+ exercises)
- Loading a dataset as a Pandas dataframe
- Element selection from a dataframe (+ exercises)
- Handling null values in a dataframe (+ exercises)
- Iteration over a dataframe
- Aggregation and grouping of a dataframe  (+ exercises)

References: 
- Pandas official website (http://pandas.pydata.org/) 
- Python Data Science Handbook by Jake VanderPlas (http://shop.oreilly.com/product/0636920034919.do)
- Python for Data Analysis by Wes McKinney (https://www.oreilly.com/library/view/python-for-data/9781491957653/)

## â–ª Data Structures in Pandas

In [1]:
from IPython.display import Image

Image(url="https://cdn-images-1.medium.com/max/800/0*PWbW0OdJJw49kxMt.png")

Series are one-dimensional arrays. A series has an index array, which is called just index.

In [2]:
Image(url="https://cdn-images-1.medium.com/max/800/0*dddYH8GijZanG4dO.png")

A dataframe is designed to extend series to multiple dimensions. A dataframe has two index arrays: a row index called just index and a column index called columns. A dataframe is in fact a collection of mulitple series, each of which shares an index. 

In [3]:
from IPython.display import Image
Image(url="https://i.stack.imgur.com/DL0iQ.jpg")

In NumPy and Pandas, axis 0 refers to the row axis, while axis 1 to the column axis.

## Import the Pandas package

In [4]:
import pandas as pd

## Create a Pandas series

In [5]:
import numpy as np

In [6]:
data = np.arange(10, 101, 10)
data

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [7]:
series = pd.Series(data=data)
series

0     10
1     20
2     30
3     40
4     50
5     60
6     70
7     80
8     90
9    100
dtype: int64

pandas.Series: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

The right-hand side is called values of a series, while the left-hand side is called index of a series. If you do not specify any index during the series creation, by default, Pandas will assign numerical values increasing from 0 as index.

In [8]:
series.index

RangeIndex(start=0, stop=10, step=1)

In [9]:
list(series.index)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
series.index.values

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [11]:
series.values

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [12]:
type(series.values)

numpy.ndarray

## Select elements in a series

Element selection from a series is the same as that from a NumPy arrary. 

In [13]:
series[0]

10

In [14]:
series[:3]

0    10
1    20
2    30
dtype: int64

In [15]:
for num in series:
    print(num)

10
20
30
40
50
60
70
80
90
100


When iterating over a series, only the values are exposed. The index is not exposed, which means the index is only used for selecting elements in a series. 

In [16]:
index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]

series2 = pd.Series(data=data, index=index)
series2

a     10
b     20
c     30
d     40
e     50
f     60
g     70
h     80
i     90
j    100
dtype: int64

Often it is preferable to create a series using meaningful labels, instead of numbers, in order to distinguish and identify each item regardless of the order in which they were inserted into the series. 

In [17]:
series2.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

In [18]:
series2.values

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

In [19]:
series2["a"]

10

You can select individual elements, specifying the label corresponding to the position of the index. 

In [20]:
series2.a

10

series.a is equivalent to series["a"].

In [21]:
series2[0]

10

Element selection specifying the index number still works. 

In [22]:
series2[["a", "b", "c"]]

a    10
b    20
c    30
dtype: int64

In [23]:
series2[:3]

a    10
b    20
c    30
dtype: int64

In [24]:
for num in series2:
    print(num)

10
20
30
40
50
60
70
80
90
100


In [25]:
index = ["a", "b", "c", "d", "e", "a", "b", "c", "d", "e"]

series3 = pd.Series(data=data, index=index)
series3

a     10
b     20
c     30
d     40
e     50
a     60
b     70
c     80
d     90
e    100
dtype: int64

The labels in an index do not have to be unique. 

In [26]:
series3["a"]

a    10
a    60
dtype: int64

## Exercises for selecting elements from a series (5 questions)

In [27]:
data = np.arange(10)
index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]

series = pd.Series(data=data, index=index)
series

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

1\. Get the first element in <i>series</i>. 

In [28]:
# Your answer here
series.a

0

2\. Get the last 3 elements in <i>series</i>. 

In [29]:
# Your answer here
series[-3:]

h    7
i    8
j    9
dtype: int64

3\. Get the element in <i>series</i> that the index label 'c' refers to.

In [30]:
# Your answer here
series.c

2

4\. Get the elements in <i>series</i> that the index labels 'a', 'c', and 'e' refer to.

In [31]:
# Your answer here
series[['a','c','e']]

a    0
c    2
e    4
dtype: int64

5\. Print all elements in <i>series</i> with a tab between elements.

In [32]:
# Your answer here
for elem in series:
    print(elem, end='\t')

0	1	2	3	4	5	6	7	8	9	

## Create a New DataFrame

### Define DataFrame as a List of Rows

In [33]:
import pandas as pd

df = pd.DataFrame(data = [10,20,30,40,50]
                  ,columns = ['Numbers']
                  ,index = [1,2,3,4,5])

In [34]:
type(df)

pandas.core.frame.DataFrame

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1 to 5
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Numbers  5 non-null      int64
dtypes: int64(1)
memory usage: 80.0 bytes


The <b>info()</b> method shows a concise summary of a dataframe.

pandas.DataFrame.info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

In [36]:
df.head()   # shows first 5 rows by default

Unnamed: 0,Numbers
1,10
2,20
3,30
4,40
5,50


In [37]:
df.columns

Index(['Numbers'], dtype='object')

pandas.DataFrame.head: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html

In [38]:
data = [['Tom',10], ['Jane',20], ['Alfred',60]]
df2 = pd.DataFrame(data = data
                   ,columns=['Name', 'Age']
                   ,index = ['a','b','c'])
df2.head()

Unnamed: 0,Name,Age
a,Tom,10
b,Jane,20
c,Alfred,60


### Define DataFrame as a List of Tuples

In [39]:
dataTup = [('Tom',10,True), ('Jane',20,True), ('Alfred',60,False)]
df3 = pd.DataFrame(data = dataTup
                   ,columns=['Name', 'Age','Student']
                   ,index = ['a','b','c'])
df3.head()

Unnamed: 0,Name,Age,Student
a,Tom,10,True
b,Jane,20,True
c,Alfred,60,False


### Define DataFrame From Dictionary

In [40]:
dataDict = {'Name': ['Tom','Jane','Alfred']
            ,'Age': [10,20,60]
            ,'Student' : [True,True,False]}

df4 = pd.DataFrame(data = dataDict
                   #,columns=['Names', 'Age','Student']
                   ,index = ['a','b','c'])
df4.head()

Unnamed: 0,Name,Age,Student
a,Tom,10,True
b,Jane,20,True
c,Alfred,60,False


### Create DataFrame From Numpy Ndarray
- 10 X 5 Numpy array with random normal distribution with zero mean and std deviation of 1

In [41]:
import pandas as pd
import numpy as np

dataNp = np.random.normal(0,1,(10,5))

df5 = pd.DataFrame(data = dataNp
                  ,columns = ['Col1','Col2','Col3','Col4','Col5'])

df5.head()

Unnamed: 0,Col1,Col2,Col3,Col4,Col5
0,-1.043281,-1.508654,-1.678576,0.069069,-0.579774
1,0.358991,0.12121,0.31755,1.449969,-0.453575
2,-0.435615,0.465963,-0.020989,-0.565792,0.678359
3,1.102762,1.311979,2.034645,-0.812104,-1.346705
4,0.235819,-1.427554,-0.062069,1.620858,-0.478419


In [42]:
df5.shape

(10, 5)

In [43]:
dataNp.ndim

2

In [44]:
dataNp.shape

(10, 5)

## Load a Dataset as a Pandas DataFrame

In [45]:
from seaborn import load_dataset

df = load_dataset("titanic")
#df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=1)  # data reduction

In [46]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [47]:
df.shape

(891, 15)

In [48]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [49]:
list(df.index)

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [50]:
df.values

array([[0, 3, 'male', ..., 'Southampton', 'no', False],
       [1, 1, 'female', ..., 'Cherbourg', 'yes', False],
       [1, 3, 'female', ..., 'Southampton', 'yes', True],
       ...,
       [0, 3, 'female', ..., 'Southampton', 'no', False],
       [1, 1, 'male', ..., 'Cherbourg', 'yes', True],
       [0, 3, 'male', ..., 'Queenstown', 'no', True]], dtype=object)

In [51]:
len(df)

891

The length of a dataframe is the number of rows in the dataframe. 

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


In [53]:
df.head(1)          # Returns the first 5 rows.

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


In [54]:
df.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


In [55]:
df.tail(5)          # Returns the last 5 rows. 

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


pandas.DataFrame.tail: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html

When you load a new dataset, it is a good idea to start by looking at the first and last few rows to get a sense of what the dataset would look like.

### Use Sampling for Data Reduction
- Subset of Columns
- Sampled subset of Rows

In [56]:
from seaborn import load_dataset

df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=1)

df.shape

(10, 5)

In [57]:
df.head()

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


### Load a CSV as Pandas DataFrame

In [58]:
import os
import pandas as pd

# filepath=os.path.join(os.getcwd(), 'data', 'BoyNames.csv')
filepath=os.path.join('/home/jerjacob/Data', 'BoyNames.csv')
nameData = pd.read_csv(filepath)

In [59]:
filepath

'/home/jerjacob/Data/BoyNames.csv'

In [60]:
nameData = pd.read_csv('../../Data/BoyNames.csv')

In [61]:
nameData = pd.read_excel('../../Data/BoyNames.xlsx')

In [62]:
type(nameData)

pandas.core.frame.DataFrame

In [63]:
nameData.head()

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273
3,4,1980,Matthew,2112
4,5,1980,David,2088


## Select elements from a dataframe

In [64]:
df.head()

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25


In [65]:
df["age"]          # Returns all rows under the column age.

862    48.0
223     NaN
84     17.0
680     NaN
535     7.0
623    21.0
148    36.5
3      35.0
34     28.0
241     NaN
Name: age, dtype: float64

NaN stands for "Not A Number", which means a null value in a series.

In [66]:
df.age

862    48.0
223     NaN
84     17.0
680     NaN
535     7.0
623    21.0
148    36.5
3      35.0
34     28.0
241     NaN
Name: age, dtype: float64

df.a is quivalent to df["age"]
- Be aware of column names with spaces in them. You need to use [] option or rename column to include an underscore

In [67]:
type(df.age)

pandas.core.series.Series

A column in a dataframe is in fact a series.

In [68]:
df.age.index

Int64Index([862, 223, 84, 680, 535, 623, 148, 3, 34, 241], dtype='int64')

In [69]:
df.age.values

array([48. ,  nan, 17. ,  nan,  7. , 21. , 36.5, 35. , 28. ,  nan])

In [70]:
df.survived.value_counts()

1    5
0    5
Name: survived, dtype: int64

In [71]:
df.survived.value_counts(normalize=True)

1    0.5
0    0.5
Name: survived, dtype: float64

In [72]:
for num in df.age:
    print(num)

48.0
nan
17.0
nan
7.0
21.0
36.5
35.0
28.0
nan


In [73]:
df[:]

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5
680,0,3,female,,8.1375
535,1,2,female,7.0,26.25
623,0,3,male,21.0,7.8542
148,0,2,male,36.5,26.0
3,1,1,female,35.0,53.1
34,0,1,male,28.0,82.1708
241,1,3,female,,15.5


In [74]:
df.shape

(10, 5)

In [75]:
df[:3]                 # Returns the first three rows.

Unnamed: 0,survived,pclass,sex,age,fare
862,1,1,female,48.0,25.9292
223,0,3,male,,7.8958
84,1,2,female,17.0,10.5


In [76]:
df["age"][862]         # Returns the element with row index number 862 under the column age. 

48.0

Note that you should look up the __column label first, followed by the row index number__, each in separate matching brackets.

In [77]:
# df[862]["age"]

KeyError: 86

In [None]:
df["age"][:3]          # Returns the first three rows under the column age.

In [None]:
df.iloc[0]             # Returns the first row.

<b>iloc</b> means index location. If there is only one argument inside the matching square brackets, the only argument is for the row index. 

In [None]:
type(df.iloc[0])

A row in a dataframe is in fact a series too, just as a column in a dataframe was a series. In other words, a dataframe is a 2D collection of series. 

In [None]:
df.iloc[0].index

In [None]:
df.iloc[0].values

In [None]:
df.iloc[:, 0]          # Returns all rows under the first column.

If there are two arguments inside the matching square brackets, the first one is for the row index while the second for the column index. Note that when using <b>iloc you should look up the row numbers first, followed by the column numbers</b>, all in matching square brackets.

In [None]:
df.iloc[:, :2]         # Returns all rows under the first two columns. 

In [None]:
df.iloc[:3, :2]        # Returns the first three rows under the first two columns.

In [None]:
df.iloc[-3:, -2:]      # Returns the last three rows under the last two columns.

In [None]:
df.age > 30

In [None]:
df[df.age > 30]        # Set a condition for filtering

Make sure to put df. before the column name. 

In [None]:
df[age > 30]

You can also use pandas query to select rows that meet a condition

In [None]:
df.query('age > 30')

In [None]:
df[(df.age > 30) & (df.pclass == 1)]

In [None]:
df[(df.age > 30) | (df.pclass == 1)]

In [None]:
cols=["survived", "sex"]

In [None]:
type(cols)

In [None]:
df[["survived", "sex"]]                  # Equivalent to df.iloc[:, :2]

In [None]:
df[cols]

To select all rows under certain columns, put a list of column names as a filtering condition.

In [None]:
df.drop("fare", axis=1)

In [None]:
df.head()

pandas.DataFrame.drop: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

To delete an entire column and all its content, use the <b>drop()</b> method. The axis must be set to 1, so the method can find the column on the column axis. Note that the <b>drop</b> method returns a new copy, not changing the content of the dataframe provided.

In [None]:
del df["fare"]

Another way to delete a column is to use the <b>del</b> command. This actually deletes the column in the dataframe.

In [None]:
df.columns

## Exercises for selecting elements from a dataframe (15 questions)

Let's continue to use the Titanic dataset but with another samples this time.

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=20, replace=False, random_state=2)
df

1\. Get the columns of <i>df</i>. 

In [None]:
# Your answer here
df.columns

2\. Get the index of <i>df</i>. 

In [None]:
# Your answer here
df.index

3\. Get the shape of <i>df</i>. 

In [None]:
# Your answer here
df.shape

4\. Get the number of rows, or records, in <i>df</i>. 

In [None]:
# Your answer here
len(df.index)

5\. Select all rows under the column <i>survived</i>. 

In [None]:
# Your answer here
df['survived'][:]

6\. Select the first 3 rows. 

In [None]:
# Your answer here
df[:3]

7\. Select the last 3 rows.

In [None]:
# Your answer here
df[-3:]

8\. Select the element with row index number 615 under the column <i>fare</i>.

In [None]:
# Your answer here
df['fare'][615]

9\. Select the element in the third row and the fifth column.

In [None]:
# Your answer here
df.iloc[3,4]

10\. Select all rows under the last column, not specifying the column label.

In [None]:
# Your answer here
df.iloc[:,-1]

11\. Select all rows under the last 2 columns, not specifying the column labels.

In [None]:
# Your answer here
df.iloc[:,-2:]

12\. Select the first 5 rows under the last 2 columns, not specifying the column labels.

In [None]:
# Your answer here
df.iloc[:5,-2:]

13\. Select all rows with their column <i>sex</i> being male.

In [None]:
# Your answer here
df[df.sex=='male']

14\. Select all rows with the column <i>fare</i> being between 50 (inclusive) and 100 (exclusive).

In [None]:
# Your answer here
df[(df.fare>=50) & (df.fare<100)]

15\. Print all values under the column <i>fare</i> with a tab between the values. 

In [None]:
# Your answer here
for fare in df.fare:
    print(fare, end='\t') 

## Handle Null Values in a DataFrame

In many cases, you should take care of the null values, or missing values, in a dataframe. There are two approaches you can consider to handle null values:
- Drop the rows with null values
- Fill the null values with something else

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=3)
df

In [None]:
df.isnull()

pandas.DataFrame.isnull: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html

The <b>isnull()</b> method returns which entries in a dataframe are null.

In [None]:
df.isnull().any(axis=0)

pandas.DataFrame.any: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html

The <b>any()</b> method returns whether any element is True, potentially over an axis.

In [None]:
df[df.isnull().any(axis=1)]       # Returns all rows with any null values.

In [None]:
df.age.isnull().any()

In [None]:
df1 = df.copy()
df1

pandas.DataFrame.copy: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html

The <b>copy()</b> method makes a copy of a dataframe.

In [None]:
df1.dropna(inplace=True)

pandas.DataFrame.dropna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

The <b>dropna()</b> method drops null values in a dataframe. Note that it returns a new copy of the dataframe provided.

In [None]:
df1 = df1.dropna()

In [None]:
len(df1)

Note that the <b>dropna</b> function drops the entire row if there is any null value in the row. 

In [None]:
df2 = df.copy()

In [None]:
df2 = df2.dropna(how="all")
df2

If you set the parameter `how` to all, it drops the row when all values in the row are null. Default is any.

In [None]:
df3 = df.copy()

pandas.DataFrame.fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

The <b>fillna()</b> method fills the null values using the specified method. The `value` parameter specifies the value to be replaced with.

In [None]:
df3.age.fillna(value=0, inplace=True)        # Targeted at a specific column
df3

In [None]:
df3 = df.copy()
df3.age = df3.age.fillna(value=0)        # Targeted at a specific column
df3

In [None]:
df4 = df.copy()

In [None]:
df4 = df4.fillna(value={"survived": df.survived.mean(), "plcass": df.pclass.mean(), "sex": "unknown",
                        "age": df.age.mean(), "fare": df.fare.mean()})
df4

Instead of filling all null values with the same value, you can fill with different values depending on the column, specifying one by one the columns and the values to be replaced. 

In [None]:
df5 = df.copy()

In [None]:
df5 = df5.fillna(method="ffill")
df5

If you set the parameter `method` to ffill, which means forward fill, it propagates the last non-null observation forward. Setting it to bfill, which means backward fill, works backward.

In [None]:
df6 = df.copy()

In [None]:
df6.fillna(method="bfill", inplace=True)
df6

## Exercises for handling null values (6 questions)

Suppose you have <i>df1</i>, <i>df2</i>, <i>df3</i>, <i>df4</i>, and <i>df5</i>, each of which is a copy of <i>df</i>.

In [None]:
df = load_dataset("titanic")
df = df[["survived", "pclass", "sex", "age", "fare"]].sample(n=10, replace=False, random_state=4)
df.iloc[1, 0] = None
df.iloc[7, 1] = None
df.iloc[3, 4] = None
df.iloc[[5, 9], :] = None

df

In [None]:
df1, df2, df3, df4, df5 = df.copy(), df.copy(), df.copy(), df.copy(), df.copy()

1\. Select all rows in <i>df</i> with any null values.

In [None]:
# Your answer here
df[df.isnull().any(axis=1)]

2\. Drop the rows in <i>df1</i> that have any null values. Make sure to assign the resulting dataframe back to <i>df1</i> to actually change <i>df1</i>.

In [None]:
# Your answer here
df1=df1.dropna()

3\. Drop the rows in <i>df2</i> in which all values are null. Make sure to assign the resulting dataframe back to <i>df2</i> to actually change <i>df1</i>.

In [None]:
# Your answer here
df2 = df2.dropna(how='all')
df2

4\. Fill the missing values under the column <i>sex</i> in <i>df3</i> with 'unknown'.

In [None]:
# Your answer here
df3.sex.fillna(value='unknown')
df3

5\. Fill the missing values under the columns <i>pclass</i>, <i>age</i>, and <i>fare</i> in <i>df4</i> with the minimum value of their column.

In [None]:
# Your answer here
df4.fillna({'pclass':df.pclass.min(), 'age':df.age.min(), 'fare': df.fare.min()})

6\. Fill all missing values in <i>df5</i> with the last non-null oberservation backward.

In [None]:
# Your answer here


## Working with Dataframe Indices
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [None]:
import pandas as pd

sampleTs = pd.read_csv('../../Data/sampleTs.csv')

sampleTs.head(2)

In [None]:
sampleTs.set_index('TimeStamp')

In [None]:
sampleTs.head(2)

In [None]:
sampleTs = sampleTs.set_index('TimeStamp')
sampleTs.head(2)

In [None]:
sampleTs = pd.read_csv('../../Data/sampleTs.csv')
sampleTs.set_index('TimeStamp', inplace=True)
sampleTs.head(2)

In [None]:
sampleTs.reset_index()

In [None]:
sampleTs.head(2)

In [None]:
sampleTs.reset_index(inplace=True)
sampleTs.head(2)

In [None]:
sampleTs = pd.read_csv('../../Data/sampleTs.csv')
sampleTs.set_index('TimeStamp', inplace=True)
sampleTs.reset_index(inplace=True, drop=True)
sampleTs.head(2)

## Iterate over a dataframe

There are ways to iterate over a dataframe:  
- Using <b>iloc</b>
- Using <b>iteritems()</b> to iterate over the (key,value) pairs
- Using <b>iterrows()</b> to iterate over the rows as (index,series) pairs
- Using <b>itertuples()</b> to iterate over the rows as named tuples

You can choose any of the four above depending on how you want to retrieve data from a dataframe.

In [None]:
df

In [None]:
# Iterates by row
for i in range(len(df)):
    print(i, df.iloc[i].values)

In [None]:
# Iterates by column
for i in range(len(df.columns)):
    print(i, df.iloc[:, i].values)

In [None]:
# Iterates over the (key, value) pairs
for key, val in df.iteritems():
    print(key)
    print(val)

In [None]:
# Iterates over the rows as (index, series) pairs
for idx, series in df.iterrows():
    print(idx)
    print(series)

In [None]:
# Iterates over the rows as (index, series) pairs
for tuple in df.itertuples():
    print(tuple)

## Aggregate and group a dataframe

In [None]:
df = load_dataset("titanic")
df

In [None]:
df.describe()

pandas.DataFrame.describe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

The <b>describe()</b> method generates various descriptive summary statistics of a series or a dataframe.

In [None]:
df.mean()                         # Returns the mean value of each column.  

In [None]:
df.age.mean()                     # Returns the mean age. 

In [None]:
df['age'].mean()

You can specify the column you are interested in.

In [None]:
df.age.min()                      # Returns the minimum age.

In [None]:
df.age.max()                      # Returns the maximum age.

In [None]:
df.age.std()                      # Returns the standard deviation of age.

In [None]:
df[df.pclass == 1].age.mean()     # Returns the mean age of the first class passengers. 

If you are only interested in a subset of rows, first select the rows using filtering and then do what you want. 

In [None]:
df[(df.pclass == 1) & (df.survived == 0)].fare.mean() # Returns the mean fare of the first class passengers who died. 

In [None]:
df.groupby("sex").mean() 

pandas.DataFrame.groupby: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The <b>groupby()</b> method groups a dataframe by a series of columns. It should be followed by an operation you want to do after grouping. 

In [None]:
df.groupby("class").mean()

In [None]:
df.groupby("class").age.mean()

In [None]:
df.groupby(["sex", "class"]).age.mean()

You can group by multiple columns, and the order of columns is important.

In [None]:
df.sort_values(by="fare", ascending=False)

pandas.DataFrame.sort_values: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

The <b>sort_values()</b> method sorts a dataframe by a series of columns.

In [None]:
df.sort_values(by=["fare", "age"], ascending=[False, True])

You can sort by multiple columnms, and the order of columns is important.

In [None]:
df.sample(n=10, replace=False, random_state=0)

pandas.DataFrame.sample: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

The <b>sample()</b> method returns a random sample of items from a dataframe. 

## Exercises for aggregation and grouping (8 questions)

Let's continue to use the entire Titanic datafame <i>df</i>.

1\. What was the highest fare?

In [None]:
# Your answer here
df.fare.max()

2\. What was the lowest fare?

In [None]:
# Your answer here
df.fare.min()

3\. What was the mean age of the female passengers? (Select the female passengers first.)

In [None]:
# Your answer here
df.age.mean()

4\. Select the rows, or passengers, who were under the age of ten and died? (Put two conditions for filtering.)

In [None]:
# Your answer here
df[(df.age<10) & (df.alive=='no')]


5\. What were the mean ages for those who survived and who died, respectively? In other words, group the dataframe by <i>survived</i> and get the mean age of each group. 

In [None]:
# Your answer here
df[(df.survived==1) & (df.alive=="no")].age.mean()

6\. Group the dataframe by <i>survived</i> and then by <i>pclass</i> and get the mean fare of each group. 

In [None]:
# Your answer here
df.groupby(["survived","pclass"]).fare.mean()

7\. Get a copy of <i>df</i> in which the entire dataframe is sorted by <i>age</i> and then by <i>pclass</i> in descending order, respectively. 

In [None]:
# Your answer here
df.sort_values(by=['age', 'pclass'], ascending=[True, False])

8\. Create a random sample of 50 rows with no duplicates from <i>df</i>. 

In [None]:
# Your answer here
df.sample(n=50,replace=False)