<a href="https://colab.research.google.com/github/peterlulu666/Data-Analytics-Using-Python/blob/main/Data_Analytics_Using_Python_Week_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

**Pandas is considered to be one of the most favorite Python libraries to work with. Perhaps because it is the workhorse for data analysts using Python!**

## Pandas = Panel + Data

Etymologically, Pandas is a portmanteau from ‘Panel’ and ‘Data’. Panel data is a term commonly used for the data sets that contain data with observations over a period of time for the same subject / individuals each time.

## Pandas functionalities

There are many Pandas functionalities that include:

- loading your data from a file to start your analysis
- previewing your data to understand the data
- manipulating your data for gathering better insights
- transforming your data from one format to another

As in Python, to use a module in your Pandas program you first need to import it.

## Importing Pandas

Similar to NumPy, to start using Pandas and all of the functions available, you are required to import the package. This can be easily done using the following code:


In [None]:
import pandas as pd

There is an unstated, undocumented convention that is followed in the Python world – using ‘pd’ as the reference name while importing Pandas. Technically, any other name can be used, but this is the convention generally followed.

## How is it different from NumPy?

NumPy is powerful, but it lacks some high-level functions and abstractions that are needed for solving everyday problems for a data analyst using structured data tables. For example: Labelled columns: Labelled data is especially useful in explicit data alignment while loading data to Pandas. Reading from files: Datasets stored in CSV and XLSX (among other formats) can be read easily using Pandas. Pandas is a fast, powerful, flexible, and easy to use open source data analysis and data manipulation tool for Python. Pandas is built on top of NumPy and makes it easy to use in data analytics applications. That being said it is very important and useful to learn NumPy, as a lot of the core features of Pandas and other packages are based on NumPy functionality.

## Pandas data structures

There are two fundamental data structures that Pandas provides:

- Pandas Series
- Pandas DataFrame

Both these two data structures provide a solid basis for most of the applications that you would build in Python. Python Series is a one-dimensional Array-like object consisting of Array of data, Array of labels, also known as index, whereas: DataFrame is a two-dimensional structure and it represents a tabular, spreadsheet-like data structure with an ordered collection of columns. It has an index for both rows and columns. Let’s explore these data structures in some detail in the subsequent sections.

# Pandas data structure: Pandas Series

Pandas Series:

- is a one-dimensional Array that can hold data of any type
- contains an Array of data
- contains an associated Array of labels, which is also known as its index.

## Creating Pandas Series

A simplest way to create a series is to pass an Array of data to Series function, as is highlighted in the code snippet below:

Code:

In [None]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [None]:
ser1 = Series([1,2,3,4,5,6,7])
ser1

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

Looking at the output of the code above, we can infer that:

- the second column contains the Array data
- the first column contains the default labels (ie, default index assigned to the data).

## Accessing data and index of Series
Just a while ago, while describing Series, we said that it can contain both Array of the data and Array of the labels. So, there must be some mechanism available to extract the Array of data, and Array of indexes from Pandas Series. There are two attributes, values and index, that return the Array representation of the data and the index object of the Series respectively. The following code snippet highlights this functionality:

Code:

In [None]:
ser1.values

array([1, 2, 3, 4, 5, 6, 7])

In [None]:
ser1.dtype

dtype('int64')

In [None]:
ser1.index

RangeIndex(start=0, stop=7, step=1)

## Creating Series with explicit labelling

Often it is required to create a Series with a labelled index for each data point. This is achieved by passing the index labels to Series function. This is highlighted in the code snippet below:

Code:

In [None]:
ser2 = Series([1,2,3,4,5,6,7], index=['a','b','c','d','e','f','g'])
ser2

a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64

In [None]:
ser2.values

array([1, 2, 3, 4, 5, 6, 7])

In [None]:
ser2.dtype

dtype('int64')

In [None]:
ser2.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

## Accessing values using indexes (Default index or Explicit index)
Values in the Series can be accessed using the index notation. We can either pass a single index or multiple indexes (Default index or Explicit index) to access the values. The code snippet below highlights the behaviour:

Code:

In [None]:
ser2

a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64

In [None]:
ser2['a'], ser2['f']

(1, 6)

In [None]:
ser2[['a','b','f']]

a    1
b    2
f    6
dtype: int64

## Creating Series from Python dictionaries
A Pandas Series can also be created by passing a dictionary object to Series Function. In this particular case, the keys of the dictionary will be assigned as the labels of the Series, and the values of the dictionary become the data of the Series. The code snippet below highlights the behaviour.

We have first created a dictionary with a key:value pair, where: Key = country code Value = description of currency.

We have created the Pandas series by passing the dictionary to the Series() function: The values from dictionary = the data for series The keys from the dictionary = the index for the series.

Code:

In [None]:
dict_1 = {'AU':'Australian Dollar',
     	'US': 'US Dollar',
     	'IN': 'Indian Ruppees',
     	'DK': 'Danish Krones',
     	'SW': 'Swiss Francs'}
dict_1

{'AU': 'Australian Dollar',
 'DK': 'Danish Krones',
 'IN': 'Indian Ruppees',
 'SW': 'Swiss Francs',
 'US': 'US Dollar'}

In [None]:
ser1 = Series(dict_1)
ser1

AU    Australian Dollar
US            US Dollar
IN       Indian Ruppees
DK        Danish Krones
SW         Swiss Francs
dtype: object

## Name attribute for Series and index
Both the Pandas Series and its index have the ‘name’ attribute, and this is used at various places during programming with Pandas. We will see its usage in some exercises during the course.

For now, it’s good to remember that such an attribute exists and its value can be accessed programmatically.

The following code snippet highlights the ‘Name’ attribute. You can see that when we display:

- the Series, the Name of the Series is displayed along with the data
- the index, the Name of index is also displayed along with the respective indexes.

Code:

In [None]:
ser1.name = "Currency"
ser1.index.name="Country"
ser1

Country
AU    Australian Dollar
US            US Dollar
IN       Indian Ruppees
DK        Danish Krones
SW         Swiss Francs
Name: Currency, dtype: object

Code:

In [None]:
ser1.index

Index(['AU', 'US', 'IN', 'DK', 'SW'], dtype='object', name='Country')

# Pandas data structure: Pandas DataFrame

A Pandas DataFrame:

- is a tabular, spreadsheet-like data structure
- contains an ordered list of columns, and each can have different data types
- has indexes or labels for both rows and columns Let’s explore it through examples

## Creating Pandas DataFrame

There are various ways a DataFrame can be created using the function DataFrame(), but one of the common ways is to pass a dictionary of equal length list or NumPy Arrays. In this case, the keys will be column labels / index. The following code snippet will highlight this behaviour:

Code:

In [None]:
data = {
	'state':['WA', 'SA', 'VIC', 'NSW', 'ACT', 'QLD', 'NT'],
	'pop': [1,1,2.5,2.7,0.5,1.5,0.4],
	'TZ': ['GMT+8', 'GMT+9.30','GMT+10', 'GMT+10', 'GMT+10', 'GMT+10', 'GMT+9.30']
}

df_states = DataFrame(data)
df_states

Unnamed: 0,state,pop,TZ
0,WA,1.0,GMT+8
1,SA,1.0,GMT+9.30
2,VIC,2.5,GMT+10
3,NSW,2.7,GMT+10
4,ACT,0.5,GMT+10
5,QLD,1.5,GMT+10
6,NT,0.4,GMT+9.30


Later in the course, we will see how to populate DataFrames using CSV files and databases.

## Specifying the sequence of columns

We can also specify the sequence or order of the columns for DataFrame, as a result of which data will be arranged in the column order specified. See below code snippet for demonstration:

Code:

In [None]:
df_states=DataFrame(data, columns=['state','TZ','pop'])
df_states

Unnamed: 0,state,TZ,pop
0,WA,GMT+8,1.0
1,SA,GMT+9.30,1.0
2,VIC,GMT+10,2.5
3,NSW,GMT+10,2.7
4,ACT,GMT+10,0.5
5,QLD,GMT+10,1.5
6,NT,GMT+9.30,0.4


If a column name is specified but there is no data supplied for that data column, then DataFrame will fill the null values or NaN for that column.

Code:

In [None]:
df_states1=DataFrame(data, columns=['state','TZ','pop','GDP'])
df_states1

Unnamed: 0,state,TZ,pop,GDP
0,WA,GMT+8,1.0,
1,SA,GMT+9.30,1.0,
2,VIC,GMT+10,2.5,
3,NSW,GMT+10,2.7,
4,ACT,GMT+10,0.5,
5,QLD,GMT+10,1.5,
6,NT,GMT+9.30,0.4,


## Specifying the row labels

As we have seen in the case of Series, we can specify the explicit indexes for the rows for DataFrame. This is done by passing the index parameter to the DataFrame function. The index parameter can be either:

- the list of labels
- the name of the column / key in the data provided, that should be treated as index.
See the code example below for the demonstration:

Code:

In [None]:
#"Specifying the Row Labels, or Indexes"
data = {
	'state':['Western Australia', 'Southern Australia', 'Victoria', 'New South Wales', 'Australian Capital Territory', 'Queensland', 'Northern Territory'],
	'pop': [1,1,2.5,2.7,0.5,1.5,0.4],
	'TZ': ['GMT+8', 'GMT+9.30','GMT+10', 'GMT+10', 'GMT+10', 'GMT+10', 'GMT+9.30']
}

row_labels = ['WA', 'SA','VIC','NSW','ACT','QLD','NT']
df_states_lbl = DataFrame(data, columns=['state','TZ','pop'], index=row_labels)
df_states_lbl

Unnamed: 0,state,TZ,pop
WA,Western Australia,GMT+8,1.0
SA,Southern Australia,GMT+9.30,1.0
VIC,Victoria,GMT+10,2.5
NSW,New South Wales,GMT+10,2.7
ACT,Australian Capital Territory,GMT+10,0.5
QLD,Queensland,GMT+10,1.5
NT,Northern Territory,GMT+9.30,0.4


## Retrieving columns from DataFrame

Individual columns in DataFrame are stored as Pandas Series. We can access the data in the individual columns by either:

- using index Notation and passing the column name as the index
- using the df.Attribute notation, where the column name itself is the name of the attribute.

While retrieving the individual columns, the index of DataFrame is retained in the retrieved Series. The following code snippet showcases this behaviour:

Code:

In [None]:
df_states_lbl

Unnamed: 0,state,TZ,pop
WA,Western Australia,GMT+8,1.0
SA,Southern Australia,GMT+9.30,1.0
VIC,Victoria,GMT+10,2.5
NSW,New South Wales,GMT+10,2.7
ACT,Australian Capital Territory,GMT+10,0.5
QLD,Queensland,GMT+10,1.5
NT,Northern Territory,GMT+9.30,0.4


In [None]:
series_states = df_states_lbl['state']
series_states

WA                Western Australia
SA               Southern Australia
VIC                        Victoria
NSW                 New South Wales
ACT    Australian Capital Territory
QLD                      Queensland
NT               Northern Territory
Name: state, dtype: object

## Retrieving rows from DataFrame

Individual rows can also be accessed from the DataFrame by using the loc method and passing the index as a parameter to this method.

Code:

In [None]:
df_states_lbl

Unnamed: 0,state,TZ,pop
WA,Western Australia,GMT+8,1.0
SA,Southern Australia,GMT+9.30,1.0
VIC,Victoria,GMT+10,2.5
NSW,New South Wales,GMT+10,2.7
ACT,Australian Capital Territory,GMT+10,0.5
QLD,Queensland,GMT+10,1.5
NT,Northern Territory,GMT+9.30,0.4


In [None]:
wa_data = df_states_lbl.loc['WA']
wa_data

state    Western Australia
TZ                   GMT+8
pop                      1
Name: WA, dtype: object

In case we don’t want to use the explicit label or row index, we can also use the default positional index for the rows in DataFrame. For such scenarios, there is a method .iloc(), and it can be used to access the particular row.

Code:

In [None]:
wa_data1 = df_states_lbl.iloc[0]
wa_data1

state    Western Australia
TZ                   GMT+8
pop                      1
Name: WA, dtype: object

In [None]:
wa_data1 = df_states_lbl.iloc[1]
wa_data1

state    Southern Australia
TZ                 GMT+9.30
pop                       1
Name: SA, dtype: object

## Changing data in a particular column

The values in a particular column can be changed by assigning either the scalar value or a range of values (equal to number of rows in the DataFrame). The following code snippet shows this behaviour:

Code:

In [None]:
data = {
	'state':['Western Australia', 'Southern Australia', 'Victoria', 'New South Wales', 'Australian Capital Territory', 'Queensland', 'Northern Territory'],
	'pop': [1,1,2.5,2.7,0.5,1.5,0.4],
	'TZ': ['GMT+8', 'GMT+9.30','GMT+10', 'GMT+10', 'GMT+10', 'GMT+10', 'GMT+9.30']
}

row_labels = ['WA', 'SA','VIC','NSW','ACT','QLD','NT']

df_states = DataFrame(data, columns=['state','TZ','pop','GDP'], index=row_labels)
df_states

Unnamed: 0,state,TZ,pop,GDP
WA,Western Australia,GMT+8,1.0,
SA,Southern Australia,GMT+9.30,1.0,
VIC,Victoria,GMT+10,2.5,
NSW,New South Wales,GMT+10,2.7,
ACT,Australian Capital Territory,GMT+10,0.5,
QLD,Queensland,GMT+10,1.5,
NT,Northern Territory,GMT+9.30,0.4,


In [None]:
#Access GDP column and assign a constant value
df_states['GDP'] = 16
df_states

Unnamed: 0,state,TZ,pop,GDP
WA,Western Australia,GMT+8,1.0,16
SA,Southern Australia,GMT+9.30,1.0,16
VIC,Victoria,GMT+10,2.5,16
NSW,New South Wales,GMT+10,2.7,16
ACT,Australian Capital Territory,GMT+10,0.5,16
QLD,Queensland,GMT+10,1.5,16
NT,Northern Territory,GMT+9.30,0.4,16


In the example below we are passing the list of values instead of some constant value

In [None]:
gdp_data = [11,8,20,22,15,18,8]
df_states['GDP'] = gdp_data
df_states

Unnamed: 0,state,TZ,pop,GDP
WA,Western Australia,GMT+8,1.0,11
SA,Southern Australia,GMT+9.30,1.0,8
VIC,Victoria,GMT+10,2.5,20
NSW,New South Wales,GMT+10,2.7,22
ACT,Australian Capital Territory,GMT+10,0.5,15
QLD,Queensland,GMT+10,1.5,18
NT,Northern Territory,GMT+9.30,0.4,8


## Adding a column to DataFrame

Using the same convention, if we specify a column name that doesn’t exist in DataFrame and pass the list of values or a scalar, it results in the addition of a new column to the DataFrame.

See the below example for a demonstration:

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP
WA,Western Australia,GMT+8,1.0,11
SA,Southern Australia,GMT+9.30,1.0,8
VIC,Victoria,GMT+10,2.5,20
NSW,New South Wales,GMT+10,2.7,22
ACT,Australian Capital Territory,GMT+10,0.5,15
QLD,Queensland,GMT+10,1.5,18
NT,Northern Territory,GMT+9.30,0.4,8


In [None]:
df_states['area'] = "TBD"
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD
NT,Northern Territory,GMT+9.30,0.4,8,TBD


So far, we have seen some of the basic operations related to the Series and DataFrame. In the subsequent sections, let’s explore some of the essential functionalities for Pandas, as these will form the foundation of the data wrangling activities for any data analytics projects that you would be involved in.

# QUIZ

1. Which of the following is incorrect about Pandas functionalities?
  - Pandas helps you load your data.
  - It can preview your data.
  - It can help gather better insights into your data.
  - **It works only with your numerical data.**

2. When importing Pandas, what is the commonly used reference name?
  - **pd**
  - np
  - pn
  - nt

3. Which of the following best describes Pandas Series?
  - You can create a series by passing a pandas function.
  - It is a two-dimensional object.
  - It contains only an ordered list of columns.
  - **It contains both Array of data and Array of labels.**

4. Which of the following functions will you use if you want to change the sequence of the columns in your existing table from ‘name’, ‘ID’, ‘phone number’ to ‘ID’, ‘phone number’, ‘name’?
  - df_customerinfo=DataFrame(data, columns=[‘name,’ID’,’phone number’])
  - df_customerinfo=DataFrame(data, rows=[‘ID,’phone number’,’name’])
  - **df_customerinfo=DataFrame(data, columns=[‘ID,’phone number’,’name’])**
  - df_customerinfo=DataFrame(data, series=[‘ID,’phone number’,’name’])



# Pandas: Essential operations

**Similar to NumPy, Pandas also has some essential operations that you would need to explore interacting with the data stored in Series and DataFrame.**

## Reindexing

An important operation that we perform on the Pandas data structure is reindexing, which means to create a new object and rearrange the data in the Pandas data structure, conforming to the new index.

While doing so, if data is not present for some index in the original data, missing values are added, corresponding to those indexes.

Code:

In [None]:
a = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
a

a    0.671328
b    0.764802
c    0.804547
d   -0.219320
e    0.518468
f    0.000156
g   -0.547678
h   -0.338730
i   -0.319640
j    1.727902
dtype: float64

In [None]:
new_index = ['a','A1','b','B1','c','C1','d','e','f','g','h','i','j']
a_new = a.reindex(new_index)
a_new

a     0.671328
A1         NaN
b     0.764802
B1         NaN
c     0.804547
C1         NaN
d    -0.219320
e     0.518468
f     0.000156
g    -0.547678
h    -0.338730
i    -0.319640
j     1.727902
dtype: float64

Imagine a situation where you are processing employee records. However, many of the employees have supplied incomplete information. You need a way to handle these cases and highlight the gaps to follow up with them. Perhaps you could insert ‘Unknown’ into all the empty fields, to make the missing values easy to identify. There are various ways the missing values can be handled during reindexing.

We can:

- either specify a particular value to be filled, and we do this by adding a parameter to the reindex method i.e. `fill_value = <pre defined value to be filled>`

- or we can specify the pre-defined options by passing a parameter `method = <pre defined method values>`. This is especially useful in case we need to do operations like interpolation, forward fill, backward fill, and so on for time series data analysis.

These two methods are demonstrated by the code snippet below:

1. Specifying fill_value

Code:

In [None]:
a_fillvalue = a.reindex(new_index, fill_value=0)
a_fillvalue

a     0.671328
A1    0.000000
b     0.764802
B1    0.000000
c     0.804547
C1    0.000000
d    -0.219320
e     0.518468
f     0.000156
g    -0.547678
h    -0.338730
i    -0.319640
j     1.727902
dtype: float64

2. Specifying ‘method’ parameter

Code:

In [None]:
a = Series(np.random.randn(10), index=[0,2,4,6,8,10,12,14,16,18])
a

0     1.096819
2     0.629804
4     0.482115
6    -0.826161
8    -1.432868
10    0.193929
12    1.669338
14   -0.789566
16    1.617491
18    0.105089
dtype: float64

In [None]:
## Reindex so that indexes 1,3,5... are introduced in the series
a_new = a.reindex(range(20))
a_new

0     1.096819
1          NaN
2     0.629804
3          NaN
4     0.482115
5          NaN
6    -0.826161
7          NaN
8    -1.432868
9          NaN
10    0.193929
11         NaN
12    1.669338
13         NaN
14   -0.789566
15         NaN
16    1.617491
17         NaN
18    0.105089
19         NaN
dtype: float64

In [None]:
## Perform similar reindex but with forward fill method specific for null values
a_ffill = a.reindex(range(20), method='ffill')
a_ffill

0     1.096819
1     1.096819
2     0.629804
3     0.629804
4     0.482115
5     0.482115
6    -0.826161
7    -0.826161
8    -1.432868
9    -1.432868
10    0.193929
11    0.193929
12    1.669338
13    1.669338
14   -0.789566
15   -0.789566
16    1.617491
17    1.617491
18    0.105089
19    0.105089
dtype: float64

Observe index 1, 3, and 5: values have been populated from the previous index. For the complete list of parameters of reindexing method, refer to the documentation available at following links:

Read: [Pandas Document for Series reindexing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html) [1]

Read: [Pandas Document for Dataframe Reindexing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html) [2]

## Dropping entries from axis

At many times, we need to delete the data from the Pandas Series and DataFrame. You can do this using the drop method, which is available to both Series and DataFrame. This method accepts the index, or the list of index, to be dropped from the Series and DataFrame.

This method creates a new object with only the required values. Note that this operation doesn’t perform inline-drop (ie, the original Pandas Series or DataFrame will be preserved and still available after the drop operations). In practical terms, the method creates a selective copy of the data.

1. Drop single index

Code:

In [None]:
b = Series(np.arange(10), index=['a','b','c','d','e','f','g','h','i','j'])
b

a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

In [None]:
#Dropping index b
new_series = b.drop('b')
new_series

a    0
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

2. Drop multiple index

In [None]:
# Dropping multiple index.
# for e.g., a, g, j
new_series_1 = b.drop(['a','g','j'])
new_series_1

b    1
c    2
d    3
e    4
f    5
h    7
i    8
dtype: int64

In the case of DataFrame, we specify the index for both axes: row labels (by using index parameter), and column names (by using columns parameter). The following code snippets demonstrates this behaviour:

3. Removing a row from DataFrame

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD
NT,Northern Territory,GMT+9.30,0.4,8,TBD


In [None]:
df_states_noNT = df_states.drop('NT')
df_states_noNT

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD


4. Removing multiple columns from DataFrame by passing a sequence of column index and axis = 1

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD
NT,Northern Territory,GMT+9.30,0.4,8,TBD


In [None]:
df1 = df_states.drop(['state','area'], axis=1)
df1

Unnamed: 0,TZ,pop,GDP
WA,GMT+8,1.0,11
SA,GMT+9.30,1.0,8
VIC,GMT+10,2.5,20
NSW,GMT+10,2.7,22
ACT,GMT+10,0.5,15
QLD,GMT+10,1.5,18
NT,GMT+9.30,0.4,8


In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD
NT,Northern Territory,GMT+9.30,0.4,8,TBD


Observe how the original DataFrame is always preserved, and whenever we use drop() method, a new DataFrame is created.

# Indexing, Selection, and Filtering

## Indexing, Selection, and Filtering

**We have already seen various examples of indexing being used. Let’s explore a little more about indexing features available to Pandas.**

### Pandas Series

Indexing for Pandas Series works similarly to NumPy ndArrays, with one additional feature being that we can also use the labelled index along with the implicit positional index, which are available for Series. Following are some of the examples of this behaviour:

Code:

In [None]:
ob1 = Series(np.random.randn(10), index=['a','b','c','d','e','f','g','h','i','j'])
ob1

a   -0.916779
b   -0.180231
c    0.804683
d   -0.457195
e    0.633762
f   -0.938861
g    1.369232
h   -0.896382
i    1.812082
j   -0.169518
dtype: float64

In [None]:
ob1[0], ob1['a']

(-0.9167785538101799, -0.9167785538101799)

In [None]:
ob1[5], ob1['f']

(-0.9388605166220878, -0.9388605166220878)

Slicing with labels works a little differently than normal slicing, and the difference is that both the indexes are inclusive whereas in the case of **normal slicing**, the **endpoint** is **not inclusive**. The following are some of the examples of this behaviour:

Code:

In [None]:
# normal slicing
ob1[0:5]

a   -0.916779
b   -0.180231
c    0.804683
d   -0.457195
e    0.633762
dtype: float64

In [None]:
# Slicing with labels
ob1['a':'f']

a   -0.916779
b   -0.180231
c    0.804683
d   -0.457195
e    0.633762
f   -0.938861
dtype: float64

## Pandas DataFrame

As we have already seen, we use indexing to retrieve a particular subset of data along the x and y axis of DataFrame, by passing either the single value or sequence of indexes. The following examples demonstrate these features again:

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD
NT,Northern Territory,GMT+9.30,0.4,8,TBD


In [None]:
df_states['state']

WA                Western Australia
SA               Southern Australia
VIC                        Victoria
NSW                 New South Wales
ACT    Australian Capital Territory
QLD                      Queensland
NT               Northern Territory
Name: state, dtype: object

In [None]:
df_states[['state', 'TZ']]

Unnamed: 0,state,TZ
WA,Western Australia,GMT+8
SA,Southern Australia,GMT+9.30
VIC,Victoria,GMT+10
NSW,New South Wales,GMT+10
ACT,Australian Capital Territory,GMT+10
QLD,Queensland,GMT+10
NT,Northern Territory,GMT+9.30


This mechanism of supplying indexes let’s us do data selection by a variety of ways.

For example:

1. by passing row slices

Code:

In [None]:
df_states[:2]

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD


2. by passing a Boolean Array (filter Array) to select data that meets certain conditions.

Code:

In [None]:
df_states[df_states['GDP']>8]

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD


# Data alignment, mapping, and sorting

## Data alignment

One of the interesting features of Pandas operations is data alignment. We have already seen some analogous behaviours. For example:

- The creation of DataFrame (ie, if data is not presented for one of the specified columns, then missing values are filled in automatically as NaN).

- Reindexing if the data doesn’t exist for the supplied index. So then by default, NaN is filled for those indexes and additionally we have the option to pass values and methods as well to the reindex method.

On the same lines, if we do mathematical operations between Panda objects with different indexes, Pandas will perform the data alignment into the resulting Panda object. 

Code:

In [None]:
df1 = DataFrame(np.arange(9).reshape(3,3), columns=['a','b','c'], index=['SA', 'VIC', 'NSW'])
df1

Unnamed: 0,a,b,c
SA,0,1,2
VIC,3,4,5
NSW,6,7,8


In [None]:
df2 = DataFrame(np.arange(12).reshape(4,3), columns=['a','b','e'], index=['SA', 'VIC', 'NSW', 'ACT'])
df2

Unnamed: 0,a,b,e
SA,0,1,2
VIC,3,4,5
NSW,6,7,8
ACT,9,10,11


In case of addition, if index pairs are not the same, the resultant Pandas object will have the index that is union of both the original index, and missing values will be filled as NaN. 

Code:

In [None]:
df1+df2

Unnamed: 0,a,b,c,e
ACT,,,,
NSW,12.0,14.0,,
SA,0.0,2.0,,
VIC,6.0,8.0,,


We also have the option of passing parameter values to determine how missing values should be dealt with, which performs this internal data alignment. 

Code:

In [None]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,e
ACT,9.0,10.0,,11.0
NSW,12.0,14.0,8.0,8.0
SA,0.0,2.0,2.0,2.0
VIC,6.0,8.0,5.0,5.0


## Mapping

At many times, we would want to change or manipulate the values in a particular row or a column by way of applying some functions to the values in selection. For example, think of a data set that captures information about a large collection of products (represented as columns in the data set). These products update to a new version every year. You need a way to update all the version numbers quickly and easily. This process is known as mapping, and we do this by using the .apply() method, which has following parameters:

- a lambda function, to specify what kind of transformation needs to be applied
- an axis parameter, which by default equates to 0 and so applies across the index (and not columns). The following code snippets demonstrates this behaviour: 

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,Western Australia,GMT+8,1.0,11,TBD
SA,Southern Australia,GMT+9.30,1.0,8,TBD
VIC,Victoria,GMT+10,2.5,20,TBD
NSW,New South Wales,GMT+10,2.7,22,TBD
ACT,Australian Capital Territory,GMT+10,0.5,15,TBD
QLD,Queensland,GMT+10,1.5,18,TBD
NT,Northern Territory,GMT+9.30,0.4,8,TBD


In [None]:
f = lambda x:x.upper()

In [None]:
df_states['state'] = df_states['state'].apply(f)
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,WESTERN AUSTRALIA,GMT+8,1.0,11,TBD
SA,SOUTHERN AUSTRALIA,GMT+9.30,1.0,8,TBD
VIC,VICTORIA,GMT+10,2.5,20,TBD
NSW,NEW SOUTH WALES,GMT+10,2.7,22,TBD
ACT,AUSTRALIAN CAPITAL TERRITORY,GMT+10,0.5,15,TBD
QLD,QUEENSLAND,GMT+10,1.5,18,TBD
NT,NORTHERN TERRITORY,GMT+9.30,0.4,8,TBD


Note that there are other methods available for doing column wise transformations, and we will cover some of those in detail during the data wrangling sections of this course.

## Sorting

Many kinds of data need to be sorted to be meaningful and useful. Think of a video on-demand streaming service that needs to know which TV series in its catalogue are the most popular, so that it can decide which of them to renew for another season. The series titles need to be sorted by the most watched. Sorting is also one of the important operations that we perform on the data in Pandas.

### Sorting the indexes / labels

To sort lexicographically (ie, the dictionary order) by row or column index, we use the sort_index() method. See below a demonstration of sorting the indexes. It should be noted that this method returns a new object, which is sorted based on the criteria specified: * Original DataFrame 

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,WESTERN AUSTRALIA,GMT+8,1.0,11,TBD
SA,SOUTHERN AUSTRALIA,GMT+9.30,1.0,8,TBD
VIC,VICTORIA,GMT+10,2.5,20,TBD
NSW,NEW SOUTH WALES,GMT+10,2.7,22,TBD
ACT,AUSTRALIAN CAPITAL TERRITORY,GMT+10,0.5,15,TBD
QLD,QUEENSLAND,GMT+10,1.5,18,TBD
NT,NORTHERN TERRITORY,GMT+9.30,0.4,8,TBD


In [None]:
# DataFrame sorted by row index Code
df_states.sort_index()

Unnamed: 0,state,TZ,pop,GDP,area
ACT,AUSTRALIAN CAPITAL TERRITORY,GMT+10,0.5,15,TBD
NSW,NEW SOUTH WALES,GMT+10,2.7,22,TBD
NT,NORTHERN TERRITORY,GMT+9.30,0.4,8,TBD
QLD,QUEENSLAND,GMT+10,1.5,18,TBD
SA,SOUTHERN AUSTRALIA,GMT+9.30,1.0,8,TBD
VIC,VICTORIA,GMT+10,2.5,20,TBD
WA,WESTERN AUSTRALIA,GMT+8,1.0,11,TBD


In [None]:
# DataFrame sorted by columns (lexicographically)
df_states.sort_index(axis=1)

Unnamed: 0,GDP,TZ,area,pop,state
WA,11,GMT+8,TBD,1.0,WESTERN AUSTRALIA
SA,8,GMT+9.30,TBD,1.0,SOUTHERN AUSTRALIA
VIC,20,GMT+10,TBD,2.5,VICTORIA
NSW,22,GMT+10,TBD,2.7,NEW SOUTH WALES
ACT,15,GMT+10,TBD,0.5,AUSTRALIAN CAPITAL TERRITORY
QLD,18,GMT+10,TBD,1.5,QUEENSLAND
NT,8,GMT+9.30,TBD,0.4,NORTHERN TERRITORY


## Sorting by values

Instead of indexes and labels, we can also sort the data by the actual values in the columns. For this purpose, there is another function, sort_values(), that can be used. This function will do the sorting on the basis of values instead of the labels. See below code snippet for an example, where we will arrange the values by GDP Column: 

Code:

In [None]:
df_states

Unnamed: 0,state,TZ,pop,GDP,area
WA,WESTERN AUSTRALIA,GMT+8,1.0,11,TBD
SA,SOUTHERN AUSTRALIA,GMT+9.30,1.0,8,TBD
VIC,VICTORIA,GMT+10,2.5,20,TBD
NSW,NEW SOUTH WALES,GMT+10,2.7,22,TBD
ACT,AUSTRALIAN CAPITAL TERRITORY,GMT+10,0.5,15,TBD
QLD,QUEENSLAND,GMT+10,1.5,18,TBD
NT,NORTHERN TERRITORY,GMT+9.30,0.4,8,TBD


In [None]:
df_states.sort_values('GDP')

Unnamed: 0,state,TZ,pop,GDP,area
SA,SOUTHERN AUSTRALIA,GMT+9.30,1.0,8,TBD
NT,NORTHERN TERRITORY,GMT+9.30,0.4,8,TBD
WA,WESTERN AUSTRALIA,GMT+8,1.0,11,TBD
ACT,AUSTRALIAN CAPITAL TERRITORY,GMT+10,0.5,15,TBD
QLD,QUEENSLAND,GMT+10,1.5,18,TBD
VIC,VICTORIA,GMT+10,2.5,20,TBD
NSW,NEW SOUTH WALES,GMT+10,2.7,22,TBD


# Pandas: Mathematical and statistical methods

**It would be very laborious to calculate statistics on large datasets by hand. Fortunately, there are various mathematical and statistical methods in Pandas to automate some of the calculations. Available methods include summation, finding maximum and minimum values, and generating descriptive statistics.
Let’s explore these methods with some code examples:**

## Sum method

Summation is a common step in calculating statistics such as the mean and standard deviation. This method calculates the summation on a per column basis, and returns the result as a Pandas Series. 

Code:

In [None]:
new_df = DataFrame(np.random.rand(24).reshape((4,6)),
               	index=['r1','r2','r3','r4'],
               	columns=['c1','c2','c3','c4','c5','c6'])
new_df

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,0.342765,0.145512,0.453728,0.804428,0.94671,0.832559
r2,0.640571,0.83611,0.139126,0.554305,0.49695,0.968878
r3,0.288859,0.645866,0.02395,0.509962,0.903352,0.798825
r4,0.237727,0.909543,0.176025,0.4924,0.391355,0.189794


In [None]:
# column sum
new_df.sum()

c1    1.509922
c2    2.537030
c3    0.792830
c4    2.361094
c5    2.738368
c6    2.790056
dtype: float64

In [None]:
# row sum
new_df.sum(axis=1)

r1    3.525701
r2    3.635940
r3    3.170814
r4    2.396844
dtype: float64

## Min, max

As the name suggests, these methods will return the minimum and maximum values respectively for every column. These methods form the basis for other statistics such as finding the range of a set of values.

Code:

In [None]:
new_df

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,0.342765,0.145512,0.453728,0.804428,0.94671,0.832559
r2,0.640571,0.83611,0.139126,0.554305,0.49695,0.968878
r3,0.288859,0.645866,0.02395,0.509962,0.903352,0.798825
r4,0.237727,0.909543,0.176025,0.4924,0.391355,0.189794


In [None]:
# column min
new_df.min()

c1    0.237727
c2    0.145512
c3    0.023950
c4    0.492400
c5    0.391355
c6    0.189794
dtype: float64

In [None]:
# row min
new_df.min(axis=1)

r1    0.145512
r2    0.139126
r3    0.023950
r4    0.176025
dtype: float64

In [None]:
# column max
new_df.max()

c1    0.640571
c2    0.909543
c3    0.453728
c4    0.804428
c5    0.946710
c6    0.968878
dtype: float64

In [None]:
# row max
new_df.max(axis=1)

r1    0.946710
r2    0.968878
r3    0.903352
r4    0.909543
dtype: float64

## Describe method

The describe method provides the summary statistics, such as mean, range, and standard deviation, of all the data columns available in the DataFrame. See the below code snippet for an example:

Code:

In [None]:
new_df

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,0.342765,0.145512,0.453728,0.804428,0.94671,0.832559
r2,0.640571,0.83611,0.139126,0.554305,0.49695,0.968878
r3,0.288859,0.645866,0.02395,0.509962,0.903352,0.798825
r4,0.237727,0.909543,0.176025,0.4924,0.391355,0.189794


In [None]:
new_df.describe()

Unnamed: 0,c1,c2,c3,c4,c5,c6
count,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.37748,0.634258,0.198207,0.590273,0.684592,0.697514
std,0.180561,0.344254,0.182245,0.145126,0.281519,0.346371
min,0.237727,0.145512,0.02395,0.4924,0.391355,0.189794
25%,0.276076,0.520777,0.110332,0.505571,0.470552,0.646567
50%,0.315812,0.740988,0.157576,0.532133,0.700151,0.815692
75%,0.417216,0.854468,0.245451,0.616835,0.914192,0.866639
max,0.640571,0.909543,0.453728,0.804428,0.94671,0.968878


For an exhaustive list of methods, refer to the Pandas API Documentation. Refer to: [Pandas API Document](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) [1] We will also explore some additional methods by way of demonstration throughout the course.

# QUIZ

1. Which of the following functions will allow you to change or manipulate the selected piece of data?
  - pandas.series ()
  - pandas.df()
  - **pandas.apply()**
  - pandas.attribute()

2. Choose the incorrect statement.
  - By default, rows and columns with missing values are filled with NaN.
  - **sort_values() function will do the sorting on the basis of labels instead of the values.**
  - sort_index() function will sort your row or column indexes lexicographically.
  - describe () method provides the summary statistics of all the data columns available in the DataFrame.

3. Reindexing doesn’t:
  - change the order of rows in the dataframe
  - use reindex () function to perform this action
  - change the order of columns in the dataframe
  - **filter rows based on selection criteria as per the requirements**

4. Which method will you use for sorting the index row or column in the dictionary order?
  - sort_values()
  - **sort_index()**
  - sort_indexes()
  - sort_rows()

