# Tutorial: Getting started with Pandas

Since becoming an open-source project in 2010, Pandas has matured into a large library applicable in a broad set of real-world use cases.

Pandas contains data structures and data manipulation tools designed to make data processing efficient in Python. This tutorial introduces essential functionality in Pandas for representing and accessing data.

Learning objectives:
1. Generate sample datasets 
2. Build DataFrames from Python built-in data structures
3. Retrieve data from a dataset
4. Sort datasets
5. Compute descriptive statistics

## Numerical Python

**NumPy**, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Most libraries providing data science resources use NumPy's data structures as the standard for data exchange.

## Importing Pandas and NumPy

Throughout the rest of the tutorials, we will use the the following import convention for pandas and numpy:

In [1]:
import pandas as pd
import numpy as np

## 1. Generating sample data

NumPy provides utilities for generating sample data that we can use to construct DataFrames. Generating sample data is useful when learning and experimenting with DataFrames.

Generate a range of 10 integers: 

In [2]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

We can use the *reshape* method to construct a matrix that we can later use to construct a DataFrame. Using reshape, contruct a 5x2 matrix (5 rows, 2 columns):

In [3]:
np.arange(10).reshape((5, 2))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

Generate a 5x2 matrix filled with random values:

In [7]:
np.around(np.random.randn(5, 2), 2)

array([[-1.43,  0.57],
       [-0.59, -0.65],
       [ 0.02,  0.63],
       [-1.44, -0.59],
       [ 1.22, -1.04]])

> This function returns values sampled at random from the standard normal distribution.

Round to the values to two decimals:

In [8]:
np.around(np.random.randn(5, 2), 2)

array([[ 0.49,  1.12],
       [-1.61,  0.9 ],
       [ 0.62, -2.37],
       [ 0.08,  0.25],
       [ 0.76,  0.86]])

## 2. Building DataFrames

**Series** and **DataFrame** constitute the two workhorse data structures in Pandas. Before getting started with DataFrames, we will begin with Series.

### Series

A **Series** is a one-dimensional array-like object containing a sequence of values and an associated array of *data labels*, called its **index**.

Create a Series from a list:

In [9]:
series = pd.Series([4, 7, -5, 3])
series

0    4
1    7
2   -5
3    3
dtype: int64

The output above shows the index on the left and the values on the right. Since we did not specify an index for the data, default labels consisting of the integers 0 through N - 1 (where N is the length of the data) are created.

> In the same way ordinary objects in Python have a type associated, the values stored in a Series object also contain a data type or **dtype**. 
> 
> The numerical dtypes are named the same way. A type name, like float or int, followed by a number indicating the number of bits per element:
- int8, int16, int32, int64
- uint8, uint16, uint32, uint64
- float16, float32, float64, float128
- complex64, complex128, complex256
>
> Series objects can also contain non-numerical values:
- bool
- object
- string_
- unicode_
>
> *All values in a Series must be of the same dtype.*

You can get an array representation of the values and the index object of a series via its *values* and *index* attributes:

In [10]:
series.values

array([ 4,  7, -5,  3])

In [11]:
series.index

RangeIndex(start=0, stop=4, step=1)

We can also create a series with an index that associates each data point with a label:

In [12]:
series2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
series2

d    4
b    7
a   -5
c    3
dtype: int64

In [13]:
series2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

### DataFrames

A DataFrame represents a rectangular table of data and contains an ordered collection of columns. 

In contrast to series:
- Dataframes support different data types, as each column can be of a different value type.
- A DataFrame has two indexes:
    - A row index that contains labels for the rows. The row index is also referred to as *axis 0*.
    - A column index that contains labels for the columns. The column index is also referred to as *axis 1*.

### Creating a DataFrame from Python lists

Create a DataFrame from a nested list:

In [14]:
lists = [
    [1,2],
    [3,4],
    [5,6]
]
frame = pd.DataFrame(lists)
frame

Unnamed: 0,0,1
0,1,2
1,3,4
2,5,6


> If not specified, Pandas automatically creates the labels in the row and column indexes.

We can specify the column labels:

In [17]:
lists = [
    ['Ohio', 2000, 1.5],
    ['Ohio', 2001, 1.7],
    ['Ohio', 2002, 3.6],
    ['Nevada', 2001, 2.4],
    ['Nevada', 2002, .9],
    ['Nevada', 2003, 3.2]    
]

frame = pd.DataFrame(
    lists,
    # columns refers to the column index or axis 1
    columns=["state", "year", "population"] 
)
frame

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,0.9
5,Nevada,2003,3.2


We can also specify the rows index:

In [18]:
frame = pd.DataFrame(
    lists,
    columns=["state", "year", "population"],
    # index refers to the row index
    index=["zero", "one", "two", "three", "four", "five"] 
)
frame

Unnamed: 0,state,year,population
zero,Ohio,2000,1.5
one,Ohio,2001,1.7
two,Ohio,2002,3.6
three,Nevada,2001,2.4
four,Nevada,2002,0.9
five,Nevada,2003,3.2


We can also access the *index* (row labels), *columns* (column labels), and *values* properties of a DataFrame:

In [19]:
frame.index

Index(['zero', 'one', 'two', 'three', 'four', 'five'], dtype='object')

In [20]:
frame.columns

Index(['state', 'year', 'population'], dtype='object')

In [21]:
frame.values

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 0.9],
       ['Nevada', 2003, 3.2]], dtype=object)

Note that each column in a dataframe is a series object:

In [22]:
type(frame['state'])

pandas.core.series.Series

> Internally, a DataFrame object consists of a collection of Series objects, each with a dtype (data type) associated.

### Creating a DataFrame from dictionaries

Another way to construct a dataframe is from a dictionary of equal-length lists.

In this way, *Pandas automatically takes the keys in the dictionary as the column labels*:

In [23]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'population': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


You can also use a dictionary of dictionaries.

In this case, *Pandas takes the keys of the inner dictionaries as the row labels*:

In [24]:
data2 = {'Nevada': {2001: 2.4, 2002: 2.9}, 
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame2 = pd.DataFrame(data2)
frame2

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


### Inspecting the structure of a DataFrame

We can use the head method to inspect the first rows of a Dataframe:

In [26]:
frame.head()

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


The shape attribute returns a tuple representing the dimensionality of the DataFrame, the numer of rows and number of columns:

In [27]:
frame.shape

(6, 3)

### Manipulating columns in a DataFrame

In [28]:
frame = pd.DataFrame({
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
})

frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Columns can be modified by assignment using the *=* operator. We can assign a scalar value, which assigns the same value to all rows:

In [29]:
frame['debt'] = 16.5
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,16.5
1,Ohio,2001,1.7,16.5
2,Ohio,2002,3.6,16.5
3,Nevada,2001,2.4,16.5
4,Nevada,2002,2.9,16.5
5,Nevada,2003,3.2,16.5


We can also assign sequence-like objects. In this case, the sequence’s length must match the length of the DataFrame: 

In [30]:
frame['debt'] = [16.5, 17.5, 16, 18, 15, 16.2]
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,16.5
1,Ohio,2001,1.7,17.5
2,Ohio,2002,3.6,16.0
3,Nevada,2001,2.4,18.0
4,Nevada,2002,2.9,15.0
5,Nevada,2003,3.2,16.2


We can retrieve the columns of a dataframe either by attribute or dictionoary-like notation.

Retrieve the state column by attribute notation:

In [34]:
frame.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

Retrieve the state column by dictionoary-like notation:

In [37]:
frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

You can also retrieve more than column at once by specifying several column names:

In [38]:
frame[['state', 'year']]

Unnamed: 0,state,year
0,Ohio,2000
1,Ohio,2001
2,Ohio,2002
3,Nevada,2001
4,Nevada,2002
5,Nevada,2003


> When passing a single element or a list to the [] operator, Pandas matches each value in the list against the values in the column labels, returning a new DataFrame with the matched columns.

We can use the drop method to remove entries from either the row or column axes (indexes). 

In [39]:
frame = pd.DataFrame(
    np.arange(10).reshape((5, 2)),
    columns=['col1', 'col2'],
    index=['a', 'b', 'c', 'd', 'e'])

frame

Unnamed: 0,col1,col2
a,0,1
b,2,3
c,4,5
d,6,7
e,8,9


The drop method returns a new object with the indicated value or values deleted from an axis.

By default, drop removes elements using the row index:

In [40]:
new_frame = frame.drop('a') 
new_frame

Unnamed: 0,col1,col2
b,2,3
c,4,5
d,6,7
e,8,9


In [41]:
new_frame = frame.drop('col1', axis=1) 
new_frame

Unnamed: 0,col2
a,1
b,3
c,5
d,7
e,9


In [42]:
# frame.drop('a') is equivalent to: 

new_frame = frame.drop('a', axis=0) 
new_frame

Unnamed: 0,col1,col2
b,2,3
c,4,5
d,6,7
e,8,9


## 3. Retrieving data

We can also use the [] operator to filter the rows in the DataFrame using three approaches:

A) **Boolean filtering**: useful for retrieving rows based on a criterion  
B) **Slicing notation**: useful for retrieving rows based on position ranges of the data  
C) **loc and iloc**: useful for retrieving rows based on labels (loc) or label positions (iloc)

In [57]:
frame = pd.DataFrame({
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    # Refers to population growth
    'population': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
    },
     index=list(range(1,7))
)

frame

Unnamed: 0,state,year,population
1,Ohio,2000,1.5
2,Ohio,2001,1.7
3,Ohio,2002,3.6
4,Nevada,2001,2.4
5,Nevada,2002,2.9
6,Nevada,2003,3.2


### A) Boolean filtering

We can filter rows in a DataFrame by specifying a *list of boolean values* to []. This approach takes the positions of the boolean values in the list and retrieves only those rows corresponding to the True values.

The example below filters the second and third rows:

In [58]:
frame[[False, True, False, True, False, False]]

Unnamed: 0,state,year,population
2,Ohio,2001,1.7
4,Nevada,2001,2.4


*Boolean filtering is useful for retrieving rows based on criteria specified on column values.*

First, we specify a condition using the column labels. Specify a condition that checks if a record is from the state of Nevada: 

In [59]:
filter = frame['state'] == 'Nevada'
filter

1    False
2    False
3    False
4     True
5     True
6     True
Name: state, dtype: bool

We can then use the generated boolean list to perform boolean filtering:

In [60]:
frame[filter]

Unnamed: 0,state,year,population
4,Nevada,2001,2.4
5,Nevada,2002,2.9
6,Nevada,2003,3.2


We can specify the condition in a single expression:

In [61]:
frame[frame['state'] == 'Nevada']

Unnamed: 0,state,year,population
4,Nevada,2001,2.4
5,Nevada,2002,2.9
6,Nevada,2003,3.2


### B) Slicing notation

We can also use list-slicing notation *start:stop:step* to filter rows in a DataFrame. The slicing notation operates over the positions of the row index:

Retrieve the first two rows:

In [65]:
frame[:2]

Unnamed: 0,state,year,population
1,Ohio,2000,1.5
2,Ohio,2001,1.7


Retrieve the last two rows:

In [66]:
frame[-2:]

Unnamed: 0,state,year,population
5,Nevada,2002,2.9
6,Nevada,2003,3.2


Retrieve the all rows except the first and last rows:

In [67]:
frame[1:-1]

Unnamed: 0,state,year,population
2,Ohio,2001,1.7
3,Ohio,2002,3.6
4,Nevada,2001,2.4
5,Nevada,2002,2.9


Retrieve all rows in reverse order:

In [68]:
frame[::-1]

Unnamed: 0,state,year,population
6,Nevada,2003,3.2
5,Nevada,2002,2.9
4,Nevada,2001,2.4
3,Ohio,2002,3.6
2,Ohio,2001,1.7
1,Ohio,2000,1.5


We can also combine boolean filtering with slicing. Select the first two rows with a population greater than 2:

In [69]:
frame[frame['population'] >= 2][0:2]

Unnamed: 0,state,year,population
3,Ohio,2002,3.6
4,Nevada,2001,2.4


### C) Retrieving rows with *loc* and *iloc*

For label-based filtering, we can use the special indexing operators *loc* and *iloc*.

They enable us to select a subset of rows and columns:
- *loc* operates on the index labels
- *iloc* operates on the index positions. In many cases, *iloc* is equivalent to the slicing notation.

In [70]:
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

frame

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


**Indexing with *loc*:**

Rertieve the rows corresponding to the Ohio and Colorado labels:

In [71]:
frame.loc[['Colorado', 'Ohio']]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Ohio,0,1,2,3


When retrieving a single record, loc and iloc returns the record as a series object. Retrieve the row corresponding to the Colorado label:

In [72]:
frame.loc['Colorado']

one      4
two      5
three    6
four     7
Name: Colorado, dtype: int64

Use [[]] to return a DataFrame:

In [73]:
frame.loc[['Colorado']]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7


We can also specify filtering criteria with *loc* and the slicing notation. Retrieve all rows afer the Colorado label:

In [74]:
frame.loc['Colorado':]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


**Indexing with *iloc*:**

Rertieve the rows corresponding to first two rows using iloc:

In [75]:
frame.iloc[[0,1]]

# Equivalent to frame[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


Both *loc* and *iloc* can also operate over the column index by adding a second parameter for the column:
- loc: row label, column label
- iloc: row position, column position 

Retrieve column 'two' from the Colorado record:

In [76]:
frame.loc['Ohio', 'two']

1

Retrieve the value corresponding to the 3rd row (New York) and 2nd column (three):

In [77]:
frame.iloc[3, 2]

14

## 4. Sorting

Sorting a DataFrame by some criterion is another important feature in Pandas.

In [78]:
frame = pd.DataFrame(
    {
        'salary': [2000, 5000, 7000, 4000, 2000, 2000],
        'age': [25, 34, 33, 45, 28, 30]
    },
    index=['John', 'Mary', 'Joseph', 'Peter', 'Louis', 'Jessy'],
    columns=['salary', 'age']
)

frame

Unnamed: 0,salary,age
John,2000,25
Mary,5000,34
Joseph,7000,33
Peter,4000,45
Louis,2000,28
Jessy,2000,30


To sort lexicographically by the row or column labels, use the *sort_index* method, which returns a new, sorted DataFrame:

In [79]:
frame.sort_index()

Unnamed: 0,salary,age
Jessy,2000,30
John,2000,25
Joseph,7000,33
Louis,2000,28
Mary,5000,34
Peter,4000,45


In [80]:
frame.sort_index(axis=1)

Unnamed: 0,age,salary
John,25,2000
Mary,34,5000
Joseph,33,7000
Peter,45,4000
Louis,28,2000
Jessy,30,2000


sort_index sorts the data in ascending order by default, but it can also sort in descending order: 

In [81]:
frame.sort_index(ascending=False)

Unnamed: 0,salary,age
Peter,4000,45
Mary,5000,34
Louis,2000,28
Joseph,7000,33
John,2000,25
Jessy,2000,30


We can also sort a DataFrame by one or more columns. To do so, pass one or more column names to the *by* parameter of the *sort_values* function:

In [82]:
frame.sort_values(by='salary')

Unnamed: 0,salary,age
John,2000,25
Louis,2000,28
Jessy,2000,30
Peter,4000,45
Mary,5000,34
Joseph,7000,33


In [83]:
frame.sort_values(by=['salary', 'age'])

Unnamed: 0,salary,age
John,2000,25
Louis,2000,28
Jessy,2000,30
Peter,4000,45
Mary,5000,34
Joseph,7000,33


## 5. Computing descriptive statistics

Pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a series of values from the rows or columns of a DataFrame.

In [84]:
frame = pd.DataFrame(
    {
        'salary': [2000, 5000, 7000, 4000, 2000, 2000],
        'age': [25, 34, 33, 45, 28, 30]
    },
    index=['John', 'Mary', 'Joseph', 'Peter', 'Louis', 'Jessy'],
    columns=['salary', 'age']
)

frame

Unnamed: 0,salary,age
John,2000,25
Mary,5000,34
Joseph,7000,33
Peter,4000,45
Louis,2000,28
Jessy,2000,30


Calculate the sum of a column:

In [85]:
frame['salary'].sum()

22000

Calculate the mean of a column:

In [86]:
frame['salary'].mean()

3666.6666666666665

Methods like *idxmin* and *idxmax* return indirect statistics like the label where the minimum or maximum values are attained:

In [87]:
frame['salary'].idxmax()

'Joseph'

We can combine idxmax with loc to obtain the full record of the employee with the highest salary:

In [88]:
frame.loc[[frame['salary'].idxmax()]]

Unnamed: 0,salary,age
Joseph,7000,33


Calculate the mean for every column:

In [89]:
frame.mean()

salary    3666.666667
age         32.500000
dtype: float64

Get the labels of the highest values in every column:

In [90]:
frame.idxmax()

salary    Joseph
age        Peter
dtype: object

*describe* produces multiple summary statistics in one shot:

In [63]:
frame.describe()

Unnamed: 0,salary,age
count,6.0,6.0
mean,3666.666667,32.5
std,2065.591118,6.94982
min,2000.0,25.0
25%,2000.0,28.5
50%,3000.0,31.5
75%,4750.0,33.75
max,7000.0,45.0
