# ![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white) **CHAPTER 05 - BASICS OF PANDAS**

---

In [1]:
import numpy as np
import pandas as pd


## **1. INTRODUCTION TO PANDAS OBJECTS:**

### **1.1. SERIES OBJECT:**

`Series` is a one-dimensional object that resembles an array with its label - `index`.

In [4]:
# The simplest way to create a Series object.
s = pd.Series([1, 2, 3, 4])

In [5]:
# Access to the Series values.
s.values

array([1, 2, 3, 4], dtype=int64)

In [6]:
# Access to the Series index.
s.index

RangeIndex(start=0, stop=4, step=1)

In [11]:
# Create a Series with the defined index.
s = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])

In [16]:
# Indexing operations.
s[0], s["a"]

(1, 1)

In [17]:
# Fancy indexing operations.
s[["a", "c", "b"]]

a    1
c    3
b    2
dtype: int64

In [18]:
# Create a mask.
s[s > 3]

d    4
dtype: int64

In [19]:
# Arithmetical operations.
s * 2

a    2
b    4
c    6
d    8
dtype: int64

In [21]:
# Obviously it works with numpy.
np.exp(s)

a     2.718282
b     7.389056
c    20.085537
d    54.598150
dtype: float64

`Series` behaves like a dictionary, where its index is similar to a dictionary key.

In [22]:
# Create a Series from a dictionary.
tmp = {"Ohio": 3500, "Texas": 7100, "Oregon": 1600, "Utah": 5000}
s = pd.Series(tmp)
s

Ohio      3500
Texas     7100
Oregon    1600
Utah      5000
dtype: int64

In [23]:
# Similar to check if the key exists in the dictionary.
"Texas" in s

True

In [25]:
# Create a Series from a dictionary but with defined indices.
# When we pass a non-existing key -> value corresponding to it will be NaN.
tmp = {"Ohio": 3500, "Texas": 7100, "Oregon": 1600, "Utah": 5000}
s = pd.Series(tmp, index=["California", "Ohio", "Oregon", "Texas"])
s

California       NaN
Ohio          3500.0
Oregon        1600.0
Texas         7100.0
dtype: float64

In [27]:
# We can check if there are NaN values.
pd.isnull(s)  # By a function.

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [30]:
s.isnull()  # By a method.

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [38]:
# We can set the Series index and name in place.
s.index = ["Bob", "Steve", "Jeff", "Ryan"]  # type: ignore
s.index.name = "person"  # type: ignore
s

person
Bob         NaN
Steve    3500.0
Jeff     1600.0
Ryan     7100.0
dtype: float64

`Series` objects automatically adjust values based on the index.

In [32]:
# Add two Series without mutual index results in the NaN assignment.
s1 = pd.Series([1, 2, 3], index=list("abc"))
s2 = pd.Series([1, 2, 3], index=list("bcd"))
s1 + s2

a    NaN
b    3.0
c    5.0
d    NaN
dtype: float64

---

### **1.2. DATAFRAME OBJECT:**

#### **BASIC OPERATIONS:**

`DataFrame` is a two-dimensional tabular data structure. It is composed of columns - `Series` objects with the same index.

In [39]:
# Create a DataFrame from a dictionary.
tmp = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
}

df = pd.DataFrame(tmp)  # With default index -> 0, 1, 2, ...
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [40]:
# Head of DataFrame, by default only 5 first rows.
df.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [41]:
# When we pass a non-existing key -> the column will have NaN values.
df = pd.DataFrame(tmp, columns=["year", "state", "pop", "debt"],
                  index=["one", "two", "three", "four", "five", "six"])
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [42]:
# Access to the given column.
df["pop"]  # Series object.

one      1.5
two      1.7
three    3.6
four     2.4
five     2.9
six      3.2
Name: pop, dtype: float64

In [43]:
# Second method.
# Works only when the column has proper name from the Python perspective.
# For example `frame2.pop` will not work.
df.year   

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [44]:
# Access to the given row.
df.loc["three"]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [45]:
# Assign a value to the given column.
df["debt"] = 16
df.head()

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16
two,2001,Ohio,1.7,16
three,2002,Ohio,3.6,16
four,2001,Nevada,2.4,16
five,2002,Nevada,2.9,16


In [46]:
# Assign a value to the given column.
df["debt"] = np.arange(len(df))
df.head()

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


In [48]:
# When we set a new column with too few elements or non-existing indices,
# values for missing indices will be NaN.
values = pd.Series([-1, -2, -3], index=["two", "four", "five"])
df["debt"] = values
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.0
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-2.0
five,2002,Nevada,2.9,-3.0
six,2003,Nevada,3.2,


In [51]:
# When we add a new column we have to do this with brackets [].
df["new"] = (df["state"] == "Ohio")  # Cannot be `df.new`.
df

Unnamed: 0,year,state,pop,debt,new
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.0,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-2.0,False
five,2002,Nevada,2.9,-3.0,False
six,2003,Nevada,3.2,,False


In [53]:
# Returned column is a original column.
tmp = df["new"]
tmp["one"] = -1
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp["one"] = -1


Unnamed: 0,year,state,pop,debt,new
one,2000,Ohio,1.5,,-1
two,2001,Ohio,1.7,-1.0,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-2.0,False
five,2002,Nevada,2.9,-3.0,False
six,2003,Nevada,3.2,,False


In [56]:
# When we want to modify the column, we have to use a copy.
tmp = df["new"].copy()
tmp["one"] = "Something New"
df

Unnamed: 0,year,state,pop,debt,new
one,2000,Ohio,1.5,,-1
two,2001,Ohio,1.7,-1.0,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-2.0,False
five,2002,Nevada,2.9,-3.0,False
six,2003,Nevada,3.2,,False


In [57]:
# Transpose DataFrame.
df.T

Unnamed: 0,one,two,three,four,five,six
year,2000,2001,2002,2001,2002,2003
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
pop,1.5,1.7,3.6,2.4,2.9,3.2
debt,,-1.0,,-2.0,-3.0,
new,-1,True,True,False,False,False


In [58]:
# Set index and columns names.
df.index.name = "year"
df.columns.name = "state"
df

state,year,state,pop,debt,new
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,2000,Ohio,1.5,,-1
two,2001,Ohio,1.7,-1.0,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-2.0,False
five,2002,Nevada,2.9,-3.0,False
six,2003,Nevada,3.2,,False


In [59]:
# The `values` attribute returns 2D numpy array.
df.values

array([[2000, 'Ohio', 1.5, nan, -1],
       [2001, 'Ohio', 1.7, -1.0, True],
       [2002, 'Ohio', 3.6, nan, True],
       [2001, 'Nevada', 2.4, -2.0, False],
       [2002, 'Nevada', 2.9, -3.0, False],
       [2003, 'Nevada', 3.2, nan, False]], dtype=object)

#### **DIFFERENT WAYS TO CREATE DATAFRAME:**

In [60]:
# 2D numpy array.
tmp = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
pd.DataFrame(tmp, columns=["x1", "x2", "x3"])

Unnamed: 0,x1,x2,x3
0,1,2,3
1,4,5,6
2,7,8,9


In [62]:
# Dictionary of lists, tuples, arrays, series, etc.
tmp = {
    "x1": [1, 2, 3],
    "x2": (1, 2, 3),
    "x3": np.array([1, 2, 3]),
    "x4": pd.Series([1, 2, 3]),
    "x5": np.random.rand(3),
}
pd.DataFrame(tmp)

Unnamed: 0,x1,x2,x3,x4,x5
0,1,1,1,1,0.58728
1,2,2,2,2,0.653208
2,3,3,3,3,0.423588


In [63]:
# Nested dicitionary.
tmp = {
    "x1": {  # Will be a column.
        1: "a",
        2: "b",
    },
    "x2": {  # Will be a column.
        1: "a",
        2: "b",
        3: "c",
    },
}
pd.DataFrame(tmp)

Unnamed: 0,x1,x2
1,a,a
2,b,b
3,,c


In [64]:
# List of dictionaries, series.
tmp = [
    {"x1": [1, 2, 3]},
    {"x2": [4, 5, 6]},
]
pd.DataFrame(tmp)

Unnamed: 0,x1,x2
0,"[1, 2, 3]",
1,,"[4, 5, 6]"


In [65]:
# Nested list.
tmp = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
pd.DataFrame(tmp)

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [70]:
# Other DataFrame object.
tmp = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df1 = pd.DataFrame(tmp)
df2 = pd.DataFrame(df1, index=[0, 1, 4])  # Index 4 does not exist in the df1.
df2

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,4.0,5.0,6.0
4,,,


---

### **1.3. INDEX OBJECT:**

`Index` object is used to store axis labels, axis names or other metadata.

In [74]:
# Access to the Index object.
s = pd.Series(range(3), index=["a", "b", "c"])
index_s = s.index
index_s

Index(['a', 'b', 'c'], dtype='object')

In [72]:
# We can indexing similarly as in lists.
index_s[:1]

Index(['a'], dtype='object')

In [91]:
# Index is immutable.
# index_s[1] = 0  # Error.

In [75]:
# Create an Index.
labels = pd.Index(range(3))
labels

RangeIndex(start=0, stop=3, step=1)

In [79]:
s = pd.Series([1, 2, 3], index=labels)
s

0    1
1    2
2    3
dtype: int64

In [78]:
s.index is labels

True

In [80]:
# Watch out to duplicate indices.
labels = pd.Index(["a", "a", "b", "c"])
s = pd.Series([1, 2, 3, 4], index=labels)
s["a"]

a    1
a    2
dtype: int64

We can perform different operations with an Index object. There are several methods:
- `append`,
- `difference`,
- `insertion`,
- `union`,
- `isin`,
- `delete`,
- `drop`,
- `insert`,
- `is_monotonic`,
- `is_unique`,
- `unique`.

---

## **2. BASIC FUNCTIONALITIES:**

### **2.1. INDEX UPDATE:**

In [81]:
# Let's create a simple Series.
s = pd.Series([4.5, 1.2, -4.1, 3.4], index=["b", "d", "a", "c"])
s

b    4.5
d    1.2
a   -4.1
c    3.4
dtype: float64

In [82]:
# We can change the index with `reindex`.
s.reindex(["a", "b", "c", "d"])

a   -4.1
b    4.5
c    3.4
d    1.2
dtype: float64

In [83]:
s = pd.Series(["blue", "yellow", "green"], index=[0, 2, 4])
s

0      blue
2    yellow
4     green
dtype: object

In [85]:
s.reindex(range(6), method="ffill")  # Forward fill.

0      blue
1      blue
2    yellow
3    yellow
4     green
5     green
dtype: object

In [86]:
# Let's create a simple DataFrame
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=["a", "b", "c"],
                  columns=["Ohio", "Texas", "California"])
df

Unnamed: 0,Ohio,Texas,California
a,0,1,2
b,3,4,5
c,6,7,8


In [87]:
# When we pass additional Index, then each value will be NaN.
df.reindex(["a", "b", "c", "d"])

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,3.0,4.0,5.0
c,6.0,7.0,8.0
d,,,


In [91]:
# Column names. When we pass non-existing column, it will be NaN.
df.reindex(columns=["Texas", "Utah", "California"])

Unnamed: 0,Texas,Utah,California
a,1,,2
b,4,,5
c,7,,8


### **2.2. DROP ELEMENTS:**

In [97]:
s = pd.Series(np.arange(5), index=["a", "b", "c", "d", "e"])
s

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [98]:
# The `drop` method creates a new object.
val = s.drop("b")
val

a    0
c    2
d    3
e    4
dtype: int32

In [100]:
# The `drop` method creates a new object.
val = s.drop(["b", "c"])
val

a    0
d    3
e    4
dtype: int32

In [108]:
df = pd.DataFrame(np.arange(16).reshape(4, 4), index=["Ohio", "Colorado", "Utah", "New York"],
                  columns=["one", "two", "three", "four"])
df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [104]:
# By default `drop90` works with rows.
df.drop(["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [105]:
# We can pass the axis argument to drop columns.
df.drop(["one", "three"], axis=1)

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


In [109]:
# Inplace drop modifies the original DataFrame.
df.drop(["one", "three"], axis=1, inplace=True)
df

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


### **2.3. INDEXING, SELECTING, FILTERING:**

In [110]:
# Example Series.
s = pd.Series(np.arange(4), index=["a", "b", "c", "d"])
s

a    0
b    1
c    2
d    3
dtype: int32

In [111]:
s[1]

1

In [112]:
s["b"]

1

In [113]:
s[2:4]

c    2
d    3
dtype: int32

In [114]:
s[2:3]

c    2
dtype: int32

In [115]:
s[["a", "c", "a"]]

a    0
c    2
a    0
dtype: int32

In [116]:
s[[0, 2, 0]]  # type: ignore

a    0
c    2
a    0
dtype: int32

In [117]:
s["a":"c"]  # Include last element.

a    0
b    1
c    2
dtype: int32

In [118]:
s["c":"d"]

c    2
d    3
dtype: int32

In [119]:
s[s < 2]

a    0
b    1
dtype: int32

In [120]:
# Example DataFrame.
df = pd.DataFrame(np.arange(16).reshape(4, 4),
                  index=["Ohio", "Colorado", "Utah", "New York"],
                  columns=["one", "two", "three", "four"])
df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [122]:
# Return a Series object.
df["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [123]:
isinstance(df["two"], pd.Series)

True

In [124]:
# Fancy indexing results in return DatFrame.
df[["two", "four"]]

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


In [126]:
# Selecting rows slice.
df[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [127]:
# Create a mask.
df["three"] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [128]:
# Apply a mask.
df[df["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [129]:
# A mask for the whole DataFrame.
df < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [133]:
# Assign a value using a mask.
df[df < 5] = 0
df

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### **2.4. LOC, ILOC PROPERTIES:**

In [134]:
# Example DataFrame.
df = pd.DataFrame(np.arange(16).reshape(4, 4),
                  index=["Ohio", "Colorado", "Utah", "New York"],
                  columns=["one", "two", "three", "four"])
df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [135]:
# df.loc[labels]
# Select a row or rows based on labels.
df.loc[["Ohio", "Utah"]]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11


In [136]:
# df.loc[:, labels]
# Select a column or columns based on labels.
df.loc[:, ["two", "four"]]

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


In [137]:
# df.loc[labels, labels]
# Select row/rows and col/cols based on labels.
df.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int32

In [138]:
# df.iloc[integers]
# Select a row or rows based on integer value.
df.iloc[0]

one      0
two      1
three    2
four     3
Name: Ohio, dtype: int32

In [133]:
# df.iloc[:, integers]
# Select a column or columns based on integer value.
df.iloc[:, 1]  # Column number 1 - "two".

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [134]:
# df.iloc[integers, integers]
# Select row/rows and col/cols based on integer values.
df.iloc[[1, 2], [3, 1]]

Unnamed: 0,four,two
Colorado,7,5
Utah,11,9


In [135]:
df.iloc[1, 3]

7

In [136]:
# df.at[label, label]
# Select a specific value based on labels.
df.at["Ohio", "two"]

1

In [137]:
# df.at[integer, integer]
# Select a specific value based on integer value.
df.iat[0, 2]

2

In [139]:
# Example DataFrame for simple unit tests.
df = pd.DataFrame(np.arange(16).reshape(4, 4),
                  index=["Ohio", "Colorado", "Utah", "New York"],
                  columns=["one", "two", "three", "four"])
df

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [140]:
# Several equivalent expressions.
assert np.all(df["two"] == df.loc[:, "two"])
assert np.all(df["two"] == df.iloc[:, 1])
assert np.all(df.loc[:, "two"] == df.iloc[:, 1])
assert np.all(df.at["Ohio", "three"] == df.iat[0, 2])
assert np.all(df.iloc[1, 3] == df.at["Colorado", "four"])
assert np.all(df.loc[["Ohio", "New York"]] == df.iloc[[0, 3]])
assert np.all(df.loc[["Ohio", "Utah"], ["two", "three"]] == df.iloc[[0, 2], [1, 2]])
assert np.all(df[:2] == df.loc[:"Colorado"])

### **2.5. MATH OPERATIONS:**

In [141]:
# Add two Series objects.
s1 = pd.Series([7.3, 2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4.0, 3.1], index=["a", "b", "e", "d", "f"])
s1 + s2

a    5.2
b    NaN
c    NaN
d    7.4
e    0.0
f    NaN
dtype: float64

In [142]:
df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=list("abc"),
                   index=["Ohio", "Texas", "Colorado"])
df1

Unnamed: 0,a,b,c
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [143]:
df2 = pd.DataFrame(np.arange(12).reshape(4, 3), columns=list("bcd"),
                   index=["Utah", "Ohio", "Texas", "Oregon"])
df2

Unnamed: 0,b,c,d
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [145]:
# Add two DataFrame objects. It will add up only common elements.
df1 + df2

Unnamed: 0,a,b,c,d
Colorado,,,,
Ohio,,4.0,6.0,
Oregon,,,,
Texas,,10.0,12.0,
Utah,,,,


In [161]:
# Add up common elements with a fill value.
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d
Colorado,6.0,7.0,8.0,
Ohio,0.0,4.0,6.0,5.0
Oregon,,9.0,10.0,11.0
Texas,3.0,10.0,12.0,8.0
Utah,,0.0,1.0,2.0


In [165]:
df = pd.DataFrame(np.arange(12).reshape(4, 3), columns=list("abc"),
                  index=["Utah", "Ohio", "Texas", "Oregon"])
df

Unnamed: 0,a,b,c
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [166]:
s = pd.Series([0, 1, 2], index=list("abc"))
s

a    0
b    1
c    2
dtype: int64

In [168]:
# The difference is applied to each column.
df - s

Unnamed: 0,a,b,c
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


In [172]:
df.sub(df["b"], axis=0)

Unnamed: 0,a,b,c
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


### **2.6. APPLY AND MAP:**

In [173]:
df = pd.DataFrame(np.random.randn(4, 3), columns=list("abc"), index=range(4))
df

Unnamed: 0,a,b,c
0,-0.143593,-1.621547,-0.549848
1,0.487871,-0.586205,0.327658
2,-0.105875,1.204674,-0.181266
3,-0.240028,-1.465012,0.71548


In [175]:
# Calculate max - min for each column.
df.apply(lambda x: x.max() - x.min(), axis=0)

a    0.727899
b    2.826221
c    1.265328
dtype: float64

In [176]:
# Calculate max - min for each row.
df.apply(lambda x: x.max() - x.min(), axis=1)

0    1.477954
1    1.074076
2    1.385940
3    2.180492
dtype: float64

In [177]:
# We can pass different functions to apply.
def f(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])

df.apply(f)

Unnamed: 0,a,b,c
min,-0.240028,-1.621547,-0.549848
max,0.487871,1.204674,0.71548


In [179]:
# Round the values to the two decimal places.
df["a"].apply(lambda x: f"{x:.2f}")

0    -0.14
1     0.49
2    -0.11
3    -0.24
Name: a, dtype: object

In [181]:
df["a"].map(lambda x: "%.2f" % x)

0    -0.14
1     0.49
2    -0.11
3    -0.24
Name: a, dtype: object

### **2.7. SORTING:**

In [207]:
s = pd.Series(range(4), index=list("dbca"))
s

d    0
b    1
c    2
a    3
dtype: int64

In [182]:
s.sort_index()

a    0
b    1
c    2
dtype: int64

In [183]:
s.sort_values(ascending=False)

c    2
b    1
a    0
dtype: int64

In [184]:
df = pd.DataFrame(np.arange(8).reshape(2, 4), index=["three", "one"], columns=list("dabc"))
df

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [185]:
df.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [186]:
df.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [187]:
df.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [188]:
df = pd.DataFrame({"d": [4, 7, -1, 2], "a": [0, 1, 0, 2]})
df

Unnamed: 0,d,a
0,4,0
1,7,1
2,-1,0
3,2,2


In [189]:
df.sort_values(by="d")

Unnamed: 0,d,a
2,-1,0
3,2,2
0,4,0
1,7,1


In [223]:
df.sort_values(by=["a", "d"])

Unnamed: 0,d,a
2,-1,0
0,4,0
1,7,1
3,2,2


---

## **3. STATISTICS:**

#### **DATAFRAME SUMMARY:**

In [194]:
tmp = [
    [1, 2, 3, 4, np.nan],
    [5, 6, 7, 8, 9],
    [np.nan, np.nan, np.nan, np.nan],
    [-5, -4, -3, -2, -1],
]

df = pd.DataFrame(tmp, index=list("abcd"), columns=["x1", "x2", "x3", "x4", "x5"])
df

Unnamed: 0,x1,x2,x3,x4,x5
a,1.0,2.0,3.0,4.0,
b,5.0,6.0,7.0,8.0,9.0
c,,,,,
d,-5.0,-4.0,-3.0,-2.0,-1.0


In [195]:
# Series with a sum of values in columns.
df.sum()

x1     1.0
x2     4.0
x3     7.0
x4    10.0
x5     8.0
dtype: float64

In [196]:
# Series with a sum of values in rows.
df.sum(axis="columns")

a    10.0
b    35.0
c     0.0
d   -15.0
dtype: float64

In [197]:
# NaN is not considered until you turn it off.
df.mean(axis="columns", skipna=False)

a    NaN
b    7.0
c    NaN
d   -3.0
dtype: float64

In [198]:
# Indices with maximal values in columns.
df.idxmax()

x1    b
x2    b
x3    b
x4    b
x5    b
dtype: object

In [199]:
# Indices with minimal values in columns.
df.idxmin()

x1    d
x2    d
x3    d
x4    d
x5    d
dtype: object

In [200]:
# Cumulative sum.
df.cumsum()

Unnamed: 0,x1,x2,x3,x4,x5
a,1.0,2.0,3.0,4.0,
b,6.0,8.0,10.0,12.0,9.0
c,,,,,
d,1.0,4.0,7.0,10.0,8.0


In [201]:
# Statistical description on each column.
df.describe()

Unnamed: 0,x1,x2,x3,x4,x5
count,3.0,3.0,3.0,3.0,2.0
mean,0.333333,1.333333,2.333333,3.333333,4.0
std,5.033223,5.033223,5.033223,5.033223,7.071068
min,-5.0,-4.0,-3.0,-2.0,-1.0
25%,-2.0,-1.0,0.0,1.0,1.5
50%,1.0,2.0,3.0,4.0,4.0
75%,3.0,4.0,5.0,6.0,6.5
max,5.0,6.0,7.0,8.0,9.0


Methods for summary and descriptive statistics:
- `count`,
- `describe`,
- `min`, `max`,
- `argmin`, `argmax`,
- `idxmin`, `idxmax`,
- `quantile`,
- `sum`,
- `mean`,
- `median`,
- `mad`,
- `prod`,
- `var`,
- `std`,
- `skew`,
- `kurt`,
- `cumsum`,
- `cummin`, `cummax`,
- `cumprod`,
- `diff`,
- `pct_change`.