# 20 Panda Functions to boost Data Analysis


In [1]:
# Importing the libraries
import numpy as np
import pandas as pd

In [None]:
# Sample DataFrame
values_1 = np.random.randint(10, size = 10)
values_2 = np.random.randint(10, size = 10)
years = np.arange(2010, 2020)
groups = ['A','A','B','A','B','B','C','A','C','C']

df = pd.DataFrame ({'group' : groups, 'year': years, 'value_1':values_1, 'value_2':values_2})

df

## 1. Query
Filter a dataframe based on a condition or apply a mask to get certain values.

In [3]:
df.query('value_1 < value_2')

Unnamed: 0,group,year,value_1,value_2
1,A,2011,2,9
6,C,2016,4,9
7,A,2017,8,9


## 2. Insert
Used to add a new column to a dataframe, at any location using insert function.

We need to specify the position by passing an **index** as first argument, **column name** as second argument and **series** or **array-like** object as third argument.

In [4]:
# New Column
new_col = np.random.randn(10)

# Insert the new columns at position 2
df.insert(2, 'new_col', new_col)
df

Unnamed: 0,group,year,new_col,value_1,value_2
0,A,2010,0.481033,9,5
1,A,2011,-0.816668,2,9
2,B,2012,-0.711052,9,9
3,A,2013,-0.021827,4,3
4,B,2014,-1.428239,9,6
5,B,2015,0.459038,2,1
6,C,2016,-2.011737,4,9
7,A,2017,2.477067,8,9
8,C,2018,0.673653,9,1
9,C,2019,-0.018974,5,3


## 3. Cumsum
Function to calculate the cumulative sum, we can pair it with `groupby` and use the `cumsum` function to have powerful analysis.

In [5]:
# Grouping by groups and adding the value_2 in each row for cumulative sum
df['cumsum_2'] = df[['value_2', 'group']].groupby('group').cumsum()
df

Unnamed: 0,group,year,new_col,value_1,value_2,cumsum_2
0,A,2010,0.481033,9,5,5
1,A,2011,-0.816668,2,9,14
2,B,2012,-0.711052,9,9,9
3,A,2013,-0.021827,4,3,17
4,B,2014,-1.428239,9,6,15
5,B,2015,0.459038,2,1,16
6,C,2016,-2.011737,4,9,9
7,A,2017,2.477067,8,9,26
8,C,2018,0.673653,9,1,10
9,C,2019,-0.018974,5,3,13


## 4. Sample
Allows us to select values randomly from a **Series** or **DataFrame**. It is useful when we want to select a random sample from a distribution.

In [7]:
# Taking a sample 
sample1 = df.sample(n = 3)
sample1

Unnamed: 0,group,year,new_col,value_1,value_2,cumsum_2
6,C,2016,-2.011737,4,9,9
9,C,2019,-0.018974,5,3,13
7,A,2017,2.477067,8,9,26


We specify the number of values with `n` parameter but we can also pass a ratio to `frac` parameter. For instance, 0.5 will return half of the rows.

In [8]:
# Taking another sample
sample2 = df.sample(frac = 0.5)
sample2

Unnamed: 0,group,year,new_col,value_1,value_2,cumsum_2
9,C,2019,-0.018974,5,3,13
4,B,2014,-1.428239,9,6,15
2,B,2012,-0.711052,9,9,9
5,B,2015,0.459038,2,1,16
3,A,2013,-0.021827,4,3,17


To obtain reproducible sample, we can use `random_state` parameter.

## 5. Where
It is used to replace values in rows or columns based on a condition. The default replacement values is NaN but we can also specify the value to be put as a replacement.

In [9]:
df['new_col'].where(df['new_col'] > 0, 0)

0    0.481033
1    0.000000
2    0.000000
3    0.000000
4    0.000000
5    0.459038
6    0.000000
7    2.477067
8    0.673653
9    0.000000
Name: new_col, dtype: float64

One important point is that **“where”** for Pandas and NumPy are not exactly the same. We can achieve the same result but with slightly different syntax. With **DataFrame.where**, the values that fit the condition are selected as is and the other values are replaced with the specified value. **Np.where** requires to also specify the value for the ones that fit the condition. The following two lines return the same result:

In [10]:
df['new_col'].where(df['new_col'] > 0 , 0)

np.where(df['new_col'] < 0, df['new_col'], 0)

array([ 0.        , -0.81666769, -0.71105243, -0.02182694, -1.42823921,
        0.        , -2.01173727,  0.        ,  0.        , -0.01897364])