### **7 Pandas Functions That I Use the Most**
Pandas is a prominent data analysis and manipulation library. It provides numerous functions and methods that expedite the data analysis and exploration process.  
Within the rich selection of functions and methods, some of them are more commonly used. They provide a quick way to obtain a basic understanding of the data at hand.  
In this post, I will go over 7 pandas functions that I most commonly use in my data analysis projects.
I will use a diabetes dataset available on Kaggle. Let’s first read the dataset into a pandas dataframe.

***_1. Head and Tail_***

In [1]:
import pandas as pd
import numpy as np
diabetes = pd.read_csv("input/diabetes.csv")

In [2]:
diabetes.head() # diabetes.head(n)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#I find nunique easier to use:
diabetes.tail() # diabetes.tail(n)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


**_2. Nunique_**

When working with categorical data or features that have discrete values, it is very important to know the number of unique values. It is an essential step towards data exploration.  
One way is to use value_counts function which returns a pandas series with unique values in a column and the number of occurrences of each value. The length of this series is the number of unique values.  

In [4]:
len(diabetes.Pregnancies.value_counts())

17

In [5]:
#I find nunique easier to use:
diabetes.Pregnancies.nunique()

17

In [6]:
diabetes.nunique()

Pregnancies                  17
Glucose                     136
BloodPressure                47
SkinThickness                51
Insulin                     186
BMI                         248
DiabetesPedigreeFunction    517
Age                          52
Outcome                       2
dtype: int64

### 3. Describe

Describe function gives a quick overview of numerical columns by providing basic statistics such as mean, median, and standard deviation.

In [7]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


# 4. Isna
- Handling missing values is a critical step to build a robust data analysis process. The missing values should be a top priority since they have a significant effect on the accuracy of any analysis.
- Isna function returns a dataframe filled with boolean values with true indicating missing values.

In [8]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [9]:
diabetes.isna().any()

Pregnancies                 False
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool

In [10]:
diabetes.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## 5. Groupby
Pandas groupby function is a great tool in exploring the data. It makes it easier to unveil the underlying relationships among variables. The figure below shows an overview of what groupby function doe
![Colected by Nguyen Duong Hung](img/Groupby.png)

In [11]:
from IPython.display import Image
Image(filename='img/Groupby.png',width=800, height=400)

FileNotFoundError: [Errno 2] No such file or directory: 'img/Groupby.png'

Assume we have two features. One is color which is a categorical feature and the other one is a numerical feature, values. We want to group values by color and calculate the mean (or any other aggregation) of values for different colors. Then finally sort the colors based on average values.
You can, of course, create more complex grouping operations but the concept will be the same.
Let’s create a simple grouping operation with our dataset. The following code will show us the average glucose values of diabetes positive and negative people.

In [26]:
diabetes[['Outcome','Glucose','BMI']].groupby('Outcome').mean()

Unnamed: 0_level_0,Glucose,BMI
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1
0,109.98,30.3042
1,141.257463,35.142537


## 6. Dtypes and Astype
We need to have the values stored in an appropriate data type. Otherwise, we may encounter errors. For large datasets, memory-usage is greatly affected by correct data type selection. For example, “categorical” data type is more appropriate than “object” data type for categorical data especially when the number of categories is much less than the number of rows.

**Dtypes shows the data type of each column**

In [27]:
diabetes.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

The data type of “BloodPressure” column is not appropriate. It should be int or float. Object data type can be to store strings or categorical data.
We can easily change the data type with astype function.

In [28]:
diabetes.BloodPressure = diabetes.BloodPressure.astype('int64')
diabetes.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

## 7. Shape and Size

Shape can be used on numpy arrays, pandas series and dataframes. It shows the number of dimensions as well as the size in each dimension.
Since dataframes are two-dimensional, what shape returns is the number of rows and columns. It is a measure of how much data we have and a key input to the data analysis process.
Furthermore, the ratio of rows and columns is very important when designing and implementing a machine learning model. If we do not have enough observations (rows) with respect to features (columns), we may need to apply some pre-processing techniques such as dimensionality reduction or feature extraction.


In [34]:
diabetes.shape

(768, 9)

In [35]:
diabetes.size

6912

# **_8 Python Iteration Skills That Data Scientists Shouldn’t Miss Out_**

Theoretically, we can utilize the basic form to address all iteration-related needs, but in many cases, our code can become more concise if we take advantage of existing functionalities that Python has to offer. In this article, I’d like to review 8 useful techniques that we should consider when we conduct our data science projects.
To illustrate the usefulness of these techniques, I’ll contrast them with the code that only uses the most basic form. From these comparisons, you can see noticeable improvement in code readability.

# **_1. Track Iteration With enumerate()_**  
Suppose that we need to track the counting of the iteration. In other words, we want to know how many loops we have iterated. In this case, we should consider the enumerate() function.

In [1]:
# An iterable to start with
numbers = ['one', 'two', 'three']

# The basic way
for i in range(len(numbers)):
    print(f"# {i + 1}: {numbers[i]}")
    
# Use enumerate()
for i, number in enumerate(numbers, 1):
    print(f"# {i}: {number}")

# 1: one
# 2: two
# 3: three
# 1: one
# 2: two
# 3: three


-  To get the index of the item of the sequence, the basic way involves creating a range object, because the typical way (i.e., for item in iterable) doesn’t have index-related information. Although we can find the index using the index() method with a list, it returns the index of the first found element by default. Thus, when there are duplicate items, it will give unintended information.
-  The enumerate() function creates an enumerate object as an iterator. It can take an optional argument start, which specifies the start of the counter. By default, it starts the counting from 0. In our case, we starts to count the first rendered element from 1. As you can see, the enumerate() function directly gives us the counter and the element.

# **_2. Pair Iterables With zip()_**  
When we have a few iterables to begin with and need to retrieve items from each of these iterables at the same positions, we should consider the zip() function, as shown in this example.  
- To get the elements at the same index, we create the index by using the range() function, as we did in the previous section. It’s a little tedious to use the indexing to retrieve the element from each iterable.
- The zip() function can join multiple iterables and in each loop, it produces a tuple object that comprise elements from each iterable at the same index. We can unpack the tuple object to retrieve the elements very conveniently. The code looks much cleaner, doesn’t it?
- Another thing to note is that the zip() function will zip the iterables matching the shortest iterable among them. If you want the zipping matching the longest iterable, you should use zip_longest() function in the itertools library.

In [3]:
# Two iterables
students = ["John", "David", "Ashley"]
scores = [95, 93, 94]

# The basic way
for i in range(len(students)):
    student = students[i]
    score = scores[i]
    print(f"Student {student}: {score}")

# Use zip()
for student, score in zip(students, scores):
    print(f"Student {student}: {score}")

Student John: 95
Student David: 93
Student Ashley: 94
Student John: 95
Student David: 93
Student Ashley: 94


# 3. Reverse Iteration With reversed()  
When you need to iterate a sequence of elements in the reverse order, it’s best to use the reversed() function. Suppose that students arrive at the classroom at slightly different times, you want to check their assignments using the reverse order — the first student that arrived gets checked last.  
-  If you stick with the range() function, you’ll use the reverse indexing of the sequence. In other words, we use -1 to refer to the last item of the list and so on.
-  Alternatively, we can reverse the list using [::-1] and then iterate the new created list object.
-  The best way to do is just simply use the reversed() function. It is a very flexible function, because it can take other sequence data, such as tuples and strings.

In [4]:
# The students arrival records
students_arrived = ["John", "David", "Ashley"]

# The typical ways
for i in range(1, len(students_arrived)+1):
    print(students_arrived[-i])

for student in students_arrived[::-1]:
    print(student)

# Use reversed()
for student in reversed(students_arrived):
    print(student)

Ashley
David
John
Ashley
David
John
Ashley
David
John


# 4. Filter Elements With filter()
You don’t always need to use all the items in the iterable. In these cases, we can usually check if items satisfy particular criteria before we apply the needed operations. Such condition evaluation and creation of the needed iterator can be easily integrated into one function call — filter(). Let’s see how it works in comparison to the typical way.  
- The typical way involves evaluating each element.
- The filter() function will evaluate the elements and render the elements as an iterator at the same time. In other words, the function returns an iterator such that it can be used in the for loop.
- Depending on your needs, you can consider other filter functions, such as filterfalse() in the itertools library, which does the opposite operation (i.e., keep those that evaluate False).

In [5]:
# A list of numbers to process
numbers = [1, 3, 4, 8, 9]

# The typical way
for number in numbers:
    if number % 2:
        print(f"Do operations with odd number: {number}")
        
# Use filter()
for number in filter(lambda x: x % 2, numbers):
    print(f"Do operations with odd number: {number}")

Do operations with odd number: 1
Do operations with odd number: 3
Do operations with odd number: 9
Do operations with odd number: 1
Do operations with odd number: 3
Do operations with odd number: 9


# 5. Chain Iterables With chain()
- In a previous section, we’ve talked about how to work with multiple iterables using the zip() function, for which, you can think of that we concatenate iterables in the vertical direction. If you want to concatenate iterables head to tail, you should use the chain() function in the itertools library. Specifically, suppose that you have multiple iterables, you want to iterate each of them sequentially, which is a best use case of the chain() function.
- The typical way involves concatenating the iterables manually, such as using an intermediate list. If you work with other iterables, such as dictionaries and sets, you need to know how to concatenate them.
- The chain() function can chain any number of iterables and make another iterator that produces elements sequentially from each of the iterables. You don’t need to manage another temporary object that holds these elements.

In [6]:
from itertools import chain

# A few iterables to begin with
odd_numbers = [1, 3]
even_numbers = [2, 4]

# The typical way
numbers = odd_numbers + even_numbers
for number in numbers:
    print(f"Operate with number: {number}")
    
# Use chain()
for number in chain(odd_numbers, even_numbers):
    print(f"Operate with number: {number}")

Operate with number: 1
Operate with number: 3
Operate with number: 2
Operate with number: 4
Operate with number: 1
Operate with number: 3
Operate with number: 2
Operate with number: 4


# 6. Iterate Dictionaries
Dictionaries are a very common data type that stores data in the form of key-value pairs. Because of the implementation using hashes, it’s very fast to look up and retrieve items from dictionaries, and thus they’re the favorite data structure for many developers. The storage of key-value pairs gives us different options to iterate dictionaries.  
- To iterate the keys, we’ll just use the keys() method on the dictionary object. Alternatively, we can just use the dictionary object itself as the iterable, which is just a syntactical sugar for the view object created by the keys() method.
- To iterate the values, we’ll just use the values() method.
- To iterate the items in the form of key-value pairs, we’ll use the items() method.
- Notably, the objects created by these methods are dictionary view objects, which is pretty much like SQL views. In other words, these view objects will get updated when the dict object is updated, and a trivial example is shown below.


In [10]:
# The dictionary object
grades = {"John": 99, "Danny": 95, "Ashley": 98}
# Current keys
names = grades.keys()
print(f"Before updating: {names}")
#Before updating: dict_keys(['John', 'Danny', 'Ashley'])
# Add a new item and check the same view object
grades['Jennifer'] = 97
print(f"After updating: {names}")
#After updating: dict_keys(['John', 'Danny', 'Ashley', 'Jennifer'])

Before updating: dict_keys(['John', 'Danny', 'Ashley'])
After updating: dict_keys(['John', 'Danny', 'Ashley', 'Jennifer'])


# 7. Consider Comprehensions As Alternatives
If the purpose of the iteration is to create a new list, dictionary, or set object from the iterable, we should consider the comprehension technique, which is more performant and more concise.  
- The list comprehension has the following format: [expr for item in iterable], which is the preferred way to create a list object compared to the for loop.
- The dictionary comprehension has the following format: {key_expr: value_expr for item in iterable}. Similarly, it’s the preferred way to create a dict object from an iterable.
- The set comprehension has the following format: {expr for item in iterable}, which is the preferred way to create a set object from an iterable compared to the for loop.

In [12]:
# A list of numbers
primes = [2, 3, 5]

# List Comprehension
# Instead of the following
squares_list0 = list()
for prime in primes:
    squares_list0.append(prime * prime)
# Do this
squares_list1 = [x * x for x in primes]

# Dictionary Comprehension
# Instead of the following
squares_dict0 = dict()
for prime in primes:
    squares_dict0[prime] = prime*prime
# Do this
squares_dict1 = {x: x*x for x in primes}

# Set Comprehension
# Instead of the following
squares_set0 = set()
for prime in primes:
    squares_set0.add(prime)
# Do this
squares_set1 = {x*x for x in primes}
print(squares_set1)

{9, 4, 25}


# 8. Consider the else Clause
The last but not the least is the consideration of using the else clause in the for loop. It should be noted that it’s not the most intuitive technique to use, as many people don’t even know the existence of the else clause with the for loop. The following case shows you a trivial example.  
Unlike some people that have mistakenly thought, the code in the else clause will run following the for loop in regular situations. However, if execution encounters any break statement, the code in the else clause will be skipped. As shown in the first function call, the else clause didn’t execute.


In [17]:
def place_group_order(ordered_items):
    menu_items = ['beef', 'pork', 'sausage', 'chicken']
    for name, item in ordered_items.items():
        if item not in menu_items:
            print(f"Your group order can't be served, because {name}'s {item} isn't available.")
            break               
        else:
            print("Your group order can be served.")
print("Group 0")
group0_items = {"John": "beef", "Jack": "tuna", "Jacob": "chicken"}
place_group_order(group0_items)
print("\nGroup 1")
group1_items = {"Aaron": "beef", "Ashley": "pork", "Anna": "sausage"}
place_group_order(group1_items)

Group 0
Your group order can be served.
Your group order can't be served, because Jack's tuna isn't available.

Group 1
Your group order can be served.
Your group order can be served.
Your group order can be served.
