### pandas unique values multiple columns
https://stackoverflow.com/questions/26977076/pandas-unique-values-multiple-columns

In [32]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
                   'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
                   'Col3': np.random.random(5)})

df

Unnamed: 0,Col1,Col2,Col3
0,Bob,Joe,0.524077
1,Joe,Steve,0.907524
2,Bill,Bob,0.404884
3,Mary,Bob,0.046687
4,Joe,Steve,0.203101


In [33]:
for col in df:
    print(df[col].unique())

['Bob' 'Joe' 'Bill' 'Mary']
['Joe' 'Steve' 'Bob']
[0.52407681 0.90752445 0.40488407 0.04668726 0.2031009 ]


In [34]:
pd.Series({col:df[col].unique() for col in df})

Col1                               [Bob, Joe, Bill, Mary]
Col2                                    [Joe, Steve, Bob]
Col3    [0.5240768126243682, 0.9075244480908424, 0.404...
dtype: object

pd.unique returns the unique values from an input array, or DataFrame column or index.

The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:

In [6]:
pd.unique(df[['Col1', 'Col2']].values.ravel('K'))

array(['Bob', 'Joe', 'Bill', 'Mary', 'Steve'], dtype=object)

Note that ravel() is an array method that returns a view (if possible) of a multidimensional array.

The argument 'K' tells the method to flatten the array in the order the elements are stored in the memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.

An alternative way is to select the columns and pass them to np.unique:

In [7]:
np.unique(df[['Col1', 'Col2']].values)

array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

In [9]:
df1 = pd.concat([df]*100000, ignore_index=True) # DataFrame with 500000 rows
%timeit np.unique(df1[['Col1', 'Col2']].values)

893 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [29]:
df = df.drop_duplicates(subset=['Col1','Col2','Col3'])

KeyError: Index(['Col1', 'Col2', 'Col3'], dtype='object')

#### Method 1: Using pandas Unique() and Concat() methods

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.

In [11]:
import pandas as pd
import numpy as np
 
# Creating a custom dataframe.
df = pd.DataFrame({'FirstName': ['Arun', 'Navneet', 'Shilpa',
                                 'Prateek', 'Pyare', 'Prateek'],
                    
                   'LastName': ['Singh', 'Yadav', 'Yadav', 'Shukla',
                                'Lal', 'Mishra'],
                    
                   'Age': [26, 25, 25, 27, 28, 30]})

df

Unnamed: 0,FirstName,LastName,Age
0,Arun,Singh,26
1,Navneet,Yadav,25
2,Shilpa,Yadav,25
3,Prateek,Shukla,27
4,Pyare,Lal,28
5,Prateek,Mishra,30


In [27]:
# To get unique values in 1 series/column
print("Unique FN:\n" f"{df['FirstName'].unique()}")
print()
# Extending the idea from 1 column to multiple columns
print("Unique Values from 3 Columns:\n"
f"{pd.concat([df['FirstName'],df['LastName'],df['Age']]).unique()}")

Unique FN:
['Arun' 'Navneet' 'Shilpa' 'Prateek' 'Pyare']

Unique Values from 3 Columns:
['Arun' 'Navneet' 'Shilpa' 'Prateek' 'Pyare' 'Singh' 'Yadav' 'Shukla'
 'Lal' 'Mishra' 26 25 27 28 30]


#### Method 2: Using Numpy.unique() method

With the help of np.unique() method, we can get the unique values from an array given as parameter in np.unique() method.

Note: This approach has one limitation i.e. we cannot combine str and numerical columns together, and therefore if such a situation arises where we need to club different datatypes columns together then go for Method 1.