We present a relatively comprehensive review of sorting on different types of objects in Python. 

In [1]:
import numpy as np
import pandas as pd

#### I. Sorting on DataFrame Objects in Pandas

In Pandas, the main methods associated with DataFrame objects are sort_index() and sort_values. We first focus on sort_index(). This method provides sorting in customized directions as well as sorting on different axis (sorting by rows or columns). And it's often used when you want to sort by all variables. If the argument axis=0, then we sort based on row index (default is 0). The argument ascending=True (or False) tells us the order of sorting. The argument inplace=True (or False) tells us whether we want to permanently change the object as usual. Let's see an example below. We first create a DataFrame object from a Python list first and assign its row index and column names. Then we create two new datasets, one obtained thorough sorting by rows and the other by column names:

In [2]:
students = [ ('Jack', 34, 'Sydney') ,
             ('Riti', 31, 'Delhi' ) ,
             ('Aadi', 16, 'New York') ,
             ('Riti', 32, 'Delhi' ) ,
             ('Riti', 33, 'Delhi' ) ,
             ('Riti', 35, 'Mumbai' )
              ]
df=pd.DataFrame(students, columns=['Name','Marks','City'], index=['b', 'a', 'f', 'e', 'd', 'c'])
df

Unnamed: 0,Name,Marks,City
b,Jack,34,Sydney
a,Riti,31,Delhi
f,Aadi,16,New York
e,Riti,32,Delhi
d,Riti,33,Delhi
c,Riti,35,Mumbai


In [3]:
df2=df.sort_index(ascending=False, axis=0)
df2

Unnamed: 0,Name,Marks,City
f,Aadi,16,New York
e,Riti,32,Delhi
d,Riti,33,Delhi
c,Riti,35,Mumbai
b,Jack,34,Sydney
a,Riti,31,Delhi


In [4]:
df3=df.sort_index(ascending=True, axis=1)
df3

Unnamed: 0,City,Marks,Name
b,Sydney,34,Jack
a,Delhi,31,Riti
f,New York,16,Aadi
e,Delhi,32,Riti
d,Delhi,33,Riti
c,Mumbai,35,Riti


One can also change the original dataset using the 'inplace' argument:

In [5]:
df.sort_index(inplace=True)
df

Unnamed: 0,Name,Marks,City
a,Riti,31,Delhi
b,Jack,34,Sydney
c,Riti,35,Mumbai
d,Riti,33,Delhi
e,Riti,32,Delhi
f,Aadi,16,New York


The sort_values() method sorts by the values along either axis. This basically allows you to sort by some variable. Arguments of this function includes by (a str or a list of str goes in here), axis (0 by default, 1 for column), ascending (True of False), na_position (default value is last, it puts NaNs at the beginning if set to be first, and last puts NaNs at the end). The result returns a Dataframe object. Like sort_index(), sort_values() can sort by both rows or columns. However, it's way more common to sort it by rows when we are dealing with a dataset. Below is an example:

In [6]:
students = [ ('Jack', 34, 'Sydney') ,
             ('Riti', 31, 'Delhi' ) ,
             ('Aadi', 16, 'New York') ,
             ('Riti', 32, 'Delhi' ) ,
             ('Riti', np.NaN, 'Delhi' ) ,
             (np.NaN, 35, 'Mumbai' )
              ]
df=pd.DataFrame(students, columns=['Name','Marks','City'], index=['b', 'a', 'f', 'e', 'd', 'c'])
df

Unnamed: 0,Name,Marks,City
b,Jack,34.0,Sydney
a,Riti,31.0,Delhi
f,Aadi,16.0,New York
e,Riti,32.0,Delhi
d,Riti,,Delhi
c,,35.0,Mumbai


In [7]:
df.sort_values(by=['Name','Marks','City'], ascending=[True, True, False]) # sorting by row by default

Unnamed: 0,Name,Marks,City
f,Aadi,16.0,New York
b,Jack,34.0,Sydney
a,Riti,31.0,Delhi
e,Riti,32.0,Delhi
d,Riti,,Delhi
c,,35.0,Mumbai


#### II. Sorting Lists and Dictionaries

We now focus on sorting in lists and dictionaries. To start with, notice that it does not make sense to discuss sorting associated with Python objects including tuples and sets, because tuples are immutable and sets by definition has no inherent order defined upon themselves. 

For sorting lists, the method associated is sort(), with optional argument 'reverse':

In [8]:
numberlist=[3,4,1]
numberlist.sort(reverse = True) 
numberlist

[4, 3, 1]

When the list object is more complicated, things become more interesting. For instance, if we sort the list object 'students' defined above, we get the results below. Notice that this object is a list of tuples, and each tuple has 3 elements defined in it. What the sort() method does here is essentially sorting on each element one by one, the first element first, and then the second element next etc.:

In [9]:
students = [ ('Jack', 34, 'Sydney') ,
             ('Riti', 31, 'Delhi' ) ,
             ('Aadi', 16, 'New York') ,
             ('Riti', 32, 'Delhi' ) ,
              ]
students.sort()
students

[('Aadi', 16, 'New York'),
 ('Jack', 34, 'Sydney'),
 ('Riti', 31, 'Delhi'),
 ('Riti', 32, 'Delhi')]

In [10]:
students.sort(reverse=True)
students

[('Riti', 32, 'Delhi'),
 ('Riti', 31, 'Delhi'),
 ('Jack', 34, 'Sydney'),
 ('Aadi', 16, 'New York')]

We can also sort dictionaries. Dictionary is like a hash table that store the elements by calculating hashes of keys and orders of elements in it can not be predicted. Therefore, its also called unordered container and we can sort the dictionary in place. There are different ways of sorting a dictionary object. We can either sort it by dictionary keys, or by dictionary values, or altogether (items):

In [11]:
dict0 = {
    "hello": 56,
    "at" : 23 ,
    "test" : 43,
    "this" : 100
    }
print(dict0)

{'hello': 56, 'at': 23, 'test': 43, 'this': 100}


In [12]:
sorted(dict0.keys(), reverse=True) # dict0.keys() is an iterable, and this turns into a list

['this', 'test', 'hello', 'at']

In [13]:
sorted(dict0.values(), reverse=True) # dict0.values() is an iterable, and this turns into a list

[100, 56, 43, 23]

In [14]:
sorted(dict0.items(), reverse=True) # this turns into a list of tuples

[('this', 100), ('test', 43), ('hello', 56), ('at', 23)]

#### III. Deduppling DataFrame Objects

We now focus on removing duplicates in datasets. The method drop_duplicates(subset=None, keep=’first’, inplace=False) does our job of removing duplicate rows in a DataFrame object. The argument 'Subset' takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates. This is the same as the 'by' variable in SAS procedure proc sort with nodupkey argument. The argument 'keep' is to control how to consider duplicate value. It has only three distinct value and default is ‘first’. If ‘first’, it considers first value as unique and rest of the same values as duplicate. If ‘last’, it considers last value as unique and rest of the same values as duplicate. If False, it consider all of the same values as duplicates. Below is an example:

In [15]:
students2 = [('Jack', 34, 'Sydney', 'Ramsey'),
             ('Riti', 31, 'Delhi', 'Ramsey'),
             ('Aadi', 16, 'New York','Ramsey'),
             ('Riti', 32, 'Delhi', 'Ramsey'),
             ('Riti', 33, 'Delhi', 'Doe'),
             ('Riti', 35, 'Mumbai', 'Black'),
             ('Riti', 35, 'Mumbai', 'Black'),
             ('Riti', 35, 'Mumbai', 'Smith'),
            ]
df=pd.DataFrame(students2, columns=['Name','Marks','City', 'Ancestry'])
df

Unnamed: 0,Name,Marks,City,Ancestry
0,Jack,34,Sydney,Ramsey
1,Riti,31,Delhi,Ramsey
2,Aadi,16,New York,Ramsey
3,Riti,32,Delhi,Ramsey
4,Riti,33,Delhi,Doe
5,Riti,35,Mumbai,Black
6,Riti,35,Mumbai,Black
7,Riti,35,Mumbai,Smith


We first dedup data on every level. The reset_index() methods help us reset the index of the DataFrame; otherwise the old index will stick with the new dataset:

In [16]:
D1=df.drop_duplicates().reset_index()
D1

Unnamed: 0,index,Name,Marks,City,Ancestry
0,0,Jack,34,Sydney,Ramsey
1,1,Riti,31,Delhi,Ramsey
2,2,Aadi,16,New York,Ramsey
3,3,Riti,32,Delhi,Ramsey
4,4,Riti,33,Delhi,Doe
5,5,Riti,35,Mumbai,Black
6,7,Riti,35,Mumbai,Smith


Now let's dedup on a different level:

In [17]:
D2=df.drop_duplicates(subset=['Ancestry'])
D2

Unnamed: 0,Name,Marks,City,Ancestry
0,Jack,34,Sydney,Ramsey
4,Riti,33,Delhi,Doe
5,Riti,35,Mumbai,Black
7,Riti,35,Mumbai,Smith


We can use the drop parameter to avoid the old index being added as a column:

In [18]:
D3=df.drop_duplicates(subset=['Name','Ancestry'], keep='last').reset_index(drop=True)
D3

Unnamed: 0,Name,Marks,City,Ancestry
0,Jack,34,Sydney,Ramsey
1,Aadi,16,New York,Ramsey
2,Riti,32,Delhi,Ramsey
3,Riti,33,Delhi,Doe
4,Riti,35,Mumbai,Black
5,Riti,35,Mumbai,Smith


There are also times when we need to get rid of duplicates in columns. This usually happens when we try to merge two tables with similar columns and we are not careful about removing the redundant columns. Here is an example below. Let's append a redundant column on the existing dataset. Below, we see that the there are two columns sharing the same name:

In [19]:
newdata=['Jack','Riti','Aadi','Riti','Riti','Riti','Riti']
D1['Name2']=pd.Series(newdata)
D1.rename({'Name2':'Name'}, axis=1, inplace=True)
D1.drop(['index'], axis=1, inplace=True) # getting rid of redundant column 'index'
D1

Unnamed: 0,Name,Marks,City,Ancestry,Name.1
0,Jack,34,Sydney,Ramsey,Jack
1,Riti,31,Delhi,Ramsey,Riti
2,Aadi,16,New York,Ramsey,Aadi
3,Riti,32,Delhi,Ramsey,Riti
4,Riti,33,Delhi,Doe,Riti
5,Riti,35,Mumbai,Black,Riti
6,Riti,35,Mumbai,Smith,Riti


To get rid of the duplicated column name, we simply do the following by transposing the 'DataFrame' twice:

In [20]:
new_D=D1.T.drop_duplicates().T
new_D

Unnamed: 0,Name,Marks,City,Ancestry
0,Jack,34,Sydney,Ramsey
1,Riti,31,Delhi,Ramsey
2,Aadi,16,New York,Ramsey
3,Riti,32,Delhi,Ramsey
4,Riti,33,Delhi,Doe
5,Riti,35,Mumbai,Black
6,Riti,35,Mumbai,Smith


References:
   - https://thispointer.com/pandas-sort-a-dataframe-based-on-column-names-or-row-index-labels-using-dataframe-sort_index/
   - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
   