# Handy Python Pandas for Removing Duplicates, Reformatting Data, Renaming, and Reordering Columns

__Data Cleaning & Data Preparation Series — <code> drop_duplicates, to_datetime(), strftime, apply(), rename()</code>__

__1. Removing duplicates__

Duplicates in a dataset can lead to inaccurate analysis and results. Therefore, it is essential to remove duplicates. To remove duplicates from a dataset, we can use the <code>drop_duplicates()</code> function in pandas library. This function removes rows that are exactly the same as another row in the dataframe.

In [11]:
import pandas as pd

# create a sample dataframe
df = pd.DataFrame({'name':['John', 'Alice', 'Mary', 'John'],
                   'age':[25, 30, 27, 25],
                   'gender':['M', 'F', 'F', 'M']})

# remove duplicates based on all columns
df1 = df.drop_duplicates()

In [12]:
print(df1)

    name  age gender
0   John   25      M
1  Alice   30      F
2   Mary   27      F


In [13]:
# remove duplicates based on a specific column
df2 = df.drop_duplicates(subset=['name'])

In [14]:
print(df2)

    name  age gender
0   John   25      M
1  Alice   30      F
2   Mary   27      F


__2. Reformatting data__

Data can be formatted in various ways. Sometimes, it may be necessary to reformat data to fit specific requirements or to make it more presentable. In Python, we can use various functions and methods to reformat data.

In [26]:
# create a sample dataframe
df = pd.DataFrame({'date':['01-01-2021', '02-01-2021', '03-01-2021'],
                   'sales':[500, 700, 900]})

print("Raw data\n")
print(df)

# convert date to datetime format
print("\nConverting date to datetime format\n")
df['date'] = pd.to_datetime(df['date'])
print(df)

# change the format of the date
print("\nChanging format of the date\n")
df['date'] = df['date'].dt.strftime('%d/%m/%Y')
print(df)

# format the sales column with a currency symbol
print("\nformating and adding currency symbol\n")

df['sales'] = df['sales'].apply(lambda x: '${:,.2f}'.format(x))
print(df)

Raw data

         date  sales
0  01-01-2021    500
1  02-01-2021    700
2  03-01-2021    900

Converting date to datetime format

        date  sales
0 2021-01-01    500
1 2021-02-01    700
2 2021-03-01    900

Changing format of the date

         date  sales
0  01/01/2021    500
1  01/02/2021    700
2  01/03/2021    900

formating and adding currency symbol

         date    sales
0  01/01/2021  $500.00
1  01/02/2021  $700.00
2  01/03/2021  $900.00


__3. Renaming and Reordering columns__

Column names may not always be descriptive or may need to be changed to fit specific requirements. Similarly, the order of columns may also need to be changed. In Python, we can use the pandas library to rename and reorder columns.

In [27]:
# create a sample dataframe
df = pd.DataFrame({'name':['John', 'Alice', 'Mary'],
                   'age':[25, 30, 27],
                   'gender':['M', 'F', 'F']})
print("\nRaw data\n")
print(df)

# rename columns
df = df.rename(columns={'name':'First Name', 'age':'Age', 'gender':'Gender'})
print("\nrenaming column\n")
print(df)

# reorder columns
df = df[['Gender', 'Age', 'First Name']]
print("\nreordering columns\n")
print(df)


Raw data

    name  age gender
0   John   25      M
1  Alice   30      F
2   Mary   27      F

renaming column

  First Name  Age Gender
0       John   25      M
1      Alice   30      F
2       Mary   27      F

reordering columns

  Gender  Age First Name
0      M   25       John
1      F   30      Alice
2      F   27       Mary


__In summary__, removing duplicates, reformatting data, and renaming and reordering columns are essential steps in data processing and can be easily accomplished in Python using the pandas library.