In [65]:
import pandas as pd
import numpy as np

In our example, we assume that we are a software company and the demodata represents a list of users of our software. We store their data in order to reach them via email for admin and marketing purposes, address them by name and create a customized marketing and promotion experience based on age and gender information. 

In [66]:
df_users = pd.read_csv("documentpath.csv", index_col=0)

First of all, we will get rid of all fully duplicate rows, as their is no need for duplicate data sets. 

In [67]:
df_users = df_users.drop_duplicates()

Secondly, we want to drop all rows with only NaN values.

In [68]:
df_users = df_users[df_users.isnull().all(axis=1)==False]

We want to clean the age column. In order to do so, we first identify the values that are not ints. Due to the simplicity of our users, we turn phrases like "old" or negative values in range into valid values. 

In [69]:
iterator = 0
iteratorList = []
for cell in df_users["age"]:
    try:
        int(cell)
        iterator+=1
    except ValueError:
        iteratorList.append(iterator)
        iterator+=1

# use the generated lists of indices to change values of "old" entries.
id_old = df_users.iloc[iteratorList]["age"] == "old"
id_old = id_old.reset_index()
df_users["age"].loc[id_old["id"]] = 70

df_users["age"] = df_users["age"].astype(int)

Next, we check whether the values are negative. If they are negative but the absolute value is in range, we turn the number positive as we assume our customers entered the "-" by accident. Age information only serves the purpose of marketing anyway.

In [70]:
df_users.loc[df_users["age"]<0,"age"] = -(df_users.loc[df_users["age"]<0,"age"])

Customer information is only relevant if an email address is available, thus we drop all rows with no email address.

In [71]:
df_users = df_users[df_users.isnull().any(axis=1)==False]

We are now done with the small data cleaning exercise and are left with a clean table. The last thing we should do before running analysis on the data is to reset the index for facilitated calling of data.

In [72]:
df_users = df_users.reset_index(drop=True)
df_users

Unnamed: 0,full_name,first_name,last_name,email,gender,age
0,Mariel Finnigan,Mariel,Finnigan,mfinnigan0@usda.gov,Female,60
1,Kenyon Possek,Kenyon,Possek,kpossek1@ucoz.com,Male,12
2,Lalo Manifould,Lalo,Manifould,lmanifould2@pbs.org,Male,26
3,Nickola Carous,Nickola,Carous,ncarous3@phoca.cz,Male,4
4,Norman Dubbin,Norman,Dubbin,ndubbin4@wikipedia.org,Male,17
5,Franz Castello,Franz,Castello,fcastello6@1688.com,Male,25
6,Jorge Tarney,Jorge,Tarney,jtarney7@ft.com,Male,77
7,Eunice Blakebrough,Eunice,Blakebrough,eblakebrough8@sohu.com,Female,45
8,Kristopher Frankcombe,Kristopher,Frankcombe,kfrankcombe9@slate.com,Male,70
9,Palm Domotor,Palm,Domotor,pdomotora@github.io,Male,6
