# Managing Bad Data Entries

Bad or dirty data refers to information that can be erroneous, misleading, and without general formatting.  The impact of bad data can be distorted metrics and invalid reports.  To start fixing the data quality, you need to know what exactly is the cause of the bad data.

Let's examine a sample dataset that contains total daily sales for a company during years 2010 and 2011 by product category and whether the total sales amount was for online purchases.  This data is stored as a CSV file on the Math@Work server.

In [69]:
import pandas as pd
sales = pd.read_csv('https://mathatwork.org/DATA/bad-data.csv')
print(sales.head())

         date         category total_sales is_online
0    2/5/2010           Cereal       205.7     FALSE
1   2/12/2010  Office Supplies      651.21      TRUE
2  02/19/2010           Fruits        9.33     FALSE
3  02/26/2010  Office Supplies      651.21     FALSE
4    3/5/2010        Baby Food      255.28     FALSE


To check the quality of this dataset, we will need to explore the data in every data column.  Let's start with the *date* column.  This column appears to contain dates.  Let's verify this by checking the data type of this column.

In [70]:
print(sales.date.dtypes)

object


The code confirms that the data in the *date* column is of type string (or object).  It is a good idea to convert this column to type datetime which is more appropriate for the data it contains.  Import **datetime** from the *datetime* libray to use it for this conversion.

In [71]:
from datetime import datetime
sales.date = pd.to_datetime(sales.date)
print(sales.date.dtypes)

datetime64[ns]


After printing the data type of the converted column, we see that the data type for the *date* column is now *datetime*.  Since the data is supposed to be for years 2010 and 2011, filter out any data rows not equal to these years using Pandas **.dt.year** method.

In [72]:
sales[(sales['date'].dt.year != 2010) & (sales['date'].dt.year != 2011)]

Unnamed: 0,date,category,total_sales,is_online
84,1945-09-09,Vegetables,154.06,True


The filter exposed that there is indeed a bad date in the dataset at data row 84.  You will need to gather more insights regarding this particular sale to determinine how to fix the date.  For this lesson, let's assume the year should be 2011.

In [74]:
sales.date.loc[84] = datetime.strptime('09-09-2011', '%m-%d-%Y')

Pandas **datetime.strptime( )** is used to parse the new date string '09-09-2011' to appropriate *datetime* format before replacing the bad date in row 84.  Pandas **.loc[ ]** method allows us to pass in the row number of the row we want to select.

In [75]:
sales[(sales['date'].dt.year != 2010) & (sales['date'].dt.year != 2011)]

Unnamed: 0,date,category,total_sales,is_online


Applying the same filter above exposes there are no more bad dates in the dataset.

Let's move on to the *category* column. This column appears to contain text. Let's verify this by checking the data type of this column.

In [76]:
print(sales.category.dtypes)

object


Great.  The code confirms that the data in the *category* column is indeed of type string (or object). Intuitively, the *category* data column should not be a *"free for all"* data entry field.  Instead, it should contain a small set of standardized categories.  Let's explore the data in this column by using the Pandas **.unique( )** method to give us a listing of all the unique values in this column.

In [77]:
import numpy as np
print(np.sort(sales.category.unique()))

['Baby Food' 'Beverages' 'Cereal' 'Clothes' 'Cosmetics' 'Fruits'
 'Household' 'Households' 'Meat' 'Office Supplies' 'Personal Care' 'Snacks'
 'Vegetables']


It may be better to sort the listing in order to see similarities in unique categories more clearly.  The Numpy library was imported so that its **.sort( )** method could be applied.  For more details regarding Numpy's **.sort( )** method review the Python for Data Science workshop.  

We see from the unique listing of categories that we have a *Household* and a *Households* category.  This is probably a typo.  You will need to gather more insights regarding this particular category to determinine which is should be. For this lesson, let's assume the category should be *Household*.

In [78]:
sales.category.replace('Households', 'Household', inplace=True)

In the above code, we applied Pandas **.replace( )** method to the category column in order to replace the category *Households* with *Household*.  We passed in *inplace=True* to make the change to the original *sales* DataFrame and not a copy of it.  A quick check of unique categories verifies the change.

In [79]:
print(np.sort(sales.category.unique()))

['Baby Food' 'Beverages' 'Cereal' 'Clothes' 'Cosmetics' 'Fruits'
 'Household' 'Meat' 'Office Supplies' 'Personal Care' 'Snacks' 'Vegetables']


Great. Let's move on to the *total_sales* column. This column appears to contain numerical data. Let's verify this by checking the data type of this column.

In [80]:
print(sales.total_sales.dtypes)

object


The code confirms that the data in the *total_sales* column is of type string (or object). It is a good idea to convert this column to type numeric which is more appropriate for the data it contains. 

In [81]:
sales.total_sales = pd.to_numeric(sales.total_sales)

ValueError: Unable to parse string "FALSE" at position 50

Oops!  The error message above let's us know that there is invalid data in this column at row 50.  Let's filter this row out.

In [83]:
sales.loc[50]

date           2011-01-14 00:00:00
category                   Clothes
total_sales                  FALSE
is_online                   109.28
Name: 50, dtype: object

Upon filtering we see that while *total_sales* is FALSE, *is_online* is 109.28.  This data was probably switched or entered in the wrong fields.  You will need to gather more insights regarding this particular row to determinine whether the 2 data fields were truly switched. For this lesson, let's assume *total_sales* should be 109.28 and *is_online* should be FALSE.

### Exercise

At row 50, replace the data in *total_sales* with 109.28 and in *is_online* with FALSE.  After making your replacement, be sure to filter this row again to check the changes you made.

With the invalid data fixed, we should now be able to convert the *total_sales* column to the appropriate numeric data type.  A quick check of this column's data type after conversion will confirm the change.

In [85]:
sales.total_sales = pd.to_numeric(sales.total_sales)
print(sales.total_sales.dtypes)

float64


Great.  The *total_sales* column is now of datatype *float* with double precision.

Let's move on to the *is_online* column. This column appears to contain text. Let's verify this by checking the data type of this column.

In [86]:
print(sales.is_online.dtypes)

object


Great. The code confirms that the data in the *is_online* column is indeed of type string (or object). Futhermore, the *is_online* column appears to contain only the values TRUE or FALSE. Let's explore the data in this column by using the Pandas .unique( ) method to give us a listing of all the unique values in this column.

In [89]:
print(sales.is_online.unique())

['FALSE' 'TRUE' 'FLASE']


The code confirms that the data contains a misspelling.

# Exercise

**1)** Filter to locate the misspelling in the *is_online* column.

**2)** Replace all occurrences of the misspelling with FALSE.  After making your replacement, be sure to filter this row again to check the changes you made.

Awesome.  All the columns have now been checked (and managed) for bad data.  We are now ready to check for duplicated rows.  Intuitively, in this dataset all rows should be unique.  That is, no 2 rows should contain the exact same data entries.  Use Pandas **.duplicated( )** to perform this check.

In [100]:
duplicates = pd.DataFrame(sales.duplicated(), columns=['is_duplicate'])
duplicates[duplicates.is_duplicate == True]

Unnamed: 0,is_duplicate
39,True


It is important to note that **.duplicated( )** returns True when a duplicate row is found.  In the above code, we created a DataFrame named *duplicates* by returning True in the column *is_duplicate* if the row is repeated and False in the column *is_duplicate* if the row is unique.  A quick filter on True exposes that row 39 is duplicated.

Apple Pandas **.drop( )** method to the *sales* DataFrame and pass in 39 to drop row 39 from the dataset.

In [101]:
sales.drop(39, inplace=True)

duplicates = pd.DataFrame(sales.duplicated(), columns=['is_duplicate'])
duplicates[duplicates.is_duplicate == True]

Unnamed: 0,is_duplicate


Recreating the *duplicates* table and then filtering it on True confirms that the duplicate row was indeed dropped.

Nice.  Presumably at this point, your data has been properly managed for bad data entries.