# Data Cleansing
Data cleansing deals with the problem of dirty data.  Data is nearly always dirty! And you need to spend good time cleaning it up.  Dirty data not only makes your machine learning activities more difficult, but it can lead to poor or misleading results.

Let's walk through the cleanup of a very small data set.

In [None]:
from dasi_library import *

In [None]:
df = readCsv("cleaning.csv")
df

You can probably see some issues with the data.  

**How many can you see? >>**

Let's start by inspecting the data types of the columns:

In [None]:
df.dtypes

**Do these data types look correct? >>**

Now, let's check to see if we have any null values in our data.  The following function tells us the percentage of nulls in each column:

In [None]:
checkForNulls(df)

**How will these null values affect our analysis? >>**

Let's deal with each column in turn.

## location
The location column looks fine.  Its data type is object (i.e. string).  And there are no nulls.  There is nothing we need to do.

## date_of_sale

The date_of_sale is clearly a date, but has come through as an object (string).  We need to convert to a date otherwise date functionality won't work.

In [None]:
df["date_of_sale"] = convertToDateTime(df, "date_of_sale")

In [None]:
df.dtypes

## number_of_bedrooms

This should be numeric, but has come through as object (string) because there are some non-numeric values.  Let's get a list of such values:

In [None]:
non_nums = listNonNumeric(df, 'number_of_bedrooms')
non_nums

And let's replace them with nulls:

In [None]:
df = replaceValues(df, "number_of_bedrooms", non_nums, np.nan)
df

In [None]:
df['number_of_bedrooms'] = convertToFloat(df, 'number_of_bedrooms') 

In [None]:
df.dtypes

## price

Again for price, we would hope this would be numeric, but it contains characters that are not numeric, which must be removed.

In [None]:
df = replaceSubstrings(df, 'price', ['£', ','], '')
df['price']

We can then convert to float:

In [None]:
df["price"] = convertToFloat(df, 'price')

In [None]:
df.dtypes

Still have a problem.  £0 is not a valid price.  So replace 0 will null.

In [None]:
df = replaceValues(df, "price", [0], np.nan)

In [None]:
df["price"]

## type
On first glance it looks ok, but:


In [None]:
listUnique(df, 'type')

We can see a misspelling of 'terraced'.  Let's replace that:


In [None]:
df = replaceValues(df, "type", ['teraced'], 'terraced')
df

In [None]:
df.dtypes

## Dealing with nulls

Now we need to turn our attention to the nulls in the data.

In [None]:
checkForNulls(df)

Drop all columns containing nulls:

In [None]:
df_dropcols = dropNullCols(df)
df_dropcols

Drop all rows containing nulls:

In [None]:
df_droprows = dropNullRows(df)
df_droprows

Impute with mean:


In [None]:
df_imputemean = imputeNullWithMean(df, "price")
df_imputemean

In [None]:
df_imputemedian = imputeNullWithMedian(df, "price")
df_imputemedian

**What are the pros and cons of the various imputation approaches? >>**

## Other data cleansing tasks

There are also some other more sophisticated approaches to imputation.

Other data cleansing task you may need to do include:

Outliers
- Remove erroneous outliers

Duplicates
- Remove them

