# Cleaning data

In this Notebook you will start to learn about some of the ways in which we can start to clean a dataset programmatically.

There is no way we can hope to be exhaustive! Instead, this is intended as a glimpse only, although hopefully enough of a glimpse to enable you to get started on working with your own dirty datasets.  Indeed, some of the examples given in this Notebook are either stubs, or fragmentary notes only - you are encouraged to expand on them as you find you need to use them and learn more about them.

This Notebook is intend to be skimmed rather than studied in depth, giving you hints to some common tricks and tips for cleaning data.

As you start working with real datasets, you are likely to encounter problems with the data that will mean it needs cleaning. This Notebook can act as a first point of reference if you encounter a new problem or issue. As you develop your own data-cleaning habits, it makes sense to keep a note of them somewhere. Feel free to extend this Notebook with your own examples of useful data-cleaning tips and tricks, copy and extend this Notebook on a per-topic basis, or create one or more of your own Notebooks to record your notes. You will gradually produce a Notebook covering a range of approaches and techniques you can apply when working on new projects.

For cleaning problems regarding *character and file encodings*, refer to the Notebook `02.2.0 Data file formats - file encodings`. 

In [None]:
import pandas as pd

## Coping with whitespace

You will often find that strings start or end with unwanted whitespace.

In [None]:
coursedata_df = pd.DataFrame({ 'coursecode': [' TM351', 'TU100 ', ' M269 '],
                               'points': [30, 60, 30],
                               'level': ['3', '1', '2']
                              })
# Pull out the course codes as a list, and then join them with underscores to show the spaces.
"_".join(coursedata_df.coursecode)

We can use the `strip()` string method to remove whitespace from the start and the end of a string. Alternatively, use `lstrip()` or `rstrip()` to remove whitespace from the left or right-hand end of the string respectively.

In [None]:
coursedata_df.coursecode = coursedata_df.coursecode.str.strip()
"_".join(coursedata_df.coursecode)

## Coping with case

We can change the case of elements within a column of a *pandas* DataFrame by applying the `str.upper()` or `str.lower()` method to it.

In [None]:
coursedata_df.coursecode.str.lower()

In [None]:
coursedata_df.coursecode.str.lower().str.upper()

## Type casting

If necessary, we can cast the type of a *pandas* DataFrame column to another type using the `astype()` operator.

In [None]:
# Check the datatypes of each column.
coursedata_df.dtypes

In [None]:
# Here we recast the level and points values to be 64 bit floating point numbers.
coursedata_df[ ['level', 'points'] ] = coursedata_df[ ['level', 'points'] ].astype(float)
coursedata_df.dtypes

In [None]:
coursedata_df

In [None]:
# Now we switch the level values to integer type.
coursedata_df.level = coursedata_df.level.astype(int)
coursedata_df.dtypes

There are also several *pandas* methods available to handle cases where collections contain mixed types. For example, if you need to cast a Series or DataFrame column to a numeric type, but there are likely to be some elements that aren't castable and need replacing with `NaN` (the not-a-number marker), use `pd.to_numeric()` with the `errors='coerce'` parameter to generate `NaN` for those values.

## Rounding numbers

Sometimes you may be presented with floating-point numbers that have lost precision, such as financial amounts that should run to pennies but appear as something like 1.0000001, rather than 1.00.

The `round(value, precision)` function will round the value to the nearest value at a specified value of precision.

In [None]:
# Round to 2 decimal places.
round(157248.22334673467, 2)

If the `precision` is not specified, then `round()` rounds to the nearest whole number.

In [None]:
# Round to an integer value.
round(157248.22334673467)

If `precision` is a negative value, `round()` interprets the value as a power of ten.

In [None]:
# Round to the nearest thousand (10^3).
round(157248.22334673467, -3)

## Splitting data in one column across several columns

The following example is taken from the website 'Stack Overflow' question [Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries](http://stackoverflow.com/questions/23317342/pandas-dataframe-split-column-into-multiple-columns-right-align-inconsistent-c).    

In [None]:
addresses_df = pd.DataFrame( 
    {'City, State, Country':['HUN', 'ESP', 'GBR', 'ESP', 'FRA', 'ID, USA', 'GA, USA',
                             'Hoboken, NJ, USA', 'NJ, USA', 'AUS'] })
addresses_df

You can see that all rows contain a country code, some also have a state with the country, and some have a city, a state and a country. 

Suppose we want to reshape this as three columns in a DataFrame - one column each for country, state and city.

We can split the cell entries on the comma character by applying a string `split()` method to each string in the DataFrame.

In [None]:
# Split each cell entry on the comma, and assign to a new Series: 
columnssplitter = lambda x: pd.Series([i for i in (x.split(','))])
splitaddresses_df = addresses_df['City, State, Country'].apply(columnssplitter)
splitaddresses_df


Notice, however, that this is where the original questioner ran into the 'right aligned' problem.  Look at column 0: it's a mix of countries, states and cities.

To resolve this, we need to reverse the list of split items, so that countries appear in column 0, states in column 1 and cities in column 2.

In [None]:
# Split each cell entry on the comma, reverse the split list, and assign to new Series columns.
splitter = lambda x: pd.Series([i for i in reversed(x.split(','))])
splitaddresses_df = addresses_df['City, State, Country'].apply(splitter)
splitaddresses_df

In [None]:
# Now rename the columns.
splitaddresses_df.rename(columns = {0:'Country',1:'State',2:'City'}, inplace=True)
splitaddresses_df

## Techniques for recognising and parsing time

Being able to work with time-related objects *as time-based data* is a very powerful technique. But first this means we need to be able to recognise strings as representing time, date or datetime objects.

Many programming languages offer libraries that support the parsing of time related strings as date, time or datetime objects. Different time elements (for example, day of the week, month of the year, hour of the day in 12 or 24-hour clock format, along with PM or AM modifier) can be parsed using conventional directives.

Let's look at one or two examples.

In [None]:
# Create a DataFrame containing some date and datetime data.
timedata_df = pd.DataFrame( 
            { 'item': ['A','B','C'],
              'date':['12-5-12','30-08-11','17-10-10'],
              'datetime':['May 7, 2010, 11.14','April 22, 2011, 22.06','October 7, 2013, 00.01']
             } )

In [None]:
# There's nothing up my sleeves ... as you can see these are simply stored as strings
timedata_df


In [None]:
# We can cast a column to a datetime object by specifying the way 
#    the date or datetime string element is formatted.

# In this case, we parse a date.
pd.to_datetime(timedata_df.date, format='%d-%m-%y')

# In this format string we are saying that the date column uses the format day-month-year
# The result shows this converted to a datetime datatype, in which datetime elements are
# displayed as year-month-day.

In [None]:
#Here's another example: this time we parse a date and a time.
pd.to_datetime(timedata_df.datetime, format='%B %d, %Y, %H.%M')

Some common datetime format elements are:

    %a - The abbreviated weekday name (e.g. 'Sun')
    %A - The  full  weekday  name (e.g. 'Sunday')
    %b - The abbreviated month name (e.g. 'Jan')
    %B - The  full  month  name (e.g. `January')
    %d - Day of the month (01..31)
    %H - Hour of the day, 24-hour clock (00..23)
    %I - Hour of the day, 12-hour clock (01..12)
    %j - Day of the year (001..366)
    %m - Month of the year (01..12)
    %M - Minute of the hour (00..59)
    %p - Meridian indicator (e.g. 'AM' or 'PM')
    %S - Second of the minute (00..60)
    %U - Week number of the current year, starting with the first Sunday as the first day of the first week (00..53)
    %W - Week number of the current year, starting with the first Monday as the first day of the first week (00..53)
    %w - Day of the week (Sunday is 0, 0..6)
    %y - Year without a century (00..99)
    %Y - Year with century (e.g. 2015)

For a full list of time-related codes, see the [Python's strftime directives](http://strftime.org/).

If a string is not matched by the formatter an error will be thrown. You can force unmatched strings to the `NaT` (*not a time*) value by setting `errors=coerce`.

In [None]:
# Create a DataFrame containing something that is not a date.
timedata2_df = pd.DataFrame( 
            { 'item': ['A','B','C'],
              'date':['66-65-64','30-08-11','17-10-10'],
             } )

pd.to_datetime(timedata2_df.date, format='%d-%m-%y', errors='coerce')

*pandas* can also parse dates and datetimes when reading in dates from CSV files.

For general information on handling time in *pandas*, see the *pandas* documentation: [Time Series / Date functionality](http://pandas.pydata.org/pandas-docs/stable/timeseries.html).

### Exercise
Try making up some of your own date/time strings and see if you can cast them to datetime objects.

In [None]:
# YOUR EXAMPLES HERE.

What would you do if a date included dates in the form *7th* or *22nd*?

## A glimpse at regular expressions

*Regular expressions* will be covered in detail in Part 4 of the module when we consider non-numeric data analysis.

Regular expressions are included here as they are a very valuable way of constructing patterns to recognise specific dirty data appearing in strings, and to apply the results of the pattern recognition to rebuild a cleaner string.

Regular expressions are another area where we could give a book load of examples, but instead we will show just two:

- a method for cleaning alphabetic characters from a numeric value
- a method for extracting elements from a string.

As you come up with your own useful regular expression cleaning tricks and examples, feel free to add them to this Notebook, and you can share them on the module forum.

### Cleaning alphabetic characters from a numeric value.

In [None]:
# Let's tidy up the following number representations.
messynumbers_df = pd.DataFrame({'messyvals': ['£40000', "UKP 25,000", '25000 pounds Sterling'] })
messynumbers_df

In [None]:
# First remove any commas:
messynumbers_df['cleanvals'] = messynumbers_df.messyvals.str.replace(',', '')
messynumbers_df

In [None]:
# Now apply a regular expression to get rid of the non-numeric characters 
#   left and right of the digits we want to keep.
# The bracketed term '([\d]*)' in the middle of the complex pattern string 
#   is the one we are extracting as '\1'.
messynumbers_df.replace({'cleanvals' : "^[^\d]*([\d]*)[^\d]*$"}, {'cleanvals' : r'\1'}, regex=True)

In the Notebook on regular expressions in Part 4 you will be shown how this regular expression is shaped.

This is the regular expression part:   `"^[^\d]*([\d]*)[^\d]*$"` 

It reads: match the start of the string (`^`) followed by zero or more (`*`) occurrences of any non-digit characters `[^\d]`, then match as a usable pattern `()` any string of zero or more (`*`) digits `[\d]`, followed by zero or more (`*`) non-digit characters `[^\d]` and the end of the string (`$`)

The 'usuable' pattern is then referenced as `\1` in the replacement string.

### Extracting elements from a string

In the next example we are given a list of web pages; the plan is to extract the domain into a new column and the filetype into a new column.

In [None]:
urls_df = pd.DataFrame({'url':['http://this.example.com/path/file.html',
                               'http://another.example.com/longer/path/file.json']})
urls_df

In [None]:
# Let's create new columns for each of the extracts, based on the original:
urls_df['domain'] = urls_df['url']
urls_df['filetype'] = urls_df['url']

# We can pull out the first item - the domain - in this simple example easily enough:
# Simply find everything between the http:// and the next /
urls_df.replace({'domain' : "^http://([^/]*).*$"}, {'domain' : r'\1'} , regex=True, inplace=True)

# We could extend the same regular expression, with a match term for everything after the last '.'
# This second match term is the one we extract to give the filetype:
urls_df.replace({'filetype' : "^http://([^/]*).*\.([^\.]*)$"}, {'filetype' : r'\2'},
                regex=True, inplace=True)
urls_df

The following snippet shows how we can call on both those extracted values and reorder them

In [None]:
urls_df.url.replace("^http://([^/]*).*\.([^\.]*)$", r'We got a(n) \2 file from \1.', regex=True)

## Fuzzy matching

The OpenRefine application provides a set of clustering tools that attempt to group together partially matching strings in order to support data normalisation as part of a data-cleaning process. 

Note that the OpenRefine service publishes an API that can be accessed from a Notebook, although we will not explore it in this module. For further information, see https://github.com/PaulMakepeace/refine-client-py/ and this example Notebook: http://nbviewer.ipython.org/gist/trevormunoz/6265360

There are several fuzzy matching packages available for Python, such as https://github.com/seatgeek/fuzzywuzzy  or the https://pypi.python.org/pypi/Fuzzy (which includes phonetic matching), as described in http://www.informit.com/articles/article.aspx?p=1848528.

Other than noting the existence of these libraries, we will not explicilty explore them further in this module, although you are welcome to use them in your own data explorations.

## What next?

In this Notebook, you have seen examples of several data-cleaning techniques, albeit in quite a cursory form.

Data cleaning is something that benefits from a build up of case knowledge and experience. Feel free to add to this Notebook as you come up with your own data-cleaning recipes.

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `03.2 Selecting and projecting, sorting and limiting`.