# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L1 Data Cleaning: Part 1

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* Data cleaning 
* Real-world applications of data cleaning

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](http://pandas.pydata.org/)

## Data Cleaning Overview
Data analysts spend a surprising amount of time preparing data for analysis. In fact, a survey was conducted found that cleaning big data is the most time-consuming and least enjoyable task data scientists do!
<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg" width="700">
(image from [https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg))

Data preparation includes, but is not limited to, tasks such as:
* Loading data into an appropriate data structure
* Merging multiple data sets
* Cleaning the data
    * Reshaping data, transforming data, changing data type
    * Replacing values, removing duplicates
    * Performing data binning/discretization
    * Handling missing values
    * Detecting outliers
    * Standardizing/scaling data
* Many others!

### Missing Values
It is not uncommon to have datasets with missing values. Missing values are usually coded as an out of range value, such as an empty string in a text field, -1 in a numeric field that is normally positive, or 0 in a numeric field that cannot take on the value of 0. In the Scipy ecosystem, the common value `NaN` (not a number) is used to denote missing data. There is support in the Scipy libraries to handle `NaN` specially. For example, the Pandas function [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html) returns a Boolean array detecting the `NaN` values element-wise and [`dropna()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) removes `NaN` values from a series or data frame:

In [1]:
import numpy as np
import pandas as pd
x = np.arange(0, 10)
ser = pd.Series(x)
ser[1] = np.NaN
ser[5] = np.NaN
nans = ser.isnull()
# count the number of missing values
print(nans.sum())
print(ser)
ser.dropna(inplace=True)
print(ser)

2
0    0.0
1    NaN
2    2.0
3    3.0
4    4.0
5    NaN
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64
0    0.0
2    2.0
3    3.0
4    4.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64


Note: you can learn more about missing data by reading [Pandas website](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).

By learning how to use the Pandas library, we have the skills to perform many of the tasks listed above. In this lesson, we are going to focus on *data cleaning*, modifying the data to make it sufficiently accurate and structured to support the analysis you want to perform. To learn about data cleaning, we are going to clean data by working through an example!

## Data Cleaning Example
We are going to work with the [pd_hoa_activities.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/pd_hoa_activities.csv) dataset. This dataset contains information from a smart home study where participants performed 9 activities of daily living (ADLs) in a smart home environment:
1. Water plants
1. Fill medication dispenser
1. Wash counter top
1. Sweep and dust
1. Cook
1. Wash hands
1. Perform the [Timed Up and Go (TUG)](http://www.rehabmeasures.org/Lists/RehabMeasures/DispForm.aspx?ID=903) test
1. Perform TUG with questions being asked
1. A day out task

Note: you can read more about the design of this study and the various tasks in [Cook et al., 2015](http://ieeexplore.ieee.org/document/7181652/). 

The activities were timed and the duration is recorded for each participant in the dataset. The participants of the study include individual's with Parkinson's disease (PD) and age-matched, healthy older adults (HOA). For each participant in the study, the dataset includes a participant id (pid), age, and their class (PD or HOA). The data has been de-identified. For the purposes of our analysis today, we are interested in aggregating this data into PD and HOA groups to investigate the effect of PD on older adult's ability to perform the above tasks.

Here is a sample of the format of the data:

|pid|task|duration|age|class|
|-|-|-|-|-|
|0|1|146|72|HOA|
|0|2|210|72|HOA|
|0|3|241|72|HOA|
|0|4|328|72|HOA|
|0|5|229|72|HOA|
|0|6|38|72|HOA|
|0|7|10|72|HOA|
|0|8|10|72|HOA|
|0|dot|680|72|HOA|
|1|1|63|54|HOA|
|...|...|...|...|...|

Let's take a look at each column in the data and how the data needs to be cleaned:
* pid (integer): Index of the dataset. Counting numbers starting at 0.
* task (integer): ID of the task the patient performed.
    * Clean: Decode the integer task label to the plain text string task label.
    * Example: 1 will be decoded to "Water plants".
* duration (integer): Number of seconds it took the participant to perform the task.
    * Clean: Ensure this data is a numeric data type.
* age (integer): Age of the participant.
    * Clean: Ensure this data is a numeric data type.
* class (string): Population the participant belongs to: HOA or PD.

### Load the Data
First we are going to load the data into a `pandas` `DataFrame` object. The header row is the first row in the file. We are not going to set an index column for the data because there is not a column in the csv file that contains unique values.

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import pandas as pd
import numpy as np

fname = r"files\pd_hoa_activities.csv"
df = pd.read_csv(fname, header=0)
print(df.shape)
print("Number of participants:", df.shape[0] // 9)

(675, 5)
Number of participants: 75


### Explore the Data
Now, let's take a look at some of the data points.

In [3]:
print(df.head(n=5))
print(df.tail(n=5))
print(df[660:670])
print(df[7:10])
print(df[25:28])

   pid task duration  age class
0    0    1      146   72   HOA
1    0    2      210   72   HOA
2    0    3      241   72   HOA
3    0    4      328   72   HOA
4    0    5      229   72   HOA
     pid task duration  age class
670   74    5      235   78    PD
671   74    6       41   78    PD
672   74    7       11   78    PD
673   74    8        9   78    PD
674   74  dot     1532   78    PD
     pid task duration  age class
660   73    4       30   70    PD
661   73    5      666   70    PD
662   73    6      162   70    PD
663   73    7        ?   70    PD
664   73    8        ?   70    PD
665   73  dot        ?   70    PD
666   74    1      180   78    PD
667   74    2      254   78    PD
668   74    3      280   78    PD
669   74    4      417   78    PD
   pid task duration  age class
7    0    8       10   72   HOA
8    0  dot      680   72   HOA
9    1    1       63   54   hoa
    pid task duration  age        class
25    2    8       11   62  parkinson's
26    2  dot      921 

If we only look at the first 5 rows and the last 5 rows of the dataset, the columns looks like it is well formed with no missing values; however, we see the class column has inconsistent labels for our two classes (HOA and PD) and for pids 663, 664, 665 (among others) there is a "?" denoting a missing value. In fact, if we count the number of "?" in the duration column, we see that there are 10 tasks with missing durations:

In [4]:
print(df["duration"].value_counts()["?"])

10


In the next lesson, we will write code to handle these missing values, as well as clean other columns of this dataset.