# Lesson Outline
Data wrangling process:

* Gather
* Assess
* **Clean (this lesson)**

**Cleaning** your data is the third step in data wrangling. It is where you fix the quality and tidiness issues that you identified in the assess step. In this lesson, you'll clean all of the issues you identified in Lesson 3 using Python and pandas.

## This lesson will be structured as follows:

* You'll get remotivated (if you aren't already) to clean the dataset for lessons 3 and 4: Phase II clinical trial data that compares the efficacy and safety of a new oral insulin to treat diabetes to injectable insulin
* You'll learn about the data cleaning process: defining, coding, and testing
* You'll address the missing data first (and learn why it is usually important to address these completeness issues first)
* You'll tackle the tidiness issues next (and learn why this is usually the next logical step)
* And finally, you'll clean up the quality issues

This lesson will consist primarily of Jupyter Notebooks, of which there will be two types: one quiz notebook that you'll work with throughout the whole lesson (i.e. your work will carry over from page to page) and three solution notebooks. I'll pop in and out to introduce the larger conceptual bits.

You will leverage the most common cleaning functions and methods in the pandas library to clean the nineteen quality issues and four tidiness issues identified in Lesson 3. Given your pandas experience and that this isn't a course on pandas, these functions and methods won't be covered in detail. Regardless, with this experience and your research and documentation skills, you can be confident that leaving this course you'll be able to clean any form of dirty and/or messy data that comes your way in the future.

# Data Cleaning Process
## The Process
The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later. Copying DataFrames in pandas is done using the `copy` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html). If the original DataFrame was called df, the soon-to-be clean copy of the dataset could be named `df_clean`.

```python
df_clean = df.copy()
```

Note that simply assigning a DataFrame to a new variable name leaves the original DataFrame vulnerable to modifications, as explained in the answers to this Stack Overflow question: ["Why should I make a copy of a DataFrame in pandas?"](https://stackoverflow.com/questions/27673231/why-should-i-make-a-copy-of-a-data-frame-in-pandas)

# Address Missing Data First

**Imputing** means filling in missing data values with other values, using some appropriate method. Here is more information on the topic of imputation: [Wikipedia: Imputation](https://goo.gl/3bgcc8)

# You Can Iterate
As mentioned at the end of Lesson 3, the concept of iterating isn't that applicable for clinical trials given the rigor involved in their planning. But, theoretically, the following situations could arise that require iteration:

* Maybe you (as the data analyst or data scientist on the clinical trial research team) realized your statistical power calculations were wrong, and you needed to recruit more patients to make your study statistically significant. You'd also have to revisit *gathering* in this scenario.
* Maybe you realized you were missing a key piece of patient information, like patient blood type (again, unlikely given the rigor of clinical trials, but mistakes happen) because you discovered new research that related insulin resistance to blood type. You'd also have to revisit *gathering* in this scenario.
* Maybe you finished assessing, started cleaning, and spotted another data quality issue. Revisiting *assessing* to add these assessments to your notes is fine.

# Cleaning Summary
Cleaning is the third step in the data wrangling process:

* Gather
* Assess
* Clean

There are two types of cleaning:

* Manual (not recommended unless the issues are one-off occurrences)
* Programmatic

The programmatic data cleaning process:

1. Define: convert our assessments into defined cleaning tasks. These definitions also serve as an instruction list so others (or yourself in the future) can look at your work and reproduce it.
1. Code: convert those definitions to code and run that code.
1. Test: test your dataset, visually or with code, to make sure your cleaning operations worked.

Always make copies of the original pieces of data before cleaning!