# Assessing
**Assessing** your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).

Assessing is the precursor to cleaning. You can't clean something that you don't know exists! In this lesson, you'll learn to identify and categorize common data quality and tidiness issues. This lesson is the shortest and most "hands-off" code-wise of all four in the course because of the passive nature of assessing relative to gathering and cleaning. We have tried to include quizzes wherever possible.

## This lesson will be structured as follows:

* You'll get motivated to assess (and later clean) the dataset for lessons 3 and 4: Phase II clinical trial data that compares the efficacy and safety of a new oral insulin to treat diabetes
* You'll learn to distinguish between dirty data and messy data
* You'll assess the data visually and programmatically to identify:
    * Data quality issues
    * Tidiness issues
* You'll learn about data quality dimensions and categorize each of the data quality issues identified above into its appropriate dimension

To begin, I want to introduce you to the dataset you will be assessing in this lesson.

# Dataset: Oral Insulin Phase II Clinical Trial Data
## DISCLAIMER: This Data Isn't "Real"
The Auralin and Novodra are **not** real insulin products. This clinical trial data was fabricated for the sake of this course. When assessing this data, the issues that you'll detect (and later clean) are meant to simulate real-world data quality and tidiness issues.

That said:

* This dataset was constructed with the consultation of real doctors to ensure plausibility.
* This clinical trial data for an alternative insulin was inspired and closely mimics this real [clinical trial for a new inhaled insulin called Afrezza](http://care.diabetesjournals.org/content/38/12/2266.long).
* The data quality issues in this dataset mimic real, [common data quality issues in healthcare data](http://media.hypersites.com/clients/1446/filemanager/Articles/DocCenter_Problem_with_data.pdf). These issues impact quality of care, patient registration, and revenue.
* The patients in this dataset were created using this [fake name generator](http://www.fakenamegenerator.com/order.php) and do not include real names, addresses, phone numbers, emails, etc.

The video above is only a short preview of the dataset that is intended to motivate. So don't worry if the details don't all make sense right now. You'll get intimately familiar with each column in each table in the dataset shortly. If you want to dive deeper into the data now, hop ahead to the **Visual Assessment: Acquaint Yourself** page where the data files are provided in a Jupyter Notebook workspace. (You can also download the files from there if you'd like by clicking the Jupyter logo in the workspace then selecting and downloading each file.)

# Assessment: Types vs. Steps
Why should we first document unclean issues we observe, rather than just write what we need to do to fix the issues? When your data's issues get complicated, writing how to fix each can get confusing, lengthy, and time-consuming. It can get overwhelming trying to think of how to clean something complicated immediately after documenting it.

If you are separating the assessing and cleaning steps of data wrangling, as we are in this lesson, writing only observations as a first step is good practice.

If you choose to assess an issue then immediately clean that issue (which is very much allowed), you can skip the observation and go straight to defining how to clean it (which is part of the *Define-Code-Test* cleaning framework you’ll see in Lesson 4).

*Note: Visualizing your data (i.e., creating plots) is part of Programmatic rather than Visual Assessment. Tricky! This is because plotting data requires coding, or programming.*

## In this Lesson
You're going to start with visual assessment in the first part of this lesson to identify data quality issues. You'll then use programmatic assessment to identify some more data quality issues.

Toward the end of the lesson, you'll use visual (first) and programmatic assessment (second) to identify tidiness issues.

# Quality: Visual Assessment1
**More Information**
* [Stack Overflow: Is it a good idea to use an integer column for storing US ZIP codes in a database?](https://stackoverflow.com/questions/893454/is-it-a-good-idea-to-use-an-integer-column-for-storing-us-zip-codes-in-a-databas)

# Assessing vs. Exploring
In the context of this dataset, **assessing** is everything you just identified, like spotting:

* Missing HbA1c changes
* Poorly formatted zip codes (e.g., four digits and float data type instead of five digits and string or object data type)
* Multiple state formats (e.g., NY and New York)
* Incorrect patient height values (e.g., 27 inches instead of 72 inches)

**Assessing** is also identifying structural (tidiness) issues that make analysis difficult.

The discovery of these data quality and ensure that the analysis can be executed, which for this clinical trial data includes calculated average patient metrics (e.g. age, weight, height, and BMI) and calculating the confidence interval for the difference in HbA1c change means between Novodra and Auralin patients.

**Exploring**, in the context of this dataset, might be:

* Using summary statistics like `count` on the state column or `mean` on the weight column to see if patients from certain states or of certain weights are more likely to have diabetes, which we can use to exclude certain patients from the analysis and make it less biased

Exploring, in the context of a clinical trial, is less likely to happen given that clinical trials are expensive and consist of extreme pre-planning. So exploring on this dataset would likely exclusively happen before the treatments and adverse_reactions tables were created, i.e., before the clinical trial was conducted.

# Quality: Visual Assessment2
**More Information**
* [ABBYY: Optical Character Recognition](https://www.abbyy.com/en-ca/finereader/what-is-ocr/)

# Data Quality Dimensions1
Data quality dimensions help guide your thought process while assessing and also cleaning. The four main data quality dimensions are:

* **Completeness**: do we have all of the records that we should? Do we have missing records or not? Are there specific rows, columns, or cells missing?
* **Validity**: we have the records, but they're not valid, i.e., they don't conform to a defined schema. A schema is a defined set of rules for data. These rules can be real-world constraints (e.g. negative height is impossible) and table-specific constraints (e.g. unique key constraints in tables).
* **Accuracy**: inaccurate data is wrong data that is valid. It adheres to the defined schema, but it is still incorrect. Example: a patient's weight that is 5 lbs too heavy because the scale was faulty.
* **Consistency**: inconsistent data is both valid and accurate, but there are multiple correct ways of referring to the same thing. Consistency, i.e., a standard format, in columns that represent the same data across tables and/or within tables is desired.

Regarding the other data quality research mentioned in the video, the additional dimensions are super specific cases of these four dimensions listed above. Example: currency, defined as follows: the degree to which data is current with the world that it models. Currency can measure how up-to-date data is. Currency is a specific case of accuracy data in the sense that out-of-date data is (usually) valid but wrong. In other words, our definition of accuracy can include currency.

More Information
The inconsistent data quality dimension research mentioned in the video: [source 1 (PDF)](http://www.damauk.org/RWFilePub.php?&cat=403&dx=2&ob=3&rpn=catviewleafpublic403&id=106193), [source 2](http://www.informit.com/articles/article.aspx?p=399325&seqNum=3), [source 3](http://searchdatamanagement.techtarget.com/definition/data-quality), and [source 4](https://www.youtube.com/watch?v=dPsx8_Fcr-U)

# Quality: Programmatic Assessment1
In the above video, I state that having the **country** column as the data type object (string) is fine, while I argue that **state** should be the category data type. This topic deserves a little bit more discussion.

**state** is categorical because its values are a finite set of options without order. **country**, for all intents and purposes, also has a finite set of values and therefore could be argued to be of categorical type as well. It seems there isn't much freedom of values in **country** to deserve classifying it as a string.

So why use object here for the data type for **country**? Well, **country** does still have a lot of values. Categorical data with tons of categories isn't that useful. Another reason for using object here is situational, i.e., it depends on the context in which you'd like to use the **country** column. In this dataset, all of the clinical trial patients are from the United States, so there are no advantages gained from switching the data type from object to category. The **country** column won't be used for analysis.

A more general scenario outside of this dataset is as follows. Say you had one to a few observations from each **country**, it would probably be best to treat country like a string and group observations on a larger unit, like **world_region** (Africa, Asia, Central America, etc.). If you had a lot of observations from a few countries, like test scores from students sampled in a handful of countries, making **country** categorical would be more appropriate.

The answer to a lot of questions in data analysis and data science is "it depends." This is what makes wrangling tricky sometimes since you have to understand the context of your data to make the best decision. Data scientists in a workplace should often consult with others on the team who know the data context best, or who will use the results of analysis later, like business analysts or product owners.

# How Data Gets Dirty and Messy
## Sources of Dirty Data
*Dirty data = low quality data = content issues*

There are lots of sources of dirty data. Basically, anytime humans are involved, there's going to be dirty data. There are lots of ways in which we touch data we work with.

* We're going to have user entry errors.
* In some situations, we won't have any data coding standards, or where we do have standards they'll be poorly applied, causing problems in the resulting data
* We might have to integrate data where different schemas have been used for the same type of item.
* We'll have legacy data systems, where data wasn't coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
* Some of our data won't have the unique identifiers it should.
* Other data will be lost in transformation from one format to another.
* And then, of course, there's always programmer error.
* And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that's not our fault.

## Sources of Messy Data
Messy data = untidy data = structural issues

Messy data is usually the result of poor data planning. Or a lack of awareness of the benefits of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). Fortunately, messy data is usually much more easily addressable than most of the sources of dirty data mentioned above.

# You Can Iterate!
The concept of iterating isn't that applicable for clinical trials given the rigor involved in their planning. But, theoretically, the following situations could arise that require iteration:

* Maybe you (as the data analyst or data scientist on the clinical trial research team) realized your statistical power calculations were wrong, and you needed to recruit more patients to make your study statistically significant. You'd also have to do revisit gathering in this scenario.
* Maybe you realized you were missing a key piece of patient information, like patient blood type (again, unlikely given the rigor of clinical trials, but mistakes happen) because you discovered new research that related insulin resistance to blood type. You'd also have to do revisit gathering in this scenario.
* Maybe you finished assessing, started cleaning, and spotted another data quality issue. Revisiting assessing to add these assessments to your notes is fine.

# Assessing Summary
Assessing is the second step in the data wrangling process:

* Gather
* **Assess**
* Clean

You can assess data for:

* Quality: issues with content. Low quality data is also known as dirty data.
* Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    1. Each variable forms a column.
    1. Each observation forms a row.
    1. Each type of observational unit forms a table.
    
...using two types of assessment:

* Visual assessment: scrolling through the data in your preferred software application (Google Sheets, Excel, a text editor, etc.).
* Programmatic assessment: using code to view specific portions and summaries of the data (pandas' `head`, `tail`, and `info` methods, for example).