# Process Data from Dirty to Clean

Notes from this course: https://www.coursera.org/learn/process-data

## Module 1: The importance of integrity

### Learning log

#### Focus on integrity
- A strong analysis depends on the integrity of the data

#### Data integrity and analytics objectives
- Data integrity
    - The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle
- One missing piece can make all of your data useless
- Ways data can be compromised
    - Replicated
        - There's a chance data data will be out of sync
        - Example: One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates
    - Transferred
        - Might end up with incomplete data if data transfer is interrupted
        - Example: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.
    - Manipulated
        - An error during the process can compromise the efficiency
        - Example: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.
- Data replication
    - Is the process of storing data in multiple locations
- Data transfer
    - The process of copying data from a storage device to memory, or from one computer to another
- Data manipulation
    - The process of changing data to make it more organized and easier to read
- Other threats to data integrity
    - Human error
    - Viruses
    - Malware
    - Hacking
    - System failures
- Data constraints and examples
    - Data type
        - Values must be of a certain type: date, number, percentage, Boolean, etc
        - Example: If the data type is a date, a single number like 30 would fail the constraint and be invalid
    - Data range
        - Values must fall between predefined maximum and minimum values
        - Example: If the data range is 10-20, a value of 30 would fail the constraint and be invalid
    - Mandatory
        - Values can’t be left blank or empty
        - Example: If age is mandatory, that value must be filled in
    - Unique
        - Values can’t have a duplicate
        - Example: Two people can’t have the same mobile phone number within the same service area
    - Regular expression (regex) patterns
        - Values must match a prescribed pattern
        - Example: A phone number must match ###-###-#### (no other characters allowed)
    - Cross-field validation
        - Certain conditions for multiple fields must be satisfied
        - Example: Values are percentages and values from multiple fields must add up to 100%
    - Primary-key
        - (Databases only) value must be unique per column
        - Example: A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique.
    - Set-membership
        - (Databases only) values for a column must come from a set of discrete values 
        - Example: Value for a column must be set to Yes, No, or Not Applicable
    - Foreign-key
        - (Databases only) values for a column must be unique values coming from a column in another table
        - Example: In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
    - Accuracy
        - The degree to which the data conforms to the actual entity being measured or described
        - Example: If values for zip codes are validated by street location, the accuracy of the data goes up.
    - Completeness
        - The degree to which the data contains all desired components or measures
        - Example: If data for personal profiles required hair and eye color, and both are collected, the data is complete.
    - Consistency
        - The degree to which the data is repeatable from different points of entry or collection
        - Example: If a customer has the same address in the sales and repair databases, the data is consistent.
- It's important to check that the data you use aligns with the business objective
- Well-aligned objectives and data
    - You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives
    - As a data analyst, alignment is something you will need to judge
    - Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective
- Clean data + alignment to business objective = accurate conclusions
- Alignment to business objective + additional data cleaning = accurate conclusions
- Alignment to business objective + newly discovered variables + constraints = accurate conclusions
- VLOOKUP
    - Spreadsheet function that searches for a certain value in a column to return a related piece of information
- DATEDIF
    - Calculate the difference between the dates
    - Calculate the number of days between two dates
- When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports
- If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis
- If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objective

#### Overcoming the challenges of insufficient data

#### Testing your data

#### Consider the margin of error

#### Glossary


---

## Module 2: Sparkling-clean data

---

## Module 3: Cleaning data with SQL

---

## Module 4: Verify and report on your cleaning results

---

## Module 5: Adding data to your resume