# Process Data from Dirty to Clean

Notes from this course: https://www.coursera.org/learn/process-data

## Module 1: The importance of integrity

### Learning log

#### Focus on integrity
- A strong analysis depends on the integrity of the data

#### Data integrity and analytics objectives
- Data integrity
    - The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle
- One missing piece can make all of your data useless
- Ways data can be compromised
    - Replicated
        - There's a chance data data will be out of sync
        - Example: One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates
    - Transferred
        - Might end up with incomplete data if data transfer is interrupted
        - Example: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.
    - Manipulated
        - An error during the process can compromise the efficiency
        - Example: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.
- Data replication
    - Is the process of storing data in multiple locations
- Data transfer
    - The process of copying data from a storage device to memory, or from one computer to another
- Data manipulation
    - The process of changing data to make it more organized and easier to read
- Other threats to data integrity
    - Human error
    - Viruses
    - Malware
    - Hacking
    - System failures
- Data constraints and examples
    - Data type
        - Values must be of a certain type: date, number, percentage, Boolean, etc
        - Example: If the data type is a date, a single number like 30 would fail the constraint and be invalid
    - Data range
        - Values must fall between predefined maximum and minimum values
        - Example: If the data range is 10-20, a value of 30 would fail the constraint and be invalid
    - Mandatory
        - Values can’t be left blank or empty
        - Example: If age is mandatory, that value must be filled in
    - Unique
        - Values can’t have a duplicate
        - Example: Two people can’t have the same mobile phone number within the same service area
    - Regular expression (regex) patterns
        - Values must match a prescribed pattern
        - Example: A phone number must match ###-###-#### (no other characters allowed)
    - Cross-field validation
        - Certain conditions for multiple fields must be satisfied
        - Example: Values are percentages and values from multiple fields must add up to 100%
    - Primary-key
        - (Databases only) value must be unique per column
        - Example: A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique.
    - Set-membership
        - (Databases only) values for a column must come from a set of discrete values 
        - Example: Value for a column must be set to Yes, No, or Not Applicable
    - Foreign-key
        - (Databases only) values for a column must be unique values coming from a column in another table
        - Example: In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
    - Accuracy
        - The degree to which the data conforms to the actual entity being measured or described
        - Example: If values for zip codes are validated by street location, the accuracy of the data goes up.
    - Completeness
        - The degree to which the data contains all desired components or measures
        - Example: If data for personal profiles required hair and eye color, and both are collected, the data is complete.
    - Consistency
        - The degree to which the data is repeatable from different points of entry or collection
        - Example: If a customer has the same address in the sales and repair databases, the data is consistent.
- It's important to check that the data you use aligns with the business objective
- Well-aligned objectives and data
    - You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives
    - As a data analyst, alignment is something you will need to judge
    - Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective
- Clean data + alignment to business objective = accurate conclusions
- Alignment to business objective + additional data cleaning = accurate conclusions
- Alignment to business objective + newly discovered variables + constraints = accurate conclusions
- VLOOKUP
    - Spreadsheet function that searches for a certain value in a column to return a related piece of information
- DATEDIF
    - Calculate the difference between the dates
    - Calculate the number of days between two dates
- When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports
- If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis
- If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objective

#### Overcoming the challenges of insufficient data
- Challenges are bound to come up, but once you know your business objective you'll be able to recognize whether you have enough data. And if you don't, you'll be able to deal with it before you start your analysis
- Types of insufficient data
    - Data from only one source
    - Data that keeps updating
    - Outdated data
    - Geographically-limited data
- Ways to address insufficient data
    - Identify trends with the available data
    - Wait for more data if time allows
    - Talk with stakeholders to adjust your objective
    - Look for a new dataset
- What to do when you find an issue with your data
    - No data
        - Solution 1
            - Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. 
            - Example: If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees.
        - Solution 2
            - If there isn’t time to collect data, perform the analysis using proxy data from other datasets. This is the most common workaround
            - Example: If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic. 
    - Too little data
        - Solution 1
            - Do the analysis using proxy data along with actual data
            - Example: If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors
        - Solution 2
            - Adjust your analysis to align with the data you already have
            - Example: If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only
    - Wrong data, including data with errors
        - Solution 1
            - If you have the wrong data because requirements were misunderstood, communicate the requirements again.
            - Example: If you need the data for female voters and received the data for male voters, restate your needs.
        - Solution 2
            - Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors
            - Example: If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values
        - Solution 3
            - If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias
            - Example: If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data
- Decision tree on how to deal with data errors or not enough data
![decision tree](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/nubavN6IS5mm2rzeiFuZgw_1204106238b34cff9a89859772cdfaa1_Screen-Shot-2021-03-05-at-10.36.19-AM.png?expiry=1700352000000&hmac=9EmnpUoPqBTwZOISDCVSiX9LHAnf-9DIzUvFOAtNkrQ)
- Population
    - All possible data values in a certain dataset
    - The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company
- Sample
    - A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population
- Margin of error
    - Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. 
- Confidence level
    - How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. 
- Confidence interval
    - The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
- Statistical significance
    - The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.
- Sample size
    - A part of a population that is representative of the population
    - Helps ensure the degree to which you can be confident that your conclusions accurately represent the population
    - Cost effective and takes less time
- Downside of sample size
    - When you only use a small sample of a population, it can lead to uncertainty
    - Can't be 100% sure that your statistics are a complete and accurate representation of the population. This leads to sampling bias
- Sampling bias
    - A sample which isn't representative of the population as a whole
- Ramdom sampling
    - A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
- Companies usually create sample sizes before data analysis so analysts know that the resulting dataset is representative of a population.
- Things to remember when determining the size of your sample
    - Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
    - The confidence level most commonly used is 95%, but 90% can work in some cases. 
- Increase the sample size to meet specific needs of your project:
    - For a higher confidence level, use a larger sample size
    - To decrease the margin of error, use a larger sample size
    - For greater statistical significance, use a larger sample size
- Sample size calculators use statistical formulas to determine a sample size
- Why a minimum sample of 30?
    - This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30
- Sample sizes vary by business problem
    - Sample size will vary based on the type of business problem you are trying to solve
    - For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like? Would 200 be alright if the people surveyed represented every district in the city? 
        - Answer:
            - It depends on the stakes. 
            - A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
            - A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library
            - You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey
- Larger sample sizes have a higher cost
    - You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results. 
- Knowing the basics is helpful
    - Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. 
    - Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results
- Complete the following tasks before analyzing data:
    - Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data
    - Connect objectives to data by understanding how your business objectives can be served by an investigation into the data
    - Know when to stop collecting data
- Pre-cleaning activities
    - Data analysts perform pre-cleaning activities to complete the steps above
    - Pre-cleaning activities help you determine and maintain data integrity
    - Pre-cleaning activities are important because they increase the efficiency and success of your data analysis tasks
- Problems that might occur if you don't follow pre-cleaning steps
    - You may find that you are working with inaccurate or missing data, which can cause misleading results in your analysis
    - If you don’t connect objectives with the data, your analysis may not be relevant to the stakeholders
    - Finally, not understanding when to stop collecting data can lead to unnecessary delays in completing tasks
- If an analyst does not have the data needed to meet a business objective, they should gather related data on a small scale and request additional time. Then, they can find more complete data or perform the analysis by finding and using proxy data from other datasets.

##### Further reading
- [Central Limit Theorem (CLT)](https://www.investopedia.com/terms/c/central_limit_theorem.asp)
- [Sample Size Formula](https://www.statisticssolutions.com/dissertation-resources/sample-size-calculation-and-sample-size-justification/sample-size-formula/)
- [Determine the Best Sample Size](https://www.coursera.org/learn/process-data/lecture/mSj5A/determine-the-best-sample-size)
- [Sample Size Calculator](https://www.coursera.org/learn/process-data/supplement/ZqcDw/sample-size-calculator)

#### Testing your data

#### Consider the margin of error

#### Glossary


---

## Module 2: Sparkling-clean data

---

## Module 3: Cleaning data with SQL

---

## Module 4: Verify and report on your cleaning results

---

## Module 5: Adding data to your resume