# Process Data from Dirty to Clean

Notes from this course: https://www.coursera.org/learn/process-data

## Module 1: The importance of integrity

### Learning log

#### Focus on integrity
- A strong analysis depends on the integrity of the data

#### Data integrity and analytics objectives
- Data integrity
    - The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle
- One missing piece can make all of your data useless
- Ways data can be compromised
    - Replicated
        - There's a chance data data will be out of sync
        - Example: One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates
    - Transferred
        - Might end up with incomplete data if data transfer is interrupted
        - Example: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.
    - Manipulated
        - An error during the process can compromise the efficiency
        - Example: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.
- Data replication
    - Is the process of storing data in multiple locations
- Data transfer
    - The process of copying data from a storage device to memory, or from one computer to another
- Data manipulation
    - The process of changing data to make it more organized and easier to read
- Other threats to data integrity
    - Human error
    - Viruses
    - Malware
    - Hacking
    - System failures
- Data constraints and examples
    - Data type
        - Values must be of a certain type: date, number, percentage, Boolean, etc
        - Example: If the data type is a date, a single number like 30 would fail the constraint and be invalid
    - Data range
        - Values must fall between predefined maximum and minimum values
        - Example: If the data range is 10-20, a value of 30 would fail the constraint and be invalid
    - Mandatory
        - Values can’t be left blank or empty
        - Example: If age is mandatory, that value must be filled in
    - Unique
        - Values can’t have a duplicate
        - Example: Two people can’t have the same mobile phone number within the same service area
    - Regular expression (regex) patterns
        - Values must match a prescribed pattern
        - Example: A phone number must match ###-###-#### (no other characters allowed)
    - Cross-field validation
        - Certain conditions for multiple fields must be satisfied
        - Example: Values are percentages and values from multiple fields must add up to 100%
    - Primary-key
        - (Databases only) value must be unique per column
        - Example: A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique.
    - Set-membership
        - (Databases only) values for a column must come from a set of discrete values 
        - Example: Value for a column must be set to Yes, No, or Not Applicable
    - Foreign-key
        - (Databases only) values for a column must be unique values coming from a column in another table
        - Example: In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
    - Accuracy
        - The degree to which the data conforms to the actual entity being measured or described
        - The degree of conformity of a measure to a standard or true value
        - Example: If values for zip codes are validated by street location, the accuracy of the data goes up.
    - Completeness
        - The degree to which the data contains all desired components or measures
        - The degree to which all required measures are known
        - Example: If data for personal profiles required hair and eye color, and both are collected, the data is complete.
    - Consistency
        - The degree to which the data is repeatable from different points of entry or collection
        - The degree to which a set of measures is equivalent across systems
        - Example: If a customer has the same address in the sales and repair databases, the data is consistent.
    - Validity
        - The concept of using data integrity principles to ensure measures conform to defined business rules or constraints
- It's important to check that the data you use aligns with the business objective
- Well-aligned objectives and data
    - You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives
    - As a data analyst, alignment is something you will need to judge
    - Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective
- Clean data + alignment to business objective = accurate conclusions
- Alignment to business objective + additional data cleaning = accurate conclusions
- Alignment to business objective + newly discovered variables + constraints = accurate conclusions
- VLOOKUP
    - Spreadsheet function that searches for a certain value in a column to return a related piece of information
- DATEDIF
    - Calculate the difference between the dates
    - Calculate the number of days between two dates
- When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports
- If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis
- If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objective

#### Overcoming the challenges of insufficient data
- Challenges are bound to come up, but once you know your business objective you'll be able to recognize whether you have enough data. And if you don't, you'll be able to deal with it before you start your analysis
- Types of insufficient data
    - Data from only one source
    - Data that keeps updating
    - Outdated data
    - Geographically-limited data
- Ways to address insufficient data
    - Identify trends with the available data
    - Wait for more data if time allows
    - Talk with stakeholders to adjust your objective
    - Look for a new dataset
- What to do when you find an issue with your data
    - No data
        - Solution 1
            - Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. 
            - Example: If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees.
        - Solution 2
            - If there isn’t time to collect data, perform the analysis using proxy data from other datasets. This is the most common workaround
            - Example: If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic. 
    - Too little data
        - Solution 1
            - Do the analysis using proxy data along with actual data
            - Example: If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors
        - Solution 2
            - Adjust your analysis to align with the data you already have
            - Example: If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only
    - Wrong data, including data with errors
        - Solution 1
            - If you have the wrong data because requirements were misunderstood, communicate the requirements again.
            - Example: If you need the data for female voters and received the data for male voters, restate your needs.
        - Solution 2
            - Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors
            - Example: If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values
        - Solution 3
            - If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias
            - Example: If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data
- Decision tree on how to deal with data errors or not enough data
![decision tree](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/nubavN6IS5mm2rzeiFuZgw_1204106238b34cff9a89859772cdfaa1_Screen-Shot-2021-03-05-at-10.36.19-AM.png?expiry=1700352000000&hmac=9EmnpUoPqBTwZOISDCVSiX9LHAnf-9DIzUvFOAtNkrQ)
- Population
    - All possible data values in a certain dataset
    - The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company
    - This is the total number you hope to pull your sample from
- Sample
    - A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population
- Margin of error
    - The maximum amount that the sample results are expected to differ from those of the actual population
    - Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. 
- Confidence level
    - The probability that your sample size accurately reflects the greater population
    - How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.
    - Having a 99% confidence level is ideal, but most industries hope for at least 90% or 95% percent confidence level
- Confidence interval
    - The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
- Statistical significance
    - The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.
- Sample size
    - A part of a population that is representative of the population
    - Helps ensure the degree to which you can be confident that your conclusions accurately represent the population
    - Cost effective and takes less time
- Downside of sample size
    - When you only use a small sample of a population, it can lead to uncertainty
    - Can't be 100% sure that your statistics are a complete and accurate representation of the population. This leads to sampling bias
- Sampling bias
    - A sample which isn't representative of the population as a whole
- Ramdom sampling
    - A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
- Companies usually create sample sizes before data analysis so analysts know that the resulting dataset is representative of a population.
- Things to remember when determining the size of your sample
    - Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
    - The confidence level most commonly used is 95%, but 90% can work in some cases. 
- Increase the sample size to meet specific needs of your project:
    - For a higher confidence level, use a larger sample size
    - To decrease the margin of error, use a larger sample size
    - For greater statistical significance, use a larger sample size
- Sample size calculators use statistical formulas to determine a sample size
- Why a minimum sample of 30?
    - This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30
- Sample sizes vary by business problem
    - Sample size will vary based on the type of business problem you are trying to solve
    - For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like? Would 200 be alright if the people surveyed represented every district in the city? 
        - Answer:
            - It depends on the stakes. 
            - A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
            - A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library
            - You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey
- Larger sample sizes have a higher cost
    - You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results. 
- Knowing the basics is helpful
    - Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. 
    - Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results
- Complete the following tasks before analyzing data:
    - Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data
    - Connect objectives to data by understanding how your business objectives can be served by an investigation into the data
    - Know when to stop collecting data
- Pre-cleaning activities
    - Data analysts perform pre-cleaning activities to complete the steps above
    - Pre-cleaning activities help you determine and maintain data integrity
    - Pre-cleaning activities are important because they increase the efficiency and success of your data analysis tasks
- Problems that might occur if you don't follow pre-cleaning steps
    - You may find that you are working with inaccurate or missing data, which can cause misleading results in your analysis
    - If you don’t connect objectives with the data, your analysis may not be relevant to the stakeholders
    - Finally, not understanding when to stop collecting data can lead to unnecessary delays in completing tasks
- If an analyst does not have the data needed to meet a business objective, they should gather related data on a small scale and request additional time. Then, they can find more complete data or perform the analysis by finding and using proxy data from other datasets.

#### Testing your data
- Statistical power
    - Is the probability of getting meaningful results from a test
    - The larger the sample size, the greater the chance you'll have statistically significant results with your test
    - Usually shown as a value out of one
        - Example: If your statistical power is 0.6, that's the same thing as saying 60%
        - Meaning: There's 60% chance of you getting statistically significant result
    - Can be calculated and reported for a completed experiment to comment on the confidence one might have in the conclusions drawn from the results of the study. It can also be used as a tool to estimate the number of observations or sample size required in order to detect an effect in an experiment
- Hypothesis testing
    - A way to see if a survey or experiment has meaningful results
- Statistically significant
    - Is a term that is used in statistics
    - If a test is statistically significant, it means the results of the test are real and not an error caused by random chance
    - Usually, you need a statistical power of at least 0.8 or 80% to consider your results statistically significant
- What to do when there is no data
    - Proxy data examples:
        - Scenario 1
            - Business scenario: A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now.
            - Proxy: The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership.
        - Scenario 2
            - Business scenario: A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. 
            - Proxy: The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years.
        - Scenario 3
            - Business scenario: The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet.
            - Proxy: The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier.
- Confidence level and margin of error don't have to add up to 100%. They're independent of each other.
- Sample size calculator
    - Tells you how many people you need to interview (or things you need to test) to get results that represent the target population
- Estimated response rate
    - If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey
    - If you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis
- The calculated sample size is the minimum number to achieve what you input for confidence level and margin of error
- In order for an experiment to be statistically significant, the results should be real and not caused by random chance
- In order to have a high confidence level in a customer survey, the sample size should accurately reflect the entire population.

#### Consider the margin of error
- Based on the sample size, the resulting margin of error will tell us how different the results might be compared to the results if we had surveyed the entire population
- Margin of error helps you understand how reliable the data from your hypothesis testing is
- The closer to zero the margin of error, the closer your results from your sample would match results from the overall population
- Example: 
    - Let's say you completed a nationwide survey using a sample of the population. You asked people who work five-day workweeks whether they like the idea of a four-day workweek. So your survey tells you that 60% prefer a four-day workweek. The margin of error was 10%, which tells us that between 50 and 70% like the idea. So if we were to survey all five-day workers nationwide, between 50 and 70% would agree with our results.
    - Keep in mind that our range is between 50 and 70%. That's because the margin of error is counted in both directions from the survey results of 60%. If you set up a 95% confidence level for your survey, there'll be a 95% chance that the entire population's responses will fall between 50 and 70% saying, yes, they want a four-day workweek.
    - Since your margin of error overlaps with that 50% mark, you can't say for sure that the public likes the idea of a four-day workweek. In that case, you'd have to say your survey was inconclusive.
- If you've already been given the sample size, you can calculate the margin of error yourself. Then you can decide yourself how much of a chance your results have of being statistically significant based on your margin of error. In general, the more people you include in your survey, the more likely your sample is representative of the entire population.
- Decreasing the confidence level would also have the same effect, but that would also make it less likely that your survey is accurate.
- To calculate margin of error you need:
    - Population size
    - Sample size
    - Confidence level
- Search "margin of error calculator" online
- A/B testing (or split testing) 
    - Tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue
- Conversion rate
    - User traffic that gets monetized is known as the conversion rate
- Margin of error helps you understand and interpret survey or test results in real-life
- Calculating the margin of error is particularly helpful when you are given the data to analyze
- After using a calculator to calculate the margin of error, you will know how much the sample results might differ from the results of the entire population

##### Datasets
- CSV: [Credit card customers](https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers)
- JSON: [Trending YouTube videos](https://www.kaggle.com/datasets/datasnaek/youtube-new)
- SQLite: [U.S. wildfire data](https://www.kaggle.com/datasets/rtatman/188-million-us-wildfires)
- BigQuery: [Google Analytics 360](https://www.kaggle.com/datasets/bigquery/google-analytics-sample)

##### Further reading
- [Central Limit Theorem (CLT)](https://www.investopedia.com/terms/c/central_limit_theorem.asp)
- [Sample Size Formula](https://www.statisticssolutions.com/dissertation-resources/sample-size-calculation-and-sample-size-justification/sample-size-formula/)
- [Determine the Best Sample Size](https://www.coursera.org/learn/process-data/lecture/mSj5A/determine-the-best-sample-size)
- [Sample Size Calculator](https://www.coursera.org/learn/process-data/supplement/ZqcDw/sample-size-calculator)
- [Survey Monkey Sample Size Calculator](https://www.surveymonkey.com/mp/sample-size-calculator/)
- [Raosoft Sample Size Calculator](http://www.raosoft.com/samplesize.html)
- [A Gentle Introduction to Statistical Power and Power Analysis in Python](https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/)
- [Is There a Difference Between Open Data and Public Data?](https://medium.com/thinkdata/is-there-a-difference-between-open-data-and-public-data-2b8d5608b2f1)
- [4 Examples of Business Analytics In Action](https://online.hbs.edu/blog/post/business-analytics-examples)
- [To Get To The Root Of A Hard Problem, Just Ask “Why” Five Times](https://www.fastcompany.com/1669738/to-get-to-the-root-of-a-hard-problem-just-ask-why-five-times)

#### Glossary
https://docs.google.com/document/d/1iPgiXhYGya72hWKw3eQpf68PV7Ta3Ud317ad-xX_n1E/template/preview

---

## Module 2: Sparkling-clean data

### Learning log

#### Data cleaning is a must
- Yearly cost of poor-quality data in the US alone is $3.1 trillion USD according to IBM
- Human error is the #1 cause of poor quality data
- Dirty data can be the result of
    - Typing a piece of data incorrectly
    - Inconsistent formatting
    - Blank fields
    - Duplicate data
- Dirty data
    - Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve
    - When you work with dirty data, you can't be sure that your results are correct. In fact, you can pretty much bet they won't be
    - It can't be used in a meaningful way
- Clean data
    - Data that is complete, correct, and relevant to the problem that you're trying to solve
    - It's something that you should do and do properly because otherwise it can cause serious problems
    - This allows you to understand and analyze information and identify important patterns, connect related information, and draw useful conclusions
- Data engineers
    - Transform data into a useful format for analysis and give it a reliable infrastructure
    - They develop, maintain, and test databases, data processors, and related systems
- Data warehousing specialists
    - Develop processes and procedures to effectively store and organize data
    - They make sure that data is available, secure, and backed up to prevent loss
- It's important to remember that no dataset is perfect. It's always a good idea to examine and clean data before beginning analysis
- Null
    - An indication that a value does not exist in a dataset
    - It's not the same as zero
    - To do your analysis, you would need to first clean this data. Step 1 is to decide what to do with nulls. Either filter them out and communicate that you now have smaller sample size, or keep them in and learn from the fact that customers did not provide responses
        - Reasons why this could happen:
            - Maybe your survey questions weren't written as well as they could be
            - Maybe they were confusing or biased
- Types of dirty data
    - Duplicate data
        - Any data record that shows up more than once
        - Cause: Manual data entry, batch data imports, or data migration
        - Effect: Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval
    - Outdated data
        - Any data that is old which should be replaced with newer and more accurate information
        - Cause: People changing roles or companies, or software and systems becoming obsolete
        - Effect: Inaccurate insights, decision-making, and analytics
    - Incomplete data
        - Any data that is missing important fields
        - Cause: Improper data collection or incorrect data entry
        - Effect: Decreased productivity, inaccurate insights, or inability to complete essential services
    - Incorrect/inaccurate data
        - Any data that is complete but inaccurate
        - Cause: Human error inserted during data input, fake information, or mock data
        - Effect: Inaccurate insights or decision-making based on bad information resulting in revenue loss
    - Inconsistent data
        - Any data that uses different formats to represent the same thing
        - Cause: Data stored incorrectly or errors inserted during data transfer
        - Effect: Contradictory data points leading to confusion or inability to classify or segment customers
- Field
    - A single piece of information from a row or column of a spreadsheet
- Field length
    - A tool for determining how many characters can be keyed into a field
- Data validation
    - A tool for checking the accuracy and quality of data before adding or importing it
    
#### Begin cleaning data
- It's always a good practice to make a copy of the dataset
- Removing unwanted data
    - Remove duplicates
    - Remove irrelevant data  or data that isn't relevant to the problem you're trying to solve
    - Remove extra spaces and blanks
    - Fix misspellings
    - Inconsistent capitalization
    - Incorrect punctuation and other typos
    - Removing formatting
- Merger
    - An agreement that unites two organizations into a single new one
- Data merging
    - The process of combining two or more datasets into a single dataset
- Compatibility
    - How well two or more datasets are able to work together
- Questions
    - Do I have all the data I need?
    - Does the data I need exists within these datasets?
    - Do the datasets need to be cleaned, or are they ready for me to use?
    - Are the datasets cleaned to the same standard?
- Common data-cleaning pitfalls
    - Not checking for spelling errors
        - Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses
    - Forgetting to document errors
        - Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them
    - Not checking for misfielded values
        - A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful
    - Overlooking missing values
        - Missing values in your dataset can create errors and give you inaccurate conclusions
    - Looking at a subset of data and not the whole picture
        - It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors
    - Losing track of business objective
        - When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand
    - Not fixing the source of the error
        - Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again
    - Not analyzing the system prior to data cleaning
        - If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data
    - Not backing up your data prior to data cleansing
        - It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache.
    - Not accounting for data cleansing in your deadlines/process
        - All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA

#### Cleaning data in spreadsheets
- Conditional formatting
    - A spreadsheet tool that changes how cells appear when values meet specific conditions
- Remove duplicates
    - A tool that automatically searches for and eliminates duplicate entries from a spreadsheet
- Text string
    - A group of characters within a cell, most often composed of letters
- Split
    - A tool that divides a text string around the specified character and puts each fragment into a new and separate cell
- Concatenate
    - A function that joins multiple text strings into a single string
- Syntax
    - A predetermined structure that includes all required information and its proper placement
- What can be automated?
    - Communicating with your team and stakeholders
        - No
        - Communication is key to understanding the needs of your team and stakeholders as you complete the tasks you are working on. There is no replacement for person-to-person communications. 
    - Presenting your findings
        - No
        - Presenting your data is a big part of your job as a data analyst. Making data accessible and understandable to stakeholders and creating data visualizations can’t be automated for the same reasons that communications can’t be automated.
    - Preparing and cleaning data
        - Partially
        - Some tasks in data preparation and cleaning can be automated by setting up specific processes, like using a programming script to automatically detect missing values.
    - Data exploration
        - Partially
        - Sometimes the best way to understand data is to see it. Luckily, there are plenty of tools available that can help automate the process of visualizing data. These tools can speed up the process of visualizing and understanding the data, but the exploration itself still needs to be done by a data analyst.
    - Modeling the data
        - Yes
        - Data modeling is a difficult process that involves lots of different factors; luckily there are tools that can completely automate the different stages.
- More about automating data cleaning
    - One of the most important ways you can streamline your data cleaning is to clean data where it lives. This will benefit your whole team, and it also means you don’t have to repeat the process over and over.
- Pivot table
    - A data summarization tool that is used in data processing.
    - Pivot tables sort, reorganize, group, count, total or average data stored in the database.
    - In data cleaning, pivot tables are used to give you a quick, clutter-free view of your data.
    - You can choose to look at the specific parts of the data set that you need to get a visual in the form of a pivot table.
- VLOOKUP
    - Vertical Lookup
    - A function that searches for a certain value in a column to return a corresponding piece of information
    - VLOOKUP searches for the value in the first argument in the leftmost column of the specified location
- Plotting
    - When you plot data, you put it in a graph chart, table, or other visual to help you quickly find what it looks like
    - Plotting is very useful when trying to identify any skewed data or outliers
- Data mapping
    - The process of matching fields from one database to another.


##### Further reading
- Data cleaning
    - [Top ten ways to clean your data](https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19)
    - [10 Google Workspace tips to clean up data](https://support.google.com/a/users/answer/9604139?hl=en#zippy=)
- Automation
    - [Automating Scientific Data Analysis](https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e)
    - [Automating Big-Data Analysis](https://news.mit.edu/2016/automating-big-data-analysis-1021)
    - [10 of the Best Options for Workflow Automation Software](https://technologyadvice.com/blog/information-technology/top-10-workflow-automation-software/)

#### Glossary
https://docs.google.com/document/d/1bdwobtiQ4NJGkaMghHPlj4z6V-knT6B-o6GGI4xsiDM/template/preview


---

## Module 3: Cleaning data with SQL

### Learning log

#### Using SQL to clean data
- SQL is the primary way data analysts extract data from databases
- Spreadsheets functions and formulas or SQL queries?
    - First, they have to think about where the data lives
        - If it is stored in a database, then SQL is the best tool for the job
        - If it is stored in a spreadsheet, then they will have to perform their analysis in that spreadsheet
- Features of Spreadsheets
    - Smaller data sets
    - Enter data manually
    - Create graphs and visualizations in the same program
    - Built-in spell check and other useful functions
    - Best when working solo on a project
- Features of SQL Databases 
    - Larger datasets
    - Access tables across a database
    - Prepare data for further analysis in another software
    - Fast and powerful functionality
    - Great for collaborative work and tracking queries run by all users

#### Learn basic SQL queries
- Be super curious about whatever data set that you're given
- Spend a lot of time, even before you touch your keyboard, in thinking about what data set or what insights you can get from your data. And then start having fun.
- The hard part is actually not the syntax writing, much like with any programming language, but the actual what question do you want to ask of your data?
- DISTINCT
    - Equivalent to "Remove duplicates" tool in spreadsheet
- LENGTH
    - Use to check if data has correct length
- TRIM
    - Remove extra spaces
- SUBSTR
    - Select sequence of characters within a string
- Steps to clean data using SQL
    - Step 1: Inspect columns
        - The first thing you want to do is inspect the data in your table so you can find out if there is any specific cleaning that needs to be done
    - Step 2: Inspect the length column 
        - Next, you will inspect a column with numerical data.
    - Step 3: Fill in missing data
        - Missing values can create errors or skew your results during analysis. You’re going to want to check your data for null or missing values
    - Step 4: Identify potential errors
        - Once you have finished ensuring that there aren’t any missing values in your data, you’ll want to check for other potential errors
    - Step 5: Ensure consistency
        - Finally, you want to check your data for any inconsistencies that might cause errors. These inconsistencies can be tricky to spot — sometimes even something as simple as an extra space can cause a problem

#### Transforming data
- CAST
    - Can be used to convert anything from one data type to another
    - CAST(field as TYPE)
    - Useful for cleaning data
- Typecasting
    - Converting data from one type to another
- Float
    - Number that contains a decimal
- CONCAT
    - Adds strings together to create new text strings that can be used as unique keys
    - CONCAT(field1, field2)
- COALESCE
    - Can be used to return non-null values in a list
    - COALESCE(field1, field2)

##### Further reading
- SQL dialects and their uses
    - [What Is a SQL Dialect, and Which one Should You Learn?](https://learnsql.com/blog/what-sql-dialect-to-learn/)
    - [Difference Between SQL Vs MySQL Vs SQL Server](https://www.softwaretestinghelp.com/sql-vs-mysql-vs-sql-server/)
    - [SQL Server, PostgreSQL, MySQL... what's the difference? Where do I start?](https://www.datacamp.com/blog/sql-server-postgresql-mysql-whats-the-difference-where-do-i-start)
    - [SQLite Window Functions](https://sqlite.org/windowfunctions.html)
    - [What Is SQL](https://www.sqltutorial.org/what-is-sql/)
    - [Datacamp Data Engineering](https://www.datacamp.com/learn/popular/data-engineering)
    
#### Glossary
https://docs.google.com/document/d/1wSW8yJuNz6UQC0Q8SurFzeoFQA-UzHygkPPXQLOj7Gw/template/preview

---

## Module 4: Verify and report on your cleaning results

### Learning log

#### Manually cleaning data
- Verification
    - Is a process to confirm that a data-cleaning effort was well-executed and the resulting data is accurate and reliable
    - Rechecking clean dataset
    - Manual cleanups
    - Think about the original purpose of the project
    - Making sure data is properly verified allows you to double-check the work you did to clean up your data was thorough and accurate
    - Let's you catch mistakes before you begin analysis
    - Without it, any insights you gain from analysis can't be trusted for decision-making. You might even risk misrepresenting populations or damaging the outcome of a product you're actually tyring to improve
    - Without it, you have no way of knowing that your insights can be relied on for data-driven decision-making
    - Stamp of approval
    - It's like car companies running tons of tests to make sure a car is safe before it hits the road
- Open communication is a lifeline for any data analytics project
- Reports
    - Super effective way to show your team that you're being 100% transparent about your data cleaning
    - Great opportunity to show stakeholders that you're accountable, build trust with your team, and make sure you're all on the same page of important project details
- Changelog
    - A file containing a chronologically ordered list of modifications made to a project
    - Usually organized by version and includes the date followed by a list of added, improved, and removed features
    - Very useful for keeping track of how a dataset evolved over the course of a project
- Verification steps
    - Going back to your original unclean data set and comparing to what you have now
        - Review the dirty data and try to identify any common problems
    - Taking a big picture view of you project
        - This is an opportunity to confirm you're actually focusing on the business problem that you need to solve and the overall project goals and to make sure that your data is actually capable of solving that problem and achieving those goals
        - It's important to take the time to reset and focus on the big picture because projects can sometimes evolve or transform over time without us even realizing it
- Taking a big picture view of your project involves doing 3 things:
    - Consider the business problem
        - If you've lost sight of the problem, you have no way of knowing what data belongs in your analysis
        - Taking a problem-first approach to analytics is essential at all stages of any project
        - You need to be certain that your data will actually make it possible to solve your business problem
    - Consider the goal
        - It's not enough just to know that your company wants to analyze customer feedback about a product
        - What you really need to know is that the goal of getting this feedback is to make improvements to that product
        - You also need to know whether the data you've collected and cleaned will actually help your company achieve that goal
    - Consider the data
        - you need to consider whether your data is capable of solving the problem and meeting the project objectives
        - Thinking about where the data came from and testing your data collection and cleaning processes
        - Sometimes data analysts can be too familiar with their own data, which makes it easier to miss something or make assumptions. Asking a teammate to review your data from a fresh perspective and getting feedback from others is very valuable in this stage
        - This is also the time to notice if anything sticks out to you as suspicious or potentially problematic in your data
- Verifying your data ensures that the insights you gain from analysis can be trusted. It's an essential part of data-cleaning that helps companies avoid big mistakes. This is another place where data analysts can save the day.
- When it comes to data cleaning verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification.
- Data-cleaning verification: A checklist
    - Correct the most common problems
        - Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
        - Null data: Did you search for NULLs using conditional formatting and filters?
        - Misspelled words: Did you locate all misspellings?
        - Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
        - Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
        - Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
        - Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
        - Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
        - Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
        - Misleading variable labels (columns): Did you name your columns meaningfully?
        - Truncated data: Did you check for truncated or missing data that needs correction?
        - Business Logic: Did you check that the data makes sense given your knowledge of the business?
    - Review the goal of your project
        - Confirm the business problem 
        - Confirm the goal of the project
        - Verify that data can solve the problem and is aligned to the goal 
- COUNTA
    - A function that counts the total number of values within a specified range
- CASE statement
    - The CASE statement goes through one or more conditions and returns a value as soon as a condition is met.

#### Documenting results and the cleaning process
- Documentation
    - The process of tracking changes, additions, deletions and errors involved in your data cleaning effort
- Having a record of how a data set evolved does three very important things
    - Recover data-cleaning errors
        - Instead of scratching our heads, trying to remember what we might have done three months ago, we have a cheat sheet to rely on if we come across the same errors again later
    - Inform other users of changes you've made
        - If you ever go on vacation or get promoted, the analyst who takes over for you will have a reference sheet to check in with
    -  Determine the quality of the data
- Changelog
    - Spreadsheet - Version history
    - BigQuery - Query history
- What do engineers, writers, and data analysts have in common? 
    - Change
    - Engineering Change Orders (ECO)
        - Engineers use engineering change orders (ECOs) to keep track of new product design details and proposed changes to existing products
    - Document revision histories
        - Writers use document revision histories to keep track of changes to document flow and edits
    - Changelogs
        - Data analysts use changelogs to keep track of data transformation and cleaning
- Version histories record what was done in a data change for a project, but don't tell us why
- Changelogs are super useful for helping us understand the reasons changes have been made
- Changelogs have no set format and you can even make your entries in a blank document. But if you are using a shared changelog, it is best to agree with other data analysts on the format of all your log entries.
- Typically, a changelog records this type of information:  
    - Data, file, formula, query, or any other component that changed
    - Description of what changed
    - Date of the change
    - Person who made the change
    - Person who approved the change 
    - Version number 
    - Reason for the change
- By following up, you would ensure data integrity outside your project. You would also be showing personal integrity as someone who can be trusted with data. That is the power of a changelog!
- Best practices for changelogs
    - Changelogs are for humans, not machines, so write legibly.
    - Every version should have its own entry.
    - Each change should have its own line.
    - Group the same types of changes. For example, Fixed should be grouped separately from Added.
    - Versions should be ordered chronologically starting with the latest.
    - The release date of each version should be noted.
- Types of changes usually fall into one of the following categories:
- Added: new features introduced
    - Changed: changes in existing functionality
    - Deprecated: features about to be removed
    - Removed: features that have been removed
    - Fixed: bug fixes
    - Security: lowering vulnerabilities
- Sample Changelog
    - Version 1.0.0 (02-23-2019)
    - New
        - Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )
        - Added Column “AveCost” to track average item cost
    - Changes 
        - Changed date format to MM-DD-YYYY
        - Removal of whitespace (cosmetic)
    - Fixes
        - Fixed misalignment in Column "TotalCost" where some rows did not match with correct dates
        - Fixed SUM to run over entire column instead of partial

#### Glossary
https://docs.google.com/document/d/1WicDnBSCvoxP-hu3ZyM8tH6XRupW7HAeIiA2Vpfbycs/template/preview

---

## Module 5: Adding data to your resume

### Learning log

#### The data analyst hiring process
- Make sure it's a competitive offer before you sign. Remember, if they reach out to you with an offer, that means they want you as much as you want them.
- There's usually some room to negotiate your salary, vacation days, or something else. Keep in mind, you'll need to find a balance between what you want, what they want to give you, and what's fair.
- Know your own worth but also understand that the company hiring you has already placed a certain value on your role.
- Give yourself at least two weeks before you officially start. Why? Well, if you're already employed somewhere else during your job search, it's customary and polite to give at least a two-week notice at your old job before starting at the new one.
- When you take a picture, you usually try to capture lots of different things in one image. Maybe you're taking a picture of the sunset and want to capture the clouds, the tree line and the mountains. Basically, you want a snapshot of that entire moment. You can think of building a resume in the same way. You want your resume to be a snapshot of all that you've done both in school and professionally.
- Contact information
    - Name
    - Address
    - Phone number
    - Email address
- Accomplished [x] as measured by [y] by doing [z]
- If you are transitioning from a different career and don’t yet have relevant work experience, then you may want to pick a format that highlights your technical skills and portfolio projects. Some resume formats include a Summary or Goals section at the top to help candidates add context to their application, while other resume formats avoid these sections completely and save that space for sections such as Skills and Experience.
- Whatever format you pick, make sure to follow the one-page rule and keep the completed version on just a single page. 
- If the one-page rule seems limiting, think about the purpose resumes serve in the hiring process overall.
- Resumes are short documents designed to communicate the most pertinent information about yourself to recruiters and hiring managers at a glance.
- These are different from longer, multi-page Curriculum Vitaes (CVs) that exhaustively list every relevant thing the candidate has ever done. 
- If an employer wants a detailed history of your past work experiences and accolades, they might specifically request a CV (curriculum vitae) instead. If they don’t, always assume they prefer a resume.
- While it is generally considered acceptable for resumes of applicants with extensive work history applying for senior technical roles to have two-page resumes, these are the exception rather than the rule. When applying for a data analyst position, keep it to one page!

#### Understand the elements of a data analyst resume

#### Highlighting experiences on resumes

#### Exploring areas of interest