# Data integrity & analytics objective

### Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle

### Data replication
The process of storing data in multiple locations

**Data replication compromising data integrity:** Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates. 

### Data transfer
The process of copying data form a sotrage device to memory, or from one computer to another

**Data transfer compromising data integrity:** Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity. 

### Data munipulation
The process of changing data to make it more organized and easier to read

**Data manipulation compromising data integrity:** When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.



### Other threats to data integrity
- Human error
- Viruses
- Malware
- Hacking
- System failures

# Dealing with insufficient data

### Types of insufficient data 
- Data from only one source
- Data that keeps updating
- Outdated data
- Geographically-limited data

### Ways to address insufficient data
- Identify trends with the available data
- Wait for more data if time allows
- Talk with stakeholders and adjust your objective
- Look for a new dataset

# The importance of sample size

### Population
All possible data values in a certian dataset


### Sample size
A part of a population that is representative of the population

### Sampling bias
A sample isn't representative of the population as a whole

### Random sampling
A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen

### Margin of error
Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. 

### Confidence level
How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. 

### Confidence interval
The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.

### Statistical significance
The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

### Things to remember when determining the size of your sample
When figuring out a sample size, here are things to keep in mind:

- Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.

    This recommendation is based on the **Central Limit Theorem (CLT)** in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30.


- The confidence level most commonly used is 95%, but 90% can work in some cases. 

Increase the sample size to meet specific needs of your project:

- For a higher confidence level, use a larger sample size

- To decrease the margin of error, use a larger sample size

- For greater statistical significance, use a larger sample size

# Using statistical power


### Hypothesis testing
A way to see if a survey or experiment has meanigful results


If a test is statisticlaly significant, it means the results of the test are real and not an error caused by random chance

Usually, you need a statistical power of at least zero point 8 or 80% to consider your results statistically significant

# What to do when there is no data
### Proxy data examples
Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful.
- Business scenario:
    A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now.
- How proxy data can be used
    The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership.

### Open (public) datasets
f you are part of a large organization, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Towards Data Science article for a brief explanation of the difference between open and public data.)

### CSV, JSON, SQLite, and BigQuery datasets

# Evaluate the reliability of your data
### To calculate margin of error
- Population size
- Sample size
- Confidence level
