**SAMPLING**

**CENTRAL LIMIT THEOREM**

The central limit theorem states that, under many conditions, independent random variables summed together will converge to a normal distribution as the number of variables increases. This becomes very useful for applying statistical logic to sample statistics in order to estimate population parameters. 

![image.png](attachment:image.png)

**DETECTING NON-NORMAL DATA SETS**

Before we can make use of the normal distribution, we need to first confirm that our data is normally distributed. If it is not, then we'll need to use the Central Limit Theorem to create a sample distribution of sample means that will be normally distributed.

There are two main ways to check if a sample follows the normal distribution or not. The easiest is to simply plot the data and visually check if the data follows a normal curve or not.

*USE SNS.DISTPLOT TO SEE GRAPH*

For a more formal way to check if a dataset is normally distributed or not, we can make use of a statistical test. There are many different statistical tests that can be used to check for normality, but we'll keep it simple and just make use of the normaltest() function from scipy.stats, which we imported as st --see the documentation if you have questions about how to use this method.

The function tests the hypothesis that the distribution passed into the function differs from the normal distribution. The null hypothesis would then be that the data is normally distributed. We typically reject the null hypothesis if the p-value is less than 0.05. 

Since our dataset is non-normal, that means we'll need to use the **Central Limit Theorem.**

In [None]:
To create sample distribution from dataset:
        sample_dist_10 = create_sample_distribution(data, 10, 3)
        sns.distplot(sample_dist_10);

**STEP ONE: Sampling With Replacement**
In order to create a Sample Distribution of Sample Means, we need to first write a function that can sample with replacement.

def get_sample(data, n):
    sample = []
    while len(sample) != n:
        x = np.random.choice(data)
        sample.append(x)
    
    return sample

test_sample = get_sample(data, 30)
print(test_sample[:5]) 

**STEP TWO: SAMPLE MEAN**

def get_sample_mean(sample):
    return sum(sample) / len(sample)

In [None]:
**STEP THREE: SAMPLE DISTRIBUTION OF SAMPLE MEANS**
    
Now that we have helper functions to help us sample with replacement and calculate sample means, we just need to bring it all together and write a function that creates a sample distribution of sample means!

def create_sample_distribution(data, dist_size=100, n=30):
    sample_distribution = []
    while len(sample_distribution) != dist_size:
        sample = get_sample(data, n)
        sample_mean = get_sample_mean(sample)
        sample_distribution.append(sample_mean)
    
    return sample_distribution
                
test_sample_dist = create_sample_distribution(data)
print(test_sample_dist[:5]) 

**Point estimates**: estimates of population parameters 

Point estimates of specific parameters of a population have predictable behaviors, in that the point estimates themselves will form specific probability distributions.

*To take sample*
    sample = df.sample(n=50, random_state=22)

*To look at percent error*
    err = np.abs(sample.Age.mean() - df.Age.mean())
    per_err = err / df.Age.mean()
    print(per_err)
    
