### Q1. What are the three measures of central tendency?

The three measures of central tendency are:

1. Mean: It is calculated by summing up all the values in a dataset and dividing by the total number of values. It is the most commonly used measure of central tendency.

2. Median: It is the middle value in a dataset when the values are arranged in order. If the dataset has an even number of values, then the median is the average of the two middle values.

3. Mode: It is the value that appears most frequently in a dataset. A dataset may have one mode (unimodal) or more than one mode (multimodal).

### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are all measures of central tendency used to describe the "average" value in a dataset, but they are calculated differently and have different interpretations:

1. Mean represents the "average" value of the dataset and is sensitive to outliers or extreme values.
   
   To find the median, you first arrange the values in the dataset in order from lowest to highest (or highest to lowest). Then, if the dataset has an odd number of values, the median is the middle value. For example, in the dataset 2, 4, 6, 8, the median is 6. If the dataset has an even number of values, the median is the average of the two middle values. For example, in the dataset 2, 4, 6, 8, 10, the median is (6+8)/2 = 7.

2. Median is not affected by extreme values or outliers, making it a better measure of central tendency for datasets with skewed distributions.
   
   To find the mode, you identify the value or values that occur most frequently in the dataset. For example, in the dataset 2, 4, 6, 8, 8, the mode is 8, because it appears twice and no other value appears more than once.

3. Mode is useful for describing the most common value or category in a dataset, especially for categorical or nominal data.
   
   These measures of central tendency are used to provide a summary of the "average" value or typical value in a dataset, which can be useful for describing the distribution of the data and making comparisons between different datasets.

### Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
df = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [10]:
import numpy as np
from scipy import stats

In [20]:
def centen(data):
    mean = np.mean(data)
    median = np.median(data)
    mode = stats.mode(data)
    
    print("Mean Value of Data : {}""\n""Median value of Data : {}""\n""Mode Value of Data : {}".format(mean,median,mode))

In [21]:
centen(df)

Mean Value of Data : 177.01875
Median value of Data : 177.0
Mode Value of Data : ModeResult(mode=array([177.]), count=array([3]))


  mode = stats.mode(data)


### Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [22]:
std = np.std(df)
print("Standard Deviation of Data :",std)

Standard Deviation of Data : 1.7885814036548633


### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion such as range, variance, and standard deviation are used to describe the spread or variability of a dataset. They help to provide additional information beyond the central tendency measures by indicating how much the values in a dataset deviate from the central value or mean.

1. Range: The range is the difference between the highest and lowest values in a dataset. It is a simple measure of dispersion that can be useful for quickly assessing the spread of a dataset.
   
   For example, in the dataset 2, 4, 6, 8, 10, the range is 10-2=8.

2. Variance: The variance is a measure of how much the values in a dataset deviate from the mean. It is calculated by subtracting each value from the mean, squaring the differences, adding them up, and dividing by the number of values minus one.
   
   For example, consider the dataset 2, 4, 6, 8, 10. The mean is (2+4+6+8+10)/5=6. The differences between each value and the mean are -4, -2, 0, 2, and 4. Squaring these differences gives 16, 4, 0, 4, and 16. Adding these up gives 40. Dividing by 4 (the number of values minus one) gives a variance of 10.

3. Standard deviation: The standard deviation is the square root of the variance. It is a commonly used measure of dispersion that indicates how much the values in a dataset are spread out around the mean.
   
   For the same dataset used in the variance example, the standard deviation would be the square root of 10, which is approximately 3.16.

They are useful for assessing the variability in a dataset and making comparisons between different datasets.

### Q6. What is a Venn diagram?

A Venn diagram is a visual tool used to illustrate the relationships between sets of data. It consists of overlapping circles or other shapes, with each circle representing a set and the overlapping regions representing the intersection between sets.

### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:

1. A ∩ B
= {2,6}
2. A ⋃ B
= {0,2,3,4,5,6,7,8,10}

### Q8. What do you understand about skewness in data?

Skewness in data is a measure of the asymmetry of the distribution of values in a dataset. A dataset is said to be skewed if it is not symmetric, meaning that the distribution of values is not evenly distributed around the mean. In a skewed distribution, the tail of the distribution will be longer on one side than the other.

Skewness can be described as positive, negative, or zero.

### Q9. If a data is right skewed then what will be the position of median with respect to mean?

If a dataset is right-skewed, the median will typically be less than the mean.

This is because the right-skewness is caused by a few large values in the dataset, which pull the mean towards the right or positive side of the distribution. As a result, the mean will be higher than the median, which represents the middle value of the dataset.

### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are two measures that are used to describe the relationship between two variables in a dataset

1. Covariance is a measure of the direction and strength of the linear relationship between two variables. It measures how much two variables change together. Specifically, covariance measures the degree to which the values of one variable co-vary with the values of another variable.
   
   If two variables have a positive covariance, it means they tend to increase or decrease together. If two variables have a negative covariance, it means they tend to move in opposite directions.

2. Correlation is a standardized measure of covariance, which makes it easier to compare relationships between variables that have different units of measurement.
   
   A correlation of 1 indicates a perfect positive relationship between two variables, a correlation of -1 indicates a perfect negative relationship, and a correlation of 0 indicates no relationship.

### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula for calculating the sample mean is:

sample mean = sum of all data points / number of data points

For example, let's say we have the following dataset:

3, 5, 6, 7, 2, 8, 9, 1

To calculate the sample mean, we would first add up all the data points:

3 + 5 + 6 + 7 + 2 + 8 + 9 + 1 = 41

Next, we would divide by the number of data points, which in this case is 8:

41 / 8 = 5.125

Therefore, the sample mean for this dataset is 5.125.

### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the measures of central tendency (mean, median, and mode) are all equal to each other.

### Q13. How is covariance different from correlation?

Covariance and correlation are both measures that describe the relationship between two variables in a dataset. However, they differ in their scale, interpretation, and usefulness.

1. Scale: Covariance is not standardized, while correlation is standardized and ranges from -1 to 1.
2. Interpretation: Covariance measures the degree to which two variables co-vary, while correlation measures the strength and direction of the linear relationship between two variables.
3. Usefulness: Covariance can be difficult to interpret and compare across datasets because it is affected by the units of measurement of the variables, while correlation is a more useful and commonly used measure of the relationship between two variables because it is standardized and can be easily compared across datasets.

### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers can have a significant impact on measures of central tendency and dispersion, especially on the mean and standard deviation.

Lets take an example, we have a dataset of payouts : 25,000, 35,000, 40,000, 36,000, 59,000

The mean of the payouts would be,

mean = (25,000 + 35,000 + 40,000 + 36,000 + 59,000)/5 = 39,000

Let's add one more payout into the above dataset: 25,000, 35,000, 40,000, 36,000, 59,000, 1019,000

Now calculate mean of the new dataset,

mean = (25,000 + 35,000 + 40,000 + 36,000 + 59,000 + 1019,000)/6 = 242800

We can clearly see difference by adding an outlier the mean is deflected too much.

In [24]:
data = (25000,35000,40000,36000,59000)
np.mean(data)

39000.0

In [25]:
data1 = (25000,35000,40000,36000,59000+1019000)
np.mean(data1)

242800.0