In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

#### Q1. What are the three measures of central tendency?

Mean, Median, Mode

#### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

The mean, median, and mode are three different measures of central tendency used to describe the center of a dataset. The main differences between them are:

1. Mean: The mean is the arithmetic average of a set of data. It is calculated by summing up all the values in the dataset and dividing by the number of values. The mean is affected by extreme values, or outliers, and is generally used to describe datasets with a symmetrical distribution.

2. Median: The median is the middle value of a set of data when the values are arranged in order of magnitude. If the dataset has an even number of values, the median is the average of the two middle values. The median is not affected by extreme values and is generally used to describe datasets with a skewed distribution.

3. Mode: The mode is the value that occurs most frequently in a set of data. A dataset can have multiple modes, or no mode at all if all values occur with the same frequency. The mode is not affected by extreme values and is generally used to describe datasets with a categorical or nominal variable.

These measures are used to measure the central tendency of a dataset because they provide information about where the center of the data is located. For example, if the mean, median, and mode are all close in value, it suggests that the data is clustered around a central value. If the mean is much larger or smaller than the median or mode, it suggests that the data is skewed in one direction or the other. The choice of measure to use depends on the type of data and the research question being addressed.

#### Q3. Measure the three measures of central tendency for the given height data:

In [1]:
height = [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [3]:
df = pd.DataFrame(height)

In [4]:
df.mean()

0    177.01875
dtype: float64

In [5]:
df.median()

0    177.0
dtype: float64

In [6]:
df.mode()

Unnamed: 0,0
0,177.0
1,178.0


#### Q4. Find the standard deviation for the given data:

In [7]:
df.std()

0    1.847239
dtype: float64

#### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

Measures of dispersion such as range, variance, and standard deviation are used to describe the spread of a dataset.

1. Range: The range is the difference between the maximum and minimum values in a dataset. It provides a quick estimate of the spread of the data, but it is highly influenced by extreme values.

2. Variance: The variance is a measure of how much the values in a dataset deviate from the mean. It is calculated by summing the squared deviations of each value from the mean and dividing by the number of values. A higher variance indicates that the data is more spread out, while a lower variance indicates that the data is clustered closer to the mean.

3. Standard deviation: The standard deviation is the square root of the variance. It is a commonly used measure of dispersion and is more interpretable than the variance, as it is expressed in the same units as the original data. A higher standard deviation indicates that the data is more spread out, while a lower standard deviation indicates that the data is clustered closer to the mean.

For example, consider the following dataset of test scores for a class: 75, 80, 85, 90, 95. The range of this dataset is 20, the variance is 50, and the standard deviation is approximately 7.1. These measures suggest that the scores are clustered around the mean of 85, but with some variability between values. The range indicates that the scores are spread out over a range of 20 points, while the variance and standard deviation indicate that the data is relatively tightly clustered around the mean.

#### Q6. What is a Venn diagram?

A Venn diagram is a graphical representation of the relationships between sets, often used in mathematics, logic, and statistics. It consists of overlapping circles or other shapes, with each circle representing a set and the overlapping areas representing the intersections of those sets. The Venn diagram was developed by John Venn in the late 19th century as a way of visualizing set theory concepts.

#### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
(i) A intersection B
(ii) A union B

(i): (2,6)
(ii): (0,2,3,4,5,6,7,8,10)

#### Q8. What do you understand about skewness in data?

Skewness is a measure of the asymmetry of a probability distribution, indicating the degree to which a dataset deviates from symmetry around its mean. A dataset is said to be skewed if it is not symmetric, meaning that it is skewed to the left or right of the mean.

#### Q9. If a data is right skewed then what will be the position of median with respect to mean?

If the data us right skewed that means that this is positive skewed and the tail of distribution is extended to the right of the mean and the median is less than mean

#### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

Covariance and correlation are two measures used in statistics to describe the relationship between two variables.

Covariance measures the extent to which two variables vary together. It is a measure of the linear association between two variables, indicating whether they tend to increase or decrease together. A positive covariance indicates that the two variables tend to move in the same direction, while a negative covariance indicates that they tend to move in opposite directions. However, covariance does not provide information about the strength of the relationship or how well the variables are related.

Correlation, on the other hand, measures the strength and direction of the relationship between two variables. It is a standardized version of covariance, with values ranging from -1 to +1. A correlation coefficient of +1 indicates a perfect positive relationship, while a coefficient of -1 indicates a perfect negative relationship. A coefficient of 0 indicates no relationship between the variables. Correlation provides a more interpretable measure of association, as it is not affected by differences in scale or units between the variables.

Both covariance and correlation are used in statistical analysis to understand the relationship between two variables. Covariance can be used to identify whether two variables are associated, and whether the relationship is positive or negative. Correlation provides additional information about the strength and direction of the relationship, allowing researchers to make more informed decisions about the nature of the relationship and the appropriate statistical tests to use.

#### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The formula to calculate the sample mean is sum of all values divided by the number of values in the dataset

For example, suppose we have the following dataset:

3, 6, 8, 2, 5

To calculate the sample mean, we add up all the values in the dataset and divide by the total number of values:

x̄ = (3 + 6 + 8 + 2 + 5) / 5 = 4.8

Therefore, the sample mean for this dataset is 4.8.

#### Q12. For a normal distribution data what is the relationship between its measure of central tendency?

For a normal distribution, the measures of central tendency (mean, median, and mode) are all equal. This means that the peak of the distribution (mode), the middle of the distribution (median), and the arithmetic average of all the values in the distribution (mean) are all at the same point.

#### Q13. How is covariance different from correlation?

Covariance measures the extent to which two variables vary together. It is a measure of the linear association between two variables, indicating whether they tend to increase or decrease together. A positive covariance indicates that the two variables tend to move in the same direction, while a negative covariance indicates that they tend to move in opposite directions. However, covariance does not provide information about the strength of the relationship or how well the variables are related.

Correlation, on the other hand, measures the strength and direction of the relationship between two variables. It is a standardized version of covariance, with values ranging from -1 to +1. A correlation coefficient of +1 indicates a perfect positive relationship, while a coefficient of -1 indicates a perfect negative relationship. A coefficient of 0 indicates no relationship between the variables. Correlation provides a more interpretable measure of association, as it is not affected by differences in scale or units between the variables.

#### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

Outliers are data points that are significantly different from other data points in a dataset. Outliers can have a significant impact on measures of central tendency and dispersion, as they can skew the distribution and affect the calculation of summary statistics.

In terms of measures of central tendency, outliers can significantly affect the mean, which is particularly sensitive to extreme values. However, outliers are less likely to affect the median or mode, which are less sensitive to extreme values.

In terms of measures of dispersion, outliers can significantly affect the range, variance, and standard deviation. The range, for example, is directly affected by outliers because it is the difference between the largest and smallest values in the dataset. The variance and standard deviation, which measure the spread of the data around the mean, can also be significantly affected by outliers because they can increase the distance of the mean from the other data points.

For example, consider a dataset of salaries of employees in a company, with the following values: 50,000, 60,000, 55,000, 58,000, 54,000, and 1,000,000. The outlier salary of 1,000,000 will significantly affect the calculation of the mean salary, which will be much higher than the typical salaries of the other employees. It will also increase the range, variance, and standard deviation of the dataset. If the outlier is removed, the measures of central tendency and dispersion will be more representative of the typical salaries of the employees.