#### Q1. What are the three measures of central tendency?
    Ans. Mean, median, mode

#### Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?
    Ans. 
    1. Mean:
    The mean is calculated by summing up all the values in the dataset and then dividing that sum by the number of data points. It represents the average value of the dataset. The formula for calculating the mean is:
    Mean = (Sum of all values) / (Number of data points)

    The mean is sensitive to extreme values, so if you have a few outliers in your dataset, they can heavily influence the mean. For example, in the dataset [1, 2, 3, 4, 100], the mean would be significantly affected by the outlier 100.

    
    2.Median:
    The median is the middle value of a dataset when the values are arranged in ascending or descending order. If the dataset has an odd number of data points, the median is the middle value. If the dataset has an even number of data points, the median is the average of the two middle values. The median is not affected by extreme values and is a robust measure of central tendency.
    To find the median, first, arrange the data in ascending or descending order, then find the middle value(s).

    
    3. Mode:
    The mode is the value that appears most frequently in the dataset. In some datasets, there may be multiple modes (bimodal, trimodal, etc.) if more than one value occurs with the highest frequency. Unlike the mean and median, the mode can be used for both numerical and categorical data.
    
    
    
    How to choose the appropriate measure:

    Mean: Use the mean when the data is roughly symmetric and does not have significant outliers.
    Median: Use the median when the data contains outliers or is skewed (as it is not sensitive to extreme values).
    Mode: Use the mode when you want to identify the most frequently occurring value in a dataset, especially in categorical data.

#### Q3. Measure the three measures of central tendency for the given height data:
    [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [15]:
import numpy as np
from scipy import stats
import warnings

warnings.filterwarnings("ignore")
arr1 = np.array([178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5])
np.mean(arr1), np.median(arr1), stats.mode(arr1)

(177.01875, 177.0, ModeResult(mode=array([177.]), count=array([3])))

#### Q4. Find the standard deviation for the given data:
    [178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [16]:
arr2 = np.array([178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5])
np.std(arr2)

1.7885814036548633

#### Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.
    Ans. Measures of dispersion, such as range, variance, and standard deviation, are used to quantify how spread out or dispersed the values in a dataset are. They provide valuable information about the variability or spread of the data points from the central tendency (mean, median, or mode).
    1. Range:
    The range is the simplest measure of dispersion and is calculated as the difference between the maximum and minimum values in the dataset. It provides a quick overview of how widely the data values are spread.
    Range = Maximum value - Minimum value
    
    2. Variance:
    The variance is a measure of how much the data points deviate from the mean of the dataset. It is calculated by taking the average of the squared differences between each data point and the mean. A higher variance indicates a greater spread or dispersion of data points from the mean.
    Variance = Σ((x - μ)^2) / N

    where:
    x = individual data points
    μ = mean of the dataset
    N = number of data points
    
    3.Standard Deviation:
    The standard deviation is the square root of the variance. It provides a more interpretable measure of dispersion, as it is in the same unit as the original data, making it easier to compare with the mean.
    Standard Deviation = sqrt(Variance)

#### Q6. What is a Venn diagram?
    Ans. A Venn diagram is a graphical representation of the relationships between different sets. It uses circles (or other closed shapes) to depict these sets and their overlapping regions, illustrating the common elements shared among the sets and the unique elements present in each set. Venn diagrams are widely used to visualize set theory and the relationships between various categories or groups.

    Key components of a Venn diagram:

        Sets: Each circle in the diagram represents a set. A set is a collection of objects or elements that share certain characteristics or properties.

        Overlapping regions: When two or more sets have elements in common, their circles overlap, creating regions that represent the shared elements among those sets.

        Non-overlapping regions: The portions of the circles that do not overlap with any other circle represent the elements that are unique to each individual set.

        Universal set: The space within the rectangle or boundary of the Venn diagram is the universal set, which includes all possible elements relevant to the context of the diagram.

    Uses of Venn diagrams:

        Venn diagrams are used in various fields, including mathematics, statistics, logic, and data analysis. Some common applications include:

        Set theory: Venn diagrams are commonly used to depict the relationships between sets, intersections, unions, and complements.

        Logic and reasoning: They are used to illustrate logical relationships and identify contradictions or overlaps in arguments.

        Probability: Venn diagrams can help visualize probabilities and understand events' intersections and unions in probability theory.

        Data analysis: In data science, Venn diagrams are used to compare data sets, identify common elements, and understand the relationships between different data categories.

#### Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
    (i) A ⋂ B
    Ans. A.intersection(B):{2, 6}
    (ii) A ⋃ B
    Ans. A.union(B): {0, 2, 3, 4, 5, 6, 7, 8, 10}

#### Q8. What do you understand about skewness in data?
    Ans. Skewness is a statistical measure that helps describe the asymmetry of the probability distribution of a dataset. In other words, it measures the degree to which a dataset deviates from being symmetrical (i.e., having a bell-shaped or normal distribution). Skewness provides insights into the shape and behavior of the data distribution.

    There are three main types of skewness:

        Positive Skewness (Right Skewness):
        In a positively skewed distribution, the majority of the data is concentrated on the left side, while the right tail is longer. The mean is typically greater than the median, and the median is greater than the mode. This happens when there are few extremely high values that pull the mean towards the right.
        Example of a positively skewed distribution:
        [5, 6, 7, 8, 9, 10, 50]

        Negative Skewness (Left Skewness):
        In a negatively skewed distribution, the majority of the data is concentrated on the right side, while the left tail is longer. The mean is typically less than the median, and the median is less than the mode. This occurs when there are a few extremely low values that pull the mean towards the left.
        Example of a negatively skewed distribution:
        [70, 40, 30, 20, 10, 9, 8, 7, 6, 5]

        Zero Skewness (Symmetrical):
        In a symmetrical distribution, the data is evenly distributed around the mean, and the left and right tails are roughly equal in length. The mean, median, and mode are all approximately the same value.
        Example of a symmetrical distribution:
        [10, 20, 30, 40, 50]

    The concept of skewness is essential in understanding the underlying patterns and characteristics of data. It helps in selecting appropriate statistical methods and models for analysis. For instance, in the presence of skewness, using the mean as a measure of central tendency may not be ideal, as it can be heavily influenced by outliers. In such cases, the median is often a more robust choice.

    Skewness is a valuable tool in data analysis, especially when assessing the distribution of variables in research, finance, economics, and other fields. Understanding the skewness of a dataset can lead to more accurate interpretations and predictions.

#### Q9. If a data is right skewed then what will be the position of median with respect to mean?
    Ans. mean<median<mode

#### Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?
    Ans. Covariance and correlation are both measures used to describe the relationship between two variables in statistical analysis, but they have some key differences in their interpretation and scale:

    Covariance:
    Covariance measures the degree to which two variables change together. It indicates whether the two variables tend to increase or decrease at the same time. If the covariance is positive, it means that when one variable increases, the other tends to increase as well, and when one variable decreases, the other tends to decrease. If the covariance is negative, it means that one variable tends to increase when the other decreases, and vice versa.
    However, the magnitude of the covariance does not provide a standardized measure of the strength of the relationship between the variables. It can be influenced by the scale of the variables, making it difficult to interpret its value in isolation.

    The formula for covariance between two variables X and Y with n data points is given as:

    Cov(X, Y) = Σ[(X_i - mean(X)) * (Y_i - mean(Y))] / n

    Correlation:
    Correlation is a standardized measure of the relationship between two variables. It quantifies both the strength and direction of the linear relationship between the variables, and its value always falls within the range of -1 to 1.
    A correlation of +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship (though there could be other types of relationships present). Correlation allows for easier comparison across different datasets, as it is not influenced by the scale of the variables.

    The most commonly used measure of correlation is Pearson correlation coefficient, which is defined as:

    Corr(X, Y) = Cov(X, Y) / (σ_X * σ_Y)

    where Cov(X, Y) is the covariance between X and Y, and σ_X and σ_Y are the standard deviations of X and Y, respectively.

    Usage in statistical analysis:

    Covariance: Covariance is used to understand the direction of the relationship between two variables. However, since the covariance is not standardized, it is challenging to compare the strength of relationships across different datasets or variable scales.

    Correlation: Correlation is widely used in statistical analysis to assess the strength and direction of the linear relationship between two variables. It provides a standardized measure that allows for easy comparison. Correlation is commonly used in fields such as economics, finance, social sciences, and data analysis to identify patterns, model relationships, and make predictions.

#### Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.
    Ans. The formula for calculating the sample mean (also known as the sample average) is relatively straightforward. It is the sum of all data points divided by the number of data points in the sample. If we have a dataset with 'n' data points represented as x1, x2, x3, ..., xn, then the sample mean (denoted as "x̄") is calculated as follows:

    Sample Mean (x̄) = (x1 + x2 + x3 + ... + xn) / n

    Example calculation:

    Let's consider a dataset of exam scores for a class of 10 students:

    {85, 78, 92, 89, 76, 95, 88, 84, 91, 82}

    To find the sample mean, we sum up all the scores and then divide by the number of data points (which is 10 in this case):

    Sample Mean (x̄) = (85 + 78 + 92 + 89 + 76 + 95 + 88 + 84 + 91 + 82) / 10

    Sample Mean (x̄) = 870 / 10

    Sample Mean (x̄) = 87

    So, the sample mean for this dataset is 87. This means that, on average, the students in the class scored 87 on the exam. The sample mean is a useful measure of central tendency that represents the typical value of the data and is commonly used in various statistical analyses.

#### Q12. For a normal distribution data what is the relationship between its measure of central tendency?
    Ans. mean = median = mode

#### Q13. How is covariance different from correlation?
    Ans. Covariance and correlation are both measures that describe the relationship between two variables, but they have some key differences:

    Definition:
    Covariance: Covariance is a measure of how two variables change together. It indicates the degree to which the values of one variable increase or decrease in relation to the corresponding values of the other variable. A positive covariance indicates a direct relationship (both variables increase or decrease together), while a negative covariance indicates an inverse relationship (one variable increases as the other decreases).
    Correlation: Correlation, on the other hand, is a standardized measure of the linear relationship between two variables. It quantifies both the strength and direction of the linear association between the variables. Correlation values range from -1 to 1, where +1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship (though other relationships might exist).
    
    Scale:
    Covariance: The magnitude of the covariance is not standardized and depends on the scales of the variables involved. Therefore, it is challenging to interpret the absolute value of covariance, and comparing covariances from different datasets can be difficult.
    Correlation: Correlation standardizes the relationship by dividing the covariance by the product of the standard deviations of the two variables. This makes correlation values independent of the scale of the variables and allows for easier comparison of the strength of relationships across different datasets.
    
    Interpretation:
    Covariance: The sign of the covariance (+ or -) indicates the direction of the relationship, while the magnitude tells us about the strength of the relationship. However, the magnitude of covariance alone does not provide a clear indication of the strength of the relationship, especially when comparing different datasets with different scales.
    Correlation: Correlation values, being standardized, provide a clear indication of the strength and direction of the linear relationship. A correlation of +1 or -1 implies a perfect linear relationship, while values closer to 0 indicate weaker or no linear relationships.
    
    Range of Values:
    Covariance: Covariance can take any real value, from negative infinity to positive infinity.
    Correlation: Correlation values are always between -1 and +1, inclusive.

#### Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.
    Ans. Outliers can have a significant impact on measures of central tendency and dispersion in a dataset. An outlier is an extreme value that differs greatly from the other data points in the dataset. Outliers can distort the typical pattern of the data and affect the overall summary statistics.

    Measures of Central Tendency (Mean, Median, Mode):
    Mean: The mean is sensitive to outliers because it takes into account the value of each data point. When there are extreme values (outliers) in the dataset, they can pull the mean towards their direction. Consequently, the mean can be significantly affected by outliers, leading to a distorted representation of the central tendency if the dataset is not well-distributed.

    Median: The median is less sensitive to outliers compared to the mean. It represents the middle value of the dataset when it is arranged in ascending or descending order. Since the median only considers the position of the data points, outliers do not influence it as much as they do the mean.

    Mode: The mode is the most frequent value in the dataset. Outliers do not affect the mode, as it is determined by the frequency of values, not their actual magnitude.

    Example:
    Consider the following dataset of exam scores:

    {85, 78, 92, 89, 76, 95, 88, 84, 91, 150}

    In this dataset, 150 is an outlier since it is significantly larger than the other scores. Let's calculate the mean, median, and mode:

    Mean: (85 + 78 + 92 + 89 + 76 + 95 + 88 + 84 + 91 + 150) / 10 = 948 / 10 = 94.8
    Median: The middle value is the average of the 5th and 6th scores, which are 76 and 95. So, (76 + 95) / 2 = 171 / 2 = 85.5
    Mode: The mode is 85 since it appears twice, more frequently than any other value.
    As you can see, the outlier 150 significantly affected the mean, pulling it towards the higher end. However, the median and mode remained less affected by the outlier.

    Measures of Dispersion (Range, Variance, Standard Deviation):
    Outliers can also have a significant impact on measures of dispersion, which quantify the spread or variability of data points.
    Range: Outliers can substantially affect the range, which is simply the difference between the maximum and minimum values in the dataset. Outliers that are very large or very small can widen the range.

    Variance and Standard Deviation: Both variance and standard deviation are influenced by outliers because they involve squaring the differences between data points and the mean. Since outliers have large differences from the mean, squaring them amplifies their effect on these measures.

    Example:
    Let's consider the same dataset as before, but with an additional outlier:

    {85, 78, 92, 89, 76, 95, 88, 84, 91, 150, 50}

    Calculating the variance and standard deviation for this dataset will show a significant increase due to the presence of the outlier 150 and the low-value outlier 50.