# Q1. What are the three measures of central tendency?

 a. **Mean (Arithmetic Average)**:
      - The mean is calculated by adding up all the values in the dataset and dividing by the total number of values.
      - It represents the **balance point** of the data.
      - Formula: \(\text{Mean} = \frac{\sum \text{values}}{\text{number of values}}\)
      - Use cases:
        - When data is **approximately symmetric** (e.g., heights, test scores).
        - Sensitive to extreme values (outliers).

   b. **Median**:
      - The median is the **middle value** when the data is arranged in order.
      - It is not affected by extreme values.
      - Use cases:
        - When data has **skewed distributions** (e.g., income, house prices).
        - Robust to outliers.

   c. **Mode**:
      - The mode is the **most frequent value** in the dataset.
      - It can be used for both **categorical** and **numerical** data.
      - Use cases:
        - Identifying the most common response in a survey.
        - Handling categorical data.

# Q2. What is the difference between the mean, median, and mode? How are they used to measure the central tendency of a dataset?

a. **Mean (Arithmetic Average)**:
      - The mean is calculated by adding up all the values in the dataset and dividing by the total number of values.
      - It represents the **balance point** of the data.
      - Formula: \(\text{Mean} = \frac{\sum \text{values}}{\text{number of values}}\)
      - Use cases:
        - When data is **approximately symmetric** (e.g., heights, test scores).
        - Sensitive to extreme values (outliers).

   b. **Median**:
      - The median is the **middle value** when the data is arranged in order.
      - It is not affected by extreme values.
      - Use cases:
        - When data has **skewed distributions** (e.g., income, house prices).
        - Robust to outliers.
        FORMULA:
        MEDIAN=(N/2+(N+1)/2)/2 --- N IS EVEN
        MEDIAN=N/2----------------N IS ODD

   c. **Mode**:
      - The mode is the **most frequent value** in the dataset.
      - It can be used for both **categorical** and **numerical** data.
      - Use cases:
        - Identifying the most common response in a survey.
        - Handling categorical data.

# Q3. Measure the three measures of central tendency for the given height data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [1]:
import numpy as np
from scipy import stats as st
l=[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]
array=np.array(l)
print(array.mean())
print(np.median(array))
print(st.mode(array))

177.01875
177.0
ModeResult(mode=array([177.]), count=array([3]))


  print(st.mode(array))


# Q4. Find the standard deviation for the given data:
[178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5]

In [2]:
a=np.array([178,177,176,177,178.2,178,175,179,180,175,178.9,176.2,177,172.5,178,176.5])
a.std()

1.7885814036548633

# Q5. How are measures of dispersion such as range, variance, and standard deviation used to describe the spread of a dataset? Provide an example.

1. **Range**:
   - The **range** is the simplest measure of dispersion. It represents the difference between the maximum and minimum values in a dataset.
   - Formula: Range = Maximum Value - Minimum Value
   - Example: Suppose we have a dataset of daily temperatures (in degrees Celsius) for a week: {20, 22, 18, 25, 21, 23, 19}. The range would be:
     \[ \text{Range} = 25 - 18 = 7 \]

2. **Variance**:
   - **Variance** quantifies how much individual data points deviate from the mean (average) of the dataset.
   - Formula: \(\sigma^2 = \frac{\sum_{i=1}^{n} (y_i - \bar{y})^2}{n}\)
     where \(y_i\) represents each data point, \(\bar{y}\) is the mean, and \(n\) is the sample size.
   - Example: Consider the dataset {10, 12, 15, 11, 14}. The mean is \(\bar{y} = \frac{10 + 12 + 15 + 11 + 14}{5} = 12.4\). The variance is:
     \[ \sigma^2 = \frac{(10-12.4)^2 + (12-12.4)^2 + (15-12.4)^2 + (11-12.4)^2 + (14-12.4)^2}{5} = 4.16 \]

3. **Standard Deviation**:
   - The **standard deviation** is the square root of the variance. It provides a more interpretable measure of dispersion.
   - Formula: \(\sigma = \sqrt{\sigma^2}\)
   - Example: Using the same dataset as above, the standard deviation is:
     \[ \sigma = \sqrt{4.16} \approx 2.04 \]

# Q6. What is a Venn diagram?

A **Venn diagram** is a widely used diagram style that shows the logical relationship between sets. It was popularized by **John Venn** in the 1880s. These diagrams are valuable tools for visualizing and understanding set theory, logic, and various other fields. Here are the key points about Venn diagrams:

1. **Purpose**:
   - A Venn diagram helps us **visually represent the differences and similarities** between two or more concepts or sets.
   - It uses intersecting and non-intersecting circles (or other closed figures like squares) to denote the relationships between sets.

2. **Components**:
   - **Universal Set**: Before drawing a Venn diagram, we consider a larger set called the **universal set** (denoted by \(E\) or sometimes \(U\)). The universal set contains all elements from the sets being considered.
   - **Circles (or Figures)**: Each set is represented by a circle (or closed figure) within the universal set. The circles may intersect or remain separate.
   - **Subset Relationship**: A subset is a set contained entirely within another set. For example, if set \(A\) is entirely within set \(B\), we say \(A\) is a subset of \(B\) (symbolically represented as \(A \subseteq B\)).

3. **Example**:
   - Let's consider an example. Suppose we have two sets:
     - Set \(A\) contains even numbers from 1 to 25.
     - Set \(B\) contains numbers in the 5x table from 1 to 25.
   - The Venn diagram shows that 10 and 20 are both even numbers and multiples of 5 between 1 and 25 (the intersecting part).

4. **Applications**:
   - Venn diagrams are used in:
     - Teaching elementary set theory.
     - Illustrating set relationships in probability, logic, statistics, linguistics, and computer science.

# Q7. For the two given sets A = (2,3,4,5,6,7) & B = (0,2,6,8,10). Find:
## (i) (A ∩ B)
## (ii) A ⋃ B

1. **Intersection (A ∩ B)**:
   - The intersection of sets \(A\) and \(B\) consists of elements that are common to both sets.
   - We find the common elements between \(A\) and \(B\).
   - \(A = \{2, 3, 4, 5, 6, 7\}\)
   - \(B = \{0, 2, 6, 8, 10\}\)
   - The common element is \(2\).
   - Therefore, \(A \cap B = \{2\}\).

2. **Union (A ∪ B)**:
   - The union of sets \(A\) and \(B\) contains all distinct elements from both sets.
   - \(A \cup B = \{0, 2, 3, 4, 5, 6, 7, 8, 10\}\).

In summary:
- \(A \cap B = \{2\}\)
- \(A \cup B = \{0, 2, 3, 4, 5, 6, 7, 8, 10\}\)



# Q8. What do you understand about skewness in data?

**Skewness** is a measure of the **asymmetry** of a distribution. When we analyze data, we often encounter distributions that are not perfectly symmetrical. Here are the key points about skewness:

1. **Definition**:
   - Skewness quantifies how the data points are **spread out** in a dataset.
   - It indicates whether the distribution is **lopsided** or not.
   - A distribution can have three types of skewness:
     - **Right (Positive) Skew**: The right tail is longer than the left tail. Data points are concentrated on the left side, with a few extreme values on the right.
     - **Left (Negative) Skew**: The left tail is longer than the right tail. Data points are concentrated on the right side, with a few extreme values on the left.
     - **Zero Skew**: The distribution is perfectly symmetrical (left and right sides mirror each other).

2. **Interpretation**:
   - **Zero Skew**: When skewness is zero, the distribution is **symmetrical**. Normal distributions have zero skew, but other symmetrical distributions (like uniform distributions) can also have zero skew.
   - **Right Skew (Positive Skew)**: A right-skewed distribution has a long tail on the right side. It indicates that extreme values are more frequent on the right.
   - **Left Skew (Negative Skew)**: A left-skewed distribution has a long tail on the left side. It indicates that extreme values are more frequent on the left.

3. **Practical Use**:
   - **Descriptive Statistics**: Skewness helps describe the shape of a dataset alongside other statistics.
   - **Model Assumptions**: Checking skewness is essential for verifying assumptions in statistical models (e.g., normality assumptions).

4. **Example**:
   - Imagine a dataset of exam scores. If most students score around the average (with a few high scores), the distribution might be right-skewed.
   - Conversely, if most scores are below the average (with a few low scores), the distribution might be left-skewed.

# Q9. If a data is right skewed then what will be the position of median with respect to mean?

In a **right-skewed** dataset, the position of the **median** with respect to the **mean** is as follows:

1. **Mean**:
   - The **mean** (average) tends to be **greater** than the median in a right-skewed distribution.
   - This is because the right tail (containing larger values) pulls the mean toward higher values.

2. **Median**:
   - The **median** is typically **less** than the mean in a right-skewed dataset.
   - The median is less affected by extreme values, so it remains closer to the center of the data.

In summary:
- **Mean > Median** in a right-skewed distribution.


# Q10. Explain the difference between covariance and correlation. How are these measures used in statistical analysis?

1. **Covariance**:
   - **Definition**: Covariance measures the **extent to which two random variables change together**. It indicates the direction of the linear relationship between variables.
   - **Calculation**: For two variables \(X\) and \(Y\), the sample covariance is given by:
     \[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1} \]
     where \(x_i\) and \(y_i\) are data values, \(\bar{x}\) and \(\bar{y}\) are their respective means, and \(n\) is the sample size.
   - **Range**: Covariance values can lie between \(-\infty\) and \(+\infty\).
   - **Units**: The units of covariance are derived from the product of the units of the variables.

2. **Correlation**:
   - **Definition**: Correlation measures both the **strength and direction** of the linear relationship between two variables. It is a standardized form of covariance.
   - **Calculation**: The sample correlation coefficient (\(r\)) is obtained by dividing the covariance of the variables by the product of their standard deviations:
     \[ r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]
     where \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\).
   - **Range**: Correlation values fall within the range of \(-1\) to \(+1\).
   - **Units**: Correlation is **non-dimensional**; it doesn't have any specific units.

3. **Differences**:
   - **Interpretation**:
     - Covariance indicates the **extent of dependency** between variables.
     - Correlation signifies the **strength of association** between variables when other factors are held constant.
   - **Standardization**:
     - Correlation is a **normalized** version of covariance.
     - Covariance values are not standardized.
   - **Effect of Scale Change**:
     - Covariance is affected by changes in the scale of variables (e.g., multiplying all values by a constant).
     - Correlation remains unaffected by such scale changes.
   - **Range**:
     - Covariance values can be large or small, without a specific upper or lower limit.
     - Correlation values are bounded between \(-1\) and \(+1\).

4. **Applications**:
   - **Covariance**:
     - Used in portfolio theory to assess the relationship between asset returns.
     - Helps understand how variables move together (e.g., stock prices and interest rates).
   - **Correlation**:
     - Widely used in finance, economics, and social sciences.
     - Determines the strength and direction of relationships (e.g., correlation between GDP and unemployment rate).

# Q11. What is the formula for calculating the sample mean? Provide an example calculation for a dataset.

The **sample mean** represents the average value of a given sample data. It allows us to estimate what the entire population might be doing without surveying every individual in that population. The formula for calculating the sample mean is straightforward:

\[ \text{Sample Mean} = \frac{\text{Sum of Terms}}{\text{Number of Terms}} = \frac{\sum_{i=1}^{n} x_i}{n} \]

Where:
- \(x_i\) represents each data point in the sample.
- \(n\) is the number of terms in the sample.

Let's work through a couple of examples:

1. **Example 1**:
   - Given data: 60, 57, 109, 50
   - Sum of terms: \(60 + 57 + 109 + 50 = 276\)
   - Number of terms: 4
   - Using the sample mean formula:
     \[ \text{Mean} = \frac{\text{Sum of terms}}{\text{Number of terms}} = \frac{276}{4} = 69 \]
   - Answer: The sample mean of 60, 57, 109, and 50 is **69**.

2. **Example 2**:
   - Heights of five friends: 110 units, 115 units, 109 units, 112 units, and 114 units.
   - Sum of heights: \(110 + 115 + 109 + 112 + 114 = 560\)
   - Number of people: 5
   - Using the sample mean formula:
     \[ \text{Mean} = \frac{\text{Sum of heights}}{\text{Number of people}} = \frac{560}{5} = 112 \]
   - Answer: The sample mean height of the five friends is **112 units**.

3. **Example 3**:
   - Homework completion times for five children: 30 minutes, 60 minutes, 45 minutes, 40 minutes, and 90 minutes.
   - Sum of time slots: \(30 + 40 + 45 + 60 + 90 = 265\)
   - Number of children: 5
   - Using the sample mean formula:
     \[ \text{Mean} = \frac{\text{Sum of time slots}}{\text{Number of children}} = \frac{265}{5} = 53 \]
   - Answer: The sample mean time for the five children is **53 minutes**.


# Q12. For a normal distribution data what is the relationship between its measure of central tendency?

In a **normal distribution**, all three common measures of central tendency—**mean**, **median**, and **mode**—have a specific relationship due to the symmetric nature of the distribution:

1. **Mean**:
   - The **mean** (average) of a normal distribution is equal to the **x-value of the peak** (the highest point) of the bell-shaped curve.
   - Mathematically, for a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), the mean is given by:
     \[ \text{Mean} = \mu \]

2. **Median**:
   - The **median** of a normal distribution is also equal to the **x-value of the peak**.
   - Since the normal distribution is symmetric, the median coincides with the mean.
   - Therefore, for a normal distribution:
     \[ \text{Median} = \mu \]

3. **Mode**:
   - The **mode** of a normal distribution is the **x-value of the global maximum** of the bell-shaped curve.
   - Setting the derivative of the normal distribution function to zero, we find that the mode occurs at \(x = \mu\).
   - Thus, for a normal distribution:
     \[ \text{Mode} = \mu \]


# Q13. How is covariance different from correlation?

1. **Covariance**:
   - **Definition**: Covariance measures the **extent to which two random variables change together**. It indicates the direction of the linear relationship between variables.
   - **Calculation**: For two variables \(X\) and \(Y\), the sample covariance is given by:
     \[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{n-1} \]
     where \(x_i\) and \(y_i\) are data values, \(\bar{x}\) and \(\bar{y}\) are their respective means, and \(n\) is the sample size.
   - **Range**: Covariance values can lie between \(-\infty\) and \(+\infty\).
   - **Units**: The units of covariance are derived from the product of the units of the variables.

2. **Correlation**:
   - **Definition**: Correlation measures both the **strength and direction** of the linear relationship between two variables. It is a standardized form of covariance.
   - **Calculation**: The sample correlation coefficient (\(r\)) is obtained by dividing the covariance of the variables by the product of their standard deviations:
     \[ r = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]
     where \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\).
   - **Range**: Correlation values fall within the range of \(-1\) to \(+1\).
   - **Units**: Correlation is **non-dimensional**; it doesn't have any specific units.

3. **Differences**:
   - **Interpretation**:
     - Covariance indicates the **extent of dependency** between variables.
     - Correlation signifies the **strength of association** between variables when other factors are held constant.
   - **Standardization**:
     - Correlation is a **normalized** version of covariance.
     - Covariance values are not standardized.
   - **Effect of Scale Change**:
     - Covariance is affected by changes in the scale of variables (e.g., multiplying all values by a constant).
     - Correlation remains unaffected by such scale changes.
   - **Range**:
     - Covariance values can be large or small, without a specific upper or lower limit.
     - Correlation values are bounded between \(-1\) and \(+1\).

I

# Q14. How do outliers affect measures of central tendency and dispersion? Provide an example.

1. **Measures of Central Tendency**:
   - **Mean (Average)**:
     - Outliers can **pull the mean** towards their extreme values.
     - If there are outliers with large values, the mean can be **inflated**.
     - Conversely, outliers with small values can **lower** the mean.
     - Example: Consider a dataset of exam scores: {80, 85, 90, 95, 200}. The outlier (200) significantly affects the mean.
       - Without the outlier: Mean = (80 + 85 + 90 + 95) / 4 = 87.5
       - With the outlier: Mean = (80 + 85 + 90 + 95 + 200) / 5 = 110

   - **Median**:
     - The median is **less affected by outliers**.
     - It represents the middle value when data is sorted.
     - Example: In the same dataset, the median remains unaffected by the outlier: Median = 90.

   - **Mode**:
     - The mode (most frequent value) can also be influenced by outliers.
     - If an outlier occurs frequently, it can become the mode.
     - Example: If the dataset includes more occurrences of the outlier (200), it becomes the mode.

2. **Measures of Dispersion**:
   - **Range**:
     - Outliers can significantly **widen the range**.
     - The range is the difference between the maximum and minimum values.
     - Example: In the exam scores dataset, the range without the outlier is 95 - 80 = 15, but with the outlier, it becomes 200 - 80 = 120.

   - **Standard Deviation and Variance**:
     - Outliers can increase the **standard deviation** and **variance**.
     - These measures quantify the spread of data around the mean.
     - Example: The standard deviation increases when the outlier is included.