### Q1.Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.

**Types of Data:**

1. **Qualitative Data (Categorical):**
    - Describes qualities or characteristics.
    - Cannot be measured numerically, only categorized.
    - **Examples:** Colors (red, blue), Gender (male, female), Types of cuisine (Italian, Chinese).

2. **Quantitative Data (Numerical):**
    - Represents quantities and can be measured numerically.
    - **Examples:** Height (170 cm), Age (25 years), Number of students (30).

**Scales of Measurement:**

1. **Nominal Scale:**
    - Categories with no inherent order.
    - **Example:** Types of fruit (apple, banana, orange).

2. **Ordinal Scale:**
    - Categories with a meaningful order, but intervals are not equal.
    - **Example:** Survey ratings (satisfied, neutral, dissatisfied), Education level (high school, bachelor, master).

3. **Interval Scale:**
    - Ordered categories with equal intervals, but no true zero.
    - **Example:** Temperature in Celsius or Fahrenheit.

4. **Ratio Scale:**
    - Ordered, equal intervals, and has a true zero point.
    - **Example:** Weight (0 kg means no weight), Height, Age, Income.

### Q2. What are the measures of central tendency, and when should you use each? Discuss the mean, median,and mode with examples and situations where each is appropriate

**Measures of Central Tendency:**

1. **Mean (Average):**
    - Calculated by adding all values and dividing by the number of values.
    - **Example:** For the data set [2, 4, 6, 8], the mean is (2+4+6+8)/4 = 5.
    - **When to use:** Use the mean for data without extreme outliers and when all values are equally important.

2. **Median:**
    - The middle value when data is ordered from smallest to largest. If even number of values, it is the average of the two middle values.
    - **Example:** For [1, 3, 5], the median is 3. For [1, 3, 5, 7], the median is (3+5)/2 = 4.
    - **When to use:** Use the median when data has outliers or is skewed, as it is less affected by extreme values.

3. **Mode:**
    - The value that appears most frequently in the data set.
    - **Example:** For [2, 2, 3, 4], the mode is 2.
    - **When to use:** Use the mode for categorical data or when you want to know the most common value.

**Summary Table:**

| Measure | Best Used When... | Example Data | Result |
|---------|-------------------|--------------|--------|
| Mean    | Data is symmetric, no outliers | [2, 4, 6, 8] | 5 |
| Median  | Data is skewed or has outliers | [1, 2, 2, 100] | 2 |
| Mode    | Most frequent value is needed, categorical data | [red, blue, blue, green] | blue |

### Q3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?

**Dispersion** refers to the extent to which data values vary around the central value (such as the mean). It shows how spread out the data points are in a dataset.

- **Variance** measures the average squared deviation of each data point from the mean. A higher variance indicates that data points are more spread out.
### Variance and Standard Deviation

When working with a dataset \( x_1, x_2, ..., x_n \), it's useful to understand how spread out the data is. This is where **variance** and **standard deviation** come in.

- **Variance** measures the average squared difference from the mean. It's calculated using this formula:

  \[
  \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2
  \]

  where:
  - \( n \) is the number of data points
  - \( x_i \) is each individual data point
  - \( \mu \) is the mean of the dataset

- **Standard Deviation** is simply the square root of the variance. Since it’s in the same units as the original data, it’s easier to interpret.

  \[
  \text{Standard Deviation} = \sqrt{\text{Variance}}
  \]

**Example:**  
For the data [2, 4, 4, 4, 5, 5, 7, 9]:
- Mean = 5
- Variance = 4
- Standard Deviation = 2

**Interpretation:**  
A small standard deviation means data points are close to the mean; a large standard deviation means they are more spread out.

### Q4. What is a box plot, and what can it tell you about the distribution of data?

### **Answer:** A **box plot** (or box-and-whisker plot) is a graphical representation of the distribution of a dataset. It displays the dataset’s minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum values.

**What a box plot shows:**
- **Median (Q2):** The line inside the box shows the median value.
- **Quartiles (Q1 and Q3):** The edges of the box represent the 25th (Q1) and 75th (Q3) percentiles.
- **Interquartile Range (IQR):** The length of the box (Q3 - Q1) shows the spread of the middle 50% of the data.
- **Whiskers:** Lines extending from the box to the minimum and maximum values (excluding outliers).
- **Outliers:** Data points outside 1.5 × IQR from the quartiles are plotted individually.

**What you can learn from a box plot:**
- The central value (median) of the data.
- The spread and variability (IQR).
- Skewness (if the median is not centered in the box or whiskers are uneven).
- Presence of outliers.

**Example:**
A box plot can quickly show if data is symmetric, skewed, or contains outliers, making it useful for comparing distributions across groups.

### Q5. Discuss the role of random sampling in making inferences about populations

**Random sampling** is a technique where each member of a population has an equal chance of being selected for a sample. This method is crucial for making valid inferences about populations because it helps ensure that the sample is representative of the entire population.

**Role in Inference:**
- **Reduces Bias:** Random sampling minimizes selection bias, making the results more generalizable.
- **Enables Statistical Analysis:** Many statistical methods assume random sampling; it allows for the use of probability theory to estimate population parameters and calculate margins of error.
- **Supports Valid Conclusions:** With a representative sample, conclusions drawn about the population (such as means, proportions, or relationships) are more likely to be accurate.

**Example:**  
If you want to estimate the average height of students in a university, randomly selecting students from the entire student body will give a more reliable estimate than choosing only students from one class or group.

### Q6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?

**Skewness** measures the asymmetry of a data distribution around its mean.

- **Types of Skewness:**
    1. **Symmetrical Distribution:** The left and right sides of the distribution are mirror images. Mean = Median = Mode.
    2. **Positive Skew (Right Skew):** The right tail is longer; most data are concentrated on the left. Mean > Median > Mode.
    3. **Negative Skew (Left Skew):** The left tail is longer; most data are concentrated on the right. Mean < Median < Mode.

**Effect on Interpretation:**
- Skewness indicates whether data are spread more to one side of the mean.
- In positively skewed data, the mean is pulled higher by large values; in negatively skewed data, the mean is pulled lower by small values.
- Skewness affects which measure of central tendency (mean, median, mode) is most appropriate to use.
- Understanding skewness helps in choosing the right statistical methods and in accurately interpreting data summaries.

**Example:**  
Income data are often right-skewed: most people earn average or below-average incomes, but a few high earners pull the mean higher than the median.

### Q7. What is the interquartile range (IQR), and how is it used to detect outliers?

**Interquartile Range (IQR):**  
The IQR is a measure of statistical dispersion and represents the range within which the central 50% of the data lie. It is calculated as:


{IQR} = Q3 - Q1


where:
- \( Q1 \) is the first quartile (25th percentile)
- \( Q3 \) is the third quartile (75th percentile)

**Detecting Outliers with IQR:**  
Outliers are typically defined as data points that fall below \( Q1 - 1.5 \times \text{IQR} \) or above \( Q3 + 1.5 \times \text{IQR} \).

**Example:**  
If \( Q1 = 10 \) and \( Q3 = 20 \), then \( \text{IQR} = 10 \).  
- Lower bound: \( 10 - 1.5 \times 10 = -5 \)
- Upper bound: \( 20 + 1.5 \times 10 = 35 \)

Any data point outside \([-5, 35]\) is considered an outlier.

### Q8. Discuss the conditions under which the binomial distribution is used

**Binomial Distribution Conditions:**

The binomial distribution models the number of successes in a fixed number of independent trials, each with the same probability of success. It is used when the following conditions are met:

1. **Fixed Number of Trials (n):** The experiment is repeated a set number of times.
2. **Two Possible Outcomes:** Each trial results in either "success" or "failure".
3. **Constant Probability (p):** The probability of success is the same for each trial.
4. **Independence:** The outcome of one trial does not affect the others.

**Examples:**
- Flipping a coin 10 times and counting the number of heads.
- Testing 20 light bulbs and counting how many work.

If these conditions are satisfied, the binomial distribution can be used to calculate probabilities of obtaining a certain number of successes.

### Q9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).

**Normal Distribution:**
- The normal distribution is a continuous, symmetric, bell-shaped distribution centered around the mean.
- **Properties:**
    - Symmetrical about the mean (mu).
    - Mean, median, and mode are all equal.
    - Defined by two parameters: mean (\( \mu \)) and standard deviation (\( \sigma \)).
    - The total area under the curve is 1.
    - Tails extend infinitely in both directions but never touch the horizontal axis.

**Empirical Rule (68-95-99.7 Rule):**
- Describes how data are distributed in a normal distribution:
    - About **68%** of data falls within **1 standard deviation** (\( \mu \pm 1\sigma \)) of the mean.
    - About **95%** falls within **2 standard deviations** (\( \mu \pm 2\sigma \)).
    - About **99.7%** falls within **3 standard deviations** (\( \mu \pm 3\sigma \)).

**Example:**
If the mean test score is 70 with a standard deviation of 10:
- 68% of scores are between 60 and 80.
- 95% are between 50 and 90.
- 99.7% are between 40 and 100.

The empirical rule helps quickly estimate the spread and identify outliers in normally distributed data.

### Q10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.

from scipy.stats import poisson

- #### Real-life example: Number of emails received per hour
- #### Suppose you receive an average of 5 emails per hour (λ = 5).
- #### What is the probability of receiving exactly 3 emails in the next hour?

lam = 5  # average rate (emails per hour)
k = 3    # number of emails

prob = poisson.pmf(k, lam)
print(f"Probability of receiving exactly {k} emails in the next hour: {prob:.4f}")

### Q11. Explain what a random variable is and differentiate between discrete and continuous random variables

**Random Variable:**  
A random variable is a numerical quantity whose value depends on the outcome of a random phenomenon. It assigns a real number to each outcome in a sample space.

**Types of Random Variables:**

1. **Discrete Random Variable:**
    - Takes on a countable number of distinct values.
    - Typically arises from counting processes.
    - **Examples:** Number of heads in 10 coin tosses, number of students in a class.

2. **Continuous Random Variable:**
    - Can take on any value within a given range (possibly infinite).
    - Typically arises from measuring processes.
    - **Examples:** Height of students, time taken to run a race, temperature.

**Summary Table:**

| Type      | Values Taken         | Example                      |
|-----------|---------------------|------------------------------|
| Discrete  | Countable           | Number of cars in a parking lot |
| Continuous| Any value in interval| Weight of a person           |

### Q12. Provide an example dataset, calculate both covariance and correlation, and interpret the results

In [1]:
import numpy as np
import pandas as pd

# Example dataset
data = {
    'Hours_Studied': [2, 4, 6, 8, 10],
    'Exam_Score': [65, 70, 75, 80, 85]
}
df = pd.DataFrame(data)

# Calculate covariance
covariance = df['Hours_Studied'].cov(df['Exam_Score'])

# Calculate correlation
correlation = df['Hours_Studied'].corr(df['Exam_Score'])

print(f"Covariance: {covariance:.2f}")
print(f"Correlation: {correlation:.2f}")

# Interpretation
if correlation > 0:
    print("There is a positive relationship: as hours studied increase, exam scores tend to increase.")
elif correlation < 0:
    print("There is a negative relationship: as hours studied increase, exam scores tend to decrease.")
else:
    print("There is no linear relationship between hours studied and exam scores.")

Covariance: 25.00
Correlation: 1.00
There is a positive relationship: as hours studied increase, exam scores tend to increase.
