#**STATISTICS BASICS ASSIGNMENT**

**1.Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss nominal, ordinal, interval, and ratio scales.**
- ## **Types of Data in Statistics**

### **1. Qualitative (Categorical) Data**
Qualitative data represents categories or labels that describe characteristics but do not have numerical meaning. This type of data is often non-numeric, though it can sometimes be assigned numerical codes for analysis.

#### **Examples:**
- **Gender:** Male, Female, Other
- **Blood Type:** A, B, AB, O
- **Eye Color:** Brown, Blue, Green
- **Marital Status:** Single, Married, Divorced

#### **Types of Qualitative Data:**
1. **Nominal Scale:**  
   - Data is categorized without a specific order.
   - Example: Eye color (Brown, Blue, Green), Blood Type (A, B, AB, O)
   
2. **Ordinal Scale:**  
   - Data is categorized with a meaningful order, but differences between categories are not measurable.
   - Example: Satisfaction levels (Low, Medium, High), Education level (Primary, Secondary, Graduate)



 **2. Quantitative (Numerical) Data**
Quantitative data represents measurable quantities and can be analyzed using mathematical operations.

**Examples:**
- **Height (in cm):** 160 cm, 175 cm
- **Temperature (in °C):** 25°C, 30°C
- **Weight (in kg):** 55 kg, 70 kg
- **Test Scores:** 85, 90, 95

 **Types of Quantitative Data:**
1. **Interval Scale:**  
   - Data has a meaningful order, and the differences between values are measurable, but there is no true zero.  
   - Example: Temperature in Celsius or Fahrenheit (0°C doesn’t mean "no temperature"), IQ scores.

2. **Ratio Scale:**  
   - Data has a meaningful order, measurable differences, and a true zero, allowing for meaningful ratios.  
   - Example: Height, Weight, Age, Income (₹0 means no income).  



**2.What are the measures of central tendency, and when should you use each? Discuss the mean, median, and mode with examples and situations where each is appropriate. **
-  **Measures of Central Tendency**  

**1. Mean (Average)**  
The mean is the sum of all values in a dataset divided by the total number of values. It provides a useful overall measure but can be distorted by extreme values.  

**Example:**  
In a company, the monthly salaries of five employees are **₹30,000, ₹35,000, ₹40,000, ₹45,000, and ₹10,00,000**. The mean salary is much higher than most employees’ actual earnings because of the one extremely high salary.  

**Best for:** Normally distributed numerical data, such as average test scores or daily temperatures.  
**Avoid when:** Data contains extreme values, as it may not accurately represent the majority.  

 **2. Median (Middle Value)**  
The median is the middle value in an ordered dataset. It is useful when data has outliers because it is not influenced by extreme values.  

**Example:**  
In the same company, the sorted salaries are **₹30,000, ₹35,000, ₹40,000, ₹45,000, ₹10,00,000**. The median salary is **₹40,000**, which better represents what most employees earn compared to the mean.  

**Best for:** Skewed distributions, such as income, property prices, or waiting times.  
**Avoid when:** Data is symmetrically distributed, where the mean is a better measure.  

 **3. Mode (Most Frequent Value)**  
The mode is the most frequently occurring value in a dataset. It is useful for identifying common categories or repeated values.  

**Example:**  
A shoe store records shoe sizes sold in a day: **7, 8, 8, 9, 10, 10, 10, 11**. The mode is **10**, as it appears most often, indicating the most popular shoe size.  

**Best for:** Categorical data like customer preferences, voting results, or frequently purchased items.  
**Avoid when:** Data has no repeated values or multiple frequent values without a clear pattern.  



**3.Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?**
- **Concept of Dispersion:**
Dispersion refers to the spread or variability of data points in a dataset. It helps measure how much individual values differ from the central tendency (mean, median, or mode). A high dispersion indicates that values are widely spread, while low dispersion means they are closely clustered.

**Variance and Standard Deviation:**
- Variance; measures the average squared deviation from the mean, showing how spread out the data is. A higher variance means more variability.

- Standard Deviation; the square root of variance and provides a measure of dispersion in the same unit as the original data, making it easier to interpret.

Example:
Consider test scores of students: 60, 62, 85, 90, 95

- A high standard deviation means scores vary widely.

- A low standard deviation means scores are close to the mean.



**4.What is a box plot, and what can it tell you about the distribution of data?**
-  Box Plot and Data Distribution:
A box plot (box-and-whisker plot) visually represents the distribution of data, highlighting key statistics such as:

- Minimum: The smallest value (excluding outliers).

- First Quartile (Q1): The 25th percentile, where 25% of data falls below.

- Median (Q2): The 50th percentile (middle value).

- Third Quartile (Q3): The 75th percentile, where 75% of data falls below.

- Maximum: The largest value (excluding outliers).

- Outliers: Data points that lie far from the rest.



    - If the median is centered: The data is symmetrically distributed.

    - If the median is closer to Q1 or Q3: The data is skewed.

    - If the whiskers are unequal: Indicates variability in data spread.

Example:
A box plot of house prices may show a long upper whisker, indicating a few expensive properties pushing the prices up.



 **5. Discuss the role of random sampling in making inferences about populations.**
- Role of Random Sampling in Population Inference:
Random sampling is a technique where individuals are selected from a population randomly, ensuring that every member has an equal chance of being chosen. This helps in making unbiased inferences about the whole population.

Importance:
- Reduces Bias: Ensures that sample data represents the population fairly.

- Improves Accuracy: Larger random samples provide better approximations of population characteristics.

- Allows Generalization: Findings from a sample can be extended to the whole population with statistical confidence.

Example:
A pharmaceutical company testing a new drug selects random patients from different age groups and backgrounds to ensure fair results.

**6.Explain the concept of skewness and its types. How does skewness affect the interpretation of data?**
- Skewness is a statistical measure that describes the asymmetry of a probability distribution. It indicates whether the data points are concentrated on one side of the mean. There are three types of skewness:

- Positive Skewness (Right Skewed): The tail on the right side of the distribution is longer or fatter than the left side. The mean is greater than the median.
- Negative Skewness (Left Skewed): The tail on the left side is longer or fatter than the right side. The mean is less than the median.
- Zero Skewness (Symmetrical): The distribution is perfectly symmetrical, and the mean and median are equal.

Skewness affects data interpretation by influencing the mean and median values, which can lead to different conclusions about the central tendency and spread of the data.

**7.What is the interquartile range (IQR), and how is it used to detect outliers?**
- The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset (IQR = Q3 - Q1). It represents the range within which the central 50% of the data lies.

To detect outliers using the IQR, the following steps are taken:

 1) Calculate Q1 and Q3.

2) Compute the IQR.

3) Determine the lower bound as Q1 - 1.5 * IQR and the upper bound as Q3 + 1.5 * IQR.

4) Any data points outside these bounds are considered outliers.

**8.Discuss the conditions under which the binomial distribution is used.**
- The binomial distribution is used under the following conditions:

1) Fixed Number of Trials (n): The experiment is conducted a specific number of times.

2) Two Possible Outcomes: Each trial results in one of two outcomes, often termed "success" and "failure."

3) Constant Probability (p): The probability of success remains the same for each trial.

4) Independent Trials: The outcome of one trial does not affect the outcome of another.


**9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).**
- The normal distribution is a continuous probability distribution characterized by the following properties:

1) Symmetrical: The distribution is symmetric around the mean.

2) Bell-Shaped Curve: The graph of the distribution forms a bell shape.

3) Mean, Median, Mode: All three measures of central tendency are equal and located at the center of the distribution.

4) Asymptotic: The tails of the distribution approach the horizontal axis but never touch it.

The empirical rule (68-95-99.7 rule) states that for a normal distribution:

 - Approximately 68% of the data falls within one standard deviation (σ) of the mean (μ).
 - About 95% falls within two standard deviations.
 - About 99.7% falls within three standard deviations.

**10. Provide a Real-Life Example of a Poisson Process and Calculate the Probability for a Specific Event**
  - Real-Life Example:

Suppose a call center receives an average of 3 calls per minute ((\lambda = 3)). What is the probability of receiving exactly 5 calls in a minute?

 - Poisson Formula:

The probability of (k) events occurring in a fixed interval is: [ P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!} ]

 - Calculation:

For (\lambda = 3) and (k = 5): [ P(X = 5) = \frac{e^{-3} \cdot 3^5}{5!} \approx 0.1008 ]

**11. Explain what a random variable is and differentiate between discrete and continuous random variables.**
 - A random variable is a variable that takes on numerical values determined by the outcome of a random experiment.

Types of Random Variables:

- Discrete Random Variable:

Takes on countable values (e.g., 0, 1, 2, ...).
Example: Number of heads in 10 coin flips.

- Continuous Random Variable:

Takes on any value within a range or interval.
Example: The height of students in a class

**12. Provide an Example Dataset, Calculate Both Covariance and Correlation, and Interpret the Results**
  - Example Dataset:

X= 2,4,6,8,10

y=3,7,9,13,15

- Calculations in Python:



```
 import numpy as np
import pandas as pd
 Example data
data = pd.DataFrame({
    'X': [2, 4, 6, 8, 10],
    'Y': [3, 7, 9, 13, 15]
})

 Covariance
covariance = np.cov(data['X'], data['Y'])[0, 1]

 Correlation
correlation = np.corrcoef(data['X'], data['Y'])[0, 1]

print(f"Covariance: {covariance}")
print(f"Correlation: {correlation}")

```

