# Statistics Basics


1. Explain the different types of data (qualitative and quantitative) and provide examples of each. Discuss
nominal, ordinal, interval, and ratio scales.
- Data is categorized into **qualitative** and **quantitative** types:

* **Qualitative Data (Categorical)** – Descriptive, non-numeric data.
   - **Nominal Scale**: Labels without a meaningful order.  
     *Example: Gender (Male, Female), Eye Color (Blue, Green, Brown).*
   - **Ordinal Scale**: Ordered categories, but differences are not measurable.  
     *Example: Education Level (High School, Bachelor’s, Master’s), Satisfaction Rating (Low, Medium, High).*

* **Quantitative Data (Numerical)** – Numeric values that can be measured.
   - **Interval Scale**: Ordered data with meaningful differences but no true zero.  
     *Example: Temperature in Celsius or Fahrenheit, IQ Scores.*
   - **Ratio Scale**: Ordered data with meaningful differences and a true zero.  
     *Example: Height, Weight, Age, Income.*
2. What are the measures of central tendency, and when should you use each? Discuss the mean, median,
and mode with examples and situations where each is appropriate.
- ### **Measures of Central Tendency**  
These are ways to find the "center" of a dataset.  

* **Mean (Average)**  
   - Formula: **Sum of values ÷ Number of values**  
   - **Best for:** Normally distributed data (no extreme outliers).  
   - **Example:** Average salary of employees in a company.  
   - **Avoid if:** Data has extreme values (outliers), as they can skew the mean.  
* **Median (Middle Value)**  
   - **Best for:** Skewed data or when there are outliers.  
   - **Example:** Median house price (helps avoid distortion by very high/low prices).  
   - **How to find:** Arrange data in order, pick the middle value (or average of two middle values if even numbers).  

* **Mode (Most Frequent Value)**  
   - **Best for:** Categorical data or when you want to know the most common value.  
   - **Example:** Most common shoe size sold in a store.  
   - **Can be used for:** Both numerical and non-numerical data.
3. Explain the concept of dispersion. How do variance and standard deviation measure the spread of data?
- Concept of Dispersion
Dispersion tells us how spread out the data is. It shows how much values differ from the average (mean).

* Variance & Standard Deviation
Variance (σ² or s²)

Measures the average squared difference from the mean.

Higher variance = More spread-out data.

Example: If students’ test scores vary a lot, the variance is high.

* Standard Deviation (σ or s)

Square root of variance (brings units back to original scale).

Shows how much values typically deviate from the mean.

Example: If the standard deviation of salaries in a company is low, most employees earn close to the average salary.

4. What is a box plot, and what can it tell you about the distribution of data?
- Box Plot (Box-and-Whisker Plot)
A box plot is a visual way to show the spread and distribution of a dataset using 5 key values:

Minimum

* 1st Quartile (Q1)

* Median (Q2)

* 3rd Quartile (Q3)

* Maximum

What it tells you:
* Center of data (via median)

* Spread of data (via IQR: Q3 - Q1)

* Skewness (if the median isn’t in the center of the box)

* Outliers (shown as dots beyond whiskers)

5. Discuss the role of random sampling in making inferences about populations.
- Role of Random Sampling
Random sampling means selecting individuals from a population by chance, so every member has an equal chance of being chosen.

Why it’s important:
* Represents the whole population fairly

* Reduces bias in data collection

* Allows us to make valid inferences or generalizations about the population using a small sample

* Essential for reliable surveys, experiments, and statistical analysis

6. Explain the concept of skewness and its types. How does skewness affect the interpretation of data?
- Concept of Skewness
Skewness tells us about the asymmetry of a data distribution — whether it leans more to the left or right.

Types of Skewness:
 Positive Skew (Right-skewed)

* Tail is longer on the right

* Most values are low, a few high outliers

* Mean > Median

Example: Income distribution (most earn low, few earn very high)

 Negative Skew (Left-skewed)

* Tail is longer on the left

* Most values are high, a few low outliers

* Mean < Median

Example: Age at retirement (most people retire at older ages, few early)

 No Skew (Symmetrical)

* Mean = Median = Mode

* Data is evenly distributed

Impact on Interpretation:
Skewness helps you decide which measure of central tendency to use

* Affects how you interpret the mean (it can be misleading in skewed data)

* Shows presence of outliers and data imbalance

7. What is the interquartile range (IQR), and how is it used to detect outliers?
- Interquartile Range (IQR)
IQR is the range of the middle 50% of data. It shows how spread out the central values are.

Formula:
IQR = Q3 - Q1

Q1 (1st Quartile): 25% of data below it

Q3 (3rd Quartile): 75% of data below it

Detecting Outliers:
An outlier is a value far from the rest of the data.

Outlier Rule:

Lower Bound = Q1 - 1.5 × IQR

Upper Bound = Q3 + 1.5 × IQR

Any data below or above these bounds is considered an outlier.

 Example:
If IQR = 20, Q1 = 30, Q3 = 50

Lower Bound = 30 - (1.5×20) = 0

Upper Bound = 50 + (1.5×20) = 80
→ Any value < 0 or > 80 is an outlier.

8. Discuss the conditions under which the binomial distribution is used.
- ### **Binomial Distribution**  
Used when you’re counting **successes or failures** in a fixed number of trials.

###  **Conditions to Use Binomial Distribution:**

1. **Fixed number of trials (n)**  
   - Example: Flip a coin **10 times**

2. **Only two outcomes per trial**  
   - Success or failure (like win/lose, yes/no, heads/tails)

3. **Same probability (p) for each trial**  
   - Example: Chance of getting heads is always **0.5**

4. **Trials are independent**  
   - One outcome doesn’t affect the others

### **Example:**  
What’s the probability of getting **exactly 3 heads** in **5 coin flips**?  
→ Binomial distribution applies here.

9. Explain the properties of the normal distribution and the empirical rule (68-95-99.7 rule).
- ### **Normal Distribution**  
A **bell-shaped**, symmetrical curve where most values cluster around the **mean**.

### **Key Properties:**
- **Symmetrical** around the mean  
- **Mean = Median = Mode**  
- Tails extend infinitely in both directions  
- Used in **natural data** like height, weight, test scores, etc.



### **Empirical Rule (68-95-99.7 Rule):**

For a normal distribution:

- **68%** of data lies within **±1 standard deviation (σ)** from the mean  
- **95%** within **±2σ**  
- **99.7%** within **±3σ**


 **Example:**  
If a test has a mean score of 70 and σ = 10:  
- 68% of students scored between **60 and 80**  
- 95% scored between **50 and 90**  
- 99.7% scored between **40 and 100**

10. Provide a real-life example of a Poisson process and calculate the probability for a specific event.
- ### **Real-Life Example of a Poisson Process:**  
**Example:** Number of customer calls received by a call center per hour.

Let’s say the center gets **on average 5 calls/hour** (λ = 5).  
We want to find the probability of receiving **exactly 3 calls in an hour**.



### **Poisson Formula:**  
\[
P(X = k) = \frac{e^{-λ} \cdot λ^k}{k!}
\]

Where:  
- \( λ = 5 \) (average rate)  
- \( k = 3 \) (desired number of events)  
- \( e ≈ 2.718 \)

---

### **Calculation:**  
\[
P(X = 3) = \frac{e^{-5} \cdot 5^3}{3!} = \frac{2.718^{-5} \cdot 125}{6}
\]

\[
≈ \frac{0.0067 \cdot 125}{6} ≈ \frac{0.8375}{6} ≈ 0.1396
\]



###  **Final Answer:**  
**Probability of getting exactly 3 calls in an hour ≈ 13.96%**


11. Explain what a random variable is and differentiate between discrete and continuous random variables.
-
A **random variable** is a value that results from a **random experiment**. It assigns a number to each possible outcome.



### **Types of Random Variables:**

#### 1. **Discrete Random Variable**  
- Takes **specific, countable values**  
- **Examples:**  
  - Number of students in a class (0, 1, 2, ...)  
  - Number of goals in a match  

#### 2. **Continuous Random Variable**  
- Takes **infinite values** within a range  
- Measured, not counted  
- **Examples:**  
  - Height of students (like 5.4 ft, 5.41 ft...)  
  - Time taken to finish a race  
  
12. Provide an example dataset, calculate both covariance and correlation, and interpret the results.
- Interpretation:
Covariance = 30 → Shows a positive relationship, but hard to interpret alone.

Correlation ≈ 0.92 → Strong positive linear relationship between study hours and exam scores.
→ As students study more, scores tend to increase.
