# Descriptive Statistics for Numeric Variables

In this section, we will calculate and interpret key summary statistics for a numeric variable: **Age**. These statistics provide insights into the central tendency, variability, and overall distribution of the data.

### What We Will Cover
1. **Mean**: The average value, showing the central tendency of the data.
2. **Median**: The middle value when the data is sorted, representing the central point.
3. **Mode**: The most frequently occurring value(s) in the dataset.
4. **Range**: The difference between the maximum and minimum values, showing the spread.
5. **Variance**: The average squared deviation from the mean, measuring variability.
6. **Standard Deviation**: The square root of the variance, representing the average distance from the mean.

### Why This Matters
- Summary statistics help us understand the data at a glance.
- They reveal patterns, such as whether the data is centered, widely spread, or skewed.
- These metrics are the foundation for more advanced statistical analysis.

### Variable in Focus: **Age**
We will explore the `Age` variable from our dataset, which represents the age of respondents in years. By calculating these statistics, we will learn how the ages are distributed across the sample and identify patterns in the data.

### What to Expect
- We will calculate the statistics using Python's `pandas` library.
- We will visualize the distribution of the `Age` variable using a histogram for better understanding.
- Interpretation of each statistic will be provided to connect the calculations to real-world insights.


### Note
- We're using a synthetic dataset created in a previous repository.

In [None]:

# Load the dataset
file_path = r"C:\Users\nadzi\Documents\Personal\Research Methods 3\Synthetic_Democracy_Dataset.csv"
data = pd.read_csv(file_path)

data.head()

Unnamed: 0,Age,Gender,Education_Level,Income_Level,Democracy_Rating,Trust_in_Government,Voted_Last_Election,Political_Knowledge,Media_Consumption,Social_Equality_Support
0,56,Male,High School,High,10,5,Yes,8,Daily,5
1,69,Female,High School,Middle,5,2,Yes,10,Weekly,2
2,46,Male,Bachelor's,Middle,6,4,No,7,Daily,5
3,32,Female,High School,Middle,3,2,Yes,10,Daily,2
4,60,Female,High School,High,9,1,Yes,10,Daily,2


### Summary Statistics with `describe()` Method

The `describe()` method in pandas is a quick and convenient way to generate summary statistics for a numerical or categorical column in a dataset. When applied to a numerical column (e.g., `data['Age']`), it provides the following descriptive statistics:

1. **Count**:
   - The total number of non-missing observations in the column.

2. **Mean**:
   - The arithmetic average of the data values, representing the central tendency of the column.

3. **Standard Deviation (std)**:
   - Measures the dispersion or spread of the data. It indicates how much the individual data points deviate from the mean.

4. **Minimum (min)**:
   - The smallest value in the column.

5. **25th Percentile (25%)**:
   - Also known as the first quartile (Q1), it represents the value below which 25% of the data lies.

6. **50th Percentile (50%)**:
   - Also known as the median, it represents the middle value of the dataset when sorted.

7. **75th Percentile (75%)**:
   - Also known as the third quartile (Q3), it represents the value below which 75% of the data lies.

8. **Maximum (max)**:
   - The largest value in the column.

### Why Use `describe()`?
The `describe()` method is a fundamental tool in **descriptive statistics** and **exploratory data analysis (EDA)** because:
- It provides a comprehensive snapshot of the distribution and variability of a column.
- It helps detect outliers and anomalies by looking at the range (min and max values).
- It informs about the central tendency and spread of the data, which can guide further analysis.



In [None]:
data['Age'].describe()

count    1000.000000
mean       49.857000
std        18.114267
min        18.000000
25%        35.000000
50%        50.000000
75%        66.000000
max        79.000000
Name: Age, dtype: float64

### Calculating Summary Statistics Individually

While the `describe()` method provides a quick and comprehensive summary of the statistics for a numerical column, we can also calculate individual summary statistics separately for more detailed control over our analysis.

In this example, we are calculating the following statistics for the `Age` variable:

1. **Mean**:
   - The average value of the `Age` column, calculated using `age.mean()`. This tells us the central tendency of the data.

2. **Median**:
   - The middle value of the `Age` column when sorted, calculated using `age.median()`. The median is less sensitive to outliers compared to the mean.

3. **Mode**:
   - The most frequently occurring value in the `Age` column, calculated using `age.mode()`. The `[0]` is used to extract the first mode when multiple modes exist.

4. **Range**:
   - The difference between the maximum and minimum values, calculated with `age.max() - age.min()`. The range gives us an idea of how spread out the values are.

5. **Variance**:
   - The average of the squared differences from the mean, calculated with `age.var()`. Variance measures the dispersion of the data and indicates how much individual values deviate from the mean.

6. **Standard Deviation**:
   - The square root of the variance, calculated using `age.std()`. Standard deviation is another measure of dispersion that is easier to interpret since it's in the same units as the data.

### Why Calculate Statistics Individually?

- **Detailed Analysis**: This approach allows for a more granular understanding of the data; sometimes we just need a specific measure rather than all of them at once 



In [None]:
# Selecting variable "Age"
age = data['Age']

# Calculate summary statistics
mean_age = age.mean()
median_age = age.median()
mode_age = age.mode()[0]  # Use [0] to get the first mode if there are multiple
range_age = age.max() - age.min()
variance_age = age.var()
std_dev_age = age.std()

# Display the results
print("Summary Statistics for Age:")
print(f"Mean: {mean_age:.2f}")
print(f"Median: {median_age}")
print(f"Mode: {mode_age}")
print(f"Range: {range_age}")
print(f"Variance: {variance_age:.2f}")
print(f"Standard Deviation: {std_dev_age:.2f}")


Summary Statistics for Age:
Mean: 49.86
Median: 50.0
Mode: 79
Range: 61
Variance: 328.13
Standard Deviation: 18.11


# Explanation of Summary Statistics for Age

The following summary statistics provide insights into the distribution and variability of the **Age** variable:

---

### 1. **Mean (49.86)**
- The **mean** is the average age of the respondents.
- **Interpretation**: On average, the respondents are **49.86 years old**.
- This value represents the central tendency of the dataset, showing where most data points cluster.

---

### 2. **Median (50.0)**
- The **median** is the middle value when all ages are sorted in ascending order.
- **Interpretation**: Half of the respondents are younger than **50 years**, and half are older.
- Since the median is close to the mean, it suggests that the data is relatively symmetric with no extreme skew.

---

### 3. **Mode (79)**
- The **mode** is the most frequently occurring value in the dataset.
- **Interpretation**: The age **79 years** appears most frequently among respondents.
- **Why Important?**:
  - The mode helps identify clusters or peaks in the data.
  - In this case, the mode being **79** indicates a group of older respondents.

---

### 4. **Range (61)**
- The **range** is the difference between the maximum and minimum values.
- Formula: \( \text{Range} = \text{Max} - \text{Min} \).
- **Interpretation**: The youngest respondent is **18 years old**, and the oldest is **79 years old**, resulting in a range of **61 years**.
- **Why Important?**:
  - The range shows the total spread of the data but is sensitive to outliers.

---

### 5. **Variance (328.13)**
- The **variance** measures the average squared deviation of each age from the mean.
- **Interpretation**: A variance of **328.13** indicates that the ages are moderately spread out around the mean.
- **Why Important?**:
  - Variance gives a sense of variability in the data, but since it uses squared units, it’s harder to interpret directly.

---

### 6. **Standard Deviation (18.11)**
- The **standard deviation** is the square root of the variance and represents the average distance of each age from the mean.
- **Interpretation**: On average, respondents are **18.11 years older or younger** than the mean age of **49.86 years**.
- Most respondents are expected to fall within **1 standard deviation of the mean**, which is approximately between **31.75 years** and **67.97 years**.
- **Why Important?**:
  - Standard deviation is easier to interpret than variance because it is in the same unit as the original data (years).

---

### Overall Insights:
- The ages in this dataset are centered around 50 years, with most respondents being within 18 years of this average.
- While the range shows a wide age span, the standard deviation indicates that the data is not extremely spread out.
- The mode of 79 suggests a small cluster of older respondents who might slightly influence the overall distribution.

These summary statistics provide a clear and comprehensive understanding of the Age variable's central tendency and variability.
