# Marginal and conditional distributions

[Marginal and conditional distributions | Analyzing categorical data | AP Statistics | Khan Academy | YouTube](https://www.youtube.com/watch?v=Iw9fEYIpPMA)

# Types of Distributions in the Classroom Example  
*(Study Time vs. Percent Correct on a Test)*

| Percent Correct →<br>Time Studied ↓ | 80–100% | 60–79% | 40–59% | 20–39% | 0–19% | **Row Total** |
|-------------------------------------|---------|--------|--------|--------|-------|---------------|
| 61–80 minutes                       | 30      | 20     | 15     | 5      | 0     | 70            |
| 41–60 minutes                       | 10      | 30     | 35     | 10     | 1     | 86            |
| 21–40 minutes                       | 0       | 10     | 15     | 5      | 0     | 30            |
| 0–20 minutes                        | 0       | 0      | 5      | 0      | 9     | 14            |
| **Column Total**                    | 40      | 60     | 70     | 20     | 10    | **200**       |

## 1. Joint Distribution
- **Definition**: The complete two-way table showing the **counts** (or joint frequencies) for every combination of the two categorical variables.
- **Interpretation**: Each cell tells you exactly how many students fall into that specific combination.
- **Example**: 20 students scored 60–79% **and** studied 61–80 minutes.

## 2. Marginal Distribution
- **Definition**: Distribution of **one variable alone**, obtained by summing across the other variable (row totals or column totals).
- Usually expressed as percentages of the grand total (200 students).

### Marginal Distribution of Percent Correct
| Percent Correct | Count | Percentage |
|-----------------|-------|------------|
| 80–100%         | 40    | 20%        |
| 60–79%          | 60    | 30%        |
| 40–59%          | 70    | 35%        |
| 20–39%          | 20    | 10%        |
| 0–19%           | 10    | 5%         |
| **Total**       | 200   | 100%       |

### Marginal Distribution of Time Studied
| Time Studied     | Count | Percentage |
|------------------|-------|------------|
| 61–80 min        | 70    | 35%        |
| 41–60 min        | 86    | 43%        |
| 21–40 min        | 30    | 15%        |
| 0–20 min         | 14    | 7%         |
| **Total**        | 200   | 100%       |

## 3. Conditional Distribution
- **Definition**: The distribution of one variable **given** a fixed value of the other variable.
- Always reported as **percentages within the conditioned group**.

### Example 1: Percent Correct given 41–60 minutes studied (n = 86)
| Percent Correct | Count | Conditional % |
|-----------------|-------|---------------|
| 80–100%         | 10    | ≈ 11.6%       |
| 60–79%          | 30    | ≈ 34.9%       |
| 40–59%          | 35    | ≈ 40.7%       |
| 20–39%          | 10    | ≈ 11.6%       |
| 0–19%           | 1     | ≈ 1.2%        |
| **Total**       | 86    | 100%          |

### Example 2: Time Studied given students scored 80–100% (n = 40)
| Time Studied     | Count | Conditional % |
|------------------|-------|---------------|
| 61–80 min        | 30    | 75%           |
| 41–60 min        | 10    | 25%           |
| 21–40 min        | 0     | 0%            |
| 0–20 min         | 0     | 0%            |
| **Total**        | 40    | 100%          |

## Key Takeaways
- **Joint** → full table of counts (both variables together)  
- **Marginal** → row or column totals (one variable alone)  
- **Conditional** → percentages **within** a specific row or column (one variable given the other)

These three types of distributions are the foundation for analyzing relationships in any two-way table!

In [2]:
# Assistant
import pandas as pd

# Load the Excel file
file_name = "time_correct.studied.xlsx"
df = pd.read_excel(file_name)

# Convert columns with mixed types to numeric where possible
# This will convert strings that represent numbers to actual numbers
# and will set non-convertible strings to NaN
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Calculate row totals (sum across columns for each row)
rowTotal = df.sum(axis=1, skipna=True)  # skipna=True ignores NaN values

# Calculate column totals (sum across rows for each column)
columnTotal = df.sum(axis=0, skipna=True)

# Print results
print("Row Totals:")
print(rowTotal)

print("\nColumn Totals:")
print(columnTotal)

Row Totals:
0     0.0
1    70.0
2    86.0
3    30.0
4    14.0
dtype: float64

Column Totals:
Percent Correct →     0.0
80–100%              40.0
60–79%               60.0
40–59%               70.0
20–39%               20.0
0–19%                10.0
dtype: float64


# [Context](https://github.com/progressivepull/Statistics-Deep-Dives/blob/main/khan_academy/Content.md)