## Multivariate Statistics

Multivariate statistics involve the observation and analysis of more than one statistical outcome variable at a time. This tutorial covers key concepts, mathematical background, and numerical examples.

### 1. Multiple Linear Regression

Multiple linear regression models the relationship between one dependent variable and two or more independent variables.

*Example:*

Suppose we want to predict students' test scores ($y$) based on both study hours ($x_1$) and attendance ($x_2$). We have the following data:

| Study Hours ($x_1$) | Attendance ($x_2$) | Test Scores ($y$) |
|---------------------|---------------------|-------------------|
| 2                   | 90                  | 65                |
| 3                   | 95                  | 70                |
| 4                   | 100                 | 75                |
| 5                   | 105                 | 80                |

The multiple linear regression equation is:

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon $$

Where:
- $y$ = Test Scores (dependent variable)
- $x_1$ = Study Hours (independent variable)
- $x_2$ = Attendance (independent variable)
- $\beta_0$ = Intercept
- $\beta_1$, $\beta_2$ = Slopes for $x_1$ and $x_2$
- $\epsilon$ = Error term

**Key Properties:**

1. **Interpretation of Coefficients**: Each $\beta_i$ represents the change in $y$ for a one-unit change in $x_i$, holding other variables constant.
2. **Multicollinearity**: When independent variables are highly correlated, it can affect the stability and interpretation of the coefficients.
3. **Model Fit**: Evaluated using $R^2$, adjusted $R^2$, and other goodness-of-fit measures.

### 2. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of variables in a dataset while retaining most of the variability.

*Example:*

Suppose we have a dataset with four variables: $X_1$, $X_2$, $X_3$, and $X_4$. The goal is to reduce these four variables into two principal components.

1. **Standardize the Data**: Center the data around the mean and scale it by the standard deviation.

2. **Compute the Covariance Matrix**: Calculate the covariance matrix of the standardized data.

3. **Compute the Eigenvalues and Eigenvectors**: The eigenvectors of the covariance matrix are the principal components, and the eigenvalues represent the variance explained by each component.

4. **Form the Principal Components**: Multiply the standardized data by the eigenvectors.

**Key Properties:**

1. **Variance Explained**: The proportion of the dataset's total variance explained by each principal component.
2. **Orthogonality**: Principal components are orthogonal (uncorrelated) to each other.
3. **Dimensionality Reduction**: PCA reduces the dimensionality of the data while preserving as much variance as possible.

### 3. Factor Analysis

Factor Analysis is used to identify underlying relationships between variables by modeling them as linear combinations of potential factors.

*Example:*

Suppose we have a dataset with four observed variables: $X_1$, $X_2$, $X_3$, and $X_4$. The goal is to identify two underlying factors.

1. **Extract Initial Factors**: Use methods like Principal Axis Factoring or Maximum Likelihood to estimate the initial factors.

2. **Rotate the Factors**: Apply rotations (e.g., Varimax) to make the factor loadings more interpretable.

3. **Interpret the Factors**: Examine the rotated factor loadings to identify the underlying factors.

**Key Properties:**

1. **Factor Loadings**: The coefficients that represent the relationship between the observed variables and the factors.
2. **Communality**: The proportion of each variable's variance explained by the factors.
3. **Factor Scores**: Estimated values of the factors for each observation.

### 4. Canonical Correlation Analysis (CCA)

Canonical Correlation Analysis (CCA) examines the relationships between two sets of variables.

*Example:*

Suppose we have two sets of variables:
- Set 1: $X_1$, $X_2$, $X_3$
- Set 2: $Y_1$, $Y_2$

The goal is to find linear combinations of the $X$ variables and the $Y$ variables that are maximally correlated.

1. **Compute Canonical Correlations**: Find the linear combinations of the $X$ and $Y$ variables that maximize the correlation between the sets.

2. **Interpret the Canonical Variates**: Examine the canonical variates to understand the relationships between the variable sets.

**Key Properties:**

1. **Canonical Correlation**: The correlation between the canonical variates.
2. **Redundancy Index**: The proportion of variance in one set of variables explained by the canonical variates of the other set.
3. **Significance Testing**: Tests whether the canonical correlations are significantly different from zero.

### 5. Cluster Analysis

Cluster Analysis groups observations into clusters such that observations within each cluster are more similar to each other than to those in other clusters.

#### K-means Clustering

*Example:*

Suppose we have a dataset with observations on two variables. We want to group the observations into three clusters.

1. **Initialize Centroids**: Randomly select three initial centroids.

2. **Assign Clusters**: Assign each observation to the nearest centroid.

3. **Update Centroids**: Calculate the mean of the observations in each cluster and update the centroids.

4. **Iterate**: Repeat steps 2 and 3 until the centroids no longer change significantly.

**Key Properties:**

1. **Within-Cluster Sum of Squares (WCSS)**: Measure of the variance within each cluster.
2. **Number of Clusters**: Determined using methods like the Elbow Method.
3. **Cluster Centroids**: The mean of the observations in each cluster.

#### Hierarchical Clustering

*Example:*

Suppose we have a dataset with observations on two variables. We want to group the observations into a hierarchical structure.

1. **Compute Distance Matrix**: Calculate the pairwise distances between observations.

2. **Linkage Method**: Use a linkage method (e.g., single, complete, average) to determine the distance between clusters.

3. **Merge Clusters**: Iteratively merge the closest clusters until all observations are in a single cluster.

4. **Dendrogram**: Visualize the hierarchical structure using a dendrogram.

**Key Properties:**

1. **Linkage Criteria**: Method used to calculate the distance between clusters (e.g., single, complete, average).
2. **Dendrogram**: A tree-like diagram showing the hierarchical structure of clusters.
3. **Cut-off Point**: Determine the number of clusters by cutting the dendrogram at a specific level.

### 6. Discriminant Analysis

Discriminant Analysis is used to classify observations into predefined groups based on predictor variables.

*Example:*

Suppose we have data on students' test scores in math and science and want to classify them into pass or fail groups.

1. **Estimate Discriminant Functions**: Find linear combinations of the predictor variables that maximize the separation between the groups.

2. **Classify Observations**: Assign observations to the group with the highest discriminant score.

3. **Evaluate Accuracy**: Assess the classification accuracy using methods like cross-validation.

**Key Properties:**

1. **Discriminant Functions**: Linear combinations of predictor variables that best separate the groups.
2. **Wilks' Lambda**: A measure of how well the discriminant function separates the groups.
3. **Classification Accuracy**: The proportion of correctly classified observations.

### 7. Summary

Multivariate statistics enable the analysis of multiple variables simultaneously, providing insights into complex relationships. Multiple linear regression, PCA, factor analysis, CCA, cluster analysis, and discriminant analysis are powerful techniques for exploring and understanding multivariate data. Mastery of these concepts allows for comprehensive and effective data analysis.
