# 03 &mdash; Hypothesis Testing

*(Data Analysis and Visualization 505067 &mdash; Final Report)*

**Authored by:** Nguyen Phuc Toan

## Prerequisites

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

### Loading data

In [2]:
df = pd.read_csv("../data/salaries.csv")

### Preprocessing

**Explanation:**

1. Drop duplicates with `df.drop_duplicates()`.
2. Log-transform salary to improve normality.
3. Remove outliers with MAD.

In [3]:
df = df.drop_duplicates()
df.shape

(27311, 11)

In [4]:
target_col = "salary_in_usd"

df[target_col] = np.log1p(df[target_col])

In [5]:
target_col = "salary_in_usd"

median = df[target_col].median()
mad = np.median(np.abs(df[target_col] - median))
mod_z = 0.6745 * (df[target_col] - median) / mad

df = df[np.abs(mod_z) < 3.5]
df.shape

(27220, 11)

## Hypothesis Testing

### Proposed question

*"Does working fully remotely correlate to a higher salary than on-site or hybrid work?"*

### Hypotheses

* $H_0$: The mean salary is the same across all remote ratios.
* $H_1$: There is at least one remote ratio that has a significantly different mean salary.

### One-way ANOVA Test

**Explanation:**

One-way ANOVA compares means across groups; $\text{p–value} < 0.05$ indicates at least one mean differs.

In [6]:
groups = [g["salary_in_usd"] for _, g in df.groupby("remote_ratio")]

f_stat, p_value = stats.f_oneway(*groups)
print(
    f"F-statistic = {f_stat}",
    f"\np-value = {p_value}",
)

F-statistic = 199.2627168301049 
p-value = 1.2267506437579236e-86


**Summary:**

* $\text{p–value} \approx 1.22 \times 10^{-86} \ll 0.05$.
* We reject $H_0 \Rightarrow H_1$ is true.

## Conclusion

Mean salaries are not equal across remote ratios; at least one (on-site, hybrid, or fully remote) is significantly different from others.