In [None]:
Oracle AI Data Platform v1.0

Copyright © 2025, Oracle and/or its affiliates.

Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/

# Linear Mixed Effects Model (LME) demo

Here’s a simple yet powerful Linear Mixed Effects Model (LME) demo to showcase AI Data Platform as a data platform for advanced analytics. This demo illustrates how you can use **PySpark**, **pandas**, and **statsmodels** on AI Data Platform for mixed effects modeling.

**Overview of Use Case**

This demos is centered around educational test scores, in the notebook we will analyze test scores of students from multiple schools, accounting for both fixed effects (e.g., gender, study hours) and random effects (e.g., variation between schools).

**Prerequisites**

Install the requirements.txt file to install these libraries on your cluster;

* statsmodels
* seaborn
* pandas
* matplotlib
* numpy==1.26.4



## Load Sample Data

Simulate or load a dataset with repeated measures per group (e.g., students nested within schools).


In [1]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Simulate data
n_schools = 10
n_students_per_school = 30

schools = np.repeat([f"School_{i}" for i in range(n_schools)], n_students_per_school)
gender = np.random.choice(["M", "F"], size=n_schools * n_students_per_school)
study_hours = np.random.normal(loc=5, scale=2, size=n_schools * n_students_per_school)
school_effect = np.repeat(np.random.normal(0, 2, n_schools), n_students_per_school)
test_score = 70 + (study_hours * 2) + (gender == "F") * 3 + school_effect + np.random.normal(0, 5, n_schools * n_students_per_school)

df = pd.DataFrame({
    "school": schools,
    "gender": gender,
    "study_hours": study_hours,
    "test_score": test_score
})


## Convert to Spark DataFrame

In [1]:
df_spark = spark.createDataFrame(df)
df_spark.createOrReplaceTempView("student_scores")
df_spark.show()

## Explore with Spark SQL

In [1]:
spark.sql("SELECT school, COUNT(*) as num_students, AVG(test_score) as avg_score FROM student_scores GROUP BY school ORDER BY avg_score DESC").show()


## Fit Linear Mixed Effects Model


### First done with Python on Driver node in cluster.

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Back to pandas for modeling
df_pandas = df_spark.toPandas()

# Fit LME model: test_score ~ study_hours + gender + (1 | school)
model = smf.mixedlm("test_score ~ study_hours + gender", df_pandas, groups=df_pandas["school"])
result = model.fit()

print(result.summary())


### Then with PySpark.pandas distributed on cluster.

# Visualize the Results with Seaborn


In [1]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x="school", y="test_score", data=df_pandas)
plt.xticks(rotation=45)
plt.title("Test Scores by School")
plt.show()
