# Read and Process Data

Our data are the 2022 PISA student responses from Canada and the United States. The data from PISA are for all countries and are in .sav form. The first steps in our analysis are reading and processing the data in the following stages:

1. Read data in
2. Filter to just US/Canada responses
3. Calculate missingness proportions for our items of interest
4. Perform imputation on missing values

I performed actions 1 and 2 in RStudio locally, because the jupyter kernel could not handle the 1.9GB original file. The filtered USA/CAN file is around 130MB. 

# Load Dependencies

Note you may have to install some packages; check session info to see what we're running.

In [10]:
# you might need to run these to install packages
# !pip install pyreadstat
# !pip install session_info
# !pip install polars

[0mCollecting pyreadstat
  Downloading pyreadstat-1.2.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.0 kB)
Downloading pyreadstat-1.2.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[0mInstalling collected packages: pyreadstat
Successfully installed pyreadstat-1.2.7


In [3]:
# import dependencies
import session_info
import numpy as np
import pandas as pd
import polars as pl

In [4]:
# show session info
session_info.show()

# Load and Inspect Data

Recall that the data has already been filtered to only USA and CAN country codes.

In [5]:
dat = pl.read_csv("pisa2022_usacan.csv", null_values = ["NA", "null"])

In [6]:
# Explore data
print(dat.head())
print(f'Data shape: {dat.shape}')
print(dat.describe())
print(dat.columns)

shape: (5, 1_280)
┌─────┬─────┬─────────┬──────────┬───┬──────────┬─────────┬────────────────────┬──────┐
│     ┆ CNT ┆ CNTRYID ┆ CNTSCHID ┆ … ┆ PV10MPRE ┆ SENWT   ┆ VER_DAT            ┆ test │
│ --- ┆ --- ┆ ---     ┆ ---      ┆   ┆ ---      ┆ ---     ┆ ---                ┆ ---  │
│ i64 ┆ str ┆ i64     ┆ i64      ┆   ┆ f64      ┆ f64     ┆ str                ┆ str  │
╞═════╪═════╪═════════╪══════════╪═══╪══════════╪═════════╪════════════════════╪══════╡
│ 1   ┆ CAN ┆ 124     ┆ 12400376 ┆ … ┆ 517.929  ┆ 0.1297  ┆   02MAY23:16:37:57 ┆ null │
│ 2   ┆ CAN ┆ 124     ┆ 12400020 ┆ … ┆ 528.554  ┆ 0.13573 ┆   02MAY23:16:38:03 ┆ null │
│ 3   ┆ CAN ┆ 124     ┆ 12400906 ┆ … ┆ 303.109  ┆ 0.03326 ┆   02MAY23:16:38:02 ┆ null │
│ 4   ┆ CAN ┆ 124     ┆ 12400726 ┆ … ┆ 698.58   ┆ 0.34707 ┆   02MAY23:16:37:59 ┆ null │
│ 5   ┆ CAN ┆ 124     ┆ 12400103 ┆ … ┆ 590.067  ┆ 0.11075 ┆   02MAY23:16:37:59 ┆ null │
└─────┴─────┴─────────┴──────────┴───┴──────────┴─────────┴────────────────────┴──────┘
Data shape: (1

In [7]:
# just the outcome
print(dat["ST352Q06JA"].head())

# how many rows are there in the data?
print(f'Length of data: {len(dat)}')

# number of NA values for outcome
print(f'All values for outcome: {len(dat["ST352Q06JA"])}')
print(f'Non-null values for outcome: {dat.select(pl.count("ST352Q06JA")).to_numpy().astype(int)}')
print(f'Number of null values for outcome: {dat["ST352Q06JA"].is_null().sum()}')
print(f'Proportion of outcome data missing: {dat["ST352Q06JA"].is_null().sum() / len(dat["ST352Q06JA"])}')

shape: (10,)
Series: 'ST352Q06JA' [i64]
[
	2
	null
	2
	4
	null
	null
	null
	4
	2
	4
]
Length of data: 11737
All values for outcome: 11737
Non-null values for outcome: [[4605]]
Number of null values for outcome: 7132
Proportion of outcome data missing: 0.6076510181477379


# Filter Data

We want to filter to our features (predictors, IVs, etc.) of interest.

MISSSC
i64
0
1
1
0
0
…
0
0
0
0
