In [1]:
import pandas as pd
import numpy as np

## Load Data

In [8]:
df_cog = pd.read_csv("./data/0_raw/cy07_msu_stu_cog.csv")

In [9]:
df_cog.shape

(606627, 3590)

### Calculate cognitive question scores

- Going through codebook file I realized that some questions has partial credits.
- Most of them were true false(1-0), some of them had only one partial credit, some had many.
- Luckly there is a pattern to it


- if score starts with 0 (0, 01, 02..) it means no credit
- if score is not 1 but starts with 1 (11, 12, 13..) it means partial credit
- if score is 1 or starts with 2 (1, 2, 21, 22..) it means full credit


- Cognative items ids collected from compendia here : https://webfs.oecd.org/pisa2018/Compendia_Cognitive.zip


In [10]:
# Math questions
math_question_ids = []
with open("./data/reference/math_question_ids.txt") as math_file:
    for line in math_file:
        math_question_ids.append(line.strip())

In [11]:
# Reading questions
reading_question_ids = []
with open("./data/reference/reading_question_ids.txt") as reading_file:
    for line in reading_file:
        reading_question_ids.append(line.strip())

In [12]:
# Science questisons
science_question_ids = []
with open("./data/reference/science_question_ids.txt") as science_file:
    for line in science_file:
        science_question_ids.append(line.strip())

In [13]:
# processing df row by row in jupyter notebook requires 64gb+ mem
# drop columns not planning to use to free memory
cog_cols = math_question_ids + reading_question_ids + science_question_ids
useful_cols = ['CNT','CNTRYID', 'CNTSCHID', 'CNTSTUID', 'STRATUM', 'RCORE_PERF', 'RCO1S_PERF', 'LANGTEST_COG']
df_cog.drop(columns=df_cog.columns.difference(cog_cols + useful_cols), inplace=True)
df_cog.shape

(606627, 795)

- each question has a difficulty. however this difficulty is not provided within the dataset
- among 400 reading question, only ~30 of them graded http://www.oecd.org/pisa/test/PISA2018_Released_REA_Items_12112019.pdf
- this publication describes how leveling system changed in 2018 and what the levels in proficiency means. https://www.oecd-ilibrary.org/sites/5f07c754-en/1/2/6/index.html?itemId=/content/publication/5f07c754-en&_csp_=6aa84fb981b29e81b35b3f982f80670e&itemIGO=oecd&itemContentType=book#s49
- according to the document only limited number of items are relased to public
- https://www.oecd-ilibrary.org/sites/5f07c754-en/1/2/14/index.html?itemId=/content/publication/5f07c754-en&_csp_=6aa84fb981b29e81b35b3f982f80670e&itemIGO=oecd&itemContentType=book#mh199
- I couldn't find the difficulties of the questions in data explorer, code book, compendia or in the dataset. I spend quite a lot of time I will get back to it again after some time
- with difficulties I could have got more accurate scoring :(

--

- 2 days later examining data manually found a cluster of 300-500 values (which is the avg score)
- apparently student questionnaire file has the scores to the cognative tests
- after finding the variable i found out about their calculation and the reason behind this weighted approach here -> https://www.oecd.org/pisa/data/pisa2018technicalreport/PISA2018%20TecReport-Ch-19-Data-Products.pdf
- in the code book going trough hundereds of columns I didn't expect scores to be in questionnare and at the end of the file 

--

- 1 more day later
- there was actully a link about "How to prepare and analyse the PISA database" 🤦‍♂️
- http://www.oecd.org/pisa/data/httpoecdorgpisadatabase-instructions.htm
- There is a lengthy explanation to how scores are calculated
  - Rasch Item Response Theory used before to calculate the weight of different problems (likelyhood, and bayesian models used)
  - this response theory is good for questionnaires but cognitive tasks PISA uses plausible values PVs
  - PVs used to split students into proficiency levels. and evaluate them in their bucket more accurately. looks at the population (country-economy), creates samplings, makes regression models and avgs them. I kind a get the main idea but it couldn't follow how it is calculated and the main difference between PV1-PV2.
  - I will be using these PV values
  - One of the documents mentions that correlations between domains (math, reading, science) should be calculated in the same plasuibel values (PV1Math vs PV1Reading) since PVs calculated from conditional posterior distributions, mixing them is not right
  - In a contest PISA made they also mention for exploration it is fine to use only PV1 https://www.oecd.org/education/datavisualizationcontest.htm 
  - Crating std of PV1-PV10 doesn't seem right. but also there is value in other PV values, which I have yet to discover. I will probably use PV1 for now
  - Maybe correlating between PVs in the same domain could give more insight. if their correlation value is similar across all PVs then using only one would be enough.

In [15]:
# Given column ids as question_ids this method calculates score of a row
# valid answers given 10 point, partial answers given5 points
def calculate_score_for_columns(row, question_ids):
    student_score = 0.0
    student_answered_questions = 0

    for question in question_ids:
        score = row[question]
        if not np.isnan(score):
            score_str = str(int(score))
            student_answered_questions += 1
            if score_str == "1" or score_str.startswith("2"):
                student_score += 10
            elif score_str.startswith("1"):
                student_score += 5 # partial answer, half point
    return student_score, student_answered_questions

In [16]:
def calculate_test_score(row):
    m_score, m_count = calculate_score_for_columns(row, math_question_ids)
    r_score, r_count = calculate_score_for_columns(row, reading_question_ids)
    s_score, s_count = calculate_score_for_columns(row, science_question_ids)
    return pd.Series([m_score, m_count, r_score, r_count, s_score, s_count])

In [17]:
# Calculate scores and answer counts
score_columns = ["math_score", "math_answered", "reading_score", "reading_answered", "science_score", "science_answered"]
df_cog[score_columns] = df_cog.apply(lambda row: calculate_test_score(row), axis=1)

In [18]:
columns_to_keep = useful_cols + score_columns

In [19]:
# drop unused columns
df_cog.drop(columns=df_cog.columns.difference(columns_to_keep), inplace=True)

In [20]:
df_cog.sample(5)

Unnamed: 0,CNTRYID,CNT,CNTSCHID,CNTSTUID,STRATUM,LANGTEST_COG,RCORE_PERF,RCO1S_PERF,math_score,math_answered,reading_score,reading_answered,science_score,science_answered
130012,170.0,COL,17000251.0,17003049.0,COL0410,156.0,2.0,2.0,0.0,0.0,350.0,55.0,160.0,36.0
59298,70.0,BIH,7000040.0,7004870.0,BIH0028,192.0,3.0,3.0,0.0,0.0,540.0,57.0,280.0,38.0
539152,784.0,ARE,78400114.0,78415262.0,ARE0659,313.0,3.0,3.0,0.0,0.0,380.0,56.0,260.0,39.0
94044,124.0,CAN,12400300.0,12406464.0,CAN0547,493.0,2.0,2.0,125.0,22.0,365.0,57.0,0.0,0.0
167411,214.0,DOM,21400196.0,21404617.0,DOM0003,156.0,2.0,1.0,0.0,0.0,280.0,58.0,90.0,39.0


In [21]:
df_cog.to_csv("./data/1_processed/cognitive_scores.csv",index=False)