In [1]:
"""
Note about project:  This project is exploratory. I plan to run regression on students' performance and their 
attitude towards education. Other factors aside from attitude, such as socioeconomic status, definitely affect 
education performance regardless of students' attitude. 

Goal of this project: Test which questions strongly correlate with performance to help with the future data collection 
by TDRI.  If we have a general idea on which specific attitudes correlate with education, we can design a survey 
that better captures the result of education interventions. 
"""
# Importing all the necessary libraries. 
import pandas as pd
import numpy as np
import statsmodels.api as sm


In [2]:
"""
Input: List of numbers
Output: The average of members within the list.
Description: This function helps when I apply conditions to the dataframe, such as separating students into those
that attend public and private schools. In some instances, the length of the dataframe is 0, and dividing 0 gives
an error. Because I prefer to abstract from checking the type of elements within each array, I create this function. 
""" 

def avg(lst):
    count = 0
    sum_val = 0
    for member in lst:
        try:
            temp = float(member)
            count += 1
            sum_val += temp
        except:
            temp = 0
    try:
        return sum_val/count
    except:
        return 0

In [3]:
# Reading stata file. Note that columns with all N/A cell will be dropped from this dataset. 
# I have experimented with dropping columns with any N/A cell, but the final result was too many dataset is dropped.
"""
Notes on data:
1. This data is a survey administered to approximately 8600 students attending 290 different schools. Some of
the columns contain categorical variables, and others ordinal/numerical variables. This suggests the need to carefully
select which datafield will be included in the final analysis.
2. Some of the cells contain No Response/Not available. I must decide later on how to impute/remove these data
before running the regression model. 
"""

thadf = pd.read_stata("PISA_2018_THA.dta").dropna(how = "all", axis = 1).fillna(method = "ffill").fillna(method = "bfill")

In [4]:
# Creatinga a dictionary whose key is column name and value is column interpretation. 
# This method helps because when I tried reading .dta files directly, some of the values change. 
"""
Credit: https://stackoverflow.com/questions/44809696/is-there-a-way-to-read-stata-labels-in-python
"""

reader = pd.io.stata.StataReader("PISA_2018_THA.dta")
header = reader.variable_labels()
for var in header:
        name  = var
        label = header[name]

In [5]:
# Eliminating the keys of columns we have already dropped so that we can take advantage of .columns.tolist() library.

header_temp = {}
for key in list(thadf.columns.tolist()):
    header_temp[key] = header[key]

header = header_temp 

In [6]:
# Creating a flipped dictionary so that we can search for the column name if we have an aspect of data we are 
# interested in. 

header_flip = {}
for key in list(header.keys()):
    header_flip[header[key]] = key

In [7]:
# Creating three new columns -- math, read, scie, glcm -- to calculate the average statistics for each student.  
"""
Note: Global competency (GLCM) is dropped because there is no clear way to calculate the competency. For instance,
some questions have a lot of No Responses; others including questions that are not taught in Thai curriculum. 
Therefore, I focused on mathematics, science, and reading because these subjects are less context-dependent.
"""

num_row = thadf.shape[0]
thadf = thadf.assign(math = [0]*num_row, read = [0]*num_row, scie = [0]*num_row) 
col_lst = thadf.columns.tolist()

In [8]:
# Calculating the summary statistics for mathematics and science by averaging scores for all 10 levels in the subject. 

subject_lst = ["math", "scie"]
temp = "pv"
for index in thadf.index.tolist():
    for subject in subject_lst:
        sum_score = 0
        for i in range(1,11):
            col_name = temp + str(i) + subject
            sum_score += thadf.loc[index, col_name]
        avg_score = sum_score/10
        thadf.loc[index, subject] = avg_score

In [9]:
# Calculating the summary statistics for reading by averaging scores for all 10 levels across different areas of
# reading. 
"""
Each subfield of reading is as follows:
1.) rtsn = Single Text Structure. 
2.) rcer = Evaluate and Reflect 
3.) Multiple Text Structure
4.) rcun = Reading comprehenseion 
5.) rcli = Locate information  

Note: This analysis could be more detailed. For instance, we could treat reading comprehension and locating 
information as separate fields of data. However, I chose to average their assessment score to understand a larger
trend of Thai students' reading skills. 
"""
read_lst = ["rtsn", "rcer","rtml", "rcun", "rcli"]
for index in thadf.index.tolist():
    sum_score = 0
    for read in read_lst:
        for i in range(1,11):
            col_name = temp + str(i) + read
            sum_score += thadf.loc[index, col_name]
    avg_score = sum_score/50 # Average over 50 types of reading
    thadf.loc[index, "read"] = avg_score
subject_lst = subject_lst + ["read"]

In [10]:
"""
At this point, I will run a regression project on students' attitude.

Note: Expected maximum education attainment has some idiosyncrasy. There are 6000+ students who check bachelor's
degree, but do not check middle school and high school. Therefore, I chose to exclude this data from the analysis.
"""

# Step 1: Create a list of questions pertaining to attitude by grouping relevant questions altogether into 
# different groups. 

content = ["st016q01na"]
school_try = ["st036q05ta","st036q06ta","st036q08ta"]
max_edu = ["st225q01ha","st225q02ha","st225q03ha", "st225q04ha","st225q06ha"]
compete = ["st181q02ha", "st181q03ha", "st181q04ha"]
failure = ["st183q01ha","st183q02ha","st183q03ha"]
growth = ["st184q01ha"]
life_meaning = ["st185q01ha","st185q02ha","st185q03ha"]
mental_health = []
for i in range(1,10):
    temp = "st186q0"
    if (i != 4):
        temp += str(i) + "ha"
        mental_health.append(temp)
mental_health.append("st186q10ha")
mastery = ["st208q01ha","st208q02ha","st208q04ha"]
self_manage = []
for i in range(1,8):
    temp = "st188q0"
    if (i != 5 and i != 4):
        temp += str(i) + "ha"
        self_manage.append(temp)
school_exp = []
for i in range(1,7):
    temp = "st034q0"
    if (True):
        temp += str(i) + "ta"
        school_exp.append(temp)     

In [11]:
# Merging each list of question groups altogether. 
attdf_questions = content + school_try + max_edu + compete + failure + growth + life_meaning + mental_health + mastery
attdf_questions += self_manage + school_exp

In [12]:
# Step 2: Create a dictionary where we check how many misses/no responses are there in each variable.
"""
Note: no_resp_Qkey is a dictionary whose key is the question_id and value is the number of total 'No Response' from 
every student. no_resp_Skey is a dictionary whose key is the student_id and value the number of 'No Response' that
student provides in total. 
"""

attdf = thadf[attdf_questions]

no_resp_Qkey = {} 
no_resp_Skey = {}

for question in attdf_questions:
    no_respdf = attdf[attdf[question] == "No Response"]
    no_resp_Qkey[question] = no_respdf.shape[0]

for index in attdf.index.tolist(): 
    no_respdf = attdf.loc[index]
    for question in attdf_questions:
        try:
            no_resp_Skey[index] += 1
        except:
            no_resp_Skey[index] = 1

In [13]:
"""
After checking the no_resp_Qkey, I decided not to drop any questions because questions which have the highest
'No Response' perform pretty well with approximate 250 misses out of the total 8600 responses. Therefore, I will
drop students who provide a lot of 'No Response' instead, which in this case, I chose to drop any student with
more than 1 'No Response'.
"""

count = 0
temp = 2
for mem in list(no_resp_Skey.values()):
    if mem >= temp:
        count += 1

print(count)

8633


In [36]:
# Step 3: Create a new dictionary for each student whereby agree/disagree is converted to 1-4 scale where
# 1 = Strongly disagree/Never; 2 = Disagree/Rarely; 3 = Agree/Sometimes; 4 = Strongly agree/Always.

"""
Note: Because responses to each question is ordinal (1-4/1-5 scale on disagree-strongly agree), I created a new dataframe 
to run linear regression. I specifically used the to_dict method to take advantage of the fast computation
of the to_list method on each column. To deal with 'No Response', I chose to impute the cell with the responses 
that appear the most commonly among the responses given. Because there are not many 'No Response' in the first place,
this method would not significantly affect the final result of the regression model. 
"""

data = {}
# I created an expected_edu_lst to exclude these responses from the regression model. The reason is that some of the 
expected_edu_lst = ["st225q01ha","st225q02ha","st225q03ha","st225q04ha","st225q06ha"]

for question in attdf_questions: 
    # This question 'st016q01na' is a numerical variable., so we can directly copy the value into a new array 
    # or impute. 
    if question == 'st016q01na':
        temp_lst = []
        for response in attdf[question].tolist():
            # Directly adding the value to the list if the student provides some response.
            if response != "No Response":
                temp_lst.append(response)
            else:
            # Else, impute the No Response with the most common answers that appeared in the column.  
                most_common = attdf[question].value_counts().index.tolist()[0]
                temp_lst.append(most_common)
    # If the question is ordinal variable, we need to convert into 1-4/1-5 scale first. 
    elif question not in subject_lst and question not in expected_edu_lst:
        temp_lst = []
        for response in attdf[question].tolist():
            if response == "No Response":
                #  Impute the No Response with the most common answers that appeared in the column.  
                response = attdf[question].value_counts().index.tolist()[0]
            if response in ["Strongly disagree","Never","Not at all true of me"]:
                temp_lst.append(1)
            elif response in ["Disagree", "Rarely", "Slightly true of me"]:
                temp_lst.append(2)
            elif response in ["Agree", "Sometimes", "Moderately true of me"]:
                temp_lst.append(3)
            elif response in ["Strongly agree", "Always","Very true of me"]:
                temp_lst.append(4) 
            elif response == "Extremely true of me":
                temp_lst.append(5)
    data[question] = temp_lst

for subject in subject_lst:
    data[subject] = thadf[subject].tolist()

In [37]:
# Create a new dataframe where every attitude-related question is converted into numbers. 
replace_attdf = pd.DataFrame.from_dict(data)

In [39]:
# Removing the list of expected level of education 
replace_attdf.drop(expected_edu_lst, axis = 1, inplace = True)

In [41]:
temp_lst = []
for question in attdf_questions:
    if question not in expected_edu_lst and question not in subject_lst:
        temp_lst.append(question)
attdf_questions = temp_lst

In [228]:
# Step 4: Replace ordinal variables on disagree-agree. 
# Result: Running the regression model with all variables yields an R-squared of 0.317. 
"""
Note: The constant is added because every student should have a based intelligent even before entering school. 
"""

X = replace_attdf[attdf_questions]
y = replace_attdf["read"]
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
predictions = model.predict(X) 

model.summary()

0,1,2,3
Dep. Variable:,read,R-squared:,0.384
Model:,OLS,Adj. R-squared:,0.381
Method:,Least Squares,F-statistic:,144.7
Date:,"Sun, 26 Jan 2020",Prob (F-statistic):,0.0
Time:,15:38:35,Log-Likelihood:,-48959.0
No. Observations:,8633,AIC:,97990.0
Df Residuals:,8595,BIC:,98260.0
Df Model:,37,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,346.5665,10.400,33.324,0.000,326.180,366.953
st016q01na,-3.5188,0.388,-9.066,0.000,-4.280,-2.758
st036q05ta,-12.8438,1.751,-7.335,0.000,-16.276,-9.411
st036q06ta,12.3844,1.754,7.061,0.000,8.946,15.823
st036q08ta,2.5028,1.738,1.440,0.150,-0.905,5.911
st181q02ha,-14.9442,1.186,-12.598,0.000,-17.270,-12.619
st181q03ha,-8.1848,1.278,-6.405,0.000,-10.690,-5.680
st181q04ha,20.9648,1.348,15.556,0.000,18.323,23.607
st183q01ha,8.5158,1.379,6.176,0.000,5.813,11.219

0,1,2,3
Omnibus:,80.048,Durbin-Watson:,1.278
Prob(Omnibus):,0.0,Jarque-Bera (JB):,82.298
Skew:,0.231,Prob(JB):,1.35e-18
Kurtosis:,3.123,Cond. No.,269.0


In [224]:
# Step 5: Choosing which variables to include in the final model.
keep_lst = []
# st036q05ta yields R-squared of only 0.001. st036q06ta (entering good university) yields the highest coefficient
# and highest R-squared
motivation = ["st036q06ta"]
# st181q02ha = Enjoyment during competition (-); st181q04ha = Strive to do better in face of competition
compete = ["st181q02ha","st181q04ha"]
fear_of_failure = ["st183q02ha"]
growth_mindset = ["st184q01ha"] # Super important: R-squared of 0.108
anxiety = ["st186q02ha"] 
happy = ["st186q05ha"]
# Unhappy emotion (i.e. "st186q 06/08/10 ha" has insignificant R-squared)
learning_goal = ["st208q04ha"]
accomplishment = ["st188q02ha"] # st188q02ha = I feel content when I can accomplish something. 
negative_school_exp = ["st034q01ta"] 
# Note: Positive school experience does not add much, but negative school experience means a lot


keep_lst = motivation + compete + fear_of_failure + growth_mindset + anxiety + happy + learning_goal
keep_lst += accomplishment + negative_school_exp

X = replace_attdf[keep_lst]
y = replace_attdf["read"]
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
predictions = model.predict(X) 

model.summary()

0,1,2,3
Dep. Variable:,read,R-squared:,0.323
Model:,OLS,Adj. R-squared:,0.322
Method:,Least Squares,F-statistic:,410.7
Date:,"Sun, 26 Jan 2020",Prob (F-statistic):,0.0
Time:,15:38:06,Log-Likelihood:,-49368.0
No. Observations:,8633,AIC:,98760.0
Df Residuals:,8622,BIC:,98830.0
Df Model:,10,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,246.0985,8.836,27.851,0.000,228.777,263.420
st036q06ta,4.5318,1.227,3.693,0.000,2.126,6.937
st181q02ha,-21.3093,1.141,-18.679,0.000,-23.546,-19.073
st181q04ha,19.4202,1.312,14.803,0.000,16.848,21.992
st183q02ha,12.4425,1.096,11.348,0.000,10.293,14.592
st184q01ha,-26.3940,0.904,-29.211,0.000,-28.165,-24.623
st186q02ha,18.3660,1.099,16.708,0.000,16.211,20.521
st186q05ha,12.4283,1.303,9.541,0.000,9.875,14.982
st208q04ha,3.5666,0.953,3.743,0.000,1.698,5.435

0,1,2,3
Omnibus:,145.74,Durbin-Watson:,1.204
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152.803
Skew:,0.322,Prob(JB):,6.6e-34
Kurtosis:,3.097,Cond. No.,107.0


In [227]:
# Step 5: Summarize results
# Summary of methodology and result. 
"""
Methodology: 
First, I loaded the PISA 2018 data on Thai students' education achievement. Because people have previously studied 
the impact of students' socioeconomic background on their education achievement, I decided to focus on questions
pertaining to students' attitude towards education (referring to P.47-P.60 of the student questionnaire). 

Second, I looked at the data and observed that virtually every question is on an Agree/Disagree scale except for 
their expected education achievement. After looking at the expected education level, I concluded that the data 
was not properly filled since it added up to more than 8600 (the total number of students participating in this
study). Therefore, I decided to remove it for the regression analysis. Moreover, we experienced several problems
with No Response, so we filter out every student whose responses miss at least 2 spots. Not much data is missed 
given that only 227 out of 8700 students have at least 2 spots missed; for others, their data is complete or 
misses only one spot. 

Third, I converted the Agree/Disagree questionnaire into 1-4 scale and ran multiple linear regression. The result
will be shown above. Only the variables with p-value less than 0.05 will be shown. 

Fourth, group the variables by positive/negative coefficient and pick only the strongest coefficients. 
That is, pick variables that have high explanatory power.

Limitation: Reverse causality is not handled. 
"""

# Summarizing factors that negatively and positively correlate with education achievement.
negative_factor = ["st181q02ha","st184q01ha","st034q01ta"]
positive_factor = [member for member in keep_lst if member not in negative_factor]

# Selecting the factors that strongly associate with education achievement. 
strong_factor = ["st034q01ta","st188q02ha","st184q01ha","st181q02ha","st181q04ha"] 

# Displaying result
print("Positive factors are as follows: \n")
for factor in positive_factor:
    print(header[factor])
    
print("\nNegative factors are as follows: \n")
for factor in negative_factor:
    print(header[factor])
    
print("\nStrong factors are as follows: \n")
for factor in strong_factor:
    print(header[factor])

Positive factors are as follows: 

Thinking about your school: Trying hard at school will help me get into a good <
Agree: I try harder when I’m in competition with other people.
Agree: When I am failing, I am afraid that I might not have enough talent.
Thinking about yourself and how you normally feel: how often do you feel as desc
Thinking about yourself and how you normally feel: how often do you feel as desc
How true for you: My goal is to understand the content of my classes as thorough
Agree: I feel proud that I have accomplished things.

Negative factors are as follows: 

Agree: I enjoy working in situations involving competition with others.
Agree: Your intelligence is something about you that you can't change very much.
Thinking about your school: I feel like an outsider (or left out of things) at s

Strong factors are as follows: 

Thinking about your school: I feel like an outsider (or left out of things) at s
Agree: I feel proud that I have accomplished things.
Agree: Your 