Skip to content

Python & R script of randomForest analysis using PISA2018 dataset

License

Notifications You must be signed in to change notification settings

huni1023/PISA2018-RandomForest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PISA 2018 - RandomForest

This repository is Python & R scripts for running randomForest analysis with PISA2018 data.
The analysis focuses particularly on the population of South Korea and the U.S. The dependent variable in this analysis is academic resilience which related to academic achievement and ESCS(Economic, Social, and Cultural Status).

Prerequisite

  • program version
    • python 3.8.18
    • r 4.2.2
  • requirements
    • install dependency package
      • pip install -r requirements.txt
    • PISA 2018 dataset(only student, school, teacher), you can download from here (PISA2018 database)
    • write list of variables to codebook.xlsx. [Note: PISA Codebook], [Note: see also codebook(sample).xlsx]
    • this project is developed as module, Run the shell command to add project directory to the Python path
      • for powershell,
        ./init_env.ps1
        
      • for other OS, you can just add repository directory to sys.path

Run Analysis

1. Load and Explore data

  • this part is conducted by Python
  • enter repository directory on shell
  • unzip data file, slice it and convert to pickle
python main.py --load
  • preprocessing and explore for one PV (--visualize argument is optional)
python main.py --eda --PV 1 --visualize
  • preprocessing and explore for all PVs
  • after running this code, you can get 10 excel files and bunch of visualization results
python main.py --eda --loop --visualization

2. Run RandomForest analysis

  • this part is conducted by R scripts named Analysis.r
  • these functions are mainly implemented below..
    1. run RF 1 time for one PV
    2. run RF 5 times for one PV
    3. run RF 1 times for 10 PVs

Expected Result

  • descriptive statistics
  • confusion matrix of each RF model
  • variable importance plot

sample result

- compare 10 results of analysis

About

Python & R script of randomForest analysis using PISA2018 dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published