04. Importing, Joining, & Altering the Datasets
After downloading the data (here), it must be imported into RStudio for exploration.
The code to import, merge, and clean the dataset can be found (here)
There are two datasets used in this analysis:
- HarvardX - MITx student course data This dataset includes anonymized student-level data of students who registered for one of thirteen HarvardX (n = 5) or MITx (n = 7) courses between Fall 2012 and Summer 2013.
- Course reference data This dataset includes reference details about the course, including institution, course code, course title, and semester. At this point, we delete the field 'institution' from the reference data to avoid duplication in the left join below.
New fields must be created within the student course data to allow joining of the two datasets. Specifically, the course code is needed as the primary key between the datasets. The course code is embedded in the course id in the student level data. Regular Expressions were used to parse the course_id
into institution
, course_code
, year_term
, year
, and term
. You can learn more about RegEx, as well as test code, here.
Other fields that are useful in the analysis were also created. These include converting the grade
into letter_grade
and creating indicator variables for nevents
, ndays_act
, nplay_video
, nchapters
, and nforum_posts
.
Once the new fields are created it is possible to perform a simple left join on `course_code.
Other required alterations to this dataset will control for:
- Duplicate cases
- Outliers
- Null values
Translating the grade decimal into a letter grade provides for easier reporting.
A grade >= 90
B grade >=80 and grade <90
C grade >= 70 and grade < 80
D grade >= 60 and grade < 70
F grade < 60
The following fields were transformed to the latter in the pair based on absence/presence of events, activities, video plays, chapters read, or forum posts.
nevents -> nevents_ind
ndays_act -> ndays_act_ind
nplay_video -> nplay_video_ind
nchapters -> nchapters_ind
nforum_posts -> nforum_posts_ind
Bins will be useful to address fields with wide or long tail distributions.
nevents
ndays_act
nplay_video
nchapters
nforum_posts
-
start_time_DI -> start_time_ym
- the datestamp configuration ofYYYY-MM-DD
was truncated toYYYY-MM
to allow for cleaner analysis. Some level of detail was lost, however, the trend remains consistent. -
last_event_DI -> last_event_ym
- the datestamp configuration ofYYYY-MM-DD
was truncated toYYYY-MM
to allow for cleaner analysis. Some level of detail was lost, however, the trend remains consistent.
Null values (set as NA
) are currently being addressed in individual ggplots contained in the Exploratory Data Analysis code located here
Learning R by Doing - A Learning Experiment in RStudio and GitHub