Skip to content

04. Importing, Joining, & Altering the Datasets

Matthew Hendrickson edited this page Apr 8, 2018 · 6 revisions

Importing the Dataset into RStudio

After downloading the data (here), it must be imported into RStudio for exploration.

Associated Code

The code to import, merge, and clean the dataset can be found (here)

Step 1: Import the data

There are two datasets used in this analysis:

  1. HarvardX - MITx student course data This dataset includes anonymized student-level data of students who registered for one of thirteen HarvardX (n = 5) or MITx (n = 7) courses between Fall 2012 and Summer 2013.
  2. Course reference data This dataset includes reference details about the course, including institution, course code, course title, and semester. At this point, we delete the field 'institution' from the reference data to avoid duplication in the left join below.

Step 2: Create new fields

New fields must be created within the student course data to allow joining of the two datasets. Specifically, the course code is needed as the primary key between the datasets. The course code is embedded in the course id in the student level data. Regular Expressions were used to parse the course_id into institution, course_code, year_term, year, and term. You can learn more about RegEx, as well as test code, here.

Other fields that are useful in the analysis were also created. These include converting the grade into letter_grade and creating indicator variables for nevents, ndays_act, nplay_video, nchapters, and nforum_posts.

Step 3: Joining the datasets

Once the new fields are created it is possible to perform a simple left join on `course_code.

Step 4: Other alterations

Other required alterations to this dataset will control for:

  1. Duplicate cases
  2. Outliers
  3. Null values

Create field to assign letter_grade

Translating the grade decimal into a letter grade provides for easier reporting.

  • A grade >= 90
  • B grade >=80 and grade <90
  • C grade >= 70 and grade < 80
  • D grade >= 60 and grade < 70
  • F grade < 60

Indicator variables to check for activity as a binary field

The following fields were transformed to the latter in the pair based on absence/presence of events, activities, video plays, chapters read, or forum posts.

  1. nevents -> nevents_ind
  2. ndays_act -> ndays_act_ind
  3. nplay_video -> nplay_video_ind
  4. nchapters -> nchapters_ind
  5. nforum_posts -> nforum_posts_ind

Binning

Bins will be useful to address fields with wide or long tail distributions.

  1. nevents
  2. ndays_act
  3. nplay_video
  4. nchapters
  5. nforum_posts
  6. start_time_DI -> start_time_ym - the datestamp configuration of YYYY-MM-DD was truncated to YYYY-MM to allow for cleaner analysis. Some level of detail was lost, however, the trend remains consistent.
  7. last_event_DI -> last_event_ym - the datestamp configuration of YYYY-MM-DD was truncated to YYYY-MM to allow for cleaner analysis. Some level of detail was lost, however, the trend remains consistent.

Null Values

Null values (set as NA) are currently being addressed in individual ggplots contained in the Exploratory Data Analysis code located here

Duplicate Cases

Outliers