04. Importing, Joining, & Altering the Datasets

Importing the Dataset into RStudio

After downloading the data (here), it must be imported into RStudio for exploration.

Associated Code

The code to import, merge, and clean the dataset can be found (here)

Step 1: Import the data

There are two datasets used in this analysis:

HarvardX - MITx student course data This dataset includes anonymized student-level data of students who registered for one of thirteen HarvardX (n = 5) or MITx (n = 7) courses between Fall 2012 and Summer 2013.
Course reference data This dataset includes reference details about the course, including institution, course code, course title, and semester. At this point, we delete the field 'institution' from the reference data to avoid duplication in the left join below.

Step 2: Create new fields

New fields must be created within the student course data to allow joining of the two datasets. Specifically, the course code is needed as the primary key between the datasets. The course code is embedded in the course id in the student level data. Regular Expressions were used to parse the course_id into institution, course_code, year_term, year, and term. You can learn more about RegEx, as well as test code, here.

Other fields that are useful in the analysis were also created. These include converting the grade into letter_grade and creating indicator variables for nevents, ndays_act, nplay_video, nchapters, and nforum_posts.

Step 3: Joining the datasets

Once the new fields are created it is possible to perform a simple left join on `course_code.

Step 4: Other alterations

Other required alterations to this dataset will control for:

Duplicate cases
Outliers
Null values

Create field to assign `letter_grade`

Translating the grade decimal into a letter grade provides for easier reporting.

A grade >= 90
B grade >=80 and grade <90
C grade >= 70 and grade < 80
D grade >= 60 and grade < 70
F grade < 60

Indicator variables to check for activity as a binary field

The following fields were transformed to the latter in the pair based on absence/presence of events, activities, video plays, chapters read, or forum posts.

nevents -> nevents_ind
ndays_act -> ndays_act_ind
nplay_video -> nplay_video_ind
nchapters -> nchapters_ind
nforum_posts -> nforum_posts_ind

Binning

Bins will be useful to address fields with wide or long tail distributions.

nevents
ndays_act
nplay_video
nchapters
nforum_posts
start_time_DI -> start_time_ym - the datestamp configuration of YYYY-MM-DD was truncated to YYYY-MM to allow for cleaner analysis. Some level of detail was lost, however, the trend remains consistent.
last_event_DI -> last_event_ym - the datestamp configuration of YYYY-MM-DD was truncated to YYYY-MM to allow for cleaner analysis. Some level of detail was lost, however, the trend remains consistent.

Null Values

Null values (set as NA) are currently being addressed in individual ggplots contained in the Exploratory Data Analysis code located here

Duplicate Cases

Outliers

Learning R by Doing - A Learning Experiment in RStudio and GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04. Importing, Joining, & Altering the Datasets

Importing the Dataset into RStudio

Associated Code

Step 1: Import the data

Step 2: Create new fields

Step 3: Joining the datasets

Step 4: Other alterations

Create field to assign `letter_grade`

Indicator variables to check for activity as a binary field

Binning

Null Values

Duplicate Cases

Outliers

Clone this wiki locally

04. Importing, Joining, & Altering the Datasets

Importing the Dataset into RStudio

Associated Code

Step 1: Import the data

Step 2: Create new fields

Step 3: Joining the datasets

Step 4: Other alterations

Create field to assign letter_grade

Indicator variables to check for activity as a binary field

Binning

Null Values

Duplicate Cases

Outliers

Clone this wiki locally

Create field to assign `letter_grade`