# Data Analytics for MOOC Providers

#### Created by [Miguel Ballesteros](mailto:mabm1e15@soton.ac.uk)

### *Notebook list*

1. [Introduction](index.ipynb)
2. **[The Downloaded CSV Files](downloaded_data.ipynb)**
3. [Data Processing Pipeline](data_pipeline.ipynb)
4. [Clean Data Exploration](clean_data_exploration.ipynb)
5. [Understanding MOOC behaviors](understanding_mooc.ipynb)
6. [Predicting Dropouts](predicting_dropouts)
7. [Conclusions](conclusions.ipynb)
8. [Future Work](future_work.ipynb)

### *In this notebook...*
1. [The CSV Files](#The-CSV-Files)
2. [Manually Created CSV Files](#Manually-Created-CSV-Files)


## The CSV Files
FutureLearn allows to download the activity data for each course to the course designers. These files are mainly the result of parsing web server logs, combined with custom collected actions. Activities are separated according to some predefined categories.

* **Enrolments:** Entries of participants registered in the course containing some demographic data.
* **Step Activity:** Most important course activity containing records for each course step visit. This is a consolidated version of the visits, showing two fields where the first visited date is commonly present and a last completed at value set depending if the participant choose to mark it as such.
* **Comments:** As a social learning platform, FutureLearn encourages the interaction in forums between participants. This file contains the social forum interactions classified per course, week and step.
* **Peer Review Assignments:** Some courses have a step in which participants are required to complente an assignment that will be later reviewed by a peer. This file contains the assignment submission by the participant along with the relevant data to classify it correctly among the courses, weeks and steps.
* **Peer Review Reviews:** The assignment reviews are stored in this separate CSV file which also reference the corresponding entry at the assignments file.
* **Question Response:** Some course steps include a quick set of questions to assess the topic as the participant progress. The questions' response attempts along with the outcome (correct/incorrect) are contained in this file.

In general the file names follow a convention given by FutureLearn. The file name consists of a short course prefix that includes the run number, followed by the activity type referenced above. For example, the course Research Project has the following files for the run 1.

<img src="Images/research_project_csvs.png">

## Manually Created CSV Files
As per previous works in Learning Analytics, there are three main dimensions to consider in these scenarios: **Participant**, **Course** and **Interaction**. With the FutureLearn provided CSV files the Interaction dimension has sufficient data to allow the study of some patterns and behaviors. The enrolments data is limited in terms of demographics since useful fields are defined but only a tiny portion contains relevant values leaving only the learner ID as the only field enabling the participant dimension. On the course dimension there is no data to be downloaded, so it requires a manual process. The manual process required for the courses CSV files is not complex and increases significantly the amount of potential analysis to the whole dataset.

The structure of the manually created CSV files for course information is designed to make it easy to complete with a minimum set of data across two schemas. The course list contains a single entry for each course and it is the main reference when processing the whole pipeline. The course details contains as many entries per course as total steps it has. Basically the idea is to list all steps with information about the expected time slot the content is expected to be consumed, expected time to be spent in each step as well as the type of content.

You can download the sample files for [course_list](Data_Downloaded/course-list-sample.txt) and [course_details](Data_Downloaded/course-details-sample.txt), and modify them accordingly to match the downloaded data to be analized.

#### Course List
[<img src="Images/course_list_csv.png">](Data_Downloaded/course-list-sample.txt)

#### Course Details
[<img src="Images/course_details_csv.png">](Data_Downloaded/course-details-sample.txt)


## Next...
[Data Processing Pipeline](data_pipeline.ipynb)