# Data Analytics for MOOC Providers

#### Created by [Miguel Ballesteros](mailto:mabm1e15@soton.ac.uk)

### *Notebook list*

1. [Introduction](index.ipynb)
2. [The Downloaded CSV Files](downloaded_data.ipynb)
3. **[Data Processing Pipeline](data_pipeline.ipynb)**
4. [Clean Data Exploration](clean_data_exploration.ipynb)
5. [Understanding MOOC behaviors](understanding_mooc.ipynb)
6. [Predicting Dropouts](predicting_dropouts)
7. [Conclusions](conclusions.ipynb)
8. [Future Work](future_work.ipynb)

### *In this notebook...*
1. [The Data Pipeline](#The-Data-Pipeline)
2. [Manually Created CSV Files](#Manually-Created-CSV-Files)
3. [Run and Monitor](#Run-and-Monitor)

## The Data Pipeline
The logical sequence of processing data have become in a very similar sequence of steps deduced by many authors and companies. Although the sequence is very similar, there is an emphasis or a break-down in some steps depending on the pursued goal. The generic steps followed in this project are described in the following image.

#### *General Data Pipeline*
<img src="Images/general_pipeline.png">

For the present project the data is given by FutureLearn who collect it while the Website is used and then combine the different logs to produce the final CSV files complaint with the Data Privacy Act. This means that the *Collect* step is not applied. The *Communicate* stepis not part of the project since the deliverables include only methodological documentation with no formal business report. However, where applicable some recommendations are given with the aim to improve the data collection step and increase the data potential for future course runs. The following image shows the steps within this project.

#### *FutureLearn Data Pipeline*
<img src="Images/fl_pipeline.png">


## Pipeline Execution
The implementation of the data pipeline in R consists basically of a main file that perform the calls to the necessary methods according to the steps to execute, and a separate file for each step related operations. In addition to those, there is a *00_config.r* file that contain name-value pairs for the configuration settings as well as the log initialization.

#### *Execution Sequence*
<img src="Images/execution_sequence.png">



## Run and Monitor

As described before, the pipeline to obtain the transformed dataset with a baseline data for models, can be executed by selecting the desired stages (TRUE/FALSE) in the config file and then sourcing the file *00_main.r*. **Please, ensure the 00_main.r file has the right values for each execution step before running the following cell, as well as ensure the package *log4r* is installed in this machine.**.

**Within the notebook no progress is displayed during the execution, but the *"(busy)"* text is displayed in the Kernel status and page title. This status may last several minutes while the process is complete. To verify the last executed step, check out the log file in the project's root folder.**

In [2]:
print(paste(date(), "Running..."))
source("00_main.r")

[1] "INFO ***** MAIN_EXECUTION START *****"
[1] "INFO - START - SECTION - Load_Downloaded_Data"
[1] "INFO - START - load_from_csv_course_list"
[1] "DEBUG - READING - Local CSV ./Data_Downloaded/course-list.csv"
[1] "DEBUG - COMPLETE - Local CSV ./Data_Downloaded/course-list.csv - Elapsed: 0 s"
[1] "INFO - END - load_from_csv_course_list - Elapsed: 0.0299999999999727 s"
[1] "INFO - START - load_from_csv_course_details"
[1] "DEBUG - READING - Local CSV ./Data_Downloaded/course-details.csv"
[1] "DEBUG - COMPLETE - Local CSV ./Data_Downloaded/course-details.csv - Elapsed: 0.00999999999999091 s"
[1] "INFO - END - load_from_csv_course_details - Elapsed: 0.0299999999999727 s"
[1] "INFO - START - load_downloaded_comments"
[1] "DEBUG - READING - Downloaded CSV ./Data_Downloaded/research-project-1_comments.csv"
[1] "DEBUG - COMPLETE - Downloaded CSV ./Data_Downloaded/research-project-1_comments.csv - Elapsed: 0.420000000000073 s"
[1] "DEBUG - READING - Downloaded CSV ./Data_Downloaded/research-p