# MOOC Data Processing

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 05/19/2025   | Martin | Create  | Notebook created for MOOC data processing. Loaded in data | 
| 05/22/2025   | Martin | Update  | Started processing EdX, Coursera, Udemy dataset | 

# Content

* [Load Data](#load-data)
* [Preprocessing Dataset 1](#preprocessing-dataset-1)

# Load Data

Loading various selected datasets from our collection of MOOC datasets. Here we outline the preprocessing steps applied to each dataset such that they can be combined to one large dataset.

1. __EdX, Coursera, and Udemy Course Data__ - Serves as the base dataset that all other datasets will combine to. Select relevant columns
2. __Dataset of 15,00 Courses__ - Datasets are separated into individual sheets that each have their own preprocessing steps which is explained in detail below

In [93]:
import pandas as pd
import numpy as np

from langdetect import detect

In [204]:
# Dataset 1 - Edx, Coursera, Udemy
df1 = pd.read_json("data/mooc/EdX, Coursera, and Udemy Course Data/combined_dataset.json")

# Preprocessing Dataset 1

The following preprocessing steps were undertaken:

1. Select relevant columns
2. Select first organisation in list 
3. Set "No rating" to None
4. Process description column
5. Process reviews column

__1. Select relevant columns__

This dataset serves as the base for other datasets, so we select only columns that are relevant for our analysis

In [205]:
# Preprocessing Dataset 1
# 1. Select only the relevant columns
df1_cols = [
  'type',
  'course_name',
  'organization',
  'instructor',
  'rating',
  'description',
  'skills',
  'level',
  'Duration',
  'reviews',
  'enrollments',
  'subject',
  'provider'
]
df1 = df1[df1_cols]

__2. Select first organisation in list__

The `organization` column is a list of organisations that created the course. Only a small number of courses have multiple organisations that participated in curating the course (2.1%). Therefore, we assume that the first organisation in the list is the main provider and use that as the primary organisation

In [206]:
# 2. Take only the first organization
df1['organization'] = df1['organization'].str[0]

__3. Set No rating to None__

`rating` column contains both "No rating" and `NoneType`. Make it consistent as `NoneType`

In [207]:
# 3. Change "No rating" to None
df1.loc[df1['rating'] == 'No rating', 'rating'] = None

__4. Process description column__

First we remove the leading "Description: " and trailing newlines on some columns. Then apply language detection to see if the course was conducted in english. We use the description language here to determine this because it seems like a reasonable proxy.

In [208]:
# Remove Description: and \n
df1['description'] = df1['description'].str.removeprefix('Description: ').str.rstrip('\n')

# Create an is_english column that indicates if course is conducted in english
# (1 = english, 0 = not english)
df1['is_english'] = df1['description'].apply(lambda x: 1 if detect(x) == 'en' else 0)

__5. Process reviews column__

We replace empty lists with `np.nan`. Then split the populated reviews into the 2 separate columns, `reviews_comments` and `reviews_stars` representing the comments and stars portion of each review. 

We also create 2 new columns that represent the mean star ratings and the number of reviews left on each course. This new count column replaces the original `nu_reviews` column which was suppose to represent the same metric, because it was inconsistent with the `reviews` column.

In [209]:
def process_reviews(val: list) -> (list, float):
  reviews = []
  stars = []
  for rev in val:
    reviews.append(rev['comment'])
    stars.append(rev['stars'])
  return reviews, stars

# Replace empty lists with nan
df1.loc[df1['reviews'].str.len() == 0, 'reviews'] = np.nan

# Create 2 empty columns to separate reviews and the average star rating
df1[['reviews_comment', 'reviews_stars']] = np.nan, np.nan

# Fill values if they contain comments
expanded = df1.loc[~df1['reviews'].isna(), 'reviews'].apply(lambda x: pd.Series(process_reviews(x)))
expanded.columns = ['reviews_comment', 'reviews_stars']
# reset the index for proper assignment by position
df1.loc[~df1['reviews'].isna(), ['reviews_comment', 'reviews_stars']] = expanded.reset_index(drop=True)

# Create column for average review score
df1['reviews_avg_stars'] = df1['reviews_stars'].apply(np.mean)

# Create column for number of reviews
df1['num_reviews'] = df1['reviews_stars'].str.len()

 list(['Estaría bien que se pudiera obtener certificación de la universidad', 'me parecio un excelente curso, pero sobre todas las cosas me dio una seguridad inmensa  la hora de desarrollarme en el area de emergencias que tan frecuente se hace esta tematica', 'Curso practico para conocer sobre la patología del ICTUS y el método RACE para la evaluación y conocer si estamos ante una posible lesion de vasos cerebrales afectados', 'Excelente curso, es dinámico, fácilmente entendible y aún más con los casos clínicos mostrados, me voy con una idea bastante buena de que es y como evaluar un ictus agudo sobretodo en emergencias', 'Excelente aproximación al ACV de manera prehospitalaria, la escala RACE una excelente herramienta y el curso es muy práctico. Recomendado para el área de la salud!', 'Un curso bastante completo sobre las enfermedades cerebrovasculares. Completamente recomendado para quienes quieren saber más sobre el tema.', 'Es muy bueno el contenido y la organización de este curso,

Remove skills that have nan lists

In [210]:
temp = df1.copy()

In [211]:
temp.head()

Unnamed: 0,type,course_name,organization,instructor,rating,description,skills,level,Duration,reviews,enrollments,subject,provider,is_english,reviews_comment,reviews_stars,reviews_avg_stars,num_reviews
0,course,AWS Lambda إنشاء صورة مصغرة بإستخدام السيرفرل...,Coursera Project Network,Omar Fathy,,هذا المشروع التفاعلي -إنشاء صورة مصغرة بإستخدا...,"[AWS Identity And Access Management (IAM), Clo...",Intermediate,2.0,,,,coursera,0,,,,
1,course,Assisting Public Sector Decision Makers With ...,University of Michigan,Christopher Brooks,4.8,Develop data analysis skills that support publ...,"[Simulations, Statistical Analysis, Predictive...",Intermediate,16.0,[{'comment': 'This course was very good at get...,,,coursera,1,[wonderful],[5],5.0,1.0
2,course,Advanced Strategies for Sustainable Business,University of Colorado Boulder,Joel Hartter,,This course focuses on integrating sustainabil...,"[Circular Economy, Sustainable Business, Stake...",Beginner,6.0,,,,coursera,1,,,,
3,course,Applying Machine Learning to Your Data with G...,Google Cloud,Google Cloud Training,,"Dans ce cours, nous définirons ce qu'est le ma...",[NaN],Beginner,10.0,,,,coursera,0,,,,
4,project,Automate Blog Advertisements with Zapier,Coursera Project Network,Carmen Rojas,,Zapier is the industry leader in task automati...,"[Advertising, Social Media, Blogging, Marketing]",Intermediate,2.0,"[{'comment': 'wonderful', 'stars': 5}]",,,coursera,1,"[Very good way of teaching., Good, Good]","[5, 5, 5]",5.0,3.0
