# MOOC Data Processing

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 05/19/2025   | Martin | Create  | Notebook created for MOOC data processing. Loaded in data | 
| 05/22/2025   | Martin | Update  | Started processing EdX, Coursera, Udemy dataset | 
| 05/23/2025   | Martin | Update  | Completed Dataset 1 preprocessing. Started preprocessing 15,000 Coursera courses dataset | 
| 05/27/2025   | Martin | Update  | Processed alison, futurelearn and harvard datasets. Left coursera and pluralsight datasets from Dataset 2 | 
| 05/28/2025   | Martin | Update  | Completed processing dataset 2 and 3, started on dataset 4 | 
| 05/29/2025   | Martin | Update  | Completed data processing for MOOC dataset | 
| 05/30/2025   | Martin | Fix  | Some NaN values inside lists in udemy data caused errors when trying to import the combined dataset during analysis. Fixed by removing those nan values | 

# Content

* [Load Data](#load-data)
* [Preprocessing Dataset 1](#preprocessing-dataset-1)
* [Preprocessing Dataset 2](#preprocessing-dataset-2)
* [Preprocessing Dataset 3](#preprocessing-dataset-3)
* [Preprocessing Dataset 4](#preprocessing-dataset-4)
* [Putting Them Together](#putting-them-together)

# Load Data

Loading various selected datasets from our collection of MOOC datasets. Here we outline the preprocessing steps applied to each dataset such that they can be combined to one large dataset.

1. __EdX, Coursera, and Udemy Course Data__ - Serves as the base dataset that all other datasets will combine to. Select relevant columns
2. __Dataset of 15,00 Courses__ - Datasets are separated into individual sheets that each have their own preprocessing steps which is explained in detail below

In [1]:
import pandas as pd
import numpy as np
import re

from langdetect import detect
from iso639 import Lang

In [2]:
# Dataset 1 - Edx, Coursera, Udemy
df1 = pd.read_json("data/mooc/EdX, Coursera, and Udemy Course Data/combined_dataset.json")

# Preprocessing Dataset 1

Dataset 1: EdX, Coursera, and Udemy Courses

The following preprocessing steps were undertaken:

1. Select relevant columns
2. Select first organisation in list 
3. Set "No rating" to None
4. Process description column
5. Process reviews column
6. Replace empty lists with nan in skills column
7. Process level column
8. Simplify type column to have 2 values
9. Replace negative enrollment values with positive ones
10. Add premium column

__1. Select relevant columns__

This dataset serves as the base for other datasets, so we select only columns that are relevant for our analysis

In [3]:
# Preprocessing Dataset 1
# 1. Select only the relevant columns
df1_cols = [
  'type',
  'course_name',
  'organization',
  'rating',
  'description',
  'skills',
  'level',
  'Duration',
  'reviews',
  'enrollments',
  'subject',
  'provider'
]
df1 = df1[df1_cols]

__2. Select first organisation in list__

The `organization` column is a list of organisations that created the course. Only a small number of courses have multiple organisations that participated in curating the course (2.1%). Therefore, we assume that the first organisation in the list is the main provider and use that as the primary organisation

In [4]:
# 2. Take only the first organization
df1['organization'] = df1['organization'].str[0]

__3. Set No rating to None__

`rating` column contains both "No rating" and `NoneType`. Make it consistent as `NoneType`

In [5]:
# 3. Change "No rating" to None
df1.loc[df1['rating'] == 'No rating', 'rating'] = None

__4. Process description column__

First we remove the leading "Description: " and trailing newlines on some columns. Then apply language detection to see if the course was conducted in english. We use the description language here to determine this because it seems like a reasonable proxy.

In [6]:
# Remove Description: and \n
df1['description'] = df1['description'].str.removeprefix('Description: ').str.rstrip('\n')

# Create an is_english column that indicates if course is conducted in english
# (1 = english, 0 = not english)
df1['language'] = df1['description'].apply(lambda x: detect(x))

__5. Process reviews column__

We replace empty lists with `np.nan`. Then split the populated reviews into the 2 separate columns, `reviews_comments` and `reviews_stars` representing the comments and stars portion of each review. 

We also create 2 new columns that represent the mean star ratings and the number of reviews left on each course. This new count column replaces the original `nu_reviews` column which was suppose to represent the same metric, because it was inconsistent with the `reviews` column.

In [7]:
def process_reviews(val: list) -> (list, float):
  reviews = []
  stars = []
  for rev in val:
    reviews.append(rev['comment'])
    stars.append(rev['stars'])
  return reviews, stars

# Replace empty lists with nan
df1.loc[df1['reviews'].str.len() == 0, 'reviews'] = np.nan

# Create 2 empty columns to separate reviews and the average star rating
df1[['reviews_comments', 'reviews_stars']] = np.nan, np.nan

# Fill values if they contain comments
expanded = df1.loc[~df1['reviews'].isna(), 'reviews'].apply(lambda x: pd.Series(process_reviews(x)))
expanded.columns = ['reviews_comments', 'reviews_stars']
# reset the index for proper assignment by position
df1.loc[~df1['reviews'].isna(), ['reviews_comments', 'reviews_stars']] = expanded.reset_index(drop=True)

# Create column for average review score
df1['reviews_avg_stars'] = df1['reviews_stars'].apply(np.mean)

# Create column for number of reviews
df1['num_reviews'] = df1['reviews_stars'].str.len()

 list(['Estaría bien que se pudiera obtener certificación de la universidad', 'me parecio un excelente curso, pero sobre todas las cosas me dio una seguridad inmensa  la hora de desarrollarme en el area de emergencias que tan frecuente se hace esta tematica', 'Curso practico para conocer sobre la patología del ICTUS y el método RACE para la evaluación y conocer si estamos ante una posible lesion de vasos cerebrales afectados', 'Excelente curso, es dinámico, fácilmente entendible y aún más con los casos clínicos mostrados, me voy con una idea bastante buena de que es y como evaluar un ictus agudo sobretodo en emergencias', 'Excelente aproximación al ACV de manera prehospitalaria, la escala RACE una excelente herramienta y el curso es muy práctico. Recomendado para el área de la salud!', 'Un curso bastante completo sobre las enfermedades cerebrovasculares. Completamente recomendado para quienes quieren saber más sobre el tema.', 'Es muy bueno el contenido y la organización de este curso,

__6. Replace empty lists with nan in skills column__

Some lists only contain NaN values in them, we replace these with actual `np.nan` values

In [8]:
mask = ["NaN" in l for l in df1['skills']]
df1.loc[mask, 'skills'] = np.nan

__7. Process level column__

`level` column contains a mixture of string and list type variables. To unify the data types, lists that contain multiple items are replaced by the string "Mixed". All "mixed" values are also converted to "Mixed" for consistency.

The new `level` column contains only "Beginner", "Intermediate", "Advanced" and "Mixed" values.

In [9]:
# Replace "mixed" with "Mixed" values in column
df1.loc[df1['level'] == 'mixed', 'level'] = 'Mixed'

# Replace list type with string type, those with multiple levels are replaced as "Mixed"
levels_list = df1[df1['level'].apply(lambda x: isinstance(x, list))]['level']
levels_list = levels_list.apply(lambda x: x[0] if len(x) == 1 else "Mixed")
df1.loc[levels_list.index, 'level'] = levels_list

__8. Simplify type column to have 2 values__

The `type` column contains mainly 2 values: "courses" and "projects" taken from the Coursera classification. To simplify the data, we will set all datasets to follow this categorisation, although some other online programs from different providers might have different definitions.

In [10]:
df1 = df1[df1['type'] != 'unknown']
type_map = {'Program': 'course'}
df1['type'] = df1['type'].apply(lambda x: type_map[x] if x in type_map else x)

__9. Replace negative enrollment values with positive ones__

We make the assumption that the negative values in the `enrollment` column might have been a data entry error, so we replace negative values with positive values.

In [11]:
df1['enrollments'] = np.where(df1['enrollments'] < 0, -df1['enrollments'], df1['enrollments'])

__10. Add premium column__

Add a column to indicate if the course is premium or free

In [12]:
df1['premium'] = np.nan

In [13]:
# Housekeeping
df1 = df1.fillna(np.nan)
df1.columns = [x.lower() for x in df1.columns]

In [14]:
df1.head()

Unnamed: 0,type,course_name,organization,rating,description,skills,level,duration,reviews,enrollments,subject,provider,language,reviews_comments,reviews_stars,reviews_avg_stars,num_reviews,premium
0,course,AWS Lambda إنشاء صورة مصغرة بإستخدام السيرفرل...,Coursera Project Network,,هذا المشروع التفاعلي -إنشاء صورة مصغرة بإستخدا...,"[AWS Identity And Access Management (IAM), Clo...",Intermediate,2.0,,,,coursera,ar,,,,,
1,course,Assisting Public Sector Decision Makers With ...,University of Michigan,4.8,Develop data analysis skills that support publ...,"[Simulations, Statistical Analysis, Predictive...",Intermediate,16.0,[{'comment': 'This course was very good at get...,,,coursera,en,[wonderful],[5],5.0,1.0,
2,course,Advanced Strategies for Sustainable Business,University of Colorado Boulder,,This course focuses on integrating sustainabil...,"[Circular Economy, Sustainable Business, Stake...",Beginner,6.0,,,,coursera,en,,,,,
3,course,Applying Machine Learning to Your Data with G...,Google Cloud,,"Dans ce cours, nous définirons ce qu'est le ma...",,Beginner,10.0,,,,coursera,fr,,,,,
4,project,Automate Blog Advertisements with Zapier,Coursera Project Network,,Zapier is the industry leader in task automati...,"[Advertising, Social Media, Blogging, Marketing]",Intermediate,2.0,"[{'comment': 'wonderful', 'stars': 5}]",,,coursera,en,"[Very good way of teaching., Good, Good]","[5, 5, 5]",5.0,3.0,


---

# Preprocessing Dataset 2

Dataset 2: 15,000 Coursera Courses

Because this dataset contains a wide variety of individual CSV files, each one will have slightly different preprocessing steps

In [15]:
base = "data/mooc/dataset-of-15000-coursera-courses"

__Alison dataset__

1. Take the average for duration
2. Change all values in `type` column to "course" to match base dataframe
3. Convert vales in `category` column to list and clean strings
4. Create additional columns to match base dataframe

In [16]:
# Load alison dataset
alison = pd.read_csv(f"{base}/alison.csv")

# Select and remap columns
alison = alison[['Name Of The Course ', 'Institute', 'Duration', 'Number of learners', 'Skills', 'Type', 'Category']]
col_map = {
  'Name Of The Course ': 'course_name',
  'Institute': 'organization',
  'Duration': 'duration',
  'Number of learners': 'enrollments',
  'Skills': 'description',
  'Type': 'type',
  'Category': 'subject'
}
alison = alison.rename(col_map, axis=1)

# Set duration to the the average value
durations = alison['duration'].str.extractall(r'(\d+)\s*-\s*(\d+)')
durations = durations.astype(int)
alison['duration'] = durations.mean(axis=1).reset_index(drop=True)

# Set all type values to "course"
alison['type'] = "course"

# Format Category column 
def format_subject(val: str) -> list:
  # Replace special character with space
  val = val.replace("-", " ")

  # Make capitalise the first letter of each word
  val = val.title()

  # Replace business and management with combined categorjoy
  if val == 'Business' or val == 'management':
    val = 'Business & Management'
  return [val]

alison['subject'] = alison['subject'].apply(lambda x: format_subject(x))

alison['enrollments'] = alison['enrollments'].str.replace(',','').astype(float)

# Add additional columns
alison['provider'] = "alison"
alison['language'] = alison['description'].apply(lambda x: detect(x))

In [17]:
alison.head()

Unnamed: 0,course_name,organization,duration,enrollments,description,type,subject,provider,language
0,bendrasis duomenu apsaugos reglamentas bdar,Advance Learning - Human Resources (HR),2.5,815.0,Paaiškinkite Bendrojo duomenų apsaugos reglame...,course,[Business & Management],alison,lt
1,internal auditing information security managem...,Exoexcellence Consultants,2.5,581.0,Recognize the significant considerations regar...,course,[Information Technology],alison,en
2,anti money laundering and customer verificatio...,Training Facility UK,3.5,7710.0,Define money laundering and give a brief overv...,course,[Business & Management],alison,en
3,comptia cloud advanced,Workforce Academy Partnership,3.5,23385.0,"Define testing and tools, Describe security m...",course,[Information Technology],alison,en
4,itil 4 fundamentals essentials of it service m...,Exoexcellence Consultants,2.5,83244.0,Define the term ‘ITIL framework’ and describe ...,course,[Information Technology],alison,en


__FutureLearn dataset__

1. Extract the numeric value from `Review` column for number of reviews
2. Multiply `Duration` with `Spend time per week` to get the full course duration
3. Remap `Category` based on which courses are premium

In [18]:
# Load futurelearn dataset
fl = pd.read_csv(f"{base}/futurelearn.csv")

# Select columns and rename
fl = fl.drop('Link', axis=1)
fl = fl.rename({
  'Institution': 'organization',
  'Name': 'course_name',
  'Rating': 'rating',
  'Review': 'num_reviews',
  'Duration': 'duration',
  'Category': 'premium',
  'Type': 'type'
}, axis=1)

# Extract numeric value from review
fl['num_reviews'] = fl['num_reviews'].str.extract(r"(\d+)").astype(float)

# Get course full runtime
fl['duration'] = fl['duration'].str.extract(r"(\d+)").astype(float)
fl['Spend time per week'] = fl['Spend time per week'].str.extract(r"(\d+)").astype(float)
fl['duration'] = fl['duration'] * fl['Spend time per week']
fl = fl.drop('Spend time per week', axis=1)

# Remap premium column to indicate which courses must be paid
fl['premium'] = fl['premium'].map({
  'Included in Unlimited': 1,
  'Premium Course': 1,
  'Free digital upgrade': 0
})
fl['premium'] = fl['premium'].fillna(0)

# Convert type to course
fl['type'] = "course"

# Add additional columns
fl['provider'] = "futurelearn"
fl['language'] = fl['course_name'].apply(lambda x: detect(x))

In [19]:
fl.head()

Unnamed: 0,organization,course_name,rating,num_reviews,duration,premium,type,provider,language
0,University of Padova,4 0 shades of digitalisation for the chemical ...,,,20.0,1.0,course,futurelearn,en
1,International Culinary Studio,a beginners guide to basic cooking skills,,,25.0,0.0,course,futurelearn,en
2,BoxPlay & FutureLearn,a beginners guide to data analytics,,,9.0,0.0,course,futurelearn,en
3,Packt & FutureLearn,a beginners guide to data handling and managem...,,,9.0,0.0,course,futurelearn,nl
4,Packt & FutureLearn,a beginners guide to docker,,,6.0,0.0,course,futurelearn,nl


__harvard dataset__

1. Convert values in `subject` to list type
2. Convert values in `Price` to indicator if they are premium or not
3. Convert values in `Duration` to hours

In [20]:
# Load futurelearn dataset
hv = pd.read_csv(f"{base}/Harvard_university.csv")

# Select columns and rename
hv = hv.drop(['Link to course', 'Category link', 'Mode', 'Availability', 'Offered By'], axis=1)
hv = hv.rename({
  'subject ': 'subject',
  'Name': 'course_name',
  'About': 'description',
  'Price': 'premium',
  'Duration': 'duration'
}, axis=1)

# Convert subject values to list type
hv['subject'] = hv['subject'].apply(lambda x: [x])

# Convert values in Price to indicate if it's a premium course
hv['premium'] = np.select([ hv['premium'] == 'Free' ], [0], default=1)

# Convert duration to hours
def convert_str_hours(val: str) -> float:
  if 'week' in val or 'weeks' in val:
    digit = float(re.search(r'(\d+)', val).group(1))
    return 168 * digit
  elif 'day' in val or 'days' in val:
    digit = float(re.search(r'(\d+)', val).group(1))
    return 24 * digit
  else:
    return np.nan

hv['duration'] = hv['duration'].apply(lambda x: convert_str_hours(x))

hv['type'] = "course"

# Add additional columns
hv['provider'] = "harvard online"
hv['language'] = hv['description'].apply(lambda x: detect(x))

In [21]:
hv.head()

Unnamed: 0,subject,course_name,description,premium,duration,type,provider,language
0,[Humanities],PredictionX: Lost Without Longitude,"Explore the history of navigation, from stars ...",0,168.0,course,harvard online,en
1,[Social Sciences],Nonprofit Financial Stewardship Webinar: Intro...,The Introduction to Nonprofit Accounting and F...,0,,course,harvard online,en
2,[Programming],CS50: Introduction to Computer Science,An introduction to the intellectual enterprise...,0,1848.0,course,harvard online,en
3,[Social Sciences],Public Leadership Credential,"Developed by Harvard Kennedy School faculty, t...",1,,course,harvard online,en
4,[Health & Medicine],Cognitive Fitness,This online course from Harvard Health Publish...,1,336.0,course,harvard online,en


__PluralSight dataset__

1. Change `Type` column values to match base dataset
2. Convert `Duration` column to hours

In [22]:
# Load dataset
ps = pd.read_csv(f"{base}/pluralsight.csv")

# Drop the nan row
ps = ps.dropna(axis=0, how='all')

# Select and rename columns
ps = ps[['Type', 'Name', 'Level', 'Duration', 'Rating']]
ps = ps.rename({
  'Type': 'type',
  'Name': 'course_name',
  'Level': 'level',
  'Duration': 'duration',
  'Rating': 'num_reviews'
}, axis=1)

# Change values in type column
type_map = {'labs': 'project', 'courses': 'course'}
ps['type'] = ps['type'].apply(lambda x: type_map[x] if x in type_map else x)

# Convert Duration
def ps_duration(val: str) -> float:
  # Extract all numbers from string
  extracted_number = [int(num) for num in re.findall(r'\d+', val)]

  # Process those with hours and minutes differently from those with just minutes
  if len(extracted_number) == 1:
    return round(extracted_number[0] / 60, 2)
  else:
    return extracted_number[0] + round(extracted_number[1] / 60, 2)

ps['duration'] = ps['duration'].apply(ps_duration)

# Add additional columns
ps['provider'] = "pluralsight"
ps['language'] = ps['course_name'].apply(lambda x: detect(x))

In [23]:
ps.head()

Unnamed: 0,type,course_name,level,duration,num_reviews,provider,language
0,course,Oracle Database 12c Fundamentals,Beginner,3.7,357.0,pluralsight,ca
1,course,Oracle Database 12c Disaster Recovery and Data...,Intermediate,3.43,65.0,pluralsight,en
2,course,Oracle Database 12c: Installation and Upgrade,Beginner,2.78,152.0,pluralsight,en
3,course,Microsoft Azure Solutions Architect: Design fo...,Advanced,1.17,35.0,pluralsight,en
4,course,Microsoft Azure Solutions Architect: Implement...,Advanced,1.28,35.0,pluralsight,en


---

# Preprocessing Dataset 3

Dataset 3: Udacity Courses

Contains a small collection of Udacity courses. The following are the processing steps taken:

1. Create the `premium` column
2. Make `type` column consistent
3. Capitalise level column
4. Convert duration to hours
5. Make skills column into lists
6. Take the first value as organization value
7. Add the additional columns

In [24]:
# Load data
udacity = pd.read_csv('data/mooc/udacity-course-catalog/all_courses.csv')

In [25]:
# Rename and select columns
udacity = udacity.drop(['Prerequisites', 'URL'], axis=1)
udacity = udacity.rename({
  'Title': 'course_name',
  'Type': 'type',
  'Description': 'description',
  'Level': 'level',
  'Duration': 'duration',
  'Rating': 'rating',
  'Review Count': 'num_reviews',
  'Skills Covered': 'skills',
  'Affiliates': 'organization'
}, axis=1)

__1. Create the premium column__

The original `Type` column indicates both the type of course is being run but also which courses are free. Using this column, we derive the `premium` column that indicates which courses are offered for free.

In [26]:
# Indicate which courses are premium based on the type column
udacity['premium'] = np.select([udacity['type'] == 'free'], [1], 0)

__2. Make type column consistent__

To align the dataset with the base dataset, we make all entries "course" since there are not "project" types.

In [27]:
# convert all types to courses
udacity['type'] = 'course'

__3. Capitalise level column__

Capitalise the first letter for the `level` column

In [28]:
# Capitalise the first letter for level column
udacity['level'] = udacity['level'].str.title()

__4. Convert duration to hours__

Again to make the duration column consistent with the base dataset, we convert all instances to hours

In [29]:
# Convert duration to hours
def udacity_duration(val: str) -> float:
  try:
    extracted_number = [int(num) for num in re.findall(r'\d+', val)]
  except Exception:
    return val
  
  if 'Hour' in val:
    return extracted_number[0]
  elif 'Day' in val or 'Days' in val:
    return round(extracted_number[0] * 24, 2)
  elif 'Week' in val or 'Weeks' in val:
    return round(extracted_number[0] * 168, 2)
  else:
    return round(extracted_number[0] * 730, 2)

udacity['duration'] = udacity['duration'].apply(udacity_duration)

__5. Make skills column into lists__

The `skills` column is already nicely formatted as a comma separated variable, so we just convert them to a list datatype

In [30]:
# Convert skills to list
udacity['skills'] = udacity['skills'].str.split(',')

__6. Take the first value as organization value__

Similar to the base dataset, we take the first value in the comma separated string in the `organization` column. The same assumption applies here, where the first value is assumed to be the main organisation

In [31]:
# Take first item in list as the organisation
temp = udacity['organization'].str.split(',')
udacity['organization'] = temp.apply(lambda x: x[0] if type(x) == list else x)

__7. Add the additional columns__

Add the remaining columns to match the base dataset

In [32]:
# Add additional columns
udacity['provider'] = "udacity"
udacity['language'] = udacity['description'].apply(lambda x: detect(x) if type(x) == str else x)

In [33]:
udacity.head()

Unnamed: 0,course_name,type,description,level,duration,rating,num_reviews,skills,organization,premium,provider,language
0,Data Engineering with AWS,course,"Learn to design data models, build data wareho...",Intermediate,2920.0,4.6,1802.0,"[AWS Glue, Amazon S3, AWS Data Warehouse, ...",,0,udacity,en
1,Product Manager,course,Envision and execute the development of indust...,Beginner,2920.0,4.7,864.0,"[Product Strategy, Product Design, Product D...",,0,udacity,en
2,C++,course,Get hands-on experience building five real-wor...,Intermediate,2920.0,4.5,1126.0,"[Data Structures & Algorithms, Memory Managem...",,0,udacity,en
3,Business Analytics,course,Gain foundational data skills like analyzing d...,Beginner,2190.0,4.8,2649.0,"[Excel & Spreadsheets, SQL, Data Visualizati...",Mode,0,udacity,en
4,Data Scientist,course,"Build effective machine learning models, run d...",Advanced,2920.0,4.7,1212.0,"[Machine Learning, Deep Learning, Software E...",Bertelsmann,0,udacity,en


---

In [34]:
test = pd.concat([df1, alison, fl, hv, ps, udacity], axis=0)

# Preprocessing Dataset 4

Dataset 4 - Udemy Courses

A very comprehensive dataset containing thousands of courses from the Udemy platform. It is split into 2 dataframes which contains the course information (Course_info.csv) and user comments (Comments.csv). The processing steps will merge the 2 dataframes and transform them to match the base dataset.

1. Subset data
2. Combine comments and stars
3. Select and rename columns
4. Remap boolean values to integer on is_paid column
5. Processing rating column
6. Convert content_length to hours
7. Change subject and skills to lists
8. Remap language to ISO-639 shorthand
9. Add remaining columns and housekeeping

In [35]:
# Load both datasets
udemy = pd.read_csv("data/mooc/udemy-courses/Course_info.csv")
udemy_comments = pd.read_csv("data/mooc/udemy-courses/Comments.csv")

__1. Subset data__

The udemy dataset contains 200,000+ courses, which is much larger than all the other dataset. To not oversaturate the dataset with a single provider, we randomly select 15,000 courses from the platform to be added. This portion is seeded to ensure repeatability.

In [36]:
# Take a subset of the udemy data
np.random.seed(42)
NUM_ENTRIES = 15_000

selected = np.random.choice(range(udemy.shape[0]), size=NUM_ENTRIES)
udemy = udemy.iloc[selected].reset_index(drop=True)

__2. Combine comments and stars__

We merge the comments and stars from the Comments.csv with the course information based on the `course_id`. Multiple comments and stars are stored as list, matching the base dataframe's format.

In [37]:
# Combine comments and stars the have the same id into a list
udemy_comments = udemy_comments.dropna(subset=['rate', 'comment'])
comments = udemy_comments.groupby('course_id')['comment'].apply(list).reset_index(name='comments')
stars = udemy_comments.groupby('course_id')['rate'].apply(list).reset_index(name='stars')

# Convert id column in udemy to int
udemy['id'] = udemy['id'].astype(int)

# Merge the comments and stars dataframes to the main udemy dataframe
udemy = udemy.merge(
  comments,
  left_on='id',
  right_on='course_id',
  how='left'
)

udemy = udemy.merge(
  stars,
  left_on='id',
  right_on='course_id',
  how='left'
)

__3. Select and rename columns__

In [38]:
# Select and rename columns
udemy = udemy.drop([
  'id',
  'price',
  'num_reviews',
  'num_comments',
  'num_lectures',
  'published_time',
  'last_update_date',
  'subcategory',
  'course_url',
  'instructor_name',
  'instructor_url',
  'course_id_x',
  'course_id_y'
], axis=1)
udemy = udemy.rename({
  'title': 'course_name',
  'is_paid': 'premium',
  'headline': 'description',
  'num_subscribers': 'enrollments',
  'avg_rating': 'rating',
  'content_length_min': 'duration',
  'category': 'subject',
  'topic': 'skills',
}, axis=1)

__4. Remap boolean values to integer on is_paid column__

In [39]:
# Remap premium to 1s and 0s
udemy['premium'] = udemy['premium'].astype(int)

__5. Processing rating column__

To maintain consistency with the base dataframe, values that do not have ratings (0.0) are replaced with nan, then all values are rounded to 2 decimal places

In [40]:
# Replace 0 ratings with nan and round values to 2 decimal places
udemy['rating'] = round(udemy['rating'].replace(0, np.nan), 2)

__6. Convert content_length to hours__

In [41]:
# Convert content_length to hours
udemy['duration'] = round(udemy['duration'] / 60, 2)

__7. Change subject and skills to lists__

In [42]:
# Change data type of subject and skills to lists
udemy['subject'] = udemy['subject'].apply(lambda x: [x] if not pd.isna(x) else x)
udemy['skills'] = udemy['skills'].apply(lambda x: [x] if not pd.isna(x) else x)

__8. Remap language to ISO-639 shorthand__

The base dataset used the `detect` function which returns the language detected in ISO-639 format. To maintain consistency, we convert the full name of languages to the same format

In [43]:
# Convert language to iso shorthands
mapper = {
  'Greek': 'el',
  'Azeri': 'az',
  'Simplified Chinese': 'zh-cn',
  'Traditional Chinese': 'zh-tw'
}

def udemy_language(val: str, mapper: dict) -> str:
  # Try to use package to convert, else use the manual mapper
  try:
    return Lang(val).pt1
  except Exception:
    return mapper[val]

udemy['language'] = udemy['language'].apply(lambda x: udemy_language(x, mapper))

__9. Add remaining columns and housekeeping__

In [44]:
# Add additional columns
udemy['provider'] = "udemy"
udemy['type'] = "course"
udemy['num_reviews'] = udemy['comments'].str.len()
udemy['reviews_avg_stars'] = round(udemy['stars'].map(np.mean), 2)

# Housekeeping
udemy = udemy.rename({
  'comments': 'reviews_comments',
  'stars': 'reviews_stars'
}, axis=1)

In [45]:
udemy.head()

Unnamed: 0,course_name,premium,description,enrollments,rating,duration,subject,skills,language,reviews_comments,reviews_stars,provider,type,num_reviews,reviews_avg_stars
0,كيفية تطوير استراتيجية المحتوى:دليل من البدايه...,0,كل ما تحتاج لفهم انواع تطوير المحتوى التسويقي ...,1406.0,4.25,1.98,[Marketing],[Marketing Strategy],ar,[جزاك الله خير],[5.0],udemy,course,1.0,5.0
1,MATLAB 4 Everyone,1,Learning MATLAB was never this easy. Lead the ...,1.0,,2.73,[IT & Software],[MATLAB],en,,,udemy,course,,
2,Learn How To Become a Nurse Practitioner:,0,Do you have what it takes?,715.0,4.0,1.47,[Personal Development],[Nursing],en,[Thank you for your great lessons.],[4.5],udemy,course,1.0,4.5
3,The Complete Duolingo English Test Success Cou...,1,الدليل الشامل لاختبار دوولينجو الدولي للغة الإ...,411.0,3.85,0.52,[Teaching & Academics],[English Language],ar,"[It was Excellent, Yes, it was a good match fo...","[5.0, 4.0, 1.5, 5.0]",udemy,course,4.0,3.88
4,Basteln & Malen für Beginner & Kids: 3 tolle B...,1,Basteln und Malen lernen mit Styrodur: 3 einfa...,16.0,5.0,3.47,[Lifestyle],[Art for Kids],de,[Habe mit meiner Tochter zusammen viel Spaß be...,[5.0],udemy,course,1.0,5.0


---

# Putting Them Together

Combine the output into a single large dataframe to be used for analysis

In [46]:
# Concatenate all processed dataframes
df = pd.concat([df1, alison, fl, hv, ps, udacity, udemy], axis=0).reset_index(drop=True)

# Adjust the dtypes for some columns
df = df.astype({
  'rating': float,
  'enrollments': float,
})

In [47]:
df.head()

Unnamed: 0,type,course_name,organization,rating,description,skills,level,duration,reviews,enrollments,subject,provider,language,reviews_comments,reviews_stars,reviews_avg_stars,num_reviews,premium
0,course,AWS Lambda إنشاء صورة مصغرة بإستخدام السيرفرل...,Coursera Project Network,,هذا المشروع التفاعلي -إنشاء صورة مصغرة بإستخدا...,"[AWS Identity And Access Management (IAM), Clo...",Intermediate,2.0,,,,coursera,ar,,,,,
1,course,Assisting Public Sector Decision Makers With ...,University of Michigan,4.8,Develop data analysis skills that support publ...,"[Simulations, Statistical Analysis, Predictive...",Intermediate,16.0,[{'comment': 'This course was very good at get...,,,coursera,en,[wonderful],[5],5.0,1.0,
2,course,Advanced Strategies for Sustainable Business,University of Colorado Boulder,,This course focuses on integrating sustainabil...,"[Circular Economy, Sustainable Business, Stake...",Beginner,6.0,,,,coursera,en,,,,,
3,course,Applying Machine Learning to Your Data with G...,Google Cloud,,"Dans ce cours, nous définirons ce qu'est le ma...",,Beginner,10.0,,,,coursera,fr,,,,,
4,project,Automate Blog Advertisements with Zapier,Coursera Project Network,,Zapier is the industry leader in task automati...,"[Advertising, Social Media, Blogging, Marketing]",Intermediate,2.0,"[{'comment': 'wonderful', 'stars': 5}]",,,coursera,en,"[Very good way of teaching., Good, Good]","[5, 5, 5]",5.0,3.0,


In [48]:
df.to_csv('combined_mooc.csv', index=False)