# Data Exploration and Preparation for Coursera Dashboard


### Load CSV and Preview First 5 Rows

In [14]:
# Import the pandas library for data manipulation

In [15]:
# Import the pandas library for data manipulation
import pandas as pd

# Load the CSV file into a DataFrame
path="C:/Users/SANDRINE/Downloads/Cousera Courses Metadata for Analytics 2025/archive/courses_en.csv"
df=pd.read_csv(path)
# Display the first 5 rows to check the data
df.head()


Unnamed: 0,url,name,category,what_you_learn,skills,language,instructors,content
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,"Computer Networking, Network Planning And Desi...",English,['~31081695'],Welcome to course 4 of 5 of this Specializatio...
1,https://www.coursera.org/learn/-security-princ...,Security Principles,Information Technology,,"Cyber Security Policies, Data Integrity, Cyber...",English,['~31081695'],Welcome to course 1 of 5 of this Specializatio...
2,https://www.coursera.org/learn/21st-century-en...,21st Century Energy Transition: how do we make...,Physical Science And Engineering,Understand the complexity of systems supplying...,"Electric Power Systems, Environmental Policy, ...",English,['brad-hayes'],NOTE: “21 st Century Energy Transition – How d...
3,https://www.coursera.org/learn/360-vr-video-pr...,VR and 360 Video Production,Arts And Humanities,,"Virtual Reality, Videography, Media Production...",English,['googlearvr'],Welcome to the Google AR & VR Virtual Reality ...
4,https://www.coursera.org/learn/3d-anatomy-phys...,Foundations of Human Anatomy and Physiology,Health,Learners will understand how body structure su...,"Vital Signs, Basic Patient Care, Anatomy, Heal...",English,"['~167016541', '~166856472', '~166856442', '~1...",This course provides a foundational understand...


### Exploration of the Coursera Courses Metadata dataset 

In [16]:
# Display general information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5411 entries, 0 to 5410
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   url             5411 non-null   object
 1   name            5411 non-null   object
 2   category        5411 non-null   object
 3   what_you_learn  3005 non-null   object
 4   skills          5411 non-null   object
 5   language        5411 non-null   object
 6   instructors     5411 non-null   object
 7   content         5411 non-null   object
dtypes: object(8)
memory usage: 338.3+ KB


In [17]:
# Check for missing values in each column
df.isnull().sum()

url                  0
name                 0
category             0
what_you_learn    2406
skills               0
language             0
instructors          0
content              0
dtype: int64

In [18]:
# Count unique values for each column 
df.nunique()


url               5411
name              5370
category            11
what_you_learn    2979
skills            5408
language             1
instructors       2593
content           5380
dtype: int64

In [19]:
# Show a random sample of 5 rows to inspect data variety
df.sample(5)

Unnamed: 0,url,name,category,what_you_learn,skills,language,instructors,content
3231,https://www.coursera.org/learn/mandarin-chinese-1,Mandarin Chinese 1: Chinese for Beginners,Language Learning,,"Oral Comprehension, Cultural Sensitivity, Gram...",English,"['wangjun', 'an-na']",Mandarin Chinese 1: Chinese for beginners is a...
874,https://www.coursera.org/learn/clinicalsimulat...,Essentials in Clinical Simulations Across the ...,Health,,"Patient Education And Counseling, Clinical Ass...",English,"['kristina-dreifuerst', 'crystel-farina', 'pam...",This 7-week course provides you with key strat...
5248,https://www.coursera.org/learn/virtual-school,Foundations of Virtual Instruction,Social Sciences,,"Learning Management Systems, End User Training...",English,['cindyc'],Welcome to Foundations of Virtual Instruction!...
3085,https://www.coursera.org/learn/lesson-small-ta...,Lesson | Small Talk & Conversational Vocabulary,Language Learning,,"Interpersonal Communications, Communication, V...",English,['amaliabstephens'],"This lesson is part of a full course, Speak En..."
3545,https://www.coursera.org/learn/neurobiology,Understanding the Brain: The Neurobiology of E...,Health,,"Coordination, Communication Systems, Biology, ...",English,['~4454153'],Learn how the nervous system produces behavior...


## Exploding the 'Skills' Column

In this step, we will **explode the 'skills' column** so that each row corresponds to **a single skill**.  
The original 'skills' column contains multiple skills separated by commas for each course. 

In [20]:
# Split the 'skills' string into a list using comma as separator
df['skills'] = df['skills'].str.split(',')

# Create one row per skill
df = df.explode('skills')

# Remove extra spaces around each skill
df['skills'] =df['skills'].str.strip()

# Remove empty strings 
df = df[df['skills'] != '']
# Display the 5first rows
df.head()



Unnamed: 0,url,name,category,what_you_learn,skills,language,instructors,content
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Computer Networking,English,['~31081695'],Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Network Planning And Design,English,['~31081695'],Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,TCP/IP,English,['~31081695'],Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Threat Detection,English,['~31081695'],Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Cyber Attacks,English,['~31081695'],Welcome to course 4 of 5 of this Specializatio...


###  Cleaning and Exploding the `instructors` Column

The `instructors` column contains one or several instructor identifiers per course.  

1. **Remove the `~` prefix** from instructor identifiers.  
2. **Split** values into a list when multiple instructors are present.  
3. **Explode** the column so each instructor appears on a separate row.  
4. **Trim spaces** for clean and consistent formatting.  
5. **Remove empty values** after cleaning.



In [21]:
#Cleaning and exploding the 'instructors' column

# 1. Remove the "~" prefix
df['instructors'] = df['instructors'].str.replace('~', '', regex=False)

# 2. Remove brackets [] and quotes '  (in case the values look like lists)
df['instructors'] = df['instructors'].str.replace(r"[\[\]']", "", regex=True)

# 3. Split multiple instructors into a list (comma-separated)
df['instructors'] = df['instructors'].str.split(',')

# 4. Explode the list
df_instructors = df.explode('instructors')

# 5. Remove leading/trailing spaces
df_instructors['instructors'] = df_instructors['instructors'].str.strip()

# 6. Remove empty values
df_instructors = df_instructors[df_instructors['instructors'] != ""]
df=df_instructors
# Show head
df.head()




Unnamed: 0,url,name,category,what_you_learn,skills,language,instructors,content
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Computer Networking,English,31081695,Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Network Planning And Design,English,31081695,Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,TCP/IP,English,31081695,Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Threat Detection,English,31081695,Welcome to course 4 of 5 of this Specializatio...
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,,Cyber Attacks,English,31081695,Welcome to course 4 of 5 of this Specializatio...


In [22]:
# Grouping by course (url)
df_summary = df.groupby('url').agg(
    name=('name', 'first'),  # keep the course name
    category=('category', 'first'),  # keep the category
    language=('language', 'first'),  # keep the language
    num_unique_skills=('skills', 'nunique'),  # count unique skills
    num_unique_instructors=('instructors', 'nunique'),  # count unique instructors
    what_you_learn=('what_you_learn', 'first'),  # keep 'what you will learn' as is for now
    content=('content', 'first')  # keep the content as is
).reset_index()

# Display the first rows of the summarized dataframe
df_summary.head()


Unnamed: 0,url,name,category,language,num_unique_skills,num_unique_instructors,what_you_learn,content
0,https://www.coursera.org/learn/-network-security,Network Security,Information Technology,English,15,1,,Welcome to course 4 of 5 of this Specializatio...
1,https://www.coursera.org/learn/-security-princ...,Security Principles,Information Technology,English,11,1,,Welcome to course 1 of 5 of this Specializatio...
2,https://www.coursera.org/learn/21st-century-en...,21st Century Energy Transition: how do we make...,Physical Science And Engineering,English,14,1,Understand the complexity of systems supplying...,NOTE: “21 st Century Energy Transition – How d...
3,https://www.coursera.org/learn/360-vr-video-pr...,VR and 360 Video Production,Arts And Humanities,English,9,1,,Welcome to the Google AR & VR Virtual Reality ...
4,https://www.coursera.org/learn/3d-anatomy-phys...,Foundations of Human Anatomy and Physiology,Health,English,11,4,Learners will understand how body structure su...,This course provides a foundational understand...


In [23]:
df_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5411 entries, 0 to 5410
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   url                     5411 non-null   object
 1   name                    5411 non-null   object
 2   category                5411 non-null   object
 3   language                5411 non-null   object
 4   num_unique_skills       5411 non-null   int64 
 5   num_unique_instructors  5411 non-null   int64 
 6   what_you_learn          3005 non-null   object
 7   content                 5411 non-null   object
dtypes: int64(2), object(6)
memory usage: 338.3+ KB


In [24]:
df_skills = df[['url', 'skills']]

df_skills.head()


Unnamed: 0,url,skills
0,https://www.coursera.org/learn/-network-security,Computer Networking
0,https://www.coursera.org/learn/-network-security,Network Planning And Design
0,https://www.coursera.org/learn/-network-security,TCP/IP
0,https://www.coursera.org/learn/-network-security,Threat Detection
0,https://www.coursera.org/learn/-network-security,Cyber Attacks


In [25]:
df_instructors = df[['url', 'instructors']]
df_instructors.head()

Unnamed: 0,url,instructors
0,https://www.coursera.org/learn/-network-security,31081695
0,https://www.coursera.org/learn/-network-security,31081695
0,https://www.coursera.org/learn/-network-security,31081695
0,https://www.coursera.org/learn/-network-security,31081695
0,https://www.coursera.org/learn/-network-security,31081695


In [26]:

df_summary.to_csv("coursera_data.csv" , index=False)
df_skills.to_csv("coursera_skills.csv" , index=False)
df_instructors.to_csv("coursera_instructors.csv", index=False)