## Data Cleaning for Coursera Course Dataset

### Introduction

This notebook is dedicated to data cleaning as the initial phase of the project: Data Visualization with Python. The Coursera Course Dataset, which contains information about various courses offered on the Coursera platform, will be used. The dataset includes the following columns:

1. `course_title`: The title of the course.
2. `course_organization`: The organization offering the course.
3. `course_Certificate_type`: The type of certification available for the course.
4. `course_rating`: The rating of the course.
5. `course_difficulty`: The difficulty level of the course.
6. `course_students_enrolled`: The number of students enrolled in the course.

Additionally, the original CSV file contains a column at the beginning with the number to identify the dataset.

Data source and documentation: [Kaggle: Coursera Course Dataset](https://www.kaggle.com/datasets/siddharthm1698/coursera-course-dataset)

### Objectives

The primary objectives of this notebook are:

1. **Loading and Inspecting the Data**: Importing the dataset using Pandas and getting an overview of the dataset by examining its structure and summary statistics.
3. **Cleaning the Data**: Performing necessary data cleaning steps, including:
   - Removing unnecessary columns.
   - Converting data types to appropriate formats.
   - Handling missing values.
   - Standardizing the format of certain columns (e.g., converting student enrollment numbers to numerical values).

### Data Loading and Initial Inspection

The dataset is loaded from a CSV file using Pandas, and an initial inspection is performed to understand the dataset's structure and contents.

1. **Displaying the First Few Rows**:

In [1]:
import pandas as pd
from convert_num_function import convert_to_number

df = pd.read_csv("coursea_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,134,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k
1,743,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k
2,874,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k
3,413,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k
4,635,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k


2. **Dropping Unnecessary Columns**: Removing the `Unnamed: 0` column as it is not needed for analysis.

In [2]:
df.drop("Unnamed: 0", axis=1, inplace=True)
df.head()

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k
1,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k
2,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k
3,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k
4,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k


3. **Getting Summary Statistics**:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   course_title              891 non-null    object 
 1   course_organization       891 non-null    object 
 2   course_Certificate_type   891 non-null    object 
 3   course_rating             891 non-null    float64
 4   course_difficulty         891 non-null    object 
 5   course_students_enrolled  891 non-null    object 
dtypes: float64(1), object(5)
memory usage: 41.9+ KB


4. **Checking Dataset Shape and Counting Unique Values**:

In [4]:
rows, columns = df.shape
print(f"Number of rows: {rows}, and columns: {columns}\n")

unique_course_title = df["course_title"].nunique()
unique_course_organization = df["course_organization"].nunique()
unique_course_Certificate_type = df["course_Certificate_type"].nunique()
unique_course_difficulty = df["course_difficulty"].nunique()

print(
    f"Count of unique: \n"
    f"course_title: {unique_course_title},\n"
    f"course_organization: {unique_course_organization},\n"
    f"course_Certification_type: {unique_course_Certificate_type},\n"
    f"course_difficulty: {unique_course_difficulty}"
)

Number of rows: 891, and columns: 6

Count of unique: 
course_title: 888,
course_organization: 154,
course_Certification_type: 3,
course_difficulty: 4


5. **Displaying Unique Values in `course_Certificate_type` and `course_difficulty` Columns**:

In [5]:
print(
    f"Course certification types: {df['course_Certificate_type'].unique()},\nand courses difficulty: {df['course_difficulty'].unique()}"
)

Course certification types: ['SPECIALIZATION' 'COURSE' 'PROFESSIONAL CERTIFICATE'],
and courses difficulty: ['Beginner' 'Intermediate' 'Mixed' 'Advanced']


6. **Displaying `course_students_enrolled` Column**:

In [9]:
df["course_students_enrolled"]

0      5.3k
1       17k
2      130k
3       91k
4      320k
       ... 
886     52k
887     21k
888     30k
889    9.8k
890     38k
Name: course_students_enrolled, Length: 891, dtype: object

### Converting `course_students_enrolled` to Numeric

The `course_students_enrolled` column contains enrollment numbers in string format with suffixes like 'k' for thousands and 'm' for millions. To facilitate numerical analysis, this column needs to be converted to a numeric format.

1. **Defining Conversion Function**: Defining a function `convert_to_number` to handle the conversion.
2. **Applying Conversion**: Applying the `convert_to_number` function to the `course_students_enrolled` column.
3. **Verifying Conversion**: Displaying the first few rows of the converted column to verify the changes.

In [2]:
df["course_students_enrolled"] = df["course_students_enrolled"].apply(convert_to_number)
df.head()

Unnamed: 0.1,Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,134,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5300.0
1,743,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17000.0
2,874,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130000.0
3,413,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91000.0
4,635,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320000.0


This conversion ensures that the `course_students_enrolled` column is in a numeric format, enabling accurate numerical analysis and visualization in subsequent steps.

### Finalizing Data Cleaning

After converting the `course_students_enrolled` column to numeric format, the final steps involve ensuring the data type is consistent and saving the cleaned dataset.

1. **Converting to Integer**: Converting the `course_students_enrolled` column to integer type for consistency and ease of analysis.
2. **Saving Cleaned Dataset**: Saving the cleaned dataset to a new CSV file for future use.


In [7]:
df["course_students_enrolled"] = df["course_students_enrolled"].apply(lambda x: int(x))

df.to_csv("coursea_data_cleaned.csv", index=False)

These steps ensure that the dataset is clean, with consistent data types, and is saved for subsequent analysis and visualisation tasks.