Skip to content

mahikkaaa/End_2_End_Data_Science_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

End to End Data Science Project using the Udemy Dataset

This repository contains the data analytics insights of the Udemy courses datasets of 4 selective major domains.

This dataset is taken from the Kaggle website.

The link is below. https://www.kaggle.com/andrewmvd/udemy-courses

This dataset includes 3683 courses from Udemy in 4 areas: business finance, graphic design, musical instruments, and web design. Udemy is an online platform for massive open online courses (MOOCs) that has both free and paid courses. Udemy's business model is that anyone can make a course, which is how it has grown to have hundreds of thousands of courses. Online courses and digital learning are becoming more and more popular these days. And more students, teachers, and even professionals are taking classes online through sites like Udemy, Coursera, and so on. So, this data analysis is done to figure out how many people sign up for courses on the Udemy platform.

From the insights developed, I answer the following questions:

Questions

  • Course Title

    • What is the most frequent words in course title
    • Longest/Shortest course title
    • How can we build recommendation systems via title using similarity
    • Most famous courses by number of subscribers
  • Subjects/Category

    • What is the distribution of subjects
    • How many courses per subject
    • Distribution of subjects per year
    • How many people purchase a particular subject
    • Which subject is the most popular
  • Published Year

    • Number of courses per year
    • Which year has the highest number of courses
    • What is the trend of courses per year
  • Levels

    • How many levels do we have
    • What is the distribution of courses per levels
    • Which subject have the highest levels
    • How many subscribers per levels
    • How many courses per levels
  • Duration of Course

    • Which courses have the highest duration (paid or not)
    • Which courses have higher duration
    • Duration vs number of subscribers
  • Subscribers

    • Which course have the highest number of subscribers
    • Average number of subscribers
    • Number of subscribers per Subject
    • Number of subscribers per year
  • Price

    • What is the average price of a course
    • What is the min/max price
    • How much does Udemy earn
    • The most profitable courses
  • Correlation Questions

    • Does number of subscribers depend on
    • number of reviews
    • price
    • number of lectures
    • content duration

Insights are developed to answer all the above questions with the help of pandas, numpy and matplotlib framework.

I also performed Keyword extraction to remove stopwords. Stopword is a word that is automatically omitted from a computer-generated concordance or index.Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

Libraries used: pandas, numpy, matplotlib, seaborn, warnings, datetime, neattext, counter, rake.

If you find this insightful, feel free to star it. Any issues can be notified to me.

If you wanna work with this analysis, you can:

Clone the repository, or Fork the repository. Then, can make changes as you wish.

About

End to End Data Science Project using the Udemy Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published