# DataCamp Course Scheduler

In [19]:
# Import Libraries - Web Scraping
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import TimeoutException
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time
import pandas as pd
import traceback
# Import Libraries - Named Entity Recognition
import spacy
from spacy.training.example import Example
from spacy.util import minibatch, compounding
import re
import string
from collections import Counter
import random
# Import Libraries - User Interface
from IPython.display import display, HTML
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
# Import Libraries - Kahn's Algorithm
import networkx as nx

## Project Introduction

DataCamp is an online learning platform, specializing in courses for Data Science, Data Engineering, and Data Analysis.  This platform hosts 582 courses encompassing Data Cleaning, Data Discovery, Data Modeling, AI tools, and much more.  DataCamp courses are often linked together through Prerequisite relationships, where certain courses must be completed before others can be started.

DataCamp offers track-based options for people wanting to develop the skills of a Data Scientist or Data Analyst, but few options for experienced Data Scientists wanting to upskill on ad-hoc topics beyond searching the course catalog for the desired topic.  This is not a great solution, however, because that only points the Data Scientist toward a course that covers that topic while leaving the Data Scientist to identify the course prerequisites manually, each of which could have prerequisites of its own.

My project allows the Data Scientist to select common Data Science/Machine Learning topics from a drop-down menu to identify DataCamp courses that would assist in learning their desired topic.  Once the Data Scientist selects the courses from this topic that they would like to complete, my project generates a recommended course sequence incorporating all levels of prerequisites required for those courses as well as ensuring that each prerequisite is taken before the course in which it is required.

## Generate Course List from DataCamp Website

The available DataCamp API did not include Prerequisite as an available field, so it did not meet the needs of this project. In its place, I have used BeautifulSoup and Selenium to scrape the needed data from DataCamp.

In [3]:
def get_course_header(url, df):
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    driver.get(url)
    source = driver.page_source
    soup = BeautifulSoup(source, 'html.parser')
    for tr in soup.find_all(class_='css-gqd6cf'):
        course_name = [td.text for td in tr.find_all(class_='css-1k3iole')]
        course_link = [td.get('href') for td in tr.find_all('a')]
        skill_level = [td.text for td in tr.find_all(class_='css-ext011')]
        url_name = 'https://www.datacamp.com' + course_link[0]
        new_data = pd.DataFrame([[course_name, url_name, skill_level]], 
                        columns=['course_name', 'course_url', 'skill_level'])
        df = pd.concat([df, new_data], ignore_index=True)
    driver.quit()
    return df

In [9]:
# Fill in Course Info List
total_pages = 20
output_df = pd.DataFrame(columns=['course_name', 'course_url', 'skill_level'])

for i in range(total_pages):
    input_df = output_df
    page_link = 'https://www.datacamp.com/courses-all/page/' + str(i+1)
    print(page_link)
    output_df = get_course_header(page_link, input_df)
    time.sleep(10)

https://www.datacamp.com/courses-all/page/1
https://www.datacamp.com/courses-all/page/2
https://www.datacamp.com/courses-all/page/3
https://www.datacamp.com/courses-all/page/4
https://www.datacamp.com/courses-all/page/5
https://www.datacamp.com/courses-all/page/6
https://www.datacamp.com/courses-all/page/7
https://www.datacamp.com/courses-all/page/8
https://www.datacamp.com/courses-all/page/9
https://www.datacamp.com/courses-all/page/10
https://www.datacamp.com/courses-all/page/11
https://www.datacamp.com/courses-all/page/12
https://www.datacamp.com/courses-all/page/13
https://www.datacamp.com/courses-all/page/14
https://www.datacamp.com/courses-all/page/15
https://www.datacamp.com/courses-all/page/16
https://www.datacamp.com/courses-all/page/17
https://www.datacamp.com/courses-all/page/18
https://www.datacamp.com/courses-all/page/19
https://www.datacamp.com/courses-all/page/20


In [10]:
output_df.head(2)

Unnamed: 0,course_name,course_url,skill_level
0,[Introduction to Python],https://www.datacamp.com/courses/intro-to-pyth...,[BasicSkill Level]
1,[Introduction to SQL],https://www.datacamp.com/courses/introduction-...,[BasicSkill Level]


In [14]:
# write course info DF to CSV
output_df.to_csv('output_df.csv', index = False)

In [213]:
# input course info DF from CSV
output_df = pd.read_csv('output_df.csv')
output_df.head(2)

Unnamed: 0,course_name,course_url,skill_level
0,['Introduction to Python'],https://www.datacamp.com/courses/intro-to-pyth...,['BasicSkill Level']
1,['Introduction to SQL'],https://www.datacamp.com/courses/introduction-...,['BasicSkill Level']


## Get Course Prerequisites and Descriptions

Each DataCamp course can have up to 3 prerequisites, each of which could have prerequisites of their own.  This process accesses the link for each course and returns the prerequisites, language, and description for each course.  The full course description is accessed by a "Read More" button.

In [214]:
def get_course_prerequisites(data):
    output_data = pd.DataFrame(columns=['course_name', 'course_url', 'skill_level', 'prerequisites', 'course_attributes', 'description'])
    try:
        for index, row in data.iterrows():
            driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
            course_name = data.loc[index,'course_name']
            skill_level = data.loc[index,'skill_level']
            url = data.loc[index,'course_url']
            print(url)
            print(index/len(data))
            driver.get(url)
            wait = WebDriverWait(driver, 15)
            try:
                # wait for Read More button to appear
                element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[2]/div/main/div[1]/div/div/div/div/button")))
                actions = ActionChains(driver)
                actions.move_to_element(element).click().perform()
            except TimeoutException:
                print("Button did not appear within 15 seconds, continuing with rest of code")
            subpage_source = driver.page_source
            subpage_soup = BeautifulSoup(subpage_source, 'html.parser')
            prerequisites = [td.text for td in subpage_soup.find_all(True, {'class':['css-uhxy31', 'css-45ksdk']})]
            course_attributes = [td.text for td in subpage_soup.find_all(class_='css-171kpsz')]
            description = [td.text for td in subpage_soup.find_all(class_='css-775ab')]
            new_data = pd.DataFrame([[course_name, url, skill_level, prerequisites, course_attributes, description]], 
                        columns=['course_name', 'course_url', 'skill_level', 'prerequisites', 'course_attributes', 'description'])
            output_data = pd.concat([output_data, new_data], ignore_index=True)
            driver.quit()
            time.sleep(8)
    except Exception as e:
        driver.quit()
        print(f"An unexpected error occurred: {e}")
        traceback.print_exc()
        return None
    return output_data

In [17]:
# Get Course Prerequisites and Description for each course
df_small_full = get_course_prerequisites(output_df)

https://www.datacamp.com/courses/intro-to-python-for-data-science
0.0
https://www.datacamp.com/courses/introduction-to-sql
0.001718213058419244
https://www.datacamp.com/courses/introduction-to-power-bi
0.003436426116838488
https://www.datacamp.com/courses/understanding-artificial-intelligence
0.005154639175257732
https://www.datacamp.com/courses/intermediate-sql
0.006872852233676976
https://www.datacamp.com/courses/free-introduction-to-r
0.00859106529209622
https://www.datacamp.com/courses/intermediate-python
0.010309278350515464
https://www.datacamp.com/courses/understanding-data-science
0.012027491408934709
https://www.datacamp.com/courses/introduction-to-excel
0.013745704467353952
https://www.datacamp.com/courses/joining-data-in-sql
0.015463917525773196
https://www.datacamp.com/courses/introduction-to-python-for-developers
0.01718213058419244
https://www.datacamp.com/courses/supervised-learning-with-scikit-learn
0.018900343642611683
https://www.datacamp.com/courses/data-manipulation

In [196]:
df_small_full.head(2)

Unnamed: 0.1,Unnamed: 0,course_name,course_url,skill_level,prerequisites,course_attributes,description,language,prerequisites_text,description_text
0,0,Introduction to Python,https://www.datacamp.com/courses/intro-to-pyth...,BasicSkill Level,[['There are no prerequisites for this course']],"[['Python', 'Programming', '4 hr', '11 vide...",An Introduction to Python \nPython has grown t...,Python,There are no prerequisites for this course,An Introduction to Python \nPython has grown t...
1,1,Introduction to SQL,https://www.datacamp.com/courses/introduction-...,BasicSkill Level,[['There are no prerequisites for this course']],"[['SQL', 'Data Manipulation', '2 hr', '7 vi...","""Get an Introduction to SQL in Two Hours \nMuc...",SQL,There are no prerequisites for this course,"""Get an Introduction to SQL in Two Hours \nMuc..."


In [20]:
#df_small_full.to_csv('test_full.csv')

In [7]:
df_small_full = pd.read_csv('test_full.csv')
df_small_full = df_small_full.rename(columns={'language': 'course_attributes'})

## Text and Dataframe Formatting

In [8]:
# Split string columns into list
df_small_full['prerequisites'] = df_small_full['prerequisites'].str.split(',')
df_small_full['course_attributes'] = df_small_full['course_attributes'].str.split(',')

In [9]:
df_small_full.head(2)

Unnamed: 0.1,Unnamed: 0,course_name,course_url,skill_level,prerequisites,course_attributes,description
0,0,['Introduction to Python'],https://www.datacamp.com/courses/intro-to-pyth...,['BasicSkill Level'],[['There are no prerequisites for this course']],"[['Python', 'Programming', '4 hr', '11 vide...",['An Introduction to Python \nPython has grown...
1,1,['Introduction to SQL'],https://www.datacamp.com/courses/introduction-...,['BasicSkill Level'],[['There are no prerequisites for this course']],"[['SQL', 'Data Manipulation', '2 hr', '7 vi...","[""Get an Introduction to SQL in Two Hours \nMu..."


In [10]:
# Get target items from list fields
df_small_full['language'] = df_small_full['course_attributes'].str.get(0)
df_small_full['prerequisites_text'] = df_small_full['prerequisites'].str.get(0)
df_small_full['description_text'] = df_small_full['description']

In [11]:
# Text processing to remove newline characters in description field
df_small_full['description_text'] = df_small_full['description_text'].astype(str)
df_small_full['description_text'] = df_small_full['description_text'].str.replace('\\n', '..')
df_small_full['description_text'] = df_small_full['description_text'].str.replace('\\\n', '...')

In [12]:
def remove_outer_apostrophes(text):
    if isinstance(text, str) and text.startswith("'") and text.endswith("'"):
        return text[1:-1]
    return text

In [13]:
# DataFrame Formatting
for column_name in df_small_full.columns:
    try:
        df_small_full[column_name] = df_small_full[column_name].str.replace("[", '').str.replace("]", '')
        df_small_full[column_name] = df_small_full[column_name].apply(remove_outer_apostrophes)
        df_small_full[column_name] = df_small_full[column_name].str.replace("Read Less", "")
        df_small_full[column_name] = df_small_full[column_name].str.replace(',', '')
    except Exception as e:
        continue
df_small_full.head(2)

Unnamed: 0.1,Unnamed: 0,course_name,course_url,skill_level,prerequisites,course_attributes,description,language,prerequisites_text,description_text
0,0,Introduction to Python,https://www.datacamp.com/courses/intro-to-pyth...,BasicSkill Level,[['There are no prerequisites for this course']],"[['Python', 'Programming', '4 hr', '11 vide...",An Introduction to Python \nPython has grown t...,Python,There are no prerequisites for this course,An Introduction to Python ..Python has grown t...
1,1,Introduction to SQL,https://www.datacamp.com/courses/introduction-...,BasicSkill Level,[['There are no prerequisites for this course']],"[['SQL', 'Data Manipulation', '2 hr', '7 vi...","""Get an Introduction to SQL in Two Hours \nMuc...",SQL,There are no prerequisites for this course,"""Get an Introduction to SQL in Two Hours ..Muc..."


## Named Entity Recognition Model

Each course description contains the names of Python packages, software, Data Science concepts, and programming languages that it covers.  This description is much more thorough than just reading the course title and often contains multiple paragraphs of information.  I'm using a Custom Named Entity Recognition model to pull relevant entities from the course description which can then be used to filter courses that include specific entities.  This Named Entity Recognition Model returns entities from 4 categories: Python packages, Software, Data Science concepts, and Programming Languages.

### Generate Training Data for NER Model

In [53]:
# 27 Train Rows
training_data = [
            ("This course will provide you an understanding of how to use built-in PostgreSQL functions in your SQL queries to manipulate different types of data including strings, character, numeric and date/time. We'll travel back to a time where Blockbuster video stores were on every corner and if you wanted to watch a movie, you actually had to leave your house to rent a DVD! You'll also get an introduction into the robust full-text search capabilities which provides a powerful tool for indexing and matching keywords in a PostgreSQL document. And finally, you'll learn how to extend these features by using PostgreSQL extensions.",
             {"entities": [(69, 79, "PROG_LANG"), (518, 528, "PROG_LANG"), (603, 613, "PROG_LANG"), (80, 89, "DATA_CONCEPT"), (482, 490, "DATA_CONCEPT"), (614, 624, "DATA_CONCEPT")]}),
            
            ("An Introduction to Python \nPython has grown to become the market leader in programming languages and the language of choice for data analysts and data scientists.",
             {"entities": [(19, 25, "PROG_LANG"), (27, 33, "PROG_LANG")]}),

            ("Discover the Python Basics \nThis is a Python course for beginners and we designed it for people with no prior Python experience.",
              {"entities": [(13, 19, "PROG_LANG"), (38, 44, "PROG_LANG"), (110, 116, "PROG_LANG")]}),

            ("You will cover the basics of Python helping you understand common everyday functions and applications including how to use Python as a calculator understanding variables and types and building Python lists.",
              {"entities": [(29, 35, "PROG_LANG"), (123, 129, "PROG_LANG"), (193, 199, "PROG_LANG"), (75, 84, "DATA_CONCEPT")]}),

            ("Explore Python Functions and Packages \nThe second half of the course starts with a view of how you can use functions methods and packages to use code that other Python developers have written.",
              {"entities": [(8, 14, "PROG_LANG"), (161, 167, "PROG_LANG"), (107, 116, "DATA_CONCEPT"), (15, 24, "DATA_CONCEPT")]}),

            ("Get Started with NumPy \nNumPy is an essential Python package for data science. You’ll finish this course by learning to use some of the most popular tools in the NumPy array and start exploring data in Python.",
              {"entities": [(46, 52, "PROG_LANG"), (202, 208, "PROG_LANG"), (17, 22, "PYTHON_PKG"), (24, 29, "PYTHON_PKG"), (162, 167, "PYTHON_PKG")]}),

            ("Get an Introduction to SQL in Two Hours \nMuch of the world's raw data—from electronic medical records to customer transaction histories—lives in organized collections of tables called relational databases.",
              {"entities": [(23, 26, "PROG_LANG"), (184, 204, "DATA_CONCEPT")]}),

            ("Being able to wrangle and extract data from these databases using SQL is an essential skill within the data industry and in increasing demand.",
              {"entities": [(66, 69, "PROG_LANG"), (50, 59, "DATA_CONCEPT")]}),

            ("Learn how Relational Databases are Organized \nSQL is an essential language for building and maintaining relational databases which opens the door to a range of careers in the data industry and beyond. You’ll start this course by covering data organization tables and best practices for database construction.",
              {"entities": [(10, 30, "DATA_CONCEPT"), (104, 124, "DATA_CONCEPT"), (286, 307, "DATA_CONCEPT")]}),

            ("Write Your First SQL Queries \nThe second half of this course looks at creating SQL queries for selecting data that you need from your database.",
              {"entities": [(17, 20, "PROG_LANG"), (79, 82, "PROG_LANG"), (83, 90, "DATA_CONCEPT")]}),

            ("Understand the Difference Between PostgreSQL and SQL Server \nPostgreSQL and SQL Server are two of the most popular SQL flavors. By the end of the course you’ll have some hands-on experience in learning SQL and the grounding to start applying it to projects or continue your learning in a more specialized direction.",
              {"entities": [(34, 44, "PROG_LANG"), (61, 71, "PROG_LANG"), (49, 59, "SOFTWARE"), (76, 86, "SOFTWARE"), (115, 118, "PROG_LANG"), (202, 205, "PROG_LANG")]}),

            ("A Thorough Introduction to Power BI\nIn this 3-hour course you’ll gain a 360° overview of the Power BI basics and learn how to use the tool to build impactful reports. In this course you’ll go from zero to hero as you discover how to use this popular business intelligence platform through hands-on exercises.",
              {"entities": [(27, 35, "SOFTWARE"), (93, 101, "SOFTWARE"), (250, 271, "DATA_CONCEPT")]}),

            ("Before diving into creating visualizations using Power BI's drag-and-drop functionality you’ll first learn how to confidently load and transform data using Power Query and the importance of data models. Learn the Power BI Basics\nYou’ll start by looking at some of the fundamentals of Power BI getting to grips with Data Model and Report views.",
              {"entities": [(49, 57, "SOFTWARE"), (213, 221, "SOFTWARE"), (284, 292, "SOFTWARE"), (156, 167, "SOFTWARE")]}),

            ("You’ll learn to load data sets build a data model and discover how to shape and transform your data with Power Query Editor. Create Powerful Data Visualizations\nOnce you've covered Power BI in general you can dig into its options for data visualization.",
              {"entities": [(39, 49, "DATA_CONCEPT"), (105, 116, "SOFTWARE"), (141, 160, "DATA_CONCEPT"), (234, 252, "DATA_CONCEPT")]}),

            ("Explore the basics of Artificial Intelligence\nAI is transforming our economy media industries and society. Designed for beginners with no coding knowledge required gain a straightforward introduction to the world of AI. \n\nYou’ll demystify common AI buzzwords—machine learning deep learning generative AI etc.—in a way that is easy to understand and apply.",
              {"entities": [(22, 45, "DATA_CONCEPT"), (46, 48, "DATA_CONCEPT"), (216, 218, "DATA_CONCEPT"), (246, 248, "DATA_CONCEPT"), (290, 303, "DATA_CONCEPT"), (259, 275, "DATA_CONCEPT"), (276, 289, "DATA_CONCEPT")]}),

            ("Overview of AI applications\nWe\'ll journey through AI\'s practical applications in daily life discover how organizations can integrate AI and discuss the societal impact arising from AI\'s relentless advancement. We’ll explore also tasks that AI can solve from helping run traffic more smoothly to optimizing booking systems. \n\nUpon completion you\'ll have a comprehensive understanding of AI\'s capabilities and its influence on society.",
              {"entities": [(12, 14, "DATA_CONCEPT"), (50, 52, "DATA_CONCEPT"), (133, 135, "DATA_CONCEPT"), (181, 183, "DATA_CONCEPT"), (240, 242, "DATA_CONCEPT"), (386, 388, "DATA_CONCEPT")]}),

            ("SQL is widely recognized as the most popular language for turning raw data stored in a database into actionable insights. This course uses a films database to teach how to navigate and extract insights from the data using SQL. \nDiscover Filtering with SQL\nYou'll discover techniques for filtering and comparing data enabling you to extract specific information to gain insights and answer questions about the data.",
              {"entities": [(0, 3, "PROG_LANG"), (222, 225, "PROG_LANG"), (252, 255, "PROG_LANG")]}),

            ("Get Acquainted with Aggregation\nNext you'll get a taste of aggregate functions essential for summarizing data effectively and gaining valuable insights from large datasets. You'll also combine this with sorting and grouping data adding another layer of meaning to your insights and analysis.",
              {"entities": [(20, 31, "DATA_CONCEPT"), (59, 78, "DATA_CONCEPT")]}),

            ("Learn R Programming   R programming language is a useful tool for data scientists analysts and statisticians especially those working in academic settings.  This introduction to R course covers the basics of this open source language including vectors factors lists and data frames. You’ll gain useful coding skills and be ready to start your own data analysis in R.\n\nGain an Introduction to R",
              {"entities": [(6, 7, "PROG_LANG"), (22, 23, "PROG_LANG"), (178, 179, "PROG_LANG"), (392, 393, "PROG_LANG")]}), 

            ("You’ll get started with basic operations like using the console as a calculator and understanding basic data types in R. Once you’ve had a chance to practice you’ll move on to creating vectors and try out your new R skills on a data set based on betting in Las Vegas.\n \nNext you’ll learn how to work with matrices in R learning how to create them and perform calculations with them. You’ll also examine how R uses factors to store categorical data.",
            {"entities": [(214, 215, "PROG_LANG"), (317, 318, "PROG_LANG"), (407, 408, "PROG_LANG")]}),
            
            ("Finally you’ll explore how to work with R data frames and lists.\n\nMaster the R Basics for Data Analysis \nBy the time you’ve completed our Introduction to R course you’ll be able to use R for your own data analysis. These sought-after skills can help you progress in your career and set you up for further learning. This course is part of several tracks including Data Analyst with R Data Scientist with R and R Programming all of which can help you develop your knowledge.",
            {"entities": [(40, 41, "PROG_LANG"), (77, 78, "PROG_LANG"), (154, 155, "PROG_LANG"), (185, 186, "PROG_LANG"), (381, 382, "PROG_LANG"), (403, 404, "PROG_LANG"), (409, 410, "PROG_LANG")]}),
    
            ("Improve Your Python Skills\nLearning Python is crucial for any aspiring data science practitioner. Learn to visualize real data with Matplotlib’s functions and get acquainted with data structures such as the dictionary and pandas DataFrame. This four-hour intermediate course will help you to build on your existing Python skills and explore new Python applications and functions that expand your repertoire and help you work more efficiently.",
              {"entities": [(13, 19, "PROG_LANG"), (36, 42, "PROG_LANG"), (315, 321, "PROG_LANG"), (345, 351, "PROG_LANG"), (132, 142, "PYTHON_PKG"), (222, 228, "PYTHON_PKG")]}),

            ("Learn to Use Python Dictionaries and pandas\nDictionaries offer an alternative to Python lists while the pandas dataframe is the most popular way of working with tabular data. In the second chapter of this course you’ll find out how you can create and manipulate datasets and how to access them using these structures. Hands-on practice throughout the course will build your confidence in each area.",
              {"entities": [(13, 19, "PROG_LANG"), (81, 87, "PROG_LANG"), (37, 43, "PYTHON_PKG"), (104, 110, "PYTHON_PKG"), (20, 32, "DATA_CONCEPT"), (44, 56, "DATA_CONCEPT")]}),

            ("Explore Python Boolean Logic and Python Loops \nIn the second half of this course you’ll look at logic control flow filtering and loops. These functions work to control decision-making in Python programs and help you to perform more operations with your data including repeated statements.", 
              {"entities": [(8, 14, "PROG_LANG"), (33, 39, "PROG_LANG"), (187, 193, "PROG_LANG"), (15, 28, "DATA_CONCEPT"), (40, 45, "DATA_CONCEPT"), (129, 134, "DATA_CONCEPT"), (142, 151, "DATA_CONCEPT")]}),

            ("Getting started with Excel\nIn this Excel course you’ll learn the fundamentals needed to have you analyzing data in spreadsheets before you know it. This course focuses on helping you navigate Excel and prepare your data for basic analysis. You’ll learn how to manage tables and apply calculations to your data to provide new insights.",
              {"entities": [(21, 26, "SOFTWARE"), (35, 40, "SOFTWARE"), (192, 197, "SOFTWARE")]}),

            ("Learn Functions\nYou’ll learn about the many available functions in Excel that can help you to perform calculations analyze data and in some cases automate tasks. We’ll start your journey into understanding the full potential of Excel’s functions and leverage built-ins such as SUM AVERAGE COUNT UPPER LEFT and more.",
              {"entities": [(67, 72, "SOFTWARE"), (228, 233, "SOFTWARE"), (54, 63, "DATA_CONCEPT"), (236, 245, "DATA_CONCEPT"), (6, 15, "DATA_CONCEPT")]}),

            ("Visualizing data\nWe’ll finish this course by explaining the basics of using Excel to effectively communicate data through compelling visuals. You’ll explore the versatility of area column and pie charts to help tell a story to bring your analysis to life.",
              {"entities": [(76, 81, "SOFTWARE"), (0, 16, "DATA_CONCEPT")]}),
        ]

### Generate Test Data for NER Model

In [None]:
# 25 Test Rows
("This course will provide you an understanding of how to use built-in PostgreSQL functions in your SQL queries to manipulate different types of data including strings, character, numeric and date/time. We'll travel back to a time where Blockbuster video stores were on every corner and if you wanted to watch a movie, you actually had to leave your house to rent a DVD! You'll also get an introduction into the robust full-text search capabilities which provides a powerful tool for indexing and matching keywords in a PostgreSQL document. And finally, you'll learn how to extend these features by using PostgreSQL extensions.",
    {"entities": [(69, 79, "PROG_LANG"), (518, 528, "PROG_LANG"), (603, 613, "PROG_LANG"), (80, 89, "DATA_CONCEPT"), (482, 490, "DATA_CONCEPT"), (614, 624, "DATA_CONCEPT")]}),
            
("Are you ready to dive deep into the world of financial modeling and hone your Excel skills? In this comprehensive course, you'll embark on a journey that will empower you with the knowledge and tools to excel in financial analysis.Discover the Power of Cash Flows\nBegin your journey by unraveling the intricacies of cash flows. You'll construct a financial model, calculate net income, and gain a solid understanding of what makes up an income statement. These skills will form the foundation of your financial modeling prowess.\nMaster Scenario Analysis Techniques\nTake your modeling skills to the next level by diving into scenario analysis. You'll learn how to forecast multiple outcomes, conduct sensitivity analysis, and effortlessly manipulate growth rates using Excel's versatile tools.\nUnlock the Secrets of Time Value of Money\nUnderstand the critical concept of the time value of money and its pivotal role in decision-making. You'll calculate present and future values and apply these skills with precision.\nBecoming a Capital Budgeting Pro\nEmpower yourself with the ability to make data-driven decisions using metrics like net present value, internal rate of return. You'll compare results against benchmarks and provide actionable recommendations that drive success in financial scenarios.\n\nUpon completing this course, you'll be equipped with the expertise to excel in financial modeling using Excel. You'll have a deep understanding of financial concepts and the practical skills to apply them effectively. Join us on this journey and unlock your potential as a modeling master!",
    {"entities": [(45, 63, "DATA_CONCEPT"), (501, 519, "DATA_CONCEPT"), (1381, 1399, "DATA_CONCEPT"), (78, 83, "SOFTWARE"), (768, 773, "SOFTWARE"), (1406, 1411, "SOFTWARE"), (212, 230, "DATA_CONCEPT"), (347, 362, "DATA_CONCEPT"), (624, 641, "DATA_CONCEPT"), (699, 719, "DATA_CONCEPT"), (874, 893, "DATA_CONCEPT")]}),
            
("Managing the end-to-end lifecycle of a Machine Learning application can be a daunting task for data scientists, engineers, and developers. Machine Learning applications are complex and have a proven track record of being difficult to track, hard to reproduce, and problematic to deploy.\nIn this course, you will learn what MLflow is and how it attempts to simplify the difficulties of the Machine Learning lifecycle such as tracking, reproducibility, and deployment. After learning MLflow, you will have a better understanding of how to overcome the complexities of building Machine Learning applications and how to navigate different stages of the Machine Learning lifecycle.\nThroughout the course, you will deep dive into the four major components that make up the MLflow platform. You will explore how to track models, metrics, and parameters with MLflow Tracking, package reproducible ML code using MLflow Projects, create and deploy models using MLflow Models, and store and version control models using Model Registry.\nAs you progress through the course, you will also learn best practices of using MLflow for versioning models, how to evaluate models, add customizations to models, and how to build automation into training runs. This course will prepare you for success in managing the lifecycle of your next Machine Learning application.",
    {"entities": [(39, 55, "DATA_CONCEPT"), (1317, 1333, "DATA_CONCEPT"), (323, 329, "SOFTWARE"), (482, 488, "SOFTWARE"), (767, 773, "SOFTWARE"), (851, 857, "SOFTWARE"), (903, 909, "SOFTWARE"), (951, 957, "SOFTWARE"), (1105, 1111, "SOFTWARE"), (1206, 1216, "DATA_CONCEPT"), (389, 415, "DATA_CONCEPT"), (649, 675, "DATA_CONCEPT"), (139, 168, "DATA_CONCEPT"), (575, 604, "DATA_CONCEPT"), (1142, 1157, "DATA_CONCEPT")]}),
            
("Get an introduction to the programming language Scala. You'll learn why and how companies like Netflix, Airbnb, and Morgan Stanley are choosing Scala for large-scale applications and data engineering infrastructure. You'll learn the basics of the language, including syntax and style, focusing on the most commonly used features in the Scala standard library. You'll learn by writing code for a real program that plays a computer version of the popular card game Twenty-One. You’ll get a taste of the value of a hybrid object-oriented and functional programming language, of which Scala is the foremost example. We recommend this course for learners with intermediate-level programming experience, which can be acquired in the listed prerequisites",
    {"entities": [(48, 53, "PROG_LANG"), (144, 149, "PROG_LANG"), (336, 341, "PROG_LANG"), (581, 586, "PROG_LANG")]}),
            
("Grow your machine learning skills with scikit-learn and discover how to use this popular Python library to train models using labeled data. In this course, you'll learn how to make powerful predictions, such as whether a customer is will churn from your business, whether an individual has diabetes, and even how to tell classify the genre of a song. Using real-world datasets, you'll find out how to build predictive models, tune their parameters, and determine how well they will perform with unseen data.",
    {"entities": [(39, 51, "PYTHON_PKG"), (407, 424, "DATA_CONCEPT"), (126, 138, "DATA_CONCEPT")]}),

("Time series data is ubiquitous. Whether it be stock market fluctuations, sensor data recording climate change, or activity in the brain, any signal that changes over time can be described as a time series. Machine learning has emerged as a powerful method for leveraging complexity in data in order to generate predictions and insights into the problem one is trying to solve. This course is an intersection between these two worlds of machine learning and time series data, and covers feature engineering, spectograms, and other advanced techniques in order to classify heartbeat sounds and predict stock prices.",
    {"entities": [(193, 204, "DATA_CONCEPT"), (457, 468, "DATA_CONCEPT"), (436, 452, "DATA_CONCEPT"), (206, 222, "DATA_CONCEPT"), (486, 505, "DATA_CONCEPT"), (507, 518, "DATA_CONCEPT")]}),
            
("Do you know the basics of supervised learning and want to use state-of-the-art models on real-world datasets? Gradient boosting is currently one of the most popular techniques for efficient modeling of tabular datasets of all sizes. XGboost is a very fast, scalable implementation of gradient boosting, with models using XGBoost regularly winning online data science competitions and being used at scale across different industries. In this course, you'll learn how to use this powerful library alongside pandas and scikit-learn to build and tune supervised learning models. You'll work with real-world datasets to solve classification and regression problems",
    {"entities": [(26, 45, "DATA_CONCEPT"), (547, 566, "DATA_CONCEPT"), (284, 301, "DATA_CONCEPT"), (321, 328, "PYTHON_PKG"), (233, 240, "PYTHON_PKG"), (621, 635, "DATA_CONCEPT"), (640, 650, "DATA_CONCEPT"), (505, 511, "PYTHON_PKG"), (516, 528, "PYTHON_PKG")]}),
            
("Deep learning with TensorFlow and Keras for neural network development",
    {"entities": [(0, 13, "DATA_CONCEPT"), (19, 29, "PYTHON_PKG"), (34, 39, "PYTHON_PKG"), (44, 58, "DATA_CONCEPT"), (621, 635, "DATA_CONCEPT"), (640, 650, "DATA_CONCEPT")]}),

("Joining data is an essential skill in data analysis enabling you to draw information from separate tables together into a single meaningful set of results. In this comprehensive course on joining data you'll delve into the intricacies of table joins and relational set theory learning how to optimize your queries for efficient data retrieval.\nUnderstand Data Joining Fundamentals
    {"entities": [()]}),

"You will learn how to work with multiple tables in SQL by navigating and extracting data from various tables within a SQL database using various join types including inner joins outer joins and cross joins. With practice you'll gain the knowledge of how to select the appropriate join method.\nExplore Advanced Data Manipulation Techniques\n\nNext up you'll explore set theory principles such as unions intersects and except clauses as well as discover the power of nested queries in SQL. 
  {"entities": [()]}),

"Grow your machine learning skills with scikit-learn and discover how to use this popular Python library to train models using labeled data. Using real-world datasets you'll find out how to build predictive models tune their parameters and determine how well they will perform with unseen data."""
  {"entities": [()]}),
#11

## Create And Train NER Model

In [79]:
# Custom Named Entity Recognition for Course Descriptions
class CourseNER:
    def __init__(self):
        self.nlp = spacy.blank("en") # Initialize blank English model
        
        # Add NER component
        if "ner" not in self.nlp.pipe_names:
            ner = self.nlp.add_pipe("ner", last=True)
        else:
            ner = self.nlp.get_pipe("ner")
        
        # Define custom entity labels
        self.labels = [
            "SOFTWARE",      # Jupyter, Excel, Tableau, etc.
            "PROG_LANG",     # Python, R, SQL, JavaScript, etc.
            "DATA_CONCEPT",  # Machine Learning, Statistics, etc.
            "PYTHON_PKG"     # pandas, numpy, scikit-learn, etc.
        ]
        
        # Add labels to NER
        for label in self.labels:
            ner.add_label(label)

    def extract_entities_from_dataframe(self, df, title_column='course_title', description_column='course_description'):
        """Extract named entities from a dataframe with course titles and descriptions"""
        results = []
    
        for index, row in df.iterrows():
            # Combine title and description for entity extraction
            title = str(row[title_column]) if pd.notna(row[title_column]) else ""
            description = str(row[description_column]) if pd.notna(row[description_column]) else ""
        
            # Combine both texts with a separator
            combined_text = f"{title}. {description}" if title and description else title or description
        
            if combined_text.strip():  # Only process if there's actual text
                doc = self.nlp(combined_text)
                entities = []
            
                for ent in doc.ents:
                    entities.append({
                        'text': ent.text,
                        'label': ent.label_,
                        'start': ent.start_char,
                        'end': ent.end_char,
                        'confidence': ent._.get('score', 0.0) if hasattr(ent._, 'score') else 1.0
                    })
            
                results.append({
                    'course_name': title,
                    'entities': entities,
                    'entity_count': len(entities)
                })
    
        return results
    def train_model(self, training_data, iterations=100):
        """Train the NER model"""
        print(f"Training custom NER model for {iterations} iterations...")
        
        # Get the NER component
        ner = self.nlp.get_pipe("ner")
        
        # Disable other pipes during training
        pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
        unaffected_pipes = [pipe for pipe in self.nlp.pipe_names if pipe not in pipe_exceptions]
        
        # Convert training data to Example objects
        examples = []
        for text, annotations in training_data:
            doc = self.nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            examples.append(example)
        
        # Training loop
        with self.nlp.disable_pipes(*unaffected_pipes):
            # Initialize the model
            self.nlp.begin_training()
            
            for iteration in range(iterations):
                random.shuffle(examples)
                losses = {}
                
                # Update model
                batches = minibatch(examples, size=compounding(4.0, 32.0, 1.001))
                for batch in batches:
                    self.nlp.update(batch, losses=losses, drop=0.5)
                
                if iteration % 20 == 0:
                    print(f"Iteration {iteration}, Losses: {losses}")
        
        print("Training completed!")
    
    def extract_entities(self, texts):
        """Extract named entities from course descriptions"""
        results = []
        
        for text in texts:
            doc = self.nlp(text)
            entities = []
            
            for ent in doc.ents:
                entities.append({
                    'text': ent.text,
                    'label': ent.label_,
                    'start': ent.start_char,
                    'end': ent.end_char,
                    'confidence': ent._.get('score', 0.0) if hasattr(ent._, 'score') else 1.0
                })
            
            results.append({
                'course_description': text,
                'entities': entities,
                'entity_count': len(entities)
            })
        
        return results
    
    def entities_to_dataframe(self, extraction_results):
        """Convert extraction results to a clean DataFrame"""
        rows = []
        
        for result in extraction_results:
            course_desc = result['course_description']
            
            if result['entities']:
                for entity in result['entities']:
                    rows.append({
                        'course_description': course_desc,
                        'entity_text': entity['text'],
                        'entity_type': entity['label'],
                        'start_pos': entity['start'],
                        'end_pos': entity['end']
                    })
            else:
                # Add row even if no entities found
                rows.append({
                    'course_description': course_desc,
                    'entity_text': 'No entities found',
                    'entity_type': 'NONE',
                    'start_pos': -1,
                    'end_pos': -1
                })
        
        return pd.DataFrame(rows)
        
    def dataframe_entities_to_result(self, extraction_results):
        """Convert DataFrame extraction results to a clean output DataFrame"""
        rows = []
    
        for result in extraction_results:
            course_name = result['course_name']
        
            if result['entities']:
                # Create a list of all entities for this course
                entity_list = [f"{entity['text']} ({entity['label']})" for entity in result['entities']]
                entity_string = "; ".join(entity_list)
            
                rows.append({
                    'course_name': course_name,
                    'named_entities': entity_string,
                    'entity_count': result['entity_count']
                })
            else:
                # Add row even if no entities found
                rows.append({
                    'course_name': course_name,
                    'named_entities': 'No entities found',
                    'entity_count': 0
                })
    
        return pd.DataFrame(rows)
    
    def save_model(self, path="./course_ner_model"):
        """Save the trained model"""
        self.nlp.to_disk(path)
        print(f"Model saved to {path}")
    
    def load_model(self, path="./course_ner_model"):
        """Load a trained model"""
        self.nlp = spacy.load(path)
        print(f"Model loaded from {path}")

    def process_course_dataframe(self, df, title_column='course_title', description_column='course_description'):
        """Complete pipeline: extract entities from DataFrame and return results DataFrame"""
    
        # Extract entities from the input DataFrame
        extraction_results = self.extract_entities_from_dataframe(df, title_column, description_column)
    
        # Convert to output DataFrame format
        results_df = self.dataframe_entities_to_result(extraction_results)
    
        return results_df



In [82]:
# NER Model Evaluation Module
# Use this in addition to your existing CourseNER class

import random
import numpy as np
from sklearn.metrics import classification_report, precision_recall_fscore_support
from collections import defaultdict
from spacy.training import Example

class NERModelEvaluator:
    """Evaluation utilities for Named Entity Recognition models"""
    
    def __init__(self, ner_model):
        """Initialize with a trained CourseNER model"""
        self.model = ner_model
    
    def split_data(self, training_data, test_split=0.2, random_seed=42):
        """Split training data into train and test sets"""
        random.seed(random_seed)
        shuffled_data = training_data.copy()
        random.shuffle(shuffled_data)
        
        split_index = int(len(shuffled_data) * (1 - test_split))
        train_data = shuffled_data[:split_index]
        test_data = shuffled_data[split_index:]
        
        print(f"Split data: {len(train_data)} training examples, {len(test_data)} test examples")
        return train_data, test_data
    
    def evaluate_model(self, test_data, verbose=True):
        """Evaluate the model on test data and return detailed metrics"""
        if isinstance(test_data[0], tuple):  # Raw data format
            test_examples = []
            for text, annotations in test_data:
                doc = self.model.nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                test_examples.append(example)
        else:  # Already Example objects
            test_examples = test_data
        
        # Collect predictions and true labels
        y_true = []
        y_pred = []
        
        for example in test_examples:
            # Get true entities
            true_ents = []
            for ent in example.reference.ents:
                true_ents.append((ent.start, ent.end, ent.label_))
            
            # Get predicted entities
            pred_doc = self.model.nlp(example.text)
            pred_ents = []
            for ent in pred_doc.ents:
                pred_ents.append((ent.start, ent.end, ent.label_))
            
            # Token-level evaluation
            tokens_true = ['O'] * len(example.reference)
            tokens_pred = ['O'] * len(pred_doc)
            
            # Mark true entity tokens
            for start, end, label in true_ents:
                for i in range(start, end):
                    if i < len(tokens_true):
                        tokens_true[i] = label
            
            # Mark predicted entity tokens
            for start, end, label in pred_ents:
                for i in range(start, end):
                    if i < len(tokens_pred):
                        tokens_pred[i] = label
            
            # Align lengths (in case of tokenization differences)
            min_len = min(len(tokens_true), len(tokens_pred))
            y_true.extend(tokens_true[:min_len])
            y_pred.extend(tokens_pred[:min_len])
        
        # Calculate metrics
        precision, recall, f1, support = precision_recall_fscore_support(
            y_true, y_pred, average='weighted', zero_division=0
        )
        
        # Entity-level exact match metrics
        entity_precision, entity_recall, entity_f1 = self.calculate_entity_metrics(
            test_examples
        )
        
        metrics = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'entity_precision': entity_precision,
            'entity_recall': entity_recall,
            'entity_f1': entity_f1
        }
        
        if verbose:
            print("\n=== MODEL EVALUATION RESULTS ===")
            print(f"Token-level Precision: {precision:.4f}")
            print(f"Token-level Recall: {recall:.4f}")
            print(f"Token-level F1-Score: {f1:.4f}")
            print(f"\nEntity-level Precision: {entity_precision:.4f}")
            print(f"Entity-level Recall: {entity_recall:.4f}")
            print(f"Entity-level F1-Score: {entity_f1:.4f}")
            
            # Detailed classification report
            print("\nDetailed Classification Report:")
            print(classification_report(y_true, y_pred, zero_division=0))
            
            # Per-label performance
            self.print_label_performance(test_examples)
        
        return metrics
    
    def calculate_entity_metrics(self, test_examples):
        """Calculate entity-level precision, recall, and F1"""
        true_entities = set()
        pred_entities = set()
        
        for example in test_examples:
            # True entities
            for ent in example.reference.ents:
                true_entities.add((example.text[ent.start_char:ent.end_char], ent.label_))
            
            # Predicted entities
            pred_doc = self.model.nlp(example.text)
            for ent in pred_doc.ents:
                pred_entities.add((ent.text, ent.label_))
        
        # Calculate metrics
        if len(pred_entities) == 0:
            precision = 0.0
        else:
            precision = len(true_entities & pred_entities) / len(pred_entities)
        
        if len(true_entities) == 0:
            recall = 0.0
        else:
            recall = len(true_entities & pred_entities) / len(true_entities)
        
        if precision + recall == 0:
            f1 = 0.0
        else:
            f1 = 2 * (precision * recall) / (precision + recall)
        
        return precision, recall, f1
    
    def print_label_performance(self, test_examples):
        """Print performance metrics for each entity label"""
        label_stats = defaultdict(lambda: {'tp': 0, 'fp': 0, 'fn': 0})
        
        for example in test_examples:
            # True entities
            true_ents = set()
            for ent in example.reference.ents:
                entity_key = (ent.start_char, ent.end_char, ent.label_)
                true_ents.add(entity_key)
            
            # Predicted entities
            pred_doc = self.model.nlp(example.text)
            pred_ents = set()
            for ent in pred_doc.ents:
                entity_key = (ent.start_char, ent.end_char, ent.label_)
                pred_ents.add(entity_key)
            
            # Calculate TP, FP, FN for each label
            all_labels = set([ent[2] for ent in true_ents | pred_ents])
            
            for label in all_labels:
                true_label_ents = {ent for ent in true_ents if ent[2] == label}
                pred_label_ents = {ent for ent in pred_ents if ent[2] == label}
                
                label_stats[label]['tp'] += len(true_label_ents & pred_label_ents)
                label_stats[label]['fp'] += len(pred_label_ents - true_label_ents)
                label_stats[label]['fn'] += len(true_label_ents - pred_label_ents)
        
        print("\nPer-Label Performance:")
        print("-" * 60)
        print(f"{'Label':<15} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
        print("-" * 60)
        
        for label in sorted(label_stats.keys()):
            stats = label_stats[label]
            
            if stats['tp'] + stats['fp'] == 0:
                precision = 0.0
            else:
                precision = stats['tp'] / (stats['tp'] + stats['fp'])
            
            if stats['tp'] + stats['fn'] == 0:
                recall = 0.0
            else:
                recall = stats['tp'] / (stats['tp'] + stats['fn'])
            
            if precision + recall == 0:
                f1 = 0.0
            else:
                f1 = 2 * (precision * recall) / (precision + recall)
            
            print(f"{label:<15} {precision:<12.4f} {recall:<12.4f} {f1:<12.4f}")
    
    def cross_validate(self, training_data, CourseNER_class, k_folds=5, iterations=100, random_seed=42):
        """Perform k-fold cross validation"""
        random.seed(random_seed)
        shuffled_data = training_data.copy()
        random.shuffle(shuffled_data)
        
        fold_size = len(shuffled_data) // k_folds
        cv_scores = []
        
        print(f"Performing {k_folds}-fold cross validation...")
        
        for fold in range(k_folds):
            print(f"\nFold {fold + 1}/{k_folds}")
            
            # Split data
            start_idx = fold * fold_size
            end_idx = start_idx + fold_size if fold < k_folds - 1 else len(shuffled_data)
            
            test_fold = shuffled_data[start_idx:end_idx]
            train_fold = shuffled_data[:start_idx] + shuffled_data[end_idx:]
            
            # Create fresh model for this fold
            temp_ner = CourseNER_class()
            
            # Train and evaluate
            temp_ner.train_model(train_fold, iterations=iterations)
            temp_evaluator = NERModelEvaluator(temp_ner)
            scores = temp_evaluator.evaluate_model(test_fold, verbose=False)
            cv_scores.append(scores)
            
            print(f"Fold {fold + 1} F1: {scores['f1']:.4f}")
        
        # Calculate average scores
        avg_scores = {}
        for metric in cv_scores[0].keys():
            avg_scores[metric] = np.mean([score[metric] for score in cv_scores])
            avg_scores[f"{metric}_std"] = np.std([score[metric] for score in cv_scores])
        
        print(f"\n=== CROSS VALIDATION RESULTS ===")
        print(f"Average F1: {avg_scores['f1']:.4f} (±{avg_scores['f1_std']:.4f})")
        print(f"Average Precision: {avg_scores['precision']:.4f} (±{avg_scores['precision_std']:.4f})")
        print(f"Average Recall: {avg_scores['recall']:.4f} (±{avg_scores['recall_std']:.4f})")
        
        return avg_scores, cv_scores
    
    def hyperparameter_search(self, training_data, CourseNER_class, param_grid, iterations=50):
        """Search for best hyperparameters"""
        print("Starting hyperparameter search...")
        
        best_score = 0
        best_params = None
        results = []
        
        # Split data for consistent evaluation
        train_data, val_data = self.split_data(training_data, test_split=0.2)
        
        for params in param_grid:
            print(f"\nTesting parameters: {params}")
            
            # Create model with these parameters
            temp_ner = CourseNER_class()
            
            # Override hyperparameters if your CourseNER supports it
            if hasattr(temp_ner, 'dropout'):
                for param, value in params.items():
                    setattr(temp_ner, param, value)
            
            # Train model
            temp_ner.train_model(train_data, iterations=iterations)
            
            # Evaluate
            temp_evaluator = NERModelEvaluator(temp_ner)
            scores = temp_evaluator.evaluate_model(val_data, verbose=False)
            f1_score = scores['f1']
            
            results.append({
                'params': params,
                'f1_score': f1_score,
                'precision': scores['precision'],
                'recall': scores['recall']
            })
            
            print(f"F1 Score: {f1_score:.4f}")
            
            if f1_score > best_score:
                best_score = f1_score
                best_params = params
        
        print(f"\n=== HYPERPARAMETER SEARCH RESULTS ===")
        print(f"Best F1 Score: {best_score:.4f}")
        print(f"Best Parameters: {best_params}")
        
        return best_params, results
    
    def analyze_errors(self, test_data, show_examples=5):
        """Analyze common errors and show examples"""
        if isinstance(test_data[0], tuple):  # Raw data format
            test_examples = []
            for text, annotations in test_data:
                doc = self.model.nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                test_examples.append(example)
        else:
            test_examples = test_data
        
        false_positives = []
        false_negatives = []
        
        for example in test_examples:
            # True entities
            true_ents = set()
            for ent in example.reference.ents:
                true_ents.add((ent.start_char, ent.end_char, ent.label_))
            
            # Predicted entities
            pred_doc = self.model.nlp(example.text)
            pred_ents = set()
            for ent in pred_doc.ents:
                pred_ents.add((ent.start_char, ent.end_char, ent.label_))
            
            # Find false positives and negatives
            fps = pred_ents - true_ents
            fns = true_ents - pred_ents
            
            for fp in fps:
                false_positives.append({
                    'text': example.text,
                    'entity': example.text[fp[0]:fp[1]],
                    'predicted_label': fp[2],
                    'context': example.text[max(0, fp[0]-20):fp[1]+20]
                })
            
            for fn in fns:
                false_negatives.append({
                    'text': example.text,
                    'entity': example.text[fn[0]:fn[1]],
                    'true_label': fn[2],
                    'context': example.text[max(0, fn[0]-20):fn[1]+20]
                })
        
        print("\n=== ERROR ANALYSIS ===")
        print(f"Total False Positives: {len(false_positives)}")
        print(f"Total False Negatives: {len(false_negatives)}")
        
        if false_positives:
            print(f"\nTop {show_examples} False Positives:")
            for i, fp in enumerate(false_positives[:show_examples]):
                print(f"{i+1}. Entity: '{fp['entity']}' | Predicted: {fp['predicted_label']}")
                print(f"   Context: ...{fp['context']}...")
        
        if false_negatives:
            print(f"\nTop {show_examples} False Negatives:")
            for i, fn in enumerate(false_negatives[:show_examples]):
                print(f"{i+1}. Entity: '{fn['entity']}' | True Label: {fn['true_label']}")
                print(f"   Context: ...{fn['context']}...")
        
        return false_positives, false_negatives

# Usage Example:
def evaluate_ner_model(trained_model, training_data, CourseNER_class):
    """Complete evaluation pipeline for your trained NER model"""
    
    # Initialize evaluator
    evaluator = NERModelEvaluator(trained_model)
    
    # Split data for testing
    train_data, test_data = evaluator.split_data(training_data, test_split=0.2)
    
    # Basic evaluation
    print("="*50)
    print("BASIC MODEL EVALUATION")
    print("="*50)
    metrics = evaluator.evaluate_model(test_data)
    
    # Cross validation
    print("\n" + "="*50)
    print("CROSS VALIDATION")
    print("="*50)
    cv_scores, _ = evaluator.cross_validate(training_data, CourseNER_class, k_folds=3, iterations=50)
    
    # Error analysis
    print("\n" + "="*50)
    print("ERROR ANALYSIS")
    print("="*50)
    fps, fns = evaluator.analyze_errors(test_data)
    
    # Hyperparameter search example
    param_grid = [
        {'dropout': 0.3},
        {'dropout': 0.5},
        {'dropout': 0.7}
    ]
    
    print("\n" + "="*50)
    print("HYPERPARAMETER SEARCH")
    print("="*50)
    best_params, search_results = evaluator.hyperparameter_search(
        training_data, CourseNER_class, param_grid, iterations=30
    )
    
    return evaluator, metrics, cv_scores

# Key Hyperparameters to Modify for Better Accuracy:
"""
HYPERPARAMETERS TO TUNE IN YOUR ORIGINAL CourseNER CLASS:

1. Training Parameters (modify in train_model method):
   - iterations: 50-500 (more for complex datasets)
   - dropout: 0.2-0.8 in nlp.update() call
   
2. Batch Configuration (modify in train_model method):
   - minibatch size parameters in compounding()
   - batch_size start: 1.0-8.0
   - batch_size end: 16.0-64.0
   - batch_size compound: 1.001-1.1

3. Model Architecture (modify in __init__):
   - Use pre-trained model: spacy.load("en_core_web_sm") instead of spacy.blank("en")
   - Add custom components or features

4. Data Improvements:
   - Increase training data size
   - Improve annotation quality
   - Balance entity type distribution
   - Add data augmentation

5. Advanced Techniques:
   - Ensemble multiple models
   - Active learning for hard examples
   - Domain-specific word embeddings
"""

'\nHYPERPARAMETERS TO TUNE IN YOUR ORIGINAL CourseNER CLASS:\n\n1. Training Parameters (modify in train_model method):\n   - iterations: 50-500 (more for complex datasets)\n   - dropout: 0.2-0.8 in nlp.update() call\n   \n2. Batch Configuration (modify in train_model method):\n   - minibatch size parameters in compounding()\n   - batch_size start: 1.0-8.0\n   - batch_size end: 16.0-64.0\n   - batch_size compound: 1.001-1.1\n\n3. Model Architecture (modify in __init__):\n   - Use pre-trained model: spacy.load("en_core_web_sm") instead of spacy.blank("en")\n   - Add custom components or features\n\n4. Data Improvements:\n   - Increase training data size\n   - Improve annotation quality\n   - Balance entity type distribution\n   - Add data augmentation\n\n5. Advanced Techniques:\n   - Ensemble multiple models\n   - Active learning for hard examples\n   - Domain-specific word embeddings\n'

In [84]:
# Example usage and demonstration
def main(output_df, training_data, title_column='course_name', description_column='description_text'):
    """Main function to demonstrate the NER model with input DataFrame"""
    
    # Initialize the NER model
    course_ner = CourseNER()
    
    print(f"Training on {len(training_data)} training examples")
    
    # Train the model
    course_ner.train_model(training_data, iterations=100)

    # Initialize evaluator
    evaluator = NERModelEvaluator(course_ner)

    # Split data and evaluate
    train_data, test_data = evaluator.split_data(training_data, test_split=0.2)
    metrics = evaluator.evaluate_model(test_data)

    # Cross validation
    cv_scores, _ = evaluator.cross_validate(training_data, CourseNER, k_folds=5)

    # Error analysis
    fps, fns = evaluator.analyze_errors(test_data)
    
    # Process the input DataFrame
    print(f"\nProcessing input DataFrame with {len(output_df)} rows...")
    results_df = course_ner.process_course_dataframe(output_df, title_column, description_column)
    
    print("\nResults DataFrame:")
    print(results_df)
    
    # Save model for future use
    course_ner.save_model()
    
    return course_ner, results_df

# Run the demonstration
if __name__ == "__main__":
    model, results_df = main(df_small_full, training_data) 

Training on 27 training examples
Training custom NER model for 100 iterations...
Iteration 0, Losses: {'ner': 1322.8350258469582}
Iteration 20, Losses: {'ner': 77.12062924336934}
Iteration 40, Losses: {'ner': 29.25144657089912}
Iteration 60, Losses: {'ner': 11.37146009689421}
Iteration 80, Losses: {'ner': 10.369859737908921}
Training completed!
Split data: 21 training examples, 6 test examples

=== MODEL EVALUATION RESULTS ===
Token-level Precision: 1.0000
Token-level Recall: 1.0000
Token-level F1-Score: 1.0000

Entity-level Precision: 1.0000
Entity-level Recall: 1.0000
Entity-level F1-Score: 1.0000

Detailed Classification Report:
              precision    recall  f1-score   support

DATA_CONCEPT       1.00      1.00      1.00        16
           O       1.00      1.00      1.00       340
   PROG_LANG       1.00      1.00      1.00        17

    accuracy                           1.00       373
   macro avg       1.00      1.00      1.00       373
weighted avg       1.00      1.00 

In [55]:
results_df.head(3)

Unnamed: 0,course_name,named_entities,entity_count
0,Introduction to Python,Python (PROG_LANG); Python (PROG_LANG); Python...,21
1,Introduction to SQL,SQL (PROG_LANG); SQL (PROG_LANG); relational d...,18
2,Introduction to Power BI,Power BI (SOFTWARE); Power BI (SOFTWARE); Powe...,13


## Generate List of Named Entities

In [56]:
# Convert the named_entities column to list of entity names
entity_lists = []

for index, row in results_df.iterrows():
    entities = []
    named_entities = row['named_entities']
    
    if isinstance(named_entities, str) and named_entities.strip():
        entity_parts = named_entities.split(';')
        
        for part in entity_parts:
            part = part.strip()
            if part:
                # Extract entity name before the parentheses using regex
                # Pattern matches everything before " (" 
                match = re.match(r'^(.+?)\s*\([^)]+\)$', part)
                if match:
                    entity_name = match.group(1).strip()
                    entity_name = entity_name.replace('\n', '').replace('\r', '').replace('\\n', '')
                    entity_name = re.sub(r'\s+', ' ', entity_name).strip()
                    if entity_name:
                        entities.append(entity_name)
    
    entity_lists.append(entities)

# Create new column of entity names lists
results_df['entity_list'] = entity_lists

In [57]:
results_df.head(2)

Unnamed: 0,course_name,named_entities,entity_count,entity_list
0,Introduction to Python,Python (PROG_LANG); Python (PROG_LANG); Python...,21,"[Python, Python, Python, Python, Python, Pytho..."
1,Introduction to SQL,SQL (PROG_LANG); SQL (PROG_LANG); relational d...,18,"[SQL, SQL, relational databases, databases, SQ..."


In [58]:
# Combine Course Information with Named Entity Recognition Results
full_df = pd.merge(df_small_full, results_df, on='course_name', how='inner')
print("Course Information DF contained " + str(len(df_small_full)) + " rows")
print("Merged DF contains " + str(len(full_df)) + " rows")

Course Information DF contained 582 rows
Merged DF contains 582 rows


In [59]:
# Create Unique Named Entity List of Course Topics for User to Select From

STOPWORDS = {
    'in', 'abc', 'of', 'your', 'and'
}

# map abbrevations to their full form as both can be identified as named entities
ABBREVIATION_MAPPINGS = {
    'ai': 'artificial intelligence',
    'ci': 'continuous integration',
    'ml': 'machine learning',
    'dl': 'deep learning',
    'nlp': 'natural language processing',
    'rl': 'reinforcement learning',
    'sql': 'structured query language',
    'api': 'application programming interface',
    'json': 'javascript object notation',
    'aws': 'amazon web services',
    'bi': 'business intelligence',
    'etl': 'extract transform load'
}

def normalize_abbreviation(entity):
    # convert any abbreviations to the full form per ABBREVIATION_MAPPINGS
    entity_lower = entity.lower().strip()
    
    if entity_lower in ABBREVIATION_MAPPINGS:
        return ABBREVIATION_MAPPINGS[entity_lower]
    
    words = entity_lower.split()
    normalized_words = []
    
    for word in words:
        if word in ABBREVIATION_MAPPINGS:
            normalized_words.append(ABBREVIATION_MAPPINGS[word])
        else:
            normalized_words.append(word)
    
    return ' '.join(normalized_words)

def remove_stopwords_from_entity(entity):
    words = entity.lower().split()
    
    while words and words[0] in STOPWORDS:
        words.pop(0)
    
    while words and words[-1] in STOPWORDS:
        words.pop()
    
    return ' '.join(words)

def clean_entity(entity):
    """Clean entity by removing excess characters and normalizing case"""
    # Remove leading/trailing whitespace
    cleaned = entity.strip()
    
    # Remove newline characters and replace with space
    cleaned = cleaned.replace('\n', ' ').replace('\r', ' ').replace('\\n', ' ')
    
    # Replace multiple spaces with single space
    cleaned = re.sub(r'\s+', ' ', cleaned)
    
    # Remove trailing punctuation (periods, commas, etc.)
    cleaned = cleaned.rstrip(string.punctuation)
    
    # Remove leading punctuation if needed
    cleaned = cleaned.lstrip(string.punctuation)
    
    # Remove stopwords from entity
    cleaned = remove_stopwords_from_entity(cleaned)
    
    # Normalize abbreviations to full forms
    cleaned = normalize_abbreviation(cleaned)
    
    return cleaned.strip()

def capitalize_entity(entity):
    # Text Formatting - Capitalize first letters of words or full abbreviation
    original_abbrev = None
    for abbrev, full_form in ABBREVIATION_MAPPINGS.items():
        if entity.lower() == full_form.lower():
            original_abbrev = abbrev.upper()
            break
    
    # Show abbreviations as ABBREVIATION (full phrase)
    if original_abbrev:
        words = entity.split()
        capitalized_words = [word.capitalize() for word in words]
        return f"{original_abbrev} ({' '.join(capitalized_words)})"
    
    words = entity.split()
    capitalized_words = []
    
    for word in words:
        if word.isalpha() and len(word) >= 2 and len(word) <= 5 and word.isupper(): #2-5 characters and all letters likely an abbreviation
            # Keep as all caps
            capitalized_words.append(word.upper())
        elif word.isalpha() and len(word) >= 2 and len(word) <= 5:

            if word.lower() in ABBREVIATION_MAPPINGS.keys():
                capitalized_words.append(word.upper())
            else:
                capitalized_words.append(word.capitalize())
        else:
            capitalized_words.append(word.capitalize())
    
    return ' '.join(capitalized_words)

# Extract entities and count occurrences
entity_counter = Counter()

for index, row in results_df.iterrows():
    entity_list = row['entity_list']
    
    for entity in entity_list:
        if entity:
            cleaned_entity = clean_entity(entity)
            
            if cleaned_entity and len(cleaned_entity) > 1:
                normalized_entity = cleaned_entity.lower()
                entity_counter[normalized_entity] += 1

unique_entities_list = []

for entity, count in entity_counter.items():
    if count > 3:  # Only include entities that occur more than 3 times
        capitalized_entity = capitalize_entity(entity)
        unique_entities_list.append(capitalized_entity)

unique_entities_list = sorted(unique_entities_list)

print(f"Total unique entities that occur more than once: {len(unique_entities_list)}")
print(f"Abbreviations normalized: {len([k for k in ABBREVIATION_MAPPINGS.keys()])}")
print("\nFirst 10 unique entities (with counts):")
for entity in unique_entities_list[:10]:
    entity_for_lookup = entity.lower()
    if '(' in entity:  # Handle "AI (Artificial Intelligence)" format
        entity_for_lookup = entity.split('(')[1].rstrip(')').lower()
    count = entity_counter[entity_for_lookup]
    print(f"  - '{entity}' (appears {count} times)")

Total unique entities that occur more than once: 70
Abbreviations normalized: 12

First 10 unique entities (with counts):
  - 'AI (Artificial Intelligence)' (appears 289 times)
  - 'API (Application Programming Interface)' (appears 7 times)
  - 'Analyzing' (appears 20 times)
  - 'Anomaly' (appears 4 times)
  - 'Application' (appears 9 times)
  - 'BI (Business Intelligence)' (appears 7 times)
  - 'Business' (appears 18 times)
  - 'Business Decisions' (appears 8 times)
  - 'Business Processes' (appears 5 times)
  - 'Business Questions' (appears 8 times)


In [60]:
print(unique_entities_list)

['AI (Artificial Intelligence)', 'API (Application Programming Interface)', 'Analyzing', 'Anomaly', 'Application', 'BI (Business Intelligence)', 'Business', 'Business Decisions', 'Business Processes', 'Business Questions', 'Business Value', 'DL (Deep Learning)', 'Data Manipulation', 'Data Visualization', 'Databases', 'Databricks', 'Deep Dive', 'Deep Understanding', 'Dictionaries', 'ETL (Extract Transform Load)', 'Excel', 'Final', 'Financial', 'Foundation', 'Foundational', 'Functions', 'Generative Artificial Intelligence', 'Loops', 'ML (Machine Learning)', 'Machine', 'Machine Translation', 'Matplotlib', 'Model', 'Nosql', 'Numpy', 'Pandas', 'Parallel', 'Performance', 'Pinecone', 'Pipelines', 'Pivot', 'Plotly', 'Polars', 'Postgresql', 'Power', 'Power Analysis', 'Power Business Intelligence', 'Power Pivot', 'Power Query', 'Practicing', 'Processing', 'Production', 'Professional', 'Prompt', 'Pyspark', 'Python', 'Queries', 'Query', 'Recurrent', 'Regression', 'Reinforcement', 'Relational Datab

# User Interface to Select Topics and Courses

In [72]:
def jupyter_entity_filter_app(df, unique_entities_list, entity_counter):
    
    # Create entity options with occurrence counts
    entity_options = []
    entity_mapping = {}  # Maps display string to actual entity
    
    for entity in unique_entities_list:
        # Get count for display
        entity_lower = entity.lower()
        if '(' in entity:  # Handle "AI (Artificial Intelligence)" format
            entity_lower = entity.split('(')[1].rstrip(')').lower()
        count = entity_counter.get(entity_lower, 0)
        display_name = f"{entity}"
        entity_options.append(display_name)
        entity_mapping[display_name] = entity
    
    # Entity selection widget
    entity_selector = widgets.SelectMultiple(
        options=entity_options,
        description='Select Entities:',
        disabled=False,
        layout=widgets.Layout(width='600px', height='200px'),
        style={'description_width': 'initial'}
    )
    
    # Output widgets
    filter_output = widgets.Output()
    course_selector_output = widgets.Output()
    results_output = widgets.Output()
    
    # Storage for selected courses
    selected_courses_storage = {'df': None}
    
    def filter_courses(selected_entity_options):
        with filter_output:
            filter_output.clear_output()
            
            if not selected_entity_options:
                print("Please select at least one entity to filter courses.")
                return
            
            # Extract actual entity names
            selected_entities = [entity_mapping[option] for option in selected_entity_options]
            
            print("Selected Entities:")
            for entity in selected_entities:
                print(f"  • {entity}")
            print()
            
            # Filter DataFrame
            filtered_df = df[df['entity_list'].apply(
                lambda x: matches_any_entity(x, selected_entities)
            )].copy()
            
            print(f"Found {len(filtered_df)} matching courses:")
            
            if len(filtered_df) == 0:
                print("No courses found containing the selected entities.")
                return
            
            # Display course selection interface
            display_course_selector(filtered_df)
    
    def matches_any_entity(entity_list, target_entities):
        """Check if any target entity appears in the entity_list"""
        if not entity_list:
            return False
        
        for target_entity in target_entities:
            if entity_matches_single(entity_list, target_entity):
                return True
        return False
    
    def entity_matches_single(entity_list, target_entity):
        """Check if target_entity appears in the entity_list (case-insensitive)"""
        # Handle the display format "ABBREV (Full Form)" by extracting both parts
        target_variants = set()
        target_lower = target_entity.lower()
        target_variants.add(target_lower)
        
        # If target has abbreviation format, extract both abbreviation and full form
        if '(' in target_entity:
            # Extract abbreviation: "AI (Artificial Intelligence)" -> "ai"
            abbrev = target_entity.split('(')[0].strip().lower()
            target_variants.add(abbrev)
            
            # Extract full form: "AI (Artificial Intelligence)" -> "artificial intelligence"  
            full_form = target_entity.split('(')[1].rstrip(')').strip().lower()
            target_variants.add(full_form)
        
        # Check each entity in the list against all target variants
        for entity in entity_list:
            entity_lower = entity.lower().strip()
            
            # Remove stopwords and normalize the entity for comparison
            entity_clean = remove_stopwords_from_entity(entity_lower)
            entity_normalized = normalize_abbreviation(entity_clean)
            
            # Check if any target variant matches
            for variant in target_variants:
                if (variant == entity_lower or 
                    variant == entity_clean or 
                    variant == entity_normalized):
                    return True
        
        return False
    
    def display_course_selector(filtered_df):
        with course_selector_output:
            course_selector_output.clear_output()
            
            print("\nSelect courses you're interested in:")
            print("-" * 50)
            
            # Create checkboxes for each course
            course_checkboxes = []
            course_data = []
            
            for idx, row in filtered_df.iterrows():
                course_name = row.get('course_name', f'Course {idx}')
                course_url = row.get('course_url', '')
                skill_level = row.get('skill_level', 'Unknown')
                entities_in_course = row.get('entity_list', [])
                description = row.get('description_text', '')
                
                # Create checkbox
                checkbox = widgets.Checkbox(
                    value=False,
                    description=f"{course_name} ({skill_level})",
                    disabled=False,
                    indent=False,
                    layout=widgets.Layout(width='600px'),
                    style={'description_width': 'initial'}
                )
                
                course_checkboxes.append(checkbox)
                course_data.append({
                    'course_name': course_name,
                    'course_url': course_url,
                    'skill_level': skill_level,
                    'entities': entities_in_course,
                    'description': description
                })
            
            # Display checkboxes
            for checkbox in course_checkboxes:
                display(checkbox)
            
            # Create submit button
            submit_button = widgets.Button(
                description='Save Selected Courses',
                disabled=False,
                button_style='success',
                tooltip='Save selected courses to DataFrame',
                icon='check'
            )
            
            def on_submit_clicked(b):
                selected_courses = []
                
                for i, checkbox in enumerate(course_checkboxes):
                    if checkbox.value:
                        selected_courses.append(course_data[i])
                
                if selected_courses:
                    selected_courses_df = pd.DataFrame(selected_courses)
                    selected_courses_storage['df'] = selected_courses_df
                    
                    with results_output:
                        results_output.clear_output()
                        print(f"✅ Successfully saved {len(selected_courses)} selected courses!")
                        print("\nSelected Courses:")
                        print("-" * 40)
                        
                        for i, course in enumerate(selected_courses, 1):
                            print(f"{i}. {course['course_name']} ({course['skill_level']})")
                                                
                        # Also display the DataFrame
                        print("\nSelected Courses DataFrame:")
                        display(selected_courses_df[['course_name', 'skill_level', 'course_url']])
                else:
                    with results_output:
                        results_output.clear_output()
                        print("No courses selected.")
            
            submit_button.on_click(on_submit_clicked)
            display(submit_button)
    
    def get_selected_courses():
        """Function to retrieve selected courses DataFrame"""
        return selected_courses_storage['df']
    
    # Create interactive widget
    interactive_widget = interactive(filter_courses, selected_entity_options=entity_selector)
    
    # Display the interface
    print("Course Entity Filter - Jupyter Notebook Version")
    print("=" * 50)
    display(interactive_widget)
    display(filter_output)
    display(course_selector_output)
    display(results_output)
    
    return get_selected_courses
 
get_selected = jupyter_entity_filter_app(full_df, unique_entities_list, entity_counter)
#selected_courses_df = get_selected()

Course Entity Filter - Jupyter Notebook Version


interactive(children=(SelectMultiple(description='Select Entities:', layout=Layout(height='200px', width='600p…

Output()

Output()

Output()

# Create Course Schedule With Kahn's Algorithm

Kahn's algorithm is an algorithm used to find a topological sort of a Directed Acyclic Graph (DAG).  Topological sort is a linear ordering of directed acyclic graph (DAG) vertices such that for every directed edge (u, v), vertex u comes before vertex v in the ordering.  In this use case, a prerequisite must come before a course that it is a prerequisite for.

I'm using Kahn's algorithm, for the list of courses generated by the user input, to order the courses in such a way that the course prerequisites always occur before the course that they are a prerequisite for.  This also takes into account that multiple courses can have the same prerequisites, and that a prerequisite can also have prerequisite(s) itself. Each course can have up to 3 prerequisites.

In [73]:
selected_courses_df = get_selected()

In [63]:
# Create DF of target courses with all layers of prerequisite(s) for each
course_list_df = pd.DataFrame(columns=['course_name','prerequisite'])

# master list of valid course names, sorted longest-first to prioritize longer matches
all_courses = sorted(full_df['course_name'].astype(str).str.strip().tolist(), key=len, reverse=True)

for course in selected_courses_df['course_name']:
    course = re.sub(r"[\[\]'\"]", "", str(course)).strip()
    stack, seen = [course], set()

    while stack:
        c = re.sub(r"[\[\]'\"]", "", str(stack.pop())).strip()
        if c in seen: 
            continue
        seen.add(c)

        prereqs_raw = full_df.loc[full_df['course_name'].astype(str).str.strip()==c, 'prerequisites']
        
        if prereqs_raw.empty:
            course_list_df.loc[len(course_list_df)] = [c, "there are no prerequisites for this course"]
        else:
            for raw in prereqs_raw:
                raw = re.sub(r"[\[\]'\"]", "", str(raw)).strip()
                
                prereq_matches = []
                remaining = raw
                for crs in all_courses:
                    if crs in remaining:
                        prereq_matches.append(crs)
                        # remove the match once found to avoid double-counting
                        remaining = remaining.replace(crs, "", 1)
                
                if not prereq_matches:
                    course_list_df.loc[len(course_list_df)] = [c, "there are no prerequisites for this course"]
                else:
                    for p in prereq_matches:
                        if (c, p) not in seen:
                            course_list_df.loc[len(course_list_df)] = [c, p]
                            stack.append(p)

In [64]:
course_list_df

Unnamed: 0,course_name,prerequisite
0,Anomaly Detection in Python,Supervised Learning with scikit-learn
1,Supervised Learning with scikit-learn,Introduction to Statistics in Python
2,Introduction to Statistics in Python,Data Manipulation with pandas
3,Data Manipulation with pandas,Intermediate Python
4,Intermediate Python,Introduction to Python
5,Introduction to Python,there are no prerequisites for this course
6,Time Series Analysis in Tableau,Calculations in Tableau
7,Calculations in Tableau,Analyzing Data in Tableau
8,Analyzing Data in Tableau,Introduction to Tableau
9,Introduction to Tableau,there are no prerequisites for this course


In [65]:
# Kahn's Algorithm
df = course_list_df
G = nx.DiGraph()

# Add edges: prerequisites -> course
for _, row in df.iterrows():
    G.add_edge(row['prerequisite'], row['course_name'])

# Check for cycles
if not nx.is_directed_acyclic_graph(G):
    raise ValueError("Cycle detected in prerequisites")

# Perform topological sort
course_order = list(nx.topological_sort(G))

print("Recommended course order:")
for course in course_order:
    if course != 'there are no prerequisites for this course':
        print(course)

Recommended course order:
Introduction to Python
Introduction to Tableau
Intermediate Python
Analyzing Data in Tableau
Data Manipulation with pandas
Calculations in Tableau
Introduction to Statistics in Python
Time Series Analysis in Tableau
Supervised Learning with scikit-learn
Anomaly Detection in Python
