# Notebook 2: Process Courses Dataset 
Coursera.csv  
https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021

#### This notebook produces the following data into the _output_datasets_ folder:
```
(COURSE) NODE						course__node.csv
course_id:ID
course_name
course_difficulty_level
course_url
:LABEL = "COURSE”
```
#### Also, it produces intermediate datasets, used for further Skill Matching steps into the _temp_datasets_ folder:
```
(COURSE_SKILL) NODE					courses_skills_TEMP.csv
course_skill_id
course_skill_name

(COURSE_SKILL) RELATION					courses_skills_relationship_TEMP.csv
course_id
course_skill_id
```



## Imports

In [1]:
# %pip install stanza
# %pip install spacy
# %pip install nltk
# !python -m spacy download en_core_web_sm

import pandas as pd
import numpy as np
import stanza
import spacy
import re
stanza.download('en') 
nlp_spacy = spacy.load("en_core_web_sm")
nlp_stanza = stanza.Pipeline('en', processors='tokenize, ner', use_gpu=False, pos_batch_size=3000, download_method=None)

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 50.1MB/s]
2022-12-10 11:31:20 INFO: Downloading default packages for language: en (English) ...
2022-12-10 11:31:22 INFO: File exists: /Users/sergeygurvich/stanza_resources/en/default.zip
2022-12-10 11:31:26 INFO: Finished downloading models and saved to /Users/sergeygurvich/stanza_resources.
2022-12-10 11:31:26 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| ner       | ontonotes |

2022-12-10 11:31:26 INFO: Use device: cpu
2022-12-10 11:31:26 INFO: Loading: tokenize
2022-12-10 11:31:26 INFO: Loading: ner
2022-12-10 11:31:27 INFO: Done loading processors!


In [2]:
# this cell is to support running the notebook in Google Colab

mydrive = ""  # this is when we run locally

# Google Colab:
# from google.colab import drive
# drive.mount('/content/drive')
# mydrive = "/content/drive/MyDrive/DSE 203 — etl/DSE203_Project/"  # this is when we run on COLAB Leslie
# mydrive = "/content/drive/MyDrive/DSE203_Project/"  # this is when we run on COLAB Sergey

input_dir = mydrive+"input_datasets/"
output_dir = mydrive+"output_datasets/"
temp_dir = mydrive+"temp_datasets/"

## Read Data

In [3]:
# SMALL, SAMPLED DATASET FOR TESTING:
# courses_df = pd.read_csv(input_dir+'coursera_small.csv')

# FULL DATASET:
courses_df = pd.read_csv(input_dir+'Coursera.csv')

courses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         20 non-null     object
 1   University          20 non-null     object
 2   Difficulty Level    20 non-null     object
 3   Course Rating       20 non-null     object
 4   Course URL          20 non-null     object
 5   Course Description  20 non-null     object
 6   Skills              20 non-null     object
dtypes: object(7)
memory usage: 1.2+ KB


In [4]:
courses_df.rename(columns = {"Course Name":"course_name",
                     "Course Rating": "course_rating",
                     "Difficulty Level": "course_difficulty_level",
                     "Course URL": "course_url",
                     "Course Description": "course_description",
                     "Skills": "course_skills"
                    }, inplace=True)

In [5]:
columns_to_leave = ["course_name", 
                    "course_description",
                    "course_difficulty_level",
                    "course_url",
                    "course_skills"]
courses_df = courses_df[columns_to_leave]
courses_df.columns

Index(['course_name', 'course_description', 'course_difficulty_level',
       'course_url', 'course_skills'],
      dtype='object')

## Text Cleaning

In [6]:
def clean_text(string):
    '''
    remove everything but the alphabetic letters
    '''
    string = re.sub('[^a-zA-Z,.?! ]+', '', string)
    return string


columns_to_clean = ['course_name', 'course_description', 'course_skills']

for column in columns_to_clean:
    courses_df[column] = courses_df[column].apply(clean_text)

### Convert Skills string to list

In [7]:
courses_df.course_skills = courses_df.course_skills.str.split('  ')
courses_df.head(3)

Unnamed: 0,course_name,course_description,course_difficulty_level,course_url,course_skills
0,Python and Statistics for Financial Analysis,Course Overview httpsyoutu.beJgFVqzAYno Pytho...,Advanced,https://www.coursera.org/learn/python-statisti...,"[Data Analysis, Python, Python Programming, fi..."
1,Parallel programming,With every smartphone and computer now boastin...,Beginner,https://www.coursera.org/learn/parprog1,"[Data Structures, parallel algorithm, openfabr..."
2,Getting Started with Go,"Learn the basics of Go, an open source program...",Intermediate,https://www.coursera.org/learn/golang-getting-...,"[record computer science, language, structured..."


## Apply NER to extract skills

In [8]:
def extract_entities_stanza(series):
    '''
    This function will get a dataframe column (series) and will extract skills from the text
    using Stanza library.
    '''
    
    doc = nlp_stanza(series)
    entities_skills = doc.entities
    
    result = list({x.text for x in entities_skills if (x.type == 'ORG') or (x.type == 'PRODUCT')})
    
    return result

In [9]:
def extract_entities_spacy(series):
    '''
    This function will get a dataframe column (series) and will extract skills from the text
    using Spacy library.
    '''

    doc = nlp_spacy(series, disable=["tok2vec", "parser"])
    entities_skills = doc.ents
    
    result = list({x.text for x in entities_skills if (x.label_ == 'ORG') or (x.label_ == 'PRODUCT')})
    
    return result

In [10]:
def extend_lists(df):
    '''
    This function will get a Courses dataframe and will combine lists of skills 
    from different columns, will remove duplicates and then will produce a final skills list.
    '''
    # for simplicity
    one  = df.course_name_stanza
    two = df.course_name_spacy
    three = df.description_stanza
    four = df.description_spacy
    five = df.course_skills
    
    # combine into one and eliminate dups
    result = one + two + three + four + five
    
    # lowercase all skills
    result = [x.lower() for x in result]
    
    result = list(set(result))
    
    return result

In [11]:
%%time

# extract with Stanza
courses_df['course_name_stanza'] = courses_df['course_name'].apply(extract_entities_stanza)
courses_df['description_stanza'] = courses_df['course_description'].apply(extract_entities_stanza)

# extract with Space
courses_df['course_name_spacy'] = courses_df['course_name'].apply(extract_entities_spacy)
courses_df['description_spacy'] = courses_df['course_description'].apply(extract_entities_spacy)

# Combine everything together and remove duplicate skills
courses_df['all_course_skills'] = courses_df.apply(extend_lists,axis=1)

courses_df.head(3)

CPU times: user 2min 12s, sys: 25.5 s, total: 2min 38s
Wall time: 21.9 s


Unnamed: 0,course_name,course_description,course_difficulty_level,course_url,course_skills,course_name_stanza,description_stanza,course_name_spacy,description_spacy,all_course_skills
0,Python and Statistics for Financial Analysis,Course Overview httpsyoutu.beJgFVqzAYno Pytho...,Advanced,https://www.coursera.org/learn/python-statisti...,"[Data Analysis, Python, Python Programming, fi...",[],[Jupyter Notebook],[],"[linear, Dataframe Manipulate]","[statistical analysis, jupyter notebook, data ..."
1,Parallel programming,With every smartphone and computer now boastin...,Beginner,https://www.coursera.org/learn/parprog1,"[Data Structures, parallel algorithm, openfabr...",[],"[Scala, Ruby, CC, Functional Program Design]",[],"[Scala, Recommended, Learning Outcomes, Functi...",[computer programming computerscience software...
2,Getting Started with Go,"Learn the basics of Go, an open source program...",Intermediate,https://www.coursera.org/learn/golang-getting-...,"[record computer science, language, structured...",[],"[Google, Java]",[],[Google],[subroutine computerscience softwaredevelopmen...


## Create relational tables

### Explode table by all skills

In [12]:
courses_exploded_df = courses_df[['course_name', 'all_course_skills']].explode('all_course_skills')

In [13]:
courses_exploded_df.columns = ['course_name', 'course_skill']
courses_exploded_df = courses_exploded_df.drop_duplicates(subset=['course_name','course_skill'])
courses_exploded_df.reset_index(inplace=True)
courses_exploded_df.columns = ['course_id', 'course_name', 'course_skill']
courses_exploded_df[courses_exploded_df.course_skill.str.contains('ython')]

Unnamed: 0,course_id,course_name,course_skill
6,0,Python and Statistics for Financial Analysis,python
10,0,Python and Statistics for Financial Analysis,python programming
50,3,TensorFlow for CNNs Transfer Learning,python
62,4,Image Classification with CNNs using Keras,python programming
66,4,Image Classification with CNNs using Keras,python


## Create table of course skills (without duplicates)

In [14]:
courses_skills_df = courses_exploded_df[['course_skill']].copy()
courses_skills_df = courses_skills_df.drop_duplicates() \
               .reset_index(drop=True) \
               .reset_index() \
               .rename(columns={'course_skill':'course_skill_name', 'index':'course_skill_id'})
courses_skills_df.course_skill_id = courses_skills_df.course_skill_id.astype('int')

courses_skills_df[courses_skills_df.course_skill_name.str.contains('ython')]

Unnamed: 0,course_skill_id,course_skill_name
6,6,python
10,10,python programming


In [15]:
courses_skills_df.to_csv(temp_dir+"courses_skills_TEMP.csv", index=False)

## Map course skills back to the courses to for creating the intermedite (temp) relationships table

In [16]:
courses_exploded_df = courses_exploded_df.merge(courses_skills_df, how='outer', left_on='course_skill', right_on='course_skill_name')
courses_exploded_df

Unnamed: 0,course_id,course_name,course_skill,course_skill_id,course_skill_name
0,0,Python and Statistics for Financial Analysis,statistical analysis,0,statistical analysis
1,0,Python and Statistics for Financial Analysis,jupyter notebook,1,jupyter notebook
2,0,Python and Statistics for Financial Analysis,data analysis,2,data analysis
3,6,Behavior Driven Development with Selenium and ...,data analysis,2,data analysis
4,19,The Product Lifecycle A Guide from start to fi...,data analysis,2,data analysis
...,...,...,...,...,...
269,19,The Product Lifecycle A Guide from start to fi...,microsoft excel,228,microsoft excel
270,19,The Product Lifecycle A Guide from start to fi...,product lifecycle,229,product lifecycle
271,19,The Product Lifecycle A Guide from start to fi...,tracking,230,tracking
272,19,The Product Lifecycle A Guide from start to fi...,product sales,231,product sales


In [17]:
courses_exploded_df[["course_id", "course_skill_id"]].to_csv(temp_dir+"courses_skills_relationship_TEMP.csv",index=False)

### (node) COURSE

In [18]:
columns_to_leave = ["course_name", 
                    "course_difficulty_level",
                    "course_url"]
courses_df = courses_df[columns_to_leave]
courses_df[":LABEL"] = "COURSE"
courses_df

Unnamed: 0,course_name,course_difficulty_level,course_url,:LABEL
0,Python and Statistics for Financial Analysis,Advanced,https://www.coursera.org/learn/python-statisti...,COURSE
1,Parallel programming,Beginner,https://www.coursera.org/learn/parprog1,COURSE
2,Getting Started with Go,Intermediate,https://www.coursera.org/learn/golang-getting-...,COURSE
3,TensorFlow for CNNs Transfer Learning,Beginner,https://www.coursera.org/learn/tensorflow-for-...,COURSE
4,Image Classification with CNNs using Keras,Beginner,https://www.coursera.org/learn/image-classific...,COURSE
5,Create your first test automation script Sele...,Beginner,https://www.coursera.org/learn/create-first-te...,COURSE
6,Behavior Driven Development with Selenium and ...,Intermediate,https://www.coursera.org/learn/behavior-driven...,COURSE
7,Building Test Automation Framework using Selen...,Beginner,https://www.coursera.org/learn/building-test-a...,COURSE
8,Advanced TestNG Framework and Integration with...,Beginner,https://www.coursera.org/learn/Advanced-testng...,COURSE
9,Automate an ecommerce web application using Se...,Beginner,https://www.coursera.org/learn/automate-e-comm...,COURSE


In [19]:
courses_df.to_csv(output_dir+"course__node.csv", index_label="course_id:ID")