# Course Recommender: Feature Engineering

This project implements and deploys an AI course Recommender System using [Streamlit](https://streamlit.io/). It was inspired by the the [IBM Machine Learning Professional Certificate](https://www.coursera.org/professional-certificates/ibm-machine-learning) offered by IBM & Coursera. In the last course/module of the Specialization, Machine Learning Capstone, a similar application is built; check my [class notes](https://github.com/mxagar/machine_learning_ibm/tree/main/06_Capstone_Project/06_Capstone_Recommender_System.md) for more information.

This notebook performs the **Feature Engineering (FE)** of the dataset. You can open it in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mxagar/course_recommender_streamlit/blob/main/notebooks/02_FE.ipynb)

Table of contents:

- [1. Build Course Text BoWs DataFrame](#1.-Build-Course-Text-BoWs-DataFrame)
- [2. Analyze Course Text BoWs DataFrame](#2.-Analyze-Course-Text-BoWs-DataFrame)
    - Pivot to Data-Term-Matrix
    - Get Similar Courses

In [62]:
import itertools

import pandas as pd
import numpy as np

import nltk as nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim
from gensim import corpora

from scipy.spatial.distance import cosine, jaccard, euclidean, cityblock

In [2]:
# Download NLTK packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

[nltk_data] Downloading package punkt to /Users/mxagar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mxagar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mxagar/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to /Users/mxagar/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

## 1. Build Course Text BoWs DataFrame

In [5]:
course_content_df = pd.read_csv('../data/course_processed.csv')

In [6]:
course_content_df.head()

Unnamed: 0,COURSE_ID,TITLE,DESCRIPTION
0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...
1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...
2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...
3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...
4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...


In [10]:
# A unique text field is created
course_content_df['course_texts'] = course_content_df['TITLE'] + ' ' + course_content_df['DESCRIPTION']
course_content_df['index'] = course_content_df.index

In [11]:
course_content_df.head()

Unnamed: 0,index,COURSE_ID,TITLE,DESCRIPTION,course_texts
0,0,ML0201EN,robots are coming build iot apps with watson ...,have fun with iot and learn along the way if ...,robots are coming build iot apps with watson ...
1,1,ML0122EN,accelerating deep learning with gpu,training complex deep learning models with lar...,accelerating deep learning with gpu training c...
2,2,GPXX0ZG0EN,consuming restful services using the reactive ...,learn how to use a reactive jax rs client to a...,consuming restful services using the reactive ...
3,3,RP0105EN,analyzing big data in r using apache spark,apache spark is a popular cluster computing fr...,analyzing big data in r using apache spark apa...
4,4,GPXX0Z2PEN,containerizing packaging and running a sprin...,learn how to containerize package and run a ...,containerizing packaging and running a sprin...


In [39]:
# We tokenize the course_text colum values, BUT:
# - stop words are not considered
# - only nouns are taken; for that, we need to get the POS (part-of-speech) tags
def tokenize_course(course, keep_only_nouns=True):
    """Tokenize a text string.
    Stop words are removed.
    If specified, only nouns are taken.
    
    Args:
        course: str
            Course title+description
        keep_only_nouns: bool
            Whether to take only nouns or not (default: True)
    Returns:
        word_tokens: list[str]
            List of word tokens
    """
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(course)
    # Remove English stop words and numbers
    word_tokens = [w for w in word_tokens
                   if (not w.lower() in stop_words) and (not w.isnumeric())]
    # Only keep nouns 
    if keep_only_nouns:
        # We can get a list of all POS tags with nltk.help.upenn_tagset()
        filter_list = ['WDT', 'WP', 'WRB', 'FW', 'IN', 'JJR', 'JJS',
                       'MD', 'PDT', 'POS', 'PRP', 'RB', 'RBR', 'RBS',
                       'RP']
        tags = nltk.pos_tag(word_tokens)
        word_tokens = [word for word, pos in tags if pos not in filter_list]

    return word_tokens

In [40]:
# Run tokenization on the colum
# We get a list of lists: for each course a list of words/tokens
tokenized_course_texts = [tokenize_course(course_content_df.iloc[i,:]['course_texts'])
                              for i in range(course_content_df.shape[0])]

In [41]:
tokenized_course_texts[:1]

[['robots',
  'coming',
  'build',
  'iot',
  'apps',
  'watson',
  'swift',
  'red',
  'fun',
  'iot',
  'learn',
  'way',
  'swift',
  'developer',
  'want',
  'learn',
  'iot',
  'watson',
  'ai',
  'services',
  'cloud',
  'raspberry',
  'pi',
  'node',
  'red',
  'found',
  'place',
  'build',
  'iot',
  'apps',
  'read',
  'temperature',
  'data',
  'take',
  'pictures',
  'raspcam',
  'use',
  'ai',
  'recognize',
  'objects',
  'pictures',
  'program',
  'irobot',
  'create',
  'robot']]

In [42]:
# Create a dictionary
tokens_dict = gensim.corpora.Dictionary(tokenized_course_texts)

In [43]:
# Vocalubary: token2id dictionary
print(tokens_dict.token2id)

{'ai': 0, 'apps': 1, 'build': 2, 'cloud': 3, 'coming': 4, 'create': 5, 'data': 6, 'developer': 7, 'found': 8, 'fun': 9, 'iot': 10, 'irobot': 11, 'learn': 12, 'node': 13, 'objects': 14, 'pi': 15, 'pictures': 16, 'place': 17, 'program': 18, 'raspberry': 19, 'raspcam': 20, 'read': 21, 'recognize': 22, 'red': 23, 'robot': 24, 'robots': 25, 'services': 26, 'swift': 27, 'take': 28, 'temperature': 29, 'use': 30, 'want': 31, 'watson': 32, 'way': 33, 'accelerate': 34, 'accelerated': 35, 'accelerating': 36, 'analyze': 37, 'based': 38, 'benefit': 39, 'caffe': 40, 'case': 41, 'chips': 42, 'classification': 43, 'comfortable': 44, 'complex': 45, 'computations': 46, 'convolutional': 47, 'course': 48, 'datasets': 49, 'deep': 50, 'dependencies': 51, 'deploy': 52, 'designed': 53, 'feel': 54, 'google': 55, 'gpu': 56, 'hardware': 57, 'house': 58, 'ibm': 59, 'images': 60, 'including': 61, 'inference': 62, 'large': 63, 'learning': 64, 'libraries': 65, 'machine': 66, 'models': 67, 'need': 68, 'needs': 69, 'n

In [44]:
# Create Bags of Words (BoWs) for each tokenized course text
courses_bow = [tokens_dict.doc2bow(course) for course in tokenized_course_texts]

In [45]:
courses_bow[:1]

[[(0, 2),
  (1, 2),
  (2, 2),
  (3, 1),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 4),
  (11, 1),
  (12, 2),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 2),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 2),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 2),
  (33, 1)]]

In [46]:
# Create a dataframe which contains the BoWs of each course
# flattened along the rows
doc_indices = [[course_content_df.iloc[i,:]['index']]*len(courses_bow[i]) for i in range(course_content_df.shape[0])]
doc_ids = [[course_content_df.iloc[i,:]['COURSE_ID']]*len(courses_bow[i]) for i in range(course_content_df.shape[0])]
tokens = [[tokens_dict.get(courses_bow[i][j][0]) for j in range(len(courses_bow[i]))] for i in range(len(courses_bow))]
bow_values = [[courses_bow[i][j][1] for j in range(len(courses_bow[i]))] for i in range(len(courses_bow))]

In [47]:
# Flatten the lists of lists
doc_indices = list(itertools.chain(*doc_indices))
doc_ids = list(itertools.chain(*doc_ids))
tokens = list(itertools.chain(*tokens))
bow_values = list(itertools.chain(*bow_values))

In [48]:
# Dictionary for the dataframe
bow_dicts = {"doc_index": doc_indices,
             "doc_id": doc_ids,
             "token": tokens,
             "bow": bow_values}

In [49]:
course_text_bows_df = pd.DataFrame(bow_dicts)

In [50]:
course_text_bows_df.head()

Unnamed: 0,doc_index,doc_id,token,bow
0,0,ML0201EN,ai,2
1,0,ML0201EN,apps,2
2,0,ML0201EN,build,2
3,0,ML0201EN,cloud,1
4,0,ML0201EN,coming,1


In [51]:
course_text_bows_df.shape

(10363, 4)

In [52]:
# This should be the same as `../data/courses_bows.csv`
course_bows_df = pd.read_csv('../data/courses_bows.csv')

In [53]:
# It's not exactly the same... but hopefully close.
# The difference might be due to some tagging/tokenization disparities, etc.
course_bows_df.shape

(10366, 4)

## 2. Analyze Course Text BoWs DataFrame

In [54]:
course_text_bows_df.head()

Unnamed: 0,doc_index,doc_id,token,bow
0,0,ML0201EN,ai,2
1,0,ML0201EN,apps,2
2,0,ML0201EN,build,2
3,0,ML0201EN,cloud,1
4,0,ML0201EN,coming,1


In [64]:
# All documents/courses
course_bows_df['doc_id'].unique()

array(['ML0201EN', 'ML0122EN', 'GPXX0ZG0EN', 'RP0105EN', 'GPXX0Z2PEN',
       'CNSC02EN', 'DX0106EN', 'GPXX0FTCEN', 'RAVSCTEST1', 'GPXX06RFEN',
       'GPXX0SDXEN', 'CC0271EN', 'WA0103EN', 'DX0108EN', 'GPXX0PICEN',
       'DAI101EN', 'GPXX0W7KEN', 'GPXX0QR3EN', 'BD0145EN', 'HCC105EN',
       'DE0205EN', 'DS0132EN', 'OS0101EN', 'DS0201EN', 'BENTEST4',
       'CC0210EN', 'PA0103EN', 'HCC104EN', 'GPXX0A1YEN', 'TMP0105EN',
       'PA0107EN', 'DB0113EN', 'PA0109EN', 'PHPM002EN', 'GPXX03HFEN',
       'RP0103', 'RP0103EN', 'BD0212EN', 'GPXX0IBEN', 'SECM03EN',
       'SC0103EN', 'GPXX0YXHEN', 'RP0151EN', 'TA0105', 'SW0201EN',
       'TMP0106', 'GPXX0BUBEN', 'ST0201EN', 'ST0301EN', 'SW0101EN',
       'TMP0101EN', 'DW0101EN', 'BD0143EN', 'WA0101EN', 'GPXX04HEEN',
       'BD0141EN', 'CO0401EN', 'ML0122ENv1', 'BD0151EN', 'TA0106EN',
       'TMP107', 'ML0111EN', 'GPXX048OEN', 'CO0201EN', 'GPXX01DCEN',
       'GPXX04XJEN', 'GPXX0JZ4EN', 'GPXX0ZYVEN', 'GPXX0ZMZEN',
       'GPXX0742EN', 'GPXX0KV4EN', 

### Pivot to Data-Term-Matrix

In [70]:
def pivot_two_bows(basedoc, comparedoc):
    """Given two BOWs of different documents/course texts,
    re-arrange them horizontally and create a dataframe.
    
    Args:
        basedoc: pd.DataFrame
            Base BoW
        comparedoc: pd.DataFrame
            The other BoW
            
    Returns:
        joinT: pd.DataFrame
            Composed data frame which contains
            both inputs horizontally arranged.
    """
    base = basedoc.copy()
    base['type'] = 'base'
    compare = comparedoc.copy()
    compare['type'] = 'compare'
    # Append the two token sets vertically
    join = base.append(compare)
    # Pivot the two joined courses
    joinT = join.pivot(index=['doc_id', 'type', 'doc_index'], columns='token').fillna(0).reset_index(level=[0, 1])
    # Assign columns
    joinT.columns = ['doc_id', 'type'] + [t[1] for t in joinT.columns][2:]
    return joinT

In [71]:
course1 = course_bows_df[course_bows_df['doc_id'] == 'ML0151EN']
course2 = course_bows_df[course_bows_df['doc_id'] == 'ML0101ENv3']

In [72]:
course1

Unnamed: 0,doc_index,doc_id,token,bow
3512,200,ML0151EN,learn,1
3513,200,ML0151EN,course,1
3514,200,ML0151EN,learning,5
3515,200,ML0151EN,machine,4
3516,200,ML0151EN,using,1
3517,200,ML0151EN,r,2
3518,200,ML0151EN,basics,1
3519,200,ML0151EN,language,1
3520,200,ML0151EN,programming,1
3521,200,ML0151EN,statistical,1


In [74]:
bow_vectors = pivot_two_bows(course1, course2)
bow_vectors

Unnamed: 0_level_0,doc_id,type,approachable,basics,beneficial,comparison,course,dives,free,future,...,relates,started,statistical,supervised,tool,tools,trends,unsupervised,using,vs
doc_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
158,ML0101ENv3,compare,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
200,ML0151EN,base,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


In [76]:
similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
similarity

0.6626221399549089

### Get Similar Courses

In [81]:
# Given a reference course
# compute BoW similarities to all fifferent courses
reference = 'ML0101ENv3'
course_vector = []
similarity_vector = []
for course in course_bows_df['doc_id'].unique():
    if course != reference:
        course1 = course_bows_df[course_bows_df['doc_id'] == reference]
        course2 = course_bows_df[course_bows_df['doc_id'] == course]
        course_vector.append(course)
        bow_vectors = pivot_two_bows(course1, course2)
        similarity = 1 - cosine(bow_vectors.iloc[0, 2:], bow_vectors.iloc[1, 2:])
        similarity_vector.append(similarity)

similarity_df = (pd.DataFrame({'course': course_vector,
                               'similarity': similarity_vector})
                 .sort_values(by="similarity", ascending=False).reset_index(drop=True)
                )

In [84]:
# All similar courses sorted
# Now, we can perform a JOIN (merge) to get course names...
similarity_df

Unnamed: 0,course,similarity
0,ML0151EN,0.662622
1,excourse47,0.634755
2,excourse46,0.612054
3,excourse60,0.549040
4,ML0109EN,0.521749
...,...,...
301,CO0101EN,0.000000
302,LB0107ENv1,0.000000
303,GPXX0PG8EN,0.000000
304,CB0103EN,0.000000


In [86]:
# Reference course
course_content_df[course_content_df['COURSE_ID'] == reference]['TITLE']

158    machine learning with python
Name: TITLE, dtype: object

In [89]:
# Most similar courses (first 10)
similarity_df_ = pd.merge(left=similarity_df,
         right=course_content_df,
         how='inner',
         left_on='course',
         right_on='COURSE_ID')[['course', 'similarity', 'TITLE']]
similarity_df_.head(10)

Unnamed: 0,course,similarity,TITLE
0,ML0151EN,0.662622,machine learning with r
1,excourse47,0.634755,machine learning for all
2,excourse46,0.612054,machine learning
3,excourse60,0.54904,introduction to tensorflow for artificial inte...
4,ML0109EN,0.521749,machine learning dimensionality reduction
5,excourse69,0.481932,machine learning with big data
6,excourse51,0.4594,introduction to machine learning in production
7,excourse48,0.454077,introduction to machine learning language pro...
8,excourse61,0.447214,convolutional neural networks in tensorflow
9,ML0115EN,0.421637,deep learning 101
