# Content Based Search for Curriculum Bot

Let me step you through the demo of machine learning model to search training kit topics based on student's questions. For the purpose of this demo, let us use private data obtained from Lambda School Airtable data.

To run this notebook, first upload **modSearchData.json** from github repository.

Now we perform exploratory data analysis on the dataset.

In [0]:
#@title
# Generic imports
import pandas as pd
import numpy as np

In [2]:
df = pd.read_json('modSearchData.json')
print(f'Our data set has {df.shape[0]} records and {df.shape[1]} features or columns.')

# Identify initial records in the data
df.head()

Our data set has 380 records and 5 features or columns.


Unnamed: 0,URL,description,id,modSearchProfile,name
0,https://learn.lambdaschool.com/and-pre/module/...,This teaches students how to make more modular...,rec06OcmnettIrdMk,{'text': 'this teaches students how to make mo...,Methods
1,https://learn.lambdaschool.com/web1/module/rec...,JavaScript III introduces us to the `this` key...,rec0AWuNLezbpit7m,{'text': 'javascript iii introduces us to the ...,JavaScript III
2,https://learn.lambdaschool.com/ds/module/rec0O...,Explore Bag of Words analysis with NLTK! We'll...,rec0O4tJizjI1C6EA,{'text': 'explore bag of words analysis with n...,Vector Representations
3,https://learn.lambdaschool.com/fsw-pre/module/...,"In the last lesson, we started down the path o...",rec0Vmn34NDFqmwy8,"{'text': 'in the last lesson, we started down ...",HTML and CSS Fundamentals
4,https://learn.lambdaschool.com/ds/module/rec0p...,One of Linear Algebra's great strengths is its...,rec0pSWqkfdxJv6eC,{'text': 'one of linear algebra's great streng...,Dimensionality Reduction Techniques


In [3]:
print('Checking the data consistency')
df.isnull().sum()

Checking the data consistency


URL                 0
description         0
id                  0
modSearchProfile    0
name                0
dtype: int64

In [4]:
df.head()

Unnamed: 0,URL,description,id,modSearchProfile,name
0,https://learn.lambdaschool.com/and-pre/module/...,This teaches students how to make more modular...,rec06OcmnettIrdMk,{'text': 'this teaches students how to make mo...,Methods
1,https://learn.lambdaschool.com/web1/module/rec...,JavaScript III introduces us to the `this` key...,rec0AWuNLezbpit7m,{'text': 'javascript iii introduces us to the ...,JavaScript III
2,https://learn.lambdaschool.com/ds/module/rec0O...,Explore Bag of Words analysis with NLTK! We'll...,rec0O4tJizjI1C6EA,{'text': 'explore bag of words analysis with n...,Vector Representations
3,https://learn.lambdaschool.com/fsw-pre/module/...,"In the last lesson, we started down the path o...",rec0Vmn34NDFqmwy8,"{'text': 'in the last lesson, we started down ...",HTML and CSS Fundamentals
4,https://learn.lambdaschool.com/ds/module/rec0p...,One of Linear Algebra's great strengths is its...,rec0pSWqkfdxJv6eC,{'text': 'one of linear algebra's great streng...,Dimensionality Reduction Techniques


### Feature Engineering

From the above output it appears all the features are clean. So for the sake of this demo we shall proceed with using the **name** and **description** features.

Let us categorize our links into different categories.

* ds - Data Science or Data Structure
* web - Full Stack or Web Development
* ios - iOS
* android - Android
* career - Career related
* ux - UX
* cs - Computer Science

text information from **modSearchProfile** is considered for now.

In [0]:
# Categorizing the training kit information
category = []
section_names = !cat modSearchData.json | grep '"URL"' | cut -d/ -f4

for section in section_names:
    if section in ['and-pre', 'android']:
        category.append('android')
    elif section in ['cd', 'cr', 'ls-edu', 'nxt', 'p4s']:
        category.append('career')
    elif section in ['cs']:
        category.append('cs')
    elif section in ['ds', 'ds-pre']:
        category.append('ds')
    elif section in ['fsw', 'fsw-pre', 'web1', 'web2', 'web3', 'web4java', 'web4node']:
        category.append('web')
    elif section in ['ios', 'ios-pre']:
        category.append('ios')
    elif section in ['ux', 'ux-pre']:
        category.append('ux')
    else:
        category.append('other')

df['category'] = category

# Extract text information from modSearchProfile
def extract_text(row):
    return dict(row)['text']
    
df['modSearchText'] = df['modSearchProfile'].apply(extract_text)

# Combining text based information
df['text'] = df.apply(lambda row: row['name'] + " " + row['description']
                      + " " + row['modSearchText'], axis = 1)

# Dropping detailed text information. This can be used later if needed.
df.drop(columns=['modSearchProfile'], inplace=True)

In [6]:
available_category = df.category.unique()
available_category

array(['android', 'web', 'ds', 'career', 'ios', 'cs', 'ux'], dtype=object)

### Text Similarity Metrics:

For building our content based search bot, we shall compare the name and description from training kit with students questions. For this we use commonly used text similarity metrics **Jaccard Similarity** and **Cosine Similarity**. 

#### Jaccard similarity: 
Also called intersection over union is defined as size of intersection divided by size of union of two sets.

#### Cosine similarity:
Calculates similarity by measuring the cosine of angle between two vectors.

#### Reference:
https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50

In [0]:
def get_jaccard_sim(str1, str2):
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [0]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_sim(*strs): 
    vectors = [t for t in get_vectors(*strs)]
    return cosine_similarity(vectors)[0][1]
    
def get_vectors(*strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    return vectorizer.transform(text).toarray()

### Text processing using NLTK

Before we run Jaccard similarity on our data we have to further clean up our text data.

Cleaning of text data is done with the help of Natural Language Tool Kit(NLTK) library.

In [9]:
!pip install --upgrade pip
!pip install -U nltk

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Requirement already up-to-date: pip in /usr/local/lib/python3.6/dist-packages (19.2.3)
Requirement already up-to-date: nltk in /usr/local/lib/python3.6/dist-packages (3.4.5)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
import string
table = str.maketrans('','', string.punctuation)

from nltk.tokenize import word_tokenize # Word Tokenizer

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words = set(stop_words)


from nltk.stem.wordnet import WordNetLemmatizer # Word Lemmatizer
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    """
    Cleaning the document before vectorization.
    """
    # Tokenize by word
    tokens = word_tokenize(text)
    # Make all words lowercase
    lowercase_tokens = [w.lower() for w in tokens]
    # Strip punctuation from within words
    no_punctuation = [x.translate(table) for x in lowercase_tokens]
    # Remove words that aren't alphabetic
    alphabetic = [word for word in no_punctuation if word.isalpha()]
    # Remove stopwords
    no_stop_words = [w for w in alphabetic if not w in stop_words]
    # Lemmatize words
    lemmas = [lemmatizer.lemmatize(word) for word in no_stop_words]
    return ' '.join(lemmas)

# Clean up the text
df['cleaned_text'] = df.text.apply(clean_text)

### Student Search Input 

Below cell has sample student query information which shall be used as input to the content based recommendation system.

Since query can overlap across different categories, we can request students to input specific category which he/she is looking for.

**Category Based Search Format:**
```
"<category>: <question>"
```

In [11]:
# 1st Sample User Information
student_query = "web: Advanced CSS"

# Check if the category is available
query_category = student_query.split(":")[0]

if query_category in available_category:
    df_match_by_category = df[df['category']==query_category].copy()
    
    query_without_category = clean_text(student_query.\
                                        replace(query_category+":", ""))
    
    df_match_by_category['jaccard_sim_value'] = \
        df_match_by_category.cleaned_text.apply(get_jaccard_sim, 
                                                args=(query_without_category,))
    sort_by_jaccard_sim = df_match_by_category.sort_values('jaccard_sim_value',
                                                          ascending=False).head(3)
    print("\nCategory Based: Content matched based on Jaccard Similarity")
    jaccard_match = sort_by_jaccard_sim[sort_by_jaccard_sim['jaccard_sim_value'] > 0]
    print(jaccard_match.loc[:, ['name', 'jaccard_sim_value']])
    
    df_match_by_category['cosine_sim_value'] = \
        df_match_by_category.cleaned_text.apply(get_cosine_sim, 
                                                args=(query_without_category,))
    sort_by_cosine_sim = df_match_by_category.sort_values('cosine_sim_value',
                                                           ascending=False).head(3)
    print("\nCategory Based: Content matched based on Cosine Similarity")
    cosine_match = sort_by_cosine_sim[sort_by_cosine_sim['cosine_sim_value'] > 0]
    print(cosine_match.loc[:, ['name', 'cosine_sim_value']])
    
else:
    df_full_match = df.copy()
    
    df_full_match['jaccard_sim_value'] = \
        df_full_match.cleaned_text.apply(get_jaccard_sim, 
                                         args=(clean_text(student_query),))
    sort_by_jaccard_sim = df_full_match.sort_values('jaccard_sim_value',
                                                    ascending=False).head(3)
    print("\nFull Match: Content matched based on Jaccard Similarity")
    jaccard_match = sort_by_jaccard_sim[sort_by_jaccard_sim['jaccard_sim_value'] > 0]
    print(jaccard_match.loc[:, ['name', 'jaccard_sim_value']])
    
    df_full_match['cosine_sim_value'] = \
        df_full_match.cleaned_text.apply(get_cosine_sim, 
                                         args=(clean_text(student_query),))
    
    sort_by_cosine_sim = df_full_match.sort_values('cosine_sim_value',
                                                   ascending=False).head(3)
    print("Full Match: Content matched based on Cosine Similarity")
    cosine_match = sort_by_cosine_sim[sort_by_cosine_sim['cosine_sim_value'] > 0]
    print(cosine_match.loc[:, ['name', 'cosine_sim_value']])


Category Based: Content matched based on Jaccard Similarity
                                                  name  jaccard_sim_value
20   Java Data Modeling, Custom Querying with Intro...           0.027778
177               Delivering a Single Page Application           0.010989
265                                      JS V: Classes           0.006536

Category Based: Content matched based on Cosine Similarity
                  name  cosine_sim_value
348    Preprocessing I          0.304282
110  User Interface II          0.223708
286   User Interface I          0.221240


### Full Search

Students can also query without specifying the category. This results in search across all the categories.

**Full Search Format:**
```
"<question>"
```

In [12]:
# 2nd Sample User Information
student_query = "Recursion"

# Check if the category is available
query_category = student_query.split(":")[0]

if query_category in available_category:
    df_match_by_category = df[df['category']==query_category].copy()
    
    query_without_category = clean_text(student_query.\
                                        replace(query_category+":", ""))
    
    df_match_by_category['jaccard_sim_value'] = \
        df_match_by_category.cleaned_text.apply(get_jaccard_sim, 
                                                args=(query_without_category,))
    sort_by_jaccard_sim = df_match_by_category.sort_values('jaccard_sim_value',
                                                          ascending=False).head(3)
    print("\nCategory Based: Content matched based on Jaccard Similarity")
    jaccard_match = sort_by_jaccard_sim[sort_by_jaccard_sim['jaccard_sim_value'] > 0]
    print(jaccard_match.loc[:, ['name', 'jaccard_sim_value']])
    
    df_match_by_category['cosine_sim_value'] = \
        df_match_by_category.cleaned_text.apply(get_cosine_sim, 
                                                args=(query_without_category,))
    sort_by_cosine_sim = df_match_by_category.sort_values('cosine_sim_value',
                                                           ascending=False).head(3)
    print("\nCategory Based: Content matched based on Cosine Similarity")
    cosine_match = sort_by_cosine_sim[sort_by_cosine_sim['cosine_sim_value'] > 0]
    print(cosine_match.loc[:, ['name', 'cosine_sim_value']])
    
else:
    df_full_match = df.copy()
    
    df_full_match['jaccard_sim_value'] = \
        df_full_match.cleaned_text.apply(get_jaccard_sim, 
                                         args=(clean_text(student_query),))
    sort_by_jaccard_sim = df_full_match.sort_values('jaccard_sim_value',
                                                    ascending=False).head(3)
    print("\nFull Match: Content matched based on Jaccard Similarity")
    jaccard_match = sort_by_jaccard_sim[sort_by_jaccard_sim['jaccard_sim_value'] > 0]
    print(jaccard_match.loc[:, ['name', 'jaccard_sim_value']])
    
    df_full_match['cosine_sim_value'] = \
        df_full_match.cleaned_text.apply(get_cosine_sim, 
                                         args=(clean_text(student_query),))
    
    sort_by_cosine_sim = df_full_match.sort_values('cosine_sim_value',
                                                   ascending=False).head(3)
    print("\nFull Match: Content matched based on Cosine Similarity")
    cosine_match = sort_by_cosine_sim[sort_by_cosine_sim['cosine_sim_value'] > 0]
    print(cosine_match.loc[:, ['name', 'cosine_sim_value']])


Full Match: Content matched based on Jaccard Similarity
                                             name  jaccard_sim_value
136               Java II - Language Fundamentals           0.032258
107  Computer Architecture: Subroutines, CALL/RET           0.008621
282                             Iterative Sorting           0.005650

Full Match: Content matched based on Cosine Similarity
                                name  cosine_sim_value
233                Recursive Sorting          0.192715
136  Java II - Language Fundamentals          0.093659
80                          Graphs I          0.034120


### Conclusion

We can perform **A/B testing** based on Jaccard and Cosine Similarity for getting the feedback from students.

Once we have enough feedback we can try building **User based collaborative filtering** for recommendating the results.