>NOTE: This Notebook is replicated as a python file for future ChatBot Enginnering, thus no           need to run the code here. The notebook is for expalaining the data 

## 1. Retrieving data

In [1]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import nltk
import re
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from rake_nltk import Rake
import psycopg2 as pg2

In [2]:
%run sql_table.py

In [3]:
def con_cur_to_db(dbname=DBNAME, dict_cur=None):
    ''' 
    Returns both a connection and a cursor object for your database
    '''
    con = pg2.connect(host=IP_ADDRESS,
                  dbname=dbname,
                  user=USER,
                  password=PASSWORD)
    if dict_cur:
        cur = con.cursor(cursor_factory=RealDictCursor)
    else:
        cur = con.cursor()
    
    return con, cur

In [4]:
def execute_query(query, dbname=DBNAME, dict_cur=None, command=False):
    '''
    Executes a query directly to a database, without having to create a cursor and connection each time. 
    '''
    con, cur = con_cur_to_db(dbname, dict_cur)
    cur.execute(f'{query}')
    if not command:
        data = cur.fetchall()
        col_names = []
        for elt in cur.description:
            col_names.append(elt[0])
        con.close()
        return data, col_names
    con.commit()
    con.close()

In [5]:
query = '''
SELECT *
FROM udemy
'''
data, column_names = execute_query(query)

In [6]:
df = pd.DataFrame(data, columns=column_names)
df['headline'] = df['headline'].fillna('No headlines')
df['description'] = df['description'].fillna('No description')
df.reset_index(inplace=True)

## 2. Convert texts into vectors

The text embedding converts text (words or sentences) into a numerical vector, they encode words and sentences in fixed-length dense vectorsto drastically improve the processing of textual data.

Universal Embeddings: embeddings that are pre-trained on a large corpus and can be plugged in a variety of downstream task models to automatically improve their performance by incorporating some general word/sentence representations learned on the larger dataset.  

Google’s Universal Sentence Encoder, published in early 2018. Their encoder uses a transformer-network that is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The pre-trained Universal Sentence Encoder is publicly available in [Tensorflow-hub.](https://tfhub.dev/). The model is efficient and result in accurate performance on diverse transfer tasks.


![https://www.tensorflow.org/hub/modules/google/universal-sentence-encoder/1](https://cdn-images-1.medium.com/max/1600/1*qACWEt8866AOKEYRb-Y5ig.png)


We simply load the Universal Sentence Encoder module from tensorflow hub. It’s as simple as that.

In [7]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/2" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]
embed = hub.Module(module_url)

INFO:tensorflow:Using /var/folders/mq/2zsplllx4zjg6fgp1tlzc4500000gn/T/tfhub_modules to cache modules.


A function `get_features` to wrap tensorflow call. We just create a session and run the embed node in the graph. This gives us the vector for each text.

In [10]:
def get_features(texts):
    if type(texts) is str:
        texts = [texts]
    tf.logging.set_verbosity(tf.logging.ERROR)
    with tf.Session() as sess:
        sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
        return sess.run(embed(texts))

In [8]:
def clean_words(raw_text):       
    letters_only = re.sub("[^a-zA-Z0-9.#+]", " ", raw_text)
    words = letters_only.lower().split()
    stops = set(stopwords.words('english'))
    extra_stops = set(['learn','course'])
    nostop_words = [w for w in words if not w in stops]
    meaningful_words = [w for w in nostop_words if not w in extra_stops]
    lemmatizer = WordNetLemmatizer()
    lem_words = [lemmatizer.lemmatize(w) for w in meaningful_words]
    return(" ".join(lem_words))

The text should be preprocessed to remove noises from the texts.

We are going to build a Vectors on `objective_summary` in each course.

In [9]:
df['objectives_summary'] = [clean_words(i) for i in df['objectives_summary']]

In [11]:
BASE_VECTORS = get_features(df['objectives_summary'].values)

Now, `BASE_VECTORS` is a 79821 X 512 vector matrix converted from `objevive_summary` text.

In [18]:
df['num_subscribers'] = df['num_subscribers'].astype('int64')
df['avg_rating_recent'] = df['avg_rating_recent'].astype('float64')
norm = (df['avg_rating_recent'] * df['num_subscribers'])/(df['num_subscribers']+1000)
scaled_foctor = MinMaxScaler().fit_transform(norm.values.reshape(-1,1))

We convert the `num_subscribers` and `avg_rating_recent` to numeric values, and generate the scale factor that is based on value of `ave_rating_recent` multiply by `num_subscribers` and divided by `num_subscribers` with denominator 1000. In this way, the scaler considers both number of subscribers and average ratings that course has, so the higher number of subscribers and ratings will have a higher value. 

In [14]:
def semantic_search(query, dataframe, vectors, scale):
    query = clean_words(query)
    print("Extracting features...")
    query_vec = get_features(query)
    sim = cosine_similarity(vectors, query_vec)
    dataframe['scaled_sim'] = sim*scale
    top_5 = df.sort_values('scaled_sim', ascending=False)[:5]
    return top_5

A function that convert the input text as a 512 x 1 vector, and use cosine similarity to find simiarity between two vectors (the input text vector and `BASE_VECTORS`. This is nothing but finding the cosine of angle between two vectors. The formula is direcly taken from dot prduct of vectors:

## $$
cos(\theta) = \frac{A \cdot B}{\left\| A\right\| \left\| B\right\| } = \frac{A \cdot B}{\sqrt{\sum{A_i^2}} \cdot \sqrt{\sum{B_i^2}}}
$$

The function will find the most similar `objective_summary` with the search term and return 5 top closet courses matches to the search term and has higher number of subscribers and ratings.

In [16]:
!jupyter nbconvert --to script 02_Recommender.ipynb

[NbConvertApp] Converting notebook 02_Recommender.ipynb to script
[NbConvertApp] Writing 5364 bytes to 02_Recommender.py


Finally, rename the python file to `Recommender.py`.

# The MOOC_BOT Engineering
<img src="../Image/MOOC_BOT.jpg" width="100">

The python files in this folder can deploy MOOC_BOT. It is a simple bot that answers questions about Udemy online coures. The user can ask about popular courses and ask for the bot to recommend the courses Match your search.

One important thing to note with this design is that, the data and processing is all handled in the local system. Even though we use IBM, it is used as an API service and none of the internal data is sent to IBM. This way the entire design can be implemented in your workplace without having to worry about data transfers.

The [SlackBot](https://github.com/Sundar0989/Movie_Bot/blob/master/slack/Create_slack_app.ipynb) and IBM [Watson account](https://github.com/Sundar0989/Movie_Bot/blob/master/nlp/IBM_Watson_Conversation_setup.ipynb) are built based on the Sundar0989's Tutorial. The API keys and Authorizations are in `config.py`.

### 1. `nlp_command.py`
Users can interact with MOOC_BOT via Slack. Once the user post a question, it is passed to the backend system for analysis.

**Identify Response conditions in the IBM Watson Platform:**

- Intents — What the user is trying to ask or query? 
- Dialog/Interaction — Provide the appropriate request/response for the user question.

> For instance, when the user asks:"Recommend me Udemy courses", the Intents are detected as predefined "recommend_moocs" Intent. Give the response based on the condition set in the `nlp_command.py`


### 2. `slack_command.py`

 Slack APP API engineering: includes functions that return the Slack output mainly as `attachment` method. 
 
 ```  
      slack_client.api_call(
      "chat.postMessage",
      channel= ' ',
      attachments='')   
 ```
 
 
 ### 3. `Recommender.py`
 
The content based recommender function generated in [HERE](../Code/02_Content_Based_Recommender.ipynp). With the text input in the slack bot the Recommender will give you a top 5 courses related to your search.


### 4. `chatbot_functions.py`

The dataset that processed to give a result as a Slack message. 

> For instance, `Top5_Overall` variable is the five courses that most people are subscribed to .

### 5. ` main.py`

Finally, the `main.py` is putting together all functions together to initiate the bot


## Step 7: Initiate Bot

Navigate to the folder where the main python script exists and run the code below.

```sh
python3 main.py
