The objective of this project is to develop a messenger bot, which provides a technological watch on a specific theme (artificial intelligence in our case).
The project operates in several stages:
- Load and clean the previously crawled data
- We get the last users and create a session for them if they are new
- We load the data of each user and we reentrain the DNN
- We predict which articles are the most suitable for each user
- We update the data on S3
The purpose of the tutorial is to give you a guideline to develop your own recommendation system. We're going to build a system which performs technical monitoring. In our case we will focus on technical media.
This article is divided in two parts : how to collect data and build a recommendation system and a second part, how to build a messenger bot which send daily messages and collect data to improve the recommendation.
- Collect data
- Clean Data
- Predict articles
To manage all this section, we used python. First, build your environnement, you can use Anaconda (https://www.youtube.com/watch?v=pVME6JvdD5g&t=67s) or VirtualEnv. In our case, we used VirtualEnv with PyCharm for the IDE.
The methode we use to collect the data is the scraping, using the library Scrapy. This is the first lib you need to install.
Collect data with Scrapy :
- Create a project : scrapy startproject NameOfTheProject
- cd /pathToYourProject
- scrapy genspider NameOfTheCrawler theLinkOfASiteYouWantToScrap.anything
A file "NameOfTheCrawler" should have been created. Open this file and look at the code.
# -*- coding: utf-8 -*-
import scrapy
class NameofthecrawlerSpider(scrapy.Spider):
name = 'NameOfTheCrawler'
allowed_domains = ['https://theLinkOfASiteYouWantToScrap.anything']
start_urls = ['https://theLinkOfASiteYouWantToScrap.anything/']
def parse(self, response):
passNow we're going to fetch the webpage you want to scrap and select the appropriate data : title of the articles, link of the articles and we will add a date for later analysis.
- Open a scrapy shell with command : scrapy shell
- Fetch the website (we will scrap medium for the example) : fetch('https://medium.com/topic/artificial-intelligence')
- If it output 200 it means that is works, then you need to search for the data you want
You can select data by using many informations : id, tag, h1, h2 etc... using response.xpath('YourXpath'). Check the doc if you want more explanation : https://docs.scrapy.org/en/xpath-tutorial/topics/xpath-tutorial.html
Here is an example of the command we use to scrap Medium :
# -*- coding: utf-8 -*-
import scrapy
from datetime import datetime
class MediumaiSpider(scrapy.Spider):
name = 'MediumAI'
allowed_domains = ['https://medium.com/topic/artificial-intelligence']
start_urls = ['https://medium.com/topic/artificial-intelligence/']
def parse(self, response):
titles = response.xpath('//*[@id="root"]/div/section/section[1]/div[3]/div[1]/section/div/section/div[1]/div[1]/div[1]/h3/a/text()').extract()
link = response.xpath('//*[@id="root"]/div/section/section[1]/div[3]/div[1]/section/div/section/div[1]/div[1]/div[1]/h3/a/@href').extract()
now = datetime.today().strftime('%Y-%m-%d')
clean_approx = []
for sentence in titles:
clean_approx.append((sentence.replace(",", "")))
yield {"Title": clean_approx, 'Link': link, 'Date': now}While you have edit your python file with your corrects xpath, you will be able to crawl the data by typing "scrapy crawl NameOfTheCrawler -o NameOfTheCrawler.csv" in a command shell at the root of your project.
Then, create a bat file where you activate your environnement and then you run every crawler.
@echo off
cmd /k "MyPythonEnvironnement\Scripts\activate & scrapy crawl MediumAI -o MediumAI.csv & scrapy crawl mediumtech -o MediumTech.csv & scrapy crawl approximatelycorrect -o approximatelycorrect.csv & scrapy crawl kdnuggets -o kdnuggets.csv"
It should store all your csv files in the same folder. Now we have to clean all this data to make predictions.
Since all the csv are stored in the same folder, we can load them and clean them in order to create a simple csv file with all the articles/links/dates.
This task is done using the class DailyData. There are two functions to clean the csv files because differents websites return data with differents structures.
Then, in the main.py file, we load every file to add it in a huge csv file and then we save this file locally by calling it 'AI_articles_dataset.csv'.
Once we have our articles of the day cleaned, we need to select 3 articles for each user depending on his preferences.
In a firt time, we need to collect many data to build a recommendation system. Since we start with no data, there is a problem called the "Cold Start problem", which mean we have to give to users articles without knowing their preferences.
We will first use a basic neural network with transfer learning (ELMO) for the embbeding layer to make a classification between articles. Each user has his own file with his history of what he has read. We initialize it with a default file and then each day we add the previous day's data depending on whether or not he has read his articles.
The file TrainModel.py works in several steps :
- Add the label from the previous day
- Preprocessing on the daily data
- Retraining of the model with new data
- Predict 3 best articles and upload it on AWS S3
As soon as we have the 3 predicted articles uploaded on S3, the messenger bot is able to send the customized articles for each user.
Here is what you need to follow this tutorial and set up your messenger bot :
- Node.js
- npm (node package manager)
- Heroku Toolbelt
- Git
- A Facebook page
What is Node.js?
One day, there was a guy named Ryan Dahl who had the brilliant idea of taking the V8 Javascript engine, it's the one found in the Chrome browser, and using it outside the browser. He created the Node.js platform! It is an open source development platform for executing JavaScript code server-side. Node is useful for developing applications that require a persistent connection from the browser to the server and is often used for real-time applications such as chat, news feeds and web push notifications.
Download node.js here: https://nodejs.org/en/download/
When your download and set up of Node.js are finished, make sure that Node.js is correctly installed on your machine by simply typing in your command prompt: node -v
If Node.js is installed you will see its version.
What is npm ?
npm is the package manager for the Node JavaScript platform. It puts modules in place so that node can find them, and manages dependency conflicts intelligently.
npm website: https://www.npmjs.com/
npm is distributed with Node.js which means that when you download Node.js, you automatically get npm installed on your computer. Make sure that npm is correctly installed on your machine by simply typing in your command prompt: npm -v
If npm is installed you will see its version.
What is Heroku ?
Heroku is a cloud platform as a service (PaaS) supporting several programming languages. Heroku, one of the first cloud platforms, has been in development since June 2007, when it supported only the Ruby programming language, but now supports Java, Node.js, Scala, Clojure, Python, PHP, and Go.
Download Heroku Toolbelt here: https://nodejs.org/en/download/
Make sure that Heroku Toolbelt is correctly installed on your machine by simply typing in your command prompt: heroku -v
If Heroku is installed you will see its version.
What is Git ?
Git is a distributed version-control system for tracking changes in source code during software development. It is designed for coordinating work among programmers, but it can be used to track changes in any set of files. Its goals include speed, data integrity, and support for distributed, non-linear workflows.
Download Git here: https://git-scm.com/downloads
Make sure that Git is correctly installed on your machine by simply typing in your command prompt: git --version
If git is installed you will see its version.
Now that we have all the necessary tools, let's create our local environment. Start by opening a new command prompt.
- Go to the directory of your choice, in this example I go to my desktop:
cd C:\Users\myUserName\desktop
- Create a new folder and give it the name of your bot
mkdir myMessengerBot
- Go to your Facebook Messenger bot folder
cd myMessengerBot
- Set up your heroku account with your Facebook Messenger bot folder
heroku login
- Update npm
npm install npm --global
- Initialize npm
When I leave the fields empty it is because you are not required to put something in them
npm init
name: myMessengerBot
version: 1.0.0
description:
entry point: index.js
test command:
git repository:
keywords:
author: yourName
license: MIT
Now, if you have configured your folder with npm correctly, you will be able to see package.json in your myMessengerBot folder there:
- Install Node.js modules
npm install express request $ body-parser —save
After this step, your myMessengerBot folder looks like this:
- Download the Javascript code needed
FacebookBot-master.zip https://codeload.github.com/elmehdimobi/FacebookBot/zip/master Thanks to El Mehdi LAIDOUNI for the code.
And put it on your desktop. You now have these two folders on your desktop:
-
Now, unzip the FacebookBot-master.zip file and you will be able to see this:
-
Copy and paste these two files from the unzip fodler to your myMessenger bot folder:
- Procfile
- index.js
Now, we are here, your local environment myMessengerBot is ready:
Now that our local environment is ready, let's configure our developer facebook account.
- Go to the Facebook developers website
- Create a new app (click on "Add a new app")
- Go to "Messenger" section and clikc on "Get started" button
- In the "Access token" section, select your Facebook page
- Copy your token
- Paste it in your index.js file at the right place:
var token = "entrez votre token ici"
Translation: "entrez votre token ici" = "enter your token here"
Now that our local environment is ready and our facebook developer account is ready too, let's configure our heroku account.
- Go to myMessengerBot folder on your desktop and open a cmd
- Initialize git
git init
- Create your app (bot) on heroku by enter your app name
heroku create myMessengerBot
So you will get the callback URL of your application, which you can copy now in your clipboard, it looks like:
Callback URL: https://myMessengerBot.herokuapp.com/
- Now, if you go to Heroku dashboard, you will see your app
- Before continuing, make a first commit to heroku to initialize your application
git add .
git commit -am "your message"
git push heroku master
- Go back to your facebook developer dashboard
- Click on your app
- Go to the "Messenger" tab
- Click on "Settings"
- Go to "webhooks" section
- Click on "Edit events"
- Fill in the fields
Callback URL: https://myMessengerBot.herokuapp.com/webhook Check the token: Bot_Messenger_App Tick all
You now have the possibility to modify the index.js file as you wish.
Messenger platform: https://developers.facebook.com/docs/messenger-platform/
Heroku documentation: https://devcenter.heroku.com/categories/reference
