In [1]:
import sys
import os
import re

import pandas as pd
import numpy as np

from facepy import GraphAPI
import spacy
from textacy import extract
from textacy.doc import Doc
from textacy import similarity

stdout = sys.stdout
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = stdout

In this post we will use modern NLP techniques to find similar posts in a facebook group. 

You have probably been in a situation where you want to post something in a facebook group and you are not sure whether almost the same post already exists and maybe just hiding on the next page. You could click through the pages and read all the posts or you could use the search but both ways have their disadvantages. Where the first one is obviosly very tidious, for the second one you will have to try different words and phrasing in order not to miss something.

Out goal here is to find the most similar posts in a group based on their meaning.

First you will need to get an `App-ID` and `App-Secret` from facebook in order to access the content through the API. It is very simple and you can find the required steps here https://developers.facebook.com/docs/apps/register

Second you will need the `Group-ID`. I decided to look at the posts in the group "Python Programming Language" - Group  https://www.facebook.com/groups/python.programmers/. The easiest way to get the ID is to go to tha website https://lookup-id.com and post the URL of the Group there.

We will use the library facepy (https://github.com/jgorset/facepy) which "makes it really easy to use Facebook's Graph API". You can install `facepy` with `pip install facepy`

Once you have installed facepy and have got your App-ID, App-Secret and Group-ID, scraping all posts from the group wall can be done with just a couple lines of code.

In [2]:
group_id = '457660044251817' #https://www.facebook.com/groups/python.programmers/
app_id = os.environ["FB_APP_ID"]
app_secret = os.environ["FB_APP_SECRET"]
access_token = app_id + "|" + app_secret

graph = GraphAPI(access_token)
pages = graph.get(group_id + "/feed", page=True, retry=3, limit=100)
i = 0
posts =[]

#Iterate through pages and store all posts in one list
for p in pages:
    if i%10 == 0:
        print 'Downloading page', i
    posts.extend(p['data'])
    i += 1




Downloading page 0
Downloading page 10
Downloading page 20
Downloading page 30
Downloading page 40
Downloading page 50
Downloading page 60
Downloading page 70


In [3]:
#Convert the list to a dataframe for more comfort
df = pd.DataFrame(posts)

#Drop empty posts
df.dropna(subset=['message'],inplace=True)

#Do some cleaning and drop posts posts with less than 20 characters
df["cleanmessage"] = (df['message'].str.lower()
                                   .str.replace(r'\n|\r|\t', ' ')
                                   .str.replace(r'[^a-z0-9+]+', ' ')
                                   .str.replace(r' +', ' '))

df = df.loc[df['cleanmessage'].str.len() > 20]
df.reset_index(inplace=True,drop=True)

There are many different ways to calculate the similarity between two documents. First we need to decide how do we want to represent the document (Bag of words https://en.wikipedia.org/wiki/Bag-of-words_model, Tf-Idf https://en.wikipedia.org/wiki/Tf%E2%80%93idf, word embeddings https://en.wikipedia.org/wiki/Word_embedding) and then we have to choose the distance metric (euclidean distance https://en.wikipedia.org/wiki/Euclidean_distance, cosine similarity https://en.wikipedia.org/wiki/Cosine_similarity, word mover's distance http://proceedings.mlr.press/v37/kusnerb15.pdf) which will tell us how close (similar) are two documents.

We are going to represent the content of the facebook posts with word embeddings and compare the transformed posts with word mover's distance. The combination of both have shown lower k-nearest neighbor document
classification error rates compared to other state of the art techniques.

The advantage of word embeddings is that the words which have similar meanings but don’t have any letters in common will still have similar vectors (be close) in the embedded space (e.g. `lion` and `tiger` ). 

This requires a model that has been trained on a large corpus of text of the respective language. Luckliy for us such models are ready available and we don't have to train our own. We will use the library `spaCy` (https://spacy.io/) to transform the the documents. You can find the instruction on how to install `spaCy` and how to download the language models here https://spacy.io/docs/usage/. Currently four languages ( `EN` , `DE` , `ES` , `FR` ) are supported out of the box but you can find even more open sourced language models and add them to library yourself. 

Additionally we will use the library `textacy` (http://textacy.readthedocs.io/en/latest/) which is build on top of `spacy` to compute the word mover's distance. 

In [4]:
#Load spaCy's english model
nlp = spacy.load('EN')

textacy_docs = {}

#Transform the posts to spaCy documents
docs = list(nlp.pipe(df.cleanmessage.values, 
                      batch_size=1000, 
                      n_threads=3, 
                      tag=False, 
                      entity=False))

#Transform spaCy documents to textacy documents
for i in np.arange(len(docs)):
    try:
        textacy_docs[i] = Doc(docs[i])
    except Exception as e:
        print("Failed to get word vector:{}, {}".format(i, e))
        continue

In [5]:
#Create a textacy document from your own post
my_post = unicode("""
I want to become a data scientist.
What online resources can you recommend? 
""")

clean_post = (my_post.lower().replace(r'\n|\r|\t', ' ')
                          .replace(r'[^a-z0-9+]+', ' ') 
                          .replace(r' +', ' '))

doc2 = Doc(content=clean_post, lang=nlp.lang)

In [6]:
def has_vector(doc):
    t = list(extract.words(doc))
    has_vec = np.sum([x.has_vector for x in t])
    return (has_vec > 0)

In [7]:
#Iterate through the posts and calculate the similarity between your own post and the existing posts
similarities = {}
for i, doc1 in textacy_docs.iteritems():
    #check if has any word vectors
    if (not has_vector(doc1)):
        continue

    try:
        this_sim = similarity.word_movers(doc1, doc2)
        if not np.isnan(this_sim):
            similarities[i] = this_sim
    except Exception as e:
        print e,i

In [8]:
#Find the k most similar posts and print the text and the direkt link to the post
k = 5
kNN = sorted(similarities, key=similarities.get)[-k:]

print 'My post:\n {}\n\n'.format(my_post)
print 'Most relevant previous posts\n'
i = 1
for ind, row in df.loc[kNN].iterrows():
    print '{}) {}'.format(i,row['message'])
    post_id = row['id'].replace('_','/')
    print 'https://www.facebook.com/groups/{}\n'.format(post_id)
    i+=1

My post:
 
I want to become a data scientist.
What online resources can you recommend? 



Most relevant previous posts

1) What exactly  statistical knowledge I need to acquire for data science,machine learning. Predictive modelling etc??

Want answers from experienced data scientists.
https://www.facebook.com/groups/457660044251817/1016982104986272

2) What to become an expert in data science by learning some set of online coursers targeted to help you truly master your data science skill. Then this month is the great way start. 
coursera offering different data science specializations to full fill your dream to become data science expert. Have look at coursera data science specializations list ordered by DataAspirant team. 
ALL THE BEST


https://www.facebook.com/groups/457660044251817/1105549306129551

3) 12 Python Resources for Data Science
https://www.facebook.com/groups/457660044251817/1541228582561619

4) https://datapyr.zeef.com/kranthi.kumar
Hello everyone, I have collected s