In [30]:
import numpy as np
import pandas as pd
import nltk

nltk.download('punkt') # one time execution
import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\RobertPagano\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [31]:
df = pd.read_csv("data/medium_articles.csv")


In [32]:
df.head()

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...


In [33]:
## doing some cleaning up
df['text'] = df['text'].str.replace('\n', '')
df['text'] = df['text'].str.replace('.', '. ')
df['text'] = df['text'].str.replace('.  ', '. ')
df['text'] = df['text'].str.replace('?', '? ')
df['text'] = df['text'].str.replace('!', '! ')
df['text'] = df['text'].str.replace(':', ': ')

In [34]:
df['text'][0]

"Oh, how the headlines blared: Chatbots were The Next Big Thing. Our hopes were sky high. Bright-eyed and bushy-tailed, the industry was ripe for a new era of innovation:  it was time to start socializing with machines. And why wouldn’t they be?  All the road signs pointed towards insane success. At the Mobile World Congress 2017, chatbots were the main headliners. The conference organizers cited an ‘overwhelming acceptance at the event of the inevitable shift of focus for brands and corporates to chatbots’. In fact, the only significant question around chatbots was who would monopolize the field, not whether chatbots would take off in the first place: One year on, we have an answer to that question. No. Because there isn’t even an ecosystem for a platform to dominate. Chatbots weren’t the first technological development to be talked up in grandiose terms and then slump spectacularly. The age-old hype cycle unfolded in familiar fashion. . . Expectations built, built, and then. . . . . 

For the sake of this exercise, I will select just the first row in the df to summarize

In [35]:
df = df.iloc[[0]]
df.head()

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared: Chatbots were Th..."


In [36]:
from nltk.tokenize import sent_tokenize
sentences = []
for s in df['text']:
  sentences.append(sent_tokenize(s))

sentences = [y for x in sentences for y in x] # flatten list

## Load in Word Embeddings

downloading pre-trained GloVe word embeddings (trained on Wikipedia articles) from:

http://nlp.stanford.edu/data/glove.6B.zip

In [37]:
# Extract word vectors
word_embeddings = {}
f = open('data/glove.6B/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [38]:
len(word_embeddings)

400000

In [39]:
# vectors from 400k different terms stored in dictionary 'word_embeddings'

## Text Preprocessing

In [40]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\RobertPagano\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [42]:
# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

## Vector Representation of Sentences to Summarize


In [43]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)/len(i.split())
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [44]:
v

array([-0.06382809,  0.09177911,  0.12717892,  0.06284608, -0.01834458,
       -0.09346077,  0.19451775,  0.20676911, -0.00334608,  0.02346177,
        0.1657896 , -0.03655922, -0.13967766,  0.05920789,  0.06758371,
       -0.00159665,  0.07687406, -0.12206397,  0.0054083 ,  0.10788106,
        0.17871564, -0.10188406,  0.03512993,  0.13056472, -0.14343828,
        0.05956272,  0.15232384, -0.110997  , -0.08306597, -0.20588706,
        0.07406547, -0.16765367,  0.03341829,  0.06630685,  0.34570214,
       -0.07956772,  0.01941429,  0.03512494,  0.23765118,  0.01764918,
       -0.05262369, -0.16465517, -0.03009995, -0.00084315, -0.06967766,
        0.26414294,  0.04409545,  0.07864068,  0.06878811,  0.17954773,
       -0.2587706 , -0.06105197,  0.109008  , -0.01011944, -0.12214893,
        0.00391104,  0.08542229, -0.07989755,  0.02728636, -0.02134583,
       -0.02693653, -0.09051724, -0.0926012 ,  0.02701899,  0.11056971,
       -0.05942029, -0.03931034,  0.0248993 ,  0.15335332,  0.09

## Similarity Matrix Preparation


Now we need to find the similarities between the sentences, and we'll use cosine similarity to do so. We will initiate an empty similarity matrix first, then populate it with the cosine similarities of the sentences

In [51]:
# similarity matrix
sim_mat = np.zeros([len(clean_sentences), len(clean_sentences)])

In [52]:
from sklearn.metrics.pairwise import cosine_similarity

In [53]:
for i in range(len(clean_sentences)):
  for j in range(len(clean_sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

## Applying PageRank Algorithm

**Note that I want to also try "TextRank"

Here we will convert the similarity matrix "sim_mat" into a graph - the nodes will represent sentences and the edges will represent similarity scores between the sentences. We will then apply the PageRank algorithm to define the sentence rankings

For more info on PageRank, see here: https://en.wikipedia.org/wiki/PageRank

In [54]:
import networkx as nx

nx_graph = nx.from_numpy_array(sim_mat)
scores = nx.pagerank(nx_graph)

## Summary Extraction

Now we will extract the top N sentences based on their rankings for summary generation

In [55]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [56]:
# Extract top 5 sentences as the summary
for i in range(5):
  print(ranked_sentences[i][1])

Building a bot for the sake of it, letting it loose and hoping for the best will never end well: The vast majority of bots are built using decision-tree logic, where the bot’s canned response relies on spotting specific keywords in the user input.
Some platforms provide a bit of NLP, but even the best is at toddler-level capacity (for example, think about Siri understanding your words, but not their meaning.
This turned out to be a whole lot more difficult than anyone originally realised: The next item on the agenda was holding a two-way dialog with a machine.
The next wave will be multimodal apps, where you can say what you want (like with Siri) and get back information as a map, text, or even a spoken response.
The level of AI required for human-like conversation just isn’t available yet.
