In [52]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import json
import re

# Welcome to A Basic Intro to NLP from Jordan Jomsky

In this notebook, I will be walking you through one kind of problem in NLP: **summarization**. My approach to this will be oversimplified to introduce you to these concepts in a more theoretical sense rather than the robust methods used in industry. We will be using a very popular introductory dataset, the Yelp reviews dataset, that first introduced me to NLP.

In [6]:
# Brining in the Dataset (You may need to change the filepath to make it work on your computer)

yelp = pd.read_csv("/content/yelp.csv")

# Problem 1: Summarization

When we want to write summaries, how does our brain condense information and remove extraneous information? Further, how can we translate these ideas into a method that can automate that task?

Let's start with a simple strategy: **look for the most similar sentence in a document for an easy one sentence summary**. Think about it, the sentence that is most similar to all the other sentences in a document captures the meanings without having to generate any new text. Let's take a look at this.

In [33]:
# Importing a package called spacy that is the most popular NLP package for Python
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("en_core_web_lg")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [59]:
# Grabbing the longest review from the dataset to summarize

longest_review = yelp.loc[yelp['text'].str.len().idxmax(),'text'].lower()

In [60]:
# Doing a basic max searching loop (optimal it is not but it is simple)
longest_review_sentences = [sent for sent in nlp(longest_review).sents]

best_score = 0
best_sentence = "" # Every sentence will be given to spacy so we can use its methodology

for sentence in longest_review_sentences:
  this_sentence = sentence
  similarity = np.mean([sentence.similarity(this_sentence) for sentence in longest_review_sentences])
  
  if similarity > best_score:
    best_score = similarity
    best_sentence = this_sentence

print("Similarity Score: " + str(best_score))
print("Best Sentence: " + str(best_sentence))

Similarity Score: 0.7688554015200016
Best Sentence: at this point that the night could have turned into a disaster, but to their credit - it didn't.


So, that was not great, but it is a start. Let's create a strategy. Sentences are too complicated to compare. Instead, we should try to extracts words that occur a lot and weight the sentences based on how many times they feature these words.

In [66]:
# Let's get the keywords of the review
from spacy.lang.en.stop_words import STOP_WORDS
from collections import Counter

stopwords = list(STOP_WORDS) # stop words are just words we don't care about
doc = nlp(longest_review)
parts_of_speech = ['PROPN', 'NOUN', 'ADJ', 'VERB'] # Focusing on the meat and potatoes of the sentence

keywords = []

for token in doc: # token is a fancy umbrella term for any words, spaces, or punctuation used in a document
  if token.pos_ in parts_of_speech and token.text not in stopwords:
    keywords += [token.text]

top5_words = dict(Counter(keywords).most_common(5))

# Normalize the counts
factor = 1/sum(top5_words.values())
top5_words={key:count*factor for key,count in top5_words.items()}
top5_words

{'bar': 0.25,
 'caroline': 0.20833333333333331,
 'served': 0.20833333333333331,
 'vintage': 0.16666666666666666,
 'drinks': 0.16666666666666666}

In [67]:
# Let's use the keywords to get the highest scoring sentence
best_score = 0
best_sentence = ""

for sent in doc.sents:
  sent_score = 0
  for token in sent:
    if token.text in top5_words.keys():
      word_score = top5_words[token.text]
      sent_score += word_score
  if sent_score > best_score:
    best_score = sent_score
    best_sentence = sent

print("Similarity Score: " + str(best_score))
print("Best Sentence: " + str(best_sentence))

Similarity Score: 0.625
Best Sentence: exactly.

caroline and i told the hostesses we were only there for drinks, so we were seated in the bar area in some fabulous leather club chairs.


Let's try to grab multiple sentences and really make a summary!

In [78]:
sentence_dict = {}

for sent in doc.sents:
  sent_score = 0
  for token in sent:
    if token.text in top5_words.keys():
      word_score = top5_words[token.text]
      sent_score += word_score
  sentence_dict[sent] = sent_score

print(" ".join([s.text for s in list(dict(Counter(sentence_dict).most_common(3)).keys())])[10:])

caroline and i told the hostesses we were only there for drinks, so we were seated in the bar area in some fabulous leather club chairs. and speaking of the bar, even though v95 advertises itself as a wine bar, they do have booze. before i go further, understand that whenever i go out for eats or drinks, i have  in  mind a platonic ideal of the bar/pub/eatery i most want to frequent.


There we go! A simple summary. While summarization is more of a solved problem, this was a subtle introduction to a lot of cool things you can do with Python's package SpaCy including part of speech tagging and document similarity. I leave an exercise for you! Can you build a model that classifies a review as cool, useful, and funny? There are labels already built in, go nuts!

In [84]:
yelp

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
...,...,...,...,...,...,...,...,...,...,...
9995,VY_tvNUCCXGXQeSvJl757Q,2012-07-28,Ubyfp2RSDYW0g7Mbr8N3iA,3,First visit...Had lunch here today - used my G...,review,_eqQoPtQ3e3UxLE4faT6ow,1,2,0
9996,EKzMHI1tip8rC1-ZAy64yg,2012-01-18,2XyIOQKbVFb6uXQdJ0RzlQ,4,Should be called house of deliciousness!\n\nI ...,review,ROru4uk5SaYc3rg8IU7SQw,0,0,0
9997,53YGfwmbW73JhFiemNeyzQ,2010-11-16,jyznYkIbpqVmlsZxSDSypA,4,I recently visited Olive and Ivy for business ...,review,gGbN1aKQHMgfQZkqlsuwzg,0,0,0
9998,9SKdOoDHcFoxK5ZtsgHJoA,2012-12-02,5UKq9WQE1qQbJ0DJbc-B6Q,2,My nephew just moved to Scottsdale recently so...,review,0lyVoNazXa20WzUyZPLaQQ,0,0,0
