# Summary

Make wiki_summary function more capable of handling missing or ambiguous cases. Also see if image retrieval is easy/possible.

In [1]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [125]:
from bs4 import BeautifulSoup as BS
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from pathlib import Path
import re
import requests
from tldextract import extract
import wikipedia as wiki

from jabberwocky.config import C
from jabberwocky.openai_utils import load_prompt, load_openai_api_key
from jabberwocky.external_data import W
from htools import *

In [3]:
cd_root()

Current directory: /Users/hmamin/jabberwocky


In [134]:
def _wiki_search(name, *tags):
    terms = ['wikipedia'] + name.split() + list(tags)
    r = requests.get(f'http://www.google.com/search?q={"+".join(terms)}')
    links = BS(r.text, 'lxml').find_all('a')
    return [link.replace('/url?q=', '')
            for link in map(lambda x: x['href'], links) 
            if link.startswith('/url?q=')]

In [15]:
def wiki_summary(name, *tags):
    page = W.page(name.title().replace(' ', '_'))
    if not page.exists():
        raise RuntimeError('Wikipedia page not found. Provide a URL instead.')
    summary = page.summary.splitlines()[0]
    if summary.endswith('may refer to:'):
        raise RuntimeError('Ambiguous search term. Provide a URL instead.')
    return summary

In [9]:
wiki_summary('j.k. rowling')

'Joanne Rowling  ( ROH-ling; born 31 July 1965), better known by her pen name J. K. Rowling, is a British author, philanthropist, film producer, television producer, and screenwriter. She is best known for writing the Harry Potter fantasy series, which has won multiple awards and sold more than 500 million copies, becoming the best-selling book series in history. The books are the basis of a popular film series, over which Rowling had overall approval on the scripts and was a producer on the final films. She also writes crime fiction under the pen name Robert Galbraith.'

In [31]:
wiki_summary('jk rowling')

RuntimeError: Wikipedia page not found. Provide a URL instead.

In [13]:
wiki_summary('chris lee')

'Christopher Lee (1922–2015) was an English actor and singer.'

In [57]:
r = _wiki_search('chris lee', 'politician')

In [59]:
eprint(r[:5])

 0: /url?q=https://en.wikipedia.org/wiki/Chris_Lee_(New_York_politician)&sa=U&ved=2ahUKEwibvu6Ch4fxAhX1IDQIHXM-AQgQFjAAegQIBBAB&usg=AOvVaw3enElqWa8R0y2pOkEEO5tz
 1: /url?q=https://en.wikipedia.org/wiki/Chris_Lee_(New_York_politician)%23Biography&sa=U&ved=2ahUKEwibvu6Ch4fxAhX1IDQIHXM-AQgQ0gIwAHoECAQQAg&usg=AOvVaw12Agt6zY_XL1Xa9tjlYGzz
 2: /url?q=https://en.wikipedia.org/wiki/Chris_Lee_(New_York_politician)%23Political_campaigns&sa=U&ved=2ahUKEwibvu6Ch4fxAhX1IDQIHXM-AQgQ0gIwAHoECAQQAw&usg=AOvVaw1ZKZmRVWBsqKA8zcN8V8H-
 3: /url?q=https://en.wikipedia.org/wiki/Chris_Lee_(New_York_politician)%23U.S._House_of_Representatives&sa=U&ved=2ahUKEwibvu6Ch4fxAhX1IDQIHXM-AQgQ0gIwAHoECAQQBA&usg=AOvVaw1pKb3uxYvCyhZLp3jc0lCe
 4: /url?q=https://en.wikipedia.org/wiki/Chris_Lee_(New_York_politician)%23First_term&sa=U&ved=2ahUKEwibvu6Ch4fxAhX1IDQIHXM-AQgQ0gIwAHoECAQQBQ&usg=AOvVaw0Zu3oRyNvkz5EArWYoddOE


In [66]:
extract(r[0].replace('/url?q=', ''))

ExtractResult(subdomain='en', domain='wikipedia', suffix='org')

In [88]:
duck_url_fmt = 'https://duckduckgo.com/?q=!ducky+{}+site%3Awikipedia.org'
terms = 'chris lee politician'
r = requests.get(duck_url_fmt.format('+'.join(terms.split())),
                 headers={'user-agent': 'jabberwocky'})
r

<Response [200]>

In [89]:
r

<Response [200]>

In [90]:
r.text

"<html><head><meta http-equiv='Content-Type' content='text/html; charset=utf-8'><meta name='referrer' content='origin'><meta name='robots' content='noindex, nofollow'><meta http-equiv='refresh' content='0; url=/l/?uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FChris_Lee_(New_York_politician)&rut=301a9a08127652533377bab64eb867db64987c1add90ea92d3bafbccfeecbcde'></head><body><script language='JavaScript'>function ffredirect(){window.location.replace('/l/?uddg=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FChris_Lee_(New_York_politician)&rut=301a9a08127652533377bab64eb867db64987c1add90ea92d3bafbccfeecbcde');}setTimeout('ffredirect()',100);</script></body></html>"

In [91]:
r.url

'https://duckduckgo.com/?q=!ducky+chris+lee+politician+site%3Awikipedia.org'

In [None]:
'https://www.google.com/search?btnI=1&q=%s site:developer.mozilla.org'

In [136]:
wiki.PageError

wikipedia.exceptions.PageError

In [138]:
try:
    print(wiki.summary('Derek Jetero', 
                       auto_suggest=False).content.splitlines()[0])
except wiki.PageError:
    print('Not found')

Not found


In [111]:
print(wiki.page('Derek Jeter', auto_suggest=False).content.splitlines()[0])

Derek Sanderson Jeter ( JEE-tər; born June 26, 1974) is an American former professional baseball shortstop, businessman, and baseball executive. He has been the chief executive officer (CEO) and part owner of the Miami Marlins of Major League Baseball (MLB) since September 2017. As a player, Jeter spent his entire 20-year MLB career with the New York Yankees. He was elected to the Baseball Hall of Fame in his first year of eligibility in 2020; he received 396 of 397 possible votes (99.75%), the second-highest percentage in MLB history and the highest by a position player.


In [139]:
print(wiki.summary('Derek Jeter', auto_suggest=False).splitlines()[0])

Derek Sanderson Jeter ( JEE-tər; born June 26, 1974) is an American former professional baseball shortstop, businessman, and baseball executive. He has been the chief executive officer (CEO) and part owner of the Miami Marlins of Major League Baseball (MLB) since September 2017. As a player, Jeter spent his entire 20-year MLB career with the New York Yankees. He was elected to the Baseball Hall of Fame in his first year of eligibility in 2020; he received 396 of 397 possible votes (99.75%), the second-highest percentage in MLB history and the highest by a position player.


In [122]:
matches = wiki.search('Chris Lee politician')
for match in matches:
    if '(disambiguation)' in match: continue
    summary = wiki.summary(match, auto_suggest=False)
    print(summary.splitlines()[0])
    break

Christopher John Lee (born April 1, 1964) is a former Republican member of the United States House of Representatives for New York's 26th congressional district. He served from January 2009 until he resigned on February 9, 2011, after it was revealed that he had solicited a woman on Craigslist.


In [127]:
matches = wiki.search('Joanne Rowling author')
for match in matches:
    if '(disambiguation)' in match: continue
    summary = wiki.summary(match, auto_suggest=False)
    print(summary.splitlines()[0])
    break

Joanne Rowling  ( ROH-ling; born 31 July 1965), better known by her pen name J. K. Rowling, is a British author, philanthropist, film producer, television producer, and screenwriter. She is best known for writing the Harry Potter fantasy series, which has won multiple awards and sold more than 500 million copies, becoming the best-selling book series in history. The books are the basis of a popular film series, over which Rowling had overall approval on the scripts and was a producer on the final films. She also writes crime fiction under the pen name Robert Galbraith.


In [128]:
re.sub('\s{2,}', ' ', summary.splitlines()[0])

'Joanne Rowling ( ROH-ling; born 31 July 1965), better known by her pen name J. K. Rowling, is a British author, philanthropist, film producer, television producer, and screenwriter. She is best known for writing the Harry Potter fantasy series, which has won multiple awards and sold more than 500 million copies, becoming the best-selling book series in history. The books are the basis of a popular film series, over which Rowling had overall approval on the scripts and was a producer on the final films. She also writes crime fiction under the pen name Robert Galbraith.'

In [133]:
_wiki_text_cleanup(summary.splitlines()[0])

'Joanne Rowling (ROH-ling; born 31 July 1965), better known by her pen name J. K. Rowling, is a British author, philanthropist, film producer, television producer, and screenwriter. She is best known for writing the Harry Potter fantasy series, which has won multiple awards and sold more than 500 million copies, becoming the best-selling book series in history. The books are the basis of a popular film series, over which Rowling had overall approval on the scripts and was a producer on the final films. She also writes crime fiction under the pen name Robert Galbraith.'

In [231]:
from fuzzywuzzy import fuzz, process
from wikipedia import PageError

In [322]:
def _wiki_text_cleanup(text):
    text = re.sub('\s{2,}', ' ', text)
    return re.sub('\([^a-zA-Z0-9]+', '(', text)

In [376]:
def wiki_page(name, *tags, retry=True, min_similarity=50, debug=False, 
              og_name=None):
    try:
        page = wiki.page(name, auto_suggest=False)
        score = fuzz.token_set_ratio((og_name or name).lower(),
                                     page.title.lower()) 
        if score < min_similarity:
            raise RuntimeError(
                f'Similarity score of {score} fell short of threshold '
                f'{min_similarity}. Page title: {page.title}.'
            ) from None
        return page
    except PageError:
        if not retry:
            raise ValueError(f'Couldn\'t find wikipedia page for {name}.') \
                from None
        warnings.warn('Page not found. Trying to auto-select correct match.')
        
        terms = ' '.join(name.split() + list(tags))
        matches = wiki.search(terms)
        if debug: print('matches:', matches)
        for match in matches:
            if '(disambiguation)' in match: continue
            return wiki_page(match, retry=False, og_name=name)

In [377]:
def download_image(url, out_path, verbose=False):
    """Ported from spellotape. Given a URL, fetch an image and download it to
    the specified path.
    
    Parameters
    ----------
    url: str
        Location of image online.
    out_path: str
        Path to download the image to.
    verbose: bool
        If True, prints a message alerting the user when the image could not 
        be retrieved.
        
    Returns
    -------
    bool: Specifies whether image was successfully retrieved.
    """
    try:
        with requests.get(url, stream=True, timeout=10) as r:
            if r.status_code != 200:
                if verbose: print(f'STATUS CODE ERROR: {url}')
                return False

            # Write bytes to file chunk by chunk.
            with open(out_path, 'wb') as f:
                for chunk in r.iter_content(256):
                    f.write(chunk)
            
    # Any time url cannot be accessed, don't care about exact error.
    except Exception as e:
        if verbose: print(e)
        return False

    return True

In [378]:
def wiki_data(name, tags=(), img_dir='data/tmp', **page_kwargs):
    page = wiki_page(name, *tolist(tags), **page_kwargs)
    summary = page.summary.splitlines()[0]
    
    # Download image if possible. Find photo with name closest to the one we
    # searched for (empirically, this seems to be a decent heuristic to give 
    # us a picture of the person rather than of, for instance, their house).
    img_url = ''
    img_path = ''
    if img_dir and page.images:
        name2url = {u.rpartition('/')[-1].split('.')[0].lower(): u 
                  for i, u in enumerate(page.images)}
        name, _ = process.extractOne(name.lower(), name2url.keys())
        url = name2url[name]
        path = Path(img_dir)/f'{name}.{url.rpartition(".")[-1]}'.lower()
        if download_image(url, path): 
            img_url = url
            img_path = str(path)
    return Results(summary=_wiki_text_cleanup(summary),
                   img_url=img_url, 
                   img_path=img_path)

In [379]:
wiki_data('jo rowling', 'author')

  app.launch_new_instance()


Results(summary='Joanne Rowling (ROH-ling; born 31 July 1965), better known by her pen name J. K. Rowling, is a British author, philanthropist, film producer, television producer, and screenwriter. She is best known for writing the Harry Potter fantasy series, which has won multiple awards and sold more than 500 million copies, becoming the best-selling book series in history. The books are the basis of a popular film series, over which Rowling had overall approval on the scripts and was a producer on the final films. She also writes crime fiction under the pen name Robert Galbraith.', img_url='https://upload.wikimedia.org/wikipedia/commons/b/b4/Jk-rowling-crop.JPG', img_path='data/tmp/jk-rowling-crop.jpg')

In [383]:
wiki_data('mike schur')

  app.launch_new_instance()


Results(summary='Michael Herbert Schur (born c. 1975/1976) is an American television producer, writer, and character actor. He was a producer and writer for the comedy series The Office, and co-created Parks and Recreation with Office producer Greg Daniels. He created The Good Place, co-created the comedy series Brooklyn Nine-Nine and was a producer on the series Master of None. He also played Mose Schrute in The Office. In 2021, he co-created a comedy series Rutherford Falls.', img_url='https://upload.wikimedia.org/wikipedia/commons/0/07/Michael_Schur_2012_%28cropped%29.jpg', img_path='data/tmp/michael_schur_2012_%28cropped%29.jpg')

In [384]:
# Seems to strugle with typos.
with assert_raises(RuntimeError):
    wiki_data('yan lecun', 'machine learning', debug=True)

  app.launch_new_instance()


matches: ['Deep learning', 'History of artificial intelligence', 'Glossary of artificial intelligence', 'Darkforest', 'Synthetic media']
As expected, got RuntimeError(Similarity score of 36 fell short of threshold 50. Page title: Deep learning.).


In [381]:
wiki_data('Usain Bolt')

Results(summary='Usain St Leo Bolt, (born 21 August 1986) is a Jamaican retired sprinter, widely considered to be the greatest sprinter of all time. He is a world record holder in the 100 metres, 200 metres and 4 × 100 metres relay. ', img_url='https://upload.wikimedia.org/wikipedia/commons/0/08/Usain-bolt-press-conference-berlin-2009.jpg', img_path='data/tmp/usain-bolt-press-conference-berlin-2009.jpg')

## Line Chunker

Experiment with manually inserting newlines into text while allowing for reversal to the original text. Want to display text in gui so it's not all on one line, but don't want to insert random newlines when querying gpt3.

In [484]:
class GuiTextChunker:

    def __init__(self, max_chars=79):
        self.raw = {}
        self.chunked = {}
        self.max_chars = max_chars

    @fallback(keep=['max_chars'])
    def add(self, key, text, return_chunked=True, **kwargs):
        if self._previously_added(key, text):
            if return_chunked:
                return self.get(key, chunked=True)
            return
        chunked = self._chunk_lines(text, max_chars)
        self.raw[key] = text
        self.chunked[key] = chunked
        if return_chunked: return chunked

    def get(self, key, chunked):
        if chunked:
            return self.chunked[key]
        return self.raw[key]

    def _chunk_lines(self, text, max_chars):
        words = text.split(' ')
        lines, line = [], []
        curr_len = 0
        for word in words:
            length = len(word) + 1
            if curr_len + length > max_chars:
                lines.append(line)
                line = []
                curr_len = 0
            line.append(word)
            curr_len += length
        if line: lines.append(line)
        return '\r\n'.join(' '.join(line) for line in lines)
    
    def _previously_added(self, key, text):
        try:
            raw = self.get(key, chunked=False)
            chunked = self.get(key, chunked=True)
            assert text in (raw, chunked)
            return True
        except (KeyError, AssertionError):
            return False

    def clear(self):
        self.raw.clear()
        self.chunked.clear()

    def __contains__(self, key):
        in_raw, in_chunked = key in self.raw, key in self.chunked
        if in_raw and in_chunked:
            return True
        elif in_raw or in_chunked:
            raise KeyError(
                f'Key {key} was found in '
                f'{"self.raw" if in_raw else "self.chunked"}. Should be in '
                'neither or both. It may be wise to call the clear() method '
                'and re-add your key.'
            )
        else:
            return False

In [485]:
t = 'J.K. walked to the store! me@gmail.com, but then? You\'re being '\
    'difficult with that request...idk.Who is this; what are you doing? '\
    '"Nothing," '\
    'he answered - but what about more lines more words can emojis work :) '\
    'testing testing test test. Let\'s make this more characters. Over 2 '\
    'lines. How many now?? A bit overhanging '
print(t)

J.K. walked to the store! me@gmail.com, but then? You're being difficult with that request...idk.Who is this; what are you doing? "Nothing," he answered - but what about more lines more words can emojis work :) testing testing test test. Let's make this more characters. Over 2 lines. How many now?? A bit overhanging 


In [486]:
text_manager = GuiTextChunker()
tmp = text_manager.add('rand', t)
print(tmp)

J.K. walked to the store! me@gmail.com, but then? You're being difficult with
that request...idk.Who is this; what are you doing? "Nothing," he answered -
but what about more lines more words can emojis work :) testing testing test
test. Let's make this more characters. Over 2 lines. How many now?? A bit
overhanging 


In [487]:
print(text_manager.get('rand', chunked=True))

J.K. walked to the store! me@gmail.com, but then? You're being difficult with
that request...idk.Who is this; what are you doing? "Nothing," he answered -
but what about more lines more words can emojis work :) testing testing test
test. Let's make this more characters. Over 2 lines. How many now?? A bit
overhanging 


In [488]:
print(text_manager.get('rand', chunked=False))

J.K. walked to the store! me@gmail.com, but then? You're being difficult with that request...idk.Who is this; what are you doing? "Nothing," he answered - but what about more lines more words can emojis work :) testing testing test test. Let's make this more characters. Over 2 lines. How many now?? A bit overhanging 


In [489]:
chunker = GuiTextChunker()
inp = t
# Input is initially unchunked, but we repeatedly add the chunked version.
# This results in text growing progressively shorter, but I think we'd only
# want it to change the first time. UPDATE: seems to work now.
for i in range(5):
#     print(chunker._previously_added('rand', inp))
    inp = chunker.add('rand', inp)
    print(inp, end='\n\n')

J.K. walked to the store! me@gmail.com, but then? You're being difficult with
that request...idk.Who is this; what are you doing? "Nothing," he answered -
but what about more lines more words can emojis work :) testing testing test
test. Let's make this more characters. Over 2 lines. How many now?? A bit
overhanging 

J.K. walked to the store! me@gmail.com, but then? You're being difficult with
that request...idk.Who is this; what are you doing? "Nothing," he answered -
but what about more lines more words can emojis work :) testing testing test
test. Let's make this more characters. Over 2 lines. How many now?? A bit
overhanging 

J.K. walked to the store! me@gmail.com, but then? You're being difficult with
that request...idk.Who is this; what are you doing? "Nothing," he answered -
but what about more lines more words can emojis work :) testing testing test
test. Let's make this more characters. Over 2 lines. How many now?? A bit
overhanging 

J.K. walked to the store! me

In [476]:
inp.split(' ')

['J.K.',
 'walked',
 'to',
 'the',
 'store!',
 'me@gmail.com,',
 'but',
 "then?\r\nYou're\r\nbeing\r\ndifficult\r\nwith\r\nthat",
 'request...idk.Who',
 'is',
 'this;',
 'what',
 'are\r\nyou\r\ndoing?\r\n"Nothing,"',
 'he\r\nanswered',
 '-\r\nbut',
 'what',
 'about',
 'more',
 'lines',
 'more',
 'words\r\ncan\r\nemojis\r\nwork',
 ':)',
 'testing\r\ntesting',
 'test\r\ntest.',
 "Let's",
 'make',
 'this',
 'more\r\ncharacters.\r\nOver',
 '2\r\nlines.',
 'How',
 'many\r\nnow??',
 'A',
 'bit\r\noverhanging',
 '']

In [448]:
t2 = """Leverage agile frameworks to provide a robust synopsis for high level overviews. Iterative approaches to corporate strategy foster collaborative thinking to further the overall value proposition. Organically grow the holistic world view of disruptive innovation via workplace diversity and empowerment.

Bring to the table win-win survival strategies to ensure proactive domination. At the end of the day, going forward, a new normal that has evolved from generation X is on the runway heading towards a streamlined cloud solution. User generated content in real-time will have multiple touchpoints for offshoring.

Capitalize on low hanging fruit to identify a ballpark"""
text_manager.add('long', t2, max_chars=100)

In [450]:
print(text_manager.get('long', chunked=True))

Leverage agile frameworks to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to further the overall value
proposition. Organically grow the holistic world view of disruptive innovation via workplace
diversity and empowerment.

Bring to the table win-win survival strategies to ensure proactive
domination. At the end of the day, going forward, a new normal that has evolved from generation X
is on the runway heading towards a streamlined cloud solution. User generated content in real-time
will have multiple touchpoints for offshoring.

Capitalize on low hanging fruit to identify a
ballpark


In [451]:
print(text_manager.get('long', chunked=False))

Leverage agile frameworks to provide a robust synopsis for high level overviews. Iterative approaches to corporate strategy foster collaborative thinking to further the overall value proposition. Organically grow the holistic world view of disruptive innovation via workplace diversity and empowerment.

Bring to the table win-win survival strategies to ensure proactive domination. At the end of the day, going forward, a new normal that has evolved from generation X is on the runway heading towards a streamlined cloud solution. User generated content in real-time will have multiple touchpoints for offshoring.

Capitalize on low hanging fruit to identify a ballpark
