## 1 — Recognize objects in image (or classify image)

Using trained NN, get object label or labels for image, or otherwise provide a label for the image. Also store the centrality of the object. 

## 2  — Generate semantic word families

For each label, use Word2Vec `similar` to retrieve list of words semantically related to the image object labels

## 3 — Generate all related words

For each semantically related (below a distance threshold) word to each object label, measure its phonetic similarity to all words in the dictionary. Also store each words's distance.

For each word in each semantic family, sort and choose the phonetically closest (below a distance threshold) words.
(One way is to convert the word to IPA and compare to an IPA converted version of every word in the CMU dictionary.)

## 4 — Gather candidate phrases

For each word in the phonetic family, of each word in the semantic family, of each object label, retrieve each idiom containing the word.
Add the idiom Id, as well as the stats on the object centrality, semantic distance, and phonetic distance, to a dataframe.

Compute _suitability score_ for each word-idiom match and add this to that column of the dataframe

Also, for each word in the semantic family, search the joke list for match and add that these to a joke_match dataframe, to use if there's too low a suitability score using a substitution.


## 5 — Choose captions

Sort captions dataframe by the _suitability score_

Choose the top 10 and generate a list containing each caption with the original semantic family word substituted into the idiom in addition to jokes containing any of the semantic family words

In [109]:
import pandas as pd
from collections import namedtuple
import uuid

## -1  — Webscrape and process phrases (idioms, sayings, aphorisms)

They should be converted into lists of phonetic sounds

## 0  — Load `phrase_dict` pickled and processed after being scraped

#### Data structures defined

In [110]:

Phrase = namedtuple('Phrase',['text_string', 'word_list','phon_list','string_length', 'word_count', 'prefix', 'phrase_type'])
phrase_dict = dict()

Close_word = namedtuple('Close_word', ['word', 'distance'])
#Sem_family = namedtuple('Sem_family', ['locus_word', 'close_words'])
Sem_family = namedtuple('Sem_family', ['locus_word', 'sem_fam_words'])
#Sem_family = namedtuple('Sem_family', ['locus_word', 'phon_fams'])
Phon_family = namedtuple('Phon_family', ['locus_word_str', 'close_words'])

#### Temporary toy example of the dict of phrases, to be replaced with idioms etc. scraped from web

In [111]:
t_string = 'Smarter than the average bear'
w_list = t_string.lower().split()
ph_id1 = uuid.uuid1()
phrase_dict[ph_id1] = Phrase(text_string = t_string, word_list = w_list, phon_list = w_list, string_length = len(t_string), word_count = len(w_list), prefix="As usual: ", phrase_type='idiom' )

# toy example of the dict
t_string = 'Not a hair out of place'
w_list = t_string.lower().split()
ph_id2 = uuid.uuid1()
phrase_dict[ph_id2] = Phrase(text_string = t_string, word_list = w_list, phon_list = w_list, string_length = len(t_string), word_count = len(w_list), prefix="As usual: ", phrase_type='idiom' )

# toy example of the dict
t_string = 'Three blind mice'
w_list = t_string.lower().split()
ph_id3 = uuid.uuid1()
phrase_dict[ph_id3] = Phrase(text_string = t_string, word_list = w_list, phon_list = w_list, string_length = len(t_string), word_count = len(w_list), prefix="As usual: ", phrase_type='idiom' )

# toy example of the dict
t_string = 'I just called to say I love you'
w_list = t_string.lower().split()
ph_id3 = uuid.uuid1()
phrase_dict[ph_id3] = Phrase(text_string = t_string, word_list = w_list, phon_list = w_list, string_length = len(t_string), word_count = len(w_list), prefix="As usual: ", phrase_type='idiom' )


## 1 — Recognize objects in image (or classify image)

Using trained NN, get object label or labels for image, or otherwise provide a label for the image. Also store the centrality of the object. 

## 2 — Generate semantic word families

For each label, use Word2Vec `similar` to retrieve list of words semantically related to the image object labels

## 3 — Generate all related words

For each semantically related (below a distance threshold) word to each object label, measure its phonetic similarity to all words in the dictionary. Also store each words's distance.

For each word in each semantic family, sort and choose the phonetically closest (below a distance threshold) words.
(One way is to convert the word to IPA and compare to an IPA converted version of every word in the CMU dictionary.)

In [112]:
english_dictionary = ['two', 'pair', 'bear', 'scare', 'you', 'twice', 'hair', 'mice', 'speaker', 'book']

In [113]:
# two_phon_fam = Phon_family(locus_word=Close_word('two', 3), close_words = [Close_word('you', 2.1)])
# two_phon_fam

# pair_phon_fam = Phon_family(locus_word=Close_word('pair', 5), close_words = [Close_word('bear', 1.5), Close_word('hair', 2.7)])
# pair_phon_fam

# twice_phon_fam = Phon_family(locus_word=Close_word('twice', 4.1), close_words = [Close_word('mice', 2.1)])
# twice_phon_fam

In [114]:
# import eng_to_ipa as ipa
# ipa.get_rhymes("words to rhyme with)

In [115]:
def IPA_get_rhymes( w ):
    if w == 'two':
        return [['you']]
    elif w == 'pair':
        return [['bear', 'hair']]
    else:
        return [['mice']]
    

In [116]:
## Temporary stand-in function, to be replaced with one that computes phonetic distance
def lev_dist( phon1, phon2 ):
    return 100

In [117]:
two_fam_member_list = ['you']
pair_fam_member_list = ['bear', 'hair']
twice_fam_member_list = ['mice']

def make_phon_fam_for_sem_fam_member( w, thresh=.2 ):
    w_phon_code = w
    close_word_list = []
    
    # Find words that are not necessarily rhyms but phonetically similar
    for word in english_dictionary:
        word_phon_code = word
        dist = lev_dist(w_phon_code, word_phon_code)
        if dist < thresh:
            close_word_list.append( Close_word(word, dist) )
    
    rhyme_dist = .1
    rhyme_word_list = IPA_get_rhymes( w )[0]
    
    # Find words that are rhymes
    for word in rhyme_word_list:
         close_word_list.append( Close_word(word, rhyme_dist) )
    
    return Phon_family(locus_word_str= w, close_words=close_word_list )
    
    

In [118]:
# two_phon_fam = get_phon_fam_for_sem_fam_member( 'two' )
# pair_phon_fam = get_phon_fam_for_sem_fam_member( 'pair' )
# twice_phon_fam = get_phon_fam_for_sem_fam_member( 'twice' )


In [119]:
## ALERT:  Need to incorporate the semantic distance somewhere

In [120]:
# two_sem_fam = Sem_family(locus_word='two', phon_fams = [make_phon_fam_for_sem_fam_member( 'two' ), \
#                                                         make_phon_fam_for_sem_fam_member( 'pair' ), \
#                                                         make_phon_fam_for_sem_fam_member( 'twice' )])

In [121]:
# To be replaced with Word2Vec `most_similar()`

def get_most_similar( w ):  
    list_of_duples =  [('pair', .95), ('twice', .90)]
    list_of_close_words = [Close_word( word=w_str, distance= 1 - w_sim) for w_str, w_sim in list_of_duples ]
        
    return list_of_close_words

In [122]:
def make_phon_fams_and_sem_family( w ):
    sem_sim_words = get_most_similar( w )  # Eventually replace with call to Word2Vec

    phon_fams_list = [make_phon_fam_for_sem_fam_member( w )]
    
    for close_w in sem_sim_words:
        print( close_w )
        phon_fams_list.append( make_phon_fam_for_sem_fam_member( close_w.word ) )
    
    # pass
    # Use the code in the cell above as the basis for a for loop  
    # First make a phonetic family for the locus word
    # Then go trough each member of the semantic family and add a phonetic family for each
    
#     close_word_list = [sem_fam.locus_word]
#     for w in sem_fam.close_words:

    locus_word_ =   Close_word(w, 0.0)
    return Sem_family(locus_word= locus_word_ , sem_fam_words = phon_fams_list)
   
    
#     return Sem_family(locus_word= locus_word_ , phon_fams = [make_phon_fam_for_sem_fam_member( w ), \
#                                                      make_phon_fam_for_sem_fam_member( 'pair' ), \
#                                                      make_phon_fam_for_sem_fam_member( 'twice' )])
    

In [123]:
two_sem_fam = make_phon_fams_and_sem_family('two')
two_sem_fam

Close_word(word='pair', distance=0.050000000000000044)
Close_word(word='twice', distance=0.09999999999999998)


Sem_family(locus_word=Close_word(word='two', distance=0.0), sem_fam_words=[Phon_family(locus_word_str='two', close_words=[Close_word(word='you', distance=0.1)]), Phon_family(locus_word_str='pair', close_words=[Close_word(word='bear', distance=0.1), Close_word(word='hair', distance=0.1)]), Phon_family(locus_word_str='twice', close_words=[Close_word(word='mice', distance=0.1)])])

## 4 — Gather candidate phrases

For each word in the phonetic family, of each word in the semantic family, of each object label, retrieve phrases containing the word.
Add the phrase_Id, as well as the stats on the object centrality, semantic distance, and phonetic distance, to a dataframe.

Compute _suitability score_ for each word-phrase match and add this to that column of the dataframe

Also, for each word in the semantic family, search the joke list for match and add that these to a joke_match dataframe, to use if there's too low a suitability score using a substitution.


## TO CODE NEXT

Write code that takes the word `twice` and returns its `semantic_family` which is a list of words 
('pair', and 'twice' in this case) and returns either (TBD) the list phonetically similar words or 
the pboneticized version of the word to be compared with the phoneticized variants of words in
the phrases



#### Define dataframe for candidate phrases

In [124]:
col_names = ['semantic_match', 'sem_dist', 'phonetic_match', 'phon_dist', 'phrase_id', 'score']

cand_df = pd.DataFrame(columns= col_names)
cand_df

Unnamed: 0,semantic_match,sem_dist,phonetic_match,phon_dist,phrase_id,score


#### Need to write body of function that will convert to phoneticized version of word

In [125]:
def phoneticized( w ):
    return w

### ALERT:  Instead, pre-generate a dictionary of phoneticized versions of the words in the list of idioms. Each phonetic word should map to a list of duples each of which is a phrase id and the corresponding word

In [126]:
def get_matching_phrases( w ):
    matched_id_list = []
    for phrase_id in phrase_dict.keys():
        if w in phrase_dict[phrase_id].phon_list:
            matched_id_list.append(phrase_id)
    return matched_id_list

In [127]:
#get_matching_phrases('bear')

In [128]:
#  cycles through each phonetic family in the semantic family to get matching phrases

def get_phrases_for_phon_fam( phon_fam_, sem_dist_ ):

    word_match_records_ = []

    #phon_fam_ = pair_phon_fam
    for word in phon_fam_.close_words:
        matched_phrases = get_matching_phrases( word.word )
        #print(word, len(matched_phrases))
        if matched_phrases:
            for p_id in matched_phrases:
                word_match_records_.append({'semantic_match': phon_fam_.locus_word_str, 'sem_dist': sem_dist_, 'phonetic_match': word.word, 'phon_dist': word.distance, 'phrase_id': p_id, 'score': ''})
    return word_match_records_ 


In [136]:
def get_phrases_for_sem_fam( sem_fam_ ):
    word_match_records_ = []
    for phon_fam_ in sem_fam_.sem_fam_words:
        print( sem_fam_.locus_word.distance )
        word_match_records_.extend( get_phrases_for_phon_fam( phon_fam_, sem_fam_.locus_word.distance ) )
    return word_match_records_

In [137]:
# word_match_records = []   

# word_match_records.extend( get_phrases_for_phon_fam( two_phon_fam ) ) 
# word_match_records.extend( get_phrases_for_phon_fam( pair_phon_fam ) )
# word_match_records.extend( get_phrases_for_phon_fam( twice_phon_fam ) )  
# word_match_records

In [138]:
#cand_df = cand_df.append(word_match_records)

In [139]:
two_sem_fam

Sem_family(locus_word=Close_word(word='two', distance=0.0), sem_fam_words=[Phon_family(locus_word_str='two', close_words=[Close_word(word='you', distance=0.1)]), Phon_family(locus_word_str='pair', close_words=[Close_word(word='bear', distance=0.1), Close_word(word='hair', distance=0.1)]), Phon_family(locus_word_str='twice', close_words=[Close_word(word='mice', distance=0.1)])])

In [140]:
# word_match_records = get_phrases_for_sem_fam( two_sem_fam )
# word_match_records

In [141]:
# To be replaced with image recognition algorithms
def get_image_topics():
    return ['two']

## The equivalent of `main` for the time being, until two or more image topics are handled

In [142]:
image_topics     = get_image_topics()
image_topic_word = image_topics[0]

two_sem_fam = make_phon_fams_and_sem_family( image_topic_word )
#two_sem_fam

word_match_records = get_phrases_for_sem_fam( two_sem_fam )

cand_df = cand_df.append(word_match_records)

Close_word(word='pair', distance=0.050000000000000044)
Close_word(word='twice', distance=0.09999999999999998)
0.0
0.0
0.0


In [143]:
# cand_df = pd.DataFrame({"semantic_match": ['pair','pair','twice'], "sem_dist": [5, 5, 4.1], "phonetic_match": ['bear', 'hair','mice'], "phon_dist": [1.5, 2.7, 2.1], "phrase_id":  [ph_id1, ph_id2,ph_id3], "phrase_type":  ['idiom', 'idiom','idiom'],'score': [.5, .3, 1.1]})


In [144]:
cand_df

Unnamed: 0,semantic_match,sem_dist,phonetic_match,phon_dist,phrase_id,score
0,two,0.0,you,0.1,c2c6cf42-7fc5-11eb-9afd-acde48001122,
1,pair,0.0,bear,0.1,c2c6be08-7fc5-11eb-9afd-acde48001122,
2,pair,0.0,hair,0.1,c2c6c45c-7fc5-11eb-9afd-acde48001122,
3,twice,0.0,mice,0.1,c2c6c9f2-7fc5-11eb-9afd-acde48001122,


In [145]:
cand_df['score'] = cand_df.apply(lambda row: 1.0/(row['sem_dist'] + row['phon_dist']), axis=1)
cand_df

Unnamed: 0,semantic_match,sem_dist,phonetic_match,phon_dist,phrase_id,score
0,two,0.0,you,0.1,c2c6cf42-7fc5-11eb-9afd-acde48001122,10.0
1,pair,0.0,bear,0.1,c2c6be08-7fc5-11eb-9afd-acde48001122,10.0
2,pair,0.0,hair,0.1,c2c6c45c-7fc5-11eb-9afd-acde48001122,10.0
3,twice,0.0,mice,0.1,c2c6c9f2-7fc5-11eb-9afd-acde48001122,10.0


## 5 —  Choose captions

Sort captions dataframe by the _suitability score_

Choose the top 10(?) and generate a list containing each caption with the original semantic family word substituted into the idiom in addition to jokes containing any of the semantic family words

In [146]:
def sub(phrase, orig_word='', new_word=''):
    return phrase.text_string.replace(orig_word, new_word)

In [147]:
def construct_caption( row ):
    return sub( phrase_dict[ row['phrase_id']], row['phonetic_match'],  row['semantic_match']  )

In [148]:
def get_best_captions(df, selection_size=10):
    df.sort_values(by='score', inplace=True)
    best_df = df.head(selection_size)
    best_df['caption'] = best_df.apply(construct_caption, axis=1 )    
    return best_df
    #return best_df['_caption'].to_list()

In [149]:
best_captions_df = get_best_captions(cand_df)
best_captions_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  best_df['caption'] = best_df.apply(construct_caption, axis=1 )


Unnamed: 0,semantic_match,sem_dist,phonetic_match,phon_dist,phrase_id,score,caption
0,two,0.0,you,0.1,c2c6cf42-7fc5-11eb-9afd-acde48001122,10.0,I just called to say I love two
1,pair,0.0,bear,0.1,c2c6be08-7fc5-11eb-9afd-acde48001122,10.0,Smarter than the average pair
2,pair,0.0,hair,0.1,c2c6c45c-7fc5-11eb-9afd-acde48001122,10.0,Not a pair out of place
3,twice,0.0,mice,0.1,c2c6c9f2-7fc5-11eb-9afd-acde48001122,10.0,Three blind twice


In [150]:
best_captions_list = best_captions_df['caption'].to_list()
best_captions_list 

['I just called to say I love two',
 'Smarter than the average pair',
 'Not a pair out of place',
 'Three blind twice']