5-6 March 2020  <br> ***Team Beat me up***: Fadi, Nat, Paula 

# Hackathon XXII

 ## Contact Beats Classification

### Steps <a id='steps'></a>
- get document urls and tiles for [contacts](#contacts)
- get [beats](#beats)
- get keyword data from url and title
- return word embedding for keywords
- create match scores for beats
- use scores to return beat suggestions

## Word embeddings

 --> transforming words into vectors that hold information about the words meaning <br>
- example: Vector(“King”) — Vector(“Man”)+Vector(“Woman”) = Vector(“Queen”) <br>
- usecases: Compute similar words, Text classifications, Document clustering/grouping, Feature extraction for text classifications, Natural language processing

***

In [2]:
# using pretrained spacy word embeddings 
import spacy
nlp = spacy.load('en_core_web_lg')

In [3]:
from heapq import nlargest
from collections import defaultdict

***

In [4]:
def get_beat_similarities(beats_dict, keywords):
    '''Return similarity scores for all beats.'''
    
    similarity = {}
    keywords_token = nlp(keywords)
    for beat, beat_value in beats_dict.items():
        beat_token = nlp(beat_value)
        similarity[beat] = keywords_token.similarity(beat_token)
    return similarity

In [5]:
def get_beats(beats_dict, keywords, threshold):
    '''Return sorted beat results above threshold.'''
    
    results = []
    similarities = get_beat_similarities(beats_dict, keywords)
    for beat, value in similarities.items():
        if value >= threshold:
            results.append((beat, value))
    return sort_results(results)

In [36]:
# get beats with score 
keywords = url_to_keywords("beat me up")

results_list = get_beats(beats_dict, keywords,0.58)
results_list += check_for_exact_beats_match(beats_dict, keywords)
sort_results(results_list)

[]

In [7]:
# examples

#https://www.digitaltrends.com/photography/how-to-create-a-layer-mask-photoshop/
#https://us.motorsport.com/f1/news/ferrari-not-hiding-binotto/4715906
#https://www.publicopiniononline.com/story/news/politics/elections/2020/03/05/election-2020-republicans-switching-parties-red-parts-pennsylvania/4901908002/
#https://www.espn.com/soccer/la-galaxy/story/4064228/galaxy-shift-from-zlatan-to-chicharito-shows-promise-but-needs-patience

In [8]:
def get_top_beats(beats_dict, contacts_dict, number_beats):
    '''Return top n beat suggestions for each document of a contact.'''
    
    results_dict = defaultdict(list)
    for contact, documents in contacts_dict.items():
        print('working on: ', contact)
        for doc in documents:
            keywords = url_to_keywords(doc[0]+doc[1]) # doc[0]: url and doc[1]: title TODO: inlcude as params?
            results_dict[contact] += check_for_exact_beats_match(beats_dict, keywords)
            results_dict[contact] += get_beats(beats_dict, keywords, 0.55)
    
    top_beats_dict = {}
    for contact, beats in results_dict.items():
        top_beats_dict[contact] = list(set(beats))
        top_beats_dict[contact] = sort_results(beats)[:number_beats] #TODO: fix so that dublicates are removed
        
    return top_beats_dict

In [9]:
def check_for_exact_beats_match(beats_dict, keywords):
    '''Return exact beat matches from keywords.'''
    
    beat_matches = []
    for beat, beat_value in beats_dict.items():
        if beat_value in keywords:
            beat_matches.append((beat,1.0))
    return beat_matches

In [10]:
def top_beats(beats, url, number):
    '''Return top n beats for url ordered highest to lowest similarity.'''
    
    keywords = url_to_keywords(url)
    similarities = get_beat_similarities(beats, keywords)
    return nlargest(number, similarities, key = similarities.get)

***

In [11]:
def process_text(text):
    '''Clean data from punctuation, stopwords and perform lemmatization.''' 
    
    doc = nlp(text.lower())
    result = []
    for token in doc:
        if token.text in nlp.Defaults.stop_words:
            continue
        if token.is_punct:
            continue
        if token.lemma_ == '-PRON-':
            continue
        result.append(token.lemma_)
    return " ".join(result)

In [12]:
def url_to_keywords(url):
    '''Turn urls and other text into cleaned keywords.'''
    
    keywords = url.lower().replace('/', ' ').replace('.', ' ').replace(':', ' ').replace('-', ' ').replace('+', ' ')
    keywords = keywords.replace('www', '').replace('com', '').replace('https', '').replace('news','')
    keywords = ''.join(i for i in keywords if not i.isdigit()).strip() # TODO: does it help to remove digits? (f1, b2b cases?)
    return process_text(keywords)
    

In [13]:
def sort_results(results):
    results.sort(key = lambda x: x[1])
    results.reverse()
    return results

***

## Experiments

In [17]:
%%time
# try stuff out:
test_keyword = nlp('soccer')
beat1 = nlp('performing arts')
beat2 = nlp('football')
beat3 = nlp('motorsport')
print('similarity to: ', test_keyword)
print('--------------------------------------------------')
print(beat1.similarity(test_keyword),'   ', beat1)
print(beat2.similarity(test_keyword),'   ', beat2)
print(beat3.similarity(test_keyword), '   ',beat3)
print('--------------------------------------------------')

similarity to:  soccer
--------------------------------------------------
0.3209037406442767     performing arts
0.8352942059469198     football
0.31737114230393226     motorsport
--------------------------------------------------
CPU times: user 42.1 ms, sys: 3.83 ms, total: 45.9 ms
Wall time: 45.7 ms


***

# Run examples from contacts

In [211]:
%%time
example_results = get_top_beats(beats_dict, contacts3, 7)

working on:  53
working on:  54
working on:  65
CPU times: user 1min 48s, sys: 304 ms, total: 1min 48s
Wall time: 1min 48s


In [89]:
#TODO: run with smaller test set, takes a very long time!
contacts3 = {k: contacts[k] for k in list(contacts)[:3]}

In [212]:
example_results

{'53': [('Technology', 1.0),
  ('Technology (general)', 1.0),
  ('Health (general)', 1.0),
  ('Government legislation', 0.7453719647511108),
  ('State government', 0.6904202382403168),
  ('Tax law', 0.6753549532058237),
  ('Financial law', 0.6682317042559766)],
 '54': [('Music', 1.0),
  ('Music (general)', 1.0),
  ('Hunting', 1.0),
  ('Cars', 1.0),
  ('Theatre', 1.0),
  ('Music', 1.0),
  ('Media', 1.0)],
 '65': [('Hotels', 1.0),
  ('Luxury goods', 0.7977234458607577),
  ('Home interest', 0.7043882694002281),
  ('Hotels', 0.6979901844198264),
  ('House & Home', 0.6912891349850351),
  ('Spas and health clubs', 0.6855559181117479),
  ('Mexico', 0.6546812142237388)]}

In [None]:
# write results to file

***

# Kermit Beats <a id='beats'></a>

In [24]:
beats_file_name = 'beats.txt'
with open(beats_file_name) as file:
    beats_list = [line.rstrip() for line in file]

In [25]:
# clean beats 
beats_dict = {}
for beat in beats_list:
    beats_dict[beat] = beat.replace('(','').replace(')','').replace(' & ',' ').replace(' and ',' ').replace('general','').lower().strip()
    beats_dict[beat] = process_text(beats_dict[beat])

In [30]:
beats_dict

{'Accounting': 'accounting',
 'Actors and acting': 'actor act',
 'Adult education': 'adult education',
 'Adult entertainment': 'adult entertainment',
 'Aerospace engineering': 'aerospace engineering',
 'Afghanistan': 'afghanistan',
 'Africa': 'africa',
 'Agricultural economics': 'agricultural economic',
 'Agricultural technology': 'agricultural technology',
 'Agricultural machinery': 'agricultural machinery',
 'Agricultural policy': 'agricultural policy',
 'Air freight': 'air freight',
 'Environmental pollution': 'environmental pollution',
 'Aviation safety': 'aviation safety',
 'Airlines': 'airline',
 'Airports': 'airport',
 'Albania': 'albania',
 'Alcohol': 'alcohol',
 'Algeria': 'algeria',
 'Allergology': 'allergies',
 'Alternative medicine': 'alternative medicine',
 'Andorra': 'andorra',
 'Angola': 'angola',
 'Animal breeding and genetics': 'animal breeding genetics',
 'Animal rights and protection': 'animal rights protection',
 'Anthropology': 'anthropology',
 'Antiques': 'antique

In [27]:
beats_dict['Football (Soccer)'] = 'soccer'

In [28]:
# beats_dict['Cooking'] = 'Cooking or cookery is the art, technology, \
# science and craft of preparing food for consumption. Cooking techniques and ingredients vary widely across the world,\
# from grilling food over an open fire to using electric stoves, to baking in various types of ovens,\
# reflecting unique environmental, economic, and cultural traditions and trends. \
# Types of cooking also depend on the skill levels and training of cooks. \
# Cooking is done both by people in their own dwellings and by professional cooks and\
# chefs in restaurants and other food establishments. \
# Cooking can also occur through chemical reactions without the presence of heat,\
# such as in ceviche, a traditional South American dish where fish is cooked with the acids in lemon or lime juice.\
# Preparing food with heat or fire is an activity unique to humans. \
# It may have started around 2 million years ago,\
# though archaeological evidence for it reaches no more than 1 million years ago. \
# The expansion of agriculture, commerce, trade, \
# and transportation between civilizations in different regions offered cooks many new ingredients. \
# New inventions and technologies, such as the invention of pottery for holding and boiling water, \
# expanded cooking techniques. Some modern cooks apply advanced scientific techniques to food preparation\
# to further enhance the flavor of the dish served.'

In [29]:
# correction of oov beats
for original,new in beats_dict.items():
    tokens = nlp(new)
    for token in tokens:
        if token.is_oov:
            print(original, '->', token.text)
            replacement = input('Please type corrected beat keyword: ')
            beats_dict[original] = replacement

Allergology -> allergology
Please type corrected beat keyword: allergies
Floristry and Flowers -> floristry
Please type corrected beat keyword: flowers florist
Comics -> comic_strip
Please type corrected beat keyword: comic


***

# Contact Data <a id='contacts'></a>

In [31]:
contact_file_name = 'contact_docs.txt'
contacts = defaultdict(list)
with open(contact_file_name) as file:
    for line in file:
        contact_id, url, title = line.rstrip().split(',',2)
        contacts[contact_id].append((url, title))

In [32]:
contacts


defaultdict(list,
            {'53': [('https://fcw.com/articles/2020/03/03/dhs-vacancies-hearing-cuccinelli.aspx',
               'DHS will appeal Cuccinelli ruling'),
              ('https://fcw.com/articles/2020/02/27/cerner-va-golive-july-delay.aspx',
               'VA health record go-live pushed back to July'),
              ('https://defensesystems.com/articles/2020/02/26/aws-jedi-administrative-record.aspx',
               'Cloud Amazon requests more documents in JEDI case'),
              ('https://washingtontechnology.com/articles/2020/02/25/aws-record-jedi-cloud.aspx',
               'AWS looks to expand record in JEDI lawsuit'),
              ('https://gcn.com/articles/2020/02/25/aws-jedi-administrative-record.aspx',
               'AWS calls for more JEDI procurement docs'),
              ('https://fcw.com/articles/2020/02/25/aws-record-jedi-cloud.aspx',
               'AWS looks to expand record in JEDI lawsuit'),
              ('https://fcw.com/articles/2020/02/24/schum

***

The End.