# Computational Creativity Seminar, LMU, WS 2021/22

## Project: Interdimensional Monopoly

### Creators: Laura Luckert & Shaoqiu Zhang

Topic Description: Zwei oder mehr Spieler spielen das klassische Monopoly, aber nach jedem "über Los" ändert sich das Thema des Spiels. Aus dem klassischen Monopoly wird ein "Star Wars" Monopoly, ein "Herr der Ringe" Monopoly etc. Die Namen der Straßen und Aktionsfelder ändern sich nach dem aktuellen Thema. Die Aktionskarten, Felder und Namen müssen sinnvoll generiert werden. Programmier-Aufwand sollte sich nicht auf eine aufwändige GUI richten, sondern auf die sinnvolle Generation neuer Dimensionen.

#### Target:
* Each new dimension is related to a popular Netflix movie / series
* Within the dimension, places and actions are generated from places that exist in this movie/series and useful actions related to the respective movie/series

#### Data Sources:
* Kaggle series / movie dataset with user rankings: https://www.kaggle.com/chasewillden/netflix-shows
* Fictional Worlds: https://github.com/prosecconetwork/The-NOC-List/blob/master/NOC/DATA/Veale's%20NOC%20List/Veales%20place%20elements.xlsx
* Wikipedia API via https://pypi.org/project/Wikipedia-API/0.3.5/ 
* Regularization via https://hatebase.org and https://github.com/dariusk/wordfilter

#### How To

##### Randomization
* Random selection of topic: Movie or series from netflix dataset, we only consider titles with a rating > 90 to pick only the most popular shows
* For each topic, we retrieve the Wikipedia article, if there is none, the topic is discarded
* Filtration: if Wikipedia article is too short, the topic is discarded

##### Plagiarism
* we use the regular Monopoly action cards in combination with the Wikipedia data for action card text generation

##### Generation
For places:
* NER of wikipedia text to extract persons and locations

For actions:
* KeywordToText Generation with action card input and frequently used terms in the Wikipedia article (considering NERs)

##### Filtration & Creation
* Fitness: Find a fitness metric for the existing places and actions and compare the generated text against it
* -> Self-evaluation of system -> keep only above a certain treshold, otherwise trigger re-generation

#### Output Structure

In [1]:
"""
{
"topic": "Topic Name",

"places": {
    "general_places": [("Name of Place", 1000), ("Name of Place 2", 2000)],
    "train_stations": [("Place 1",3000), ("Place 2",3000), ("Place 3",3000), ("Place 4",3000)]
    "jail": "Name of Place",
    "free_parking": "Name of Place"
    },
    
"actions": {
    "neutral_action": ["Go three fields back", "..."],
    "reward_action": ["Generate rewarding action for the player", "..."],
    "penalty_action": ["Generate punishing action for the player", "..."]
    }
}
"""

'\n{\n"topic": "Topic Name",\n\n"places": {\n    "general_places": [("Name of Place", 1000), ("Name of Place 2", 2000)],\n    "train_stations": ["Place 1", "Place 2", "Place 3", "Place 4"]\n    "jail": "Name of Place",\n    "free_parking": "Name of Place"\n    },\n    \n"actions": {\n    "neutral_action": ["Go three fields back", "..."],\n    "reward_action": ["Generate rewarding action for the player", "..."],\n    "penalty_action": ["Generate punishing action for the player", "..."]\n    }\n}\n'

### 1. Select Topic (Dimension) via Netflix Data

In [11]:
import sys
import pandas as pd
import os
import wikipediaapi

In [124]:
import numpy as np

In [7]:
#!conda install -n comp_creativity pandas -y
PATH = "~/Desktop/"
FILENAME = "netflix_data.csv"

full_path = os.path.expanduser(PATH)
os.chdir(full_path)

netflix_data = pd.read_csv(FILENAME, sep=";")

In [9]:
netflix_data.head(10)

Unnamed: 0,title,rating,ratingLevel,ratingDescription,release year,user rating score,user rating size
0,White Chicks,PG-13,"crude and sexual humor, language and some drug...",80,2004,82.0,80
1,Lucky Number Slevin,R,"strong violence, sexual content and adult lang...",100,2006,,82
2,Grey's Anatomy,TV-14,Parents strongly cautioned. May be unsuitable ...,90,2016,98.0,80
3,Prison Break,TV-14,Parents strongly cautioned. May be unsuitable ...,90,2008,98.0,80
4,How I Met Your Mother,TV-PG,Parental guidance suggested. May not be suitab...,70,2014,94.0,80
5,Supernatural,TV-14,Parents strongly cautioned. May be unsuitable ...,90,2016,95.0,80
6,Breaking Bad,TV-MA,For mature audiences. May not be suitable for...,110,2013,97.0,80
7,The Vampire Diaries,TV-14,Parents strongly cautioned. May be unsuitable ...,90,2017,91.0,80
8,The Walking Dead,TV-MA,For mature audiences. May not be suitable for...,110,2015,98.0,80
9,Pretty Little Liars,TV-14,Parents strongly cautioned. May be unsuitable ...,90,2016,96.0,80


In [8]:
## select only shows with rating score > 90
netflix_subset = netflix_data[netflix_data["user rating score"] > 90]

In [3]:
## 271 titles remain
netflix_subset.shape[0]

271

In [9]:
## randomly select topic
topic = netflix_subset.sample()["title"]
print(topic)

63    Criminal Minds
Name: title, dtype: object


### 2. Get Data for Topic from Wikipedia API

https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

In [448]:
## for regular text output
wiki_en_wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI)

## check if page for topic exists
if wiki_en_wiki.page(topic).exists():
    print("Topic is ok.")
    wiki_page = wiki_en_wiki.page(topic)
else:
    print("Find a new topic")

Topic is ok.


In [500]:
wiki_page.sections[0]

Section: Plot (1):
After defeating Owen Shaw and securing amnesty for their past crimes, Dom, Brian and the team have returned to the United States to live normal lives. Brian accustoms himself to life as a father, while Dom tries to help Letty Ortiz regain her memory. Meanwhile, Owen's older brother, Deckard, breaks into the hospital where the comatose Owen is held, before breaking into the DSS office in Los Angeles to extract profiles of Dom's crew. After revealing his identity, Deckard fights Luke Hobbs and escapes, detonating a bomb that severely injures Hobbs. Dom later learns from his sister Mia that she is pregnant again and convinces her to tell Brian. However, a letter bomb, sent from Tokyo, explodes and destroys the Toretto house shortly after Han Lue, a member of Dom's team, is apparently killed by Deckard in Tokyo. Dom travels to Tokyo to retrieve Han's body and acquires the objects found at the crash site from Han's friend, Sean Boswell.
As Dom, Brian, Tej Parker, and Roma

In [499]:
def print_sections(sections, level=0):
        for s in sections:
                print("%s: %s - %s" % ("*" * (level + 1), s.title, s.text[0:40]))
                print_sections(s.sections, level + 1)
print_sections(wiki_page.sections)

*: Plot - After defeating Owen Shaw and securing a
*: Cast - Vin Diesel as Dominic Toretto, a former 
*: Production - 
**: Development - On October 21, 2011, the Los Angeles Tim
**: Filming - Principal photography began in early Sep
**: Stunts - The "airdrop" sequence was conceived by 
**: Redevelopment of Walker's character - In January 2014, Time reported that Walk
**: Music - The musical score was composed by Brian 
*: Release - 
**: Theatrical - The film originally scheduled to be rele
**: Home media - Furious 7 was released on July 6, 2015 i
*: Reception - 
**: Box office - Furious 7 grossed $353 million in the Un
***: North America - Predictions for the opening weekend of F
***: Outside North America - Furious 7 opened on April 1, 2015 in 12 
**: Critical response - Furious 7 received positive reviews, wit
**: Accolades - 
*: Sequel - A sequel, titled The Fate of the Furious
*: See also - List of films featuring drones
List of f
*: Notes - 
*: References - DocumentsUniversal Pict

In [449]:
topic_text = wiki_page.text

### 3. NER on Data to Identify Places

Code from: https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
https://medium.com/spatial-data-science/how-to-extract-locations-from-text-with-natural-language-processing-9b77035b3ea4

In [14]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lauraluckert/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lauraluckert/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [26]:
## pos-tagging, maybe useful for actions?
topic_text_pos = preprocess(topic_text)

In [27]:
topic_text_pos

[('The', 'DT'),
 ('Vampire', 'NNP'),
 ('Diaries', 'NNP'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('American', 'JJ'),
 ('supernatural', 'NN'),
 ('teen', 'JJ'),
 ('drama', 'NN'),
 ('television', 'NN'),
 ('series', 'NN'),
 ('developed', 'VBN'),
 ('by', 'IN'),
 ('Kevin', 'NNP'),
 ('Williamson', 'NNP'),
 ('and', 'CC'),
 ('Julie', 'NNP'),
 ('Plec', 'NNP'),
 (',', ','),
 ('based', 'VBN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('book', 'NN'),
 ('series', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('same', 'JJ'),
 ('name', 'NN'),
 ('written', 'VBN'),
 ('by', 'IN'),
 ('L.', 'NNP'),
 ('J.', 'NNP'),
 ('Smith', 'NNP'),
 ('.', '.'),
 ('The', 'DT'),
 ('series', 'NN'),
 ('premiered', 'VBD'),
 ('on', 'IN'),
 ('The', 'DT'),
 ('CW', 'NNP'),
 ('on', 'IN'),
 ('September', 'NNP'),
 ('10', 'CD'),
 (',', ','),
 ('2009', 'CD'),
 (',', ','),
 ('and', 'CC'),
 ('concluded', 'VBD'),
 ('on', 'IN'),
 ('March', 'NNP'),
 ('10', 'CD'),
 (',', ','),
 ('2017', 'CD'),
 (',', ','),
 ('having', 'VBG'),
 ('aired', 'VBN'),
 ('171', 'CD'),
 ('

In [137]:
import spacy
from spacy import displacy
from collections import Counter
ner = spacy.load('en_core_web_sm')
ner_w = spacy.load('xx_ent_wiki_sm')
## roberta based
#nlp_b = spacy.load('en_core_web_trf')

In [450]:
doc = ner(topic_text)
print([(X.text, X.label_) for X in doc.ents])

[('Fast & Furious 7', 'ORG'), ('2015', 'DATE'), ('American', 'NORP'), ('James Wan', 'PERSON'), ('Chris Morgan', 'PERSON'), ('Fast & Furious', 'ORG'), ('6', 'CARDINAL'), ('2013', 'DATE'), ('seventh', 'ORDINAL'), ('the Fast & Furious', 'ORG'), ('Vin Diesel', 'PERSON'), ('Paul Walker', 'PERSON'), ('Dwayne Johnson', 'PERSON'), ('Michelle Rodriguez', 'PERSON'), ('Tyrese Gibson', 'PERSON'), ('Chris', 'PERSON'), ('Bridges', 'PERSON'), ('Jordana Brewster', 'PERSON'), ('Djimon Hounsou', 'ORG'), ('Kurt Russell', 'PERSON'), ('Jason Statham', 'ORG'), ('Dominic Toretto', 'PERSON'), ("Brian O'Conner", 'PERSON'), ('the United States', 'GPE'), ('Deckard Shaw', 'PERSON'), ("Paul Walker's", 'PERSON'), ("O'Connor", 'PERSON'), ('November 30', 'DATE'), ('2013.Plans', 'CARDINAL'), ('seventh', 'ORDINAL'), ('first', 'ORDINAL'), ('February 2012', 'DATE'), ('Johnson', 'PERSON'), ('Fast & Furious', 'ORG'), ('6', 'CARDINAL'), ('April 2013', 'DATE'), ('Wan', 'PERSON'), ('Diesel', 'ORG'), ('that same month', 'DATE'

In [43]:
entities = set()
for item in doc.ents:
    entities.add(item.label_)

print(entities)

{'DATE', 'FAC', 'GPE', 'TIME', 'QUANTITY', 'EVENT', 'WORK_OF_ART', 'PERSON', 'ORG', 'LANGUAGE', 'LOC', 'ORDINAL', 'NORP', 'CARDINAL'}


In [456]:
type(doc)
test_list = []
for item in doc.ents:
    if item.label_ == "GPE":
        test_list.append(item.text)
        #print(item.text)

inspect_counter = Counter(test_list)


In [460]:
inspect_counter.most_common()

[('China', 8),
 ('the United States', 7),
 ('Los Angeles', 7),
 ('Tokyo', 7),
 ('Colorado', 6),
 ('Atlanta', 5),
 ('Canada', 5),
 ('Abu Dhabi', 3),
 ('Hobbs', 3),
 ('Hollywood', 3),
 ('U.S.', 3),
 ('Hercules', 2),
 ('Brazil', 2),
 ('Mexico', 2),
 ('Deckard', 1),
 ('El Segundo', 1),
 ('California', 1),
 ('Azerbaijan', 1),
 ('Miami', 1),
 ('Rousey', 1),
 ('Spain', 1),
 ('Lucas Black', 1),
 ('the Dominican Republic', 1),
 ('Dubai', 1),
 ('Arizona', 1),
 ('Georgia', 1),
 ('Montana', 1),
 ('Austin', 1),
 ('Toronto', 1),
 ('UK', 1),
 ('Argentina', 1),
 ('Chile', 1),
 ('Colombia', 1),
 ('Egypt', 1),
 ('Malaysia', 1),
 ('Romania', 1),
 ('Taiwan', 1),
 ('Thailand', 1),
 ('Venezuela', 1),
 ('Vietnam', 1),
 ('Russia', 1),
 ('Poland', 1),
 ('the United Kingdom', 1),
 ('Germany', 1)]

In [47]:
doc_w = ner_w(topic_text)

In [48]:
entities_w = set()
for item in doc_w.ents:
    entities_w.add(item.label_)

print(entities_w)

{'MISC', 'LOC', 'PER', 'ORG'}


In [49]:
for item in doc_w.ents:
    if item.label_ == "LOC":
        print(item.text)

New York City
Nevada
Seattle
Hong Kong
Seattle
Savannah
Las Vegas High School
Paris
London
Europe
United States
Vietnam
New York
London
Texas
Alvez
Alvez
Iraq
Eli
Belgium
the Pentagon
New York
Las Vegas
Grand Canyon
Virginia
Briscoe County
Texas
Texas
Barbados
Aruba
Florida
Rochelle Aytes
Bethesda General Hospital
Savannah
Savannah
Alabama
Tennessee
Santa Barbara
Southeast Asia
Hong Kong
Taiwan


### 3. Text Generation via Keywords
https://medium.com/mlearning-ai/generating-sentences-from-keywords-using-transformers-in-nlp-e89f4de5cf6b

Generation Idea: 
* Use Monopoly action cards in combination with terms of wikipedia article to generate new action cards DONE
* Group monopoly cards into sentiments: positive, negative, neutral DONE
* Create frequency count of words in original action cards, use POS tagging / NER
* remove entities in monopoly action cards and save which entities have been removed (places?, persons?) -> do it deliberatly?
* filter stop words?

now either:
- Entity Swap (Kazemi): select removed entities from wikipedia article -> use frequency counts to select "most important ones" (higher frequency -> more important entity), swap entities
- Text Generation from Keywords: generate new text via k2t model and selected input words

Evaluation idea:
* analyse distribution of POS tags in in original monopoly cards vs. distribution of POS tags in generated action cards? Compare two distributions via Kullback-Leibler-Divergenz

In [16]:
from keytotext import pipeline
from sklearn.feature_extraction.text import CountVectorizer

In [133]:
#from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine

In [35]:
import re

In [261]:
FILENAME_MONOPOLY = "monopoly_action_cards_keywords.csv"
monopoly_data = pd.read_csv(FILENAME_MONOPOLY, sep=";")

In [262]:
monopoly_data.head()

Unnamed: 0,category,content,bias,keywords,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,Event card,"Pay a fine of DM 200,- or take a community tic...",negativ,"pay, fine, community ticket",,,
1,Event card,"Move up to Seestrasse. \nIf you come over Go, ...",neutral,"move, Seestrasse, go",,,
2,Event card,Go back 3 fields.,neutral,"go, fields",,,
3,Event card,Go back to Badstraße.,neutral,"go, Badstraße",,,
4,Event card,Move forward to Schlossallee.,neutral,"move, Schlossallee",,,


In [18]:
## filter monopoly data by sentiment
monopoly_data_pos = monopoly_data[monopoly_data["bias"] == "positiv"]

In [17]:
#from nltk import ngrams, FreqDist

#inspect_counts = FreqDist(ngrams(monopoly_data_pos["content"], 1))

"""
all_counts = dict()
for size in 2, 3, 4, 5:
    all_counts[size] = FreqDist(ngrams(monopoly_data_pos["content"], size))
"""


'\nall_counts = dict()\nfor size in 2, 3, 4, 5:\n    all_counts[size] = FreqDist(ngrams(monopoly_data_pos["content"], size))\n'

In [19]:
# Create our vectorizer
vectorizer = CountVectorizer()
vectorizer.fit(monopoly_data_pos["content"])

# Let's look at the vocabulary:
print('Vocabulary: ')
print(vectorizer.vocabulary_)

Vocabulary: 
{'the': 56, 'bank': 13, 'pays': 41, 'you': 62, 'dividend': 22, 'dm': 23, '1000': 0, 'rent': 50, 'and': 9, 'bond': 17, 'interest': 33, 'are': 12, 'due': 25, '3000': 4, 'receive': 47, 'on': 39, 'preferred': 43, 'shares': 53, '900': 8, 'inherit': 32, '2000': 2, 'from': 29, 'stock': 54, 'sales': 51, '500': 7, 'annual': 10, 'annuity': 11, 'is': 34, 'draw': 24, 'win': 60, 'crossword': 21, 'puzzle': 46, 'contest': 20, 'error': 27, 'in': 30, 'your': 63, 'favor': 28, '4000': 6, 'income': 31, 'tax': 55, 'refund': 48, '400': 5, 'won': 61, '2nd': 3, 'prize': 45, 'beauty': 15, '200': 1, 'it': 35, 'birthday': 16, 'collect': 19, 'each': 26, 'player': 42, 'will': 59, 'be': 14, 'released': 49, 'prison': 44, 'must': 37, 'keep': 36, 'this': 57, 'card': 18, 'until': 58, 'need': 38, 'or': 40, 'sell': 52}


In [20]:
## save frequencies of words in monopoly action cards
new_list = []

for key, value in vectorizer.vocabulary_.items():
    new_series = [key, value]
    new_list.append(new_series)

action_cards = pd.DataFrame(new_list, columns=["word","frequency"])


In [40]:
action_cards.sort_values(by="frequency",ascending=False)

Unnamed: 0,word,frequency
35,your,63
3,you,62
42,won,61
29,win,60
52,will,59
...,...,...
13,3000,4
43,2nd,3
20,2000,2
46,200,1


## POS-Tag Preprocessing for Monopoly Action Cards

In [31]:
output_sent = preprocess(monopoly_data_pos["content"].iloc[0])

In [328]:
from collections import Counter
imd_list = []
collect_counts = []
for action_text in monopoly_data["content"]:
    action_pos_tags = preprocess(action_text)
    #print(action_pos_tags)
    imd_list = []
    for _, pos_tag in action_pos_tags:
        #print(pos_tag)
        imd_list.append(pos_tag[0:2])
    print(imd_list)
    print(Counter(imd_list))
    #collect_counts.append()
    
    #
#print(collect_counts)

['VB', 'DT', 'NN', 'IN', 'NN', 'CD', ',', ':', 'CC', 'VB', 'DT', 'NN', 'NN', '.']
Counter({'NN': 4, 'VB': 2, 'DT': 2, 'IN': 1, 'CD': 1, ',': 1, ':': 1, 'CC': 1, '.': 1})
['NN', 'RB', 'TO', 'NN', '.', 'IN', 'PR', 'VB', 'IN', 'NN', ',', 'NN', 'NN', 'CD', ',', ':', '.']
Counter({'NN': 5, '.': 2, 'IN': 2, ',': 2, 'RB': 1, 'TO': 1, 'PR': 1, 'VB': 1, 'CD': 1, ':': 1})
['NN', 'RB', 'CD', 'NN', '.']
Counter({'NN': 2, 'RB': 1, 'CD': 1, '.': 1})
['VB', 'RB', 'TO', 'NN', '.']
Counter({'VB': 1, 'RB': 1, 'TO': 1, 'NN': 1, '.': 1})
['NN', 'RB', 'TO', 'NN', '.']
Counter({'NN': 2, 'RB': 1, 'TO': 1, '.': 1})
['NN', 'RB', 'TO', 'NN', '.']
Counter({'NN': 2, 'RB': 1, 'TO': 1, '.': 1})
['VB', 'DT', 'NN', 'TO', 'DT', 'JJ', 'NN', '.', 'WR', 'PR', 'VB', 'IN', 'NN', ',', 'RB', 'NN', 'CD', ',', ':', '.']
Counter({'NN': 4, 'VB': 2, 'DT': 2, '.': 2, ',': 2, 'TO': 1, 'JJ': 1, 'WR': 1, 'PR': 1, 'IN': 1, 'RB': 1, 'CD': 1, ':': 1})
['DT', 'NN', 'VB', 'PR', 'DT', 'NN', 'NN', 'CD', ',', ':', '.']
Counter({'NN': 3, 'DT'

In [112]:
action_sent = preprocess(action_output)

In [38]:
output_sent

[('The', 'DT'),
 ('bank', 'NN'),
 ('pays', 'VBZ'),
 ('you', 'PRP'),
 ('a', 'DT'),
 ('dividend', 'NN'),
 ('.', '.'),
 ('DM', 'NNP'),
 ('1000', 'CD'),
 (',', ','),
 ('-', ':')]

In [113]:
action_sent

[('The', 'DT'),
 ('bank', 'NN'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('dividend', 'NN'),
 ('of', 'IN'),
 ('Euro', 'NNP'),
 ('is', 'VBZ'),
 ('1000', 'CD'),
 ('in', 'IN'),
 ('Texas', 'NNP'),
 ('.', '.')]

In [114]:
## pos distribution
pos_tag_df_reference = pd.DataFrame(output_sent,columns=["token","pos_tag"])
for x in pos_tag_df_reference["pos_tag"]:
    print(x[0:2])
pos_tag_df_reference["short_pos_tag"] = [x[0:2] for x in pos_tag_df_reference["pos_tag"]]

reference = pos_tag_df_reference["short_pos_tag"].value_counts()

pos_tag_df_target = pd.DataFrame(action_sent,columns=["token","pos_tag"])

pos_tag_df_target["short_pos_tag"] = [x[0:2] for x in pos_tag_df_target["pos_tag"]]

target = pos_tag_df_target["short_pos_tag"].value_counts()

DT
NN
VB
PR
DT
NN
.
NN
CD
,
:


In [135]:
merged_df = pd.merge(reference,target,how="outer", left_index=True,right_index=True)
merged_df.columns=["reference","target"]
merged_df = merged_df.fillna(0)
#np.array(merged_df["reference"])
#np.dot(merged_df["reference"], merged_df["target"])
print(1 - cosine(merged_df["reference"], merged_df["target"]))
#cosine_similarity([merged_df["reference"]], [merged_df["target"]])

0.7705517503711221


array([[0.77055175]])

In [100]:
## use only verbs, personal pronoun, adjectives, nouns
keyword_list = []

for pos_tag in output_sent:
    if re.match("VB.*|NN.*|PR.*|RB.*|CD.*", pos_tag[1]):
        if pos_tag[0] == "DM":
            keyword_list.append("Euro")
        else:
            keyword_list.append(pos_tag[0])  
        
print("The topic's keyword_list is: ", keyword_list)

The topic's keyword_list is:  ['bank', 'pays', 'you', 'dividend', 'Euro', '1000']


List of POS-Tag Description:
https://www.guru99.com/pos-tagging-chunking-nltk.html

## Preprocess NER tokens for Wikipedia data

In [55]:
doc_w.ents[0].label_

'ORG'

In [58]:
ent_collection = []
for item in doc_w.ents:
    ent_collection.append([item.text,item.label_])

topic_data_entities = pd.DataFrame(ent_collection,columns=["entity","label"])

In [69]:
count_df = pd.DataFrame(topic_data_entities["entity"].value_counts())
count_df.columns=["count"]

In [76]:
topic_data_entities = pd.merge(topic_data_entities,count_df,how="left",left_on="entity",right_index=True)

In [82]:
## distinct rows after join
topic_data_entities = topic_data_entities.drop_duplicates()

In [92]:
first_topic = topic_data_entities[topic_data_entities["label"]=="LOC"].sort_values(by="count",ascending=False).entity.iloc[0]

- POS-Tag on Monopoly Data + Counts
- NER on Wikipedia Data + Count
- NER on 

In [101]:
keyword_list.append(first_topic)
keyword_list

['bank', 'pays', 'you', 'dividend', 'Euro', '1000', 'Texas']

## Key-to-Text Model Application

Input: List of words from action cards + wikipedia

In [29]:
nlp_k2t_base = pipeline("k2t-base")  #loading the pre-trained model
params = {"do_sample":True, "num_beams":4, "no_repeat_ngram_size":3, "early_stopping":True}    #decoding params

In [224]:
from keytotext import pipeline
nlp_k2t = pipeline("k2t") 

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

In [111]:
keywords=["You", "do", "Move forward to","Galactic Republic"]
action_output = (nlp(keyword_list, **params)) 
print(action_output)

The bank with a dividend of Euro is 1000 in Texas.


In [162]:
## TESTING

text_data = ""
for text_item in monopoly_data["content"]:
    text_data += ". " + text_item
    
inspect_actions = preprocess(text_data)

## use only verbs, personal pronoun, adjectives, nouns
keyword_list_verbs = []

for pos_tag in inspect_actions:
    if re.match("VB.*", pos_tag[1]):
        if pos_tag[0] == "DM":
            keyword_list_verbs.append("Euro")
        else:
            keyword_list_verbs.append(pos_tag[0])  
        
print("The topic's keyword_list is: ", keyword_list_verbs)

keyword_list_nouns = []

for pos_tag in inspect_actions:
    if re.match("NN.*", pos_tag[1]):
        if pos_tag[0] == "DM":
            keyword_list_nouns.append("Euro")
        else:
            keyword_list_nouns.append(pos_tag[0])  
        
print("\n\nThe topic's keyword_list is: ", keyword_list_nouns)


keyword_list_pronouns = []

for pos_tag in inspect_actions:
    if re.match("PR.*", pos_tag[1]):
        if pos_tag[0] == "DM":
            keyword_list_pronouns.append("Euro")
        else:
            keyword_list_pronouns.append(pos_tag[0])  
        
print("\n\nThe topic's keyword_list is: ", keyword_list_pronouns)


keyword_list_determiner = []

for pos_tag in inspect_actions:
    if re.match("DT.*", pos_tag[1]):
        if pos_tag[0] == "DM":
            keyword_list_determiner.append("Euro")
        else:
            keyword_list_determiner.append(pos_tag[0])  
        
print("\n\nThe topic's keyword_list is: ", keyword_list_determiner)


The topic's keyword_list is:  ['Pay', 'take', 'come', 'Go', 'Go', 'Go', 'get', 'pays', 'are', 'pays', 'Go', 'receive', 'inherit', 'receive', 'is', 'win', 'won', 'is', '..', 'receives', 'has', 'buy', 'get', 'have', 'been', 'elected', 'Have', 'renovated', 'Euro', 'Pay', 'be', 'called', 'do', 'Pay', 'Do', 'pass', 'collect', 'be', 'released', 'keep', 'need', 'sell', 'Do', 'pass', 'collect', 'be', 'released', 'keep', 'need', 'sell']


The topic's keyword_list is:  ['fine', 'Euro', 'community', 'ticket', '..', 'Move', 'Seestrasse', 'Go', 'collect', 'Euro', 'fields', 'Badstraße', 'Move', 'Schlossallee', '..', 'Move', '..', 'Make', 'trip', 'station', 'Go', 'Euro', 'bank', 'dividend', 'Euro', 'Rent', 'bond', 'interest', 'bank', 'Euro', 'Move', '..', '%', 'dividend', 'shares', 'Euro', 'Euro', 'stock', 'sales', 'Euro', 'annuity', 'Draw', 'Euro', 'crossword', 'puzzle', 'contest', 'Draw', 'Euro', 'Bank', 'error', 'favor', 'Draw', 'Euro', 'tax', 'refund', 'Draw', 'Euro', 'prize', 'beauty', 'contest'

In [276]:
import random
#random.seed(42)

1

In [278]:
second_verb = "pass"
pronoun = "you"
LOCATION = "Harry Potter Street"
ACTOR = "Hermine Granger"
number = 2000

In [279]:
keywords=[first_verb, second_verb, pronoun, LOCATION, ACTOR]
print(keywords)
#keywords=["You", "do", "Move forward to","Galactic Republic"]
#action_output = (nlp(keywords, **params)) 
action_output = (nlp_k2t(keywords, **params)) 

print(action_output)

['arrange', 'pass', 'you', 'Harry Potter Street', 'Hermine Granger']


  next_indices = next_tokens // vocab_size


The arrangement for the episode with hermine granger is "president".


In [236]:
#from transformers import pipeline, set_seed
#generator = pipeline('text-generation', model='gpt2')
#set_seed(42)
generator("The arrangement for the episode with hermine granger is \"president\".", max_length=40, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The arrangement for the episode with hermine granger is "president". She was part of the team that played the villain in her own series with Brawn.\n\nAce: Originally one of'},
 {'generated_text': 'The arrangement for the episode with hermine granger is "president".\n\n"When she gets down there in the hallway that I would like to call [Ms. Grace] the President was originally'},
 {'generated_text': 'The arrangement for the episode with hermine granger is "president".\n\nPuppet Shows\n\nAvengers: Infinity War took a more "traditional" approach to a series of interl'},
 {'generated_text': 'The arrangement for the episode with hermine granger is "president". There is also an audio recording of the performance by director Mark Rylance – a man she met while filming the series for the'},
 {'generated_text': 'The arrangement for the episode with hermine granger is "president". "So you know what would happen?" she asks. "My wife would have to work for eight to 10 hours a da

## Evaluation of Generated Action Card

In [304]:
def pos_distribution(pos_tuples_of_sentence):
    """
    :pos_tuple_of_sentences: tuple (token, pos_tag) as returned from preprocess function
    
    crop pos tags into relevant groups (first two letters)
    count occurences of pos tags in input sentence
    
    :returns: pandas DataFrame with pos_tag and its frequency
    
    """
    pos_df = pd.DataFrame(pos_tuples_of_sentence,columns=["token","long_pos_tag"])
    pos_df["pos_tag"] = [x[0:2] for x in pos_df["long_pos_tag"]]
    freq_df = pos_df["pos_tag"].value_counts()
    
    return freq_df

def evaluate_generated_sentence(reference, new_sentence):

    ## preprocess both
    reference = preprocess(reference)
    new_sentence = preprocess(new_sentence)
    
    ## pos distribution
    reference = pos_distribution(reference)
    new_sentence = pos_distribution(new_sentence)
    
    ## merge vectors
    merged_df = pd.merge(reference,new_sentence,how="outer", left_index=True,right_index=True).fillna(0)
    merged_df.columns=["reference","target"]
    
    ## calc cosine similarity 
    cos_similarity = 1 - cosine(merged_df["reference"], merged_df["target"])
    
    ## calc scalar product between vectors
    return cos_similarity



    


In [None]:
#POS of output sentence -> to make sure there is an action in it!

## NER on Monopoly Data
### Check named entity swap? Similar to Kazemi?

In [147]:
text_data = ""
for text_item in monopoly_data_pos["content"]:
    text_data += " " + text_item

print(text_data)
    
#sent = 
#print(sent)
sent_ner = ner(text_data)
print([(X.text, X.label_) for X in sent_ner.ents])

 The bank pays you a dividend. DM 1000,- Rent and bond interest are due. The bank pays you DM 3000,- You receive a 7% dividend on preferred shares. DM 900,- You inherit: DM 2000,- From stock sales you receive: DM 500,- The annual annuity is due. Draw DM 2000,- You win a crossword puzzle contest. Draw DM 2000,- Bank error in your favor. Draw DM 4000,- Income tax refund. Draw DM 400,- You won the 2nd prize in a beauty contest. Draw DM 200,- It is your birthday. Collect DM 1000,- from each player. You will be released from prison! You must keep this card until you need it or sell it. You will be released from prison! You must keep this card until you need it or sell it.
[('1000,-', 'CARDINAL'), ('7%', 'PERCENT'), ('DM 900,-', 'ORG'), ('2000,-', 'CARDINAL'), ('500,-', 'PRODUCT'), ('annual', 'DATE'), ('2000,-', 'CARDINAL'), ('1000,-', 'CARDINAL')]


In [145]:
print(text_data)

The bank pays you a dividend. DM 1000,-Rent and bond interest are due. The bank pays you DM 3000,-You receive a 7% dividend on preferred shares. DM 900,-You inherit: DM 2000,-From stock sales you receive: DM 500,-The annual annuity is due. Draw DM 2000,-You win a crossword puzzle contest. Draw DM 2000,-Bank error in your favor. Draw DM 4000,-Income tax refund. Draw DM 400,-You won the 2nd prize in a beauty contest. Draw DM 200,-It is your birthday. Collect DM 1000,- from each player.You will be released from prison! You must keep this card until you need it or sell it.You will be released from prison! You must keep this card until you need it or sell it.


## Few-Shot-Learning

Additional Data Sources:
https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

https://en.wikipedia.org/wiki/List_of_fictional_towns_in_television

Source Inference API: https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api


In [None]:
import json
import requests

In [None]:
API_TOKEN = "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"

def query(payload='',parameters=None,options={'use_cache': False}):
    API_URL = "https://api-inference.huggingface.co/models/EleutherAI/gpt-neo-2.7B"
    headers = {"Authorization": f"Bearer {API_TOKEN}"}
    body = {"inputs":payload,'parameters':parameters,'options':options}
    response = requests.request("POST", API_URL, headers=headers, data= json.dumps(body))
    try:
      response.raise_for_status()
    except requests.exceptions.HTTPError:
        return "Error:"+" ".join(response.json()['error'])
    else:
      return response.json()[0]['generated_text']

In [264]:
## generate few shot training data for text generation

prompt_text = ""

for text, keywords in zip(monopoly_data["content"], monopoly_data["keywords"]):
    imd = "key: " + keywords + "\ntweet: " + text + "\n###"
    prompt_text += imd
    

In [496]:
## Source Action Verbs: https://www.citationmachine.net/resources/grammar-guides/verb/list-verbs/

action_verbs = ["Act","Answer","Approve","Arrange","Break","Build","Buy","Coach","Color","Cough","Create", 
                "Complete","Cry","Dance","Describe","Draw","Drink","Eat","Edit","Enter","Exit",
                "Imitate","Invent","Jump","Laugh","Lie","Listen","Paint","Plan","Play","Read","Replace",
                "Run","Scream","See","Shop","Shout","Sing","Skip","Sleep","Sneeze","Solve","Study","Teach",
                "Touch","Turn","Walk","Win","Write","Whistle","Yank","Zip"]

## POS Tag == "VB.*" from real monopoly action cards
action_verbs_monopoly = ["Pay","Take","Come","Go","Get","Receive","Inherit","Win","Pass",
                         "Collect","being released","Keep","Sell"]


## randomly select verbs, pronouns
first_verb = random.choice(action_verbs_monopoly).lower()
second_verb = random.choice(action_verbs).lower()
pronoun = random.choice(["you","your"]).lower()

## once locations available, randomly select location
LOCATION = "Dominic Toretto Avenue"
number = 2000

select_second_verb = random.choice([0,1])
select_pronoun = random.choice([0,1])
select_location = random.choice([0,1])

print("\nfirst_verb:", first_verb, "\nsecond_verb:", second_verb, "\npronoun:",  pronoun, "\nLOCATION:",LOCATION, "\nnumber:",number )

print("\nselect_second_verb:", select_second_verb, "\nselect_pronoun:", select_pronoun, "\nselect_location:",  select_location,)

if select_second_verb and select_pronoun and select_location:
    keyword_list = [first_verb, second_verb, pronoun, LOCATION]
elif select_second_verb == 0 and select_pronoun and select_location:
    keyword_list = [first_verb, pronoun, LOCATION]
elif select_second_verb == 0 and select_pronoun == 0 and select_location:
    keyword_list = [first_verb, LOCATION]
elif select_second_verb == 0 and select_pronoun == 1 and select_location == 0:
    keyword_list = [first_verb, pronoun, number]
elif select_second_verb == 0 and select_pronoun == 0 and select_location == 1:
    keyword_list = [first_verb, LOCATION]
elif select_second_verb == 1 and select_pronoun == 0 and select_location == 0:
    keyword_list = [first_verb, second_verb]
elif select_second_verb == 1 and select_pronoun == 0 and select_location == 1:
    keyword_list = [first_verb, second_verb, LOCATION]
elif select_second_verb == 1 and select_pronoun == 1 and select_location == 0:
    keyword_list = [first_verb, second_verb, pronoun]
elif select_second_verb == 0 and select_pronoun == 0 and select_location == 0:
    keyword_list = [first_verb, number]

prison = "Los Angeles"
if LOCATION == prison:
    keyword_list = [LOCATION, "not pass", "not collect"]
print(keyword_list)

keyword_string = ""

for item in keyword_list:
    if keyword_string == "":
        keyword_string += str(item)
    else:
        keyword_string += ", " + str(item)
        
print(keyword_string)


first_verb: being released 
second_verb: see 
pronoun: your 
LOCATION: Dominic Toretto Avenue 
number: 2000

select_second_verb: 1 
select_pronoun: 0 
select_location: 1
['being released', 'see', 'Dominic Toretto Avenue']
being released, see, Dominic Toretto Avenue


In [497]:
parameters = {
    'max_new_tokens':25,  # number of generated tokens
    'temperature': 1,   # controlling the randomness of generations
    'end_sequence': "###" # stopping sequence for generation
}

prompt = prompt_text + "\nkey: " + keyword_string + "\ntweet:"
#prompt = "key: move, Schlossallee\ntweet: Move forward to Schlossallee.\n###key: draw, Harry Potter, 10, fields, back\ntweet: Draw a picture of Harry Potter or go 10 fields back. \n###\nkey: renovate, pay, houses, bank\ntweet: Have all your houses renovated! Pay to the bank for each house 1000 Euro.\n###\nkey: go, Parkstraße, community ticket\ntweet: Go back to Parkstraße or take a community ticket. \n###\nkey: go, Parkstraße\ntweet:"

#print(prompt)


data = query(prompt,parameters)

action_card = re.findall(r"(?<=tweet:\s).*", data)[-1] 
#print(data)
print(keyword_string)
print(action_card)

being released, see, Dominic Toretto Avenue
See Dominic Toretto. Go there if you have the opportunity.


In [386]:
#action_card = re.match("(?<=:$).*", data)
#print(data)
#print(action_card)
action_card = "Harry Potter Street is now you; you can do anything in here! Eat."
action_card = "You have 2000 dollars."
action_card = "Come play Harry with the other wizards. Collect DM 8000,-"
action_card = "The town of Hogsmeade, the most famous town in Harry Potter, will be closed during the time of the party"
action_card = "You will be replaced by Harry Potter."

In [495]:
if LOCATION == prison:
    reference = "Go to the prison! Go directly there. Do not pass Go. Do not collect DM 4000,-."
    
#action_card = "Go to the jail in Los Angeles! No Going. Make it through the red line and collect DM 10000,-"
reference = "Advance to the opera square. If you get over Go, collect DM 4000,-"
print(reference)
print(action_card)

evaluate_generated_sentence(reference,action_card)

Advance to the opera square. If you get over Go, collect DM 4000,-
You inherit. If you give a house DM 500,- to a player or pay DM 4000,- for a player, then they


0.8776719906943025

## Locations - Use New Data
### Preprocess new data

In [389]:
movie_characters = pd.read_csv("tmdb_5000_credits.csv", sep=",")

In [390]:
movie_characters.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [435]:
cast_rows = []

for malformed_string in movie_characters.cast:
    imd_string = list(malformed_string[1:(len(malformed_string)-1)].split("}"))
    
    new_list = []

    for item in imd_string:
        try: 
            if item[0] != "{":
                item = item[2:(len(item))]
            item += "}"
            new_item =json.loads(item)
            person = new_item["character"]
            #gender = new_item["gender"]
            new_list.append(person)
        except IndexError:
            break
    cast_rows.append(new_list)

    
cast_dict = {}
for movie, cast in zip(movie_characters.title,cast_rows):
    cast_dict[movie] = cast

    
#cast_df = pd.DataFrame(prep_df, columns=["movie","cast"])

In [463]:
cast = cast_dict["Furious 7"]
print(cast)

['Dominic Toretto', "Brian O'Conner", 'Hobbs', 'Letty', 'Roman', "Tej (as Chris 'Ludacris' Bridges)", 'Mia', 'Jakande', 'Kiet', 'Kara', 'Ramsey', 'Mr. Nobody', 'Deckard Shaw', 'Han', 'Gisele', 'Sean Boswell', 'Elena', 'Hector', 'Sheppard', 'Owen Shaw', 'Safar', 'Jack', 'Jack', 'Samantha Hobbs', 'Letty Fan', 'Female Racer', 'Male Racer', 'Race Starter', 'Hot Teacher', 'Doctor', 'Priest', 'Merc Tech', 'Weapons Tech', 'Billionaire', 'Dominican Priest', 'Hana', 'Merc Driver (as Ben Blankenship)', 'DJ', 'DJ', 'Drone Tech', 'Jasmine', 'Mando', 'Advisor', 'Field Reporter', 'Cop', 'Leo (uncredited / archive)', 'Neela (uncredited / archive)', 'Twinkie (uncredited)', 'Santos (uncredited / archive)', 'Race Wars Racer (uncredited)', "Brian O'Conner (uncredited)", "Brian O'Connor (uncredited)"]


In [465]:
new_field = {
    "streets": {"1-3": [],"4-6": [],"7-9":[] , "10-12": [], "13-15": [], "16-18": [], "expensive": [], "cheap": []},
    "stations": [],
    "prison": [],
    "free_parking": [],
    "special": {"1": [], "2": []}
}
#6*3 streets
#2*2 streets
#4 * station
#1 * prison
#1 * free parking
#2* elektrizität + wasserwerk

new_field["streets"]["expensive"] = [x + " Avenue" for x in cast[0:2]]
new_field["streets"]["cheap"] = [x + " Drive" for x in cast[8:10]]
new_field["streets"]["1-3"] = [x + " Street" for x in cast[11:14]]
new_field["stations"] = [x + " Station" for x in cast[3:7]]

new_field

{'streets': {'1-3': ['Mr. Nobody Street', 'Deckard Shaw Street', 'Han Street'],
  '4-6': [],
  '7-9': [],
  '10-12': [],
  '13-15': [],
  '16-18': [],
  'expensive': ['Dominic Toretto Avenue', "Brian O'Conner Avenue"],
  'cheap': ['Kiet Drive', 'Kara Drive']},
 'stations': ['Letty Station',
  'Roman Station',
  "Tej (as Chris 'Ludacris' Bridges) Station",
  'Mia Station'],
 'prison': [],
 'free_parking': [],
 'special': {'1': [], '2': []}}

In [412]:
street_names = ['Avenue', 'Park', 'Street', 'Boulevard', 'Road', 'Main Street', 'Drive', 'Lane', 'Alley']
station_name = 'Station'

In [469]:
new_field

{'streets': {'1-3': ['Mr. Nobody Street', 'Deckard Shaw Street', 'Han Street'],
  '4-6': [],
  '7-9': [],
  '10-12': [],
  '13-15': [],
  '16-18': [],
  'expensive': ['Dominic Toretto Avenue', "Brian O'Conner Avenue"],
  'cheap': ['Kiet Drive', 'Kara Drive']},
 'stations': ['Letty Station',
  'Roman Station',
  "Tej (as Chris 'Ludacris' Bridges) Station",
  'Mia Station'],
 'prison': [],
 'free_parking': [],
 'special': {'1': [], '2': []}}

In [None]:
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"
# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [502]:
question = "What is a special event in the movie " + topic + "?"
print(question)
# a) Get predictions
#nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': question,
    'context': topic_text
}

res = nlp(QA_input)
print(res)


What is a special event in the movie Furious 7?


TypeError: 'WikipediaPageSection' object is not iterable

In [480]:
res["answer"]

'Atlanta'

In [481]:
new_field["prison"] = "Los Angeles"
new_field["free_parking"] = "Atlanta"

special_1 = "What is an import monument in the movie "
special_2 = "What is an expensive location in the movie "
prison = "Which one is the worst area in the movie "
free_parking = "What is the loveliest place in the movie "
special_= "What is a special event in the movie "



In [501]:
plot_only = wiki_page.sections[0]