### What to do
In this notebook is treated the game of the Guillotine which is broadcast every evening on Rai Uno on Italian television.

The competitor is presented with five pairs of words, of which he must choose one and of which one is the right clue and the other is an intruder; if he chooses the right one, the prize money remains intact, otherwise it is halved.

**Once all five clues have been found, the competitor has a minute to think about what the word that binds to each of them may be. If he guesses the word he wins the prize pool, otherwise he wins nothing. The champion returns by right in the next episode.**

So, we need to implement an algorithm that, given 5 words, return the 6th. The sixth word has to be strongly related to the other five.

### The algorithm
Our algorithm searches within a dataset (composed by us) all the sentences in which at least one word of the 5 dates appears. These sentences are saved and pre-processed by tokenizing them (and removing stop words). Then the word that occurs most in all the selected sentences is selected. We weight more the words that appears in sentences for different test words.

### Dataset
The dataset was composed by combining:
  - Titles of movies scraped from https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/ and transformed in txt format
  - Titles of italian songs scraped from https://www.midi-miti-mici.it/musica-midi/elenco-canzoni-A_B.asp
  - Common saying scraped https://www.sololibri.net/Modi-di-dire-i-piu-conosciuti.html & https://it.wikipedia.org/wiki/Glossario_delle_frasi_fatte

In [76]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from collections import defaultdict
import random
from typing import Set, List

### Data Load

In [77]:
def import_dataset(path: str) -> List[str]:
    with open(path, encoding='utf-8') as f:
        rows = f.read().splitlines()
    return rows

stop_words = set(stopwords.words('italian'))
dataset = import_dataset('data/corpus.txt')

### Pre-processing

In [78]:
# Pre-processing of a sentence
def bag_of_words(sentence: str) -> Set[str]:
    return set(remove_stopwords(tokenize_sentence(remove_punctuation(sentence))))


# Remove stopwords from a word list
def remove_stopwords(words: List[str]) -> List[str]:
    return [value.lower() for value in words if value.lower() not in stop_words]


# Get tokens from sentence
def tokenize_sentence(sentence: str) -> List[str]:
    words = []
    lmtzr = WordNetLemmatizer()
    for tag in nltk.pos_tag(word_tokenize(sentence)):
        words.append(lmtzr.lemmatize(tag[0]).lower())
    return words


# Remove punctuation and multiple spaces
def remove_punctuation(sentence: str) -> str:
    return re.sub('\s\s+', ' ', re.sub(r'[^\w\s]', ' ', sentence))

### Answer Computation

In [79]:
def compute_answer(test_words: List[str]) -> str:
    # Counting each word of each sentence (sentence in which occurs one word among those in input)
    word_counter = defaultdict(lambda: defaultdict(int))
    for test_word in test_words:
        for row in dataset:
            if test_word.lower() in [w.lower() for w in row.split(' ')]: # If the test word occur in the current sentence (of the dataset)
                sentence_words = bag_of_words(row) - {test_word} # Sentence tokenization
                for sentence_word in sentence_words:
                    word_counter[sentence_word][test_word] += 1

    scores = {}
    for sentence_word, inner_dict in word_counter.items():
        scores[sentence_word] = len(inner_dict) * sum(inner_dict.values()) # Weighted sum (we weigh more the terms that appear in sentences concerning different test_words)

    answer = max(scores, key=scores.get) # Get key with max value (max count)
    return answer

In [80]:
tests = [
        {'words': ['sassolino', 'ciabatta', 'ginnastica', 'piedi', 'tacco'], 'correct_answer': 'scarpa'},
        {'words': ['signore', 'civile', 'capo', 'perdere', 'settimo'], 'correct_answer': 'anno'},
        {'words': ['scherzo', 'rock', 'insolito', 'amaro', 'danza'], 'correct_answer': 'destino'},
        {'words': ['sorpresa', 'natale', 'compleanno', 'laurea', 'costoso'], 'correct_answer': 'regalo'},
        {'words': ['adele', 'facile', 'dolce', 'grazie', 'forza'], 'correct_answer': 'vita'},
    ]


# Random index to select a test
index = random.randint(0, len(tests) - 1)
selected_words = tests[index]['words']
correct_answer = tests[index]['correct_answer']

# Answer computation
answer = compute_answer(selected_words)

print(f'Words: {", ".join(selected_words)}')
print(f'My answer: {answer}')
print(f'Correct answer: {correct_answer}')

Words: sassolino, ciabatta, ginnastica, piedi, tacco
My answer: scarpa
Correct answer: scarpa
