# Exam for evaluating ML skills needed for Trantor: Exercise I

### Below there are a number of examples and exercises. The goal of the exam is completing as many  of the exercises as possible. The candidates could create an auxiliary .py file and read from the notebook in order to avoid excess of text. 
### It is highly recommended to create modular code in order to reuse it for the different exercises. The capacity to create modular, self-explanatory, and clean code  that could be used accross tasks will be highly appreciated.
### Short comments could be added to explain the choice of the ML model or algorithm, as well as references to papers where a similar solution is used for a related problem.

In [1]:
# We import some python libraries to be used in the notebook
#The candidate should have these libraries installed in order to execute the notebook

import pandas as pd
import glob, os
import matplotlib.pyplot as plt
from transformers import pipeline


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


### A number of datasets will be used for the evaluation. In particular, a set of files containing "topics" in different languages will be used. The use of these file is restricted to the completion of this exam. 
### The provided topics should be in the directory assigned to the variable "path" in the following cell. 



In [2]:
### Reads all the pickle files in "path" with information about the topics
# You will need to download the following file:
# https://drive.google.com/file/d/1wzZgN_pcDTEa50ZUwZgDGmRbLajOXH-W/view?usp=sharing
path = 'individualTopics_27-01-22/'
all_records = []

for fname in glob.glob(path+'*.pickle'):
    print(fname)
    obj = pd.read_pickle(fname)
    record = [obj['id'],obj['name'],obj['audience_size'],
              obj['country'],obj['topic']]
    all_records = all_records + [record]

individualTopics_27-01-22\#Te Amo.pickle
individualTopics_27-01-22\(500) Days of Summer.pickle
individualTopics_27-01-22\1 LIVE.pickle
individualTopics_27-01-22\1&1 Internet.pickle
individualTopics_27-01-22\1-800-Flowers.pickle
individualTopics_27-01-22\1. F.C. Colonia.pickle
individualTopics_27-01-22\1. FSV Mainz 05.pickle
individualTopics_27-01-22\10 000 a. C..pickle
individualTopics_27-01-22\10 000 metros.pickle
individualTopics_27-01-22\10 Barrel Brewing Company.pickle
individualTopics_27-01-22\10 de Downing Street.pickle
individualTopics_27-01-22\10 Things I Hate About You (serie de televisión).pickle
individualTopics_27-01-22\10 Years.pickle
individualTopics_27-01-22\100 latinos dijeron.pickle
individualTopics_27-01-22\100 metros vallas.pickle
individualTopics_27-01-22\100 metros.pickle
individualTopics_27-01-22\100 mexicanos dijeron.pickle
individualTopics_27-01-22\1000 km de Bathurst.pickle
individualTopics_27-01-22\101 Dalmatians.pickle
individualTopics_27-01-22\101 dálmatas.p

### Transforms the topics records into dataframe

In [3]:
df = pd.DataFrame.from_records(all_records)
df.columns = ['id','name','audience_size','country','topic']
df

Unnamed: 0,id,name,audience_size,country,topic
0,6003407352218,#Te Amo,38302730,,
1,6003195700298,(500) Days of Summer,456210,,
2,6002964102317,1 LIVE,1333240,,
3,6003305598969,1&1 Internet,1372910,,
4,6003311804999,1-800-Flowers,1816950,,
...,...,...,...,...,...
25391,6011703428298,엔제리너스커피(Angelinus Coffee),8,,
25392,6018395661890,오늘 뭐 먹지?,4865090,,
25393,6011929220692,인천공항 incheon airport,564660,,
25394,6015801032961,케이투 아웃도어 (K2 OUTDOOR),159660,,


### Extracts the text describing the topics 

In [4]:
the_topics = [name for name in df['name']]

In [5]:
the_topics

['#Te Amo',
 '(500) Days of Summer',
 '1 LIVE',
 '1&1 Internet',
 '1-800-Flowers',
 '1. F.C. Colonia',
 '1. FSV Mainz 05',
 '10 000 a. C.',
 '10 000 metros',
 '10 Barrel Brewing Company',
 '10 de Downing Street',
 '10 Things I Hate About You (serie de televisión)',
 '10 Years',
 '100 latinos dijeron',
 '100 metros vallas',
 '100 metros',
 '100 mexicanos dijeron',
 '1000 km de Bathurst',
 '101 Dalmatians',
 '101 dálmatas',
 '101 YüzBir Okey Plus',
 '102.7 KIIS FM',
 '109',
 '10K run',
 '110 metros vallas',
 '112',
 '12 años de esclavitud',
 '1200 Micrograms',
 '12th Planet (musician)',
 '13 Going on 30',
 '1408',
 '1500 metros',
 '16 and Pregnant',
 '16 bits',
 '1664 France',
 '1800 Tequila',
 '19 Kids and Counting',
 '1970s in music',
 '1990s in fashion',
 '2 Broke Girls',
 '2 Chainz',
 '2 Guns',
 '2 Unlimited',
 '2(x)ist',
 '2-step garage',
 '2. Bundesliga',
 '20 Feet from Stardom',
 '200 metros',
 '2000 AD',
 '2001 space odyssey',
 '2009 Indian Premier League',
 '2012 Philadelphia Ea

In [6]:
the_topics[0]

'#Te Amo'

# EXERCISE 1

### The list "the_topics" contains a list of short sentences in different languages. Stating from this list, complete the following tasks:

1.1) Create a list of lists A=[[n_letters_1, n_words_1, n_verbs_1],[n_letters_2, n_words_2, n_verbs_2], ...,
[n_letters_k, n_words_k, n_verbs_k] ], where:

-n_letters_i: Is the number of letters (excluding spaces and punctuation symbols) in sentence i

-n_words_i: Is the number of words in  sentence i

-n_verbs_i: Is the number of verbs in  sentence i (for sentences in languages other than English, this value is not relevant (you may either set it to zero or leave an incorrect value). 

1.2) Visualize as a figure the wordcloud (https://en.wikipedia.org/wiki/Tag_cloud) of the corpus comprising all the sentences in "the_topics".

1.3) A named entity is a “real-world object” that’s assigned a name - for example, a person, a country, a product or a book-. Create a new list with all sentences in "the_topics" that contain at least one geopolitical entity (eg., countries, cities, states)

1.4) Compute the semantic similarity of each topics to the following sentence "Jeu de Paume is an excellent art gallery in Paris".  Rank the sentences the computed values of the semantic similarity. The top ranked sentence from the_topics is the one with closest similarity to the given sentence. You may use any method or algorithm to compute this ranking. The output of this exercise could be the index of each word in the ranking, or a new list "ranked_topics".




SUGGESTIONS: You may use libraries spacy, nltk, sklearn, worldcloud or any other library from python to solve these exercises.


In [82]:
from __future__ import unicode_literals
import spacy,en_core_web_sm
from spacy.lang.en import English
from spacy.matcher import Matcher
from spacy.lang.en.stop_words import STOP_WORDS
import textacy
import string
import itertools
from itertools import zip_longest
import numpy as np

In [8]:
def remove_punctuation(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

In [9]:
def number_of_verb(string):
    verbs = []
    pattern = [{'POS': 'VERB', 'OP': '?'},\
           {'POS': 'VERB', 'OP': '+'}]
    doc = textacy.make_spacy_doc(string, lang='en_core_web_sm')
    lists = textacy.extract.matches.token_matches(doc, [pattern])
    for list in lists:
        verbs.append(list.text)
    return [len(verbs)]


In [10]:
def numb_of_words(text):
    text = text.split()
    return [len(text)]

In [11]:
def number_letters(text):
    text = len([i for i in text if i.isalpha()])
    return [text]

In [12]:
number_verb = [number_of_verb(i) for i in the_topics]

In [13]:
number_verb

[[0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [1],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [1],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [1],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [1],
 [0],
 [1],
 [0],
 [1],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0],
 [0]

In [14]:
non_punctuation = [remove_punctuation(i) for i in the_topics]
non_punctuation

['Te Amo',
 '500 Days of Summer',
 '1 LIVE',
 '11 Internet',
 '1800Flowers',
 '1 FC Colonia',
 '1 FSV Mainz 05',
 '10 000 a C',
 '10 000 metros',
 '10 Barrel Brewing Company',
 '10 de Downing Street',
 '10 Things I Hate About You serie de televisión',
 '10 Years',
 '100 latinos dijeron',
 '100 metros vallas',
 '100 metros',
 '100 mexicanos dijeron',
 '1000 km de Bathurst',
 '101 Dalmatians',
 '101 dálmatas',
 '101 YüzBir Okey Plus',
 '1027 KIIS FM',
 '109',
 '10K run',
 '110 metros vallas',
 '112',
 '12 años de esclavitud',
 '1200 Micrograms',
 '12th Planet musician',
 '13 Going on 30',
 '1408',
 '1500 metros',
 '16 and Pregnant',
 '16 bits',
 '1664 France',
 '1800 Tequila',
 '19 Kids and Counting',
 '1970s in music',
 '1990s in fashion',
 '2 Broke Girls',
 '2 Chainz',
 '2 Guns',
 '2 Unlimited',
 '2xist',
 '2step garage',
 '2 Bundesliga',
 '20 Feet from Stardom',
 '200 metros',
 '2000 AD',
 '2001 space odyssey',
 '2009 Indian Premier League',
 '2012 Philadelphia Eagles season',
 '2013 

In [15]:
number_words = [numb_of_words(i) for i in the_topics]
number_words

[[2],
 [4],
 [2],
 [2],
 [1],
 [3],
 [4],
 [4],
 [3],
 [4],
 [4],
 [9],
 [2],
 [3],
 [3],
 [2],
 [3],
 [4],
 [2],
 [2],
 [4],
 [3],
 [1],
 [2],
 [3],
 [1],
 [4],
 [2],
 [3],
 [4],
 [1],
 [2],
 [3],
 [2],
 [2],
 [2],
 [4],
 [3],
 [3],
 [3],
 [2],
 [2],
 [2],
 [1],
 [2],
 [2],
 [4],
 [2],
 [2],
 [3],
 [4],
 [4],
 [4],
 [4],
 [5],
 [3],
 [2],
 [2],
 [4],
 [3],
 [3],
 [4],
 [5],
 [4],
 [3],
 [4],
 [1],
 [2],
 [2],
 [3],
 [2],
 [1],
 [2],
 [2],
 [2],
 [1],
 [1],
 [1],
 [3],
 [2],
 [2],
 [4],
 [2],
 [4],
 [1],
 [2],
 [6],
 [1],
 [2],
 [3],
 [2],
 [1],
 [5],
 [1],
 [6],
 [3],
 [2],
 [2],
 [1],
 [1],
 [1],
 [1],
 [3],
 [4],
 [4],
 [3],
 [2],
 [3],
 [4],
 [2],
 [1],
 [2],
 [2],
 [2],
 [2],
 [3],
 [2],
 [2],
 [4],
 [4],
 [4],
 [1],
 [3],
 [2],
 [2],
 [4],
 [3],
 [2],
 [2],
 [5],
 [2],
 [2],
 [3],
 [1],
 [2],
 [5],
 [5],
 [4],
 [2],
 [4],
 [1],
 [1],
 [1],
 [2],
 [1],
 [3],
 [3],
 [3],
 [3],
 [2],
 [4],
 [3],
 [4],
 [3],
 [5],
 [4],
 [4],
 [4],
 [3],
 [4],
 [6],
 [3],
 [5],
 [4],
 [2],
 [7],
 [3]

In [16]:
number_letters = [number_letters(i) for i in the_topics]
number_letters

[[5],
 [12],
 [4],
 [8],
 [7],
 [9],
 [8],
 [2],
 [6],
 [20],
 [15],
 [36],
 [5],
 [14],
 [12],
 [6],
 [16],
 [12],
 [10],
 [8],
 [14],
 [6],
 [0],
 [4],
 [12],
 [0],
 [16],
 [10],
 [16],
 [7],
 [0],
 [6],
 [11],
 [4],
 [6],
 [7],
 [15],
 [8],
 [10],
 [10],
 [6],
 [4],
 [9],
 [4],
 [10],
 [10],
 [15],
 [6],
 [2],
 [12],
 [19],
 [24],
 [24],
 [24],
 [29],
 [12],
 [9],
 [6],
 [18],
 [12],
 [10],
 [9],
 [13],
 [18],
 [11],
 [15],
 [4],
 [6],
 [7],
 [9],
 [4],
 [6],
 [10],
 [6],
 [3],
 [7],
 [2],
 [2],
 [9],
 [6],
 [7],
 [13],
 [4],
 [13],
 [0],
 [6],
 [23],
 [0],
 [7],
 [17],
 [8],
 [1],
 [16],
 [3],
 [16],
 [12],
 [5],
 [6],
 [2],
 [6],
 [2],
 [8],
 [14],
 [14],
 [15],
 [13],
 [4],
 [10],
 [20],
 [6],
 [1],
 [8],
 [5],
 [8],
 [4],
 [11],
 [8],
 [10],
 [9],
 [13],
 [16],
 [6],
 [13],
 [6],
 [8],
 [19],
 [8],
 [4],
 [4],
 [9],
 [6],
 [11],
 [7],
 [0],
 [7],
 [8],
 [7],
 [4],
 [2],
 [15],
 [7],
 [7],
 [3],
 [8],
 [2],
 [11],
 [14],
 [10],
 [9],
 [7],
 [18],
 [11],
 [14],
 [15],
 [20],
 [11]

In [91]:
def list_1(a,b,c):
    a = np.array(number_letters)
    b = np.array(number_words)
    c = np.array(number_verb)
    return np.concatenate((a,b,c), axis = 1).tolist()

In [92]:
list_1(number_letters,number_words,number_verb)

[[5, 2, 0],
 [12, 4, 0],
 [4, 2, 0],
 [8, 2, 0],
 [7, 1, 0],
 [9, 3, 0],
 [8, 4, 0],
 [2, 4, 0],
 [6, 3, 0],
 [20, 4, 0],
 [15, 4, 0],
 [36, 9, 1],
 [5, 2, 0],
 [14, 3, 0],
 [12, 3, 0],
 [6, 2, 0],
 [16, 3, 0],
 [12, 4, 0],
 [10, 2, 0],
 [8, 2, 0],
 [14, 4, 0],
 [6, 3, 0],
 [0, 1, 0],
 [4, 2, 0],
 [12, 3, 0],
 [0, 1, 0],
 [16, 4, 0],
 [10, 2, 0],
 [16, 3, 0],
 [7, 4, 1],
 [0, 1, 0],
 [6, 2, 0],
 [11, 3, 0],
 [4, 2, 0],
 [6, 2, 0],
 [7, 2, 0],
 [15, 4, 0],
 [8, 3, 0],
 [10, 3, 0],
 [10, 3, 0],
 [6, 2, 0],
 [4, 2, 0],
 [9, 2, 0],
 [4, 1, 0],
 [10, 2, 0],
 [10, 2, 0],
 [15, 4, 0],
 [6, 2, 0],
 [2, 2, 0],
 [12, 3, 0],
 [19, 4, 0],
 [24, 4, 0],
 [24, 4, 0],
 [24, 4, 0],
 [29, 5, 0],
 [12, 3, 0],
 [9, 2, 0],
 [6, 2, 0],
 [18, 4, 0],
 [12, 3, 0],
 [10, 3, 0],
 [9, 4, 0],
 [13, 5, 0],
 [18, 4, 0],
 [11, 3, 0],
 [15, 4, 0],
 [4, 1, 0],
 [6, 2, 0],
 [7, 2, 0],
 [9, 3, 0],
 [4, 2, 0],
 [6, 1, 0],
 [10, 2, 0],
 [6, 2, 0],
 [3, 2, 0],
 [7, 1, 0],
 [2, 1, 0],
 [2, 1, 0],
 [9, 3, 0],
 [6, 2, 0],
 [7,