<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Lecture 2 Day 1</div>
<div style="text-align: right">Dino Konstantopoulos, 18 January 2024</div>

# Dictionary Problem Homework
Designed to help you learn container types, list comprehensions, and that functional data structure called *dictionaries* that replaces Objects in OO programming! 

We are going to use python dictionaries to help us learn Chinese and Hindi.

Every time we find an interesting english sentence to translate, we use [google translate](https://translate.google.com/) to translate it to hindi and chinese, and we store the translations in a dictionary, keyed by the time we enter the data and a random guid.

In [2]:
from bson.objectid import ObjectId
from datetime import datetime
import random

In [3]:
random.seed(3)


def prefix_crud_timestamp_suffix(key):
    # get the first 3 items form key list
    prefix = key[:3]
    # get the third item form key list
    crud = key[3:4]
    # way to find negative value in key[:4]
    #hyphens = [i for i in range(len(key[:4])) if key[:4].startswith('-', i)]
    # find first negative value's index
    hyphen1 = key.find('-')
    # find negative value's index in key[5:]
    hyphen2 = key[5:].find('-')

    # get values from first neative value to next(if there's more negative value)
    timestamp = key[hyphen1+1:hyphen1+1+hyphen2]
    # git rest values
    suffix = key[hyphen1+hyphen2+2:]
    return prefix, crud, timestamp, suffix #coll, op, time, guid

## seconds since midnight, simulate non-contiguous times
def ssm():
    # gete current time
    now = datetime.now()
    # set the time with 0:00:00:00
    midnight = now.replace(hour=0, minute=0, second=0, microsecond=0)
    return str((now - midnight).seconds + random.randint(0, 1000))

words = dict()
def enter_words(en, zh = None, hi = None):
    uid = ('zhon-' if zh != None else 'hind-' if hi != None else 'oops-') + ssm() + '-' + str(ObjectId())
    words[uid] = (
        dict(english = en, chinese = zh, _id = uid) if zh != None else 
        dict(english = en, hindi = hi, _id = uid) if hi != None else
        dict(_id = uid)
    )

In [4]:
ssm()

'67049'

Here's the structure of our key for an example translation: The first part is the language, the second part is the time (as an integer counter), the third part is a guid (random string).

In [5]:
en = """If a person has not had a chance to acquire his target language by the time he's an adult, 
he's unlikely to be able to reach native speaker level in that language"""
zh = '如果一個人在成人前沒有機會習得目標語言，他對該語言的認識達到母語者程度的機會是相當小的'
('zhon-' if zh != None else 'hind-' if hi != None else 'oops-') + ssm() + '-' + str(ObjectId())

'zhon-67413-65b04cc7a16a6442888ce6ef'

We are going to *simulate* the data entering process. I'll give you two files with translations of english sentences, one for chinese, another for hindi (from my NLP class):

In [6]:
# read cmn.txt, encoding as utf8
file1 = open('data/cmn.txt', 'r', encoding='utf8')
lines = file1.readlines()
file1.close()

for i,l in enumerate(lines):
    t2 = l.split('\t')
    # word is t2[0], whtether zh can check t2[1]
    enter_words(t2[0][:-1], zh = t2[1][:-1])

In [7]:
# read hin.txt, encoding as utf8
file1 = open('data/hin.txt', 'r', encoding='utf8')
lines = file1.readlines()
file1.close()

for i,l in enumerate(lines):
    t2 = l.split('\t')
    # word is t2[0], whtether hi can check t2[1]
    enter_words(t2[0][:-1], hi = t2[1][:-1])

Dictionaries *have no built-in ordering*! That means that if you enumerate on dictionary items, they will appear *unordered*:

In [8]:
# u is uid, v is words[u] witch is u'value in dictionary named words
print([(u,v) for i,(u,v) in enumerate(words.items()) if i < 10 ])

[('zhon-67364-65b04cc7a16a6442888ce6f0', {'english': 'Hi', 'chinese': '嗨', '_id': 'zhon-67364-65b04cc7a16a6442888ce6f0'}), ('zhon-66940-65b04cc7a16a6442888ce6f1', {'english': 'Hi', 'chinese': '你好', '_id': 'zhon-66940-65b04cc7a16a6442888ce6f1'}), ('zhon-67185-65b04cc7a16a6442888ce6f2', {'english': 'Run', 'chinese': '你用跑的', '_id': 'zhon-67185-65b04cc7a16a6442888ce6f2'}), ('zhon-67744-65b04cc7a16a6442888ce6f3', {'english': 'Wait', 'chinese': '等等', '_id': 'zhon-67744-65b04cc7a16a6442888ce6f3'}), ('zhon-67425-65b04cc7a16a6442888ce6f4', {'english': 'Wait', 'chinese': '等一下', '_id': 'zhon-67425-65b04cc7a16a6442888ce6f4'}), ('zhon-67292-65b04cc7a16a6442888ce6f5', {'english': 'Hello', 'chinese': '你好', '_id': 'zhon-67292-65b04cc7a16a6442888ce6f5'}), ('zhon-67447-65b04cc7a16a6442888ce6f6', {'english': 'Dino', 'chinese': '迪诺', '_id': 'zhon-67447-65b04cc7a16a6442888ce6f6'}), ('zhon-67401-65b04cc7a16a6442888ce6f7', {'english': 'I try', 'chinese': '让我来', '_id': 'zhon-67401-65b04cc7a16a6442888ce6f7'}),

So now I have *one* dictionary for *both* chinese and hindi!

Let's separate them into two dictionaries:

In [9]:
# new dictionary names separated
separated = dict()
# dictionary seoarated has 2 keys, one is 'chinese', the other is 'hindi',
# below classify these two types of data
separated['chinese'] = {k:v for k,v in words.items() if k.startswith('zhon')}
separated['hindi'] = {k:v for k,v in words.items() if k.startswith('hind')}

In [10]:
print([(u,v) for i,(u,v) in enumerate(separated['hindi'].items()) if i < 10 ])

[('hind-67414-65b04cc8a16a6442888d3d2c', {'english': 'Wow', 'hindi': 'वाह', '_id': 'hind-67414-65b04cc8a16a6442888d3d2c'}), ('hind-67643-65b04cc8a16a6442888d3d2d', {'english': 'Help', 'hindi': 'बचाओ', '_id': 'hind-67643-65b04cc8a16a6442888d3d2d'}), ('hind-67418-65b04cc8a16a6442888d3d2e', {'english': 'Jump', 'hindi': 'उछलो', '_id': 'hind-67418-65b04cc8a16a6442888d3d2e'}), ('hind-67006-65b04cc8a16a6442888d3d2f', {'english': 'Jump', 'hindi': 'कूदो', '_id': 'hind-67006-65b04cc8a16a6442888d3d2f'}), ('hind-66955-65b04cc8a16a6442888d3d30', {'english': 'Jump', 'hindi': 'छलांग', '_id': 'hind-66955-65b04cc8a16a6442888d3d30'}), ('hind-67315-65b04cc8a16a6442888d3d31', {'english': 'Hello', 'hindi': 'नमस्ते', '_id': 'hind-67315-65b04cc8a16a6442888d3d31'}), ('hind-67420-65b04cc8a16a6442888d3d32', {'english': 'Hello', 'hindi': 'नमस्कार', '_id': 'hind-67420-65b04cc8a16a6442888d3d32'}), ('hind-67241-65b04cc8a16a6442888d3d33', {'english': 'Cheers', 'hindi': 'वाह-वाह', '_id': 'hind-67241-65b04cc8a16a64428

The key has the format `language-time-randomguid` (we simulated `time` by adding a random number to number of seconds since midnight). Suppose I want to be able to practice my sentences every day in the (simulated) order that we saved them, and that every day, I want to be able to *see a specific number of sentences with a time greater than a specific time* (entered as an integer: number of seconds from midnight).

- **Question 1 (50 points)**: Given how many sentences I want to see (variable `n`), and a certain time of the day specified as number of seconds past midnight (variable `ssm`) write code that yields *the next `n` (given as input) sentences of both the chinese and hindi dictionaries, past a certain specified `ssm` that represents a time (given as input)*. Structure the result as a **dictionary** with two keys: `chinese` and `hindi`.

- **Question 2 (50 points)**: Rewrite your code in Dua Lipa style: In the smallest number of lines of python (e.g. a few!). Line continuations are allowed. For example:
```
{
    'chinese' : {k:v for k,v in .....},
    'hindi' : {k:v for k,v in .....},
}
```
counts for one line of code.

Time your Dua Lipa code with `%%time`. Shorter times combined with most beautiful Dua Lipa code get best grades :-)

In [11]:
# Answer 1

In [44]:
# Given how many sentences I want to see (variable n), and a 
# certain time of the day specified as number of seconds past midnight (variable ssm)
def generator_dic(dic,n, ssm):
    # get sentences from dic, and limited by ssm
    zh_sentences = [(k, v) for k, v in dic['chinese'].items() if int(k.split('-')[1]) > ssm][:n]
    hi_sentences = [(k, v) for k, v in dic['hindi'].items() if int(k.split('-')[1]) > ssm][:n]
    for i in zip(zh_sentences, hi_sentences):
        yield i
# li store the result generated from generator_dic
li = list(generator_dic(separated,5,67000))
# result
res = {"chinese": [i[0] for i in li], "hindi": [i[1] for i in li]}
res

{'chinese': [('zhon-67364-65b04cc7a16a6442888ce6f0',
   {'english': 'Hi',
    'chinese': '嗨',
    '_id': 'zhon-67364-65b04cc7a16a6442888ce6f0'}),
  ('zhon-67185-65b04cc7a16a6442888ce6f2',
   {'english': 'Run',
    'chinese': '你用跑的',
    '_id': 'zhon-67185-65b04cc7a16a6442888ce6f2'}),
  ('zhon-67744-65b04cc7a16a6442888ce6f3',
   {'english': 'Wait',
    'chinese': '等等',
    '_id': 'zhon-67744-65b04cc7a16a6442888ce6f3'}),
  ('zhon-67425-65b04cc7a16a6442888ce6f4',
   {'english': 'Wait',
    'chinese': '等一下',
    '_id': 'zhon-67425-65b04cc7a16a6442888ce6f4'}),
  ('zhon-67292-65b04cc7a16a6442888ce6f5',
   {'english': 'Hello',
    'chinese': '你好',
    '_id': 'zhon-67292-65b04cc7a16a6442888ce6f5'})],
 'hindi': [('hind-67414-65b04cc8a16a6442888d3d2c',
   {'english': 'Wow',
    'hindi': 'वाह',
    '_id': 'hind-67414-65b04cc8a16a6442888d3d2c'}),
  ('hind-67643-65b04cc8a16a6442888d3d2d',
   {'english': 'Help',
    'hindi': 'बचाओ',
    '_id': 'hind-67643-65b04cc8a16a6442888d3d2d'}),
  ('hind-67418-

In [54]:
import time
# start time
start_time = time.time()
# rewritten code in Dua Lipa style
res_1 = {lang: [sentence[i] for sentence in (lambda dic, n, ssm: zip(
    [(k, v) for k, v in dic['chinese'].items() if int(k.split('-')[1]) > ssm][:n],
    [(k, v) for k, v in dic['hindi'].items() if int(k.split('-')[1]) > ssm][:n]
))(separated, 5, 67800)] for i, lang in enumerate(["chinese", "hindi"])}
# end time
end_time = time.time()
# get excution time 
execution_time = end_time - start_time
res_1

{'chinese': [('zhon-67801-65b04cc7a16a6442888ce799',
   {'english': "It's cold",
    'chinese': '天很冷',
    '_id': 'zhon-67801-65b04cc7a16a6442888ce799'}),
  ('zhon-67803-65b04cc7a16a6442888ce7b6',
   {'english': 'Tom swims',
    'chinese': 'Tom游泳',
    '_id': 'zhon-67803-65b04cc7a16a6442888ce7b6'}),
  ('zhon-67801-65b04cc7a16a6442888ce800',
   {'english': 'Let him in',
    'chinese': '让他进来',
    '_id': 'zhon-67801-65b04cc7a16a6442888ce800'}),
  ('zhon-67801-65b04cc7a16a6442888ce874',
   {'english': 'Money talks',
    'chinese': '金钱万能',
    '_id': 'zhon-67801-65b04cc7a16a6442888ce874'}),
  ('zhon-67802-65b04cc7a16a6442888ce89c',
   {'english': "What's this",
    'chinese': '那是什么',
    '_id': 'zhon-67802-65b04cc7a16a6442888ce89c'})],
 'hindi': [('hind-67807-65b04cc8a16a6442888d3dd3',
   {'english': 'Summer is over',
    'hindi': 'गर्मियाँ खतम हों चुकीं हैं',
    '_id': 'hind-67807-65b04cc8a16a6442888d3dd3'}),
  ('hind-67807-65b04cc8a16a6442888d3dde',
   {'english': 'Where were you',
    

In [55]:
print("The excution time is: ", execution_time)

The excution time is:  0.022030115127563477
