# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.

Do make sure all results are uploaded to CSVs (as well as printed to console) for your assignment to be fully graded.

IMPORTS:

In [1]:
import os
import random
import pandas as pd
import numpy as np
from collections import defaultdict
import math

In [2]:
ids = ['319122610', '206446221']
STUDENT_IDS = ''
for id in ids:
  STUDENT_IDS += "{student_id_" + str(id) +  "}_"

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [3]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), 11.28 MiB | 3.68 MiB/s, done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [4]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [5]:

def preprocess():
  tokens = set()

  data_dir='nlp-course/lm-languages-data-new'

  for filename in os.listdir(data_dir):
      with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as f:
          for line in f:
              for char in line.strip():
                  tokens.add(char)
  return sorted(list(tokens)) #todo check if to return it sorted

#Test
vocabulary = preprocess()
print(vocabulary[:5])

[' ', '!', '"', '#', '$']


**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [6]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)
  tweets =  pd.read_csv(data_file_path).get('tweet_text')
  vocab_length = len(vocabulary)
  model = defaultdict(lambda: defaultdict(float))

  for tweet in tweets:
    # We are operating on strings, therefore for ease we use unicode characters under the assumption they are not in the files,
    # this is a naive implementation
    # tweet =  "⇏" + tweet + "⇍"
    
    for start in range(len(tweet) - n):
      n_gram = tweet[start:start + n]
      prefix, suffix = n_gram[:-1], n_gram[-1]
      model[prefix][suffix] += 1
  
  def do_add(suffix_frequency, suffix_sum, vocab_length):
    return {key:((val + 1) / (suffix_sum + vocab_length)) for key, val in suffix_frequency.items()}

  def no_add(suffix_frequency, suffix_sum, vocab_length=0):
    return {key:(val/suffix_sum) for key,val in suffix_frequency.items()}

  norm_func = do_add if add_one else no_add

  for key, suffix_frequency in model.items():
    suffix_sum = sum(suffix_frequency.values())
    norm_dict = norm_func(suffix_frequency, suffix_sum, vocab_length)
    model[key] = dict(norm_dict)

  return dict(model)

  # Test cell without start and end tokens for tweets

print(lm(2, vocabulary,'nlp-course/lm-languages-data-new/en.csv', False))
print(lm(1, vocabulary,'nlp-course/lm-languages-data-new/en.csv', False))
print(lm(3, vocabulary,'nlp-course/lm-languages-data-new/en.csv', False))

{'R': {'T': 0.5891769203201529, 'P': 0.0021502807310954486, 'l': 0.0015529805280133796, 't': 0.003105961056026759, 'o': 0.0285509497073229, 'I': 0.01409628479273683, 'Y': 0.007167602436984829, 'y': 0.019352526579859038, 'E': 0.03930235336280014, 'e': 0.08230796798470912, '4': 0.0015529805280133796, 'L': 0.007764902640066898, '3': 0.0009556803249313105, 'V': 0.003703261259108828, 'R': 0.005375701827738622, ' ': 0.026997969179309522, 'N': 0.0034643411778760005, 'a': 0.022458487635885795, 'D': 0.005614621908971449, '8': 0.001314060446780552, '1': 0.0014335204873969656, 'A': 0.010512483574244415, '0': 0.0010751403655477243, 'd': 0.0009556803249313105, 'O': 0.01003464341177876, ':': 0.002747580934177518, '2': 0.0017919006092462072, 'U': 0.004181101421574483, 'c': 0.0015529805280133796, 'G': 0.003942181340341655, 'W': 0.0022697407717118625, 'f': 0.001911360649862621, '-': 0.00023892008123282762, 'S': 0.008003822721299726, 's': 0.0020308206904790346, 'g': 0.0017919006092462072, 'i': 0.0151714

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [7]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for
  df = pd.read_csv(data_file)
  entropies_list = []
  missing_value = 1e-8
  probabilities = []

  for tweet in df['tweet_text'].values:

    for start in range(len(tweet) - n):
      substring = tweet[start: start + n]
      key, value = substring[:-1], substring[-1]

      if key in model:
        probabilities.append(model[key].get(value, missing_value))
      else:
        probabilities.append(missing_value)
      
  entropies_list.append(-math.log2(np.mean(probabilities)))

  return math.pow(2, np.average(entropies_list))


# Test cell without start and end tokens for tweets

model = lm(2, vocabulary, 'nlp-course/lm-languages-data-new/en.csv', False)
print(eval(2, model, 'nlp-course/lm-languages-data-new/en.csv'))
print(eval(2, model, 'nlp-course/lm-languages-data-new/it.csv'))

8.450006839661475
10.3711132093902


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

Save the dataframe to a CSV with the name format: {student_id_1}\_...\_{student_id_n}\_part4.csv

In [8]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  results = {}
  languages = ["en","es","in","it","pt","fr","nl","tl"]
  for lang in languages:
    model = lm(n, vocabulary, f'nlp-course/lm-languages-data-new/{lang}.csv', add_one)
    perplexities = []

    for lang2 in languages:
      perplexities.append(eval(n, model, f'nlp-course/lm-languages-data-new/{lang2}.csv'))

    results[lang] = perplexities

  return pd.DataFrame(results, index=languages)


match_result = match(3, True)
match_result.to_csv(STUDENT_IDS + "_part4.csv")


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

Load each result to a dataframe and save to a CSV with the name format: 

for cases with add_one: {student_id_1}\_...\_{student_id_n}\_n1\_part5.csv

For cases without add_one:
{student_id_1}\_...\_{student_id_n}\_n1\_wo\_addone\_part5.csv

Follow the same format for n2,n3, and n4


In [9]:
def run_match():
    n_values = [1, 2, 3, 4]
    data_file_path='nlp-course/lm-languages-data-new/en.csv'

    for n in n_values:
        # match with add_one smoothing
        df_add_one = match(n, True)
        file_name_add_one = STUDENT_IDS + "n1_part5.csv"
        #file_name_add_one = f"{student_id}_n{n}_part5.csv"
        df_add_one.to_csv(file_name_add_one)

        # match without add_one smoothing
        df_wo_add_one = match(n, False)
        file_name_wo_add_one = STUDENT_IDS + "wo_addone_part5.csv"
        #file_name_wo_add_one = f"{student_id}_n{n}_wo_addone_part5.csv"
        df_wo_add_one.to_csv(file_name_wo_add_one)
        
        # print the resulting tables
        print(f"n={n}, add_one=True:\n{df_add_one}\n")
        print(f"n={n}, add_one=False:\n{df_wo_add_one}\n")

run_match()

n=1, add_one=True:
           en         es         in         it         pt         fr  \
en  21.168951  20.816503  22.315085  21.649929  20.832839  21.061884   
es  20.813932  19.527945  21.157337  20.577570  19.583932  20.165776   
in  22.307913  21.153142  20.155104  21.934924  21.038162  22.111379   
it  21.645149  20.575567  21.937134  21.031269  20.603323  21.197780   
pt  20.818007  19.572400  21.029951  20.593202  19.399996  20.267364   
fr  21.063403  20.169724  22.120082  21.203992  20.283272  20.221535   
nl  21.614933  20.966975  22.649610  21.880882  21.182232  21.076931   
tl  22.966920  21.822158  21.121299  22.566285  21.700770  23.071528   

           nl         tl  
en  21.616138  22.979535  
es  20.965553  21.831454  
in  22.643593  21.126115  
it  21.877268  22.573699  
pt  21.168331  21.697243  
fr  21.079623  23.085862  
nl  20.979183  23.635054  
tl  23.623399  21.604524  

n=1, add_one=False:
           en         es         in         it         pt         fr

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be accepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [10]:
def classify():
   dir_path='nlp-course/lm-languages-data-new/'
   data_file_path = dir_path + 'test.csv'
   
   languages = ["en","es","in","it","pt","fr","nl","tl"]
   tweets = pd.read_csv(data_file_path, encoding='utf-8').get('tweet_text')
   ids = pd.read_csv(data_file_path, encoding='utf-8')['tweet_id']

   models = {}
   res = []
   n = 4
   add_one = True

   for language_model in languages:
        models[language_model] = lm(n, preprocess(), dir_path +language_model+".csv", add_one)

   for tweet in tweets:
     correct_lang = languages[0]
     min_perplexity = float('inf')

     for lang, model in models.items():
        perplexity = evalTweet(tweet, n, model)

        if perplexity < min_perplexity:
          correct_lang = lang
          min_perplexity = perplexity

     res.append(correct_lang)

   return pd.DataFrame(res, index=ids, columns=['prediction']).reset_index()

def evalTweet(tweet, n, model):
  missing_value = 1e-8
  entropy = 0

  start, end = range(len(tweet) - n + 1), range(n-1, len(tweet))
  n_gram = list(zip(start, end))

  for start, end in n_gram:
    text = ''

    if start != end:
        text = tweet[start:start + n -1]

    suffix = tweet[end]

    if text in model.keys():
      if suffix in model[text].keys():
        entropy += -math.log2(model[text][suffix])
      else:
        entropy += -math.log2(missing_value)
    else:
      entropy += -math.log2(missing_value)

  entropy /= len(n_gram)

  return math.pow(2, entropy)

classification_result = classify()

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 

Load the results to a CSV (using a DataFrame), where the row indicates the F1 results, and the columns indicate the model used. Name it {student_id_1}\_...\_{student_id_n}\_part7.csv

In [11]:
def calc_f1(result):
  df = pd.read_csv('nlp-course/lm-languages-data-new/test.csv')
  df = df.join(result.set_index('tweet_id'), on='tweet_id')
  languages = ["en","es","in","it","pt","fr","nl","tl"]

  scores = {}

  for lang in languages:
    tp = len(df[(df.label == lang) & (df.prediction == lang)])
    fp = len(df[(df.label != lang) & (df.prediction == lang)])
    fn = len(df[(df.label == lang) & (df.prediction != lang)])

    scores[lang] = 2 * tp / (2 * tp + fp + fn)

  return pd.DataFrame(scores, index=['F1 score'])

f1_scores = calc_f1(classification_result)
f1_scores.to_csv(STUDENT_IDS + "part7.csv")

<br><br><br><br>
**Part 8**  
Let's use your Language model (dictionary) for generation (NLG).

When it comes to sampling from a language model decoder during text generation, there are several different methods that can be used to control the randomness and diversity of the generated text. 

Some of the most commonly used methods include:

> `Greedy sampling`
In this method, the model simply selects the word with the highest probability as the next word at each time step. This method can produce fluent text, but it can also lead to repetitive or predictable output.

> `Temperature scaling`  
Temperature scaling involves scaling the logits output of the language model by a temperature parameter before softmax normalization. This has the effect of smoothing the distribution of probabilities and increasing the probability of lower-probability words, which can lead to more diverse and creative output.

> `Top-K sampling`  
In this method, the model restricts the sampling to the top-K most likely words at each time step, where K is a predefined hyperparameter. This can generate more diverse output than greedy sampling, while limiting the number of low-probability words that are sampled.

> `Nucleus sampling` (also known as top-p sampling)  
This method restricts the sampling to the smallest possible set of words whose cumulative probability exceeds a certain threshold, defined by a hyperparameter p. Like top-K sampling, this can generate more diverse output than greedy sampling, while avoiding sampling extremely low probability words.

> `Beam search`  
Beam search involves maintaining a fixed number k of candidate output sequences at each time step, and then selecting the k most likely sequences based on their probabilities. This can improve the fluency and coherence of the output, but may not produce as much diversity as sampling methods.

The choice of sampling method depends on the specific application and desired balance between fluency, diversity, and randomness. Hyperparameters such as temperature, K, p, and beam size can also be tuned to adjust the behavior of the language model during sampling.


You may read more about this concept in <a href='https://huggingface.co/blog/how-to-generate#:~:text=pad_token_id%3Dtokenizer.eos_token_id)-,Greedy%20Search,-Greedy%20search%20simply'>this</a> blog post.


**Please added the needed code for each sampeling method:**

In [12]:
def sort_dict(dictionary):
  return dict(sorted(dictionary.items(), key=lambda x: x[1], reverse=True))


def sample_greedy(probabilities, k=1):
  sorted_dict = sort_dict(probabilities)
  return random.choices(list(sorted_dict.keys()), weights=sorted_dict.values(), k=1)[0]


def sample_temperature(probabilities, temperature=1.0, k=1):
  sorted_dict = sort_dict(probabilities)
  exp_probs = [math.exp(p/temperature) for p in sorted_dict.values()]
  norm_probs = [p/sum(exp_probs) for p in exp_probs]
  return random.choices(list(sorted_dict.keys()), weights=norm_probs, k=1)[0]


def sample_topK(probabilities, k=1):
  sorted_dict = sort_dict(probabilities)
  topK_dict = {}

  for key, value in probabilities.items():
    if len(topK_dict) == k:
      break
    topK_dict[key] = value

  return sample_greedy(topK_dict)


def sample_topP(probabilities, p=0.9):
  sorted_dict = sort_dict(probabilities)
  topP_dict = {}
  mass = 0

  for key, value in probabilities.items():
    if mass >= p:
      break
    topP_dict[key] = value
    mass += value

  return sample_greedy(topP_dict)

def sample_beam(probabilities, k=3):
  sorted_dict = sort_dict(probabilities)
  return random.choices(list(sorted_dict.keys()), weights=sorted_dict.values(), k=k)[0]

def beam_decoder(model, data, n, stop_token, k=3):
  beams = []
  for sequence in data:
    if sequence.endswith(stop_token) or model.get(sequence[-(n - 1):], None) is None:
        beams.append(sequence)
    else: 
        beams.extend(create_beams(model, sequence, n, k))

  return get_best_k_beams(beams, n, model, k)

def create_beams(model, sequence, n, k=3):
  prefix = sequence[-(n - 1):]
  probabilities = model.get(prefix, None)

  if probabilities is None:
    return [prefix]

  suffixes = sample_beam(probabilities, k)

  return [prefix + suffix for suffix in suffixes]

def get_best_k_beams(beams, n, model, k):
  perplexities = list(map(lambda text: evalTweet(text, n, model), beams))
  best_indices = np.array(perplexities).argsort()[:k]

  return [beams[index] for index in best_indices]


Use your Language Model to generate each one out of the following examples with the coresponding params.    
Notice the 4 core issues: 
- Starting tokens
- Length of the generation
- Sampling methond (use all)
- Stop Token (if this token is sampled, stop generating)

In [13]:
test_ = {
    'example1' : {
        'start_tokens' : "H",
        'sampling_method' : ['greedy','beam'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example2' : {
        'start_tokens' : "H",
        'sampling_method' : ['temperature','topK','topP'],
        'gen_length' : "10",
        'stop_token' : "\n",
        'generation' : []
    },
    'example3' : {
        'start_tokens' : "He",
        'sampling_method' : ['greedy','beam','temperature','topK','topP'],
        'gen_length' : "20",
        'stop_token' : "me",
        'generation' : []
    }
}

Use your LM to generate a string based on the parametes of each examples, and store the generation sequance at the generation list.

In [14]:
### your code here ###
def get_sampling_func(sampling_method):
  if sampling_method == 'greedy':
    return sample_greedy
  elif sampling_method == 'temperature':
    return sample_temperature
  elif sampling_method == 'topK':
    return sample_topK
  else: #sampling_method == 'topP'
    return sample_topP
  
def generate_text(model, n, examples):
  for example, options in examples.items():
    for sampling_method in options.get('sampling_method', []):
      sample_func = get_sampling_func(sampling_method)
      generated_string = options.get('start_tokens', '')

      for i in range(int(options.get('gen_length', '0'))):
        if sampling_method == 'beam':
          if i == 0:
            generated_string = [generated_string]
          generated_string = beam_decoder(model, generated_string, n, options['stop_token'])

        else:
          prefix = generated_string[-(n - 1):]
          probabilities = model.get(prefix, None)

          if probabilities is None:
            break

          suffix = sample_func(probabilities)
          generated_string += suffix

          if generated_string.endswith(options['stop_token']):
            break

      if sampling_method == 'beam': # taking the best sequence
        generated_string = get_best_k_beams(generated_string, n, model, 1)

      print(sampling_method, generated_string)
      examples[example]['generation'].append(generated_string)

generate_text(model, 2, test_)
#####################

greedy HI'ttemeant
beam ['me']
temperature HrUdcMo 🤗💕🎧
topK HERT @ONHER
topP Hed @Flo//W
greedy HewIO7 AYo//t K: t.c b
beam ['as']
temperature HeZQ1Um　♯TJ-^SB:0XmSu💫
topK Hends @ONHERT @ONHERT 
topP Heio///t me


In [15]:
### do not change ###
print('-------- NLG --------')

for k,v in test_.items():
  l = ''.join([f'\t{sm} >> {v["start_tokens"]}{g}\n' for sm,g in zip(v['sampling_method'],v['generation'])])
  print(f'{k}:')
  print(l)

-------- NLG --------
example1:
	greedy >> HHI'ttemeant
	beam >> H['me']

example2:
	temperature >> HHrUdcMo 🤗💕🎧
	topK >> HHERT @ONHER
	topP >> HHed @Flo//W

example3:
	greedy >> HeHewIO7 AYo//t K: t.c b
	beam >> He['as']
	temperature >> HeHeZQ1Um　♯TJ-^SB:0XmSu💫
	topK >> HeHends @ONHERT @ONHERT 
	topP >> HeHeio///t me



<br><br><br>
# **Good luck!**