# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


In [2]:
import os
import io
import sys
import pandas as pd
import numpy as np
import math
from pathlib import Path
from google.colab import drive
from sklearn.metrics import f1_score
from collections import defaultdict

In [3]:
languages = ["en","es","in","it","pt","fr","nl","tl"]

*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [4]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [5]:
!ls nlp-course/lm-languages-data-new

en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [6]:
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [7]:
dir_path = Path('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data')
all_files = [x for x in dir_path.glob('**/*') if x.is_file()]

In [8]:
# Test cell
print(all_files)

[PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/tl.json'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/tests.csv'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/fr.json'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/pt.csv'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/it.json'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/test.json'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/in.csv'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/it.csv'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/nl.json'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/nl.csv'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/es.csv'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/pt.json'), PosixPath('/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/tl.csv'), PosixPath('/content/drive/

In [9]:
# Python defaults to UTF-8 encoding unless otherwise specified
def preprocess():

  # We use set instead of a list so that we do not have duplicates while running
  # Assuming a constant number of tokens, set will require O(1) memory
  tokens = set()

  for data_file in all_files:
    if data_file.suffix == '.csv':
      df = pd.read_csv(data_file, encoding='UTF-8')

      for tweet in df['tweet_text'].values:
        tokens = tokens.union(tweet)
  
  # We return the list per instructions
  return list(tokens)

In [10]:
# Test cell
vocabulary = preprocess()
print(vocabulary[:5])

['z', '☆', '😘', '⊙', '🐓']


**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

In [None]:
def lm(n, vocabulary, data_file_path, add_one):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)

  tweets =  pd.read_csv(data_file_path).get('tweet_text')
  vocab_length = len(vocabulary)

  model = defaultdict(lambda: defaultdict(float))

  for tweet in tweets:
    # We are operating on strings, therefore for ease we use unicode characters under the assumption they are not in the files, this is a naive implementation
    tweet =  "⇏" + tweet + "⇍"
    
    for start in range(len(tweet) - n):
      n_gram = tweet[start:start + n]
      prefix, suffix = n_gram[:-1], n_gram[-1]
      model[prefix][suffix] += 1

  def do_add(suffix_frequency, suffix_sum, vocab_length):
    return {key:((val + 1) / (suffix_sum + vocab_length)) for key, val in suffix_frequency.items()}

  def no_add(suffix_frequency, suffix_sum, vocab_length=0):
    return {key:(val/suffix_sum) for key,val in suffix_frequency.items()}

  norm_func = do_add if add_one else no_add

  for key, suffix_frequency in model.items():
    suffix_sum = sum(suffix_frequency.values())
    norm_dict = norm_func(suffix_frequency, suffix_sum, vocab_length)
    model[key] = dict(norm_dict)

  return dict(model)

In [None]:
# Test cell without start and end tokens for tweets
# print(lm(2, vocabulary, all_files[3], False))

{'A': {' ': 0.22585522585522586, 'n': 0.02286902286902287, 'I': 0.02457002457002457, 'L': 0.030807030807030806, 'N': 0.042714042714042715, 'M': 0.050652050652050654, 'V': 0.012852012852012852, 'g': 0.006615006615006615, 'c': 0.018522018522018523, 's': 0.012474012474012475, 'i': 0.008694008694008694, 'G': 0.007371007371007371, 'R': 0.06615006615006615, 'S': 0.029106029106029108, '5': 0.001323001323001323, 'C': 0.015498015498015497, 'x': 0.001512001512001512, '9': 0.001512001512001512, 'K': 0.003024003024003024, '3': 0.001512001512001512, 'P': 0.013608013608013609, 'D': 0.027594027594027595, ',': 0.00567000567000567, 'p': 0.010017010017010016, 'W': 0.001701001701001701, 'T': 0.02702702702702703, '7': 0.001701001701001701, 'O': 0.006615006615006615, 'A': 0.04554904554904555, 'r': 0.051219051219051216, 'w': 0.03931203931203931, '.': 0.002268002268002268, 'B': 0.010773010773010773, 'v': 0.00378000378000378, 't': 0.009639009639009639, 'd': 0.009261009261009262, 'Q': 0.004536004536004536, 'f'

In [None]:
# Test cell with start and end tokens for tweets
print(lm(2, vocabulary, all_files[3], False))

{'⇏': {'A': 0.027, 'V': 0.016888888888888887, 'E': 0.031, 'T': 0.019777777777777776, 'R': 0.43433333333333335, 'g': 0.013, 's': 0.018, 'o': 0.007222222222222222, '@': 0.13766666666666666, '3': 0.0016666666666666668, 'D': 0.008555555555555556, 'M': 0.019333333333333334, 'I': 0.0036666666666666666, 'G': 0.015333333333333332, '#': 0.006333333333333333, 'e': 0.015666666666666666, 'Q': 0.016777777777777777, '2': 0.002777777777777778, '"': 0.0035555555555555557, 'C': 0.014888888888888889, 'H': 0.004888888888888889, 'q': 0.0067777777777777775, 'c': 0.006111111111111111, 'F': 0.00788888888888889, 'N': 0.014777777777777779, 'f': 0.0024444444444444444, 'm': 0.008444444444444444, 'p': 0.00811111111111111, 'O': 0.014777777777777779, 'S': 0.02, 'v': 0.005333333333333333, 'B': 0.006111111111111111, 'L': 0.0026666666666666666, 'U': 0.0028888888888888888, 'P': 0.015666666666666666, '¡': 0.00011111111111111112, '-': 0.002111111111111111, 'd': 0.0022222222222222222, 'n': 0.006888888888888889, 'K': 0.002

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to calculate a perplexity score for
  
  df = pd.read_csv(data_file)
  entropies_list = []
  missing_value = 1e-8
  probabilities = []

  for tweet in df['tweet_text'].values:

    for start in range(len(tweet) - n):
      substring = tweet[start: start + n]
      key, value = substring[:-1], substring[-1]

      if key in model:
        probabilities.append(model[key].get(value, missing_value))
      else:
        probabilities.append(missing_value)
      
  entropies_list.append(-math.log2(np.mean(probabilities)))

  return math.pow(2, np.average(entropies_list))

In [None]:
def evalTweet(tweet, n, model):
  missing_value = 1e-8
  entropy = 0

  start, end = range(len(tweet) - n + 1), range(n-1, len(tweet))
  n_gram = list(zip(start, end))

  for start, end in n_gram:
    text = ''

    if start != end:
        text = tweet[start:start + n -1]

    suffix = tweet[end]

    if text in model.keys():
      if suffix in model[text].keys():
        entropy += -math.log2(model[text][suffix])
      else:
        entropy += -math.log2(missing_value)
    else:
      entropy += -math.log2(missing_value)

  entropy /= len(n_gram)

  return math.pow(2, entropy)

In [None]:
# Test cell without start and end tokens for tweets
#n=2
#model = lm(n, vocabulary, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/en.csv', True)
#print(eval(n, model, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/en.csv'))
#print(eval(n, model, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/es.csv'))
#print(eval(n, model, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/fr.csv'))

9.973425686113805
11.577644646746215
11.124442676049012


In [None]:
# Test cell with start and end tokens for tweets
n=2
model = lm(n, vocabulary, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/en.csv', True)
print(eval(n, model, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/en.csv'))
print(eval(n, model, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/es.csv'))
print(eval(n, model, '/content/drive/My Drive/Ilana Sivan IDC/NLP/HW1/data/fr.csv'))

10.031366993844895
11.646099255175804
11.18865693320134


**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not
  model= {}
  df = {}

  for language_model in languages:
    data_file_path = dir_path / f'{language_model}.csv'
    model[language_model] = lm(n, preprocess(), data_file_path, add_one)
    df[language_model] = {}
    
    for language in languages:
       data_file_path = dir_path / f'{language}.csv'
       df[language_model][language] = eval(n, model[language_model], data_file_path)
  
  return pd.DataFrame(df)

**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
# Run match output without start and end tokens

add_one = True and n = 1
         en       es       in       it       pt       fr       nl       tl
en  21.1690  20.8165  22.3151  21.6500  20.8329  21.0619  21.6162  22.9796
es  20.8140  19.5280  21.1574  20.5776  19.5840  20.1658  20.9656  21.8315
in  22.3079  21.1532  20.1551  21.9350  21.0382  22.1114  22.6436  21.1261
it  21.6452  20.5756  21.9372  21.0313  20.6034  21.1978  21.8773  22.5737
pt  20.8180  19.5724  21.0300  20.5932  19.4000  20.2674  21.1684  21.6973
fr  21.0634  20.1697  22.1201  21.2040  20.2833  20.2216  21.0796  23.0859
nl  21.6150  20.9670  22.6496  21.8809  21.1823  21.0770  20.9792  23.6351
tl  22.9669  21.8222  21.1213  22.5663  21.7008  23.0716  23.6234  21.6046
add_one = False and n = 1
         en       es       in       it       pt       fr       nl       tl
en  21.1226  20.7684  22.2591  21.5978  20.7724  21.0173  21.5676  22.9167
es  20.7684  19.4828  21.1042  20.5280  19.5271  20.1231  20.9185  21.7717
in  22.2591  21.1042  20.1045  21.8821  20.9772  

In [None]:
def run_match():
  for i in range(1,5):
    print(f"add_one = True and n = {i}")
    print(round(match(i, True),4))
    print(f"add_one = False and n = {i}")
    print(round(match(i, False),4))
    
run_match()

add_one = True and n = 1
         en       es       in       it       pt       fr       nl       tl
en  21.5554  21.1645  22.7541  22.0273  21.2763  21.3910  22.0122  23.4151
es  21.1957  19.8530  21.5759  20.9335  19.9896  20.4815  21.3511  22.2373
in  22.7163  21.5027  20.5462  22.3186  21.4647  22.4603  23.0537  21.5186
it  22.0417  20.9183  22.3666  21.3956  21.0280  21.5313  22.2774  22.9975
pt  21.2035  19.8981  21.4477  20.9492  19.8017  20.5871  21.5608  22.1042
fr  21.4486  20.5098  22.5574  21.5755  20.7172  20.5362  21.4672  23.5206
nl  22.0038  21.3177  23.0909  22.2612  21.6302  21.4006  21.3603  24.0701
tl  23.3870  22.1799  21.5318  22.9568  22.1377  23.4359  24.0497  22.0052
add_one = False and n = 1
         en       es       in       it       pt       fr       nl       tl
en  21.5092  21.1166  22.6984  21.9755  21.2163  21.3466  21.9639  23.3527
es  21.1503  19.8080  21.5230  20.8842  19.9331  20.4390  21.3042  22.1780
in  22.6677  21.4541  20.4958  22.2661  21.4042  

**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
# We didn't update the signature of the function as it was provided in the homework file, however this could be improved by allowing for variable conditions
def classify():
   data_file_path = dir_path / 'test.csv'
   tweets = pd.read_csv(data_file_path, encoding='utf-8').get('tweet_text')

   models = {}
   res = []
   n = 4
   add_one = True

   for language_model in languages:
        models[language_model] = lm(n, preprocess(), dir_path / f'{language_model}.csv', add_one)

   for tweet in tweets:
     correct_lang = languages[0]
     min_perplexity = float('inf')

     for lang, model in models.items():
        perplexity = evalTweet(tweet, n, model)

        if perplexity < min_perplexity:
          correct_lang = lang
          min_perplexity = perplexity

     res.append(correct_lang)

   return res

In [None]:
# Test cell
classification_result = classify()
print(classification_result)

**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
def calc_f1(result):
  data_file_path = dir_path / 'test.csv'
  labels = pd.read_csv(data_file_path).get('label')
  print(list(labels))
  return f1_score(list(labels), classification_result, average = "micro")

In [None]:
# Test cell
calc_f1(classification_result)

0.9218652331541443

# **Good luck!**