# Assignment 1
In this assignment you will be creating tools for learning and testing language models.
The corpora that you will be working with are lists of tweets in 8 different languages that use the Latin script. The data is provided either formatted as CSV or as JSON, for your convenience. The end goal is to write a set of tools that can detect the language of a given tweet.


*As a preparation for this task, download the data files from the course git repository.

The relevant files are under **lm-languages-data-new**:


*   en.csv (or the equivalent JSON file)
*   es.csv (or the equivalent JSON file)
*   fr.csv (or the equivalent JSON file)
*   in.csv (or the equivalent JSON file)
*   it.csv (or the equivalent JSON file)
*   nl.csv (or the equivalent JSON file)
*   pt.csv (or the equivalent JSON file)
*   tl.csv (or the equivalent JSON file)
*   test.csv (or the equivalent JSON file)





In [None]:
!git clone https://github.com/kfirbar/nlp-course.git

Cloning into 'nlp-course'...
remote: Enumerating objects: 71, done.[K
remote: Counting objects: 100% (71/71), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 71 (delta 29), reused 40 (delta 11), pack-reused 0[K
Unpacking objects: 100% (71/71), done.




---



**Important note: please use only the files under lm-languages-data-new and NOT under lm-languages-data**


---



In [None]:

!ls nlp-course/lm-languages-data-new


en.csv	 es.json  in.csv   it.json  pt.csv    test.json   tl.csv
en.json  fr.csv   in.json  nl.csv   pt.json   tests.csv   tl.json
es.csv	 fr.json  it.csv   nl.json  test.csv  tests.json


In [None]:
# imports
import pandas as pd
import os
import numpy as np
from collections import defaultdict

In [None]:
# Constansts:
DIR = r'/content/nlp-course/lm-languages-data-new'
# marking chars that are not part of the vocabulary as start & end tokens
START_TOKEN = 'ɸ'
END_TOKEN = 'ɼ'

**Part 1**

Write a function *preprocess* that iterates over all the data files and creates a single vocabulary, containing all the tokens in the data. **Our token definition is a single UTF-8 encoded character**. So, the vocabulary list is a simple Python list of all the characters that you see at least once in the data.

In [None]:
# helper functions part 1
def get_csv_files():
  csv_files = []
  for data_file in os.listdir(DIR):
    if data_file.endswith('csv'):
      csv_files.append(data_file)
  return csv_files

In [None]:
def preprocess():
  tokens = set()
  csv_files = get_csv_files()
  for data_file in csv_files:
    df = pd.read_csv(os.path.join(DIR, data_file))
    for tweet in df['tweet_text'].values:
      tweet_tokens = {tweet[i] for i in range(0,len(tweet))}
      tokens = tokens.union(tweet_tokens)
  return list(tokens)

In [None]:
vocabulary = preprocess()
len(vocabulary)

1859

In [None]:
# helper functions part 2

def pad_sentence(prefix_len, start_token, end_token, sentence):
  return prefix_len * start_token + sentence + prefix_len * end_token

def data_file_to_process_string(df, n, start_token, end_token):
  data_with_padding = [pad_sentence(n-1, start_token, end_token, sentence) for sentence in df['tweet_text'].values]
  data_concated_string = ''.join(data_with_padding)
  return data_concated_string

def count_suffix(padded_str, predix_len):
  prefix_to_suffix_count = {}
  for i in range(len(padded_str) - 1):
    prefix = padded_str[i : i + predix_len]
    suffix = padded_str[i + predix_len : i + predix_len+1]
    if prefix in prefix_to_suffix_count:
      counter = prefix_to_suffix_count[prefix].get(suffix,0) + 1
      prefix_to_suffix_count[prefix][suffix] = counter
    else:
      prefix_to_suffix_count[prefix] = {suffix:1}

  return prefix_to_suffix_count

def calc_prob(prefix_len, data_string, add_one, vocab_len):
  counter_dict = count_suffix(data_string, prefix_len);
  prob_dict = {}
  for key in counter_dict:
    total = 0
    for suffix in counter_dict[key]:
      total += counter_dict[key][suffix]
    current_prob = {}
    if add_one:
      total = total + vocab_len
      for suffix in counter_dict[key]: 
          current_prob[suffix] = (counter_dict[key][suffix] +1)/ total
    else:
      for suffix in counter_dict[key]:
        current_prob[suffix] = counter_dict[key][suffix] / total
    prob_dict[key] = current_prob
  return prob_dict

In [None]:
def lm(n, vocabulary, data_file_path, add_one = False):
  # n - the n-gram to use (e.g., 1 - unigram, 2 - bigram, etc.)
  # vocabulary - the vocabulary list (which you should use for calculating add_one smoothing)
  # data_file_path - the data_file from which we record probabilities for our model
  # add_one - True/False (use add_one smoothing or not)
  
  df = pd.read_csv(data_file_path)
  padded_data_string = data_file_to_process_string(df, n, START_TOKEN, END_TOKEN)
  model = calc_prob(n-1, padded_data_string, add_one, len(vocabulary))
  return model

In [None]:
n = 3
add_one = False
data_file = 'en.csv'
data_file_path = os.path.join(DIR, data_file)
model = lm(n, vocabulary, data_file_path, add_one)

In [None]:
model

**Part 2**

Write a function lm that generates a language model from a textual corpus. The function should return a dictionary (representing a model) where the keys are all the relevant n-1 sequences, and the values are dictionaries with the n_th tokens and their corresponding probabilities to occur. For example, for a trigram model (tokens are characters), it should look something like:

{
  "ab":{"c":0.5, "b":0.25, "d":0.25},
  "ca":{"a":0.2, "b":0.7, "d":0.1}
}

which means for example that after the sequence "ab", there is a 0.5 chance that "c" will appear, 0.25 for "b" to appear and 0.25 for "d" to appear.

Note - You should think how to add the add_one smoothing information to the dictionary and implement it.

**Part 3**

Write a function *eval* that returns the perplexity of a model (dictionary) running over a given data file.

In [None]:
# helper functions part 3
def eval_tweet(n, model, tweet):
  tweet_prob = []
  missing_token = 1e-8
  padded_tweet = pad_sentence(n-1, START_TOKEN, END_TOKEN, tweet)
  for i in range(len(padded_tweet) - n):
    prefix = padded_tweet[i : i + n-1]
    suffix = padded_tweet[i + n-1 : i + n]

    if prefix in model and suffix in model[prefix]:
      tweet_prob.append(model[prefix][suffix])
    else:
        tweet_prob.append(missing_token)
  return tweet_prob

In [None]:
def eval(n, model, data_file):
  # n - the n-gram that you used to build your model (must be the same number)
  # model - the dictionary (model) to use for calculating perplexity
  # data_file - the tweets file that you wish to claculate a perplexity score for

    df = pd.read_csv(os.path.join(DIR, data_file))
    tweets_prob = []
    for tweet in df['tweet_text'].values:
      tweet_prob = eval_tweet(n, model, tweet)
      tweets_prob.extend(tweet_prob)
      
    entropies = -np.log2(tweets_prob)
    entropy_avg = np.mean(entropies)
        
    return 2 ** entropy_avg

In [None]:
eval(3, model, 'en.csv')

8.895281101335167

**Part 4**

Write a function *match* that creates a model for every relevant language, using a specific value of *n* and *add_one*. Then, calculate the perplexity of all possible pairs (e.g., en model applied on the data files en ,es, fr, in, it, nl, pt, tl; es model applied on the data files en, es...). This function should return a pandas DataFrame with columns [en ,es, fr, in, it, nl, pt, tl] and every row should be labeled with one of the languages. Then, the values are the relevant perplexity values.

In [None]:
# helper functions part 4

from typing import NamedTuple

class LanguageFile(NamedTuple):
  language_name: str
  file_path: str
  language_model: dict

def file_to_language(language_file, n=None, vocabulary=None, add_one=None, train = False):
  language_name = language_file.replace('.csv', "")
  file_path = os.path.join(DIR, language_file)
  model = None
  if train:
    model = lm(n, vocabulary, file_path, add_one)
  return LanguageFile(language_name = language_file.replace('.csv', "") , file_path =file_path,  language_model = model)

def create_language_models(n, vocabulary, add_one):
  result = {}
  language_files = [file_name for file_name in get_csv_files() if not file_name.startswith("test")]
  for train_language in language_files:
    language =  file_to_language(train_language,n, vocabulary, add_one, True)
    result[language.language_name] = language.language_model
  return result

In [None]:
def match(n, add_one):
  # n - the n-gram to use for creating n-gram models
  # add_one - use add_one smoothing or not

  vocabulary = preprocess()
  match_results = defaultdict(lambda: defaultdict(float))

  language_files = [file_name for file_name in get_csv_files() if not file_name.startswith("test")]

  for train_language in language_files:
    language = file_to_language(train_language,n, vocabulary, add_one, True)

    for eval_language in language_files:
      test_language = file_to_language(eval_language)
      match_results[language.language_name][test_language.language_name] = eval(n, language.language_model , eval_language)

  return pd.DataFrame(match_results)

In [None]:
match_result = match(n, add_one)
match_result

Unnamed: 0,es,tl,fr,en,in,it,nl,pt
es,8.54936,75.616445,67.733662,80.828639,89.804394,61.470142,89.922192,57.221372
tl,107.338032,8.539298,102.945935,79.231334,62.678628,100.139832,87.705033,126.685479
fr,105.008279,160.870927,8.523593,114.929922,151.66991,101.066439,101.322698,134.61992
en,82.631448,51.817713,61.296126,8.895281,60.790169,76.693666,51.255345,102.081579
in,143.681673,66.205473,112.719284,99.906482,9.816557,154.631072,93.352589,191.859574
it,53.965185,70.107089,58.284058,66.905462,76.867413,8.515268,79.898298,71.061599
nl,154.838509,116.829793,105.544345,93.478954,94.255591,155.141736,9.156812,198.604218
pt,61.417044,100.134053,88.640732,109.056522,111.14557,76.742949,114.50092,8.056447


**Part 5**

Run match with *n* values 1-4, once with add_one and once without, and print the 8 tables to this notebook, one after another.

In [None]:
def run_match():
  for n in [1,2,3,4]:
    for add_one in [False, True]:
      print(f"calculate the perplexity matrix for n value : {n}, add_one: {add_one}")
      match_df = match(n, add_one)
      display(match_df)
      print()
      print()

In [None]:
run_match()

calculate the perplexity matrix for n value : 1, add_one: False


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,34.070633,39.542618,37.897612,43.025013,40.738263,36.832806,38.826355,35.369637
en,39.866715,36.371756,39.479454,40.317252,40.060942,39.367315,38.706784,40.27798
fr,39.469402,41.768075,35.501822,45.86783,46.226094,38.84151,39.974571,38.901193
in,41.457613,39.353415,42.227229,35.198957,37.073403,41.294381,39.436639,40.758126
tl,44.895726,42.417564,46.711206,40.391906,38.382964,44.076221,44.056254,44.762021
it,38.340969,39.073668,37.808,41.635915,41.071226,35.492806,38.984893,38.76599
nl,39.340551,37.567472,38.748748,39.674969,40.577771,38.921182,35.395858,39.404516
pt,38.605658,42.71223,39.831015,45.673819,43.344967,40.177151,41.41428,34.78056




calculate the perplexity matrix for n value : 1, add_one: True


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,34.129018,39.462518,37.726334,43.060147,40.730597,36.861579,38.824701,35.433788
en,39.925042,36.426501,39.536847,40.382448,40.146112,39.434677,38.765677,40.354576
fr,39.490707,41.773667,35.553953,45.597099,46.174078,38.907523,39.987989,38.96875
in,41.5263,39.413138,42.284574,35.258012,37.151788,41.365942,39.497842,40.843618
tl,44.966418,42.478578,46.768067,40.453465,38.457399,44.144675,44.117025,44.851003
it,38.387494,39.133043,37.864461,41.681845,41.148507,35.558929,39.033883,38.846389
nl,39.410525,37.623106,38.808607,39.740599,40.664466,38.991898,35.455549,39.493459
pt,38.656874,42.634488,39.76611,45.615867,43.239834,40.222121,41.428723,34.856372




calculate the perplexity matrix for n value : 2, add_one: False


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,16.138463,29.998765,27.571313,34.060498,28.295181,23.693457,31.6096,21.458295
en,29.225963,18.20954,26.144273,26.771224,24.954475,28.889546,25.054647,30.134756
fr,29.003268,33.902845,17.025982,41.339476,41.533636,30.621814,29.803521,29.162219
in,31.021862,27.395183,30.099043,18.030251,23.418847,30.555093,28.04303,32.482158
tl,30.496772,26.567277,31.368117,23.918727,17.819044,29.746309,29.072151,31.811494
it,23.312712,27.237049,24.462256,28.207739,27.06904,16.532836,28.786322,24.733976
nl,30.099731,25.371571,27.139581,27.070156,28.184354,30.271576,17.830693,31.449946
pt,26.067352,37.249268,31.500176,41.032867,35.216573,29.462754,36.688141,16.449661




calculate the perplexity matrix for n value : 2, add_one: True


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,18.914437,34.925921,31.598586,39.769654,34.079118,28.025802,37.115836,25.588139
en,34.692085,21.185413,30.474346,31.726969,29.506595,34.03814,29.155382,36.753412
fr,34.545898,39.318528,19.856047,49.139227,49.176823,35.875157,35.55517,35.150586
in,37.033579,31.70223,35.40383,21.306195,27.450751,36.367157,32.551095,39.472042
tl,36.985578,31.214066,37.632549,28.540618,21.421222,35.873525,34.497092,39.537608
it,27.402755,31.415176,28.269293,33.394074,31.973454,19.439419,33.695745,29.750262
nl,35.557763,29.246086,31.53173,31.852547,33.284576,35.585619,20.687968,37.938225
pt,30.301603,43.050886,36.024587,48.035181,42.662773,34.585243,43.124045,19.93492




calculate the perplexity matrix for n value : 3, add_one: False


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,8.54936,80.828639,67.733662,89.804394,75.616445,61.470142,89.922192,57.221372
en,82.631448,8.895281,61.296126,60.790169,51.817713,76.693666,51.255345,102.081579
fr,105.008279,114.929922,8.523593,151.66991,160.870927,101.066439,101.322698,134.61992
in,143.681673,99.906482,112.719284,9.816557,66.205473,154.631072,93.352589,191.859574
tl,107.338032,79.231334,102.945935,62.678628,8.539298,100.139832,87.705033,126.685479
it,53.965185,66.905462,58.284058,76.867413,70.107089,8.515268,79.898298,71.061599
nl,154.838509,93.478954,105.544345,94.255591,116.829793,155.141736,9.156812,198.604218
pt,61.417044,109.056522,88.640732,111.14557,100.134053,76.742949,114.50092,8.056447




calculate the perplexity matrix for n value : 3, add_one: True


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,24.996796,192.395059,153.532591,236.117407,206.940958,144.891429,235.559988,142.635194
en,237.054085,27.1547,164.333497,177.384209,149.157401,221.579525,139.387985,312.764215
fr,252.777089,263.362693,25.126621,394.627123,422.053197,248.916657,260.98775,336.197379
in,401.334329,260.041049,312.157394,31.598813,174.69849,432.235126,253.752024,560.247066
tl,332.339096,219.041466,315.140494,173.371952,29.983967,305.445284,256.418177,416.843407
it,134.614463,170.678756,146.075968,215.233538,196.230477,25.699749,219.645713,190.296273
nl,407.118285,225.837885,267.974612,256.075597,321.70211,410.069432,28.283368,548.416761
pt,147.659076,263.91679,208.114272,301.651088,280.15531,186.207289,301.715099,26.49302




calculate the perplexity matrix for n value : 4, add_one: False


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,4.621382,1814.03571,1077.566898,2435.167082,1193.621015,579.975625,2672.792033,625.003193
en,1917.938019,4.381687,948.623026,722.847745,373.062004,1444.755911,527.880951,2922.324199
fr,2156.07282,1903.274388,4.388358,3782.885498,3545.481574,1996.213511,2008.946763,3813.24053
in,11863.869078,7842.968578,9245.522161,4.947262,1391.141455,12194.635901,6068.326078,21662.943159
tl,3995.381451,2596.497593,5146.003896,909.25515,4.219149,3440.722345,3602.199272,5519.324128
it,714.026353,1899.8147,1225.416873,2225.653879,1203.294314,4.515039,2766.836931,1341.039924
nl,8229.901549,2800.906108,4525.22945,3891.973067,4710.060872,9392.273609,4.515322,16246.548647
pt,660.680495,3532.889361,2125.743278,3794.841848,2041.451044,856.510411,4741.008774,4.271575




calculate the perplexity matrix for n value : 4, add_one: True


Unnamed: 0,es,en,fr,in,tl,it,nl,pt
es,56.960759,8979.558003,4993.815215,12915.847359,7464.476188,3048.994686,13551.9536,3193.096052
en,10813.761086,62.128102,5529.466649,5216.713562,2868.279969,8860.314701,3477.300171,16802.430309
fr,9280.288396,8546.209547,55.212542,19074.190709,18288.181925,9470.924752,9849.317543,16731.998058
in,47233.898676,33465.957582,40100.351826,79.41669,7651.937925,49147.687012,28079.783953,82874.552439
tl,21774.949248,13452.08757,26818.506424,5270.416882,69.060615,18018.682872,18799.27036,29342.000079
it,3771.240468,9850.712594,6372.348984,12309.271112,7681.327609,58.709793,15046.511436,7263.068791
nl,32366.907808,11988.724441,18552.54429,18633.530843,22753.073133,37739.803139,65.256166,63304.386252
pt,3193.834656,16334.475304,9696.499772,19237.342087,11985.415025,4472.092742,23159.599866,59.861984






**Part 6**

Each line in the file test.csv contains a sentence and the language it belongs to. Write a function that uses your language models to classify the correct language of each sentence.

Important note regarding the grading of this section: this is an open question, where a different solution will yield different accuracy scores. any solution that is not trivial (e.g. returning 'en' in all cases) will be excepted. We do reserve the right to give bonus points to exceptionally good/creative solutions.

In [None]:
def classify(n, vocabulary, add_one):
  result_df = pd.read_csv(os.path.join(DIR, 'test.csv'))
  language_models = create_language_models(n, vocabulary, add_one)
  for tweet in result_df['tweet_text'].values:
    min_per = np.inf
    min_lang = None
    for language in language_models:
      tweet_prob = eval_tweet(3, language_models[language], tweet)
      entropy_avg = np.mean(-np.log2(tweet_prob))
      current_per = 2 ** entropy_avg
      if current_per < min_per:
        min_per = current_per
        min_lang = language
    result_df.loc[result_df['tweet_text'] == tweet, "classification"] = min_lang
  
  return result_df

clasification_result = classify(3, vocabulary, False)

In [None]:
clasification_result

Unnamed: 0,tweet_id,tweet_text,label,classification
0,845394879479996416,RT @jarsofshine: In 08 I had a volunteer who h...,en,en
1,836313846675619841,IN OGNI CASO CON LE PAGHE CHE GIRANO IN Africa...,it,it
2,836259442328940544,@jaynaldmase @acobasilianne @dingDANGdantes @d...,tl,tl
3,847729104472358912,"Daags voor @RondeVlaanderen, @VoltaClassic als...",nl,nl
4,836491739699412992,RT @ertsul20: Susuportahan kita hanggang sa du...,tl,tl
...,...,...,...,...
7994,836250659464761344,"La triste historia que inspiró ""Tu falta de qu...",es,es
7995,847676283089637380,RT @ShahwalAdli_: Aku tak bersuara tak bermakn...,in,in
7996,836319299279138816,@Benji_Mascolo DEVI TAGLIARE QUEI CAPELLI 😠😡😠😂❤,it,it
7997,836258179847716865,Assistimos de camarote varias brigas ontem!,pt,pt


**Part 7**

Calculate the F1 score of your output from part 6. (hint: you can use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). 


In [None]:
from sklearn.metrics import f1_score
def calc_f1(result):
  f1_score_result = f1_score(result['label'].tolist(), result['classification'].tolist(), average = 'weighted')
  return f1_score_result
calc_f1(clasification_result)

0.8415164925034869

# **Good luck!**