# Data prep for T-DNA
https://github.com/shizhediao/T-DNA
<img src="model-training-fine-tuning.png" width="50%">

1. fasttext model from which we get ngram embeddings
2. T-DNA expects:
    * data in the form of `text \t label` - english_snippet_graph_matches_100k.tsv
    * ngrams frequency file in the form `ngram \t count` - english_snippet_graph_matches_100k_ngrams.tsv
    * ngram embeddings file in numpy array format - english_snippet_graph_matches_100k_fasttext.npy


In [2]:
!pip install fasttext

Collecting fasttext
  Using cached fasttext-0.9.2.tar.gz (68 kB)
Collecting pybind11>=2.2
  Using cached pybind11-2.10.3-py3-none-any.whl (222 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.2-cp39-cp39-linux_x86_64.whl size=310321 sha256=dd473315fb548d6f0bd4bab942c5094aeeb65e3a88ce4432f15ad85f51597848
  Stored in directory: /home/khamilton/.cache/pip/wheels/64/57/bc/1741406019061d5664914b070bd3e71f6244648732bc96109e
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.10.3


In [1]:
import fasttext
import numpy as np
import pandas as pd
import json

In [10]:
data_path = "../datasets/train-articles/all_articles.txt"

In [11]:
!wc -l {data_path}

17484 ../datasets/train-articles/all_articles.txt


In [5]:
# train a model from the 'TAPT' data so we can extract warm-start embeddings for the ngrams to feed the T-DNA model training code.
# the dimension of the vectors must be the same as the LLM we will be continuing to train. In this case roberta-large, which has dim=1024
# since we will only be using unigrams and bigrams, we only need wordNgrams set to 2
model = fasttext.train_unsupervised(data_path, 
                                    model='skipgram', 
                                    lr=0.05, 
                                    dim=1024, 
                                    ws=4, 
                                    wordNgrams=1, 
                                    epoch=3, 
                                    thread=12)

Read 0M words
Number of words:  7463
Number of labels: 0
Progress: 100.0% words/sec/thread:    7857 lr:  0.000000 avg.loss:  2.629731 ETA:   0h 0m 0s


In [6]:
model.save_model("models/PTC_1024_fasttext.bin")

In [2]:
model = fasttext.load_model("models/PTC_1024_fasttext.bin")



In [3]:
words = model.get_words()
print(str(len(words)) + " " + str(model.get_dimension()))

7463 1024


In [9]:
# quick sanity check
model.get_nearest_neighbors('the United')

[(0.9927211999893188, 'expected'),
 (0.9927129149436951, 'appointed'),
 (0.992121160030365, 'called'),
 (0.9920170307159424, 'limited'),
 (0.990821897983551, 'affected'),
 (0.9905429482460022, 'pointed'),
 (0.9901823401451111, 'respected'),
 (0.9901558756828308, 'prohibited'),
 (0.9900302290916443, 'opened'),
 (0.9895544648170471, 'rejected')]

In [7]:
v1 = model.get_word_vector('great reset')
v2 = model.get_word_vector('jew')
v1 = model.get_word_vector('apple')
v2 = model.get_word_vector('bench')

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([v1,v2])

array([[0.99999976, 0.9881947 ],
       [0.9881947 , 1.0000002 ]], dtype=float32)

In [20]:
# loads the ngrams previously generated using the t-dna-ngrams.ipynb notebook
# this file was saved as a numpy array to more easily deal with handling splitting the two columns. 
   # both pandas and I/O had issues with the spaces in the ngrams
ngrams = np.load('../data/english_snippet_graph_matches_100k-ngrams.npy',allow_pickle=True)

In [23]:
print(len(ngrams))
ngrams[:5]
# [ngram, count]

410161


array([['natural immunity', 2605],
       ['great reset', 1355],
       ['China virus', 1185],
       ['New York', 642],
       ['Joe Biden', 629]], dtype=object)

In [9]:
# generates a tsv file in the format ngram \t count for use in T-DNA code
# generates a numpy array of embeddings for all the ngrams for use in T-DNA code
words = model.get_words()
print(str(len(words)) + " " + str(model.get_dimension()))

vectors = []
ngrams_freq = []
with open('../data/english_snippet_graph_matches_100k_ngrams.tsv', 'a') as the_file:
    for w in ngrams:
        ng= w[0]
        v = model.get_word_vector(w[0])
        vectors.append(v)
        the_file.write(w[0]+'\t'+str(w[1])+'\n')
        ngrams_freq.append([w[0],w[1]])
print(len(vectors))   

16919 1024
410161


In [9]:
np.save('../models/english_snippet_graph_matches_100k_fasttext.npy',np.array(vectors))

In [4]:
# make training data into tsv with a label column as expected by T-DNA code. Label will not be used for mlm.

td = pd.read_csv('../data/english_snippet_graph_matches_100k.csv')


In [6]:
td['label']=0
td

Unnamed: 0,snippet,label
0,shots. The people are getting Now. Cover that ...,0
1,"It's it's insane. But you know, she wants to c...",0
2,going to be honest. When I saw the tape put to...,0
3,"Works through all phases of illness, because i...",0
4,"Free People, which was the freedom to choose M...",0
...,...,...
19899,the news that came out last night from Project...,0
19900,lasting Freedom. It answers all of the questio...,0
19901,the questions that have to be asked. Because t...,0
19902,most important things we can do. It's very har...,0


In [10]:
td.to_csv('../data/english_snippet_graph_matches_100k.tsv',sep='\t',header=None, index=None)

# Removing punctuation and stopwords

removes punctuation, stopwords, and snippets shorter than 64 tokens

In [15]:
!python remove_punct.py --data_path=../data/english_audio_snippets_4.4.2022.csv