DatasetAnalysis

Research project on dealing with broken/biased datasets for NLP

Setup instructions (Linux server)

Python dependencies

Set up a personal Anaconda installation on the NLP servers. Then:

conda create -n DSAnalysis python=3.8

Set this env up for Pytorch

conda activate DSAnalysis
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

Install the requirements.txt:

pip install -r requirements.txt

Finally, configure Weights & Biases

wandb login

and paste in either Michael's private key or use your own.

Setting up working directory, env vars, data

Currently, the scripts hard code for where data is located. Michael will fix this very soon.

The data is located at nlp.cs.ucsb.edu:/mnt/hdd/saxon/{dataset} for datasets snli_1.0, anli_v1.0, mnli. Can access directly from there or copy.

You will want to set up a working directory where trained models, precomp'd embeddings can be saved.

Name		Name	Last commit message	Last commit date
Latest commit History 416 Commits
reports		reports
training		training
util		util
.gitignore		.gitignore
FEVER_info.txt		FEVER_info.txt
README.md		README.md
final_ds_info.txt		final_ds_info.txt
gen_affinity.py		gen_affinity.py
gen_embs.py		gen_embs.py
km_pseudo_bert.py		km_pseudo_bert.py
km_pseudo_rb.py		km_pseudo_rb.py
plot_peco.py		plot_peco.py
requirements.txt		requirements.txt
select_random.py		select_random.py
tsne_kmeans.py		tsne_kmeans.py
tsne_kmeans_bert.py		tsne_kmeans_bert.py

michaelsaxon/DatasetAnalysis

Folders and files

Latest commit

History

Repository files navigation

DatasetAnalysis

Setup instructions (Linux server)

Python dependencies

Setting up working directory, env vars, data

TODOs

About

Resources

Stars

Watchers

Forks

Languages