# CS 449 Final Project Proposal

Due: April 21, 2023 at 11:59pm

## 1. Names and Net IDs

*List your group members*

## 2. Abstract

*Your abstract should be two or three sentences describing the motivation
for your project and your proposed methods.*

> For example:
> 
> Our final project seeks to use a variety of `sklearn` models to classify
> handwritten digits in the `MNIST` dataset. We will compare models such
> as Logistic Regression and Multilayer Perceptrons.

## 3. Introduction

*Why is this project interesting to you? Describe the motivation for pursuing this project. Give a specific description of your data and what machine learning task you will focus on.*

>For example:
> 
> It is very important for us to be able to automatically recognize handwritten digits so that the Postal Service can identify whether a letter has been sent to the correct address. We will use a large dataset of handwritten digits and train our models to input the black-and-white pixels of those digits and output the number that was written. [etc. etc.]

## 4a. Describe your dataset(s)

*List the datasets you plan to use, where you found them, and what they contain. Be detailed! For each dataset, what does the data look like? What is the data representation? (e.g., what resolution of images? what length of sequences?) How is the data annotated or labeled? Include citations for the datasets. Include at least one citation of previous work that has used your data, or explain why no one has used your data before.*

> We will be using the WMT (Workshop on Statistical Machine Translation) 2014 English-German and English-French datasets. These datasets provide examples of phrases in English and its translation in German or French (and vice versa). Specific details about the datasets can be found here: https://aclanthology.org/W14-3302.pdf . In summary, the datasets come from formal sources (European Parliment, United Nations, news sources, etc.). They directly translated text from the source language to the target language (i.e. with no intermediary language), using machine translation, and this translation subsequently was followed by an involved manual evaluation process. The dataset does not involve any tokenization; it is purely the source text and its translation. Each example is relatively short: as stated before, they are either a simple phrase or a sentence. The reason for using this dataset is to replicate the work of Vaswani et al. in their influential paper "Attention is All you Need."

> @InProceedings{bojar-EtAl:2014:W14-33,
  author    = {Bojar, Ondrej  and  Buck, Christian  and  Federmann, Christian  and  Haddow, Barry  and  Koehn, Philipp  and  Leveling, Johannes  and  Monz, Christof  and  Pecina, Pavel  and  Post, Matt  and  Saint-Amand, Herve  and  Soricut, Radu  and  Specia, Lucia  and  Tamchyna, Ale
{s}},
  title     = {Findings of the 2014 Workshop on Statistical Machine Translation},
  booktitle = {Proceedings of the Ninth Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2014},
  address   = {Baltimore, Maryland, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {12--58},
  url       = {http://www.aclweb.org/anthology/W/W14/W14-3302}
}

>@inproceedings{NIPS2017_3f5ee243,
 author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, \L ukasz and Polosukhin, Illia},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett},
 pages = {},
 publisher = {Curran Associates, Inc.},
 title = {Attention is All you Need},
 url = {https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf},
 volume = {30},
 year = {2017}
}

## 4b. Load your dataset(s)

*Demonstrate that you have made at least some progress with getting your
dataset ready to use. Load at least a few examples and visualize them
as best you can*

In [None]:
from datasets import load_dataset

'''
    streaming=True is important. It will otherwise download the whole dataset. It will probably take an hour to load both of these in this way.

    Quick Guide:
        IterableDatasetDict : https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.IterableDatasetDict
        IterableDataset     : https://huggingface.co/docs/datasets/v2.11.0/en/package_reference/main_classes#datasets.IterableDataset
        
        `load_dataset(..., streaming=True)` returns an `IterableDatasetDict`.
        Use 'train', 'test', or 'validation' as keys to access the respective data of type `IterableDataset`.
        On `IterableDataset`,
            use `take(n)` for some n:int > 0 to get `IterableDataset` with the first n examples.
            use `shuffle()` to shuffle the dataset
'''
dataset_fr = load_dataset("wmt14", "fr-en", streaming=True)
dataset_de = load_dataset("wmt14", "de-en", streaming=True)

In [2]:
for data in dataset_fr['train'].take(2):
    print(data)
print()
for data in dataset_fr['test'].take(2):
    print(data)
print()
for data in dataset_fr['validation'].take(2):
    print(data)
print()
for data in dataset_de['train'].take(2):
    print(data)
print()
for data in dataset_de['test'].take(2):
    print(data)
print()
for data in dataset_de['validation'].take(2):
    print(data)

{'translation': {'fr': 'Reprise de la session', 'en': 'Resumption of the session'}}
{'translation': {'fr': 'Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.', 'en': 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.'}}

{'translation': {'fr': 'Spectaculaire saut en "wingsuit" au-dessus de Bogota', 'en': 'Spectacular Wingsuit Jump Over Bogota'}}
{'translation': {'fr': "Le sportif Jhonathan Florez a sauté jeudi d'un hélicoptère au-dessus de Bogota, la capitale colombienne.", 'en': 'Sportsman Jhonathan Florez jumped from a helicopter above Bogota, the capital of Colombia, on Thursday.'}}

{'translation': {'fr': "Une stratégie républicaine pour contrer la réélection d'Obama", 'en': 'A Repu

In [3]:
'''
    NOTE: When viewing some examples, it may appear to have some potential issues as seen below in "voiture\xa0?".
    However, this is simply a matter of how it is printed.
    As seen in the output of this code block, this will work properly when directly observed.
    
    But also note in this example that there are going to be some inconsistencies in the data.
    The \xa0 is present in the French translation but not the English one, and this results in a space between the word and the question mark.
'''
printing_issue_example = list(dataset_fr['test'].skip(3).take(1))
print(printing_issue_example)
print(printing_issue_example[0]['translation']['fr'])

[{'translation': {'fr': 'Une boîte noire dans votre voiture\xa0?', 'en': 'A black box in your car?'}}]
Une boîte noire dans votre voiture ?


## 4c. Small dataset

*Many deep learning datasets are very large, which is helpful for training powerful models but makes debugging difficult. For your update, you will need to construct a small version of your dataset that contains 200-1000 examples and is less than 10MB. If you are working with images, video, or audio, you may need to downsample your data. If you are working with text, you may need to truncate or otherwise preprocess your data.*

*Give a specific plan for how you will create a small version of one dataset you'll use that is less than 10MB in size. Mention the current size of your dataset and how many examples it has and how those numbers inform your plan.*

> By specifying `streaming=True` when initializing the dataset, it returns a version of the dataset that allows for immediate usage of the dataset without having to download the entire thing. This makes it so that it does not require any significant space as the data is streamed. Thus, making our dataset less than 10MB is trivial. See the below code block to see how much the streamed dataset takes up. Constructing a smaller version of our dataset is also trivial: simply specify the number of examples desired in the `take` function. This streamed dataset does not allow to see how many examples there are, but given the capabilities of this `IterableDataset`, this does not seem like this will cause an issue.

In [4]:
from sys import getsizeof # returns size in bytes

print("size of dataset_fr\t\t\t", "type: IterableDatasetDict\t", getsizeof(dataset_fr), "Bytes")
print("size of dataset_fr['train']\t\t", "type: IterableDataset:\t\t", getsizeof(dataset_fr['train']), "Bytes")
print("size of dataset_fr['train'].take(1000)\t", "type: IterableDataset\t\t", getsizeof(dataset_fr['train'].take(1000)), "Bytes")

size of dataset_fr			 type: IterableDatasetDict	 208 Bytes
size of dataset_fr['train']		 type: IterableDataset:		 56 Bytes
size of dataset_fr['train'].take(1000)	 type: IterableDataset		 56 Bytes


## 5. Methods

*Describe what methods you plan to use. This is a deep learning class, so you should use deep learning methods. Cite at least one or two relevant papers. What model architectures or pretrained models will you use? What loss function(s) will you use and why? How will you evaluate or visualize your model's performance?*

> For example:
> 
> This is a standard supervised learning task, and we will use `sklearn`'s Logistic Regression model to predict digit labels from their pixels. The model will contain one weight per pixel. We will train our model using Cross-Entropy loss, because...

## 6. Deliverables

*Include at least six goals that you would like to focus on over the course of the quarter. These should be nontrivial, but you should have at least one and hopefully both of your "Essential" goals done by the project update, due in mid-May. Your "Stretch" goals should be ambitious enough such that completing one is doable, but completing both this quarter is unlikely.*

### 6.1 Essential Goals
- (At least two goals here. At least one should involve getting a neural network model running.)

> For example:
>
> We will use a Logistic Regression and a Multilayer Perceptron to train and test on our MNIST data.

### 6.2 Desired Goals
- (At least two goals here. Completing these goals should be sufficient for you to say your project was a success.)

> For example:
>
> We will compare our MLP model against a pretrained Visual Transformer model that we fine-tune for this task.

### 6.3 Stretch Goals
- (At least two goals here. These should be ambitious extensions to your desired goals. You can still get full points without completing these.)
> For example:
> 
> We will conduct a manual analysis of the digits that our model gets wrong and use a GAN to create new images that help us learn a more robust classifier.


## 7. Hopes and Concerns

*What are you most excited about with this project? What parts, if any, are you nervous about? For example:*

> For example: 
> 
> We're worried that we'll get really bored of staring at pixelated hand-written digits for hours on end.

## 8. References

*Cite the papers or sources that you used to discover your datasets and/or models, if you didn't include the citation above.*

> For example:
> 
> LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.