# Conda
Why?  
You might have different projects requiring differet sets of dependencies. You don't want to install them in the same environment or they might lead to dependency conflicts.  

Ref:
* [Official Guide](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
* [Cheatsheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf)

TLDR:  
Create a virtual environment
```
conda create -n my_env
```
Activate the env
```
conda activate my_env
```
Install packages
```
conda install my_package
```
or (if package not available in conda)
```
pip install my_package
```


# Git / Github
Why?  
You want to keep track of your source code as you create new versions (so you can revert if bad things happen). You want to pull/merge different versions so you can collaborate with teammates.  

Ref:
* [Official Guide](https://docs.github.com/en/get-started/quickstart/hello-world)
* [Video Tutorial](https://www.youtube.com/watch?v=SWYqp7iY_Tc)
* [Cheatsheet](https://training.github.com/downloads/github-git-cheat-sheet/)

TLDR:  
First-time setup for Git
```
git config --global user.name "myname"
git config --global user.email "my_email_on_github@example.com"
```
Create a repo  
Create a empty repo on github  
cd to your project folder  
copy and run all commands following "create a new repository on the command line"  
add a .gitignore to avoid uploading big files  

Upload to Github
```
git add .
git commit -m "my commit message"
git push -u origin main
```

Download from Github (first time)
```
git clone https://github.com/someone/somethingsomething.git
```

Download from Github (automatically merges)
```
git pull
```

Create/merge a branch
```
git branch branch-name
git merge branch-name
```

# Torch Dataset and Dataloader
Ref:
* [Official tutorial](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)
* [Documentation](https://pytorch.org/docs/stable/data.html)    
       
* [Torchvision transforms](https://pytorch.org/vision/stable/transforms.html)
      
* [Huggingface tokenizers](https://huggingface.co/transformers/preprocessing.html)
* [Tokenizing Chinese (a blog in Chinese)](https://zhuanlan.zhihu.com/p/371300063)

In [1]:
# Import stuff
import torch
from torch.utils.data import Dataset, DataLoader

import glob # For listing things in a given folder
import json # For handling json files

# Image
from PIL import Image
import torchvision
from torchvision import transforms

# Text
import transformers

In [2]:
# Image example
class ImgDataset(Dataset):
    def __init__(self, img_root='./data/img/*'):
        self.paths = glob.glob(img_root)
        self.transforms = transforms.Compose([
            transforms.PILToTensor(),
            transforms.CenterCrop(300),
            transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.3)
            # Put more transforms here
            ])
        
    def __len__(self):
        return len(self.paths)
    
    def _show_img(self, img):
        img = transforms.ToPILImage()(img)
        img.show()
    
    def __getitem__(self, index):
        path = self.paths[index]
        image = Image.open(path)
        
        image = self.transforms(image)
        
#         self._show_img(image)
#         uncomment me to show img

        return image

# collate function example:
# Specify how data entries are combined into batches
# What if we want to concatenate the image batch on the 0th axis
def mr_collate(batch):
#     [img_1, img_2, ...]
#     print(batch)
    return torch.cat(batch, axis=0)
        
def make_img_loader(batch_size, shuffle=True):
    dataset = ImgDataset()
    print(dataset[0].shape)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle,) #collate_fn=mr_collate)
    return loader
        
img_loader = make_img_loader(2)
for idx, batch in enumerate(img_loader):
    print(batch.shape)

torch.Size([3, 300, 300])
torch.Size([2, 3, 300, 300])
torch.Size([1, 3, 300, 300])


  img = torch.as_tensor(np.asarray(pic))


In [3]:
# Text example
# Tokenizers

text_example = "我需要去Will的Office Hour"

#char-level
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-chinese")
tokens = tokenizer.tokenize(text_example)
print(tokens)
print(tokenizer.convert_tokens_to_ids(tokens))

#SentencePiece
#This MBART tokenizer is multilingual (works with multiple languages)
tokenizer = transformers.models.mbart.tokenization_mbart.MBartTokenizer.from_pretrained("facebook/mbart-large-cc25")
tokens = tokenizer.tokenize(text_example)
print(tokens)
print(tokenizer.convert_tokens_to_ids(tokens))


['我', '需', '要', '去', '[UNK]', '的', '[UNK]', '[UNK]']
[2769, 7444, 6206, 1343, 100, 4638, 100, 100]
['▁我', '需要', '去', 'Will', '的', 'Office', '▁Hour']
[13129, 2745, 1677, 211673, 43, 94833, 133250]


In [4]:
class TextDataset(Dataset):
    def __init__(self, tokenizer, path='./data/chinese_lyrics.json'):
        
        with open(path,'r') as f:
            self.data = json.load(f)
            # [{artist: '', title: '', lyrics:[[section],[section],...]}]
        self.tokenizer = tokenizer
            
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        #Flattern into a single string
        lyrics = []
        for lis in self.data[index]["lyrics"]:
            lyrics += lis
        lyrics = ", ".join(lyrics)
        lyrics = self.tokenizer.bos_token + lyrics + tokenizer.eos_token
        
        ids = self.tokenizer(lyrics, truncation=True, padding='max_length', max_length=512)['input_ids']
        return torch.tensor(ids)

def make_text_loader(batch_size, shuffle=True):
    tokenizer = transformers.models.mbart.tokenization_mbart.MBartTokenizer.from_pretrained("facebook/mbart-large-cc25")
    dataset = TextDataset(tokenizer)
    print(dataset[0].shape)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle,) #collate_fn=mr_collate)
    return loader

text_loader = make_text_loader(16)
for idx, batch in enumerate(text_loader):
    print(batch.shape)

torch.Size([512])
torch.Size([16, 512])
torch.Size([16, 512])
torch.Size([16, 512])
torch.Size([2, 512])
