- Python 3.7.7 and PyTorch:
torch==1.7.1+cu110 torchaudio==0.7.2 torchvision==0.8.2+cu110 # On Colab torch==1.10.0+cu111 torchaudio==0.10.0+cu111 torchvision==0.11.1+cu111
- More detail in
requirement.txt
-
Task - Income high or low
Problem: Binary classification, to predict the income of an indivisual exceeds 50,000 or not.
Method: logistic regression (linear classifier), linear dicriminant (generative model)
Dataset: This dataset is obtained by removing unnecessary attributes and balancing the ratio between positively and negatively labeled data in the Census-Income (KDD) Data Set, which can be found in UCI Machine Learning Repository. 該數據集包含從美國人口普查局進行的1994年和1995年的當前人口調查中提取的加權普查數據。數據包含41個與人口和就業相關的變量。
Reference:
-
Task - Food Classification
此資料集Food-11為網路上蒐集到的食物照片,共有11類:
- Training set: 9866張
- Validation set: 3430張
- Testing set: 3347張
category 類別 Bread 麵包 Dairy product 如起司、牛奶、奶油 Dessert 甜食 Egg 蛋 Fried food 炸物 Meat 肉類 Noodles/Pasta 麵食 Rice 米飯 Seafood 海鮮 Soup 湯 Vegetable/Fruit 蔬菜水果 Reference:
-
Task - Text Sentiment Classification
資料為Twitter上收集的推文,每則推文都會被標註為正面(1)或負面(0),如:
thanks! i love the color selectors. btw, that's a great way to search and list. (LABEL=1)
、I feel icky, i need a hug. (LABEL=0)
除了labeled data以外,我們還額外提供了120萬筆左右的 unlabeled data:
- labeled training data :20萬
- unlabeled training data:120萬
- testing data:20萬(10 萬 public,10 萬 private)
Preprocessing the sentences:
- 首先建立字典,字典元素為每一個字與其所對應到的index value。
- e.g.
I have a pen.
→ [1, 2, 3, 4];I have an apple.
→ [1, 2, 5, 6]
where {'I': 1, 'have': 2, 'a': 3, 'pen': 4, 'an': 5, 'apple': 6}
- e.g.
- 利用word embedding來代表每一個單字。亦即,用一個向量來表示字(或詞)的意思。
- 利用bag of words (BOW)方式得到代表該句子的vector
Semi-supervised Learning:
- 使用unlabeled data協助模型訓練,如Self-Training
Reference:
- Start PyTorch
Reference:
- start_Pytorch.py - start_dataloader.py - start_buildnn.py - start_autograd.py - start_optimizer.py