<a href="https://colab.research.google.com/github/nikxlvii/pytorch/blob/main/Tensor_Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This is just a practice notebook where I manipulate tensors and experiment with different data types. Basically i'm fiddling around in the data pre-processing stage.

# Working with Images

An image is represented as a collection of scalars arranged in a regular grid with a
height and a width (in pixels). We might have a single scalar per grid point (the
pixel), which would be represented as a grayscale image; or multiple scalars per grid
point, which would typically represent different colors

In [None]:
import imageio as im

img_arr = im.imread('oppie.jpeg')

  img_arr = im.imread('oppie.jpeg')


In [None]:
img_arr.shape # the output is in the form of H x W x C

(2910, 5174, 3)

In [None]:
# note that the proper layout for image arrays is C x H x W for PyTorch. We need to convert the images into that if required.

In [None]:
import torch

img = torch.from_numpy(img_arr)
out = img.permute(2,0,1) #inplace operation
out.shape

torch.Size([3, 2910, 5174])

In [None]:
batch_size = 3
batch = torch.zeros(batch_size,6)

In [None]:
import os

data_dir = 'oppie_img_data/'
filenames = [name for name in os.listdir(data_dir) if os.path.splitext(name)[-1] == '.png']

for i, filename in enumerate(filenames):
  img_arr = im.imread(os.path.join(data_dir,filename))
  img_t = torch.from_numpy(img_arr)
  img_t = img_t.permute(2,0,1)
  img_t = img_t[:3]
  batch[i] = img_t

# Working with Tabular Data

 PyTorch tensors, on the other hand, are homogeneous. Information in PyTorch is
typically encoded as a number, typically floating-point (though integer types and
Boolean are supported as well). This numeric encoding is deliberate, since neural
networks are mathematical entities that take real numbers as inputs and produce real
numbers as output through successive application of matrix multiplications and
nonlinear functions.

In [None]:
import csv
wine_path = 'winequality-white.csv'

In [None]:
# Instead of pandas, let's import this using numpy

In [None]:
import numpy as np
wine_num =  np.loadtxt(wine_path,dtype = np.float32, delimiter = ';',skiprows = 1)

In [None]:
wine_num

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

In [None]:
wineq = torch.from_numpy(wine_num)

In [None]:
wineq.type, wineq.shape

(<function Tensor.type>, torch.Size([4898, 12]))

In [None]:
data = wineq[:,:-1]
data, data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]]),
 torch.Size([4898, 11]))

In [None]:
target = wineq[:,-1]
target, target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.]), torch.Size([4898]))

# Working with Text

This is a section which involves NLP and the process is quite simple.

1. Encoding Characters: First we will split the text into individual lines then create a tensor to contain the individual character ASCII data. The maximum in ASCII is 128. After that we will individual allocate the characters to the tensor with their ASCII notation.

2. Encoding Words: To encode words, we have to establish a vocabulary and encode words along the rows of our tensor. There's a better way to do this other than encoding which is known as embedding which I'll do after this. In this method, we first create a function to clean the words (remove punctuation). Then we choose a line and take out all the words from it to make a word_list. Then we enumerate all of the words in the text and only add the ones in the tensor which we took out in the word_list.

In [20]:
import torch

In [6]:
with open('jane_austen.txt') as f:
  text = f.read()

In [21]:
# Splitting the text into lines

lines = text.split('\n')
line = lines[200]
line

'“Impossible, Mr. Bennet, impossible, when I am not acquainted with him'

In [22]:
# Encoding characters
letter_t = torch.zeros(len(line),128)

In [23]:
letter_t.shape

torch.Size([70, 128])

In [24]:
for i, letter in enumerate(line.lower().strip()):
  letter_index = ord(letter) if ord(letter) < 128 else 0
  letter_t[i][letter_index] = 1

In [27]:
# Encoding words

def clean_words(input_str):
  punctuation = '.,;:"!?”“_-'
  word_list = input_str.lower().replace('\n',' ').split()
  word_list = [word.strip(punctuation) for word in word_list]
  return word_list

In [28]:
words_in_line = clean_words(line)
line,words_in_line

('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him',
 ['impossible',
  'mr',
  'bennet',
  'impossible',
  'when',
  'i',
  'am',
  'not',
  'acquainted',
  'with',
  'him'])

In [33]:
word_list = sorted(set(clean_words(text)))
len(word_list)

7261

In [34]:
word2index_dict = {word : i for (i,word) in enumerate(word_list)}

In [37]:
word_t = torch.zeros(len(words_in_line), len(word2index_dict))
for i, word in enumerate(words_in_line):
  word_index = word2index_dict[word]
  word_t[i][word_index] = 1
  print('{:2} {:4} {}'.format(i, word_index, word))
  print(word_t.shape)

 0 3394 impossible
torch.Size([11, 7261])
 1 4305 mr
torch.Size([11, 7261])
 2  813 bennet
torch.Size([11, 7261])
 3 3394 impossible
torch.Size([11, 7261])
 4 7078 when
torch.Size([11, 7261])
 5 3315 i
torch.Size([11, 7261])
 6  415 am
torch.Size([11, 7261])
 7 4436 not
torch.Size([11, 7261])
 8  239 acquainted
torch.Size([11, 7261])
 9 7148 with
torch.Size([11, 7261])
10 3215 him
torch.Size([11, 7261])


The problem with one-hot encoding when the number of items to encode is too large. We just encoded 7000 items, and if there's a better way to do this then why not. So now, we will move onto embeddings instead of encoding.