# Representing text

 combination of new input and previous model output. These models are called recurrent neural networks (RNNs)
 Our goal in this section is to turn text into something a neural network can process: a tensor of numbers

### 4.5.1. Converting text to numbers

In [1]:
import torch
import numpy as np

In [3]:
#D:/AI/data/p1ch4d/jane-austen/1342-0.txt
with open('D:/AI/data/p1ch4d/jane-austen/1342-0.txt', encoding='utf8') as f:
    text = f.read()

### 4.5.2 One-hot-encoding characters

We first split our text into a list of lines and pick an arbitrary line to focus on:

In [4]:
lines = text.split('\n')
line = lines[200]
line

'“Impossible, Mr. Bennet, impossible, when I am not acquainted with him'

Let’s create a tensor that can hold the total number of one-hot-encoded characters for
the whole line:

In [5]:
letter_t = torch.zeros(len(line), 128)#128 jer ASCII ima 128 različitih charactera
letter_t.shape

torch.Size([70, 128])

In [7]:
for i, letter in enumerate(line.lower().strip()):#one hot encodamo tako da stavljamo 1 tamo di treba a ostalo su 0
    letter_index = ord(letter) if ord(letter) < 128 else 0
    letter_t[i][letter_index] = 1

### 4.5.3 One-hot encoding whole words

In [8]:
def clean_words(input_str):
    punctuation = '.,;:"!?”“_-'
    word_list = input_str.lower().replace('\n',' ').split()
    word_list = [word.strip(punctuation) for word in word_list]
    return word_list
words_in_line = clean_words(line)
line, words_in_line

('“Impossible, Mr. Bennet, impossible, when I am not acquainted with him',
 ['impossible',
  'mr',
  'bennet',
  'impossible',
  'when',
  'i',
  'am',
  'not',
  'acquainted',
  'with',
  'him'])

Next, let’s build a mapping of words to indexes in our encoding:
Note that word2index_dict is now a dictionary with words as keys and an integer as a
value. We will use it to efficiently find the index of a word as we one-hot encode it

In [9]:
word_list = sorted(set(clean_words(text)))
word2index_dict = {word: i for (i, word) in enumerate(word_list)}
len(word2index_dict), word2index_dict['impossible']

(7261, 3394)

In [10]:
word_t = torch.zeros(len(words_in_line), len(word2index_dict))
for i, word in enumerate(words_in_line):
    word_index = word2index_dict[word]
    word_t[i][word_index] = 1
    print('{:2} {:4} {}'.format(i, word_index, word))
print(word_t.shape)

 0 3394 impossible
 1 4305 mr
 2  813 bennet
 3 3394 impossible
 4 7078 when
 5 3315 i
 6  415 am
 7 4436 not
 8  239 acquainted
 9 7148 with
10 3215 him
torch.Size([11, 7261])


### 4.5.4 Text embeddings

A vector of, say, 100 floating-point numbers can
indeed represent a large number of words. The trick is to find an effective way to map
individual words into this 100-dimensional space in a way that facilitates downstream
learning. This is called an embedding.
. An ideal
solution would be to generate the embedding in such a way that words used in similar
contexts mapped to nearby regions of the embedding