# DATA

## Sentimental Analysis

### Raw Data
Suppose we are building a sentiment analysis model. Here are our sentences:

1. "I love PyTorch."
2. "The movie was terrible."
3. "PyTorch makes deep learning fun!"
4. "I didn't like the food."

Each sentence has a corresponding label:

- __1__ for positive sentiment.
- __0__ for negative sentiment.

So the labels are: __[1,0,1,0]__

### Tokenization
We need to tokenize these sentences, let us split it into words:

1. Sentence 1: ["I", "love", "PyTorch"]
2. Sentence 2: ["The", "movie", "was", "terrible"]
3. Sentence 3: ["PyTorch", "makes", "deep", "learning", "fun"]
4. Sentence 4: ["I", "didn't", "like", "the", "food"]


### Creating Vocab

From the above sentences we can extract unique words, and match them into integers so that we create something like a lookup table:

```python
Vocabulary:
{"I": 1, "love": 2, "PyTorch": 3, "The": 4, "movie": 5, "was": 6, 
 "terrible": 7, "makes": 8, "deep": 9, "learning": 10, "fun": 11, 
 "didn't": 12, "like": 13, "the": 14, "food": 15}
```

You notice from above, we have pytorch & I appearing twice but we assign only a vocab index once?

### Converting our tokenized sentences into Sequence of Numbers

1. Sentence 1: [1, 2, 3]
2. Sentence 2: [4, 5, 6, 7]
3. Sentence 3: [3, 8, 9, 10, 11]
4. Sentence 4: [1, 12, 13, 14, 15]

What you notice is that these sentences have different lengths, and to process data in batches, they are supposed to be of the same lengths, we are going to pad the shorter texts to match the longest one in the batch.

Padded Sequences:
1. Sentence 1: [1, 2, 3, 0, 0]
2. Sentence 2: [4, 5, 6, 7, 0]
3. Sentence 3: [3, 8, 9, 10, 11]
4. Sentence 4: [1, 12, 13, 14, 15]

### Creating batches

Now we are going to have a batch size of 2, that means we will be sending two sentences per batch:

#### Batch 1

```python
Input = [[1, 2, 3, 0, 0], [4, 5, 6, 7, 0]]
Labels = [1, 0]
```

#### Batch 2
```python
Input =  [[3, 8, 9, 10, 11], 
        [1, 12, 13, 14, 15]]
Labels = [1, 0]
```

From the above, you can see that the shape of each batch is 2 by 5. `[batch_size, sequence_length]`