# NLP: NumPy basics

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
np.set_printoptions(precision=2, suppress=True)

### Numerical data

Tables and matrices are popular format for storing data.  When the entries are all numerical, computers can easily digest them.  However, for categorial data or text data, we often need to transform them into numerical data (vectors) before training.

| student \ subject | A | B | C | D | E | decision | comments |
|----|----|----|----|----|----|----|----|
| 1 | 10 | 10 | 10 | 10 | 10 | accept | good |
| 2 | 10 | 10 | 10 | 10 | 0 | accept | so so |
| 3 | 0 | 0 | 15 | 0 | 0 | decline | need improvement |

**Terminologies**:  

- sample: each row
- feature: each column
- vector: a list of numbers (one-dimensional array)
- matrix: numbers stored in rectangular form (two-dimensional array)
- array: lots of data stored in arbitrary dimensions

### Count vector: an example

One way to transform a document into a vector is simply count how many times does each word occurs.

In [None]:
# Generated by ChatGPT with the question:
# Describe "XXX" in simple English using less than 100 words.
cat = "A cat is a small furry animal with sharp claws and a long tail. They have soft fur that comes in different colors like black, white, or orange. Cats are known for their agility and ability to climb trees. They are independent creatures but can also be friendly and enjoy human companionship. They communicate through meowing and purring. Cats are often kept as pets and are loved for their playfulness and ability to catch mice."
dog = "Dogs are loyal and friendly animals that love to be around people. They come in different sizes and colors, but all dogs have fur and a wagging tail. They enjoy playing fetch, going for walks, and cuddling with their owners. Dogs are known for their keen sense of smell and hearing, which makes them great companions and protectors. They communicate through barking, wagging their tails, and using body language. Dogs are known for their unconditional love and can be part of a family, bringing happiness and companionship to their human friends."
rat = "Rats are small animals with fur and long tails. They are known for their quickness and agility. Rats can be found in various colors, such as brown, gray, or black. They are often seen scavenging for food and can squeeze through small spaces. While some people may consider them pests, rats are intelligent creatures that can learn tricks and solve problems. They have a keen sense of smell and are good at finding food. Rats are social animals and often live in groups called colonies. Despite their negative reputation, rats play a vital role in ecosystems."
sun = "The sun is a big, bright ball of light in the sky. It gives us warmth and helps plants grow. The sun rises in the morning and sets in the evening, giving us daylight. It shines during the day, making everything around us visible. The sun is very far away from us but still feels close because it is so bright. It provides us with energy and makes the world a brighter and happier place. When we feel its rays on our skin, it can feel warm and comforting. The sun is an essential part of our lives and the Earth's ecosystem."
corpus = [cat, dog, rat, sun]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english')
X = cvec.fit_transform(corpus).toarray()

In [None]:
# show 10 columns only
X[:,:10]

In [None]:
# better format by pandas
import pandas as pd
df = pd.DataFrame(X)
df.columns = cvec.get_feature_names_out()
df

### Shape

NumPy is a Python package that handles arrays.  The shape of an array tells how many entries are stored in each dimension.  

In [None]:
arr = np.zeros((3,4)) # try (3,4,5)
print(arr.shape)
arr

Whenever possible, you may use `reshape` to change the shape.  

In [None]:
arr = np.arange(12)
arr

In [None]:
# take a careful look of the number of brakets
arr = np.arange(12)
arr.reshape(3,4)

Use `zeros_like` to create a zero array with the same data type.  

In [None]:
arr = np.arange(12)
arr = arr.reshape(3,4)
print(arr.dtype)
z_arr = np.zeros_like(arr)
print(z_arr.dtype)
z_arr

### Selection and Slicing

When `arr` is an array, use `arr[i]` or `arr[i,j]` to select its entry.  

In [None]:
arr = np.array([1,2,3])
print(arr)
print(arr[2])

In [None]:
arr = np.arange(12).reshape(3,4)
print(arr)
print(arr[2])

In [None]:
arr = np.arange(12).reshape(3,4)
print(arr)
print(arr[2,3])

Instead of selecting only one index `i` , you may use `a:b` to **slice** all entries from `a` to `b - 1` .

In [None]:
arr = np.arange(12).reshape(3,4)
print(arr)
print(arr[:, 1:3])

### Universal and aggregate functions

If an operation is applied to each entry, then it is called **universal** .  

In [None]:
arr = np.arange(12).reshape(3,4)
arr * 2

In [None]:
arr = np.arange(12).reshape(3,4)
arr.sum(axis=0) # try .sum(axis=1) or .sum()

### Distances and similarity

The distance can be used to measure how close are two vectors.  

When ${\bf x} = (x_1, x_2, \ldots, x_n)$ and ${\bf y} = (y_1, y_2, \ldots, y_n)$, the **distance** between them is  

$$
    \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2}.
$$

This formula is also known as the **$\ell_2$-norm** .

**Intuition**:  The distance is the sum of the differences of each entry, taking the square and square roots when necessary.  

**Interpretation**:  

- distance $\geq 0$
- Zero: two points are the same.  
- Higher value means farther.

In [None]:
a = np.array([1,2,3])
b = np.array([2,2,2])
dist = np.sqrt(np.sum((a - b) ** 2))
dist

In [None]:
# same as np.linalg.norm
a = np.array([1,2,3])
b = np.array([2,2,2])
np.linalg.norm(a - b)

Another way to measure how close are two vectors is the cosine similarity, particularly for points that are on the unit circle ($\|{\bf x}\| = 1$).  For example, the `word2vec` algorithm embed each word into a vector on the circle.  

When ${\bf x} = (x_1, x_2, \ldots, x_n)$ and ${\bf y} = (y_1, y_2, \ldots, y_n)$, the **cosine similarity** between them is  

$$
    \cos\theta = \frac{{\bf x}\cdot{\bf y}}{\|{\bf x}\|\|{\bf y}\|},
$$

where ${\bf x}\cdot{\bf y} = x_1y_1 + x_2y_2 + \cdots + x_ny_n$ is the **inner product** or the **dot product** .

**Intuition**:  The cosine value of the angle.

**Interpretation**:  

- similarity is between $-1$ and $1$.  
- $1$ --- two points are of the same direction.
- $0$ --- they are independent.
- $-1$ --- two points are of the opposite direction.

In [None]:
a = np.array([0.6, 0.8,   0])
b = np.array([0.6,   0, 0.8])

def cosine_similarity(x, y):
    return np.dot(x, y) / np.linalg.norm(x) / np.linalg.norm(y)

cosine_similarity(a, b)

### NLP task: how similar are two documents

We will use tf-idf to extract information from documents.  Here  

$$
    \operatorname{tf}(\text{doc}_i, \text{word}_j) = \frac{\text{# of occurrences of word$_j$ in doc$_i$}}{\text{# of words in doc$_i$}} 
$$

is the **term frequency** and  

$$
    \operatorname{df}(\text{word}_j) = \frac{\text{# of documents containing word$_j$}}{\text{# of documents}} 
$$

is the **document frequency** .  

Thus, we have  

$$
    \operatorname{tf-idf}(\text{doc}_i, \text{word}_j) = \operatorname{tf}(\text{doc}_i, \text{word}_j) \times \log_2 \frac{1}{\operatorname{df}(\text{word}_j)}.
$$

_There are some variations of this formula by adding one to the denominators, to avoid the zero division._

**Intuition**:  Higher term frequency means the term is more important, while higher document frequency means the term is more like a functional word that appears in almost all documents.  

**Interpretation**:  

- $\operatorname{tf-idf}$ is $\geq 0$.
- Zero: the word does not appear in the document.  
- Higher value means the word is more important.  

In [None]:
# Generat,ed by ChatGPT with the question:
# Describe "XXX" in simple English using less than 100 words.
cat = "A cat is a small furry animal with sharp claws and a long tail. They have soft fur that comes in different colors like black, white, or orange. Cats are known for their agility and ability to climb trees. They are independent creatures but can also be friendly and enjoy human companionship. They communicate through meowing and purring. Cats are often kept as pets and are loved for their playfulness and ability to catch mice."
dog = "Dogs are loyal and friendly animals that love to be around people. They come in different sizes and colors, but all dogs have fur and a wagging tail. They enjoy playing fetch, going for walks, and cuddling with their owners. Dogs are known for their keen sense of smell and hearing, which makes them great companions and protectors. They communicate through barking, wagging their tails, and using body language. Dogs are known for their unconditional love and can be part of a family, bringing happiness and companionship to their human friends."
rat = "Rats are small animals with fur and long tails. They are known for their quickness and agility. Rats can be found in various colors, such as brown, gray, or black. They are often seen scavenging for food and can squeeze through small spaces. While some people may consider them pests, rats are intelligent creatures that can learn tricks and solve problems. They have a keen sense of smell and are good at finding food. Rats are social animals and often live in groups called colonies. Despite their negative reputation, rats play a vital role in ecosystems."
sun = "The sun is a big, bright ball of light in the sky. It gives us warmth and helps plants grow. The sun rises in the morning and sets in the evening, giving us daylight. It shines during the day, making everything around us visible. The sun is very far away from us but still feels close because it is so bright. It provides us with energy and makes the world a brighter and happier place. When we feel its rays on our skin, it can feel warm and comforting. The sun is an essential part of our lives and the Earth's ecosystem."
corpus = [cat, dog, rat, sun]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(corpus).toarray()

In [None]:
# show 10 columns only
X[:,:10]

Now `X[0], ..., X[3]` are the vector representations of the four documents, cat, dog, rat, and sun.  We may calculate their distances.

In [None]:
np.linalg.norm(X[0] - X[1]) # try other pairs

We may use `pairwise_distances` function in `sklearn` to calculate the pairwise distances at once.  It looks like the document of sun is farther away from the other three documents.

In [None]:
from sklearn.metrics import pairwise_distances
pairwise_distances(X)

### Further reading

- [_Python Data Science Handbook_](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas