# NLP: NumPy basics

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/phonchi/ModularPython/blob/master/NLP-NumPy_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/phonchi/ModularPython/blob/master/NLP-NumPy_basics.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [1]:
import numpy as np
from sklearn import set_config
np.set_printoptions(precision=2, suppress=True)
set_config(transform_output="pandas")

### Numerical Data Representation

Tables and matrices are commonly used to store data in a structured format. When dealing with purely numerical entries, these formats allow computers to process the information efficiently. However, when working with categorical or textual data, <u>transformation into numerical vectors</u> is necessary before any computational training can occur.

Here's an example table showcasing how students' scores across different subjects can influence their overall decision:

| Student \ Subject | A  | B  | C  | D  | E  | Decision | Comments         |
|-------------------|----|----|----|----|----|----------|------------------|
| 1                 | 10 | 10 | 10 | 10 | 10 | Accept   | Excellent        |
| 2                 | 10 | 10 | 10 | 10 | 0  | Accept   | Satisfactory     |
| 3                 | 0  | 0  | 15 | 0  | 0  | Decline  | Needs improvement|

**Terminology:**
- **Sample:** Each row represents an individual sample.
- **Feature:** Each column represents a feature of the dataset.
- **Vector:** A one-dimensional array of numbers.
- **Matrix:** A two-dimensional array of numbers, stored in a rectangular format.
- **Array:** A collection of elements stored in one or more dimensions.

This format not only aids in data organization but also prepares the dataset for machine learning models where features and samples are crucial for predictions and analysis.

### 📘 NLP task: Count vector for vectorization

One way to transform a document into a vector is simply count how many times does each word occurs.

In [2]:
# Generated by ChatGPT with the question:
# Describe "XXX" in simple English using less than 100 words.
cat = "A cat is a small furry animal with sharp claws and a long tail. They have soft fur that comes in different colors like black, white, or orange. Cats are known for their agility and ability to climb trees. They are independent creatures but can also be friendly and enjoy human companionship. They communicate through meowing and purring. Cats are often kept as pets and are loved for their playfulness and ability to catch mice."
dog = "Dogs are loyal and friendly animals that love to be around people. They come in different sizes and colors, but all dogs have fur and a wagging tail. They enjoy playing fetch, going for walks, and cuddling with their owners. Dogs are known for their keen sense of smell and hearing, which makes them great companions and protectors. They communicate through barking, wagging their tails, and using body language. Dogs are known for their unconditional love and can be part of a family, bringing happiness and companionship to their human friends."
rat = "Rats are small animals with fur and long tails. They are known for their quickness and agility. Rats can be found in various colors, such as brown, gray, or black. They are often seen scavenging for food and can squeeze through small spaces. While some people may consider them pests, rats are intelligent creatures that can learn tricks and solve problems. They have a keen sense of smell and are good at finding food. Rats are social animals and often live in groups called colonies. Despite their negative reputation, rats play a vital role in ecosystems."
sun = "The sun is a big, bright ball of light in the sky. It gives us warmth and helps plants grow. The sun rises in the morning and sets in the evening, giving us daylight. It shines during the day, making everything around us visible. The sun is very far away from us but still feels close because it is so bright. It provides us with energy and makes the world a brighter and happier place. When we feel its rays on our skin, it can feel warm and comforting. The sun is an essential part of our lives and the Earth's ecosystem."
corpus = [cat, dog, rat, sun]

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english', lowercase=True)
X = cvec.fit_transform(corpus).toarray() # The default format use sparse matrix

In [4]:
# show 10 columns only
X[:,:10]

array([[2, 1, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 1],
       [0, 1, 0, 2, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 1, 0, 1, 0, 0]])

In [5]:
# better format by pandas
import pandas as pd
df = pd.DataFrame(X)
df.columns = cvec.get_feature_names_out()
df

Unnamed: 0,ability,agility,animal,animals,away,ball,barking,big,black,body,...,using,various,visible,vital,wagging,walks,warm,warmth,white,world
0,2,1,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,1,...,1,0,0,0,2,1,0,0,0,0
2,0,1,0,2,0,0,0,0,1,0,...,0,1,0,1,0,0,0,0,0,0
3,0,0,0,0,1,1,0,1,0,0,...,0,0,1,0,0,0,1,1,0,1


### 🔍 Supplementary: Exploring CountVectorizer

**stop_words**: Since CountVectorizer just counts the occurrences of each word in its vocabulary, extremely common words like ‘the’, ‘and’, etc. will become very important features while they add little meaning to the text. Your model can often be improved if you don’t take those words into account. Stop words are just a list of words you don’t want to use as features. You can set the parameter stop_words=’english’ to use a built-in list. Alternatively you can set stop_words equal to some custom list. This parameter defaults to None.

**ngram_range**: An n-gram is just a string of n words in a row. E.g. the sentence ‘I am Groot’ contains the 2-grams ‘I am’ and ‘am Groot’. The sentence is itself a 3-gram. Set the parameter ngram_range=(a,b) where a is the minimum and b is the maximum size of ngrams you want to include in your features. The default ngram_range is (1,1). In a recent project where I modeled job postings online, I found that including 2-grams as features boosted my model’s predictive power significantly. This makes intuitive sense; many job titles such as ‘data scientist’, ‘data engineer’, and ‘data analyst’ are 2 words long.

**min_df, max_df**: These are the minimum and maximum document frequencies words/n-grams must have to be used as features. If either of these parameters are set to integers, they will be used as bounds on the number of documents each feature must be in to be considered as a feature. If either is set to a float, that number will be interpreted as a frequency rather than a numerical limit. min_df defaults to 1 (int) and max_df defaults to 1.0 (float).

**max_features**: This parameter is pretty self-explanatory. The CountVectorizer will choose the words/features that occur most frequently to be in its’ vocabulary and drop everything else.

In [6]:
# @title 💡 Supplementary Code: Delve into CountVector { run: "auto", vertical-output: true, form-width: "50%" }
vect_tunned = CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=0.1, max_df=0.7, max_features=100)

### Shape

NumPy is a Python package that handles arrays.  The shape of an array tells how many entries are stored in each dimension.  

In [7]:
arr = np.zeros((3,4)) # try (3,4,5)
print(arr.shape) # Floating point
arr

(3, 4)


array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Whenever possible, you may use `reshape` to change the shape.  

In [8]:
arr = np.arange(12)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [9]:
# take a careful look of the number of brakets
arr = np.arange(12)
arr.reshape(3,4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Use `zeros_like` to create a zero array with the same data type.  

In [10]:
arr = np.arange(12)
arr = arr.reshape(3,4)
print(arr.dtype)
z_arr = np.zeros_like(arr)
print(z_arr.dtype)
z_arr

int64
int64


array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

### Selection and Slicing

When `arr` is an array, use `arr[i]` or `arr[i,j]` to select its entry.  

In [11]:
arr = np.array([1,2,3])
print(arr)
print(arr[2]) # Starts with 0

[1 2 3]
3


In [12]:
arr = np.arange(12).reshape(3,4)
print(arr)
print(arr[2])

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[ 8  9 10 11]


In [13]:
arr = np.arange(12).reshape(3,4)
print(arr)
print(arr[2,3])

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
11


Instead of selecting only one index `i` , you may use `a:b` to **slice** all entries from `a` to `b - 1` .

In [None]:
arr = np.arange(12).reshape(3,4)
print(arr)
print(arr[:, 1:3])

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[[ 1  2]
 [ 5  6]
 [ 9 10]]


### Universal and aggregate functions

If an operation is applied to each entry, then it is called **universal** .  

In [None]:
arr = np.arange(12).reshape(3,4)
arr * 2

array([[ 0,  2,  4,  6],
       [ 8, 10, 12, 14],
       [16, 18, 20, 22]])

In [None]:
arr = np.arange(12).reshape(3,4)
arr.sum(axis=0) # try .sum(axis=1) or .sum()

array([12, 15, 18, 21])

### Distances and similarity

Distance metrics are fundamental in determining how close two vectors are. Consider two vectors, ${\bf x} = (x_1, x_2, \ldots, x_n)$ and ${\bf y} = (y_1, y_2, \ldots, y_n)$. The **distance** between them is computed as:

$$
    \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2}.
$$

This expression is recognized as the **Euclidean distance** or the **$\ell_2$-norm**.

**Intuition**: This distance represents the hypotenuse of a multi-dimensional right triangle, summing up the squared differences between corresponding elements of the vectors, followed by taking the square root.

**Interpretation**:
- The distance is always non-negative ($\geq 0$).
- A distance of zero indicates identical vectors.
- Larger values suggest greater disparity between the vectors.

In [None]:
a = np.array([1,2,3])
b = np.array([2,2,2])
dist = np.sqrt(np.sum((a - b) ** 2))
dist

1.4142135623730951

In [None]:
# same as np.linalg.norm
a = np.array([1,2,3])
b = np.array([2,2,2])
np.linalg.norm(a - b)

1.4142135623730951

[Cosine similarity](https://www.youtube.com/watch?v=e9U0QAFbfLI&ab_channel=StatQuestwithJoshStarmer) is another useful metric for assessing the closeness of two vectors, particularly when vectors are normalized to the unit circle (i.e., $\|{\bf x}\| = 1$). This measure is extensively used in models like `word2vec`, where each word is embedded as a vector on this unit circle.

Given vectors ${\bf x} = (x_1, x_2, \ldots, x_n)$ and ${\bf y} = (y_1, y_2, \ldots, y_n)$, the **cosine similarity** between them is calculated as:

$$
    \cos\theta = \frac{{\bf x}\cdot{\bf y}}{\|{\bf x}\|\|{\bf y}\|},
$$

where $\cos\theta$ is the cosine of the angle $\theta$ between the vectors, and ${\bf x}\cdot{\bf y} = x_1y_1 + x_2y_2 + \cdots + x_ny_n$ represents the **dot product**.

**Intuition**: Cosine similarity measures the cosine of the angle between two vectors, providing a measure of their directional alignment.

**Interpretation**:
- The value ranges from $-1$ to $1$.
- A similarity of $1$ indicates that the two vectors are in the same direction.
- A similarity of $0$ suggests orthogonality, implying independence.
- A similarity of $-1$ means the vectors are in opposite directions.

In [None]:
a = np.array([0.6, 0.8,   0])
b = np.array([0.6,   0, 0.8])

def cosine_similarity(x, y):
    return np.dot(x, y) / np.linalg.norm(x) / np.linalg.norm(y)

cosine_similarity(a, b)

0.36

### 📘 NLP task: how similar are two documents

[TF-IDF (Term Frequency-Inverse Document Frequency)](https://www.youtube.com/watch?v=zLMEnNbdh4Q&ab_channel=DataMListic) is a statistical measure used to evaluate the importance of a word in a document, relative to a collection of documents or corpus. The computation of TF-IDF is performed as follows:

1. **Term Frequency (TF)** measures how frequently a term occurs in a document. It is calculated by:
   $$
   \operatorname{tf}(\text{doc}_i, \text{word}_j) = \frac{\text{# of occurrences of word}_j \text{ in doc}_i}{\text{# of words in doc}_i}
   $$

2. **Document Frequency (DF)** assesses how common a word is across all documents. It is defined as:
   $$
   \operatorname{df}(\text{word}_j) = \frac{\text{# of documents containing word}_j}{\text{# of documents}}
   $$

3. **TF-IDF** is then calculated by:
   $$
   \operatorname{tf-idf}(\text{doc}_i, \text{word}_j) = \operatorname{tf}(\text{doc}_i, \text{word}_j) \times \log_2 \left(\frac{1}{\operatorname{df}(\text{word}_j)}\right)
   $$

Modifications to the formula include adding one to the denominators to avoid division by zero when a word does not appear in any documents.

**Intuition**: A high term frequency indicates that a word is important within a specific document. Conversely, a high document frequency suggests that the word is common (e.g. functional word) and perhaps less significant across all documents.

**Interpretation**:
- TF-IDF values are non-negative ($\geq 0$).
- A value of zero indicates that the word does not appear in the document.
- Higher TF-IDF values signify greater importance of the word within the document in relation to the entire corpus.

This method is highly effective in filtering out common terms while highlighting significant terms specific to a document, making it essential for document analysis and information retrieval systems.

In [None]:
# Generat,ed by ChatGPT with the question:
# Describe "XXX" in simple English using less than 100 words.
cat = "A cat is a small furry animal with sharp claws and a long tail. They have soft fur that comes in different colors like black, white, or orange. Cats are known for their agility and ability to climb trees. They are independent creatures but can also be friendly and enjoy human companionship. They communicate through meowing and purring. Cats are often kept as pets and are loved for their playfulness and ability to catch mice."
dog = "Dogs are loyal and friendly animals that love to be around people. They come in different sizes and colors, but all dogs have fur and a wagging tail. They enjoy playing fetch, going for walks, and cuddling with their owners. Dogs are known for their keen sense of smell and hearing, which makes them great companions and protectors. They communicate through barking, wagging their tails, and using body language. Dogs are known for their unconditional love and can be part of a family, bringing happiness and companionship to their human friends."
rat = "Rats are small animals with fur and long tails. They are known for their quickness and agility. Rats can be found in various colors, such as brown, gray, or black. They are often seen scavenging for food and can squeeze through small spaces. While some people may consider them pests, rats are intelligent creatures that can learn tricks and solve problems. They have a keen sense of smell and are good at finding food. Rats are social animals and often live in groups called colonies. Despite their negative reputation, rats play a vital role in ecosystems."
sun = "The sun is a big, bright ball of light in the sky. It gives us warmth and helps plants grow. The sun rises in the morning and sets in the evening, giving us daylight. It shines during the day, making everything around us visible. The sun is very far away from us but still feels close because it is so bright. It provides us with energy and makes the world a brighter and happier place. When we feel its rays on our skin, it can feel warm and comforting. The sun is an essential part of our lives and the Earth's ecosystem."
corpus = [cat, dog, rat, sun]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(corpus).toarray()

In [None]:
# show 10 columns only
X[:,:10]

array([[0.33, 0.13, 0.16, 0.  , 0.  , 0.  , 0.  , 0.  , 0.13, 0.  ],
       [0.  , 0.  , 0.  , 0.1 , 0.  , 0.  , 0.13, 0.  , 0.  , 0.13],
       [0.  , 0.09, 0.  , 0.19, 0.  , 0.  , 0.  , 0.  , 0.09, 0.  ],
       [0.  , 0.  , 0.  , 0.  , 0.13, 0.13, 0.  , 0.13, 0.  , 0.  ]])

Now `X[0], ..., X[3]` are the vector representations of the four documents, cat, dog, rat, and sun.  We may calculate their pairwise distances and cosine similarity.

In [None]:
# cosine similarity
def cosine_similarity(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

cosine_similarity(X[0], X[1]) # try other pairs

0.12888416735501715

When each row is of length one, the formula `X.dot(X.T)` calculates the pairwise cosine similarity.

In [None]:
X.dot(X.T)

array([[1.  , 0.13, 0.1 , 0.  ],
       [0.13, 1.  , 0.09, 0.01],
       [0.1 , 0.09, 1.  , 0.  ],
       [0.  , 0.01, 0.  , 1.  ]])

In [None]:
# distance
np.linalg.norm(X[0] - X[1]) # try other pairs

1.3199362353121324

We may use `pairwise_distances` function in `sklearn` to calculate the pairwise distances at once.  It looks like the document of sun is farther away from the other three documents.

In [None]:
from sklearn.metrics import pairwise_distances
pairwise_distances(X)

array([[0.  , 1.32, 1.34, 1.41],
       [1.32, 0.  , 1.35, 1.41],
       [1.34, 1.35, 0.  , 1.41],
       [1.41, 1.41, 1.41, 0.  ]])

As a side note, `TfidfTransformer` apply the following transformation to the count matrix.

In [None]:
# count matrix
a = np.array([[1,1,2,1],
        [1,2,0,0],
        [2,2,0,0]])

from sklearn.feature_extraction.text import TfidfTransformer
model = TfidfTransformer()
tfidf_mtx = model.fit_transform(a).toarray()
tfidf_mtx

array([[0.25, 0.25, 0.84, 0.42],
       [0.45, 0.89, 0.  , 0.  ],
       [0.71, 0.71, 0.  , 0.  ]])

Now, we manually calculate it:

In [None]:
# calculate tf
tf = a / a.sum(axis=1)[:,np.newaxis]
tf

array([[0.2 , 0.2 , 0.4 , 0.2 ],
       [0.33, 0.67, 0.  , 0.  ],
       [0.5 , 0.5 , 0.  , 0.  ]])

In [None]:
# calculate idf
idf = np.log( (a.shape[0] + 1) / ((a > 0).sum(axis=0) + 1) ) + 1
idf

array([1.  , 1.  , 1.69, 1.69])

In [None]:
# create tf-idf and normalize each row
w = tf * idf
w / np.linalg.norm(w, axis=1)[:,np.newaxis]

array([[0.25, 0.25, 0.84, 0.42],
       [0.45, 0.89, 0.  , 0.  ],
       [0.71, 0.71, 0.  , 0.  ]])

### 🔍 Supplementary: Use `CountVectorize` and `TfidfTransformer` to obtain the tf-idf vectorization.

In [None]:
# @title 💡 Supplementary Code: Use TF-IDF Transformer { run: "auto", vertical-output: true, form-width: "50%" }
tfidf = TfidfVectorizer(stop_words='english')
X1 = tfidf.fit_transform(corpus).toarray()

cvec = CountVectorizer(stop_words='english')
X2 = cvec.fit_transform(corpus)
model = TfidfTransformer()
tfidf_mtx = model.fit_transform(X2).toarray()

np.testing.assert_array_equal(X1, tfidf_mtx)

### 📚 Further reading

- [_Python Data Science Handbook_](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas