# Natural Language Processing (NLP)
A field that involves the interaction between computers and human language. It encompasses a variety of tasks such as text classification, sentiment analysis, machine translation, and more. Here are some fundamental concepts and we'll then move on to practical examples.

## Fundamental Concepts


### Tokenization:
Splitting text into individual words or tokens.

In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is fascinating!"
tokens = word_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


### Stemming and Lemmatization:
Reducing words to their base or root form.

In [2]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "running"
print("Stemmed:", stemmer.stem(word))
print("Lemmatized:", lemmatizer.lemmatize(word, pos='v'))

[nltk_data] Downloading package wordnet to /root/nltk_data...


Stemmed: run
Lemmatized: run


### Remove Stop Words:
Common words (like "and", "the", "is") that are often removed from text to focus on more meaningful words.

In [3]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

['Natural', 'Language', 'Processing', 'fascinating', '!']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Bag of Words (BoW):
Representing text as a set of word counts.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "Natural Language Processing is fascinating.",
    "Machine learning is a part of AI.",
    "Deep learning is a subset of machine learning."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['ai' 'deep' 'fascinating' 'is' 'language' 'learning' 'machine' 'natural'
 'of' 'part' 'processing' 'subset']
[[0 0 1 1 1 0 0 1 0 0 1 0]
 [1 0 0 1 0 1 1 0 1 1 0 0]
 [0 1 0 1 0 2 1 0 1 0 0 1]]


### TF-IDF (Term Frequency-Inverse Document Frequency):
A statistical measure used to evaluate the importance of a word in a document relative to a collection of documents.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['ai' 'deep' 'fascinating' 'is' 'language' 'learning' 'machine' 'natural'
 'of' 'part' 'processing' 'subset']
[[0.         0.         0.47952794 0.28321692 0.47952794 0.
  0.         0.47952794 0.         0.         0.47952794 0.        ]
 [0.49482971 0.         0.         0.2922544  0.         0.37633075
  0.37633075 0.         0.37633075 0.49482971 0.         0.        ]
 [0.         0.41454097 0.         0.24483457 0.         0.63053818
  0.31526909 0.         0.31526909 0.         0.         0.41454097]]


### Word Embeddings:
Representing words in a continuous vector space where semantically similar words are closer together (e.g., Word2Vec, GloVe).

In [6]:
import gensim.downloader as api

# Load pre-trained Word2Vec model
model = api.load("glove-wiki-gigaword-50")

# Get the vector for a word
vector = model['king']
print(vector)

# Find similar words
similar_words = model.most_similar('king')
print(similar_words)

[ 0.50451   0.68607  -0.59517  -0.022801  0.60046  -0.13498  -0.08813
  0.47377  -0.61798  -0.31012  -0.076666  1.493    -0.034189 -0.98173
  0.68229   0.81722  -0.51874  -0.31503  -0.55809   0.66421   0.1961
 -0.13495  -0.11476  -0.30344   0.41177  -2.223    -1.0756   -1.0783
 -0.34354   0.33505   1.9927   -0.04234  -0.64319   0.71125   0.49159
  0.16754   0.34344  -0.25663  -0.8523    0.1661    0.40102   1.1685
 -1.0137   -0.21585  -0.15155   0.78321  -0.91241  -1.6106   -0.64426
 -0.51042 ]
[('prince', 0.8236179351806641), ('queen', 0.7839043140411377), ('ii', 0.7746230363845825), ('emperor', 0.7736247777938843), ('son', 0.766719400882721), ('uncle', 0.7627150416374207), ('kingdom', 0.7542160749435425), ('throne', 0.7539913654327393), ('brother', 0.7492411136627197), ('ruler', 0.7434253692626953)]


### Sequence Models:
Models like RNNs, LSTMs, GRUs, and Transformers that handle sequential data.

In [8]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Sample data
sentences = [
    "I love machine learning",
    "Deep learning is amazing",
    "Natural language processing is a part of AI",
    "I enjoy learning new things"
]
labels = [1, 1, 1, 0]  # 1: positive, 0: negative

# Tokenize the sentences
tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

# Pad the sequences to ensure they are all the same length
max_length = max(len(seq) for seq in sequences)
X = pad_sequences(sequences, maxlen=max_length, padding='post')

# Convert labels to a numpy array
y = np.array(labels)

# Define the LSTM model
model = Sequential()
model.add(Embedding(input_dim=100, output_dim=16, input_length=max_length))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X, y, epochs=10, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X, y, verbose=0)
print(f'Test Accuracy: {accuracy:.4f}')

# Make predictions
predictions = model.predict(X)
print(predictions)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.7500
[[0.54214   ]
 [0.5448219 ]
 [0.5214551 ]
 [0.53578043]]


# Explanation:



* Tokenization: We use the "Tokenizer" class from Keras to convert the sentences into sequences of integers.
* Padding: We pad the sequences to ensure they are all the same length using the pad_sequences function.
* Model Definition: We define a simple LSTM model with an embedding layer, an LSTM layer, and a dense output layer with a sigmoid activation function.
* Model Compilation: We compile the model using the Adam optimizer and binary cross-entropy loss.
* Model Training: We train the model on the sample data for 10 epochs.
* Model Evaluation: We evaluate the model on the same data and print the accuracy.
* Predictions: We make predictions on the sample data and print the results.

# Advanced NLP Tasks
Once you're comfortable with the basics, you can explore more advanced NLP tasks and models:

* Named Entity Recognition (NER): Identifying entities like names, dates, and locations in text.
* Sentiment Analysis: Determining the sentiment (positive, negative, neutral) of a piece of text.
* Machine Translation: Translating text from one language to another.
* Text Summarization: Generating a summary of a given text.
* Question Answering: Building models that can answer questions based on a given context.

## Pre-trained Models and Libraries
There are several pre-trained models and libraries that can help you with advanced NLP tasks:

* spaCy: A popular NLP library with pre-trained models for various tasks.
* Hugging Face Transformers: A library with state-of-the-art pre-trained models like BERT, GPT-3, and more.

Example: Using Hugging Face Transformers for Sentiment Analysis

In [9]:
from transformers import pipeline

# Load a pre-trained sentiment analysis pipeline
classifier = pipeline('sentiment-analysis')

# Analyze sentiment of a sentence
result = classifier("I love natural language processing!")
print(result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9998558759689331}]


# Step-by-Step Example

### 1. Data Loading and Preprocessing
First, let's load the IMDb dataset and preprocess the text data.

In [12]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt

# Load the IMDb dataset
num_words = 10000  # Only consider the top 10,000 words
max_length = 200   # Maximum length of sequences

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

# Pad sequences to ensure they are all the same length
X_train = pad_sequences(X_train, maxlen=max_length, padding='post')
X_test = pad_sequences(X_test, maxlen=max_length, padding='post')

print(f"Training data shape: {X_train.shape}")
print(f"Test data shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Test labels shape: {y_test.shape}")

Training data shape: (25000, 200)
Test data shape: (25000, 200)
Training labels shape: (25000,)
Test labels shape: (25000,)


### 2. Building the LSTM Model
Next, we'll define and compile the LSTM model.

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Define the LSTM model
model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=128, input_length=max_length))
model.add(LSTM(64, return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(64))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 200, 128)          1280000   
                                                                 
 lstm_1 (LSTM)               (None, 200, 64)           49408     
                                                                 
 dropout (Dropout)           (None, 200, 64)           0         
                                                                 
 lstm_2 (LSTM)               (None, 64)                33024     
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1362497 (5.20 MB)
Trainable params: 1362

### 3. Training the Model
Now, we'll train the model on the training data.

In [14]:
# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_split=0.2, verbose=1)

# Plot training and validation loss
plt.figure(figsize=(10, 5))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training and Validation Loss Over Time')
plt.legend()
plt.show()

# Plot training and validation accuracy
plt.figure(figsize=(10, 5))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy Over Time')
plt.legend()
plt.show()

Epoch 1/5

KeyboardInterrupt: 

### 4. Evaluating the Model
We'll evaluate the model on the test data to see how well it performs.

In [None]:
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f'Test Loss: {test_loss:.4f}')
print(f'Test Accuracy: {test_accuracy:.4f}')

### 5. Making Predictions
Finally, we'll make predictions on new data and interpret the results.

In [None]:
# Make predictions
predictions = model.predict(X_test)

# Convert predictions to binary labels
predicted_labels = (predictions > 0.5).astype(int)

# Print some example predictions
for i in range(5):
    print(f"Review: {X_test[i]}")
    print(f"Predicted Sentiment: {'Positive' if predicted_labels[i] == 1 else 'Negative'}")
    print(f"Actual Sentiment: {'Positive' if y_test[i] == 1 else 'Negative'}")
    print()