## Introduction to Natural Language Processing (NLP)
### NLP is a field of artificial intelligence that helps computers understand, interpret, and manipulate human language.

### In this notebook, we will use a scene from *The Shawshank Redemption* to demonstrate various NLP techniques.
## Our goals are to:
### - Preprocess the text (cleaning and preparing)
### - Represent the text in numerical formats using different techniques
### - Understand how these representations can be used in machine learning


1. **Text Representation and Preprocessing:**

   This involves transforming text into a format that can be effectively analyzed by machine learning algorithms. Text data is inherently unstructured and requires preprocessing steps like tokenization, stop word removal, stemming, and lemmatization to standardize and clean it. Preprocessing converts text into a more uniform, structured format that can be used for machine learning tasks.

2. **Introduction to NLP and Text Processing:**

   Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human languages. NLP aims to read, decipher, understand, and make sense of human languages in a valuable way. The process involves various techniques, including syntactic and semantic analysis, sentiment analysis, and named entity recognition, to interpret and analyze the text data.

3. **Text Preprocessing: Tokenization, Stop Word Removal, Stemming, and Lemmatization:**

   - **Tokenization**: The process of breaking down text into smaller units, such as words or sentences. It helps in splitting text into meaningful elements that can be analyzed. For example, "I've seen that magazine before" can be tokenized into ["I've", "seen", "that", "magazine", "before"].
   
   - **Stop Word Removal**: This involves removing common words (such as "the", "is", "in") that do not contribute much to the text's meaning in the context of the analysis. Removing stop words reduces noise and helps focus on the important words.
   
   - **Stemming**: This reduces words to their root form by chopping off prefixes or suffixes. Stemming is useful for normalizing words to their base form, but it often results in words that are not valid. For example, "running" becomes "run", and "magazines" becomes "magazin".
   
   - **Lemmatization**: Similar to stemming but more sophisticated. Lemmatization reduces words to their base or dictionary form while considering the context in which they are used. It produces valid words. For example, "better" can be lemmatized to "good" depending on the context, and "running" becomes "run".

4. **Different Text Vectorization Techniques: Bag of Words (BoW), TF-IDF, and N-grams:**

   - **Bag of Words (BoW)**: Represents text data as a collection of individual words, without regard to grammar or word order. It creates a vocabulary from all unique words in the text and represents each document as a vector that counts the occurrences of each word in the document.
   
   - **TF-IDF (Term Frequency-Inverse Document Frequency)**: A statistical measure used to evaluate how important a word is to a document in a collection or corpus. It considers the frequency of a word in a document (TF) and the inverse frequency of the word across all documents (IDF). This technique highlights words that are frequent in a document but not in others.
   
   - **N-grams**: Represents sequences of words in a text. "Uni-gram" refers to a single word, "bi-gram" to a sequence of two words, and "tri-gram" to a sequence of three words. N-grams are used to capture context or word relationships, making it useful for language models and sentiment analysis.

In [16]:
scene_text = """
Red: I've seen that magazine before. It's got a story about a guy who found a treasure map.
Andy: Yeah, I've read it too.
Red: It's a load of crap.
Andy: You never know, Red.
Red: Yeah, well, I'm not holding my breath.
Andy: Ever think about what you'd do if you found a treasure?
Red: Yeah, I guess I have.
Andy: Where would you go?
Red: Somewhere warm. Maybe Mexico.
Andy: That sounds nice.
Red: What about you, Andy? Where would you go?
Andy: I'd go to the beach. Just sit there and listen to the waves.
Red: The beach?
Andy: Yeah. The beach.
"""

print(scene_text)


Red: I've seen that magazine before. It's got a story about a guy who found a treasure map.
Andy: Yeah, I've read it too.
Red: It's a load of crap.
Andy: You never know, Red.
Red: Yeah, well, I'm not holding my breath.
Andy: Ever think about what you'd do if you found a treasure?
Red: Yeah, I guess I have.
Andy: Where would you go?
Red: Somewhere warm. Maybe Mexico.
Andy: That sounds nice.
Red: What about you, Andy? Where would you go?
Andy: I'd go to the beach. Just sit there and listen to the waves.
Red: The beach?
Andy: Yeah. The beach.



## Install NLTK if not already installed

In [17]:
!pip install nltk



### Importing necessary libraries for text processing

In [18]:
import nltk
import os
import zipfile 

nltk_data_dir = '/root/nltk_data'
if not os.path.exists(nltk_data_dir):
    os.makedirs(nltk_data_dir)

nltk.download('punkt', download_dir=nltk_data_dir)
nltk.download('stopwords', download_dir=nltk_data_dir)
nltk.download('wordnet', download_dir=nltk_data_dir)
nltk.download('omw-1.4', download_dir=nltk_data_dir)

wordnet_zip_path = os.path.join(nltk_data_dir, 'corpora', 'wordnet.zip')
wordnet_extracted_path = os.path.join(nltk_data_dir, 'corpora', 'wordnet')

if not os.path.exists(wordnet_extracted_path):
    print("Extracting 'wordnet.zip'...")
    with zipfile.ZipFile(wordnet_zip_path, 'r') as zip_ref:
        zip_ref.extractall(os.path.join(nltk_data_dir, 'corpora'))

nltk.data.path.append(nltk_data_dir)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [19]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

### Download required NLTK data files

### Display the original scene text

In [20]:
print("Original Scene Text:\n", scene_text)

Original Scene Text:
 
Red: I've seen that magazine before. It's got a story about a guy who found a treasure map.
Andy: Yeah, I've read it too.
Red: It's a load of crap.
Andy: You never know, Red.
Red: Yeah, well, I'm not holding my breath.
Andy: Ever think about what you'd do if you found a treasure?
Red: Yeah, I guess I have.
Andy: Where would you go?
Red: Somewhere warm. Maybe Mexico.
Andy: That sounds nice.
Red: What about you, Andy? Where would you go?
Andy: I'd go to the beach. Just sit there and listen to the waves.
Red: The beach?
Andy: Yeah. The beach.



### Step 1: Tokenization
### Tokenization is the process of breaking down text into smaller pieces, usually words or sentences.


In [21]:
tokens = word_tokenize(scene_text)
print("\nTokens:\n", tokens)


Tokens:
 ['Red', ':', 'I', "'ve", 'seen', 'that', 'magazine', 'before', '.', 'It', "'s", 'got', 'a', 'story', 'about', 'a', 'guy', 'who', 'found', 'a', 'treasure', 'map', '.', 'Andy', ':', 'Yeah', ',', 'I', "'ve", 'read', 'it', 'too', '.', 'Red', ':', 'It', "'s", 'a', 'load', 'of', 'crap', '.', 'Andy', ':', 'You', 'never', 'know', ',', 'Red', '.', 'Red', ':', 'Yeah', ',', 'well', ',', 'I', "'m", 'not', 'holding', 'my', 'breath', '.', 'Andy', ':', 'Ever', 'think', 'about', 'what', 'you', "'d", 'do', 'if', 'you', 'found', 'a', 'treasure', '?', 'Red', ':', 'Yeah', ',', 'I', 'guess', 'I', 'have', '.', 'Andy', ':', 'Where', 'would', 'you', 'go', '?', 'Red', ':', 'Somewhere', 'warm', '.', 'Maybe', 'Mexico', '.', 'Andy', ':', 'That', 'sounds', 'nice', '.', 'Red', ':', 'What', 'about', 'you', ',', 'Andy', '?', 'Where', 'would', 'you', 'go', '?', 'Andy', ':', 'I', "'d", 'go', 'to', 'the', 'beach', '.', 'Just', 'sit', 'there', 'and', 'listen', 'to', 'the', 'waves', '.', 'Red', ':', 'The', 'beac

### Step 2: Stop Word Removal
### Stop words are common words (like "the", "is", "in") that are usually removed in text processing

In [22]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("\nFiltered Tokens (Stop Words Removed):\n", filtered_tokens)


Filtered Tokens (Stop Words Removed):
 ['Red', ':', "'ve", 'seen', 'magazine', '.', "'s", 'got', 'story', 'guy', 'found', 'treasure', 'map', '.', 'Andy', ':', 'Yeah', ',', "'ve", 'read', '.', 'Red', ':', "'s", 'load', 'crap', '.', 'Andy', ':', 'never', 'know', ',', 'Red', '.', 'Red', ':', 'Yeah', ',', 'well', ',', "'m", 'holding', 'breath', '.', 'Andy', ':', 'Ever', 'think', "'d", 'found', 'treasure', '?', 'Red', ':', 'Yeah', ',', 'guess', '.', 'Andy', ':', 'would', 'go', '?', 'Red', ':', 'Somewhere', 'warm', '.', 'Maybe', 'Mexico', '.', 'Andy', ':', 'sounds', 'nice', '.', 'Red', ':', ',', 'Andy', '?', 'would', 'go', '?', 'Andy', ':', "'d", 'go', 'beach', '.', 'sit', 'listen', 'waves', '.', 'Red', ':', 'beach', '?', 'Andy', ':', 'Yeah', '.', 'beach', '.']


### Step 3: Stemming
### Stemming reduces words to their root form (e.g., "running" to "run")

In [23]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("\nStemmed Tokens:\n", stemmed_tokens)


Stemmed Tokens:
 ['red', ':', "'ve", 'seen', 'magazin', '.', "'s", 'got', 'stori', 'guy', 'found', 'treasur', 'map', '.', 'andi', ':', 'yeah', ',', "'ve", 'read', '.', 'red', ':', "'s", 'load', 'crap', '.', 'andi', ':', 'never', 'know', ',', 'red', '.', 'red', ':', 'yeah', ',', 'well', ',', "'m", 'hold', 'breath', '.', 'andi', ':', 'ever', 'think', "'d", 'found', 'treasur', '?', 'red', ':', 'yeah', ',', 'guess', '.', 'andi', ':', 'would', 'go', '?', 'red', ':', 'somewher', 'warm', '.', 'mayb', 'mexico', '.', 'andi', ':', 'sound', 'nice', '.', 'red', ':', ',', 'andi', '?', 'would', 'go', '?', 'andi', ':', "'d", 'go', 'beach', '.', 'sit', 'listen', 'wave', '.', 'red', ':', 'beach', '?', 'andi', ':', 'yeah', '.', 'beach', '.']


### Step 4: Lemmatization
### Lemmatization reduces words to their base or dictionary form (e.g., "running" to "run"), but is more accurate than stemming

In [24]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("\nLemmatized Tokens:\n", lemmatized_tokens)


Lemmatized Tokens:
 ['Red', ':', "'ve", 'seen', 'magazine', '.', "'s", 'got', 'story', 'guy', 'found', 'treasure', 'map', '.', 'Andy', ':', 'Yeah', ',', "'ve", 'read', '.', 'Red', ':', "'s", 'load', 'crap', '.', 'Andy', ':', 'never', 'know', ',', 'Red', '.', 'Red', ':', 'Yeah', ',', 'well', ',', "'m", 'holding', 'breath', '.', 'Andy', ':', 'Ever', 'think', "'d", 'found', 'treasure', '?', 'Red', ':', 'Yeah', ',', 'guess', '.', 'Andy', ':', 'would', 'go', '?', 'Red', ':', 'Somewhere', 'warm', '.', 'Maybe', 'Mexico', '.', 'Andy', ':', 'sound', 'nice', '.', 'Red', ':', ',', 'Andy', '?', 'would', 'go', '?', 'Andy', ':', "'d", 'go', 'beach', '.', 'sit', 'listen', 'wave', '.', 'Red', ':', 'beach', '?', 'Andy', ':', 'Yeah', '.', 'beach', '.']
