# Script Content
- Support for PDFs in Subfolders:
Uses pathlib to recursively find PDFs in all subdirectories.
- Improved Text Extraction:
Cleans extracted text by removing extra spaces and newlines.
- Dynamic Chunk Sizing:
Adjusts chunk size and overlap for short texts to avoid empty chunks.
- Enhanced Error Handling:
More robust error handling during PDF processing and embedding creation.
- Threaded PDF Processing:
Uses ThreadPoolExecutor to process multiple PDFs in parallel, speeding up text extraction and chunking.
- Progress Tracking:
Uses tqdm to show progress of PDF processing.
- Configurable Thread Count:
Allows setting the number of concurrent threads for PDF processing.
- Validation of Text Chunks:
Ensures only valid, non-empty text chunks are stored and processed.
- Improved User Interaction:
Validates user input and provides clearer prompts.
- Fallbacks for Model Loading:
Tries multiple model names before falling back to defaults.
- Detailed Logging:
Provides more informative messages during processing, including warnings for skipped files.
- No Support for Non-Embeddable Strings:
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

## (1) Import relevant librbaries

In [42]:
import os
import string
import array
from pathlib import Path
import re
import PyPDF2
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
from tqdm import tqdm
import pickle
from concurrent.futures import ThreadPoolExecutor, as_completed  # THREADING: Import thread tools

- Let's keep static informations such as data paths:

In [49]:
fp_rhyme ='../2_text_processing/rhyme.txt'
fp_metamorphosis ='../2_text_processing/metamorphosis.txt'
fp_clean_metamorphosis ='../2_text_processing/metamorphosis_clean.txt'

## (2) File Manipulation in python

### 1. Read file from drive

In [None]:
def load_doc(filename:str):
    file = open(filename, 'r') # open the file as read only
    text = file.read() # read all text
    file.close() # close the file
    return text

# ----------------------------------------------------------------

raw_text = load_doc(filename=fp_rhyme)

### 2. Save file

In [None]:
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# ----------------------------------------------------------------

my_lines = [
    "The Foundation is committed to complying with the laws regulating",
    "charities and charitable donations in all 50 states of the United States.",
    "Compliance requirements are not uniform and it takes a considerable effort,",
    "much paperwork and many fees to meet and keep up with these requirements."
]
save_doc(lines=my_lines, filename='artifacts/my_lines.txt')

### 3. Read text from PDF file

In [None]:
def extract_text_from_pdf(pdf_path):
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            text = ""
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:  # Skip empty pages
                    text += page_text
            # Clean text: remove extra spaces/newlines and strip whitespace
            # '\s+' matches \s: blank spae, \n: new line, \t: tab, +: one or more
            text = re.sub(r'\s+', ' ', text).strip()
            return text
    except Exception as e:
        print(f"❌ Error extracting text from {pdf_path}: {str(e)}")
        return ""

# ----------------------------------------------------------------

extract_text_from_pdf(pdf_path='knowledge/Python Tricks (en).pdf')



### 4. Tree-traversal of a Directory

In [27]:
pdf_dir = 'knowledge'
pdf_files = list(Path(pdf_dir).rglob("*.pdf"))
for pdf_file in pdf_files:
    print(f'Processing {pdf_file}')
print(f"Total PDF files found: {len(pdf_files)}")

Processing knowledge/Vladimir Kushnir - Safe C++_ How to avoid common mistakes-O'Reilly Media (2012).pdf
Processing knowledge/What Do We Understand About Convolutional Neural Network.pdf
Processing knowledge/William W. Hsieh-Machine Learning Methods in the Environmental Sciences_ Neural Networks and Kernels-Cambridge University Press (2009).pdf
Processing knowledge/C Programming and Numerical Analysis.pdf
Processing knowledge/C Programming Language (2nd Edition).pdf
Processing knowledge/Clean.Code.A.Handbook.of.Agile.Software.Craftsmanship.pdf
Processing knowledge/concepts of programming languages.pdf
Processing knowledge/Python Tricks (en).pdf
Processing knowledge/base-1/AlgorithmsNotesForProfessionals.pdf
Processing knowledge/base-1/BashNotesForProfessionals.pdf
Processing knowledge/base-1/CNotesForProfessionals.pdf
Processing knowledge/base-1/CPlusPlusNotesForProfessionals.pdf
Processing knowledge/base-1/CSharpNotesForProfessionals.pdf
Processing knowledge/base-1/CSSNotesForProfessi

### 5. Threads and Concurrent Execution

In [53]:
import concurrent.futures
import time
import math

- Example 1: ThreadPoolExecutor (I/O-bound task simulation)

In [54]:
def io_bound_task(task_id, delay):
    """Simulate an I/O-bound task (e.g., file read, API call)"""
    print(f"Task {task_id}: Starting (will take {delay}s)")
    time.sleep(delay)  # Simulate waiting for I/O
    print(f"Task {task_id}: Finished")
    return f"Result from Task {task_id}"

    
# -----------------------------------------
io_bound_task(task_id='cute-task', delay=1)

Task cute-task: Starting (will take 1s)
Task cute-task: Finished


'Result from Task cute-task'

- Example 2: ProcessPoolExecutor (CPU-bound task simulation)

In [56]:
def cpu_bound_task(number):
    """Simulate a CPU-bound task (e.g., complex calculation)"""
    print(f"Calculating factorial of {number}")
    result = math.factorial(number)
    print(f"Finished factorial of {number}")
    return result


# --------------------------------------------------------------------

cpu_bound_task(500)

Calculating factorial of 500
Finished factorial of 500


1220136825991110068701238785423046926253574342803192842192413588385845373153881997605496447502203281863013616477148203584163378722078177200480785205159329285477907571939330603772960859086270429174547882424912726344305670173270769461062802310452644218878789465754777149863494367781037644274033827365397471386477878495438489595537537990423241061271326984327745715546309977202781014561081188373709531016356324432987029563896628911658974769572087926928871281780070265174507768410719624390394322536422605234945850129918571501248706961568141625359056693423813008856249246891564126775654481886506593847951775360894005745238940335798476363944905313062323749066445048824665075946735862074637925184200459369692981022263971952597190945217823331756934581508552332820762820023402626907898342451712006207714640979456116127629145951237229913340169552363850942885592018727433795173014586357570828355780158735432768888680120399882384702151467605445407663535984174430480128938313896881639487469658817504506926365338175

In [None]:
def main():
    # Run I/O-bound tasks in parallel with threads
    print("\n=== Running I/O-bound tasks with ThreadPoolExecutor ===")
    start_time = time.time()
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        # Submit tasks with different delays
        tasks = [
            executor.submit(io_bound_task, 1, 2),
            executor.submit(io_bound_task, 2, 3),
            executor.submit(io_bound_task, 3, 1)
        ]
        
        # Get results as tasks complete
        for future in concurrent.futures.as_completed(tasks):
            try:
                result = future.result()
                print(f"Received: {result}")
            except Exception as e:
                print(f"Task failed: {str(e)}")
    
    print(f"I/O tasks completed in {time.time() - start_time:.2f}s")

    # Run CPU-bound tasks in parallel with processes
    print("\n=== Running CPU-bound tasks with ProcessPoolExecutor ===")
    start_time = time.time()
    
    numbers = [100000, 120000, 110000]  # Large numbers to stress CPU
    with concurrent.futures.ProcessPoolExecutor(max_workers=3) as executor:
        # Map numbers to the cpu_bound_task function
        results = executor.map(cpu_bound_task, numbers)
        
        # Process results in order
        for num, result in zip(numbers, results):
            print(f"Factorial of {num} has {len(str(result))} digits")
    
    print(f"CPU tasks completed in {time.time() - start_time:.2f}s")

# -----------------------------------------------------------------------------

main()


## (3) Manual Tokenization

### 1. Clean Text

In [12]:
tokens = raw_text.split()
raw_text = ' '.join(tokens)

### 2. Create Sequences

In [16]:
# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)


print(f'Total Sequences: {len(sequences)}')
print(sequences[:5])

Total Sequences: 399
['Sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ']


### 3. Save Sequences

In [None]:
out_filename = 'artifacts/char_sequences.txt'
save_doc(sequences, out_filename)

### 4. Encode Sequences

In [18]:
raw_text = load_doc(filename='artifacts/char_sequences.txt')
lines = raw_text.split('\n')
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
print(f'Unique Characters: {len(chars)}')
print(mapping)

Unique Characters: 38
{'\n': 0, ' ': 1, "'": 2, ',': 3, '.': 4, ';': 5, 'A': 6, 'B': 7, 'C': 8, 'E': 9, 'F': 10, 'H': 11, 'S': 12, 'T': 13, 'W': 14, 'a': 15, 'b': 16, 'c': 17, 'd': 18, 'e': 19, 'f': 20, 'g': 21, 'h': 22, 'i': 23, 'k': 24, 'l': 25, 'm': 26, 'n': 27, 'o': 28, 'p': 29, 'q': 30, 'r': 31, 's': 32, 't': 33, 'u': 34, 'w': 35, 'x': 36, 'y': 37}


- Next, we can process each sequence of characters one at a time and use the dictionary mapping to look up the integer value for each character.

In [20]:
sequences = list()
for line in lines:
    encoded_seq = [mapping[char] for char in line] # integer encode line 
    sequences.append(encoded_seq) # store

sequences[:5]

[[12, 23, 27, 21, 1, 15, 1, 32, 28, 27, 21],
 [23, 27, 21, 1, 15, 1, 32, 28, 27, 21, 1],
 [27, 21, 1, 15, 1, 32, 28, 27, 21, 1, 28],
 [21, 1, 15, 1, 32, 28, 27, 21, 1, 28, 20],
 [1, 15, 1, 32, 28, 27, 21, 1, 28, 20, 1]]

- The result is a list of integer lists. We need to know the size of the vocabulary later. We can retrieve this as the size of the dictionary mapping.

In [21]:
# vocabulary size
vocab_size = len(mapping)
print(f'Vocabulary Size: {vocab_size}')

Vocabulary Size: 38


### 5. Split Inputs and Output

In [24]:
sequences = array.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

print(X[:5], y[:5])

TypeError: array() argument 1 must be a unicode character, not list

- An encoded vector is returned with a length of the entire vocabulary and an integer count
for the number of times each word appeared in the document. Because these vectors will
contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package. The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to NumPy arrays to look and better
understand what is going on by calling the toarray() function. Below is an example of using
the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

### 1. Select words

In [None]:
import re
# load text
file = open(fp_clean_metamorphosis, 'rt')
text = file.read()
file.close()
# split based on words only
words = re.split(r'\W+', text)
print(words[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']


- We may want the words, but without the punctuation like commas and quotes. We also want to keep contractions together. One way would be to split the document into words by white space (as in the section Split by Whitespace), then use string translation to replace all punctuation with nothing (e.g. remove it). Python provides a constant called string.punctuation that provides a
great list of punctuation characters.

In [36]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


- We can use regular expressions to select for the punctuation characters and use the sub() function to replace them with nothing. For example:

In [38]:
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']


- Sometimes text data may contain non-printable characters. We can use a similar approach to filter out all non-printable characters by selecting the inverse of the string.printable constant. For example:

In [40]:
re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', w) for w in words]

- It is common to convert all words to one case. This means that the vocabulary will shrink in size, but some distinctions are lost (e.g. Apple the company vs apple the fruit is a commonly
used example). We can convert all words to lowercase by calling the lower() function on each word. For example:

In [41]:
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '“what’s', 'happened', 'to', 'me?”', 'he', 'thought.', 'it', 'wasn’t', 'a', 'dream.', 'his', 'room,', 'a', 'proper', 'human']


## (4) Tokenization and Cleaning with NLTK

- The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text. It provides good tools for loading and cleaning text that we can use to get our
data ready for working with machine learning and deep learning algorithms.

In [47]:
#!pip install nltk
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/njad/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk import sent_tokenize

# load data
file = open(fp_clean_metamorphosis, 'rt')
text = file.read()
file.close()
# split into sentences
sentences = sent_tokenize(text)
print(sentences[0])

### 1. Split into Words

In [None]:
from nltk.tokenize import word_tokenize

# split into words
tokens = word_tokenize(text)
print(tokens[:100])

- We can filter out all tokens that we are not interested in, such as all standalone punctuation. This can be done by iterating over all tokens and only keeping those tokens that are all alphabetic. Python has the function isalpha() that can be used. For example:

In [None]:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

- Stop words are those words that do not contribute to the deeper meaning of the phrase. They are the most common words such as: the, a, and is. For some applications like documentation
classification, it may make sense to remove stop words. NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English. They can be loaded as follows:

In [None]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(stop_words)

### 2. Stem

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

# split into words
tokens = word_tokenize(text)
# stemming of words
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

### 3. Word Counts with CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
(1, 8)
<class 'scipy.sparse._csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


### 4. Word Frequencies with TfidfVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
    "The dog.",
    "The fox"
]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


### 5. Hashing with HashingVectorizer

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]
