## Text_Parsing test


In [55]:
test_text = "This is a sample text. It contains some words, and stop words like 'the' and 'is'."
original_text = test_text

In [56]:
def print_comparison():
    print("Original text: ", original_text)
    print("Test text:     ", test_text)

### Lowercasing text

In [57]:
test_text = test_text.lower()
print_comparison()

Original text:  This is a sample text. It contains some words, and stop words like 'the' and 'is'.
Test text:      this is a sample text. it contains some words, and stop words like 'the' and 'is'.


### Removing punctuation and numbers

In [58]:
import re # regular expressions

In [59]:
# should I remove numbers too or just letters? [^a-zA-Z0-9] --- Will probably depend on each use case
test_text = re.sub(r"[^a-zA-Z]", " ", test_text)
print_comparison()

Original text:  This is a sample text. It contains some words, and stop words like 'the' and 'is'.
Test text:      this is a sample text  it contains some words  and stop words like  the  and  is  


### Removing stopwords

In [60]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kllmm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [61]:
sw = stopwords.words("english")
sw[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [62]:
test_text = test_text.split()
test_text[:10]

['this', 'is', 'a', 'sample', 'text', 'it', 'contains', 'some', 'words', 'and']

In [63]:
for word in test_text:
    if word in sw:
        test_text.remove(word)
test_text[:10]

['sample', 'text', 'contains', 'words', 'stop', 'words', 'like', 'and', 'is']

### Lemmatization or Stemming

In [64]:
# using stemming for speed
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [65]:
for i in range(len(test_text)):
    test_text[i] = ps.stem(test_text[i])
test_text[:10]

['sampl', 'text', 'contain', 'word', 'stop', 'word', 'like', 'and', 'is']

In [66]:
test_text = " ".join(test_text)
test_text

'sampl text contain word stop word like and is'

In [67]:
print_comparison()

Original text:  This is a sample text. It contains some words, and stop words like 'the' and 'is'.
Test text:      sampl text contain word stop word like and is


In [68]:
dictionary = {}

for word in test_text.split():
    if word in dictionary:
        dictionary[word] += 1
    else:
        dictionary[word] = 1
dictionary

{'sampl': 1,
 'text': 1,
 'contain': 1,
 'word': 2,
 'stop': 1,
 'like': 1,
 'and': 1,
 'is': 1}

In [69]:
def text_parse(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]',' ', text)
    text = text.split()
    stop_words = set(stopwords.words('english'))
    tokens = [ps.stem(word) for word in text if word not in stop_words]
    return ' '.join(tokens)
text = text_parse(original_text)
print(text)


sampl text contain word stop word like


### Counter test


In [70]:
from collections import Counter

tokens = text.split()
most_ocurrence = {word: tokens.count(word) for word in tokens}
print(most_ocurrence)
print(Counter(tokens))

sorted_count = sorted(most_ocurrence.items(), \
    key=lambda val: val[1], reverse=True)
print(sorted_count)

for word, count in sorted_count[:5]:
    print(f"{word}: {count}")



{'sampl': 1, 'text': 1, 'contain': 1, 'word': 2, 'stop': 1, 'like': 1}
Counter({'word': 2, 'sampl': 1, 'text': 1, 'contain': 1, 'stop': 1, 'like': 1})
[('word', 2), ('sampl', 1), ('text', 1), ('contain', 1), ('stop', 1), ('like', 1)]
word: 2
sampl: 1
text: 1
contain: 1
stop: 1


In [71]:
def word_freq_analysis(text, top_n=5):
    tokens= text.split()
    words_count= Counter(tokens)

    sorted_count = sorted(words_count.items(), \
         key=lambda val:val[1], reverse=True)

    for word, count in sorted_count[:top_n]:
        print(f'{word}: {count}')

In [72]:
text_file = './gutenberg.org_cache_epub_71894_pg71894.txt'

In [73]:
with open(text_file, 'r', encoding='utf-8') as file:
    text = file.read()

gutt_book = text_parse(text)
word_freq_analysis(gutt_book,10)


hellen: 264
cyru: 205
king: 176
great: 175
would: 171
could: 155
one: 154
day: 133
soldier: 129
time: 124


## Summary


``` 
At first, case normalization was performed, making the text easier to analyze. This step also ensures that the same word, appearing in different cases, is treated as a single entity.

Next, punctuation and numbers were removed from the text. This is crucial for eliminating 'noise' that doesn't contribute to the meaning of the text, making it more straightforward to process and analyze.

Following this, stopwords were filtered out. These are common words that don't carry significant meaning in text analysis. 

After that, techniques like lemmatization or stemming were applied. The goal here is to reduce words to their root or base form. This simplifies the vocabulary while retaining the core meaning of each word, which is beneficial for various NLP tasks that we will be working in the future.

Lastly, a counter test was implemented to count the frequency of words or elements in the text.
```

# Analysis

```
Text parsing is an indispensable piece in the field of text analytics, and the comprehensive techniques used in this notebook are highly beneficial for a wide array of future applications. 

The preprocessing steps I've carefully outlined serve not just as a set of techniques but as a robust foundation that can be seamlessly integrated and adapted across several NLP projects. 

Whether the focus shifts towards sentiment analysis to gauge customer opinions, text classification for automating document sorting, the development of intelligent conversational chatbots for customer service, or the fine-tuning of search engine algorithms for more accurate query results, this notebook will act as a foundational guide. 

The foundation established here is designed to significantly elevate the performance, accuracy, and reliability of any NLP projects or machine learning project we will delve in the future.
```

## Future improvements below

### ngrams


In [None]:
from nltk.util import ngrams


In [None]:
list(ngrams(gutt_book.split(), 2))[:10]

[('project', 'gutenberg'),
 ('gutenberg', 'ebook'),
 ('ebook', 'retreat'),
 ('retreat', 'ten'),
 ('ten', 'thousand'),
 ('thousand', 'ebook'),
 ('ebook', 'use'),
 ('use', 'anyon'),
 ('anyon', 'anywher'),
 ('anywher', 'unit')]

In [None]:
list(ngrams(gutt_book.split(), 3))[:10]

[('project', 'gutenberg', 'ebook'),
 ('gutenberg', 'ebook', 'retreat'),
 ('ebook', 'retreat', 'ten'),
 ('retreat', 'ten', 'thousand'),
 ('ten', 'thousand', 'ebook'),
 ('thousand', 'ebook', 'use'),
 ('ebook', 'use', 'anyon'),
 ('use', 'anyon', 'anywher'),
 ('anyon', 'anywher', 'unit'),
 ('anywher', 'unit', 'state')]