##### Title: Exercice 4.2
##### Author: Jerock Kalala
##### Date: January 20th 2023
##### Modified By: --
##### Using Natural Language Processing (NLP)


1. In the text, there’s a text normalizer created – your assignment is to re-create that normalizer as a Python class that can be re-used (within a .py file). However, unlike the book author’s version, pass a Pandas Series (e.g., dataframe[‘column’]) to your normalize_corpus function and use apply/lambda for each cleaning function.

In [2]:
from bs4 import BeautifulSoup
from typing import List
import  pandas as pd
import unicodedata
import re
import nltk
from nltk.corpus import gutenberg
import string
import spacy

class TextNormalizer:

    def normalize_corpus(self, corpus: pd.Series, html_stripping=True, contraction_expansion=True, accented_char_removal=True,
                     text_lower_case=True, text_lemmatization=True, special_char_removal=True,
                     stopword_removal=True, remove_digits=True) -> pd.Series:
        """
        corpus : pd.Series :  A pandas series containing text data
        html_stripping : bool :  whether or not to remove html tags from the text
        contraction_expansion : bool : whether or not to expand contractions
        accented_char_removal : bool : whether or not to remove accented characters
        text_lower_case : bool : whether or not to convert text to lowercase
        text_lemmatization : bool : whether or not to lemmatize text
        special_char_removal : bool : whether or not to remove special characters and/or digits
        stopword_removal : bool : whether or not to remove stopwords
        remove_digits : bool : whether or not to remove digits from the text
        """
        corpus = corpus.apply(lambda doc: self.strip_html_tags(doc) if html_stripping else doc)
        corpus = corpus.apply(lambda doc: self.remove_accented_chars(doc) if accented_char_removal else doc)
        corpus = corpus.apply(lambda doc: self.expand_contractions(doc) if contraction_expansion else doc)
        corpus = corpus.apply(lambda doc: doc.lower() if text_lower_case else doc)
        corpus = corpus.apply(lambda doc: re.sub(r'[\r|\n|\r\n]+', ' ', doc))
        corpus = corpus.apply(lambda doc: self.text_lemmatization(doc) if text_lemmatization else doc)
        corpus = corpus.apply(lambda doc: self.remove_special_character(doc, remove_digits) if special_char_removal else doc)
        corpus = corpus.apply(lambda doc: re.sub(' +', ' ', doc))
        corpus = corpus.apply(lambda doc: self.remove_stopwords(doc, is_lower_case=text_lower_case) if stopword_removal else doc)
        return corpus

    def strip_html_tags(self, text):
        soup = BeautifulSoup(text, "html.parser")
        [s.extract() for s in soup(['iframe', 'script'])]
        stripped_text = soup.get_text()
        stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
        return stripped_text
    def remove_accented_chars(self, text):
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        return text

    def text_lemmatization(self, text):
        nlp = spacy.load('en_core', parse=True, tag=True, entity=True )
        text = nlp(text)
        text = ' '.join([word.lemma_ if word.lemma_ !='-PRON-' else word.text for word in text])
        return text

    def remove_special_character(self, text, remove_digits=False):
        pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
        text = re.sub(pattern, '', text)
        return text

    def remove_stopwords(self, text, is_lower_case=False):
        stopword_list = nltk.corpus.stopwords.words('english')
        stopword_list = set(stopword_list)
        if is_lower_case:
            text = [word for word in text.split() if word.lower() not in stopword_list]
        else:
            text = [word for word in text.split() if word not in stopword_list]
        return ' '.join(text)

2. Using your new text normalizer, create a Jupyter Notebook that uses this class to clean up the text found in the file big.txt (that text file is in the GitHub for Week 4 repository). Your resulting text should be a (long) single stream of text.

In [3]:
# read the contents of big.txt into a pandas series
big_txt = pd.read_csv('big.txt', header=None, names=['text'])

# create an instance of the TextNormalizer class
normalizer = TextNormalizer()

# normalize the big_txt series using the normalize_corpus function
big_txt_normalized = normalizer.normalize_corpus(big_txt['text'])

ParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 5


Note: I couldn't get the above error fixed. I have emailed you for help as my professor with the knowledge and experience you have in python coding but didn't get the needed help. instead, you asked me to get it submitted before the EOD. Not being able to solve it on my own, I submitted it as is but got the text normalized by passing the text to each function and saving the normalized text into a file named big_txt_normalized.txt.  I still hope to receive assistance to fix that error, because I don't only need a good grade but knowledge is the most important.

In [8]:
from pathlib import Path
big_txt_normalized = Path('big_txt_normalized.txt').read_text()


3. Using spaCy and NLTK, show the tokens, lemmas, parts of speech, and dependencies in the first 1,021 characters of big.txt.

In [10]:
import spacy

# load the spaCy model
nlp = spacy.load("en_core_web_sm")

# get the first 1021 characters of big_txt_normalized
text = big_txt_normalized[:1021]

# process the text using spaCy
doc = nlp(text)

# print the tokens, lemmas, POS tags, and dependencies
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

the the DET det
Project Project PROPN compound
Gutenberg Gutenberg PROPN compound
EBook EBook PROPN nsubj
of of ADP prep
the the DET det
Adventures Adventures PROPN pobj
of of ADP prep
Sherlock Sherlock PROPN compound
Holmesby Holmesby PROPN compound
Sir Sir PROPN compound
Arthur Arthur PROPN compound
Conan Conan PROPN compound
Doyle15 Doyle15 PROPN pobj
in in ADP prep
our our PRON poss
series series NOUN pobj
by by ADP prep
Sir Sir PROPN compound
Arthur Arthur PROPN compound
Conan Conan PROPN compound
doylecopyright doylecopyright PROPN compound
law law NOUN pobj
be be AUX aux
change change NOUN ROOT
all all ADV advmod
over over ADP prep
the the DET det
world world NOUN pobj
    SPACE dep
be be AUX advcl
sure sure ADJ acomp
to to PART aux
check check VERB xcomp
thecopyright thecopyright ADJ amod
law law NOUN dobj
for for ADP prep
your your PRON poss
country country NOUN pobj
before before ADP prep
download download NOUN pobj
or or CCONJ cc
redistributingthis redistributingthis NOUN co

In [12]:
#Done using NLTK:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import nltk
nltk.download('averaged_perceptron_tagger')

# get the first 1021 characters of big_txt_normalized
text = big_txt_normalized[:1021]

# tokenize the text
tokens = word_tokenize(text)

# get the POS tags and lemmatize the tokens
tagged_tokens = pos_tag(tokens)
lemmatizer = WordNetLemmatizer()
for token in tagged_tokens:
    print(token[0], lemmatizer.lemmatize(token[0]), token[1])

the the DT
Project Project NNP
Gutenberg Gutenberg NNP
EBook EBook NNP
of of IN
the the DT
Adventures Adventures NNP
of of IN
Sherlock Sherlock NNP
Holmesby Holmesby NNP
Sir Sir NNP
Arthur Arthur NNP
Conan Conan NNP
Doyle15 Doyle15 NNP
in in IN
our our PRP$
series series NN
by by IN
Sir Sir NNP
Arthur Arthur NNP
Conan Conan NNP
doylecopyright doylecopyright VBD
law law NN
be be VB
change change VBN
all all DT
over over IN
the the DT
world world NN
be be VB
sure sure JJ
to to TO
check check VB
thecopyright thecopyright JJ
law law NN
for for IN
your your PRP$
country country NN
before before IN
download download NN
or or CC
redistributingthis redistributingthis NN
or or CC
any any DT
other other JJ
Project Project NNP
Gutenberg Gutenberg NNP
eBook eBook VB
this this DT
header header NN
should should MD
be be VB
the the DT
first first JJ
thing thing NN
see see NN
when when WRB
view view NN
this this DT
ProjectGutenberg ProjectGutenberg NNP
file file NN
please please NN
do do VB
not not RB

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jeroc\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
