##### Title: Exercice 4.2
##### Author: Jerock Kalala
##### Date: January 20th 2023
##### Modified By: --
##### Using Natural Language Processing (NLP)


1. In the text, there’s a text normalizer created – your assignment is to re-create that normalizer as a Python class that can be re-used (within a .py file). However, unlike the book author’s version, pass a Pandas Series (e.g., dataframe[‘column’]) to your normalize_corpus function and use apply/lambda for each cleaning function.

In [8]:
from bs4 import BeautifulSoup
from typing import List
import  pandas as pd
import unicodedata
import re
import nltk
from nltk.corpus import gutenberg
import string
import spacy

class TextNormalizer:

    def normalize_corpus(self, corpus: pd.Series, html_stripping=True, contraction_expansion=False, accented_char_removal=True,
                     text_lower_case=True, text_lemmatization=True, special_char_removal=True,
                     stopword_removal=True, remove_digits=True) -> pd.Series:
        """
        corpus : pd.Series :  A pandas series containing text data
        html_stripping : bool :  whether or not to remove html tags from the text
        contraction_expansion : bool : whether or not to expand contractions
        accented_char_removal : bool : whether or not to remove accented characters
        text_lower_case : bool : whether or not to convert text to lowercase
        text_lemmatization : bool : whether or not to lemmatize text
        special_char_removal : bool : whether or not to remove special characters and/or digits
        stopword_removal : bool : whether or not to remove stopwords
        remove_digits : bool : whether or not to remove digits from the text
        """
        corpus = corpus.apply(lambda doc: self.strip_html_tags(doc) if html_stripping else doc)
        corpus = corpus.apply(lambda doc: self.remove_accented_chars(doc) if accented_char_removal else doc)
        corpus = corpus.apply(lambda doc: self.expand_contractions(doc) if contraction_expansion else doc)
        corpus = corpus.apply(lambda doc: doc.lower() if text_lower_case else doc)
        corpus = corpus.apply(lambda doc: re.sub(r'[\r|\n|\r\n]+', ' ', doc))
        corpus = corpus.apply(lambda doc: self.text_lemmatization(doc) if text_lemmatization else doc)
        corpus = corpus.apply(lambda doc: self.remove_special_character(doc, remove_digits) if special_char_removal else doc)
        corpus = corpus.apply(lambda doc: re.sub(' +', ' ', doc))
        corpus = corpus.apply(lambda doc: self.remove_stopwords(doc, is_lower_case=text_lower_case) if stopword_removal else doc)
        return corpus

    def strip_html_tags(self, text):
        soup = BeautifulSoup(text, "html.parser")
        [s.extract() for s in soup(['iframe', 'script'])]
        stripped_text = soup.get_text()
        stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
        return stripped_text
    def remove_accented_chars(self, text):
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        return text

    def text_lower_case(text):
        text_low = text.str.lower()
        return text_low

    def text_lemmatization(self, text):
        nlp = spacy.load('en_core_web_sm')
        text = nlp(text)
        text = ' '.join([word.lemma_ if word.lemma_ !='-PRON-' else word.text for word in text])
        return text

    def remove_digits(self, text):
        text = res = ''.join([i for i in text if not i.isdigit()])
        return text
    def remove_special_character(self, text, remove_digits=True):
        pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
        text = re.sub(pattern, '', text)
        return text

    def remove_stopwords(self, text, is_lower_case=False):
        stopword_list = nltk.corpus.stopwords.words('english')
        stopword_list = set(stopword_list)
        if is_lower_case:
            text = [word for word in text.split() if word.lower() not in stopword_list]
        else:
            text = [word for word in text.split() if word not in stopword_list]
        return ' '.join(text)

2. Using your new text normalizer, create a Jupyter Notebook that uses this class to clean up the text found in the file big.txt (that text file is in the GitHub for Week 4 repository). Your resulting text should be a (long) single stream of text.

In [9]:
# read the contents of big.txt into a pandas series
big_txt = pd.read_fwf('big.txt', header=None)
corpus = big_txt[0]
# create an instance of the TextNormalizer class
normalizer = TextNormalizer()

# normalize the big_txt series using the normalize_corpus function
big_txt_normalized = normalizer.normalize_corpus(corpus)
big_txt_normalized[:100]



0      project gutenberg ebook adventure sherlock holme
1                                sir arthur conan doyle
2                         series sir arthur conan doyle
3                 copyright law change world sure check
4           copyright law country download redistribute
                            ...                        
95                               aware say holmes dryly
96    circumstance great delicacy every precaution t...
97    also aware murmur holme settle armchair close eye
98    visitor glance apparent surprise languid loung...
99    majesty would condescend state case remark wel...
Name: 0, Length: 100, dtype: object

3. Using spaCy and NLTK, show the tokens, lemmas, parts of speech, and dependencies in the first 1,021 characters of big.txt.

In [10]:
import spacy

# load the spaCy model
nlp = spacy.load("en_core_web_sm")

# get the first 1021 characters of big_txt_normalized
text = big_txt_normalized[:1021]

# process the text using spaCy
doc = nlp(text.to_string())

# print the tokens, lemmas, POS tags, and dependencies
for token in doc:
    print(token.text,"*", token.lemma_, "*", token.pos_, "*", token.dep_)

0 * 0 * NUM * nummod
       *        * SPACE * dep
project * project * NOUN * compound
gutenberg * gutenberg * PROPN * compound
ebook * ebook * PROPN * nsubj
adventure * adventure * NOUN * compound
sherlock * sherlock * NOUN * compound
holme * holme * NOUN * appos

 * 
 * SPACE * dep
1 * 1 * NUM * nummod
                                 *                                  * SPACE * dep
sir * sir * PROPN * compound
arthur * arthur * PROPN * compound
conan * conan * PROPN * compound
doyle * doyle * PROPN * appos

 * 
 * SPACE * dep
2 * 2 * NUM * nummod
                          *                           * SPACE * dep
series * series * NOUN * conj
sir * sir * PROPN * compound
arthur * arthur * PROPN * compound
conan * conan * PROPN * compound
doyle * doyle * PROPN * npadvmod

 * 
 * SPACE * dep
3 * 3 * NUM * nummod
                  *                   * SPACE * dep
copyright * copyright * NOUN * compound
law * law * NOUN * compound
change * change * NOUN * compound
world * world * NOUN 

In [12]:
#Done using NLTK:
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import nltk
nltk.download('averaged_perceptron_tagger')

# get the first 1021 characters of big_txt_normalized
text = big_txt_normalized[:1021]

# tokenize the text
tokens = word_tokenize(text.to_string())

# get the POS tags and lemmatize the tokens
tagged_tokens = pos_tag(tokens)
lemmatizer = WordNetLemmatizer()
for token in tagged_tokens:
    print(token[0],"*", lemmatizer.lemmatize(token[0]), "*", token[1])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jeroc\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


0 * 0 * CD
project * project * NN
gutenberg * gutenberg * NN
ebook * ebook * NN
adventure * adventure * NN
sherlock * sherlock * NN
holme * holme * VBD
1 * 1 * CD
sir * sir * NN
arthur * arthur * NN
conan * conan * NN
doyle * doyle * VBD
2 * 2 * CD
series * series * NN
sir * sir * NN
arthur * arthur * IN
conan * conan * JJ
doyle * doyle * JJ
3 * 3 * CD
copyright * copyright * JJ
law * law * NN
change * change * NN
world * world * NN
sure * sure * JJ
check * check * VB
4 * 4 * CD
copyright * copyright * JJ
law * law * NN
country * country * NN
download * download * VBD
redistribute * redistribute * JJ
5 * 5 * CD
project * project * NN
gutenberg * gutenberg * NN
ebook * ebook * VBD
6 * 6 * CD
header * header * NN
first * first * JJ
thing * thing * NN
see * see * NN
view * view * NN
project * project * VBP
7 * 7 * CD
gutenberg * gutenberg * NNS
file * file * JJ
please * please * NN
remove * remove * VB
change * change * NN
edit * edit * NN
8 * 8 * CD
header * header * NN
without * without