## [A Practitioner's Guide to Natural Language Processing](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72)

### Table of Contents:

1. Data Retrieval with Web Scraping
2. Text wrangling and pre-processing
3. Parts of Speech Tagging
4. Shallow Parsing
5. Constituency and Dependency Parsing
6. Named Entity Recognition
7. Emotion and Sentiment Analysis

## Standard NLP Workflow

`CRISP-DM model` - an industry standard for executing any data science project.

Any NLP-based problem can be solved by a methodical workflow that has a sequence of steps:

![title](media/nlp-workflow.png)

## 1. Scraping News Articles for Data Retrieval

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

%matplotlib inline

In [4]:
seed_urls = ['https://inshorts.com/en/read/technology',
             'https://inshorts.com/en/read/sports',
             'https://inshorts.com/en/read/world']

def build_dataset(seed_urls):
    news_data = []
    for url in seed_urls:
        news_category = url.split('/')[-1]
        data = requests.get(url)
        soup = BeautifulSoup(data.content, 'html.parser')
        
        news_articles = [{'news_headline': headline.find('span', 
                                                         attrs={"itemprop": "headline"}).string,
                          'news_article': article.find('div', 
                                                       attrs={"itemprop": "articleBody"}).string,
                          'news_category': news_category}
                         
                            for headline, article in 
                             zip(soup.find_all('div', 
                                               class_=["news-card-title news-right-box"]),
                                 soup.find_all('div', 
                                               class_=["news-card-content news-right-box"]))
                        ]
        news_data.extend(news_articles)
        
    df =  pd.DataFrame(news_data)
    df = df[['news_headline', 'news_article', 'news_category']]
    return df

In [5]:
news_df = build_dataset(seed_urls)
news_df.head(10)

Unnamed: 0,news_headline,news_article,news_category
0,"Xiaomi Redmi K20 Pro launched in India, to go ...",Xiaomi's flagship Redmi K20 Pro has been unvei...,technology
1,New TVS Ntorq 125 scooter features Bluetooth &...,"The TVS Ntorq 125, India's first scooter comes...",technology
2,"Chrome, Firefox users' data leaked, for sale a...",Extensions on Chrome and Firefox internet brow...,technology
3,Privacy concerns raised over photo app FaceApp...,The Russian company behind the viral photo-edi...,technology
4,Bluetooth flaw exposes user locations; Android...,Boston University researchers have found a fla...,technology
5,Twitter gets trolled after unveiling redesigne...,Micro-blogging platform Twitter was trolled on...,technology
6,Musk's Neuralink unveils 'threads' that link b...,Elon Musk has unveiled his new startup Neurali...,technology
7,Good: Elizabeth Warren on Peter Thiel saying s...,US politician Elizabeth Warren shared an artic...,technology
8,"Microsoft posts record Q4 results, shares hit ...",Microsoft posted its Q4 FY19 earnings report r...,technology
9,Google may be fined ₹136cr for abuse of Androi...,India antitrust watchdog Competition Commissio...,technology


In [6]:
news_df.news_category.value_counts()

world         25
technology    24
sports        24
Name: news_category, dtype: int64

## 2. Text Wrangling && Pre-processing

In [15]:
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
from contractions import CONTRACTION_MAP
import unicodedata
nlp = spacy.load('en', parse=True, tag=True, entity=True)
#nlp_vec = spacy.load('en_vecs', parse = True, tag=True, #entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

### Remove HTML tags

In [16]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

strip_html_tags('<html><h2>Some important text</h2></html>')

'Some important text'

### Remove accented characters

In [17]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

remove_accented_chars('Sómě Áccěntěd těxt')

'Some Accented text'

### Expand contractions

Converting each contraction to its expanded, original form helps with text standardization

In [19]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

expand_contractions("Y'all can't expand contractions I'd think")

'You all cannot expand contractions I would think'

### Remove special characters

- Removing digits is optional, because often we might need to keep them in the pre-processed text.

In [20]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

remove_special_characters("Well this was fun! What do you think? 123#@!", 
                          remove_digits=True)

'Well this was fun What do you think '

### Text lemmatization

- Very similar to stemming, where we remove word affixes to get to the base form of a word.

    - The difference being that the **root word is always a lexicographically correct word** (present in the dictionary)

- Both nltk and spacy have excellent lemmatizers. We will be using spacy here.

In [21]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")

'My system keep crash ! his crashed yesterday , ours crash daily'

### Text stemming

- Word stems are also known as the **base form** of a word, and we can create new words by attaching affixes to them in a process 

    - Consider the word **JUMP**: JUMPS, JUMPED, JUMPING 
    
- Stemming helps us in standardizing words to their base or root stem, irrespective of their inflections
    - helps many applications like classifying or clustering text, and even in information retrieval

In [22]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")

'My system keep crash hi crash yesterday, our crash daili'

### Remove stopwords

- Words which have little or no significance, especially when constructing meaningful features from text

- There is no universal stopword list, but we use a standard English language stopwords list from nltk. You can also add your own domain-specific stopwords as needed.


In [23]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

remove_stopwords("The, and, if are stopwords, computer is not")

', , stopwords , computer not'

### Bringing it all together, building a text normalizer

- We can keep going with more techniques like correcting spelling, grammar and so on

- At the end we have to chain these operations to build a text normalizer to pre-process text data.

In [24]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

#### Pre-process and normalize news articles

In [25]:
news_df['full_text'] = news_df["news_headline"].map(str)+ '. ' + news_df["news_article"]

In [26]:
news_df['clean_text'] = normalize_corpus(news_df['full_text'])
norm_corpus = list(news_df['clean_text'])
news_df.iloc[1][['full_text', 'clean_text']].to_dict()

{'full_text': "New TVS Ntorq 125 scooter features Bluetooth & smart connectivity. The TVS Ntorq 125, India's first scooter comes with Bluetooth support and smart connectivity features. It takes you to places with Navigation Assist and keeps you connected all the time with Caller ID and SMS alerts. Featuring SmartXonnect technology and sporty design, TVS Ntorq 125's special edition will also feature a special 'Scooter of the Year' insignia on the front.\n",
 'clean_text': 'new tvs ntorq scooter feature bluetooth smart connectivity tvs ntorq india first scooter come bluetooth support smart connectivity feature take place navigation assist keep connect time caller would sms alert feature smartxonnect technology sporty design tvs ntorq special edition also feature special scooter year insignia front'}

#### Save this dataset to disk if needed, so that you can always load it up later for future analysis.

In [27]:
news_df.to_csv('news.csv', index=False, encoding='utf-8')