# Introduction

This notebook is a compilation of functions I made for myself and my teammates for our projects that were heavier on the Natural Language Processing side. The functions are meant to streamline cleaning and pre-processing raw text with the help of libraries such as `SpaCy` and `NTLK`.

# Required Libraries

In [1]:
import nltk
import spacy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import seaborn as sns
from nltk import pos_tag
from collections import Counter

# 1. Text Cleaning

Initialize the variables below, otherwise most of the functions below will not work.

In [2]:
nlp = spacy.load('en_core_web_sm')
stops = stopwords.words('english')
stops2 = spacy.lang.en.stop_words.STOP_WORDS

## 1.1 Contraction Mapping

Contraction mapping basically maps common word contractions such as "ain't" or "can't" back to their original forms, "am not" and "cannot" respectively. Otherwise, regular text cleaning woudld simply remove the apostrophe and later be tokenized into "ain" and "t", or "can" and "t". Below is a mapping of common word contractions.

In [3]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}


## 1.2 Cleaning raw text

Text data cannot be used for analysis out of the box. Raw text is often accompanied by several special characters, punctuation marks, and other junk that you probably won't want when you tokenize the text. `clean_text` allows simple text cleaning and word filtering by part of speech. 

In [26]:
word_lem = WordNetLemmatizer()
stops = stopwords.words('english')

def clean_text(text, clean_only=False, 
               parts_of_speech=['ADJ' ,'NOUN', 'ADV', 'VERB'],
              remove_sw=True, sw=stops):
    """
    Cleans text and filters according to part of speech. See Spacy Lemmatizer
    documentation to see more keywords for parts_of_speech filters.
    
    Parameters
    ----------
    text : str
    
    clean_only : bool
        default at false, will return cleaned string with no tagging
    
    parts_of_speech : list of strings
        refer to parts of speech in SpaCy
        
    remove_sw : bool
    
    sw : list of strings
        add your own if necessary
        
    Returns
    -------
    out3 : str
        output string after processing
    """
    # cleaning
    text = text.lower()
    text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
    text = text.replace('\xa0', ' ')
    text = text.replace('\n', ' ')
    text = text.replace('\r', ' ')
    text = re.sub(r'[^\w\s]+', ' ', text)
    text = re.sub("p*\d", "", text)
    text = re.sub(r" +", ' ', text)
    
#     print("text before lemmatizing:",text)
    
    if clean_only:
        return text
    
    # pass text into nlp then remove stopwords
    text = nlp(text)
    
    # .lemma_ and .pos_ are helpful extracting the lemmatized
    # word and part of speech.
    
    out = []
    for token in text:
         out.append((token.lemma_, token.pos_))
    poss = parts_of_speech
    out3 = ''
    
    for item in out:
        if item[1] in poss:
            out3 = out3 + ' ' + item[0]
    
    if remove_sw:
        dummy = out3.split()
        dummy = [word for word in dummy if word not in sw]
        out3 = ' '.join(dummy)
        return out3.strip()
    
    else:
        return out3.strip()

Below is a demonstration of how the `clean_text` function works.

In [27]:
# just a test cell
text = "Cooking for the people you love is the greatest happiness one could ask for. It ain't boring"
clean_text(text)

'cook people love great happiness could ask bore'

the non-lemmatized and POS-tag filtered output is also shown below

In [28]:
clean_text(text, clean_only=True)

'cooking for the people you love is the greatest happiness one could ask for it is not boring'

# 2. Vectorizing and Word Embeddings

Machines can't understand words, so they have to be broken down into number representations. There are may ways to go about this (and several videos explaining the process), but this notebook will focus on methods using the `keras` library.

## 2.1 Vectorizing

Keras comes with its own tokenizer which basically lets use prepare the corpus for embedding by splitting the string into a list of strings (one word per string). We then use `pad_sequences` to make sure each word vector is of equal lenght. Now it's ready to be accepted as input for our embedding layer.

Note the `num_words` argument simply tells `Tokenizer` to use only the 300 most occuring words in the corpus. For embeddings, 300 is generally the standard, but you can experiment as necessary.

In [32]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [34]:
def vectorize_text(df_series):
    """
    Uses the keras tokenizer to vectorize text. Note that other arguments
    inside the function must be changed according to the user's specific
    problem.
    
    Parameters
    ----------
    df_series : pandas series
    
    Returns
    -------
    X : numpy array
        sparse matrix of n maximum no. of words 
    """
    tokenizer = Tokenizer(num_words=300, split=' ')
    tokenizer.fit_on_texts(df_series.values)
    X = tokenizer.texts_to_sequences(df_series.values)
    X = pad_sequences(X)
    return X

# Others

## Visualizing Relationships

SpaCy has a library for visualizing the parts of speech and their relationships with each other in an input string. `text_viz` combines the standard text cleaning steps with the actual visualization function for easier use. 

In [29]:
from spacy import displacy

def text_viz(text):
    """
    Visualizes a string's parts of speech and maps out their relationships
    with each other using displacy.
    
    Parameters
    ----------
    text : str
        input string for visualization
    
    """
    text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
    text = text.replace('\xa0', ' ')
    text = text.replace('\n', ' ')
    text = text.replace('\r', ' ')
    # keep the punctuations for this
    # text = re.sub(r'[^\w\s]+', ' ', text)
    text = re.sub("p*\d", "", text)
    text = re.sub(r" +", ' ', text)
    return displacy.render(nlp(text), style="dep", jupyter=True)

In [31]:
text_viz("I am a happy boy.")

# Acknowledgements

I'd like to thank my team mates, Jeddahlyn Gacera, Ria Flora, and Crisanto for helping me build on my knowledge in Natural Language Processing. I'd also like to acknowledge our professors, Christian Alis, PhD, and Madhavi Devaraj PhD for imparting their knowledge in our Natural Language Processing class.