# Building a class to manage our NLP pipelines
* https://www.youtube.com/watch?v=SG6jdlBx_vQ
    
* https://github.com/ZWMiller/nlp_pipe_manager/tree/master/nlp_pipeline_manager

* https://github.com/ZWMiller/nlp_pipe_manager/blob/master/nlp_pipeline_manager/pipeline_demo.ipynb

Because it's such a pain to manage all the permutations of NLP cleaners/tokenizers/vectorizers/stemmers/etc, we're going to build a class that takes all of those pieces in and manages the pipelines for us.

In [118]:
s = 'resulting result results resulted resulting run UPPER CASE @you running ran No #results  😺 😺 😺@results FOUND. View all teams. MAD Prod Fundraistrick. 350 10th Ave, Suite 1100. San Diego, CA 92101 US. Back to top. Donor Support braistrick@stayclassy.org'
s1 = 'run bunda bunda bunda No results results found. View all teams. Prod Fundraistrick. 350 10th Ave, Suite 1100. San Diego, CA 92101 US. Back to top. Donor Support braistrick@stayclassy.org. http://localhost:8888/notebooks/nlp/cleaning_sandbox.ipynb https://www.w3schools.com/python/python_regex.asp'
s2 = 'https://www.w3schools.com/python/python_regex.asp No results results found. View all teams. Prod Fundraistrick. 350 10th Ave, Suite 1100. San Diego, CA 92101 US. Back to top. Donor Support braistrick@stayclassy.org. http://localhost:8888/notebooks/nlp/cleaning_sandbox.ipynb'
s3 = 'No results results found. View all teams. Prod Fundraistrick. 350 10th Ave, Suite 1100. San Diego, CA 92101 US. Back to top. Donor Support braistrick@stayclassy.org. http://localhost:8888/notebooks/nlp/cleaning_sandbox.ipynb'
s4 = 'bunda results results found view teams prod fundraistrick ave suite san diego back top donor support'
s_pt = 'vou. vindo vamos. testar? uma nova bibliotecas para português eu vou tu vai ele foi correr corriam corrida'
list_of_strings = [s, s1, s2, s3, s4]
corpus = list_of_strings

import re
import string

dict_regex = {
    'hashtags': r'#(\w+)',
    # returns not only mentions, but
    # part of the email after the @
    'mentions': r'@(\w+)',
    'emails': r'',
    'links': r'https?:\/\/.*[\r\n]*',
    'remove_RT': r'^RT[\s]+',
    'numbers': r'\d+',
    'symbols': r'',
    'punctionation2': r'[^\w\s]',
    'punctionation': r'[%s]' % re.escape(string.punctuation),
    'periods': r'\.',
    'exclamation points': r'\!',
    'question marks': r'\?',
    'upper case words': r'[A-Z][A-Z\d]+',
    # https://stackoverflow.com/questions/39536390/match-unicode-emoji-in-python-regex
    'emojis': r'\d+(.*?)[\u263a-\U0001f645]',
    'emojis_work': r"['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']",
    'upper case': r'[A-Z][A-Z\d]+'
}

regex_emojis = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
list_of_regex_values = list(dict_regex.values())
list_of_regex_keys = list(dict_regex.keys())

sw = ['😺', '😺 😺', '😺 😺 😺', 'prod', 'suite', ' ']

In [119]:
# arrays and tables
import pandas as pd
import numpy as np

# modeling
import nltk
import sklearn

# viz
import matplotlib
import matplotlib.pyplot as plt

# numbers and calculation
import random
import math
import scipy

# files
import sys

libraries = (('Matplotlib', matplotlib), ('Numpy', np), ('Pandas', pd), ('NLTK', nltk), ('sklearn',sklearn))

print("Python Version:", sys.version, '\n')
for lib in libraries:
    print('{0} Version: {1}'.format(lib[0], lib[1].__version__))

Python Version: 3.7.6 (default, Jan  7 2020, 16:28:00) 
[Clang 11.0.0 (clang-1100.0.33.8)] 

Matplotlib Version: 3.2.0
Numpy Version: 1.18.1
Pandas Version: 0.25.3
NLTK Version: 3.5
sklearn Version: 0.22.2.post1


In [120]:
# class pipeline: preprocessing and supervised nlp
from data_pipeline.pre_processing_text.class_preprocessing import nlp_preprocessor
from data_pipeline.modeling.class_supervised_ml import supervised_nlp
from data_pipeline.modeling.my_topicmodeling import topic_modeling_nlp
# from data_pipeline.pre_processing_text.norm_lemmatize import portuguese_stemmer

# lemmatize and stem words
from data_pipeline.pre_processing_text.norm_lemmatize import lemmatize_list
from nltk.stem import PorterStemmer

# vectorization
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# models
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation

# Preprocessing

In [121]:
# example 1
nlp = nlp_preprocessor(lemmatizer=lemmatize_list)
# nlp = nlp_preprocessor()

In [122]:
print(nlp.tokenizer(s))

['resulting', 'result', 'results', 'resulted', 'resulting', 'run', 'UPPER', 'CASE', '@you', 'running', 'ran', 'No', '#results', '', '😺', '😺', '😺@results', 'FOUND.', 'View', 'all', 'teams.', 'MAD', 'Prod', 'Fundraistrick.', '350', '10th', 'Ave,', 'Suite', '1100.', 'San', 'Diego,', 'CA', '92101', 'US.', 'Back', 'to', 'top.', 'Donor', 'Support', 'braistrick@stayclassy.org']


In [123]:
print(nlp.clean_text(list_of_strings, lemmatizer=lemmatize_list))

['result result result result result run upper case run run view team mad prod fundraistrick ave suite san diego donor support braistrickorg', 'run bunda bunda bunda result result view team prod fundraistrick ave suite san diego donor support braistrickorg', '', 'result result view team prod fundraistrick ave suite san diego donor support braistrickorg', 'bunda result result view team prod fundraistrick ave suite san diego donor support']


In [124]:
nlp.fit(list_of_strings)

In [125]:
nlp.transform(list_of_strings).toarray()

array([[1, 1, 0, 1, 1, 1, 1, 1, 1, 5, 3, 1, 1, 1, 1, 1, 1],
       [1, 1, 3, 0, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 1, 1, 0, 1, 2, 0, 1, 1, 1, 1, 0, 1],
       [1, 0, 1, 0, 1, 1, 1, 0, 1, 2, 0, 1, 1, 1, 1, 0, 1]])

In [126]:
nlp.bow_table(list_of_strings)

Unnamed: 0,ave,braistrickorg,bunda,case,diego,donor,fundraistrick,mad,prod,result,run,san,suite,support,team,upper,view
0,1,1,0,1,1,1,1,1,1,5,3,1,1,1,1,1,1
1,1,1,3,0,1,1,1,0,1,2,1,1,1,1,1,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,1,1,1,0,1,2,0,1,1,1,1,0,1
4,1,0,1,0,1,1,1,0,1,2,0,1,1,1,1,0,1


In [127]:
# nlp.save_pipe('test')
# nlp.load_pipe('test.mdl')

__Vectorize using TF-IDF__

In [128]:
nlp2 = nlp_preprocessor(lemmatizer=lemmatize_list, vectorizer=TfidfVectorizer(lowercase=False))

In [129]:
nlp2.fit(list_of_strings)
print(nlp2.vectorizer.get_feature_names())
nlp2.bow_table(list_of_strings)

['ave', 'braistrickorg', 'bunda', 'case', 'diego', 'donor', 'fundraistrick', 'mad', 'prod', 'result', 'run', 'san', 'suite', 'support', 'team', 'upper', 'view']


Unnamed: 0,ave,braistrickorg,bunda,case,diego,donor,fundraistrick,mad,prod,result,run,san,suite,support,team,upper,view
0,0.124687,0.148219,0.0,0.221318,0.124687,0.124687,0.124687,0.221318,0.124687,0.623434,0.535675,0.124687,0.124687,0.124687,0.124687,0.221318,0.124687
1,0.16685,0.19834,0.716815,0.0,0.16685,0.16685,0.16685,0.0,0.16685,0.3337,0.238938,0.16685,0.16685,0.16685,0.16685,0.0,0.16685
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.254715,0.302789,0.0,0.0,0.254715,0.254715,0.254715,0.0,0.254715,0.509431,0.0,0.254715,0.254715,0.254715,0.254715,0.0,0.254715
4,0.249604,0.0,0.357447,0.0,0.249604,0.249604,0.249604,0.0,0.249604,0.499209,0.0,0.249604,0.249604,0.249604,0.249604,0.0,0.249604


# Modeling

In [130]:
from sklearn import datasets

categories = ['alt.atheism', 'comp.graphics', 'rec.sport.baseball']
ng_train = datasets.fetch_20newsgroups(subset='train', 
                                       categories=categories, 
                                       remove=('headers', 
                                               'footers', 'quotes'))
ng_train_data = ng_train.data
ng_train_targets = ng_train.target

ng_test = datasets.fetch_20newsgroups(subset='test', 
                                       categories=categories, 
                                       remove=('headers', 
                                               'footers', 'quotes'))

ng_test_data = ng_test.data
ng_test_targets = ng_test.target

In [131]:
# nlp with stemmer
nlp = nlp_preprocessor()

# nlp lemmatizing
nlp1 = nlp_preprocessor(lemmatizer=lemmatize_list)

# nlp with CountVectorizer vectorizer
nlp2 = nlp_preprocessor(vectorizer=CountVectorizer(lowercase=False))

# nlp with TfidfVectorizer
nlp3 = nlp_preprocessor(vectorizer=TfidfVectorizer(lowercase=False))

# nlp with lemmatizer_list function and TfidfVectorizer
nlp4 = nlp_preprocessor(lemmatizer=lemmatize_list, vectorizer=TfidfVectorizer(lowercase=False))

nlp_chains = [nlp, nlp1, nlp2, nlp3, nlp4]

In [132]:
# iterate over the nlp instantiated classes

import time

start_start = time.time()

for ix, chain in enumerate(nlp_chains):
    
    start = time.time()

    # model to be used
    nb = MultinomialNB()
    
    # fit the preprocessed data
    chain.fit(ng_train_data)
    
    # transform the train dataset
    train_data = chain.transform(ng_train_data)
    
    # transform the test dataset
    test_data = chain.transform(ng_test_data)
    
    # fit the model
    nb.fit(train_data, ng_train_targets)
    
    # get the accuracy
    accuracy = nb.score(test_data, ng_test_targets)
    
    # print the results
    print("Chain {}: {}".format(ix, accuracy))
    
    end = time.time()
    print(end - start)

final_end = time.time()
print(final_end - start_start)

Chain 0: 0.9158371040723982
20.290612936019897
Chain 1: 0.9194570135746606
54.94128394126892
Chain 2: 0.9158371040723982
30.068961143493652
Chain 3: 0.9067873303167421
30.367820978164673
Chain 4: 0.9076923076923077
27.94713592529297
163.61649703979492


## Supervised: Classification

Here we'll write a class to predict a class given the text of the document. 

In [133]:
nlp = nlp_preprocessor()
nlp_pipe = supervised_nlp(MultinomialNB(), nlp)
nlp_pipe.fit(ng_train_data, ng_train_targets)
nlp_pipe.score(ng_test_data, ng_test_targets)

0.9158371040723982

Swap out the model for something different.

In [134]:

nlp_pipe = supervised_nlp(LinearSVC(), nlp)
nlp_pipe.fit(ng_train_data, ng_train_targets)
nlp_pipe.score(ng_test_data, ng_test_targets)

0.8615384615384616

## Unsupervised: Topic Modeling

We don't want to make a prediction with this example, simply to find topics and have the ability to cast our data into the "topic space" from the "word space." With this in mind, we'll add a transform feature and also the ability to print out the topics.

In [135]:
cv = CountVectorizer(stop_words='english', token_pattern='\\b[a-z][a-z]+\\b')

cleaning_pipe = nlp_preprocessor(vectorizer=cv)
topic_chain = topic_modeling_nlp(TruncatedSVD(n_components=15), preprocessing_pipeline=cleaning_pipe)

In [136]:
topic_chain.fit(ng_train_data)
topic_chain.transform(ng_train_data).shape

(1661, 15)

In [137]:
topic_chain.print_topics()

Topic #0jpeg image file images gif format files data software version
Topic #1jpeg gif file quality jfif viewer quicktime programs colors quantization
Topic #2jesus god atheists people matthew atheism religious gd religion prophecy
Topic #3jesus matthew gd prophecy psalm messiah isaiah lord israel prophet
Topic #4send ray graphics mail objects rayshade file stuff files format
Topic #5argument fallacy conclusion true argumentum premises false valid inference form
Topic #6data ftp model grass vertex sgi pci ibm contact jpeg
Topic #7posting god response subject typical universe evidence einstein formal watchmaker
Topic #8game year runs good hit team dont cubs win run
Topic #9den px pz double radius pxpxpypypzpz sqrtden theta pole rtheta
Topic #10compass opcols int oprows compassop row inputimage rowcol oprowscol char
Topic #11program bits read menu display change pressing files file dont
Topic #12cubs suck game atheists lost runs program file jesus bits
Topic #13cubs suck people dont isla

Swap out the model for something different.

In [138]:
topic_chain = topic_modeling_nlp(LatentDirichletAllocation(n_components=15), preprocessing_pipeline=cleaning_pipe)

In [139]:
topic_chain.fit(ng_train_data)
topic_chain.transform(ng_train_data).shape
topic_chain.print_topics()

Topic #0year good players years dont league average baseball game lot
Topic #1image program data display processing video color images mode card
Topic #2jpeg image file gif files images format good bit color
Topic #3game games fan phillies philadelphia team era wip time baseball
Topic #4graphics data image package email ftp send software code file
Topic #5time players sex bob software bronx dont good bobbeicotekcom beauchaine
Topic #6dont women god people morris deletion men quran islamic games
Topic #7fallacy conclusion graphics premises argumentum argument valid computer inference group
Topic #8david reality win presentations runs virtual visualization scientific seminar robert
Topic #9god people dont argument atheists true evidence religion islam exist
Topic #10team year game runs lost hit games dont win braves
Topic #11jesus people time matthew atheism god atheists bible prophecy religious
Topic #12compass opcols int faq search graphics oprows data send dont
Topic #13good dont cubs