The following notebook reproduces code from Feature Engineering for Machine Learning by Alic Zheng and Amanda Casari.

In parallel, the same concepts are applied to a Twitter dataset that seeks to identify tweets suggesting a true natural disaster is occurring. See https://www.kaggle.com/c/nlp-getting-started.

## N-Grams

n-grams are sequences of n tokens. 1-grams (unigram) are just the frequency count of distinct words. 2-grams are unique 2 word pairings. While the code is taken from the book, I consolidated the code into functions for reuseability.

In [72]:
import pandas as pd
import csv

# Online News Popularity Data Set - first 10,000
yelp_df = pd.read_csv('C://Users/cusey/source/repos/DataScienceCoursework/MDS 564 - 2020 Spring/Week 2 - Numeric Feature Selection/Yelp Reviews - 10000.csv', nrows=10000)
print(yelp_df.shape)

twitter_df = pd.read_csv('C://Users/cusey/source/repos/DataScienceProjects/MDS 564 - Twitter NLP Text Analysis/twitter_train.csv')
print(twitter_df.shape)

(10000, 11)
(7613, 5)


In [37]:
def bow_and_ngrams(df, text_column):
    from sklearn.feature_extraction.text import CountVectorizer
    # Creat feature transformations for unigrams, bigrams, and trigrams.
    # Default ignored single character words, but this examples explicitely includes them.
    bow_converter = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
    bigram_converter = CountVectorizer(ngram_range=(2,2), token_pattern='(?u)\\b\\w+\\b')
    trigram_converter = CountVectorizer(ngram_range=(3,3), token_pattern='(?u)\\b\\w+\\b')

    # Fit transformers and look at vocab size
    bow_converter.fit(df[text_column])
    words = bow_converter.get_feature_names()
    bigram_converter.fit(df[text_column])
    bigrams = bigram_converter.get_feature_names()
    trigram_converter.fit(df[text_column])
    trigrams = trigram_converter.get_feature_names()

    print("Lengths of BOW, Bigrams, and Trigrams:",
           "\n Words:",len(words),
           "\n Bigrams:", len(bigrams),
           "\n Trigrams:", len(trigrams),
           "\n\n")
    
    return words, bigrams, trigrams


def view_results(words, bigrams, trigrams):
    print("Sample of Words: \n", words[:10],"\n\n",
          "Sample of Bigrams: \n", bigrams[-10:],"\n\n",
          "Sample of Trigrams: \n", trigrams[:10],"\n\n")
    

In [38]:
# Yelp Words, Bigrams, and Trigrams
words, bigrams, trigrams = bow_and_ngrams(yelp_df,"text")

view_results(words, bigrams, trigrams)

Lengths of BOW, Bigrams, and Trigrams: 
 
 Words: 29221 
 Bigrams: 368937 
 Trigrams: 881609 


Sample of Words: 
: ['0', '00', '000', '007', '00a', '00am', '00pm', '01', '02', '03'] 

 Sample of Bigrams: 
 ['zuzu was', 'zuzus room', 'zweigel wine', 'zwiebel kräuter', 'zy world', 'zzed in', 'éclairs napoleons', 'école lenôtre', 'ém all', 'òc châm'] 

 Sample of Trigrams: 
 ['0 0 eye', '0 20 less', '0 39 oz', '0 39 pizza', '0 5 i', '0 50 to', '0 6 can', '0 75 oysters', '0 75 that', '0 75 to'] 




In [40]:
# Twitter NLP Words, Bigrams, and Trigrams
words, bigrams, trigrams = bow_and_ngrams(twitter_df,"text")

view_results(words, bigrams, trigrams)

Lengths of BOW, Bigrams, and Trigrams: 
 
 Words: 21678 
 Bigrams: 69982 
 Trigrams: 87447 


Sample of Words: 
: ['0', '00', '000', '0000', '007npen6lg', '00cy9vxeff', '00end', '00pm', '01', '02'] 

 Sample of Bigrams: 
 ['ûó oh', 'ûó organizers', 'ûó rt', 'ûó the', 'ûó wallybaiter', 'ûóher upper', 'ûókody vine', 'ûónegligence and', 'ûótech business', 'ûówe work'] 

 Sample of Trigrams: 
 ['0 11 ronnie', '0 45 to', '0 6 8km', '0 75 in', '0 9 northern', '0 amp more', '0 and blew', '0 balls 0', '0 bids û_', '0 but dude'] 




## Chunking & Part of Speech Tagging
Chunking forms sequences of words (tokens) based off of parts of speech.



In [57]:
def english_chunking(df, text_column):
    ## Preload English Language
    import spacy
    nlp = spacy.load('en_core_web_sm')
    
    ## Create pandas dataframe of spaCy nlp variables
    doc_df = df[text_column].apply(nlp)
    
    for doc in doc_df[4]:
        print([doc.text, doc.pos_, doc.tag_])

In [70]:
## Yelp Chunking & POS Tagging

english_chunking(yelp_df,"text")

## Keep getting this error:
##OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

## Term Frequency - Inverse Document Frequency

In [75]:
# Book pulls out Nightlife and Restaurant Businesses in order to create a classification problem
# Pull out only Nightlife and Restaurants businesses
two_biz = yelp_df[yelp_df.apply(lambda x: 'Nightlife' in x['categories'] or 'Restaurants' in x['categories'], axis=1)]

KeyError: ('categories', 'occurred at index 0')