# Text Mining using Natural Language Processing (NLP)

## Introduction

### What is NLP?

- Using computers to process (analyze, understand, generate) natural human languages
- Most knowledge created by humans is unstructured text, and we need a way to make sense of it
- Build probabilistic model using data about a language

### Important Packages Related to Textmining
- **textmining1.0:** contains a variety of useful functions for text mining in Python.
- **NLTK:** This package can be extremely useful because you have easy access to over 50 corpora and lexical resources
- **Tweepy:** to mine Twitter data
- **scrappy:**  extract the data you need from websites
- **urllib2:** a package for opening URLs
- **requests:** library for grabbing data from the internet
- **Beautifulsoup:** library for parsing HTML data
- **re:**  grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit() are helpful functions
- **wordcloud:** to visualize the wordcloud
- **Textblob:** to used for text processing (nlp- lowel events)
- **sklearn:** to used for preprocessing, modeling

### What are some of the higher level task areas?

- **Information retrieval**: Find relevant results and similar results
    - [Google](https://www.google.com/)
- **Information extraction**: Structured information from unstructured documents
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation**: One language to another
    - [Google Translate](https://translate.google.com/)
- **Text simplification**: Preserve the meaning of text, but simplify the grammar and vocabulary
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input**: Faster or easier typing
    - [My application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis**: Attitude of speaker
    - [Hater News](http://haternews.herokuapp.com/)
- **Automatic summarization**: Extractive or abstractive summarization
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural Language Generation**: Generate text from data
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation**: Speech-to-text, text-to-speech
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering**: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

### Data Processing - What are some of the lower level components?

- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

### Why is NLP hard?

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the **language** and the **world**.

## Text Classification

#### Feature Engineering
##### TF-IDF Vectors as features
- TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
- IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

- TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams)
    - a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents
    - b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams
    - c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus

##### Text / NLP based features
- Word Count of the documents – total number of words in the documents
- Character Count of the documents – total number of characters in the documents
- Average Word Density of the documents – average length of the words used in the documents
- Puncutation Count in the Complete Essay – total number of punctuation marks in the documents
- Upper Case Count in the Complete Essay – total number of upper count words in the documents
- Title Word Count in the Complete Essay – total number of proper case (title) words in the documents
- Frequency distribution of Part of Speech Tags:
    - Noun Count
    - Verb Count
    - Adjective Count
    - Adverb Count
    - ronoun Count
    

### Model Building
- Naive Bayes Classifier
- Linear Classifier
- Support Vector Machine
- KNN
- Bagging Models
- Boosting Models
- Shallow Neural Networks
- Deep Neural Networks
    - Convolutional Neural Network (CNN)
    - Long Short Term Modelr (LSTM)
    - Gated Recurrent Unit (GRU)
    - Bidirectional RNN
    - Recurrent Convolutional Neural Network (RCNN)
    - Other Variants of Deep Neural Networks

## Part 1: Reading in the Yelp Reviews

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [None]:
#import required packages
#basics
import pandas as pd 
import numpy as np

#misc
import gc
import time
import warnings

#stats
#from scipy.misc import imread
from scipy import sparse
import scipy.stats as ss

#viz
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns
from wordcloud import WordCloud ,STOPWORDS
from PIL import Image
#import matplotlib_venn as venn

#nlp
import string
import re    #for regex
import nltk
from nltk.corpus import stopwords

#import spacy
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize

# Tweet tokenizer does not split at apostophes which is what we want
from nltk.tokenize import TweetTokenizer   


#FeatureEngineering
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer, TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm, decomposition, ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

import  textblob
#import xgboost
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

from textblob import TextBlob
from nltk.stem import PorterStemmer
import nltk
#nltk.download('wordnet')
from textblob import Word

#settings
start_time=time.time()
color = sns.color_palette()
sns.set_style("dark")
eng_stopwords = set(stopwords.words("english"))
warnings.filterwarnings("ignore")

lem = WordNetLemmatizer()
tokenizer=TweetTokenizer()

%matplotlib inline

Using TensorFlow backend.


In [None]:
# read yelp.csv into a DataFrame
yelp = pd.read_csv('yelp.csv')

In [3]:
yelp.head(5)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [4]:
yelp=yelp[['review_id', 'stars', 'text', 'cool', 'useful', 'funny']]

In [5]:
yelp.head()

Unnamed: 0,review_id,stars,text,cool,useful,funny
0,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,2,5,0
1,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,0,0,0
2,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,0,1,0
3,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",1,2,0
4,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,0,0,0


In [6]:
df = yelp

In [None]:
df['text'] = df['text'].astype(str)
df['count_sent']=df["text"].apply(lambda x: len(re.findall("\n",str(x)))+1)

#Word count in each comment:
df['count_word']=df["text"].apply(lambda x: len(str(x).split()))

#Unique word count
df['count_unique_word']=df["text"].apply(lambda x: len(set(str(x).split())))

#Letter count
df['count_letters']=df["text"].apply(lambda x: len(str(x)))

#Word density

df['word_density'] = df['count_letters'] / (df['count_word']+1)

#punctuation count
df["count_punctuations"] =df["text"].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

#upper case words count
df["count_words_upper"] = df["text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

#upper case words count
df["count_words_lower"] = df["text"].apply(lambda x: len([w for w in str(x).split() if w.islower()]))

#title case words count
df["count_words_title"] = df["text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

#Number of stopwords
df["count_stopwords"] = df["text"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

#Average length of the words
df["mean_word_len"] = df["text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

#Number of numeric
df['numeric'] = df['text'].apply(lambda x :len([x for x in x.split() if x.isdigit()]))

#Number of alphanumeric
df['alphanumeric'] = df['text'].apply(lambda x :len([x for x in x.split() if x.isalnum()]))

#Number of alphabetics
df['alphabetetics'] = df['text'].apply(lambda x :len([x for x in x.split() if x.isalpha()]))

#Number of alphabetics
df['Spaces'] = df['text'].apply(lambda x :len([x for x in x.split() if x.isspace()]))

#Number of Words ends with
df['words_ends_with_et'] = df['text'].apply(lambda x :len([x for x in x.lower().split() if x.endswith('et')]))

#Number of Words ends with
df['words_start_with_no'] = df['text'].apply(lambda x :len([x for x in x.lower().split() if x.startswith('no')]))

# Count the occurences of all words
df['wordcounts'] = df['text'].apply(lambda x :dict([ [t, x.split().count(t)] for t in set(x.split()) ]))

pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

# function to check and get the part of speech tag count of a words in a given sentence
def check_pos_tag(x, flag):
    cnt = 0
    try:
        wiki = textblob.TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt

df['noun_count'] = df['text'].apply(lambda x: check_pos_tag(x, 'noun'))
df['verb_count'] = df['text'].apply(lambda x: check_pos_tag(x, 'verb'))
df['adj_count']  = df['text'].apply(lambda x: check_pos_tag(x, 'adj'))
df['adv_count']  = df['text'].apply(lambda x: check_pos_tag(x, 'adv'))
df['pron_count'] = df['text'].apply(lambda x: check_pos_tag(x, 'pron')) 

In [7]:
df['sentiment'] = df["text"].apply(lambda x: TextBlob(x).sentiment.polarity )

In [8]:
df.head()

Unnamed: 0,review_id,stars,text,cool,useful,funny,sentiment
0,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,2,5,0,0.402469
1,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,0,0,0,0.229773
2,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,0,1,0,0.566667
3,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",1,2,0,0.608646
4,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,0,0,0,0.468125


In [9]:
yelp.stars.value_counts()

4    3526
5    3337
3    1461
2     927
1     749
Name: stars, dtype: int64

In [10]:
# create a new DataFrame that only contains the 5-star and 1-star reviews
#yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
X = yelp.text
y = yelp.stars

# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7500,)
(2500,)
(7500,)
(2500,)


In [11]:
yelp.shape

(10000, 7)

In [None]:
yelp.head()

In [16]:
s = 'Analytixlabs is from bagnalore, it has offices in Gurgoan, KL. It is staarted in 4Years back'

In [17]:
#Abbrevations and Words correction
def clean_text(text):
    text = text.lower()
    text = text.strip()
    text = re.sub(r' +', ' ', text)
    text = re.sub(r"[-()\"#/@;:{}`+=~|.!?,'0-9]", "", text)
    return(text)

In [18]:
clean_text(s)

'analytixlabs is from bagnalore it has offices in gurgoan kl it is staarted in years back'

In [19]:
stop = set(nltk.corpus.stopwords.words('english'))

In [21]:
print(stop)

{"don't", 'yourself', 'needn', 'myself', 'been', 'he', 'theirs', "haven't", 'haven', 'do', 'out', 'mightn', 'weren', 'from', 'some', "you're", 'be', 'himself', 'with', 'by', 'can', 'm', 'o', 'being', 'hasn', "hasn't", "didn't", 'to', 'more', 's', 'couldn', 'its', "isn't", 'themselves', 'each', "wasn't", 'your', 'their', 'an', 'most', 'only', 'won', "you'd", 'then', 'same', 'of', 'just', 'don', 'under', 'again', 'y', "mustn't", "you'll", 'on', 'such', "won't", 're', "hadn't", "shan't", 'ma', 'herself', "needn't", 'does', 'there', 'where', 'if', "shouldn't", 'she', 'you', 'will', 'through', 'mustn', "weren't", 'no', 'shan', 'have', 'that', 'few', 't', 'as', 'down', 'has', 'isn', 'didn', 'ourselves', 'is', 'here', 'while', 'her', 'the', 'at', "you've", 'doing', 'should', 'up', 'during', 'having', 'over', 'ours', 'hers', 'doesn', 'yourselves', 'when', 'but', 'very', 'they', 'once', 'into', 'we', 'until', 'and', 'own', "couldn't", 'me', 'am', 'a', 'yours', "she's", "it's", 'for', 'who', 've

In [22]:
stemmer_func = nltk.stem.snowball.SnowballStemmer("english").stem

In [25]:
s= 'Analytics is really doing good'

In [30]:
stemmer_func('really')

'realli'

In [32]:
s.split()

['Analytics', 'is', 'really', 'doing', 'good']

In [34]:
import string
def pre_process(text):
    #text = text.str.replace('/','')
    #text = text.apply(lambda x: re.sub("  "," ", x))
    #text = re.sub(r"[-()\"#/@;:{}`+=~|.!?,']", "", text)
    #text = re.sub(r'[0-9]+', '', text)
    #text = text.apply(lambda x: " ".join(x.translate(str.maketrans('', '', string.punctuation)) for x in x.split() if x.isalpha()))
    text = text.apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    #text = text.apply(lambda x: str(TextBlob(x).correct()))
    #text = text.apply(lambda x: " ".join(PorterStemmer().stem(word) for word in x.split()))
    #text = text.apply(lambda x: " ".join(stemmer_func(word) for word in x.split()))
    #text = text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
    #text = text.apply(lambda x: " ".join(word for word, pos in pos_tag(x.split()) if pos not in ['NN','NNS','NNP','NNPS']))
    return(text)

In [35]:
X_train = X_train.apply(lambda x: clean_text(x))
X_test = X_test.apply(lambda x: clean_text(x))

In [36]:
X_train=pre_process(X_train)
X_test=pre_process(X_test)

In [None]:
#Vectorization

In [37]:
CountVectorizer?

In [38]:
#Train
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1, 1 ), min_df=5, encoding='latin-1' , max_features=800)
xtrain_count = count_vect.fit_transform(X_train)


In [41]:
dtm=xtrain_count.toarray()

In [45]:
dtm

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 2, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [51]:
count_vect.get_feature_names()

['able',
 'absolutely',
 'across',
 'actually',
 'add',
 'added',
 'afternoon',
 'ago',
 'almost',
 'along',
 'already',
 'also',
 'although',
 'always',
 'amazing',
 'ambiance',
 'amount',
 'another',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'appetizer',
 'appetizers',
 'area',
 'arent',
 'arizona',
 'around',
 'arrived',
 'art',
 'asian',
 'ask',
 'asked',
 'ate',
 'atmosphere',
 'attention',
 'attentive',
 'authentic',
 'available',
 'average',
 'away',
 'awesome',
 'az',
 'back',
 'bacon',
 'bad',
 'bag',
 'bar',
 'bartender',
 'based',
 'bbq',
 'beans',
 'beat',
 'beautiful',
 'beef',
 'beer',
 'beers',
 'behind',
 'believe',
 'best',
 'better',
 'big',
 'bill',
 'birthday',
 'bit',
 'bite',
 'black',
 'bland',
 'blue',
 'bottle',
 'bought',
 'bowl',
 'box',
 'bread',
 'breakfast',
 'bring',
 'brought',
 'bucks',
 'buffet',
 'burger',
 'burgers',
 'burrito',
 'business',
 'busy',
 'butter',
 'buy',
 'cafe',
 'cake',
 'call',
 'called',
 'came',
 'cannot',
 'cant',
 'car',


In [52]:
dtm1=pd.DataFrame(dtm)

In [53]:
dtm1.columns=count_vect.get_feature_names()

In [54]:
dtm1.head()

Unnamed: 0,able,absolutely,across,actually,add,added,afternoon,ago,almost,along,...,year,years,yelp,yes,yet,youll,youre,youve,yum,yummy
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,2,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0


In [55]:
#Train
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1, 1 ), min_df=5, encoding='latin-1' , max_features=800)
xtrain_count = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(xtrain_count)

#Test
#count_vect = CountVectorizer()
xtest_count = count_vect.transform(X_test)

#tfidf_transformer = TfidfTransformer()
X_test_tfidf = tfidf_transformer.transform(xtest_count)


In [56]:
dtm2=pd.DataFrame(X_train_tfidf.toarray(), columns=count_vect.get_feature_names())

In [57]:
dtm2.head(100)

Unnamed: 0,able,absolutely,across,actually,add,added,afternoon,ago,almost,along,...,year,years,yelp,yes,yet,youll,youre,youve,yum,yummy
0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000
1,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000
2,0.000000,0.243022,0.0,0.100958,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000
3,0.000000,0.279355,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.254748,0.000000,0.0,0.000000,0.0,0.000000,0.000000
4,0.000000,0.000000,0.0,0.000000,0.126434,0.000000,0.000000,0.0000,0.000000,0.126715,...,0.000000,0.0,0.000000,0.000000,0.114775,0.0,0.000000,0.0,0.000000,0.000000
5,0.000000,0.000000,0.0,0.068286,0.000000,0.000000,0.092854,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000
6,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000
7,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.424834,0.000000
8,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000
9,0.000000,0.000000,0.0,0.227758,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,...,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000


In [59]:
# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(1, 2), max_features=800)
tfidf_vect_ngram.fit(df['text'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(X_train)
xtest_tfidf_ngram =  tfidf_vect_ngram.transform(X_test)

In [60]:
# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(1,2), max_features=800)
tfidf_vect_ngram_chars.fit(df['text'])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_train) 
xtest_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(X_test)

In [61]:
#Topic Models as features

# train a LDA Model
lda_model = decomposition.LatentDirichletAllocation(n_components=20, learning_method='online', max_iter=20)
X_topics = lda_model.fit_transform(X_train_tfidf)
topic_word = lda_model.components_ 
vocab = count_vect.get_feature_names()

In [63]:
# view the topic models
n_top_words = 50
topic_summaries = []
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))

topic_summaries

['average sports charge money super low options given crowded prices area center go except dining end desert outdoor clean cant rather leave watch business cheap better like though long drinks one head come really small want sure bar find dont make city employees literally shopping side mac truly also always',
 'yum weeks awesome delicious great beer say bucks surprised food place good year want worth service need burger pizza try kind think menu corn chicken id mind find low room really excellent different got large money parking stand summer price go wasnt stayed cute pool quick excited dont lunch side',
 'place great good like food get go one really time bar dont love nice always people service coffee ive would back pretty im night staff better well little happy area also fun friendly beer much drinks hour even location know think going room want never see around way youre new',
 'great food service love good best excellent place atmosphere amazing awesome ever always breakfast yumm

In [64]:
frequency_words_wo_stop= {}
for data in yelp['text']:
    tokens = nltk.wordpunct_tokenize(data.lower())
    for token in tokens:
        if token.lower() not in stop:
            if token in frequency_words_wo_stop:
                count = frequency_words_wo_stop[token]
                count = count + 1
                frequency_words_wo_stop[token] = count
            else:
                frequency_words_wo_stop[token] = 1
                



In [65]:
frequency_words_wo_stop

{'wife': 365,
 'took': 759,
 'birthday': 202,
 'breakfast': 737,
 'excellent': 724,
 '.': 75581,
 'weather': 92,
 'perfect': 649,
 'made': 1334,
 'sitting': 276,
 'outside': 594,
 'overlooking': 12,
 'grounds': 38,
 'absolute': 57,
 'pleasure': 55,
 'waitress': 426,
 'food': 6184,
 'arrived': 287,
 'quickly': 266,
 'semi': 37,
 '-': 9550,
 'busy': 498,
 'saturday': 301,
 'morning': 323,
 'looked': 464,
 'like': 5041,
 'place': 6662,
 'fills': 16,
 'pretty': 1812,
 'earlier': 73,
 'get': 3819,
 'better': 1541,
 'favor': 41,
 'bloody': 54,
 'mary': 46,
 'phenomenal': 59,
 'simply': 177,
 'best': 1952,
 "'": 27668,
 'ever': 1081,
 'sure': 1149,
 'use': 485,
 'ingredients': 276,
 'garden': 89,
 'blend': 39,
 'fresh': 1222,
 'order': 1589,
 'amazing': 1060,
 'everything': 1066,
 'menu': 1678,
 'looks': 293,
 ',': 53283,
 'white': 382,
 'truffle': 42,
 'scrambled': 21,
 'eggs': 239,
 'vegetable': 71,
 'skillet': 28,
 'tasty': 863,
 'delicious': 1339,
 'came': 1309,
 '2': 1431,
 'pieces': 200

In [1]:
var = "chandr mouli rajesh rree chandra chandra mouli mouli rajesh rajesh"

In [3]:
from wordcloud import WordCloud ,STOPWORDS

In [4]:
wordcloud = WordCloud(stopwords=[]).generate(str(var.tolist()))
%matplotlib inline
fig = plt.figure(figsize=(200,100))
plt.imshow(wordcloud)

AttributeError: 'str' object has no attribute 'tolist'

In [80]:
def train_model(classifier, feature_vector_train, label, feature_vector_valid,  valid_y, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    
    if is_neural_net:
        predictions = predictions.argmax(axis=-1)
    
    return metrics.accuracy_score(predictions, valid_y)

In [81]:
#Naive Bayes
# Naive Bayes on Count Vectors and TF-IDF
accuracy_L1 = train_model(naive_bayes.MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf, y_test)
print("NB  for L1, Count Vectors: ", accuracy_L1)



# Naive Bayes on Word Level TF IDF Vectors
accuracy_L1 = train_model(naive_bayes.MultinomialNB(), xtrain_count, y_train, xtest_count, y_test)
print("NB  for L1, WordLevel TF-IDF: ", accuracy_L1)



# Naive Bayes on Ngram Level TF IDF Vectors
accuracy_L1 = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, y_train, xtest_tfidf_ngram, y_test)
print("NB  for L1, N-Gram Vectors: ", accuracy_L1)



# Naive Bayes on Character Level TF IDF Vectors
accuracy_L1 = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, y_train, xtest_tfidf_ngram_chars, y_test)
print("NB for L1, CharLevel Vectors: ", accuracy_L1)



NB  for L1, Count Vectors:  0.4736
NB  for L1, WordLevel TF-IDF:  0.508
NB  for L1, N-Gram Vectors:  0.4512
NB for L1, CharLevel Vectors:  0.4028


In [83]:
#Naive Bayes
# Naive Bayes on Count Vectors and TF-IDF
accuracy_L1 = train_model(LogisticRegression(), X_train_tfidf, y_train, X_test_tfidf, y_test)
print("LR  for L1, Count Vectors: ", accuracy_L1)



# Naive Bayes on Word Level TF IDF Vectors
accuracy_L1 = train_model(LogisticRegression(), xtrain_count, y_train, xtest_count, y_test)
print("LR  for L1, WordLevel TF-IDF: ", accuracy_L1)



# Naive Bayes on Ngram Level TF IDF Vectors
accuracy_L1 = train_model(LogisticRegression(), xtrain_tfidf_ngram, y_train, xtest_tfidf_ngram, y_test)
print("LR  for L1, N-Gram Vectors: ", accuracy_L1)



# Naive Bayes on Character Level TF IDF Vectors
accuracy_L1 = train_model(LogisticRegression(), xtrain_tfidf_ngram_chars, y_train, xtest_tfidf_ngram_chars, y_test)
print("LR for L1, CharLevel Vectors: ", accuracy_L1)

LR  for L1, Count Vectors:  0.5164
LR  for L1, WordLevel TF-IDF:  0.4824
LR  for L1, N-Gram Vectors:  0.4988
LR for L1, CharLevel Vectors:  0.4324


In [85]:
#Naive Bayes
# Naive Bayes on Count Vectors and TF-IDF
accuracy_L1 = train_model(svm.LinearSVC(), X_train_tfidf, y_train, X_test_tfidf, y_test)
print("LR  for L1, Count Vectors: ", accuracy_L1)



# Naive Bayes on Word Level TF IDF Vectors
accuracy_L1 = train_model(svm.LinearSVC(), xtrain_count, y_train, xtest_count, y_test)
print("LR  for L1, WordLevel TF-IDF: ", accuracy_L1)



# Naive Bayes on Ngram Level TF IDF Vectors
accuracy_L1 = train_model(svm.LinearSVC(), xtrain_tfidf_ngram, y_train, xtest_tfidf_ngram, y_test)
print("LR  for L1, N-Gram Vectors: ", accuracy_L1)



# Naive Bayes on Character Level TF IDF Vectors
accuracy_L1 = train_model(svm.LinearSVC(), xtrain_tfidf_ngram_chars, y_train, xtest_tfidf_ngram_chars, y_test)
print("LR for L1, CharLevel Vectors: ", accuracy_L1)

LR  for L1, Count Vectors:  0.4976
LR  for L1, WordLevel TF-IDF:  0.4856
LR  for L1, N-Gram Vectors:  0.4976
LR for L1, CharLevel Vectors:  0.4608


## Part 2: Tokenization

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [None]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()

In [None]:
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [None]:
X_test_dtm.shape

In [None]:
# rows are documents, columns are terms (aka "tokens" or "features")
X_train_dtm.shape

In [None]:
# last 50 features
print vect.get_feature_names()[-50:]

In [None]:
# show vectorizer options
vect

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

In [None]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
CountVectorizer?

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
# last 50 features
print vect.get_feature_names()[-50:]

In [58]:
#Calculate tf-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(["New Year's Eve in New York",
                            "New Year's Eve in London",
                            "York is closer to London than to New York",
                            "London is closer to Bucharest than to New York"])

#Calculate cosine similarity:
cosine=(tfidf * tfidf.T).A
print(cosine)

[[ 1.          0.82384531  0.28730789  0.20464882]
 [ 0.82384531  1.          0.16511247  0.1679379 ]
 [ 0.28730789  0.16511247  1.          0.89268279]
 [ 0.20464882  0.1679379   0.89268279  1.        ]]


**Predicting the star rating:**

In [None]:
# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# calculate null accuracy
y_test_binary = np.where(y_test==5, 1, 0)
max(y_test_binary.mean(), 1 - y_test_binary.mean())

In [None]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print 'Features: ', X_train_dtm.shape[1]
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print 'Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

## Part 3: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options
vect

- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

In [None]:
# set of stop words
print vect.get_stop_words()

## Part 4: Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

In [None]:
# all 100 features
print vect.get_feature_names()

In [None]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

## Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

In [None]:
print(yelp.text[0])

In [None]:
# print the first review
print(yelp.text[0])

In [None]:
# save it as a TextBlob object
review = TextBlob(yelp.text[0])

In [None]:
print(dir(review))

In [None]:
print(review.ngrams(2))

In [None]:
review.sentiment

In [None]:
# list the words
review.words

In [None]:
# list the sentences
review.sentences

In [None]:
# some string methods are available
review.lower()

In [None]:
review.ngrams(n=2)

## Part 6: Stemming and Lemmatization

**Stemming:**

- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [None]:
# initialize stemmer
stemmer = SnowballStemmer('english')
stemmer

In [None]:
review.words

In [None]:
# stem each word
print [stemmer.stem(word) for word in review.words]

**Lemmatization**

- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [None]:
review.words

In [None]:
# assume every word is a noun
print [word.lemmatize() for word in review.words]

In [None]:
# assume every word is a verb
print [word.lemmatize(pos='v') for word in review.words]

In [None]:
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = unicode(text, 'utf-8').lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

In [None]:
# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

In [None]:
# last 50 features
print vect.get_feature_names()[-50:]

## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term Frequency-Inverse Document Frequency (simple version)
tf/df

In [None]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

## Part 8: Using TF-IDF to Summarize a Yelp Review

Reddit's autotldr uses the [SMMRY](http://smmry.com/about) algorithm, which is based on TF-IDF!

In [None]:
TfidfVectorizer?

In [None]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

In [None]:
def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = unicode(yelp.text[review_id], 'utf-8')
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print 'TOP SCORING WORDS:'
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print word
    
    # print 5 random words
    print '\n' + 'RANDOM WORDS:'
    random_words = np.random.choice(word_scores.keys(), size=5, replace=False)
    for word in random_words:
        print word
    
    # print the review
    print '\n' + review_text

In [None]:
summarize()

## Part 9: Sentiment Analysis

In [None]:
print review

In [None]:
# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity

In [None]:
# understanding the apply method
yelp['length'] = yelp.text.apply(len)
yelp.head(1)

In [None]:
# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    return TextBlob(text.decode('utf-8')).sentiment.polarity

In [None]:
# create a new DataFrame column for sentiment (WARNING: SLOW!)
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

In [None]:
yelp.head(5)

In [None]:
# box plot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

In [None]:
# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

In [None]:
# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

In [None]:
# widen the column display
pd.set_option('max_colwidth', 500)

In [None]:
# negative sentiment in a 5-star review
print yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].text

In [None]:
# positive sentiment in a 1-star review
print yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].text 

In [None]:
# reset the column display width
pd.reset_option('max_colwidth')

### Adding Features to a Document-Term Matrix

In [86]:
# create a DataFrame that only contains the 5-star and 1-star reviews
#yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp[feature_cols]
y = yelp.stars

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [88]:
# use CountVectorizer with text column only
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
print(X_train_dtm.shape)
print(X_test_dtm.shape)

(7500, 25797)
(2500, 25797)


In [89]:
# shape of other four feature columns
X_train.drop('text', axis=1).shape

(7500, 4)

In [94]:
# cast other feature columns to float and convert to a sparse matrix
extra = sparse.csr_matrix(X_train.drop('text', axis=1).astype(float))
extra.shape

(7500, 4)

In [95]:
# combine sparse matrices
X_train_dtm_extra = sparse.hstack((X_train_dtm, extra))
X_train_dtm_extra.shape

(7500, 25801)

In [96]:
# repeat for testing set
extra = sparse.csr_matrix(X_test.drop('text', axis=1).astype(float))
X_test_dtm_extra = sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape

(2500, 25801)

In [98]:
# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print(metrics.accuracy_score(y_test, y_pred_class))

0.4592


In [99]:
# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print(metrics.accuracy_score(y_test, y_pred_class))

0.4672


## Bonus: Fun TextBlob Features

In [100]:
# spelling correction
TextBlob('15 minuets late').correct()

TextBlob("15 minutes late")

In [103]:
s="this is bcz"

In [104]:
TextBlob(s).correct()

TextBlob("this is bc")

In [107]:
# spellcheck
Word('parot').spellcheck()

[('part', 0.9929478138222849), ('parrot', 0.007052186177715092)]

In [None]:
# definitions
Word('bank').define('v')

In [None]:
# language identification
TextBlob('Hola amigos').detect_language()

## Conclusion

- NLP is a gigantic field
- Understanding the basics broadens the types of data you can work with
- Simple techniques go a long way
- Use scikit-learn for NLP whenever possible