# Text Representation
Zimei Yang |
May 13 2018

### Part I BoW and TF-IDF
1. Use Amazon book reviews (text documents) dataset.
2. Use Bag of Words (BoW) and TF-IDF (CountVectorizer and TfidfVectorizer in scikit-learn).
3. Write Python program to create and print vocabulary and document-term matrix (vectorized representation). 
4. Try unigram and bigram parameters and observe their effect on number of features.

In [2]:
# import packages
import numpy as np
import pandas as pd

In [9]:
# load Amazon book reviews dataset
df = pd.read_csv("Small-Book Reviews from Amazon.csv", names=['number','review'])

In [15]:
df.review.head()

0    Ok~ but I think the Keirsey Temperment Test is...
1    Repellent Sale of Conservativism  The fatalist...
2    I had a bad feeling about this!  And I was rig...
3    Lost Credability, QUICKLY!!  I admit, I haven'...
4    Poorly written  I tried reading this book but ...
Name: review, dtype: object

#### Bag of Words Representation
BoW representation is implemented in CountVectorizer.

In [78]:
# import CountVectorizer for BoW
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()  

In [18]:
# define corpus
corpus = df.review

Fitting the CountVectorizer does the following: 
1. tokenizing 
2. building the vocabulary (*vocabulary is a mapping of terms to feature indices.*)
  -- vocabulary can be accessed through the "vocabulary_" attribute

In [79]:
# fit countvectorizer to tokenize the corpus and build the corpus's vocabulary 
vect.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [80]:
# see the head of our vocabulary (terms and indices) 
print(list(vect.vocabulary_.items())[0:30])

[('ok', 552), ('but', 116), ('think', 792), ('the', 782), ('keirsey', 436), ('temperment', 774), ('test', 777), ('is', 417), ('more', 508), ('accurate', 16), ('and', 47), ('cheaper', 135), ('this', 794), ('book', 108), ('has', 352), ('its', 425), ('good', 335), ('points', 593), ('if', 386), ('anything', 54), ('it', 424), ('helps', 361), ('you', 899), ('put', 623), ('into', 410), ('words', 884), ('what', 862), ('want', 847), ('from', 318), ('supervisor', 753)]


In [81]:
# see length of the vocabulary
print(len(vect.vocabulary_))

904


To create the BoW representation for the corpus, call the transform method, which creates a document-term matrix.

In [82]:
# create BoW representation for the corpus
bow = vect.transform(corpus)

In [83]:
# see dimension of the BoW matrix
print(bow.shape)

(10, 904)


In [84]:
# see review vectors in BoW representation
print(bow.toarray())

[[1 0 0 ... 3 0 0]
 [0 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 1]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 0]]


In [85]:
# see feature names
feature_names = vect.get_feature_names()
print(feature_names[0:30])

['10', '1000', '14th', '18th', '1953', '1955', '1960', '1970', '60', 'abandon', 'able', 'about', 'academic', 'accomplished', 'according', 'account', 'accurate', 'act', 'actively', 'actually', 'ade', 'admit', 'adolescent', 'adulation', 'adventure', 'advocate', 'advocating', 'after', 'against', 'age']


In [86]:
# see number of features
number_of_features = len(feature_names)
print(number_of_features)

904


#### TF-IDF Representation

TF-IDF representation is implemented in TfidfVectorizer.


In [87]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(stop_words = 'english') #excluding stop words

Fitting the TfidfVectorizer does the following: 
1. tokenizing
2. building the vocabulary
3. applying tf-idf transformation. 

In [88]:
# fit the TF-IDF Vectorizer to corpus
vect.fit(corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [90]:
# see the head part of the vocabulary
print(list(vect.vocabulary_.items())[0:30])

[('ok', 462), ('think', 667), ('keirsey', 367), ('temperment', 658), ('test', 661), ('accurate', 15), ('cheaper', 101), ('book', 79), ('good', 284), ('points', 490), ('helps', 306), ('words', 734), ('want', 710), ('supervisor', 637), ('online', 464), ('does', 181), ('account', 14), ('difference', 169), ('options', 467), ('exactly', 215), ('like', 392), ('don', 187), ('messes', 422), ('results', 562), ('did', 167), ('just', 365), ('denial', 157), ('taken', 649), ('lot', 402), ('personality', 483)]


In [91]:
# see the total number of temrs in the vocabulary (excluding stop words)
print(len(vect.vocabulary_))

749


To create the TF-IDF representation for the corpus, call the transform method, which creates a document-term matrix.

In [94]:
# transform corpus and create TF-IDF representation
tfidf = vect.transform(corpus)

In [96]:
# see dimension of the tfidf matrix
print(tfidf.shape)

(10, 749)


In [97]:
# see reviews vector in TF-IDF representation
print(tfidf.toarray())

[[0.09036659 0.         0.         ... 0.         0.         0.        ]
 [0.         0.03673022 0.03673022 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.11364664 0.         0.11364664]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.08401684 0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


#### Unigram and Bigram Parameters
Use unigram and bigram parameters and observe their effect on number of features.

In [131]:
# use only unigram
vect1 = CountVectorizer(ngram_range=(1, 1))
vect1.fit_transform(corpus)
# use only bigram 
vect2 = CountVectorizer(ngram_range=(2, 2))
vect2.fit_transform(corpus)
# use both unigram and bigram
vect1_2 = CountVectorizer(ngram_range=(1, 2))
vect1_2.fit_transform(corpus)
# use unigram, bigram and trigram 
vect1_3 = CountVectorizer(ngram_range=(1, 3))
vect1_3.fit_transform(corpus)

<10x4890 sparse matrix of type '<class 'numpy.int64'>'
	with 5360 stored elements in Compressed Sparse Row format>

In [132]:
# see number of features of using different ngram parameters
print("use only unigram:",len(vect1.get_feature_names()))
print("use only bigram:",len(vect2.get_feature_names()))
print("use both unigram and bigram:",len(vect1_2.get_feature_names()))
print("use unigram, bigram and trigram:",len(vect1_3.get_feature_names()))

use only unigram: 904
use only bigram: 1916
use both unigram and bigram: 2820
use unigram, bigram and trigram: 4890


### Part II Speech-to-Text Cognitive Services Features
Explore speech-to-text services from the four vendors (Microsoft, Amazon, Google, IBM) and list key  features of each service. 

#### Microsoft Azure Speech-to-Text API
1. Create custom language models tailored to users’ speaking styles
   -- Customize the language model of your app’s speech recognition by tailoring it to your industry expressions, technical, geography or market terms, and even speaker style.
   
2. Adapt to user environment with custom acoustic models
   -- Make sure your app’s speech recognition can function in all environments, account for background noise and match your users’ expected environments.
   
3. Use robust speech models from Microsoft
  -- Enable powerful, personalized speech recognition by building your own customized speech recognition models on top of Microsoft’s existing state-of-the-art models.
  
   Retrieved From: https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/


------
#### AWS Amazon Transcribe
1. Simple-to-Use API
   -- No complicated programming is required. Just call the API with a few lines of code, and Amazon Transcribe will return the text from your audio file stored in Amazon S3.
2. Support for a Wide Range of Use Cases and Audio Quality
   -- Provide accurate and automated trancripts for a wide range of audio quality. You can generate subtitles for any video or audio files, and even transcribe low quality telephony recordings such as customer service calls.
3. Easy-to-Read Transcriptions with Punctuation
   -- Most speech recognition systems output a string of text without punctuation. Amazon Transcribe uses deep learning to add punctuation and formatting automatically, so that the output is more intelligible and can be used without any further editing.
4. Custom Vocabulary
   -- Give you the ability to expand and customize the speech recognition vocabulary. You can add new words to the base vocabulary and generate highly-accurate transcriptions specific to your use case, such as product names, domain-specific terminology, or names of individuals.
5. Timestamp Generation
   -- Return a timestamp for each word, so that you can easily locate the audio in the original recording by searching for the text.
6. Recognize Multiple Speakers
   -- Recognize even when the speaker changes and attribute the transcribed text appropriately. This can significantly reduce the amount of work needed to transcribe audio with multiple speakers like telephone calls, meetings, and television shows.

   Retrieved From: https://aws.amazon.com/transcribe/



------
#### Google Cloud Speech-toText | Speech Recognition
1. Automatic Speech Recognition
   -- Powered by deep learning neural networking to power your applications like voice search or speech transcription.
2. Global Vocabulary in 120 Languages!
   -- Recognizes 120 languages and variants with an extensive vocabulary.
3. Word Hints
   -- Can be customized to a specific context by providing a set of words and phrases that are likely to be spoken. Especially useful for adding custom words and names to the vocabulary and in voice-control use cases.
4. Real-time Streaming or Pre-recorded Audio Support
   -- Audio input can be streamed from by an application’s microphone or sent from a pre-recorded audio file (inline or through Google Cloud Storage). Multiple audio encodings are supported, including FLAC, AMR, PCMU and Linear-16.
5. Noise Robustness
   -- Handles noisy audio from many environments without requiring additional noise cancellation.
6. Inappropriate Content Filtering
   -- Filter inappropriate content in text results for some languages.
7. Automatic Punctuation
   -- Accurately punctuates transcriptions (i.e. commas, questions marks, and periods) with machine learning.
8. Model Selection
   -- Choose from a selection of four pre-built models: default, voice commands and search, phone calls, and video transcription.

   Retrieved From: https://cloud.google.com/speech-to-text/


----
#### IBM Watson Speech-to-Text
1. Powerful real-time speech recognition
   -- Automatically transcribe audio from 7 languages in real-time. Rapidly identify and transcribe what is being discussed, even from lower quality audio, across a variety of audio formats and programming interfaces (HTTP REST, Websocket, Asynchronous HTTP)
2. Highly accurate speech engine
   -- Customize your model to improve accuracy for language and content you care most about, such as product names, sensitive subjects or names of individuals. Recognizes different speakers in your audio Spot specified keywords in real-time with high accuracy and confidence
3. Built to support various use cases
   -- Transcribe audio for various use cases ranging from real-time transcription for audio from a microphone, to analyzing 1000s of audio recording from your call center to provide meaningful analytics
   
   Retrieved From: https://www.ibm.com/watson/services/speech-to-text/

