In [1]:
%load_ext autoreload
%autoreload 2

# Character Embedding

In 2013, Tomas Mikolov introduced word embedding to learn a better quality of word. At that time word2vec is state of the art on dealing with text. Later on, doc2vec is introduced as well. What if we think in another angel? Instead of aggregate from word to document, is it possible to aggregate from character to word. 

In this article, you will go through what, why, when and how on Character Embedding

**What?**

Text Understanding from Scratch introduced using character CNN. In Char CNN paper, a list of character are defined 70 characters which including 26 English letters, 10 digits, 33 special characters and new line character.

On the other hand, Google Brain team introduced Exploring the Limits of Language Modeling and released the lm_1b model which includes 256 vectors (including 52 characters, special characters) and the dimension is just 16. By comparing to word embedding, the dimension can increase up to 300 while the number of vectors is huge.

**Why?**

In english, all words are formed by 26 (or 52 if including both upper and lower case character, or even more if including special characters). Having the character embedding, every single word's vector can be formed even it is out-of-vocabulary words (optional). On the other than, word2vec embedding can only handle those seen words.

Another benefit is that it good fits for misspelling words, emoticons, new words (e.g. In 2018, Oxford English Dictionary introduced new word which is boba tea 波霸奶茶. Before that we do not have any pre-trained word embedding for that).
It handles infrequent words better than word2vec embedding as later one suffers from lack of enough training opportunity for those rare words.

Third reason is that as there are only small amount of vector, it reduces model complexity and improving the performance (in terms of speed)

**When?**

In NLP, we can apply character embedding on:
1. [Text Classification](https://arxiv.org/pdf/1509.01626.pdf)
2. [Language Model](https://arxiv.org/pdf/1602.02410.pdf)
3. [Named Entity Recognition](https://www.aclweb.org/anthology/Q16-1026)

**The Dataset**

References to news pages collected from an web aggregator in the period from 10-March-2014 to 10-August-2014. The resources are grouped into clusters that represent pages discussing the same story.
http://archive.ics.uci.edu/ml/datasets/News+Aggregator


Original notebook https://github.com/makcedward/nlp/blob/master/sample/nlp-character_embedding.ipynb

**Functions flow **

* preprocess()
    * build_char_dictionary()
    * convert_labels()
    
* process()
    * transform_raw_data()
        * sent_tokenize
	* transform_training_data()
	
* build_model()
	* build_sentence_block()
        * Input
        * Embedding        
        * _build_character_block
        * concatenate
        * Droput
        * Model
	* build_document_block()
        * Input - input
        * TimeDistributed - output
        * Bidirectional - LTSM
        * Dropout
        * Dense
        * Dropout
        * Dense
        * Model
        * compile
        

In [2]:
import re
import pandas as pd
import nltk

from sklearn.model_selection import train_test_split

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\argenisleon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Data Ingestion

In [4]:
def get_raw_data():
    # Get file from http://archive.ics.uci.edu/ml/datasets/News+Aggregator
    df = pd.read_csv(
        "data/newsCorpora.csv", 
        sep='\t', 
        names=['id', 'headline', 'url', 'publisher', 'category', 'story', 'hostname', 'timestamp']
    )
    
    # Category: b = business, t = science and technology, e = entertainment, m = health
    return df[['category', 'headline']]

In [5]:
df = get_raw_data()
df.head()

Unnamed: 0,category,headline
0,b,"Fed official says weak data caused by weather,..."
1,b,Fed's Charles Plosser sees high bar for change...
2,b,US open: Stocks fall after Fed official hints ...
3,b,"Fed risks falling 'behind the curve', Charles ..."
4,b,Fed's Plosser: Nasty Weather Has Curbed Job Gr...


In [6]:
print('Number of news: %d' % (len(df)))

Number of news: 422419


In [7]:
train_df, test_df = train_test_split(df, test_size=0.2)

# Modeling

**_Character Embedding_**
1. Define a list a characters (i.e. m). For example, can use alphanumeric and some special characters. For my example, English characters (52), number (10), special characters (20) and one unknown character, UNK. Total 83 characters.
2. Transfer characters as 1-hot encoding and got a sequence for vectors. For unknown characters and blank characters, use all-zero vector to replace it. If exceeding pre-defined maximum length of characters (i.e. l), ignoring it. The output is 16 dimensions vector per every single character.
3. Using 3 1D CNN layers (configurable) to learn the sequence

**_Sentence Embedding_**
1. Bi-directional LSTM followed CNN layers
2. Some dropout layers are added after LSTM

### Function flow

preprocess
	build_char_dictionary
	convert_labels
process
	_transform_raw_data
	_transform_training_data

In [9]:
from charcnn import CharCNN

Using TensorFlow backend.


In [10]:
# Maximum number of characters per sentence is 256.
# Maximum number of sentence is 5

char_cnn = CharCNN(max_len_of_sentence=256, max_num_of_setnence=5, verbose = 6)

#First of all, we need to prepare meta information including character dictionary 
#and converting label from text to numeric (as keras support numeric input only).

char_cnn.preprocess(labels=df['category'].unique())

-----> Stage: preprocess
5
['b' 't' 'e' 'm']
Totoal number of chars: 83
First 3 char_indices sample: {'UNK': 0, '%': 1, 'b': 2}
First 3 indices_char sample: {0: 'UNK', 1: '%', 2: 'b'}
Label to Index:  {'b': 0, 't': 1, 'e': 2, 'm': 3}
Index to Label:  {0: 'b', 1: 't', 2: 'e', 3: 'm'}


In [14]:
print(train_df)

       category                                           headline
224652        m        Dramatic increase in MERS cases since March
325223        e  In Talks: Frank Darabont to Helm 'The Huntsman...
27519         e  Gisele Bundchen and Tom Brady selling estate i...
84787         t  Facebook could be working on a Secret-inspired...
43298         e  CinemaCon: The New Trailer for Hercules, Starr...
128382        b  Chipotle raising prices as steak, avocados, ch...
266060        m   Eva Longoria's Las Vegas Strip steakhouse closes
392455        t  Best new mobile apps for iOS and Android: Top ...
417912        b  US Index Futures Little Changed After S&P 500 ...
26494         e  President Obama to promote Obamacare on 'Ellen...
374287        b         Fed plans to end taper in October: minutes
285713        b            Pay raises depend on industry you're in
142030        e  Coachella goes corporate, as marketers can't r...
181404        t  White House top climate threats: California a

In [36]:
type(train_df)
#train_df.iloc[2]
train_df.sample(3)


Unnamed: 0,category,headline
173373,b,"Alstom Studies GE Offer, Leaves Door Open for ..."
84543,t,Google preparing Android TV launch? Leaked det...
238202,b,Temporary Fee On Big Businesses Funds Obamacare


In [49]:
df_sample = train_df.sample(10)

x_train, y_train = char_cnn.process(
    df=df_sample, x_col='headline', y_col='category')

-----> Stage: process
Number of news: 10
Actual max sentence: 1
{'b': 0, 't': 1, 'e': 2, 'm': 3}
Shape:  (10, 5, 256) (10,)


In [51]:
df_sample

Unnamed: 0,category,headline
183622,m,International public health emergency: Polio s...
353037,b,How Will PACCAR (PCAR) Stock Respond To Volksw...
409844,b,India's Factory Output Reaches 17-Month High: ...
359972,t,U.S. regulators should just ban premium SMS pr...
316913,b,Sector Update: Energy Stocks Rising; Pioneer N...
159827,t,Facebook Buys Fitness App to Track Your 'Moves'
286376,t,Destiny's extended E3 trailer does a good job ...
152035,b,TheStreet Downgrades Coca-Cola Bottling Co. Co...
291930,e,'Game of Thrones' Season 4 Finale Made More Sh...
217441,t,First Camelopardalid meteor shower expected ne...


In [68]:

print(len(x_train))
print(len(y_train))

print(x_train[0])

#  y_train can be done with and string indexer function. its convert  every posible lable to a integer representation. 
print(y_train)

10
10
[[ 0  0  0 ... 47 39 33]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]
[3 0 0 1 0 1 1 0 2 1]


In [38]:
# We have to transform raw input training data and testing to numpy format for keras input

x_train, y_train = char_cnn.process(
    df=train_df, x_col='headline', y_col='category')


-----> Stage: process
Number of news: 337935
Actual max sentence: 15
{'b': 0, 't': 1, 'e': 2, 'm': 3}
Shape:  (337935, 5, 256) (337935,)


In [18]:
#print(train_df)
print(x_train[0][0])
#print(y_train)

[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 50 42 41
 17 35  5 37 42 39 53  7  5  7 37  7 17 42  5 58 14  6 35  5 39 53  5 37
  7 17 37 41 42 39 53  5 42 53 47 17 25 17 41 79]


In [None]:
x_test, y_test = char_cnn.process(
    df=test_df, x_col='headline', y_col='category')

In [22]:
char_cnn.build_model()
char_cnn.train(x_train, y_train, x_test, y_test, batch_size=2048, epochs=10)

char_cnn.get_model().save('./char_cnn_model.h5')

-----> Stage: build model
-----> Stage: train model
Train on 337935 samples, validate on 84484 samples
Epoch 1/10
 12288/337935 [>.............................] - ETA: 26:44:09 - loss: 1.3712 - acc: 0.3313

KeyboardInterrupt: 

In [None]:
char_cnn_model_loaded = load_model('./char_cnn_model.h5')
char_cnn_model_loaded.predict(x_test)

In [None]:
print

# Conclusion
Character Embedding is a brilliant design for solving lots of text classification. It resolved some word embedding. FAIR did a further step. They introduced to use subword embedding to build fastText. 

This is some comment on Character Embedding as it does not include any word meaning but just using characters. We may include both Character Embedding and Word Embedding together to solve our NLP problem.