Candidate: Ivomar Brito Soares

Email: ivomarbsoares@gmail.com

## Summary

<ul>
    <li>Importing libraries</li>
    <li>Utility methods</li>
    <li>Reading data set</li>
    <li>Preprocessing</li>
    <li>Feature Extraction: Term Frequency - Inverse Document Frequency (TF-IDF)</li>
    <li>Preparing categorical target variable</li>
    <li>Training deep learning model</li>
    <li>Model Evaluation</li>
    <li>Saving model to file</li>
</ul>

## Importing libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report

# Data pre-processing modules
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from textblob import Word
from sklearn import preprocessing

# TFIDF vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Deep Learning modules
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils

[nltk_data] Downloading package stopwords to /home/ivomar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ivomar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Utility methods

In [3]:
def basic_preprocessing(dataset, feature_name):
    """
    These are the basic pre-processing steps followed in this function:
    - Convert text to lower case.
    - Punctuation removal.
    - Stop words removal.
    
    Additional possible pre-processing steps (future work):
    - Common words removal.
    - Rare words removal.
    - Spelling correction.
    - Keeping words of length of at least 3.
    """   
    # The first pre-processing is to convert all text into lower case, this avoids having multiple copies
    # of the same words.
    dataset[feature_name] = dataset[feature_name].apply(lambda x: " ".join(x.lower() for x in x.split()))
    
    # Punctuation removal, often it does not add extra information when dealing with text data. Removing them helps
    # reduce the size of the training data.
    dataset[feature_name] = dataset[feature_name].str.replace('[^\w\s]','')
    
    # Stop words (frequently occurring words) should be removed from the dataset.
    stop = stopwords.words('english')
    dataset[feature_name] = dataset[feature_name].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    
    # Lemmatization: Converts the word into its root word.
    dataset[feature_name] = dataset[feature_name].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
    

def prepare_targets(y_train):
    """
    Converts non-numerical catorigal labels to numerical categorical labels.
    """
    le = preprocessing.LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    return y_train_enc

## Reading data set

In [4]:
dataset = pd.read_csv('train_data.csv')

## Preprocessing

In [5]:
# Dropping missing values
dataset.dropna(subset=['categorical_target_1'], inplace=True)

basic_preprocessing(dataset, 'features')
dataset['features']

0         today past read title wine rich full aged lee ...
1         crisp dry searing acidity 100 varietal wine co...
2         light lovely 2 residual sugar taste drier arre...
3         borras blend 80 petite sirah 10 syrah 10 mourv...
4         spirit south africa swartland region shine rus...
                                ...                        
103971    textured full wine ripe character full fragran...
103972    funk nose soon blow reveal generously ripe fru...
103973    flinty lemon caramel flouted around rich layer...
103974    exuberantly fragrant ripe tropical fruit flora...
103975    wine brings fruitiness gamay along extra perfu...
Name: features, Length: 103927, dtype: object

## Feature Extraction: Term Frequency - Inverse Document Frequency (TF-IDF)

<ul>
    <li>Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence. TF = (Number of times term T appears in the particular row) / (number of terms in that row).</li>
    <li>The intuition behind inverse document frequency (IDF) is that a word is not of much use to us if it’s appearing in all the documents. Therefore, the IDF of each word is the log of the ratio of the total number of rows to the number of rows in which that word is present.IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.</li>
    <li>TF-IDF is the multiplication of the TF and IDF which is shown above.</li>
</ul>

In [6]:
tfidf_vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', max_features= 10000,strip_accents='unicode', norm='l2')

In [7]:
X_train = tfidf_vectorizer.fit_transform(dataset['features']).todense()

## Preparing categorical target variable

In [8]:
y_train_enc = prepare_targets(dataset['categorical_target_1'])

nb_classes = 43      # Chosen target variable, categorical_target_1 with 43 unique values or classes.

# Converts the 43 categories into one-hot encoding vectors in which 43 columns
# are created and the values against the respective classes are given as 1. All other classes are given as 0.
y_train = np_utils.to_categorical(y_train_enc, nb_classes)

In [9]:
print(dataset.shape)
print(X_train.shape)
print(y_train.shape)

(103927, 6)
(103927, 10000)
(103927, 43)


## Training deep learning model

In [10]:
np.random.seed(42)
batch_size = 64
nb_epochs = 20

In [11]:
# Deep learning model built in keras
model = Sequential()

model.add(Dense(1000,input_shape= (10000,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [12]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1000)              10001000  
_________________________________________________________________
activation_1 (Activation)    (None, 1000)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 500)               500500    
_________________________________________________________________
activation_2 (Activation)    (None, 500)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)               

In [13]:
# Model Training
model.fit(X_train, y_train, batch_size=batch_size, epochs=nb_epochs, verbose=1)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x7f2cb09f4c50>

## Model Evaluation

In [14]:
y_train_predclass = model.predict_classes(X_train,batch_size=batch_size)

In [20]:
print ("Deep Neural Network - Train Classification Report")
print (classification_report(y_train_enc,y_train_predclass))

Deep Neural Network - Train accuracy:
Deep Neural Network - Train Classification Report
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3016
           1       0.00      0.00      0.00         1
           2       1.00      1.00      1.00      1845
           3       1.00      1.00      1.00      2703
           4       0.00      0.00      0.00         2
           5       0.81      0.42      0.56        40
           6       0.99      0.99      0.99       114
           7       1.00      0.99      1.00       199
           8       1.00      1.00      1.00      3587
           9       0.00      0.00      0.00         1
          10       0.74      0.80      0.77        56
          11       1.00      0.50      0.67         8
          12       0.00      0.00      0.00        10
          13       0.00      0.00      0.00         1
          14       0.98      1.00      0.99        59
          15       1.00      1.00      1.00    

## Saving model to file

In [16]:
# Serialize model to JSON.
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# Serialize weights to HDF5.
model.save_weights("model.h5")
print("Saved model to disk")

Saved model to disk
