# Skills map searcher
Search related chapter base on text entered.

## Data loading

In [2]:
import numpy as np
import tensorflow as tf
from openpyxl import load_workbook

Load data from xlsx file. I loaded xlsx file and split it into inputs, labels. Finally, I also split inputs to generate more training datas.

In [32]:
# Load data from xlsx file
wb = load_workbook('skill_map_data.xlsx')
##  print(wb.get_sheet_names())
ws = wb.get_sheet_by_name('raw data - Chapter and Text')

raw_data = []
for row in ws.iter_rows():
    raw_data_row = {
        "week_day" : row[0].value,
        "chapter" : row[1].value,
        "lesson" : row[2].value,
        "section" : row[3].value,
        "text" : row[4].value
        }
    raw_data.append(raw_data_row)

raw_data = raw_data[2:] # remove table name and header
assert(len(raw_data) < 100) # normally we don't have 100+ sections

# Split raw_data into inputs and labels
inputs = [row['text'] for row in raw_data]
assert(len(raw_data) == len(inputs))

## concated week_day, chapter, lesson, section into one label
labels = [' '.join([
            str(row['week_day']), ' ',
            row['chapter'], ' ',
            row['lesson'], ' ',
            row['section']
        ]) for row in raw_data]

assert(len(raw_data) == len(labels))

# Split inputs to generate more training datas
seq_len = 200 # length for split long text
seq_inputs = []
seq_labels = []
count = 0
for i, input in enumerate(inputs):
    if len(input) > seq_len:
        for j in range(int(len(input)/seq_len + 0.5)):
            seq_input = input[j*seq_len:(j+1)*seq_len]
            seq_inputs.append(seq_input)
            seq_labels.append(labels[i])
            count += 1
    else:
        seq_inputs.append(input)
        seq_labels.append(labels[i])

len(seq_inputs), len(seq_labels)
# seq_labels[998], seq_inputs[998]

(1093, 1093)

In [4]:
inputs[:100]

["Recurrent neural networks are able to learn from sequences of data. In this lesson, you'll learn the concepts behind recurrent networks and see how a character-wise recurrent network is implemented in TensorFlow.",
 "One of the coolest deep learning results from last year was the Google Translate update. They've been a leader in machine learning for a while, but implementing a deep neural network for translation brought the service nearly to the level of human translators. With translation, the correct word to use depends on the context, and all the other words in the sentence, and even in the paragraph. Much of the information contained in language is in the sequence of the words. So far, we've been working with what are called feed forward networks. The input is fed into the network and it propagates forward through the hidden layers to the output layer. In feed forward networks, there is no sense of order in the inputs. Here's a simple idea then, let's build order into our network

## Data preprocessing

In [5]:
from string import punctuation
all_text = ''.join([c for c in inputs if c not in punctuation])

all_text = ' '.join(inputs)
words = all_text.split()

In [6]:
len(words), len(all_text), len(inputs)

(39974, 218877, 46)

In [7]:
all_text[:2000]

"Recurrent neural networks are able to learn from sequences of data. In this lesson, you'll learn the concepts behind recurrent networks and see how a character-wise recurrent network is implemented in TensorFlow. One of the coolest deep learning results from last year was the Google Translate update. They've been a leader in machine learning for a while, but implementing a deep neural network for translation brought the service nearly to the level of human translators. With translation, the correct word to use depends on the context, and all the other words in the sentence, and even in the paragraph. Much of the information contained in language is in the sequence of the words. So far, we've been working with what are called feed forward networks. The input is fed into the network and it propagates forward through the hidden layers to the output layer. In feed forward networks, there is no sense of order in the inputs. Here's a simple idea then, let's build order into our network. Fir

In [8]:
words[:100]

['Recurrent',
 'neural',
 'networks',
 'are',
 'able',
 'to',
 'learn',
 'from',
 'sequences',
 'of',
 'data.',
 'In',
 'this',
 'lesson,',
 "you'll",
 'learn',
 'the',
 'concepts',
 'behind',
 'recurrent',
 'networks',
 'and',
 'see',
 'how',
 'a',
 'character-wise',
 'recurrent',
 'network',
 'is',
 'implemented',
 'in',
 'TensorFlow.',
 'One',
 'of',
 'the',
 'coolest',
 'deep',
 'learning',
 'results',
 'from',
 'last',
 'year',
 'was',
 'the',
 'Google',
 'Translate',
 'update.',
 "They've",
 'been',
 'a',
 'leader',
 'in',
 'machine',
 'learning',
 'for',
 'a',
 'while,',
 'but',
 'implementing',
 'a',
 'deep',
 'neural',
 'network',
 'for',
 'translation',
 'brought',
 'the',
 'service',
 'nearly',
 'to',
 'the',
 'level',
 'of',
 'human',
 'translators.',
 'With',
 'translation,',
 'the',
 'correct',
 'word',
 'to',
 'use',
 'depends',
 'on',
 'the',
 'context,',
 'and',
 'all',
 'the',
 'other',
 'words',
 'in',
 'the',
 'sentence,',
 'and',
 'even',
 'in',
 'the',
 'paragraph

## Encoding the words


In [9]:
from collections import Counter
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

inputs_ints = []
for each in inputs:
    inputs_ints.append([vocab_to_int[word] for word in each.split()])

## Encoding the labels


In [10]:
labels_array = [i for i, label in enumerate(labels)]
labels_np = np.array(labels_array)


In [11]:
labels_np

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45])

 Now, create an array features that contains the data we'll pass to the network. The data should come from review_ints, since we want to feed integers to the network. Each row should be 200 elements long. For reviews shorter than 200 words, left pad with 0s. That is, if the review is ['best', 'movie', 'ever'], [117, 18, 128] as integers, the row will look like [0, 0, 0, ..., 0, 117, 18, 128]. For reviews longer than 200, use on the first 200 words as the feature vector.

In [12]:
# Filter out that inputs with 0 length
inputs_ints = [each for each in inputs_ints if len(each) > 0]

In [13]:
seq_len = 200
features = np.zeros((len(inputs), seq_len), dtype=int)
for i, row in enumerate(inputs_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]

In [14]:
features

array([[   0,    0,    0, ..., 1031,   10, 1484],
       [ 561,    3,    1, ...,    3, 2349, 1611],
       [1178,   12,  564, ...,    3,  518,   26],
       ..., 
       [  84,   11,    7, ...,  341,   30,    7],
       [4158, 3044,    1, ...,  890,  950,   43],
       [ 217, 5199,   90, ...,    1,   99, 1459]])

## Training, Validation, Test

In [15]:
split_frac= 0.8
split_idx = int(len(features)*0.8)

train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
val_x, test_x = val_x[:test_idx], val_x[test_idx:]
val_y, test_y = val_y[:test_idx], val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(36, 200) 
Validation set: 	(5, 200) 
Test set: 		(5, 200)
