Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset loading #14

Closed
danintheory opened this issue May 4, 2015 · 3 comments
Closed

Dataset loading #14

danintheory opened this issue May 4, 2015 · 3 comments

Comments

@danintheory
Copy link

I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.

What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.

@larsmans
Copy link
Owner

larsmans commented May 6, 2015

All lines in the entire set of HTML documents would be one big matrix X. Each row in this matrix is a sample (line). All the labels of all the lines are a single target vector y of the same length (len(y) == X.shape[0]).

The lengths of the actual sequences need to be an array lengths that contains the length of each sequence (document).

So, suppose you have a function that computes the features of a single line as a vector (1-d NumPy array):

def features(line):
    return np.array([feature1(line), feature2(line)])

... then you should be able to construct the input as follows:

X, y, lengths = [], [], []

for doc, label in training_set:
    lines = doc.splitlines()
    lengths.append(len(lines))
    X.append(features(line))
    y.append(label)

X, y, lengths = map(np.asarray, [X, y, lengths])

Does that answer your question?

@danintheory
Copy link
Author

Thank you, that was very helpful. So the X matrix can be a dense matrix, where each row is a feature vector and each column is a different sample? And the feature vectors can be float valued with multiple features being nonzero?

I was confused by trying to deconstruct the included conll.py file in the example. I wasn't sure if features had to be translated to a list of strings (such as ["feature1:val1", "feature2:val2"]) and then encoded into a sparse matrix using the FeatureHasher from sklearn. Also, from your documentation, I wasn't sure what this line

"Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied."

meant in terms of having my feature vectors as floats, many of which are nonzero.

Finally, I wasn't sure (from deconstructing conll.py) whether features from the previous and subsequent sample in the sequence need to be included in the current sample (as is shown in the conll.py example).

Thanks again for all your help!

@larsmans
Copy link
Owner

larsmans commented May 6, 2015

X may be either a dense array or a sparse matrix. It follows scikit-learn conventions.

Re: one-hot encoding, that's because the HMM is meant to deal with categorical data and each feature should represent the identity of an event as a boolean. I think you should be using a StructuredPerceptron if your data is anything different (sorry, hadn't thought about this earlier, I very seldom use HMMs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants