New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset loading #14
Comments
All lines in the entire set of HTML documents would be one big matrix The lengths of the actual sequences need to be an array So, suppose you have a function that computes the features of a single line as a vector (1-d NumPy array): def features(line):
return np.array([feature1(line), feature2(line)]) ... then you should be able to construct the input as follows: X, y, lengths = [], [], []
for doc, label in training_set:
lines = doc.splitlines()
lengths.append(len(lines))
X.append(features(line))
y.append(label)
X, y, lengths = map(np.asarray, [X, y, lengths]) Does that answer your question? |
Thank you, that was very helpful. So the X matrix can be a dense matrix, where each row is a feature vector and each column is a different sample? And the feature vectors can be float valued with multiple features being nonzero? I was confused by trying to deconstruct the included conll.py file in the example. I wasn't sure if features had to be translated to a list of strings (such as ["feature1:val1", "feature2:val2"]) and then encoded into a sparse matrix using the FeatureHasher from sklearn. Also, from your documentation, I wasn't sure what this line "Make sure the training set (X) is one-hot encoded; if more than one feature in X is on, the emission probabilities will be multiplied." meant in terms of having my feature vectors as floats, many of which are nonzero. Finally, I wasn't sure (from deconstructing conll.py) whether features from the previous and subsequent sample in the sequence need to be included in the current sample (as is shown in the conll.py example). Thanks again for all your help! |
Re: one-hot encoding, that's because the HMM is meant to deal with categorical data and each feature should represent the identity of an event as a boolean. I think you should be using a |
I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.
What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.
The text was updated successfully, but these errors were encountered: