New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC]DatasetMapper & Imputer #694
Merged
Merged
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
87c05a5
concept work for imputer
keon 2e4b1a8
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon 631e59e
do not to use NaN by default, let the user specify
keon 391006e
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon 6a1fb81
add template to datasetinfo and add imputer class
keon b0c5224
clean datasetinfo class and rename files
keon de35241
implement basic imputation strategies
keon 2d38604
modify imputer_main and clean logs
keon bb045b8
add parameter verification for imputer_main
keon 1295f4b
add custom strategy to impute_main
keon 5a517c2
add datatype change in IncrementPolicy
keon 94b7a5c
update types used in datasetinfo
keon ebed68f
initialize imputer with parameters
keon db78f39
remove datatype in dataset_info
keon 7c60b97
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon da4e409
add test for imputer
keon d8618ec
restructure, add listwise deletion & imputer tests
keon 3b8ffd0
fix transpose problem
keon 90a5cd2
Merge pull request #7 from mlpack/master
keon 32c8a73
merge
keon e09d9bc
updates and fixes on imputation methods
keon 87d8d46
update data::load to accept different mappertypes
keon de0b2db
update data::load to accept different policies
keon bc187ca
add imputer doc
keon a340f69
debug median imputation and listwise deletion
keon 21d94c0
remove duplicate code in load function
keon a92afaa
delete load overload
keon bace8b2
modify MapToNumerical to work with MissingPolicy
keon 896a018
MissingPolicy uses NaN instead of numbers
keon 1a908c2
fix reference issue in DatasetMapper
keon 2edbc40
Move MapToNumerical(MapTokens) to Policy class
keon d881cb7
make policy and imputation api more consistent
keon a881831
numerical values can be set as missing values
keon 63268a3
add comments and use more proper names
keon 2eb6754
modify custom impute interface and rename variables
keon 6d43aa3
add input-only overloads to imputation methods
keon fedc5e0
update median imputation to exclude missing values
keon 787fd82
optimize imputation methods with output overloads
keon a0b7d59
expressive comments in imputation_test
keon 9a6dce7
shorten imputation tests
keon c3aeba1
optimize preprocess imputer executable
keon 028c217
fix bugs in imputation test
keon 03e19a4
add more comments and delete impute_test.csv
keon ef4536b
Merge pull request #8 from mlpack/master
keon 6e2c1ff
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon 5eb9abd
fix PARAM statements in imputer
keon d043235
delete Impute() overloads that produce output matrix
keon File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
/** | ||
* @file dataset_mapper.hpp | ||
* @author Ryan Curtin | ||
* @author Keon Kim | ||
* | ||
* Defines the DatasetMapper class, which holds information about a dataset. | ||
* This is useful when the dataset contains categorical non-numeric features | ||
* that needs to be mapped to categorical numeric features. | ||
*/ | ||
#ifndef MLPACK_CORE_DATA_DATASET_INFO_HPP | ||
#define MLPACK_CORE_DATA_DATASET_INFO_HPP | ||
|
||
#include <mlpack/core.hpp> | ||
#include <unordered_map> | ||
#include <boost/bimap.hpp> | ||
|
||
#include "map_policies/increment_policy.hpp" | ||
|
||
namespace mlpack { | ||
namespace data { | ||
/** | ||
* Auxiliary information for a dataset, including mappings to/from strings and | ||
* the datatype of each dimension. DatasetMapper objects are optionally | ||
* produced by data::Load(), and store the type of each dimension | ||
* (Datatype::numeric or Datatype::categorical) as well as mappings from strings | ||
* to unsigned integers and vice versa. | ||
* | ||
* @tparam PolicyType Mapping policy used to specify MapString(); | ||
*/ | ||
template <typename PolicyType> | ||
class DatasetMapper | ||
{ | ||
public: | ||
/** | ||
* Create the DatasetMapper object with the given dimensionality. Note that | ||
* the dimensionality cannot be changed later; you will have to create a new | ||
* DatasetMapper object. | ||
*/ | ||
explicit DatasetMapper(const size_t dimensionality = 0); | ||
|
||
/** | ||
* Create the DatasetMapper object with the given policy and dimensionality. | ||
* Note that the dimensionality cannot be changed later; you will have to | ||
* create a new DatasetMapper object. Policy can be modified by the modifier. | ||
*/ | ||
explicit DatasetMapper(PolicyType& policy, const size_t dimensionality = 0); | ||
|
||
/** | ||
* Given the string and the dimension to which it belongs, return its numeric | ||
* mapping. If no mapping yet exists, the string is added to the list of | ||
* mappings for the given dimension. The dimension parameter refers to the | ||
* index of the dimension of the string (i.e. the row in the dataset). | ||
* | ||
* @param string String to find/create mapping for. | ||
* @param dimension Index of the dimension of the string. | ||
*/ | ||
typename PolicyType::MappedType MapString(const std::string& string, | ||
const size_t dimension); | ||
|
||
/** | ||
* Return the string that corresponds to a given value in a given dimension. | ||
* If the string is not a valid mapping in the given dimension, a | ||
* std::invalid_argument is thrown. | ||
* | ||
* @param value Mapped value for string. | ||
* @param dimension Dimension to unmap string from. | ||
*/ | ||
const std::string& UnmapString(const size_t value, const size_t dimension); | ||
|
||
|
||
/** | ||
* Return the value that corresponds to a given string in a given dimension. | ||
* If the value is not a valid mapping in the given dimension, a | ||
* std::invalid_argument is thrown. | ||
* | ||
* @param string Mapped string for value. | ||
* @param dimension Dimension to unmap string from. | ||
*/ | ||
typename PolicyType::MappedType UnmapValue(const std::string& string, | ||
const size_t dimension); | ||
|
||
/** | ||
* MapTokens turns vector of strings into numeric variables and puts them | ||
* into a given matrix. It is uses mapping policy to store categorical values | ||
* to maps. How it determines whether a value is categorical and how it | ||
* stores the categorical value into map and replaces with the numerical value | ||
* all depends on the mapping policy object's MapTokens() funciton. | ||
* | ||
* @tparam eT Type of armadillo matrix. | ||
* @param tokens Vector of variables inside a dimension. | ||
* @param row Position of the given tokens. | ||
* @param matrix Matrix to save the data into. | ||
*/ | ||
template <typename eT> | ||
void MapTokens(const std::vector<std::string>& tokens, size_t& row, | ||
arma::Mat<eT>& matrix); | ||
|
||
//! Return the type of a given dimension (numeric or categorical). | ||
Datatype Type(const size_t dimension) const; | ||
//! Modify the type of a given dimension (be careful!). | ||
Datatype& Type(const size_t dimension); | ||
|
||
/** | ||
* Get the number of mappings for a particular dimension. If the dimension | ||
* is numeric, then this will return 0. | ||
*/ | ||
size_t NumMappings(const size_t dimension) const; | ||
|
||
/** | ||
* Get the dimensionality of the DatasetMapper object (that is, how many | ||
* dimensions it has information for). If this object was created by a call | ||
* to mlpack::data::Load(), then the dimensionality will be the same as the | ||
* number of rows (dimensions) in the dataset. | ||
*/ | ||
size_t Dimensionality() const; | ||
|
||
/** | ||
* Serialize the dataset information. | ||
*/ | ||
template<typename Archive> | ||
void Serialize(Archive& ar, const unsigned int /* version */) | ||
{ | ||
ar & data::CreateNVP(types, "types"); | ||
ar & data::CreateNVP(maps, "maps"); | ||
} | ||
|
||
//! Return the policy of the mapper. | ||
const PolicyType& Policy() const; | ||
|
||
//! Modify the policy of the mapper (be careful!). | ||
PolicyType& Policy(); | ||
|
||
//! Modify (Replace) the policy of the mapper with a new policy | ||
void Policy(PolicyType&& policy); | ||
|
||
private: | ||
//! Types of each dimension. | ||
std::vector<Datatype> types; | ||
|
||
// BiMapType definition | ||
using BiMapType = boost::bimap<std::string, typename PolicyType::MappedType>; | ||
|
||
// Mappings from strings to integers. | ||
// Map entries will only exist for dimensions that are categorical. | ||
// MapType = map<dimension, pair<bimap<string, MappedType>, numMappings>> | ||
using MapType = std::unordered_map<size_t, std::pair<BiMapType, size_t>>; | ||
|
||
//! maps object stores string and numerical pairs. | ||
MapType maps; | ||
|
||
//! policy object tells dataset mapper how the categorical values should be | ||
// mapped to the maps object. It is used in MapString() and MapTokens(). | ||
PolicyType policy; | ||
}; | ||
|
||
// Use typedef to provide backward compatibility | ||
using DatasetInfo = DatasetMapper<data::IncrementPolicy>; | ||
|
||
} // namespace data | ||
} // namespace mlpack | ||
|
||
#include "dataset_mapper_impl.hpp" | ||
|
||
#endif |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment the
maps
andpolicy
parameter?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!