Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC]DatasetMapper & Imputer #694

Merged
merged 47 commits into from Jul 25, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
87c05a5
concept work for imputer
keon Jun 1, 2016
2e4b1a8
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 6, 2016
631e59e
do not to use NaN by default, let the user specify
keon Jun 6, 2016
391006e
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 6, 2016
6a1fb81
add template to datasetinfo and add imputer class
keon Jun 12, 2016
b0c5224
clean datasetinfo class and rename files
keon Jun 13, 2016
de35241
implement basic imputation strategies
keon Jun 13, 2016
2d38604
modify imputer_main and clean logs
keon Jun 13, 2016
bb045b8
add parameter verification for imputer_main
keon Jun 13, 2016
1295f4b
add custom strategy to impute_main
keon Jun 13, 2016
5a517c2
add datatype change in IncrementPolicy
keon Jun 14, 2016
94b7a5c
update types used in datasetinfo
keon Jun 14, 2016
ebed68f
initialize imputer with parameters
keon Jun 14, 2016
db78f39
remove datatype in dataset_info
keon Jun 15, 2016
7c60b97
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jun 15, 2016
da4e409
add test for imputer
keon Jun 15, 2016
d8618ec
restructure, add listwise deletion & imputer tests
keon Jun 18, 2016
3b8ffd0
fix transpose problem
keon Jun 27, 2016
90a5cd2
Merge pull request #7 from mlpack/master
keon Jun 27, 2016
32c8a73
merge
keon Jun 27, 2016
e09d9bc
updates and fixes on imputation methods
keon Jun 28, 2016
87d8d46
update data::load to accept different mappertypes
keon Jul 1, 2016
de0b2db
update data::load to accept different policies
keon Jul 1, 2016
bc187ca
add imputer doc
keon Jul 1, 2016
a340f69
debug median imputation and listwise deletion
keon Jul 2, 2016
21d94c0
remove duplicate code in load function
keon Jul 2, 2016
a92afaa
delete load overload
keon Jul 3, 2016
bace8b2
modify MapToNumerical to work with MissingPolicy
keon Jul 4, 2016
896a018
MissingPolicy uses NaN instead of numbers
keon Jul 4, 2016
1a908c2
fix reference issue in DatasetMapper
keon Jul 4, 2016
2edbc40
Move MapToNumerical(MapTokens) to Policy class
keon Jul 5, 2016
d881cb7
make policy and imputation api more consistent
keon Jul 5, 2016
a881831
numerical values can be set as missing values
keon Jul 6, 2016
63268a3
add comments and use more proper names
keon Jul 7, 2016
2eb6754
modify custom impute interface and rename variables
keon Jul 10, 2016
6d43aa3
add input-only overloads to imputation methods
keon Jul 10, 2016
fedc5e0
update median imputation to exclude missing values
keon Jul 11, 2016
787fd82
optimize imputation methods with output overloads
keon Jul 18, 2016
a0b7d59
expressive comments in imputation_test
keon Jul 18, 2016
9a6dce7
shorten imputation tests
keon Jul 18, 2016
c3aeba1
optimize preprocess imputer executable
keon Jul 18, 2016
028c217
fix bugs in imputation test
keon Jul 18, 2016
03e19a4
add more comments and delete impute_test.csv
keon Jul 22, 2016
ef4536b
Merge pull request #8 from mlpack/master
keon Jul 22, 2016
6e2c1ff
Merge branch 'master' of github.com:keonkim/mlpack into imputer
keon Jul 22, 2016
5eb9abd
fix PARAM statements in imputer
keon Jul 22, 2016
d043235
delete Impute() overloads that produce output matrix
keon Jul 23, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 3 additions & 2 deletions src/mlpack/core/data/CMakeLists.txt
@@ -1,8 +1,8 @@
# Define the files that we need to compile.
# Anything not in this list will not be compiled into mlpack.
set(SOURCES
dataset_info.hpp
dataset_info_impl.hpp
dataset_mapper.hpp
dataset_mapper_impl.hpp
extension.hpp
format.hpp
load.hpp
Expand All @@ -15,6 +15,7 @@ set(SOURCES
save_impl.hpp
serialization_shim.hpp
split_data.hpp
imputer.hpp
binarize.hpp
)

Expand Down
114 changes: 0 additions & 114 deletions src/mlpack/core/data/dataset_info.hpp

This file was deleted.

100 changes: 0 additions & 100 deletions src/mlpack/core/data/dataset_info_impl.hpp

This file was deleted.

164 changes: 164 additions & 0 deletions src/mlpack/core/data/dataset_mapper.hpp
@@ -0,0 +1,164 @@
/**
* @file dataset_mapper.hpp
* @author Ryan Curtin
* @author Keon Kim
*
* Defines the DatasetMapper class, which holds information about a dataset.
* This is useful when the dataset contains categorical non-numeric features
* that needs to be mapped to categorical numeric features.
*/
#ifndef MLPACK_CORE_DATA_DATASET_INFO_HPP
#define MLPACK_CORE_DATA_DATASET_INFO_HPP

#include <mlpack/core.hpp>
#include <unordered_map>
#include <boost/bimap.hpp>

#include "map_policies/increment_policy.hpp"

namespace mlpack {
namespace data {
/**
* Auxiliary information for a dataset, including mappings to/from strings and
* the datatype of each dimension. DatasetMapper objects are optionally
* produced by data::Load(), and store the type of each dimension
* (Datatype::numeric or Datatype::categorical) as well as mappings from strings
* to unsigned integers and vice versa.
*
* @tparam PolicyType Mapping policy used to specify MapString();
*/
template <typename PolicyType>
class DatasetMapper
{
public:
/**
* Create the DatasetMapper object with the given dimensionality. Note that
* the dimensionality cannot be changed later; you will have to create a new
* DatasetMapper object.
*/
explicit DatasetMapper(const size_t dimensionality = 0);

/**
* Create the DatasetMapper object with the given policy and dimensionality.
* Note that the dimensionality cannot be changed later; you will have to
* create a new DatasetMapper object. Policy can be modified by the modifier.
*/
explicit DatasetMapper(PolicyType& policy, const size_t dimensionality = 0);

/**
* Given the string and the dimension to which it belongs, return its numeric
* mapping. If no mapping yet exists, the string is added to the list of
* mappings for the given dimension. The dimension parameter refers to the
* index of the dimension of the string (i.e. the row in the dataset).
*
* @param string String to find/create mapping for.
* @param dimension Index of the dimension of the string.
*/
typename PolicyType::MappedType MapString(const std::string& string,
const size_t dimension);

/**
* Return the string that corresponds to a given value in a given dimension.
* If the string is not a valid mapping in the given dimension, a
* std::invalid_argument is thrown.
*
* @param value Mapped value for string.
* @param dimension Dimension to unmap string from.
*/
const std::string& UnmapString(const size_t value, const size_t dimension);


/**
* Return the value that corresponds to a given string in a given dimension.
* If the value is not a valid mapping in the given dimension, a
* std::invalid_argument is thrown.
*
* @param string Mapped string for value.
* @param dimension Dimension to unmap string from.
*/
typename PolicyType::MappedType UnmapValue(const std::string& string,
const size_t dimension);

/**
* MapTokens turns vector of strings into numeric variables and puts them
* into a given matrix. It is uses mapping policy to store categorical values
* to maps. How it determines whether a value is categorical and how it
* stores the categorical value into map and replaces with the numerical value
* all depends on the mapping policy object's MapTokens() funciton.
*
* @tparam eT Type of armadillo matrix.
* @param tokens Vector of variables inside a dimension.
* @param row Position of the given tokens.
* @param matrix Matrix to save the data into.
*/
template <typename eT>
void MapTokens(const std::vector<std::string>& tokens, size_t& row,
arma::Mat<eT>& matrix);

//! Return the type of a given dimension (numeric or categorical).
Datatype Type(const size_t dimension) const;
//! Modify the type of a given dimension (be careful!).
Datatype& Type(const size_t dimension);

/**
* Get the number of mappings for a particular dimension. If the dimension
* is numeric, then this will return 0.
*/
size_t NumMappings(const size_t dimension) const;

/**
* Get the dimensionality of the DatasetMapper object (that is, how many
* dimensions it has information for). If this object was created by a call
* to mlpack::data::Load(), then the dimensionality will be the same as the
* number of rows (dimensions) in the dataset.
*/
size_t Dimensionality() const;

/**
* Serialize the dataset information.
*/
template<typename Archive>
void Serialize(Archive& ar, const unsigned int /* version */)
{
ar & data::CreateNVP(types, "types");
ar & data::CreateNVP(maps, "maps");
}

//! Return the policy of the mapper.
const PolicyType& Policy() const;

//! Modify the policy of the mapper (be careful!).
PolicyType& Policy();

//! Modify (Replace) the policy of the mapper with a new policy
void Policy(PolicyType&& policy);

private:
//! Types of each dimension.
std::vector<Datatype> types;

// BiMapType definition
using BiMapType = boost::bimap<std::string, typename PolicyType::MappedType>;

// Mappings from strings to integers.
// Map entries will only exist for dimensions that are categorical.
// MapType = map<dimension, pair<bimap<string, MappedType>, numMappings>>
using MapType = std::unordered_map<size_t, std::pair<BiMapType, size_t>>;

//! maps object stores string and numerical pairs.
MapType maps;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment the maps and policy parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!


//! policy object tells dataset mapper how the categorical values should be
// mapped to the maps object. It is used in MapString() and MapTokens().
PolicyType policy;
};

// Use typedef to provide backward compatibility
using DatasetInfo = DatasetMapper<data::IncrementPolicy>;

} // namespace data
} // namespace mlpack

#include "dataset_mapper_impl.hpp"

#endif