<a href="https://colab.research.google.com/github/maskot1977/sampledataset_generator/blob/main/sampledataset_generator_basic_usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sample Dataset Generator

Have you ever struggled to find a good data set when you were trying to learn about machine learning or to test your methods? The Sample Dataset Generator is a tool to generate sample data automatically, which is useful for just such a situation.

# Install

In [2]:
!pip install git+https://github.com/maskot1977/sampledataset_generator.git

Collecting git+https://github.com/maskot1977/sampledataset_generator.git
  Cloning https://github.com/maskot1977/sampledataset_generator.git to /tmp/pip-req-build-587v_id0
  Running command git clone -q https://github.com/maskot1977/sampledataset_generator.git /tmp/pip-req-build-587v_id0
Building wheels for collected packages: sampledataset-generator
  Building wheel for sampledataset-generator (setup.py) ... [?25l[?25hdone
  Created wheel for sampledataset-generator: filename=sampledataset_generator-0.1.0-cp36-none-any.whl size=3101 sha256=7f18f30b899d3664a6b5936e0dc764f5209280055d80419a4114007055185805
  Stored in directory: /tmp/pip-ephem-wheel-cache-95ietsbo/wheels/45/04/cf/e9f65e7ff4100654a7511503a2873b57e5f83431fccef54f2c
Successfully built sampledataset-generator
Installing collected packages: sampledataset-generator
Successfully installed sampledataset-generator-0.1.0


# Basic usage

You can automatically create descriptive and objective variables in the following way.

In [3]:
from sampledataset_generator import generator

dataset = generator.SampleDatasetGenerator()
dataset.generate()

In [4]:
import pandas as pd

pd.DataFrame(dataset.X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.874719,0.228159,1.043060,-0.064032,0.062203,2.938519,-0.366674,1.087626,0.271206,1.396527
1,-0.380454,0.070878,0.181825,-0.172262,0.080013,0.420896,0.734195,-0.340716,-0.571168,1.437683
2,-1.624316,-0.312386,-0.007409,-0.001500,-0.085444,-0.117991,-2.223847,-0.318631,0.389458,-0.217023
3,-0.379374,-0.058680,-0.496479,0.541437,0.121997,-0.468587,-0.250380,0.038416,0.646550,0.869240
4,2.187036,-0.014006,0.363903,-0.509491,0.001101,1.084016,1.328812,-1.241679,0.703932,-0.157089
...,...,...,...,...,...,...,...,...,...,...
95,-0.236994,-0.159133,0.807234,-0.083568,0.078701,-0.105214,-2.761139,0.043162,-0.096247,-0.106179
96,-0.391223,-0.014557,0.804878,0.040324,0.022574,0.717083,-1.149362,1.051754,-0.471551,0.333957
97,0.481387,-0.038779,0.380662,0.857424,0.067420,1.599628,-0.529226,0.071804,-0.201889,-0.087362
98,1.422031,0.021933,-0.180896,0.644405,-0.012918,-1.929145,-0.598751,-1.371636,0.812930,0.000721


In [5]:
pd.DataFrame(dataset.Y)

Unnamed: 0,0
0,8.584018
1,1.502417
2,-0.355069
3,-2.698677
4,2.331078
...,...
95,6.929988
96,6.780913
97,4.847613
98,-0.441276


By default, the formula for calculating Y from X is a polynomial linear expression Only the first five terms are used to derive Y, and the rest of the terms are independent of Y. Its coefficients can be obtained as

In [6]:
pd.DataFrame(dataset.coef)

Unnamed: 0,0
0,0.060915
1,-0.54852
2,8.244854
3,1.599333
4,4.259971
5,0.0
6,0.0
7,0.0
8,0.0
9,0.0


# Parameters and functions

By default, `n_samples`=100, `n_features`=10, `n_informative`=5, `noise`=0.0, `function`=`linear`, `independence`=1.0. 

- `n_samples` is the number of samples.
- `n_features` is the number of features.
- `n_informative` is the number of "meaningful" features X used to derive Y in the default polynomial linear function. 
- `noise` is represents the strength of the noise.
- `independence` is how independent "irrelevant" features that are not used to derive Y are from "meaningful" features.

If you want a non-linear function, we have a number of functions `friedman1`, `friedman2` and `friedman3` that you can use as follows

In [7]:
dataset = generator.SampleDatasetGenerator(function=generator.friedman1)
dataset.generate()

Alternatively, you can define any function you like and use it in the following way

In [8]:
import numpy as np
my_func = lambda X, coef: np.arctan((X[:, 1] * X[:, 2] - 1 / (X[:, 1] * X[:, 3])) / X[:, 0])

dataset = generator.SampleDatasetGenerator(function=my_func)
dataset.generate()

# Creating data with missing values

You can create data with missing values in the following way. The rate of occurrence of the missing values can be adjusted, e.g. `rate` = 0.05.

In [9]:
dataset = generator.SampleDatasetGenerator()
dataset.generate()
dataset.fill_nan_randomly()
pd.DataFrame(dataset.X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,-0.717267,0.466119,0.157829,0.403057,0.261670,1.621723,-2.384312,-0.419102,-0.187467
1,-0.226485,0.529230,-1.417389,-0.108385,0.108098,0.217466,1.137134,0.839975,-0.329924,
2,-0.213283,-0.980020,,-0.031286,0.252950,-1.684535,1.525129,-1.228553,-0.815602,
3,-0.610736,-0.601983,0.129403,-0.041961,-0.834912,-0.447463,1.315553,-0.935183,,-1.116784
4,-0.550343,-0.148456,,0.757180,0.731174,-0.562547,0.425136,-0.848524,-1.284881,-0.720632
...,...,...,...,...,...,...,...,...,...,...
95,-0.400990,0.010573,-0.017612,0.323151,0.257867,0.008553,-0.781648,0.031025,-1.071831,1.435444
96,-0.008266,-0.945571,-1.367276,0.396686,0.434881,-0.684910,-0.845351,-0.597488,-1.066122,0.822694
97,0.154777,,,0.259955,0.008274,-0.866544,-1.278649,-0.711525,-0.105661,1.192752
98,0.174639,0.208390,2.833820,0.352648,-0.131895,1.685485,1.342137,-0.594071,0.882873,-1.190221


# Creating sparse data

You can create data with zero values in the following way The rate of occurrence of the zero values can be adjusted, e.g. 'rate' = 0.9.

In [10]:
dataset = generator.SampleDatasetGenerator()
dataset.generate()
dataset.fill_zero_randomly()
pd.DataFrame(dataset.X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000
1,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000
2,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000
3,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000
4,0.0,0.0,1.290429,0.0,0.000000,0.000000,-0.620650,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.000000,0.0,2.746314,0.000000,0.000000,0.0,0.0,0.000000
96,0.0,0.0,0.000000,0.0,0.000000,0.254578,0.000000,0.0,0.0,1.269509
97,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000
98,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.000000


# Objective variables for classification problems

By default, the objective variable is continuous and is intended to solve a regression problem, but you can treat it as a classification problem with the objective variable as a discrete value by doing the following

In [11]:
dataset = generator.SampleDatasetGenerator()
dataset.generate()
dataset.categoricalize()
pd.DataFrame(dataset.Y)

Unnamed: 0,0
0,1
1,1
2,0
3,1
4,0
...,...
95,0
96,0
97,0
98,0


You can generate an arbitrary number of categorical data with an arbitrary bias by doing the following

In [12]:
dataset = generator.SampleDatasetGenerator()
dataset.generate()
dataset.categoricalize(labels=["beer", "wine", "whiskey"], classification_ratio=0.4)
pd.DataFrame(dataset.Y)

Unnamed: 0,0
0,whiskey
1,whiskey
2,whiskey
3,whiskey
4,whiskey
...,...
95,whiskey
96,whiskey
97,beer
98,wine


# Include discrete values in explanatory variables

You can make the descriptive variable for a given column a categorical variable by doing the following

In [13]:
dataset = generator.SampleDatasetGenerator()
dataset.generate()
dataset.categoricalize(column=3)
pd.DataFrame(dataset.X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.069141,-0.152245,-0.069177,1.0,2.147144,1.683370,-1.406095,-0.288278,0.021179,0.386266
1,-0.017672,-0.330113,-0.233835,1.0,-1.383669,0.352960,0.937982,-1.835730,1.179513,-0.555153
2,-0.049318,0.672656,-0.694506,1.0,-0.250240,1.386238,1.016130,0.435432,-0.102031,-1.021133
3,-0.035957,-0.070815,0.086850,1.0,-0.044765,-0.424427,0.823290,-2.021052,-0.740976,0.774101
4,0.054945,0.316388,-0.937645,0.0,-1.004592,-1.727037,-0.325570,-0.185780,0.660687,-0.249081
...,...,...,...,...,...,...,...,...,...,...
95,-0.050984,0.218996,-0.436951,1.0,1.797569,1.522909,0.091088,1.819058,0.252704,1.506190
96,-0.040341,0.239490,-0.118325,1.0,-0.750467,-0.524603,1.785338,-0.590646,-0.311082,-0.588058
97,0.032199,0.652054,-0.293499,1.0,-0.526053,0.753688,1.558890,-1.672383,0.610953,-0.076587
98,-0.077576,0.222219,-0.018319,0.0,-0.393507,-0.626555,-0.160973,-0.469242,-0.435700,-1.651206


You can generate an arbitrary number of categorical data with an arbitrary bias by doing the following

In [14]:
dataset.categoricalize(column=5, labels=["pen", "pineapple", "apple"], classification_ratio=0.4)
pd.DataFrame(dataset.X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.0691405,-0.152245,-0.0691765,1,2.14714,pineapple,-1.4061,-0.288278,0.0211786,0.386266
1,-0.0176721,-0.330113,-0.233835,1,-1.38367,pen,0.937982,-1.83573,1.17951,-0.555153
2,-0.0493178,0.672656,-0.694506,1,-0.25024,pineapple,1.01613,0.435432,-0.102031,-1.02113
3,-0.0359565,-0.0708147,0.0868503,1,-0.0447651,apple,0.82329,-2.02105,-0.740976,0.774101
4,0.0549453,0.316388,-0.937645,0,-1.00459,apple,-0.32557,-0.18578,0.660687,-0.249081
...,...,...,...,...,...,...,...,...,...,...
95,-0.0509837,0.218996,-0.436951,1,1.79757,pineapple,0.0910875,1.81906,0.252704,1.50619
96,-0.0403408,0.23949,-0.118325,1,-0.750467,apple,1.78534,-0.590646,-0.311082,-0.588058
97,0.0321993,0.652054,-0.293499,1,-0.526053,pineapple,1.55889,-1.67238,0.610953,-0.0765871
98,-0.0775761,0.222219,-0.0183187,0,-0.393507,apple,-0.160973,-0.469242,-0.4357,-1.65121


If you want to make it an integer or non-negative number, you can do the following.

In [15]:
dataset.categoricalize(column=7, labels=range(10), classification_ratio=0.4)
pd.DataFrame(dataset.X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.0691405,-0.152245,-0.0691765,1,2.14714,pineapple,-1.4061,0,0.0211786,0.386266
1,-0.0176721,-0.330113,-0.233835,1,-1.38367,pen,0.937982,5,1.17951,-0.555153
2,-0.0493178,0.672656,-0.694506,1,-0.25024,pineapple,1.01613,6,-0.102031,-1.02113
3,-0.0359565,-0.0708147,0.0868503,1,-0.0447651,apple,0.82329,5,-0.740976,0.774101
4,0.0549453,0.316388,-0.937645,0,-1.00459,apple,-0.32557,2,0.660687,-0.249081
...,...,...,...,...,...,...,...,...,...,...
95,-0.0509837,0.218996,-0.436951,1,1.79757,pineapple,0.0910875,6,0.252704,1.50619
96,-0.0403408,0.23949,-0.118325,1,-0.750467,apple,1.78534,0,-0.311082,-0.588058
97,0.0321993,0.652054,-0.293499,1,-0.526053,pineapple,1.55889,5,0.610953,-0.0765871
98,-0.0775761,0.222219,-0.0183187,0,-0.393507,apple,-0.160973,0,-0.4357,-1.65121
