# Data example

This file is used to show how to use the methods & classes within the data module.

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets

# functions to test
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

import toolkit.data.creation as dc
import toolkit.data.eda as eda

## Data Creation
In this section, we test the creation of a dataset, using functions available in toolkit/data/creation.py

In [2]:
#let's create a random dataset
dc.create_df_rd(
    size=10, # rows number
    seed=1234, # fixed random seed
    numerics={
        "num1": None, # normally distributed, mean 0 std 1
        "num2": [3, 1]}, # normally distributed, mean 3 std 1
    booleans={"bool1": 0.5}, # Binomial distributed, prob True = 0.5
    categories={
        "a": [1, 2, 3], # multinomial distributed, with 3 values each prob 1/3
        "b": ["F", "M"]},) # multinomial distributed, with 2 values each prob 1/2

We check the inputs...
	We have 2 numerics columns
	We have 1 booleans columns
	We have 2 categories columns
We check the format of the new columns' name
We create the DataFrame
	The column ''num1'' is normaly distributed (0,1)
	The column ''num2'' is normal, (3 , 1)
	The column ''bool1'' is binomialy distributed with proba 0.5 of having 1
	The column ''a'' is binomialy distributed with proba 0.33 for each value
	The column ''b'' is binomialy distributed with proba 0.5 for each value


Unnamed: 0,id,num1,num2,bool1,a,b
0,0,0.471435,3.689382,False,1,F
1,1,-1.190976,2.968288,False,1,M
2,2,1.432707,3.668054,True,1,M
3,3,-0.312652,3.488838,True,3,M
4,4,-0.720589,2.320212,False,3,M
5,5,0.887163,1.692521,True,3,F
6,6,0.859588,4.470304,False,2,M
7,7,-0.636524,1.768973,False,1,M
8,8,0.015696,3.958775,False,1,F
9,9,-2.242685,3.74049,False,2,F


## EDAs
In this section, we test the EDAs of a dataset, using functions available in toolkit/data/eda.py

In [3]:
# let's test format function check on iris
iris = datasets.load_iris(as_frame=True)
data = iris.frame
data["target"] = data["target"].map(
    {0: "setosa", 1: "versicolor", 2: "virginica"}
)
data["target"] = data["target"].astype("category")

eda.check_format_df(df=data, digits=2)

[1; 31m ❌ : There are columns with names that are not in accepted format:[0 m
[1; 31m ❌ : 	- The column sepal length (cm) should have been sepal-length-(-cm)[0 m
[1; 31m ❌ : 	- The column sepal width (cm) should have been sepal-width-(-cm)[0 m
[1; 31m ❌ : 	- The column petal length (cm) should have been petal-length-(-cm)[0 m
[1; 31m ❌ : 	- The column petal width (cm) should have been petal-width-(-cm)[0 m
[1; 32m 😃 : ALl columns have correct type[0 m
[1; 32m 😃 : All the columns are full[0 m
[1; 32m 😃 : All the columns are finite[0 m


Unnamed: 0,columns_snake_case,data_type,type_cat_or_num,count,missing_rows_pct,infinite_rows_pct,mean,std,min,25%,50%,75%,max,num_levels,top,freq
sepal length (cm),sepal-length-(-cm),float64,Numerical,150.0,0.0,0.0,5.84,0.83,4.3,5.1,5.8,6.4,7.9,35,,
sepal width (cm),sepal-width-(-cm),float64,Numerical,150.0,0.0,0.0,3.06,0.44,2.0,2.8,3.0,3.3,4.4,23,,
petal length (cm),petal-length-(-cm),float64,Numerical,150.0,0.0,0.0,3.76,1.77,1.0,1.6,4.35,5.1,6.9,43,,
petal width (cm),petal-width-(-cm),float64,Numerical,150.0,0.0,0.0,1.2,0.76,0.1,0.3,1.3,1.8,2.5,22,,
target,target,category,Categorical,150.0,0.0,0.0,,,,,,,,3,setosa,50.0
