### <H1 align="center">Synthetic data</H1>

In [None]:
source("genData.R") #loading the data generation function
nrow <- 1000 # number of obs of each cluster
sd <- c(0.7,0.8,0.2,0.6,0.8) # standard deviation of each cluster
real.centers <- list( x=c(-2, 2, 0.0, -2, +2), y=c(2, 1, 0, -1, -2) ) # the real centers of the clusters
seed=1234 # set seed: in this way the generated data will be replicable

# data generation
data=genData(nrow,sd,real.centers,seed) # total: 5000 bivariate obs (1000 obs for each group)

In [None]:
df=as.data.frame(data)
head(df)

### <H1 align="center">Titanic</H1>

<img src="./images/RMS_Titanic_2.jpg" align="center">

Overview (from kaggle)
https://www.kaggle.com/c/titanic/data

The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data fields

- survival	Survival	0 = No, 1 = Yes
- pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
- sex	Sex	
- Age	Age in years	
- sibsp	# of siblings / spouses aboard the Titanic	
- parch	# of parents / children aboard the Titanic	
- ticket	Ticket number	
- fare	Passenger fare	
- cabin	Cabin number	
- embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

In [None]:
titanic<-read.csv("data/TITANIC/train.csv")

In [None]:
head(titanic)

 ### <H1 align="center"> Covertype</H1>


<img src="./images/covtype.jpg">

The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:
1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz


The data set (581012 observations) contains both features and the Cover_Type.

Data Fields

- Elevation - Elevation in meters
- Aspect - Aspect in degrees azimuth
- Slope - Slope in degrees
- Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
- Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
- Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
- Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
- Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
- Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
- Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
- Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
- Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
- Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

The wilderness areas are:

- 1 - Rawah Wilderness Area
- 2 - Neota Wilderness Area
- 3 - Comanche Peak Wilderness Area
- 4 - Cache la Poudre Wilderness Area

The are 35 differnet soil types.

https://www.kaggle.com/c/forest-cover-type-prediction/data

In [None]:
covertype<-read.csv("data/covtype.full.csv")

In [None]:
head(covertype)

 ### <H1 align="center">Iris</H1>

<img src="./images/iris.jpg">

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

In [None]:
iris<-read.csv("data/iris_h.csv")

In [None]:
head(iris)

 ### <H1 align="center">Swiss Banknote</H1>

<img src="./images/money.jpg">
Data Set Information:

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.


Attribute Information:

1. variance of Wavelet Transformed image (continuous) 
2. skewness of Wavelet Transformed image (continuous) 
3. curtosis of Wavelet Transformed image (continuous) 
4. entropy of image (continuous) 
5. class (integer) 



In [24]:
banknote<-read.csv("data/data_banknote_authentication.txt",header=FALSE)

In [25]:
head(banknote)

V1,V2,V3,V4,V5
3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
3.866,-2.6383,1.9242,0.10645,0
3.4566,9.5228,-4.0112,-3.5944,0
0.32924,-4.4552,4.5718,-0.9888,0
4.3684,9.6718,-3.9606,-3.1625,0


 ### <H1 align="center">Higgs Boson</H1>

<img src="./images/higgs.png">

All variables are floating point, except PRI_jet_num which is integer
- variables prefixed with PRI (for PRImitives) are “raw” quantities about the bunch collision as measured by the detector.
- variables prefixed with DER (for DERived) are quantities computed from the primitive features, which were selected by  the physicists of ATLAS
- response is the label: 

It can happen that for some entries some variables are meaningless or cannot be computed; in this case, their value is −999.0, which is outside the normal range of all variables.
https://www.kaggle.com/c/higgs-boson

In [1]:
higgs <- read.csv("data/higgs_train_10k.csv")

In [2]:
head(higgs)

EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,⋯,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,⋯,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.002653311,s
100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,⋯,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.233584487,b
100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,⋯,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.347388944,b
100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,⋯,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,5.446378212,b
100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,⋯,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245332687,b
100005,89.744,13.55,59.149,116.344,2.636,284.584,-0.54,1.362,61.619,⋯,3,90.547,-2.412,-0.653,56.165,0.224,3.106,193.66,0.083414031,b
