## The fruits dataset

Below, the fruits dataset is generated and saved to a CSV-file. The dataset consists of:

We create a toy dataset, consisting of fruits with different colors and diameters. In the data set there are approximately:
- 500 grapes with a mean diameter of 1.5cm and a color which is a random assignment of either green or red.
- 400 ripe apples with a mean diameter of 7cm and a color which is a random assignment of green, red or yellow.
- 100 unripe apples with a mean diameter of 3cm, which are all green.

In [1]:
import pandas as pd 
import numpy as np

rnd = np.random.RandomState(0)

num_grapes, loc_grapes, scale_grapes = 500, 1.5, 0.9
num_apples, loc_apples, scale_apples = 400, 7, 3
num_unripe_apples, loc_unripe_apples, scale_unripe_apples = 100, 3, 2

# Grapes:
diameter_grapes = rnd.normal(size=num_grapes, loc=loc_grapes, scale=scale_grapes)
diameter = diameter_grapes[diameter_grapes > 0]
colors_grapes = rnd.choice(['Red', 'Green'], size=len(diameter_grapes))
labels_grapes = rnd.choice(['Grape'], size=len(diameter_grapes))

# Apples:
diameter_apples = rnd.normal(size=num_apples, loc=loc_apples, scale=scale_apples)
diameter_apples = diameter_apples[diameter_apples > 0] 
colors_apples = rnd.choice(['Red', 'Green', 'Yellow'], size=len(diameter_apples))
labels_apples = rnd.choice(['Apple'], size=len(diameter_apples))

# Unripe apples:
diameter_unripe_apples = rnd.normal(size=num_unripe_apples, loc=loc_unripe_apples, scale=scale_unripe_apples)
diameter_unripe_apples = diameter_unripe_apples[diameter_unripe_apples > 0]
colors_unripe_apples = rnd.choice(['Green'], size=len(diameter_unripe_apples))
labels_unripe_apples = rnd.choice(['Apple'], size=len(diameter_unripe_apples))

# We stack the colors, diameters and fruit-labels respectively, and combine them into a dataset:
colors = np.hstack((colors_grapes, colors_apples, colors_unripe_apples))
diameters = np.hstack((diameter_grapes, diameter_apples, diameter_unripe_apples))
labels = np.hstack((labels_grapes, labels_apples, labels_unripe_apples))

data = np.vstack((colors, diameters, labels)).T
np.random.shuffle(data)

pd.DataFrame(data).to_csv("fruits-data.csv", header = ["Color","Diameter","Label"])