### Random Dataset Generation ###

It's sometimes useful to be able to generate your own dataset for testing of a model; or for demonstration purposes.
We could of course manually create a random dataset in Excel, but why do so when you can do it faster in Python.

In [1]:
import numpy as np
import pandas as pd

num_rows = 500 # Number of rows to generate

In [2]:
# Dataset headers
column_headers = ['No', 'Gender', 'Height', 'Weight', 'Shoe Size',
                       'Shopping Satisfaction Offline', 'Shopping Satisfaction Online',
                       'Average Spent Per Month']

### Sample Generation ###

Let's do the categorical ones first.
The function np.random.choice generates 'num_rows' samples from the list with probabilities of each choice specified in p.

In [3]:
no_answers = np.arange(1,num_rows+1)

In [4]:
gender_list = ['Male', 'Female', 'LGBT']
gender = np.random.choice(gender_list, num_rows, p=[0.4, 0.4, 0.2])

In [5]:
satisfaction_list = ['High', 'Medium', 'Low']
satisfaction_offline = np.random.choice(satisfaction_list, num_rows, p=[0.4, 0.4, 0.2])
satisfaction_online = np.random.choice(satisfaction_list, num_rows, p=[0.3, 0.4, 0.3])

For the height, weight, shoe size, and average spent per month we can use the randint function but with a range. I shall just assume that these are in nice round numbers for simplicity.

In [6]:
height = np.random.randint(110,200,num_rows)
weight = np.random.randint(30,100,num_rows)
shoe_size = np.random.randint(4,12,num_rows)
monthly_spending = np.random.uniform(100,10000,num_rows)

Now let's create a pandas dataframe with all these randomly generated data.

In [8]:
responses = pd.DataFrame([no_answers, gender, height, weight, shoe_size,
                         satisfaction_offline, satisfaction_online,
                         monthly_spending])

In [10]:
responses = responses.transpose()
responses.columns = column_headers

In [11]:
responses.head()

Unnamed: 0,No,Gender,Height,Weight,Shoe Size,Shopping Satisfaction Offline,Shopping Satisfaction Online,Average Spent Per Month
0,1,Male,112,70,5,High,High,8426.54
1,2,LGBT,163,48,8,Medium,Medium,2014.68
2,3,Male,133,42,4,High,Medium,3018.07
3,4,Female,121,66,5,High,Low,3598.17
4,5,Female,171,36,11,High,Medium,4755.22


Now, let's just save this to a CSV.

In [13]:
responses.to_csv('responses.csv')