## How to Generate a (pseudo)Random Data Set
#### A useful tool for practicing data science

There have been multiple times throughout my coursework where it was extremely convenient to create a random set of data to practice different Python exploratory data analysis (eda) and visualization techniques.  The code below is extremely easy to manipulate in order to get the data table you need.


In [1]:
# Import appropriate libraries
import pandas as pd # A dvery popular dataframe library
import random # Library containing randomizing functions

If for some reason you have trouble installing either of the libraries above, please consult the online documentation.  I'm currently using Anaconda3 and use the `conda install` method of installing packages.  Again please consult the documentation for your specific setup.

In [2]:
# Creating parameters for upcoming random functions
store_location = ["Denver", "Seattle", "Houston", "Boston", "New York City", "London", "Beijing", "Tokyo", "Seoul"] 
cust_count = [random.randint(1000, 10000) for _ in range(100)] # Random number of daily customers
daily_sales = [random.randint(50000, 1000000) for _ in range(100)] # Random daily sales revenue
units_sold = [random.randint(2000, 50000) for _ in range(100)] # Random number of units sold
day = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] # Day of the week

In [3]:
# Define the function to generate a single line of data
def single_entry(store_location, cust_count, units_sold, daily_sales, day): # Define function name and list of parameters
    '''(str, int, int, int, str) -> random(str, int, int, int, str)
    
    This function randomly generates values for the listed parameters.
    
    >>> single_entry(store_location, cust_count, units_sold, daily_sales, day)
    {'Location': 'Seattle',
     'Customers': 6437,
     'Sales': 534274,
     'Units': 14263,
     'Day': 'Tuesday'}
     
     '''
    return {"Location":random.sample(store_location,1), # random.sample(population, k) - this returns k length of the population parameter
            "Customers":random.sample(cust_count,1),    # For example: 'Customers' will be one random int from the cust_count parameter 
            "Sales":random.sample(daily_sales,1),
            "Units":random.sample(units_sold,1),
            "Day":random.sample(day,1)}

Now that we have our function built and properly designed the last step is to test that it works.  Let's see what happens:

In [4]:
single_entry(store_location, cust_count, units_sold, daily_sales, day)

{'Location': ['Houston'],
 'Customers': [7908],
 'Sales': [664856],
 'Units': [48826],
 'Day': ['Tuesday']}

Success!  We have defined our first function to generate a single row of data, and we can use this function to build another function.  The second function will allow us to specify how many rows of data we want to generate for our analyis.  As you can imagine you can create some pretty robust "fake"/"random" data sets to play around with in pretty short periods of time.

In [5]:
# Define a function to specify number of random entries in the data set.
def how_many(k):
    '''(number) -> (number)*single_entry
    
    Return (k) number generated rows of data using single_entry function
    
    >>> how_many(2)
    [{'Location': 'Seoul',
      'Customers': 6437,
      'Sales': 534274,
      'Units': 14263,
      'Day': 'Tuesday'},
     {'Location': 'Houston',
      'Customers': 7752,
      'Sales': 898774,
      'Units': 13563,
      'Day': 'Sunday'}
     
     '''
    return [single_entry(store_location, cust_count, units_sold, daily_sales, day) for _ in range(k)]

With the function design complete it's time to test.  For the first run we'll set (k) at (5) and see what happens:

In [6]:
how_many(5)

[{'Location': ['Denver'],
  'Customers': [2036],
  'Sales': [520452],
  'Units': [48826],
  'Day': ['Wednesday']},
 {'Location': ['Seoul'],
  'Customers': [4348],
  'Sales': [991331],
  'Units': [10977],
  'Day': ['Sunday']},
 {'Location': ['London'],
  'Customers': [2619],
  'Sales': [418218],
  'Units': [37657],
  'Day': ['Tuesday']},
 {'Location': ['Seoul'],
  'Customers': [2387],
  'Sales': [238283],
  'Units': [6135],
  'Day': ['Wednesday']},
 {'Location': ['New York City'],
  'Customers': [3463],
  'Sales': [510075],
  'Units': [37861],
  'Day': ['Saturday']}]

Everything looks good.  Now it's time to call on pandas so we can put this data into a dataframe, beautify the data with a simple table, and get it ready for analysis.

In [7]:
df = pd.DataFrame(how_many(30), columns=['Location','Customers','Sales','Units','Day']) # Setting up the dataframe format and assigning it to variable df

__Important Note:__ The columns must be named exactly the same as the function parameters.  I tried to abbreviate 'Customers' to 'Cust' and received a `Nan error` for that entire column.

In [8]:
df # Call df to execute the how_many function and organize the results into a pandas dataframe

Unnamed: 0,Location,Customers,Sales,Units,Day
0,[Denver],[2884],[841461],[47700],[Saturday]
1,[Beijing],[2418],[532449],[30729],[Monday]
2,[Beijing],[4830],[677490],[12516],[Thursday]
3,[London],[6120],[434720],[16690],[Friday]
4,[Houston],[6360],[404086],[39442],[Wednesday]
5,[Houston],[2916],[457370],[23421],[Saturday]
6,[Boston],[1615],[710899],[32734],[Monday]
7,[Houston],[9341],[418994],[22475],[Thursday]
8,[Houston],[2433],[886132],[12771],[Wednesday]
9,[Seattle],[1875],[884902],[22090],[Friday]


There we have it.  Now we have the basic understanding on how to build our own "dummy" data.  In a future post I will use these functions to create a data set to practice exploratory statistical data analysis methods and consider ways to create "dummy" pixel data for future convolutional neural networks.