As a fan of skateboarding. What's better than having our own online store where we sell skateboards and accessories. We want to create a CSV of imaginary sales for each month of last year. We're going to generate thousands of rows of product prices, addresses... 


Our data will have the 6 following columns:



In [None]:
columns = ['Order_ID', 'Product', 'Quantity_Ordered', 'Price_Each', 'Order_Date', 'Purchase_Address']

Here's the list of the product we sell with their prices. You can find complete boards,shoes and some accessories from popular brands 

In [None]:
products = {
    'Flip Complete Board': 1400,
    'Element Complete Board': 1200,
    'Oliveira Tin Toy Deck': 800,
    '55mm Sparx 99a Ricta Wheels': 219.99,
    'Nike Nyjah 2': 758.99,
    'Nike Jacob Janowski': 778.99,
    'Polished Silver Standard Bullet Trucks': 299.99,
    'Trasher hoodie': 600, 
    'Element by Nigel Cabourn Alder 4 Jacket': 3400, 
    'Flat Bar': 1899.99,
    'Grip': 26.84,
    'Steackers': 24.99,
    'Element ‑ Visserie Allen 1': 82.95,
    'All In One Skate': 88.95,
    'Supreme Cap': 79.99,
    'T-Shirt Colorblock vans': 189.99,
    'Black Standard Bullet Trucks': 300,
    'Mercer Flora Untamed 35" Longboard': 1200.00,
    'Santa Cruz Classic Dot 41" Drop Through Longboard': 1200.00

}

Let's generate a simple csv of our data with 1000 entries to start. For the moment we keep the Quantity Ordered equal to 1 for each purchase and leave Order Date and Purchase Address blank 

In [None]:
import pandas as pd 
import random 

In [None]:
#empty DataFrame with the 6 columns we defined before 
df = pd.DataFrame(columns=columns)

#filling rows with randomly selected products 
for i in range(1000):
  product = random.choice(list(products.keys()))
  price = products[product]
  df.loc[i] = [i, product, 1, price, 'NA', 'NA']

df.to_csv('test_data.csv')

Now we can open up the file. Here's a look at the first five rows 

In [None]:
df_test= pd.read_csv('test_data.csv')
df_test.head()

Unnamed: 0.1,Unnamed: 0,Order_ID,Product,Quantity_Ordered,Price_Each,Order_Date,Purchase_Address
0,0,0,"Santa Cruz Classic Dot 41"" Drop Through Longboard",1,1200.0,,
1,1,1,Polished Silver Standard Bullet Trucks,1,299.99,,
2,2,2,All In One Skate,1,88.95,,
3,3,3,55mm Sparx 99a Ricta Wheels,1,219.99,,
4,4,4,All In One Skate,1,88.95,,


It looks like we have products that seem's kind of random and matches with prices from the dictionary

# selecting some products with higher probability than others 


To make our data more realistic, we want to allow some products to show up more than others. Grip and Steackers to sell have such a low cost they should be purchased more proberly compared to the expensive Flat Bar. To do this, in addition to the price in the dictionary, we add a weight value. Maybe we expect the Flip Complete Board to be sold more than Element Complete Board and so on  

In [None]:
products = {
    'Flip Complete Board': [1400, 10],
    'Element Complete Board': [1200, 8],
    'Oliveira Tin Toy Deck': [800, 3],
    '55mm Sparx 99a Ricta Wheels': [219.99,6],
    'Nike Nyjah 2': [758.99, 9],
    'Nike Jacob Janowski': [778.99,9],
    'Polished Silver Standard Bullet Trucks': [299.99, 11],
    'Trasher hoodie': [600, 7],
    'Element by Nigel Cabourn Alder 4 Jacket': [3400, 7],
    'Flat Bar': [1899.99, 6],
    'Grip': [26.84, 30],
    'Steackers': [24.99, 30],
    'Element ‑ Visserie Allen 1': [82.95, 30],
    'All In One Skate': [88.95, 30],
    'Supreme Cap': [79.99, 26],
    'T-Shirt Colorblock vans': [189.99, 19],
    'Black Standard Bullet Trucks': [300, 22],
    'Mercer Flora Untamed 35" Longboard': [1200.00, 1],
    'Santa Cruz Classic Dot 41" Drop Through Longboard': [1200.00, 1]
}

Let's do the same process but this time considering the weight we added for each product 

In [None]:
product_list = [product for product in products]
price_list = [products[product][0] for product in products]
weight_list = [products[product][1] for product in products]

df = pd.DataFrame(columns=columns)
for i in range(1000): 
  product = random.choices(product_list, weights=weight_list)[0]
  price = products[product][0]
  df.loc[i] = [i, product, 1, price, 'NA', 'NA']
df.to_csv('test_data.csv')

# generating 12 month of data in 12 csvs

We want december to have the most items generated, maybe november to be the second most and the other month just kind of fluctuate around a certain value . Let's generate data for each month and also have that data fluctuate based on the month. The 1000 will no longer be static. We want some sort of like average value that we're selecting around and the values to appear around that average value. So the way to do it is using a normal distribution 

In [None]:
import numpy as np 
#We grab the month name by using the calendar library 
import calendar 

In [None]:
 #random value to the order_id 
 order_id = 143253 
 for month_value in range(1,13):
   df=pd.DataFrame(columns=columns)
   if month_value == 12 : 
     #make high value 
     orders_amounth = int(np.random.normal(loc=2600, scale=300))
   if month_value == 11 : 
     #make slightly  higher
     orders_amounth = int(np.random.normal(loc=2000, scale=300))
   if month_value <= 10 : 
     orders_amounth = int(np.random.normal(loc=1200, scale=400))
   for i in range(orders_amounth): 
      product = random.choices(product_list, weights=weight_list)[0]
      price = products[product][0]
      df.loc[i] = [order_id, product, 1, price, 'NA', 'NA']
      order_id+=1
   month_name = calendar.month_name[month_value]
   df.to_csv(f'{month_name}.csv')
   #we break so we don't generate all the months.
   break

# generating random addresses for our data

we want to generate random addresses that looks realistic for each row. We did a simple google research to find the commun street names in that we can find in most of morrocan cities like avenue Mohammed VI and grab the popular cities with their zips. We also use weight because we need certain cities to pop more than others. The generate_random_adress function returns a random address 

In [None]:
def generate_random_adress():
  street_names = ['Mohammed VI', 'Mohammed V', 'Abbess Ben Abdelmoutalib', 'Medina','El Siaghin', 'Essaada', 'Elfarah', 'Ennasr', 'Ville nouvelle', 'Hassan II']
  cities = ['Marrakech', 'CasaBlanca', 'Rabat', 'Tanger', 'Essaouira', 'Fes', 'ElJadida', 'Tetouan', 'Oujda', 'Salé']
  weights = [3,6,5,3,9,4,0.5,2,3,6]
  zips = ['40000', '20000', '10000', '90000', '44000', '30000', '24000', '93000', '60000', '11000']

  street = random.choice(street_names)
  index = random.choices(range(len(cities)), weights= weights)[0] 

  return f'{random.randint(1,200)} AVN {street}, {cities[index]} {zips[index]}, Morroco'

Here's an example of an address

In [None]:
print(generate_random_adress())

66 AVN Hassan II, Oujda 60000, Morroco


# generating order times for purchases

let's fill in the last NA and generate random order dates  for each of our rows of data. We aslo want the time for these purchases to peak around noon and 8pm and then all other times will kind of circle around those average times. The generate_order_time function return a date in the form of "m/d/y H:M"

In [None]:
import datetime as dt

In [None]:
def generate_order_time(month_value):
  #number of days for each month
  day_range = calendar.monthrange(2019, month_value)[1]
  random_day = random.randint(1,day_range)
  if random.random() < 0.5 :  
    date = dt.datetime(2019, month_value, random_day, 12, 0)
  else : 
    date = dt.datetime(2019, month_value, random_day, 20, 0)
  time_offset = np.random.normal(loc=0, scale=180)

  final_date = date + dt.timedelta(minutes = time_offset)

  return final_date.strftime('%m/%d/%y %H:%M')

# Generating a realistic quantity ordered for each product 

If we go back to our products, a  flat costs 1899.99 DHS , a client is not very likely to purchase two of them or even less likely to purchase three. However a grip that costs less than 30 DHS you have a much higher probability of purchasing maybe a few packs o those. Same thing with steackers and caps. The quantity ordered of an item depands on the price. To do that, we use a geometric distribution   

In [None]:
quantity_ordered = np.random.geometric(p= 1 - 1/price, size= 1)[0]

# Adding multiple items being more likely to be sold together 

Oftentimes When you're shopping, you're not bying one item but multiple items at a time. For example if you order a flip board , you would also likely pick up an extra All In One Skate or maybe some Black Standard Bullet Trucks.
We want them to have the same order_id, order_date and adress. And to make our data a little bit messy let's add some blank rows and some rows with column names 

In [None]:
def write_row(order_id, product, date, adress):
  price = products[product][0]
  quantity_ordered = np.random.geometric(p= 1.0 - (1.0/price), size= 1)[0]
  output = [order_id, product, quantity_ordered, price, date, adress]
  return output

In [None]:
order_id = 143253 
for month_value in range(1,13):
  df=pd.DataFrame(columns=columns)
  # Make some months have more purchases than others
  if month_value == 12 : 
    orders_amounth = int(np.random.normal(loc=2600, scale=300))
  if month_value == 11 :  
    orders_amounth = int(np.random.normal(loc=2000, scale=300))
  if month_value <= 10 : 
    orders_amounth = int(np.random.normal(loc=1200, scale=400))
  
  i=0
  while orders_amounth > 0: 
    # get a random address
    adress = generate_random_adress()
    # get a random product 
    product = random.choices(product_list, weights=weight_list)[0]
    # get a random date
    date = generate_order_time(month_value)
    # fill the row 
    df.loc[i] = write_row(order_id, product, date, adress)
    i+=1 
    # Flip Complete Board more likely to be sold with All In One Skate, Black Standard Bullet Trucks and Supreme Cap
    if product == 'Flip Complete Board':
      if random.random() < 0.15:
        df.loc[i] = write_row(order_id, 'All In One Skate', date, adress)
        i+=1
      if random.random() < 0.05:
        df.loc[i] = write_row(order_id, 'Black Standard Bullet Trucks', date, adress)
        i+=1
      if random.random() < 0.07:
        df.loc[i] = write_row(order_id, 'Supreme Cap', date, adress)
        i+=1
    # Element Complete Board and Oliveira Tin Toy Deck more likely to be sold with Element ‑ Visserie Allen 1
    elif product == 'Element Complete Board' or product == "Oliveira Tin Toy Deck":
        if random.random() < 0.18:
          df.loc[i] = write_row(order_id, "Element ‑ Visserie Allen 1", date, adress)
          i += 1
        if random.random() < 0.04:
          df.loc[i] = write_row(order_id, "T-Shirt Colorblock vans", date, adress)
          i += 1
        if random.random() < 0.07:
          df.loc[i] = write_row(order_id, "Supreme Cap", date, adress)
          i += 1 
    # 2% chance we get an old item with our initial product
    if random.random() <= 0.02:
        product = random.choices(product_list, weight_list)[0]
        df.loc[i] = write_row(order_id, product, date, adress)
        i += 1
    # make our data messy we fill some rows with column names
    if random.random() <= 0.002:
        df.loc[i] = columns
        i += 1
    # blank rows
    if random.random() <= 0.003:
        df.loc[i] = ["","","","","",""]
        i += 1
           
      
    order_id+=1
    orders_amounth-=1
  month_name = calendar.month_name[month_value]
  df.to_csv(f'Sales_{month_name}_2019.csv')


There we have it: 12months of sales data.
To do that, we started by creating a simple dataframe and programmatically adding rows of product purchases to it. We use the random library to select these products.

We make our data more realistic by utilizing normal distributions and geometric distributions in numpy to spread out the number of purchases we make and the quantity of each item purchased.

We use the datetime library to allow us to generate thousands of different times for each purchase with the most common times peaking around 12pm and 8pm.

We take a list of the most common morrocan street addresses to help us randomly generate addresses for each purchases.

We use numpy geometric distribution to generate a realistic quantity ordered for each product We use numpy geometric distribution

and finally add multiple items being more likely to be sold togethe