# Random Data Generation
#### By Julien Dhouti
GoDaddy told me I couldn't release any of my work data :( so I had to generate random data from the work data. This is the notebook that does so.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/clean_data.csv')
df = df.drop(['Unnamed: 0'], axis=1)   # I don't care for the first column that was imported

I start randomly generating the data using the handy numpy library.<br />
Here in this section, I generate random customer availability based on the already given avaibility percentages.

In [3]:
min_avail = df.cust_avail_v3.min()
max_avail = df.cust_avail_v3.max()
size = df.cust_avail_v3.size

cust_avail_v3 = np.random.uniform(min_avail, max_avail, size)
cust_avail_v3

array([ 0.84789413,  0.93332949,  0.88785679,  0.88805097,  0.88787775,
        0.86995898,  0.94509068,  0.9766714 ,  1.00799917,  0.8940723 ,
        0.90486815,  0.97232378,  0.86972765,  0.83354523,  0.89412036,
        0.82985928,  1.03506816,  0.82306602,  0.92414826])

Now that I have created random data fo the cust_avail_v3 column, I can go ahead and create the new dataframe and apply that data to it.

In [4]:
df2 = pd.DataFrame()   # create a new dataframe object
df2['cust_avail_v3'] = cust_avail_v3   # apply the random column to the new dataframe
df2.head()

Unnamed: 0,cust_avail_v3
0,0.847894
1,0.933329
2,0.887857
3,0.888051
4,0.887878


This can get tedious so I should probably create a function that can do this for any column that is not a dependent variable.

In [5]:
# create the function
def generate_random(column):
    """The parameter is going to be a series. The function will do the rest."""
    col_max = column.max()  # get the max from the column
    col_min = column.min()  # get the min from the column
    size = column.size      # the size will be the same for all columns but I do this just in case
    
    d_type = column.dtype   # get the type of the column so that I can apply the right random function to it
    
    random_column = np.random.uniform(col_min, col_max, size)
        
    return random_column

Now I can take this function for a test drive by applying it to the columns `css_count`, `css_score`, `orders`, `calls_per_day`, `new_sales`, and `new_conv`.

In [6]:
df2['css_count'] = generate_random(df.css_count)
df2['css_score'] = generate_random(df.css_score)
df2['orders'] = generate_random(df.orders)
df2['calls_per_day'] = generate_random(df.calls_per_day)
df2['new_sales'] = generate_random(df.new_sales)
df2['new_conv'] = generate_random(df.new_conv)

df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv
0,0.847894,3.568448,3.391978,9.11708,32.282608,697.163707,0.015936
1,0.933329,5.244309,0.193842,2.561375,22.54871,704.840896,0.193567
2,0.887857,3.078791,3.084555,5.985653,36.40241,214.463666,0.162255
3,0.888051,4.241505,2.391848,4.007344,34.727137,281.225777,0.134914
4,0.887878,3.35047,3.642247,7.790307,35.388316,86.92491,0.044531


Not too bad huh? Some columns look a bit funky because they probably aren't supposed to have so many decimal points. I can probably fix that with a quick function though.

In [7]:
df2 = df2.round({'css_score':2, 'new_sales':2, 'orders':0,
                 'calls_per_day':0, 'css_count':0, 'cust_avail_v3':4,
                 'new_conv':4})

I should probably also convert the `css_score` column to an integer based column since that's what the dirty data looked like.

In [8]:
df2['css_count'] = pd.to_numeric(df.css_count)

What does the dataframe look like now? Well still a bit weird but definitely more logical.

In [9]:
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv
0,0.8479,2,3.39,9.0,32.0,697.16,0.0159
1,0.9333,1,0.19,3.0,23.0,704.84,0.1936
2,0.8879,1,3.08,6.0,36.0,214.46,0.1623
3,0.8881,2,2.39,4.0,35.0,281.23,0.1349
4,0.8879,4,3.64,8.0,35.0,86.92,0.0445


I have a few columns that are left but I probably don't need to worry about randomizing the dates. If GoDaddy wants to sue me on stolen dates data, go ahead.

In [10]:
df2['date'] = df.date

Next I can probably randomize total_sales by adding a random number to the new_sales column. I can find this random number by calculating all of differences and then randomizing that array.

In [11]:
total_differences = df.sales - df.new_sales # create a series of all of the differences

# apply the same function i created to this series to create random differences
random_differences = generate_random(total_differences)
random_differences

array([ 252.35455903,   85.69640716,   82.10818321,  537.00481954,
        259.9062231 ,  256.36733083,  428.30855711,  285.34667406,
        243.47037128,  305.13039731,  535.21402767,  162.49959739,
        467.61973452,    1.15896232,  444.14429326,  291.31909667,
         59.87214947,  487.03589546,  453.32346662])

Looks kinda cool, now let's add it to the `new_sales` column of the new random dataframe and then create the `sales` column.

In [12]:
df2['sales'] = df2.new_sales + random_differences
df2 = df2.round({'sales':2})
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales
0,0.8479,2,3.39,9.0,32.0,697.16,0.0159,2018-04-02,949.51
1,0.9333,1,0.19,3.0,23.0,704.84,0.1936,2018-04-03,790.54
2,0.8879,1,3.08,6.0,36.0,214.46,0.1623,2018-04-04,296.57
3,0.8881,2,2.39,4.0,35.0,281.23,0.1349,2018-04-05,818.23
4,0.8879,4,3.64,8.0,35.0,86.92,0.0445,2018-04-09,346.83


Almost done! The last column that I need to create doesn't need any randomization because it's actually already been determined. I just need to look at the `sales` and `new_sales` columns. I then divide the `sales` column by the `new_sales` column and I get percentage new sales.

In [13]:
df2['new_sales_perc'] = df2.new_sales.divide(df2.sales)
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales,new_sales_perc
0,0.8479,2,3.39,9.0,32.0,697.16,0.0159,2018-04-02,949.51,0.734231
1,0.9333,1,0.19,3.0,23.0,704.84,0.1936,2018-04-03,790.54,0.891593
2,0.8879,1,3.08,6.0,36.0,214.46,0.1623,2018-04-04,296.57,0.723135
3,0.8881,2,2.39,4.0,35.0,281.23,0.1349,2018-04-05,818.23,0.343705
4,0.8879,4,3.64,8.0,35.0,86.92,0.0445,2018-04-09,346.83,0.250613


I can then restrict this column to 4 decimal places:

In [14]:
df2 = df2.round({'new_sales_perc':4})
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales,new_sales_perc
0,0.8479,2,3.39,9.0,32.0,697.16,0.0159,2018-04-02,949.51,0.7342
1,0.9333,1,0.19,3.0,23.0,704.84,0.1936,2018-04-03,790.54,0.8916
2,0.8879,1,3.08,6.0,36.0,214.46,0.1623,2018-04-04,296.57,0.7231
3,0.8881,2,2.39,4.0,35.0,281.23,0.1349,2018-04-05,818.23,0.3437
4,0.8879,4,3.64,8.0,35.0,86.92,0.0445,2018-04-09,346.83,0.2506


Looks great!

There are a few other things that I have to add to the columns so that it looks exactly the same as the previous real data, otherwise when the `april_18.ipynb` notebook is trying to clean the data, some functions won't work.

I have to add $ symbols in certain columns and % symbols in the percentage columns. I create a function for this.

In [15]:
def add_dollar(value):
    """Add a dollar sign in the front of a given value that is supposed to represent a money sum."""
    value = '$' + str(value)
    return value

def make_perc(value):
    """Convert a value into a percentage."""
    value = str(value * 100.00) + '%'
    return value

I can apply these functions using the `.apply()` function in the pandas library. This will render the function vectorized and will prevent me from having to iterate the function over every row.

In [16]:
df2['new_sales'] = df2.new_sales.apply(add_dollar)
df2['sales'] = df2.sales.apply(add_dollar)

# now for the percentages
df2['cust_avail_v3'] = df2.cust_avail_v3.apply(make_perc)
df2['new_conv'] = df2.new_conv.apply(make_perc)
df2['new_sales_perc'] = df2.new_sales_perc.apply(make_perc)

Now let's take a look at the new data and see if it's formatted the right way:

In [17]:
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales,new_sales_perc
0,84.78999999999999%,2,3.39,9.0,32.0,$697.16,1.59%,2018-04-02,$949.51,73.42%
1,93.33%,1,0.19,3.0,23.0,$704.84,19.36%,2018-04-03,$790.54,89.16%
2,88.79%,1,3.08,6.0,36.0,$214.46,16.23%,2018-04-04,$296.57,72.31%
3,88.81%,2,2.39,4.0,35.0,$281.23,13.489999999999998%,2018-04-05,$818.23,34.37%
4,88.79%,4,3.64,8.0,35.0,$86.92,4.45%,2018-04-09,$346.83,25.06%


Those weird super long decimals don't matter so I am going to leave them. It shouldn't affect the cleaning notebook.
I can now export the data to `april_random.csv`.

I do want to reorder the columns before I export the data.

In [18]:
df2 = df2[['date', 'cust_avail_v3', 'css_count', 'css_score', 'orders', 'new_conv', 'new_sales', 'new_sales_perc',
          'sales', 'calls_per_day']]
df2.head(1)

Unnamed: 0,date,cust_avail_v3,css_count,css_score,orders,new_conv,new_sales,new_sales_perc,sales,calls_per_day
0,2018-04-02,84.78999999999999%,2,3.39,9.0,1.59%,$697.16,73.42%,$949.51,32.0


I am finally ready to export the data

In [19]:
df2.to_csv('random_data.csv')