# Random data
#### By Julien Dhouti
GoDaddy told me I couldn't release any of my work data :( so I had to generate random data from the work data. This is the notebook that does so.

In [118]:
import pandas as pd
import numpy as np

In [119]:
df = pd.read_csv('clean_april_18.csv')
df = df.drop(['Unnamed: 0'], axis=1)   # we don't care for the first column that was imported

We start randomly generating the data using the handy numpy library.<br />
Here in this section, we generate random customer availability based on the already given avaibility percentages.

In [120]:
min_avail = df.cust_avail_v3.min()
max_avail = df.cust_avail_v3.max()
size = df.cust_avail_v3.size

cust_avail_v3 = np.random.uniform(min_avail, max_avail, size)
cust_avail_v3

array([ 1.05488657,  1.03156509,  1.0214843 ,  0.89743093,  0.86710445,
        1.04484562,  0.81816838,  0.98533837,  0.88750236,  0.96574661,
        0.84634189,  0.9237154 ,  0.94468935,  1.01023713,  0.85321815,
        0.82568268,  0.94913525,  0.95398924,  0.87052809])

Now that we have created random data fo the cust_avail_v3 column, we can go ahead and create the new dataframe and apply that data to it.

In [121]:
df2 = pd.DataFrame()   # create a new dataframe object
df2['cust_avail_v3'] = cust_avail_v3   # apply the random column to the new dataframe
df2.head()

Unnamed: 0,cust_avail_v3
0,1.054887
1,1.031565
2,1.021484
3,0.897431
4,0.867104


This can get tedious so we should probably create a function that can do this for any column that is not a dependent variable.

In [122]:
# create the function
def generate_random(column):
    """The parameter is going to be a series. The function will do the rest."""
    col_max = column.max()  # get the max from the column
    col_min = column.min()  # get the min from the column
    size = column.size      # the size will be the same for all columns but we do this just in case
    
    d_type = column.dtype   # get the type of the column so that we can apply the right random function to it
    
    random_column = np.random.uniform(col_min, col_max, size)
        
    return random_column

Now we can take this function for a test drive by applying it to the columns `css_count`, `css_score`, `orders`, `calls_per_day`, `new_sales`, and `new_conv`.

In [123]:
df2['css_count'] = generate_random(df.css_count)
df2['css_score'] = generate_random(df.css_score)
df2['orders'] = generate_random(df.orders)
df2['calls_per_day'] = generate_random(df.calls_per_day)
df2['new_sales'] = generate_random(df.new_sales)
df2['new_conv'] = generate_random(df.new_conv)

df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv
0,1.054887,3.893389,1.747388,1.241616,37.228616,863.193624,0.080105
1,1.031565,0.671246,5.469128,9.195623,38.561679,517.594264,0.122897
2,1.021484,2.85001,5.792848,1.023614,32.630453,555.509561,0.139811
3,0.897431,4.890119,8.898066,2.900913,32.271053,676.622558,0.166528
4,0.867104,0.174798,7.525343,9.139004,23.625943,844.002159,0.206834


Not too bad huh? Some columns look a bit funky because they probably aren't supposed to have so many decimal points. We can probably fix that with a quick function though.

In [124]:
df2 = df2.round({'css_score':2, 'new_sales':2, 'orders':0, 'calls_per_day':0, 'css_count':0})

What does the dataframe look like now? Well still a bit weird but definitely more logical.

In [125]:
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv
0,1.054887,4.0,1.75,1.0,37.0,863.19,0.080105
1,1.031565,1.0,5.47,9.0,39.0,517.59,0.122897
2,1.021484,3.0,5.79,1.0,33.0,555.51,0.139811
3,0.897431,5.0,8.9,3.0,32.0,676.62,0.166528
4,0.867104,0.0,7.53,9.0,24.0,844.0,0.206834


We have a few columns that are left but we probably don't need to worry about randomizing the dates. If GoDaddy wants to sue me on stolen dates data, go ahead.

In [126]:
df2['date'] = df.date

Next we can probably randomize total_sales by adding a random number to the new_sales column. We can find this random number by calculating all of differences and then randomizing that array.

In [127]:
total_differences = df.sales - df.new_sales # create a series of all of the differences

# apply the same function i created to this series to create random differences
random_differences = generate_random(total_differences)
random_differences

array([ 422.71896937,   99.15180399,  374.93458224,  314.60392876,
        401.06364052,   36.41527157,  161.84936159,  100.99322763,
        506.15238631,  500.61254063,  527.52077281,  538.14808822,
        452.57609752,   92.34873146,  403.667233  ,  341.49075056,
        265.63051921,  368.66243699,  192.28623491])

Looks kinda cool, now let's add it to the `new_sales` column of the new random dataframe and then create the `sales` column.

In [128]:
df2['sales'] = df2.new_sales + random_differences
df2 = df2.round({'sales':2})
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales
0,1.054887,4.0,1.75,1.0,37.0,863.19,0.080105,2018-04-02,1285.91
1,1.031565,1.0,5.47,9.0,39.0,517.59,0.122897,2018-04-03,616.74
2,1.021484,3.0,5.79,1.0,33.0,555.51,0.139811,2018-04-04,930.44
3,0.897431,5.0,8.9,3.0,32.0,676.62,0.166528,2018-04-05,991.22
4,0.867104,0.0,7.53,9.0,24.0,844.0,0.206834,2018-04-09,1245.06


Almost done! The last column that we need to create doesn't need any randomization because it's actually already been determined. We just need to look at the `sales` and `new_sales` columns. We then divide the `sales` column by the `new_sales` column and we get percentage new sales.

In [111]:
df2['new_sales_perc'] = df.new_sales / df.sales
df2

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,date,sales,new_conv,new_sales_perc
0,0.879222,5.0,9.8,10.0,24.0,534.85,2018-04-02,599.47,0.030851,0.950954
1,1.007708,7.0,8.37,2.0,32.0,624.06,2018-04-03,777.86,0.184306,1.0
2,0.858833,4.0,2.67,5.0,34.0,7.66,2018-04-04,553.98,0.172314,0.950568
3,0.818615,1.0,4.45,4.0,28.0,192.82,2018-04-05,543.83,0.132789,1.0
4,1.0182,2.0,8.32,9.0,26.0,759.04,2018-04-09,1290.02,0.148876,0.602737
5,0.898391,4.0,5.11,4.0,24.0,55.38,2018-04-10,316.29,0.16637,0.555206
6,0.826304,4.0,5.87,7.0,21.0,462.46,2018-04-11,598.72,0.196712,1.0
7,0.994471,5.0,8.43,6.0,37.0,851.26,2018-04-12,1089.03,0.07172,0.941143
8,0.965107,3.0,8.57,8.0,37.0,560.65,2018-04-13,933.37,0.100182,0.682179
9,0.852617,7.0,4.83,7.0,35.0,446.22,2018-04-16,582.51,0.083631,0.96281


Looks great! I'm going to export it so that I can use it in my other notebooks.

In [129]:
df2.to_csv('april_random.csv')