# Random Data Generation
#### By Julien Dhouti
GoDaddy told me I couldn't release any of my work data :( so I had to generate random data from the work data. This is the notebook that does so.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('clean_data.csv')
df = df.drop(['Unnamed: 0'], axis=1)   # we don't care for the first column that was imported

We start randomly generating the data using the handy numpy library.<br />
Here in this section, we generate random customer availability based on the already given avaibility percentages.

In [3]:
min_avail = df.cust_avail_v3.min()
max_avail = df.cust_avail_v3.max()
size = df.cust_avail_v3.size

cust_avail_v3 = np.random.uniform(min_avail, max_avail, size)
cust_avail_v3

array([ 0.9823137 ,  0.90197904,  0.95474414,  0.86781271,  1.02353668,
        0.81412529,  0.93606552,  0.86292229,  0.96887922,  0.81832643,
        1.01544145,  1.04717629,  0.84890921,  0.8688426 ,  1.01327074,
        0.95930932,  1.01814334,  0.92916629,  0.98462644])

Now that we have created random data fo the cust_avail_v3 column, we can go ahead and create the new dataframe and apply that data to it.

In [4]:
df2 = pd.DataFrame()   # create a new dataframe object
df2['cust_avail_v3'] = cust_avail_v3   # apply the random column to the new dataframe
df2.head()

Unnamed: 0,cust_avail_v3
0,0.982314
1,0.901979
2,0.954744
3,0.867813
4,1.023537


This can get tedious so we should probably create a function that can do this for any column that is not a dependent variable.

In [5]:
# create the function
def generate_random(column):
    """The parameter is going to be a series. The function will do the rest."""
    col_max = column.max()  # get the max from the column
    col_min = column.min()  # get the min from the column
    size = column.size      # the size will be the same for all columns but we do this just in case
    
    d_type = column.dtype   # get the type of the column so that we can apply the right random function to it
    
    random_column = np.random.uniform(col_min, col_max, size)
        
    return random_column

Now we can take this function for a test drive by applying it to the columns `css_count`, `css_score`, `orders`, `calls_per_day`, `new_sales`, and `new_conv`.

In [6]:
df2['css_count'] = generate_random(df.css_count)
df2['css_score'] = generate_random(df.css_score)
df2['orders'] = generate_random(df.orders)
df2['calls_per_day'] = generate_random(df.calls_per_day)
df2['new_sales'] = generate_random(df.new_sales)
df2['new_conv'] = generate_random(df.new_conv)

df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv
0,0.982314,3.132929,9.596516,1.810967,21.332267,927.295091,0.097898
1,0.901979,0.803462,1.776163,4.372358,32.059763,595.682988,0.083649
2,0.954744,5.232132,3.317613,5.809159,29.049377,67.878894,0.082953
3,0.867813,1.106139,2.506306,2.852789,39.115556,249.052944,0.205916
4,1.023537,5.513686,3.192747,1.581001,27.408561,821.252063,0.137357


Not too bad huh? Some columns look a bit funky because they probably aren't supposed to have so many decimal points. We can probably fix that with a quick function though.

In [7]:
df2 = df2.round({'css_score':2, 'new_sales':2, 'orders':0,
                 'calls_per_day':0, 'css_count':0, 'cust_avail_v3':4,
                 'new_conv':4})

I should probably also convert the `css_score` column to an integer based column since that's what the dirty data looked like.

In [8]:
df2['css_count'] = pd.to_numeric(df.css_count)

What does the dataframe look like now? Well still a bit weird but definitely more logical.

In [9]:
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv
0,0.9823,2,9.6,2.0,21.0,927.3,0.0979
1,0.902,1,1.78,4.0,32.0,595.68,0.0836
2,0.9547,1,3.32,6.0,29.0,67.88,0.083
3,0.8678,2,2.51,3.0,39.0,249.05,0.2059
4,1.0235,4,3.19,2.0,27.0,821.25,0.1374


We have a few columns that are left but we probably don't need to worry about randomizing the dates. If GoDaddy wants to sue me on stolen dates data, go ahead.

In [10]:
df2['date'] = df.date

Next we can probably randomize total_sales by adding a random number to the new_sales column. We can find this random number by calculating all of differences and then randomizing that array.

In [11]:
total_differences = df.sales - df.new_sales # create a series of all of the differences

# apply the same function i created to this series to create random differences
random_differences = generate_random(total_differences)
random_differences

array([ 338.38245776,   18.34334569,  194.13670608,  479.08367853,
         61.02003058,  529.56583815,  184.67913192,  182.50563238,
        305.72418498,  317.42295898,   15.67350906,  291.69652747,
        322.14377353,  437.38112208,  264.21456786,  343.48748749,
        332.61043922,  312.0299232 ,  272.72229887])

Looks kinda cool, now let's add it to the `new_sales` column of the new random dataframe and then create the `sales` column.

In [12]:
df2['sales'] = df2.new_sales + random_differences
df2 = df2.round({'sales':2})
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales
0,0.9823,2,9.6,2.0,21.0,927.3,0.0979,2018-04-02,1265.68
1,0.902,1,1.78,4.0,32.0,595.68,0.0836,2018-04-03,614.02
2,0.9547,1,3.32,6.0,29.0,67.88,0.083,2018-04-04,262.02
3,0.8678,2,2.51,3.0,39.0,249.05,0.2059,2018-04-05,728.13
4,1.0235,4,3.19,2.0,27.0,821.25,0.1374,2018-04-09,882.27


Almost done! The last column that we need to create doesn't need any randomization because it's actually already been determined. We just need to look at the `sales` and `new_sales` columns. We then divide the `sales` column by the `new_sales` column and we get percentage new sales.

In [13]:
df2['new_sales_perc'] = df2.new_sales.divide(df2.sales)
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales,new_sales_perc
0,0.9823,2,9.6,2.0,21.0,927.3,0.0979,2018-04-02,1265.68,0.73265
1,0.902,1,1.78,4.0,32.0,595.68,0.0836,2018-04-03,614.02,0.970131
2,0.9547,1,3.32,6.0,29.0,67.88,0.083,2018-04-04,262.02,0.259064
3,0.8678,2,2.51,3.0,39.0,249.05,0.2059,2018-04-05,728.13,0.342041
4,1.0235,4,3.19,2.0,27.0,821.25,0.1374,2018-04-09,882.27,0.930837


I can then restrict this column to 4 decimal places:

In [14]:
df2 = df2.round({'new_sales_perc':4})
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales,new_sales_perc
0,0.9823,2,9.6,2.0,21.0,927.3,0.0979,2018-04-02,1265.68,0.7326
1,0.902,1,1.78,4.0,32.0,595.68,0.0836,2018-04-03,614.02,0.9701
2,0.9547,1,3.32,6.0,29.0,67.88,0.083,2018-04-04,262.02,0.2591
3,0.8678,2,2.51,3.0,39.0,249.05,0.2059,2018-04-05,728.13,0.342
4,1.0235,4,3.19,2.0,27.0,821.25,0.1374,2018-04-09,882.27,0.9308


Looks great!

There are a few other things that we have to add to the columns so that it looks exactly the same as the previous real data, otherwise when the `april_18.ipynb` notebook is trying to clean the data, some functions won't work.

We have to add $ symbols in certain columns and % symbols in the percentage columns. We create a function for this.

In [15]:
def add_dollar(value):
    """Add a dollar sign in the front of a given value that is supposed to represent a money sum."""
    value = '$' + str(value)
    return value

def make_perc(value):
    """Convert a value into a percentage."""
    value = str(value * 100.00) + '%'
    return value

We can apply these functions using the `.apply()` function in the pandas library. This will render the function vectorized and will prevent me from having to iterate the function over every row.

In [16]:
df2['new_sales'] = df2.new_sales.apply(add_dollar)
df2['sales'] = df2.sales.apply(add_dollar)

# now for the percentages
df2['cust_avail_v3'] = df2.cust_avail_v3.apply(make_perc)
df2['new_conv'] = df2.new_conv.apply(make_perc)
df2['new_sales_perc'] = df2.new_sales_perc.apply(make_perc)

Now let's take a look at the new data and see if it's formatted the right way:

In [17]:
df2.head()

Unnamed: 0,cust_avail_v3,css_count,css_score,orders,calls_per_day,new_sales,new_conv,date,sales,new_sales_perc
0,98.22999999999999%,2,9.6,2.0,21.0,$927.3,9.790000000000001%,2018-04-02,$1265.68,73.26%
1,90.2%,1,1.78,4.0,32.0,$595.68,8.36%,2018-04-03,$614.02,97.00999999999999%
2,95.47%,1,3.32,6.0,29.0,$67.88,8.3%,2018-04-04,$262.02,25.91%
3,86.78%,2,2.51,3.0,39.0,$249.05,20.59%,2018-04-05,$728.13,34.2%
4,102.35000000000001%,4,3.19,2.0,27.0,$821.25,13.74%,2018-04-09,$882.27,93.08%


Those weird super long decimals don't matter so I am going to leave them. It shouldn't affect the cleaning notebook.
I can now export the data to `april_random.csv`.

I do want to reorder the columns before I export the data.

In [18]:
df2 = df2[['date', 'cust_avail_v3', 'css_count', 'css_score', 'orders', 'new_conv', 'new_sales', 'new_sales_perc',
          'sales', 'calls_per_day']]
df2.head(1)

Unnamed: 0,date,cust_avail_v3,css_count,css_score,orders,new_conv,new_sales,new_sales_perc,sales,calls_per_day
0,2018-04-02,98.22999999999999%,2,9.6,2.0,9.790000000000001%,$927.3,73.26%,$1265.68,21.0


I am finally ready to export the data

In [19]:
df2.to_csv('random_data.csv')