## Lab 3.4: csvs, functions, numpy, and distributions

Run the cell below to load the required packages and set up plotting in the notebook!

In [1]:
import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
%matplotlib inline

### Sales data

For this lab we will be using a truncated version of some sales data that we will be looking at further down the line in more detail. 

The csv has about 200 rows of data and 4 columns. The relative path to the csv ```sales_info.csv``` is provided below. If you copied files over and moved them around, this might be different for you and you will have to figure out the correct relative path to enter.

In [2]:
sales_csv_path = '/Users/kristensu/Dropbox/GA-DSI/week-01-KS/4.3-intro-stats-numpy-lab/assets/datasets/sales_info.csv'

#### 1. Loading the data

Set up an empty list called ```rows```.

Using the pattern for loading csvs we learned earlier, add all of the rows in the csv file to the rows list.

For your reference, the pattern is:
```python
with open(my_csv_path, 'r') as f:
    reader = csv.reader(f)
    ...
```

Beyond this, adding the rows in the csv file to the ```rows``` variable is up to you.

In [4]:
rows = []
with open(sales_csv_path, 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row)


##### 2. Separate header and data

The header of the csv is contained in the first index of the ```rows``` variable, as it is the first row in the csv file. 

Use python indexing to create two new variables: ```header``` which contains the 4 column names, and ```data``` which contains the list of lists, each sub-list representing a row from the csv.

Lastly, print ```header``` to see the names of the columns.

In [5]:
header = rows[0]
data = rows[1::]

In [6]:
print header

['volume_sold', '2015_margin', '2015_q1_sales', '2016_q1_sales']


In [7]:
len(data)

200

#### 3. Create a dictionary with the data

Use loops or list comprehensions to create a dictionary called ```sales_data```, where the keys of the dictionary are the column names, and the values of the dictionary are lists of the data points of the column corresponding to that column name.

In [8]:
sales_data = {}
for index, column_name in enumerate(header):
    sales_data[column_name] = []
    for row in data:
        sales_data[column_name].append(row[index])

In [9]:
header

['volume_sold', '2015_margin', '2015_q1_sales', '2016_q1_sales']

In [10]:
header_2 = list(enumerate(header))
header_2

[(0, 'volume_sold'),
 (1, '2015_margin'),
 (2, '2015_q1_sales'),
 (3, '2016_q1_sales')]

In [11]:
sales_data = {}
for index, column_name in header_2:
    sales_data[column_name] = []
    for row in data:
        sales_data[column_name].append(row[index])
    
    

**3.A** Print out the first 10 items of the 'volume_sold' column.

In [12]:
print sales_data['volume_sold'][0:10]

['18.4207604861', '4.77650991918', '16.6024006077', '4.29611149826', '8.15602328201', '5.00512242518', '14.60675', '4.45646649485', '5.04752965097', '5.38807023767']


#### 4. Convert data from string to float

As you can see, the data is still in string format (which is how it is read in from the csv). For each key:value pair in our ```sales_data``` dictionary, convert the values (column data) from string values to float values.

In [35]:
for header, values in sales_data.items():
    values = [float(v) for v in values]
    sales_data[header] = values

In [36]:
print sales_data['volume_sold'][0:10]

[18.4207604861, 4.77650991918, 16.6024006077, 4.29611149826, 8.15602328201, 5.00512242518, 14.60675, 4.45646649485, 5.04752965097, 5.38807023767]


In [37]:
sales_data.keys()

['volume_sold', '2015_q1_sales', '2016_q1_sales', '2015_margin']

#### 5. Write function to print summary statistics

Now write a function to print out summary statistics for the data.

Your function should:

- Accept two arguments: the column name and the data associated with that column
- Print out information, clearly labeling each item when you print it:
    1. Print out the column name
    2. Print the mean of the data using ```np.mean()```
    3. Print out the median of the data using ```np.median()```
    4. Print out the mode of the **rounded data** using ```stats.mode()```
    5. Print out the variance of the data using ```np.var()```
    6. Print out the standard deviation of the data using ```np.std()```
    
Remember that you will need to convert the numeric data from these function to strings by wrapping them in the ```str()``` function.

**5.A** Using your function, print the summary statistics for 'volume_sold'

In [103]:
### a = column name  b = column data
volume_sold = sales_data['volume_sold']
q1_sales_1 = sales_data['2015_q1_sales']
q1_sales_2 = sales_data['2015_q1_sales']
margin = sales_data['2015_margin']

def sales_summary_stats(a,b):
    mode = stats.mode(b)
    print 'Data for: ', a
    print 'Mean: ', np.mean(b)
    print 'Median: ', np.median(b)
    print 'Mode: ', mode.mode[0]
    print ''


In [104]:
sales_summary_stats('volume sold',q1_sales_1)

Data for:  volume sold
Mean:  154631.6682
Median:  104199.41
Mode:  4151.93



In [105]:
print sales_summary_stats('volume sold',q1_sales_1)

Data for:  volume sold
Mean:  154631.6682
Median:  104199.41
Mode:  4151.93

None


In [106]:
###different function that gives same output for only dictionary (d) input (instead of a,b input)
### a = column name  b = column data
volume_sold = sales_data['volume_sold']
q1_sales_1 = sales_data['2015_q1_sales']
q1_sales_2 = sales_data['2015_q1_sales']
margin = sales_data['2015_margin']

def sales_summary(d):
    for a, b in sales_data.items():
        mode_as_list = stats.mode(b)
        print 'Data for: ', a
        print 'Mean: ', round(np.mean(b),2)
        print 'Median: ', round(np.median(b),2)
        print 'Mode: ', mode_as_list.mode[0]
        print ''

In [107]:
sales_summary(sales_data)

Data for:  volume_sold
Mean:  10.02
Median:  8.17
Mode:  2.79463149728

Data for:  2015_q1_sales
Mean:  154631.67
Median:  104199.41
Mode:  4151.93

Data for:  2016_q1_sales
Mean:  154699.18
Median:  103207.2
Mode:  3536.14

Data for:  2015_margin
Mean:  46.86
Median:  36.56
Mode:  11.9961176992



**5.B** Using your function, print the summary statistics for '2015_margin'

In [108]:
print sales_summary_stats('2015_margin',margin)

Data for:  2015_margin
Mean:  46.8588951379
Median:  36.5621438181
Mode:  11.9961176992

None


**5.C** Using your function, print the summary statistics for '2015_q1_sales'

In [109]:
print sales_summary_stats('2015_q1_sales',q1_sales_1)

Data for:  2015_q1_sales
Mean:  154631.6682
Median:  104199.41
Mode:  4151.93

None


**5.D** Using your function, print the summary statistics for '2016_q1_sales'

In [110]:
print sales_summary_stats('2016_q1_sales',q1_sales_2)

Data for:  2016_q1_sales
Mean:  154631.6682
Median:  104199.41
Mode:  4151.93

None


#### 6. Plot the distributions

We've provided a plotting function below called ```distribution_plotter()```. It takes two arguments, the name of the column and the data associated with that column.

In individual cells, plot the distributions for each of the 4 columns. Do the data appear skewed? Symmetrical? If skewed, what would be your hypothesis for why?

In [None]:
def distribution_plotter(column, data):
    sns.set(rc={"figure.figsize": (10, 7)})
    sns.set_style("white")
    dist = sns.distplot(data, hist_kws={'alpha':0.2}, kde_kws={'linewidth':5})
    dist.set_title('Distribution of ' + column + '\n', fontsize=16)