## Lab 3.4: csvs, functions, numpy, and distributions

Run the cell below to load the required packages and set up plotting in the notebook!

import numpy as np
import scipy.stats as stats
import csv
import seaborn as sns
%matplotlib inline

### Sales data

For this lab we will be using a truncated version of some sales data that we will be looking at further down the line in more detail. 

The csv has about 200 rows of data and 4 columns. The relative path to the csv ```sales_info.csv``` is provided below. If you copied files over and moved them around, this might be different for you and you will have to figure out the correct relative path to enter.

In [None]:
sales_csv_path = '../../assets/datasets/sales_info.csv'

#### 1. Loading the data

Set up an empty list called ```rows```.

Using the pattern for loading csvs we learned earlier, add all of the rows in the csv file to the rows list.

For your reference, the pattern is:
```python
with open(my_csv_path, 'r') as f:
    reader = csv.reader(f)
    ...
```

Beyond this, adding the rows in the csv file to the ```rows``` variable is up to you

In [1]:
rows = []
import csv
print 'Opening File. Data:' 
with open('../../assets/datasets/sales_info.csv', 'rU') as f:
    string = csv.reader(f)
    rows = list(string)
print(rows)   
        

Opening File. Data:


IOError: [Errno 2] No such file or directory: '../../3.4-lab/assets/datasets/sales_info.csv'

##### 2. Separate header and data

The header of the csv is contained in the first index of the ```rows``` variable, as it is the first row in the csv file. 

Use python indexing to create two new variables: ```header``` which contains the 4 column names, and ```data``` which contains the list of lists, each sub-list representing a row from the csv.

Lastly, print ```header``` to see the names of the columns.

In [69]:
rows = []
import csv
print 'Opening File. Data:' 
with open('../../assets/datasets/sales_info.csv', 'rU') as f:
    string = csv.reader(f)
    rows = list(string)
    data = []
    j = 0
    for i in rows:
       
        if j == 0:
            header = i
            j+=1
        else:
            data.append(i)      
    
        
print(header,data[1:6])


Opening File. Data:
(['volume_sold', '2015_margin', '2015_q1_sales', '2016_q1_sales'], [['4.77650991918', '21.0824246877', '22351.86', '21736.63'], ['16.6024006077', '93.6124943024', '277764.46', '306942.27'], ['4.29611149826', '16.8247038328', '16805.11', '9307.75'], ['8.15602328201', '35.0114570034', '54411.42', '58939.9'], ['5.00512242518', '31.8774372328', '255939.81', '332979.03']])
<enumerate object at 0x1047a5e10>


#### 3. Create a dictionary with the data

Use loops or list comprehensions to create a dictionary called ```sales_data```, where the keys of the dictionary are the column names, and the values of the dictionary are lists of the data points of the column corresponding to that column name.

In [115]:
sales_data = {}

for index, column_name in enumerate(header):
    print(index,column_name)
    sales_data[column_name] = []
    for row in data:
        sales_data[column_name].append(row[index])     
        
print(sales_data)
    
         
#print(sales_data)
    


(0, 'volume_sold')
(1, '2015_margin')
(2, '2015_q1_sales')
(3, '2016_q1_sales')
{'volume_sold': ['18.4207604861', '4.77650991918', '16.6024006077', '4.29611149826', '8.15602328201', '5.00512242518', '14.60675', '4.45646649485', '5.04752965097', '5.38807023767', '9.34734863474', '10.9303977273', '6.27020860495', '12.3959191176', '4.55771189614', '4.20012242627', '10.2528698945', '12.0767847594', '3.7250952381', '3.21072662722', '6.29097142857', '7.43482131661', '4.37622478386', '12.9889127838', '11.6974557522', '5.96517512509', '3.94522273425', '7.36958530901', '7.34350882699', '12.3500273544', '8.41791967737', '10.2608361718', '7.82435369972', '10.3314300532', '12.5284878049', '18.7447505256', '6.65773264189', '10.6321289355', '6.92770422965', '6.61817422161', '7.12444444444', '9.84966032435', '11.5058377559', '6.30981315215', '10.1866219839', '10.1221793301', '10.8003469032', '7.26782845188', '10.6737166742', '9.15026865672', '8.12418187744', '6.27579970306', '10.6772953319', '5.88898

#### 4. Convert data from string to float

As you can see, the data is still in string format (which is how it is read in from the csv). For each key:value pair in our ```sales_data``` dictionary, convert the values (column data) from string values to float values.

In [124]:
for na, col in sales_data.items():
    col = [float(x) for x in col]
    print(col[2:5])
    sales_data[na] = col
    print(sales_data[na][2:5])

[16.6024006077, 4.29611149826, 8.15602328201]
[16.6024006077, 4.29611149826, 8.15602328201]
[277764.46, 16805.11, 54411.42]
[277764.46, 16805.11, 54411.42]
[306942.27, 9307.75, 58939.9]
[306942.27, 9307.75, 58939.9]
[93.6124943024, 16.8247038328, 35.0114570034]
[93.6124943024, 16.8247038328, 35.0114570034]


#### 5. Write function to print summary statistics

Now write a function to print out summary statistics for the data.

Your function should:

- Accept two arguments: the column name and the data associated with that column
- Print out information, clearly labeling each item when you print it:
    1. Print out the column name
    2. Print the mean of the data using ```np.mean()```
    3. Print out the median of the data using ```np.median()```
    4. Print out the mode of the **rounded data** using ```stats.mode()```
    5. Print out the variance of the data using ```np.var()```
    6. Print out the standard deviation of the data using ```np.std()```
    
Remember that you will need to convert the numeric data from these function to strings by wrapping them in the ```str()``` function.

**5.A** Using your function, print the summary statistics for 'volume_sold'

**5.B** Using your function, print the summary statistics for '2015_margin'

**5.C** Using your function, print the summary statistics for '2015_q1_sales'

**5.D** Using your function, print the summary statistics for '2016_q1_sales'

#### 6. Plot the distributions

We've provided a plotting function below called ```distribution_plotter()```. It takes two arguments, the name of the column and the data associated with that column.

In individual cells, plot the distributions for each of the 4 columns. Do the data appear skewed? Symmetrical? If skewed, what would be your hypothesis for why?

In [None]:
def distribution_plotter(column, data):
    sns.set(rc={"figure.figsize": (10, 7)})
    sns.set_style("white")
    dist = sns.distplot(data, hist_kws={'alpha':0.2}, kde_kws={'linewidth':5})
    dist.set_title('Distribution of ' + column + '\n', fontsize=16)