In [1]:
options(jupyter.rich_display=F)

# Problem 3: Sales analysis

(Scoring: Part A 5 points, parts B-F 9 points each. Total 50 points.)

In this assignment, you will analyze a data set from a wholesale goods distributor. The data shows the annual spending of each customer on products in several categories.


The data is stored in the file *wholesaledata.csv*. The first five lines of the file is as follows:

|Channel|Region|Fresh|Milk|Grocery|Frozen|Detergents_Paper|Delicassen|
|---|---|---|---|----|---|----|----|
|2	|3	|12669	|9656	|7561	|214	|2674	|1338
|2	|3	|7057	|9810	|9568	|1762	|3293	|1776
|2	|3	|6353	|8808	|7684	|2405	|3516	|7844
|1	|3	|13265	|1196	|4221	|6404	|507	|1788

* The `Channel` field indicates the type of the customer.
    * 1 indicates a hotel, restaurant, or cafe ("horeca" for short).
    * 2 indicates a retail store.

* The `Region` field indicates the region where the customer is located.
    * 1 for Istanbul
    * 2 for Bursa
    * 3 for Other

**(A)** Write a function named **read_data** that reads the data file and returns a data frame. You can use the  the built-in **read.csv()** function inside this function's body.

```r
> sales <- read_data("wholesaledata.csv")
> sales
    Channel Region  Fresh  Milk Grocery Frozen Detergents_Paper Delicassen
1         2      3  12669  9656    7561    214             2674       1338
2         2      3   7057  9810    9568   1762             3293       1776
3         2      3   6353  8808    7684   2405             3516       7844
...
338       1      2   9351  1347    2611   8170              442        868
339       1      2      3   333    7021  15601               15        550
340       1      2   2617  1188    5332   9584              573       1942
```

The name **sales** is arbitrary. You don't have to use the same name for the data frame.

**(B)** Write a function named **tofactors** that takes the original data frame and returns a new data frame such that:
* In the **Channel** column, the values 1 and 2 are replaced by "Horeca" and "Retail", respectively.
* In the **Region** column, the values 1, 2, and 3 are replaced by "Istanbul", "Bursa", and "Other", respectively.
* The columns **Channel** and **Region** are converted to factors.

Example:
```
> summary(tofactors(sales))
   Channel         Region        Fresh             Milk          Grocery     
 Horeca:222   Bursa   : 47   Min.   :     3   Min.   :   55   Min.   :    3  
 Retail:118   Istanbul: 77   1st Qu.:  3286   1st Qu.: 1606   1st Qu.: 2366  
              Other   :216   Median :  8726   Median : 3664   Median : 5146  
                             Mean   : 12441   Mean   : 6175   Mean   : 8442  
                             3rd Qu.: 16934   3rd Qu.: 7612   3rd Qu.:10830  
                             Max.   :112151   Max.   :73498   Max.   :92780  
     Frozen      Detergents_Paper    Delicassen     
 Min.   :   33   Min.   :    3.0   Min.   :    3.0  
 1st Qu.:  744   1st Qu.:  283.8   1st Qu.:  416.5  
 Median : 1500   Median :  833.0   Median :  982.5  
 Mean   : 3131   Mean   : 3112.8   Mean   : 1615.1  
 3rd Qu.: 3708   3rd Qu.: 4125.0   3rd Qu.: 1795.8  
 Max.   :60869   Max.   :40827.0   Max.   :47943.0  
```

**(C)** Write a function named **totalrevenue** that returns the total revenue from a given channel, region and type of good.

Example:
```
> totalrevenue(sales, 1, 1, "Fresh")
[1] 761233
> totalrevenue(tofactors(sales), "Horeca", "Istanbul", "Fresh")
[1] 761233
```

**(D)** Write a function named **sales_stats** that takes the data frame and returns the mean and max of sales for every item type. The result should be returned as a list of lists, as shown below.

Example:

```
> sales_stats(sales)
$Fresh
$Fresh$avg
[1] 12441.45

$Fresh$max
[1] 112151


$Milk
$Milk$avg
[1] 6174.871

$Milk$max
[1] 73498


$Grocery
$Grocery$avg
[1] 8441.918

$Grocery$max
[1] 92780


$Frozen
$Frozen$avg
[1] 3131.297

$Frozen$max
[1] 60869


$Detergents_Paper
$Detergents_Paper$avg
[1] 3112.824

$Detergents_Paper$max
[1] 40827


$Delicassen
$Delicassen$avg
[1] 1615.103

$Delicassen$max
[1] 47943
```

**(E)** Write a function named **contingency_table** that returns a two-way table (contingency table) giving the number of entries broken by channels and regions.

Example:

```r
> contingency_table(sales)
           Horeca Retail
  Bursa        28     19
  Istanbul     59     18
  Other       135     81
```


**(F)** Write a function named **break_sales** that returns a table of total revenue for each item type, aggregated by regions and channels, as shown below. (Hint: Use the built-in **aggregate()** function.)

Example:

```r
> break_sales(sales)
   Group.1 Group.2   Fresh   Milk Grocery Frozen Detergents_Paper Delicassen
1    Bursa  Horeca  326215  64519  123074 160861            13516      30965
2 Istanbul  Horeca  761233 228342  237542 184512            56081      70632
3    Other  Horeca 2085912 459174  526214 520296           102605     230435
4    Bursa  Retail  138506 174625  310200  29271           159795      23541
5 Istanbul  Retail   93600 194112  332495  46514           148055      33695
6    Other  Retail  824627 978684 1340727 123187           578308     159867
```

# Solution

**(A)**

In [3]:
read_data <- function(whole_path) {
    read.csv(whole_path)
}

sales <- read_data("~/tests/assignments/2019_2020_2/wholesaledata_test.csv")
head(sales)

  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 2       3      12669 9656 7561     214   2674             1338      
2 2       3       7057 9810 9568    1762   3293             1776      
3 2       3       6353 8808 7684    2405   3516             7844      
4 1       3      13265 1196 4221    6404    507             1788      
5 2       3      22615 5410 7198    3915   1777             5185      
6 2       3       9413 8259 5126     666   1795             1451      

**(B)**

In [4]:
tofactors <- function(df){
    df$Channel <- factor(ifelse(df$Channel==1, "Horeca", "Retail"))
    df$Region <- factor(ifelse(df$Region==1, "Istanbul", ifelse(df$Region==2, "Bursa", "Other")))
    return(df)
}

head(tofactors(sales))
summary(tofactors(sales))

  Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 Retail  Other  12669 9656 7561     214   2674             1338      
2 Retail  Other   7057 9810 9568    1762   3293             1776      
3 Retail  Other   6353 8808 7684    2405   3516             7844      
4 Horeca  Other  13265 1196 4221    6404    507             1788      
5 Retail  Other  22615 5410 7198    3915   1777             5185      
6 Retail  Other   9413 8259 5126     666   1795             1451      

   Channel         Region        Fresh             Milk          Grocery     
 Horeca:298   Bursa   : 47   Min.   :     3   Min.   :   55   Min.   :    3  
 Retail:142   Istanbul: 77   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153  
              Other   :316   Median :  8504   Median : 3627   Median : 4756  
                             Mean   : 12000   Mean   : 5796   Mean   : 7951  
                             3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656  
                             Max.   :112151   Max.   :73498   Max.   :92780  
     Frozen        Detergents_Paper    Delicassen     
 Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
 1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
 Median : 1526.0   Median :  816.5   Median :  965.5  
 Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
 3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
 Max.   :60869.0   Max.   :40827.0   Max.   :47943.0  

**(C)**

In [5]:
totalrevenue <- function(df, channel, region, type){
    sum(df[(df$Channel==channel) & (df$Region==region),][[type]])
}

totalrevenue(sales,1,1,"Fresh")
totalrevenue(tofactors(sales),"Horeca","Istanbul","Fresh")

[1] 761233

[1] 761233

**(D)**

In [6]:
sales_stats <- function(df){
    lst <- list()
    for (type in colnames(df)[c(-1,-2)]){
        lst[[type]]$avg <- mean(df[[type]])
        lst[[type]]$max <- max(df[[type]])
    }
    lst
}
sales_stats(sales)

$Fresh
$Fresh$avg
[1] 12000.3

$Fresh$max
[1] 112151


$Milk
$Milk$avg
[1] 5796.266

$Milk$max
[1] 73498


$Grocery
$Grocery$avg
[1] 7951.277

$Grocery$max
[1] 92780


$Frozen
$Frozen$avg
[1] 3071.932

$Frozen$max
[1] 60869


$Detergents_Paper
$Detergents_Paper$avg
[1] 2881.493

$Detergents_Paper$max
[1] 40827


$Delicassen
$Delicassen$avg
[1] 1524.87

$Delicassen$max
[1] 47943



**(E)**

In [7]:
contingency_table <- function(df) table(tofactors(df)$Region, tofactors(df)$Channel)

contingency_table(sales)

          
           Horeca Retail
  Bursa        28     19
  Istanbul     59     18
  Other       211    105

**(F)**

In [8]:
break_sales <- function(df) aggregate(df[c(-1,-2)], by=list(tofactors(df)$Region, tofactors(df)$Channel), sum)

break_sales(sales)

  Group.1  Group.2 Fresh   Milk    Grocery Frozen Detergents_Paper Delicassen
1 Bursa    Horeca   326215   64519  123074 160861  13516            30965    
2 Istanbul Horeca   761233  228342  237542 184512  56081            70632    
3 Other    Horeca  2928269  735753  820101 771606 165990           320358    
4 Bursa    Retail   138506  174625  310200  29271 159795            23541    
5 Istanbul Retail    93600  194112  332495  46514 148055            33695    
6 Other    Retail  1032308 1153006 1675150 158886 724420           191752    