<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Suplementary-Material-for-Exloratory-Data-Analysis" data-toc-modified-id="Suplementary-Material-for-Exloratory-Data-Analysis-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Suplementary Material for Exloratory Data Analysis</a></span></li><li><span><a href="#Case-#1:-Finding-the-right-phone" data-toc-modified-id="Case-#1:-Finding-the-right-phone-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Case #1: Finding the right phone</a></span></li><li><span><a href="#Case-#2:-A-treat-for-Loyal-Customer" data-toc-modified-id="Case-#2:-A-treat-for-Loyal-Customer-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Case #2: A treat for Loyal Customer</a></span></li></ul></div>

# Suplementary Material for Exloratory Data Analysis

In [2]:
import pandas as pd
import numpy as np

# Case #1: Finding the right phone

Let's say, by a chance, you're asked to make a You*ube video content about "The Phone of 2019". You currently have no clue about what the best phone whatsoever. Luckily, your friend gave you a scraped data from your local's online retail.
Read the data using :
```python 
pd.read_csv('data_input/handphone.csv', encoding='ISO-8859-1')
```
Don't worry about the encoding things, it's just a matter on how the text is represented using standad codes (usually `utf-8`)

Use the data to answer these following questions in order to help you take a glance on the data:

1. Find the top 5% of the most reviewed phone. 

2. Get 3 most frequent brand based on answer no.1 <br>
What brand doesn't belong to the top 3 ? 
- [ ] Xiaomi
- [ ] Oppo
- [ ] Apple
- [ ] Vivo


3. Based on answer no.2, compare the average price for each brand. What brand has the highest average price ? 
- [ ] Xiaomi
- [ ] Oppo
- [ ] Apple
- [ ] Vivo


4. Based on answer no.2, what brand vary the most on its price ? 
- [ ] Xiaomi
- [ ] Oppo
- [ ] Apple
- [ ] Vivo

If you have answered all the questions above, you may conclude what brand you are going to nominate.

<details><summary>Answer no 1</summary>
    
In order to get the top 5%, you can sort the data descending based on its review using `sort_values`. Then, take the first 5% row according to the data length using `round(len(data)*0.05)`. 
```python 
    data = data.sort_values(by='review', ascending=False)
    five_percent = round(len(data)*0.05)
    top_five_percent = data.head(five_percent)
```
<br>
    
Alternatively, pandas gives us the `.quantile` method to get the n-th% value. In order to get all top 5% data, we need to select all the data that pass the 95% value using `data[review].quantile(0.95)`. See the following codes 
```python 
    threshold = data['review'].quantile(0.95)
    condition = data['review'] >= threshold
    top_five_percent = data[condition]
```
<br>
    
</details>

<details><summary>Answer no 2</summary>
    
You can use `value_counts()` method in order to automatically sort value frequencies from highest to lowest. You can then take top 3 using `head` and get the index values
    
```python
    data['brand_hp'].value_counts().head(3).index.values
```
    
</details>

<details><summary>Answer no 3</summary>

In order to answer question no 3 and 4, you need to specifically inspect each brand manually (Don't worry, you can use groupby later)

```python
    condition1 = data['brand_hp'] == 'specific brand1'
    condition2 = data['brand_hp'] == 'specific brand2'
    condition3 = data['brand_hp'] == 'specific brand3'
    data_brand1 = data[condition1]
    data_brand2 = data[condition2]
    data_brand3 = data[condition3]
    
    print(data_brand1['price'].mean())
    print(data_brand2['price'].mean())
    print(data_brand3['price'].mean())
```
<br>
    
</details>

<details><summary>Answer no 4</summary>

Similar to no.3, but instead of mean, we change the function to std (Standard Deviation) to measure how much big the variation of the values. 
    
```python
    condition1 = data['brand_hp'] == 'specific brand1'
    condition1 = data['brand_hp'] == 'specific brand2'
    condition1 = data['brand_hp'] == 'specific brand3'
    data_brand1 = data[condition1]
    data_brand2 = data[condition2]
    data_brand3 = data[condition3]
    
    print(data_brand1['price'].std())
    print(data_brand2['price'].std())
    print(data_brand3['price'].std())
```
  
</details>

# Case #2: A treat for Loyal Customer

You are now act as owner of the multi-nation store mainly located in UK, and currently looking for a branch in France. In order to grow the France market, you wanted to give a treat for loyal customer in France beforehand. Who will be rewarded and How's France market compared to other nations ? 

Go take the data using 
```python 
pd.read_csv('data_input/online-retail.csv', encoding='ISO-8859-1', parse_dates=['InvoiceDate'])
```
and select only `France` country. Answer these questions in order to do execute your plan: 

1. Get the top 10 most frequent customer
2. In what month do they frequently do the transactions? (Get the top3 month). What month doesn't belong to the top 3 ? 
- [ ] September 
- [ ] October
- [ ] November
- [ ] December

3. What are 10 items do they frequently buy in the top3 most frequent month (ignoring the `quantity`) ? Select item that doesn't belong to those top 10
- [ ] WHITE HANGING HEART T-LIGHT HOLDER
- [ ] LUNCH BOX WITH CUTLERY RETROSPOT
- [ ] PLASTERS IN TIN CIRCUS PARADE
- [ ] RABBIT NIGHT LIGHT

4. Compare the revenue of those 10 items between France (use France data) and all over the country (use all the data) in order to see the proportion of France market compared to all the market. What is the proportion of France market? ?. <br>
*notes: you can calculate revenue by multiplying unit price with the quantity of each transactions*
- [ ] 13.03 %
- [ ] 13.08 %
- [ ] 0.3 %
- [ ] 20.6 %

Now that you already find the answer, you might have to discount the item on the most active month in the future, and gives rewards for those ten customer in France. 

If you find any problem, don't hesitate to take a look at these reference answer. (Your answer doesn't have to be exactly the same)
<details><summary>Answer no.1</summary>

The only customer attribute(s) existed in the data is `CustomerID`. So let's find the most frequent `CustomerID` (Assuming we don't care about Quantity values). 

First we need to select only france data. After that, we can simply count values the `CustomerID` and take `.head(10)` the get the top 10 most frequent customer. Don't forget to obtain the customer id using `.index.values` after `.value_counts()`:
```python
    france = data[data['Country'] == 'France']
    top10_user = france['CustomerID'].value_counts().head(10).index.values
    top10_user
```
<br>
</details>


<details><summary>Answer no.2</summary>

To answer such question, we need to narrow down our data focusing only on those top 10 customer. Using `.isin()` method, you can filter the customers. <br>
After you filter it out, extract the month name using `.dt.month_name()` from `InvoiceDate` columns and count the values using `value_counts()`:

```python
condition = france['CustomerID'].isin(top10_user)
data_customer = france[condition]
data_customer['Month'] = data_customer['InvoiceDate'].dt.month_name()
top3_month = data_customer['Month'].value_counts().head(3).index.values
top3_month
```
    
</details>

<details><summary>Answer no. 3</summary>
    
You need to get the top 3 most frequent month first. Using `.count_values()` you can select top 3 month from `Month` columns. Get the month names, and filter out only data that consist on of those names.<br>
Finally, do some value_counts() based on `StockCode` to findout product's category. <br>
In order to get the descriptions, filter the dataframe's `StockCode` using `.isin`. But, since `.isin` return all the data, we need to drop_duplciate so that we only show 10 products descriptions respectively <br><br>
    
```python
top3_month = data_customer['Month'].value_counts().head(3).index.values
condition = data_customer['Month'].isin(top3_month)
data_month = data_customer[condition]
stock_code = data_month['StockCode'].value_counts().head(10).index.values
unique_item = data_month[data_month['StockCode'].isin(stock_code)].drop_duplicates(subset=['StockCode'])
top10_unique = unique_item['Description'].unique()
```
<br>
<br>
</details>

<details><summary>Answer no. 4</summary>

To answer this questions, we have to measure the revenue for all the data. Hence, let's start over from reading the data, then add new column called `Revenue` wich the value is the product of `data['Quantity']` and `data['UnitPrice']`. <br>
After `Revenue` column is created, subset the the France-only data, then calculate the sum of `Revenue` for both France-only data and the data. Compare both values and we will have the result. 
    
```python
data = pd.read_csv('data_input/online-retail.csv', encoding='ISO-8859-1', parse_dates=['InvoiceDate'])
condition_france = data['Country'] == 'France'
condition_item = data['Description'].isin(top10_unique)
data['Revenue'] = data['Quantity'] * data['UnitPrice']

france = data[condition_france&condition_item]
proportion = france.groupby('Description').sum()['Revenue'].sum()/data[condition_item].groupby('Description').sum()['Revenue'].sum()

print('France Proportion: ', round(proportion*100, 2), '%')
```
    
</details>