[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mosleh-exeter/BEM1025/blob/main/Tutorial/05-Tutorial05-Transformation-practice-solution.ipynb)

# Tutorial 05. Data Transformation in Pandas

In [1]:
import pandas as pd

### Loading dataset

For this assignment we will work on the superstore data from https://www.kaggle.com/datasets/vivek468/superstore-dataset-final

In [2]:
# similar to reading data from csv file, we can read data from excel sheets, we need to specify which column we need to read data from
# this data is a sample from comercial software Tableau
df=pd.read_csv('https://github.com/mosleh-exeter/BEM1025/raw/main/Lecture/Superstore.csv')

selected_columns=['Order ID', 'Order Date', 'Product ID','Segment', 'Region','Category','variable','value']

df_order=df[selected_columns]
# or similarly you can do
df_order=pd.DataFrame(df_order,columns=selected_columns)


In [3]:
df_order.head()

Unnamed: 0,Order ID,Order Date,Product ID,Segment,Region,Category,variable,value
0,CA-2013-152156,2013-11-09 00:00:00,FUR-BO-10001798,Consumer,South,Furniture,Sales,261.96
1,CA-2013-152156,2013-11-09 00:00:00,FUR-CH-10000454,Consumer,South,Furniture,Sales,731.94
2,CA-2013-138688,2013-06-13 00:00:00,OFF-LA-10000240,Corporate,West,Office Supplies,Sales,14.62
3,US-2012-108966,2012-10-11 00:00:00,FUR-TA-10000577,Consumer,South,Furniture,Sales,957.5775
4,US-2012-108966,2012-10-11 00:00:00,OFF-ST-10000760,Consumer,South,Office Supplies,Sales,22.368


In [4]:
df_order['variable'].unique()

array(['Sales', 'Quantity', 'Discount', 'Profit'], dtype=object)

**Q1.** Find the top 10 products with highest sales.

In [6]:
df_order[df_order['variable']=='Sales'].sort_values(by='variable',ascending=False).head(10)

Unnamed: 0,Order ID,Order Date,Product ID,Segment,Region,Category,variable,value
0,CA-2013-152156,2013-11-09 00:00:00,FUR-BO-10001798,Consumer,South,Furniture,Sales,261.96
6665,CA-2013-115483,2013-07-15 00:00:00,OFF-PA-10001497,Consumer,Central,Office Supplies,Sales,219.84
6658,CA-2014-135937,2014-02-21 00:00:00,FUR-FU-10002253,Home Office,West,Furniture,Sales,68.704
6659,CA-2014-135937,2014-02-21 00:00:00,FUR-TA-10001039,Home Office,West,Furniture,Sales,386.91
6660,CA-2012-129322,2012-08-08 00:00:00,OFF-AR-10004587,Consumer,East,Office Supplies,Sales,39.66
6661,CA-2012-129322,2012-08-08 00:00:00,OFF-AP-10004336,Consumer,East,Office Supplies,Sales,113.92
6662,CA-2012-129322,2012-08-08 00:00:00,OFF-BI-10001718,Consumer,East,Office Supplies,Sales,447.86
6663,CA-2014-162173,2014-10-27 00:00:00,FUR-TA-10001520,Consumer,South,Furniture,Sales,356.85
6664,CA-2014-162173,2014-10-27 00:00:00,OFF-EN-10002831,Consumer,South,Office Supplies,Sales,251.58
6666,CA-2014-122175,2014-05-13 00:00:00,FUR-FU-10000719,Consumer,East,Furniture,Sales,42.85


**Q2.** Find the average profit for each region for each year.

Tip: You can use Grouper with appropriate frequncy: https://pandas.pydata.org/docs/reference/api/pandas.Grouper.html

In [7]:
df_order['date_formatted']=pd.to_datetime(df_order['Order Date'])
df_order[df_order['variable']=='Sales'].groupby(
    ['Region', pd.Grouper(key='date_formatted',freq='1Y')]).mean().reset_index()


Unnamed: 0,Region,date_formatted,value
0,Central,2011-12-31,222.828679
1,Central,2012-12-31,216.122315
2,Central,2013-12-31,244.493161
3,Central,2014-12-31,189.072144
4,East,2011-12-31,248.898369
5,East,2012-12-31,242.751641
6,East,2013-12-31,235.986026
7,East,2014-12-31,231.279193
8,South,2011-12-31,297.55256
9,South,2012-12-31,209.882296


**Q3.** Tranform the dataframe such that you have seperate columns for Discount, Profit, Quanity, and Sales; Then find the maximum value of each for each segment

In [8]:
df_order_wide=df_order.pivot_table(index=['Order ID','Order Date','Product ID','Segment','Region','Category'], 
                                            columns='variable', 
                                            values='value').reset_index()
df_order_wide.head(1)

variable,Order ID,Order Date,Product ID,Segment,Region,Category,Discount,Profit,Quantity,Sales
0,CA-2011-100006,2011-09-07 00:00:00,TEC-PH-10002075,Consumer,East,Technology,0.0,109.6113,3.0,377.97


In [9]:
df_order_summary=df_order_wide.groupby('Segment')[['Discount','Profit','Quantity','Sales']].max().reset_index()
df_order_summary

variable,Segment,Discount,Profit,Quantity,Sales
0,Consumer,0.8,6719.9808,14.0,13999.96
1,Corporate,0.8,8399.976,14.0,17499.95
2,Home Office,0.8,3919.9888,14.0,22638.48


**Q4.** Tranform the dataframe from previous task such that you have seperate rows for Discount, Profit, Quanity, and Sales

In [10]:
df_order_summary
df_order_summary.melt(id_vars=['Segment'],
                                            value_vars=['Discount','Profit','Quantity','Sales'],
                                            var_name='variable',
                                            value_name='value')

Unnamed: 0,Segment,variable,value
0,Consumer,Discount,0.8
1,Corporate,Discount,0.8
2,Home Office,Discount,0.8
3,Consumer,Profit,6719.9808
4,Corporate,Profit,8399.976
5,Home Office,Profit,3919.9888
6,Consumer,Quantity,14.0
7,Corporate,Quantity,14.0
8,Home Office,Quantity,14.0
9,Consumer,Sales,13999.96
