# Transforming

In [1]:
import pandas as pd

#### Read csv files

In [2]:
from pathlib import Path

# creating a relative path to the data folder 
pth = Path('../../data')

In [3]:
# read canvas csv file into a dataframe 
canvas = pd.read_csv(pth / 'canvas.csv')

In [4]:
# read graffito
graffiti = pd.read_csv(pth/'graffiti.csv')

## Table formats

A big issue in data analysis that feels like it should be obvious but often is not, is how to set up your data.

In order to answer this question you first need to consider the following:

* Which is/are your variable/s?
* What actually represents your unit of observation?
* How should data be distributed in a data table? What does a row or column actually represent?

Answering these practical questions is not as trivial as you may think. If the data is not organized in the right way, you will not be able to do your analysis or visualization easily (or not at all).

For the most part data tables can be found in two main formats (or a mixture of both):
1. Wide format
2. Long format (aka *tidy*, *tall*)

### Wide format

In this format, all of the observations about a single subject are in the same row. E.g.

|id|Product|Height|Width|Weight|
|--|-------|------|-----|------|
|0| Samsung Edge| 15|10|3|
|1| Samsung Note|20|10|4|
|2| IPhone15|10|5|2|


In [5]:
# generate wide dataframe
data = {'product': ['Samsung Edge', 'Samsung Note', 'IPhone15'],
        'height':[15, 20, 10],
        'width': [10, 10, 5],
        'weight': [3, 4, 2]}
wide = pd.DataFrame(data)
wide

Unnamed: 0,product,height,width,weight
0,Samsung Edge,15,10,3
1,Samsung Note,20,10,4
2,IPhone15,10,5,2


### Long (*tall*  or *tidy*) format

In this format,
* Each variable must have its own column
* Each observation must have its own row
* Each value must have its own cell

|id|Product|Attribute| Measure|
|--|-------|------|-----|
|0| IPhoneX|Height|10|
|1| IPhoneX|Width|5|
|2| IPhoneX|Weight|2|
|3| Samsung Note|Height|20|
|4| Samsung Note|Width|10|
|5| Samsung Note|Weight|4|
|6| Samsung Edge| Height|15|
|7| Samsung Edge| Width|10|
|8| Samsung Edge| Weight|3|





## *Wide* to *Long* (*tall* or *tidy*)

In Pandas this can be achieved in several different ways, but here we will consider only one of them. Whenever we want to go from a wide format to a long we will use the Pandas' special function [`melt()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html). This function has as its main parameters (for more examples see [here](https://hausetutorials.netlify.app/posts/2020-05-14-reshaping-data-in-python-pandas/)):
* *id_vars*: Column/s used to identify and order  rows vertically
* *value_vars*: Column/s that are going to be 'unpivoted' (i.e. assembled into a single, 'variable', column by repeating them as necessary). If not specified, then all columns are used.
* *var_name*: Name to be used to label the ‘variable’ column. If None, Pandas uses ‘variable’ as a label.
* *value_name*: Name to use to label the ‘value’ column.


```{image} ../../images/melt.png
:alt: melt
:width: 85%
:align: center
```

In general, data that is in *long* or *tidy* format can be analyzed better. Data from individual columns can be more readily used in calculations. Let's reshape our small wide dataframe into a long/tidy dataframe

In [6]:
tidy = wide.melt(id_vars='product', value_vars=['width', 'height', 'weight'], var_name='attribute', value_name='measure')
tidy

Unnamed: 0,product,attribute,measure
0,Samsung Edge,width,10
1,Samsung Note,width,10
2,IPhone15,width,5
3,Samsung Edge,height,15
4,Samsung Note,height,20
5,IPhone15,height,10
6,Samsung Edge,weight,3
7,Samsung Note,weight,4
8,IPhone15,weight,2


Reshape the graffiti DataFrame into a long/tidy dataframe specifying only **one** identifier column. Each row holds information about each column-value pair contained in the entire DataFrame.

In [7]:
graffiti.melt(id_vars='type', var_name='attribute')

Unnamed: 0,type,attribute,value
0,piece,id,0
1,piece,id,1
2,tag,id,2
3,hollow,id,3
4,tag,id,4
...,...,...,...
72655,piece,transcription,SNIDE
72656,piece,transcription,CARPO (VOA)
72657,hollow,transcription,CARPO
72658,tag,transcription,MUTE


Reshape graffiti DataFrame into a long/tidy dataframe specifying **two** identifier columns

In [8]:
graffiti.melt(id_vars= ['type', 'message'])

Unnamed: 0,type,message,variable,value
0,piece,writter,id,0
1,piece,other,id,1
2,tag,,id,2
3,hollow,,id,3
4,tag,writter,id,4
...,...,...,...,...
69022,piece,writter,transcription,SNIDE
69023,piece,writter,transcription,CARPO (VOA)
69024,hollow,writter,transcription,CARPO
69025,tag,writter,transcription,MUTE


In [9]:
# reshape graffiti dataframe into a long/tidy dataframe specifying identifier and column/s of interest
long = graffiti.melt(id_vars='type', value_vars=['width', 'height'])
long

Unnamed: 0,type,variable,value
0,piece,width,91
1,piece,width,91
2,tag,width,25
3,hollow,width,365
4,tag,width,53
...,...,...,...
7261,piece,height,90
7262,piece,height,90
7263,hollow,height,90
7264,tag,height,30


## *Long* (*tall* or *tidy*) to *Wide* 

In Pandas, this can be achieved in several different ways. Here we will only consider one of them. To convert from long to wide we use the Pandas function [`pivot_table()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html). This function has as its main parameters:
* *Index*: Which column/s should be used to identify and order your rows vertically
* *Columns*: Which column/s should be used to create the new columns. The new DataFrame will have as many columns as there are unique values.
* *Values*: Which column/s should be used to fill the values in the cells of our DataFrame.

```{image} ../../images/pivot_table.png
:alt: pivot table
:width: 85%
:align: center
```

A pivot table allows you to take columns of raw data from a pandas DataFrame, summarize them, and then analyze the summary data to reveal its insights. You can calculate common aggregate statistical calculations such as sums, counts, averages, and so on, revealing trends that your original raw data hides.

In [10]:
# reshape small dataframe back into wide format
wide = tidy.pivot_table(index='product', columns='attribute', values='measure').reset_index()
wide

attribute,product,height,weight,width
0,IPhone15,10.0,2.0,5.0
1,Samsung Edge,15.0,3.0,10.0
2,Samsung Note,20.0,4.0,10.0


In [11]:
# generate a pivot table from graffiti long format 
wide = long.pivot_table(index= 'type', columns ='variable', aggfunc=['count', 'mean']).reset_index()
wide

Unnamed: 0_level_0,type,count,count,mean,mean
Unnamed: 0_level_1,Unnamed: 1_level_1,value,value,value,value
variable,Unnamed: 1_level_2,height,width,height,width
0,blockbuster,17,17,78.941176,232.705882
1,edging,15,15,24.2,63.2
2,hollow,162,162,98.54321,226.006173
3,other,106,106,44.169811,58.915094
4,pasteUp,13,13,41.538462,22.461538
5,piece,121,121,120.958678,200.214876
6,stencil,8,8,25.625,21.0
7,sticker,442,442,9.99095,10.466063
8,tag,2446,2446,29.544971,40.646361
9,throwUp,280,280,108.45,205.346429


We can change the name of our columns/Series and drop redundant ones.

In [12]:
wide.columns = ['type', 'count_dummy', 'count', 'mean_height', 'mean_width']
wide

Unnamed: 0,type,count_dummy,count,mean_height,mean_width
0,blockbuster,17,17,78.941176,232.705882
1,edging,15,15,24.2,63.2
2,hollow,162,162,98.54321,226.006173
3,other,106,106,44.169811,58.915094
4,pasteUp,13,13,41.538462,22.461538
5,piece,121,121,120.958678,200.214876
6,stencil,8,8,25.625,21.0
7,sticker,442,442,9.99095,10.466063
8,tag,2446,2446,29.544971,40.646361
9,throwUp,280,280,108.45,205.346429


In [13]:
# drop first column
wide.drop(columns= "count_dummy")

Unnamed: 0,type,count,mean_height,mean_width
0,blockbuster,17,78.941176,232.705882
1,edging,15,24.2,63.2
2,hollow,162,98.54321,226.006173
3,other,106,44.169811,58.915094
4,pasteUp,13,41.538462,22.461538
5,piece,121,120.958678,200.214876
6,stencil,8,25.625,21.0
7,sticker,442,9.99095,10.466063
8,tag,2446,29.544971,40.646361
9,throwUp,280,108.45,205.346429


## Additional References

- For more on the use of [`melt()`](https://www.slingacademy.com/article/pandas-using-dataframe-melt-method-5-examples/).
- For more on [pivot tables](https://realpython.com/how-to-pandas-pivot-table/).