# Dataframe Transformation

* Query
* Transformation: new columns, df.apply(row_processor, axis=1)


In [None]:
import pandas
iris = pandas.read_csv('../Datasets/iris.csv')

In [None]:
iris.sample(5)

Task: Select all iris flowers with petal lengths within 5% of the largest.

+ find the largest petal length
+ define a threshold
+ select data that exceeds this threshold

In [None]:
largest_pl = iris['PetalLength'].max()
threshold = largest_pl * 0.95
Q = (iris['PetalLength'] >= threshold)

In [None]:
data = iris[Q]
data

---

In [14]:
#PID:5
#
#  1. calculate tip percentage (tip/total_bill)
#  2. select customers that tip more than average.
#
#

import pandas
tips = pandas.read_csv('../Datasets/tips.csv')
tips['tip_perc'] = tips['tip']/tips['total_bill']
Q = (tips['tip_perc'] > tips['tip_perc'].mean())
data = tips[Q]
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_perc
2,21.01,3.50,Male,No,Sun,Dinner,3,0.166587
5,25.29,4.71,Male,No,Sun,Dinner,4,0.186240
6,8.77,2.00,Male,No,Sun,Dinner,2,0.228050
9,14.78,3.23,Male,No,Sun,Dinner,2,0.218539
10,10.27,1.71,Male,No,Sun,Dinner,2,0.166504
...,...,...,...,...,...,...,...,...
228,13.28,2.72,Male,No,Sat,Dinner,2,0.204819
231,15.69,3.00,Male,Yes,Sat,Dinner,3,0.191205
232,11.61,3.39,Male,No,Sat,Dinner,2,0.291990
234,15.53,3.00,Male,Yes,Sat,Dinner,2,0.193175


**Task**: identify people who spend a lot of money on their meals but tip very little.

We need to be more specific on a few things:
+ Spending a lot of money on meals:
    + With 15% of total bill.
+ Tipping very little:
    + Within 15% of smallest tip percentage.
    + Less than average tip percentage.
    
Steps:
+ Calculate the thresholds.
+ Construct the queries
+ Select data.

In [17]:
tips['tip_perc'] = tips['tip']/tips['total_bill']
t1 = 0.85 * tips['total_bill'].max()
# t2 = 1.15 * tips['tip_perc'].min()
t2 = tips['tip_perc'].mean()
big_spender = (tips['total_bill'] >= t1)
small_tipper = (tips['tip_perc'] <= t2)
data = tips[ big_spender & small_tipper ]
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_perc
59,48.27,6.73,Male,No,Sat,Dinner,4,0.139424
102,44.3,2.5,Female,Yes,Sat,Dinner,3,0.056433
156,48.17,5.0,Male,No,Sun,Dinner,6,0.103799
182,45.35,3.5,Male,Yes,Sun,Dinner,3,0.077178


---

When we create a new column `tip_perc`, we technically transform the original dataframe.

Task: calculate the area of sepals and petals for each flower.

+ Area of sepal = sepal length * sepal width
+ What does `calculate` mean?

In [24]:
iris['sepal_area'] = iris['SepalLength'] * iris['SepalWidth']
iris['petal_area'] = iris['PetalLength'] * iris['PetalWidth']
iris

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species,sepal_area,petal_area
0,5.1,3.5,1.4,0.2,setosa,17.85,0.28
1,4.9,3.0,1.4,0.2,setosa,14.70,0.28
2,4.7,3.2,1.3,0.2,setosa,15.04,0.26
3,4.6,3.1,1.5,0.2,setosa,14.26,0.30
4,5.0,3.6,1.4,0.2,setosa,18.00,0.28
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,20.10,11.96
146,6.3,2.5,5.0,1.9,virginica,15.75,9.50
147,6.5,3.0,5.2,2.0,virginica,19.50,10.40
148,6.2,3.4,5.4,2.3,virginica,21.08,12.42


Task: categorize flowers into small, medium, and large sizes, based on sepal area.

Problem formulation:
+ small: less than 1 std of the mean.
+ medium: within 1 std of the mean.
+ large: above 1 std of the mean.

In [27]:
t1 = iris['sepal_area'].mean() - iris['sepal_area'].std()
t2 = iris['sepal_area'].mean() + iris['sepal_area'].std()

In [40]:
def row_processor(row):
    value = row['sepal_area']
    if value < t1:
        size = 'small'
    elif value > t2:
        size = 'large'
    else:
        size = 'medium'
    return size

In [41]:
iris['size'] = iris.apply(row_processor, axis=1)

In [43]:
iris['size'].value_counts()

size
medium    103
small      24
large      23
Name: count, dtype: int64

Exercise: identify and save customers who spend big and tip well.

Formulate:
+ Class A = people who spend more than average on bill, and tip more than average.

In [49]:
import pandas

def row_processor2(row):
    if (row['total_bill'] > average_total_bill) and (row['tip_perc'] > average_tip_perc):
        return True
    else:
        return False
    
tips = pandas.read_csv('../Datasets/tips.csv')
tips['tip_perc'] = tips['tip']/tips['total_bill']

average_total_bill = tips['total_bill'].mean()
average_tip_perc = tips['tip_perc'].mean()

tips['class_A'] = tips.apply(row_processor2, axis=1)

In [51]:
tips[ tips['class_A']==True ]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_perc,class_A
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,True
5,25.29,4.71,Male,No,Sun,Dinner,4,0.18624,True
15,21.58,3.92,Male,No,Sun,Dinner,2,0.18165,True
19,20.65,3.35,Male,No,Sat,Dinner,3,0.162228,True
23,39.42,7.58,Male,No,Sat,Dinner,4,0.192288,True
28,21.7,4.3,Male,No,Sat,Dinner,2,0.198157,True
44,30.4,5.6,Male,No,Sun,Dinner,4,0.184211,True
46,22.23,5.0,Male,No,Sun,Dinner,2,0.224921,True
47,32.4,6.0,Male,No,Sun,Dinner,4,0.185185,True
54,25.56,4.34,Male,No,Sun,Dinner,4,0.169797,True
