# pandas
pandas: 효과적인 데이터 분석을 위한 고수준의 자료구조와 데이터 분석 도구를 제공하는 라이브러리
- pandas Series: 1차원 데이터를 다루는 데 효과적인 자료구조
- pandas DataFrame: 행과 열로 구성된 2차원의 데이터를 다루는 데 효과적인 자료구조

참고1) 파이썬으로 배우는 트레이딩 알고리즘 (https://wikidocs.net/4367) <br>
참고2) Python for Data Analysis <br>
참고3) pandas documentation (https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

In [1]:
import pandas as pd

# Series
- 어떤 면에서는 파이썬의 리스트와 비슷하고 어떤 면에서는 딕셔너리와 닮은 자료구조
- 일차원 배열과 달리 값뿐만 아니라 각 값에 연결된 인덱스 값도 저장<br>

<font size = 4.7><center>< __Structure of Series__ ></center></font>
<img src="https://wikidocs.net/images/page/4364/r13.02.png" alt="Drawing" style="width: 300px;"/>

In [2]:
from pandas import Series

In [3]:
kakao = Series([92600, 92400, 92100, 94300, 92300])
kakao

0    92600
1    92400
2    92100
3    94300
4    92300
dtype: int64

In [4]:
kakao2 = Series([92600, 92400, 92100, 94300, 92300], 
                index=['2016-02-19', '2016-02-18', '2016-02-17', '2016-02-16', '2016-02-15'])
kakao2

2016-02-19    92600
2016-02-18    92400
2016-02-17    92100
2016-02-16    94300
2016-02-15    92300
dtype: int64

# DataFrame
- pandas의 주요 자료 구조
- 행과 열로 구성된 2차원 형태의 자료구조 (일반적인 엑셀 문서의 형태)
- Series 객체를 담는 사전이라고 생각할 수 있음<br><br>

<font size = 4.7><center>< __Structure of DataFrame__ ></center></font>
<img src="https://wikidocs.net/images/page/4367/r13.10.png" alt="Drawing" style="width: 550px;"/>
<br>
<br>
<font size = 4.7><center>< __DataFrame as dictionary of Series__ ></center></font>
<img src="https://wikidocs.net/images/page/4367/r13.11.png" alt="Drawing" style="width: 650px;"/>

In [5]:
from pandas import DataFrame

In [6]:
raw_data = {'col0': [1, 2, 3, 4],
            'col1': [10, 20, 30, 40],
            'col2': [100, 200, 300, 400]}

data = DataFrame(raw_data)
data

Unnamed: 0,col0,col1,col2
0,1,10,100
1,2,20,200
2,3,30,300
3,4,40,400


In [7]:
daeshin = {'open':  [11650, 11100, 11200, 11100, 11000],
           'high':  [12100, 11800, 11200, 11100, 11150],
           'low' :  [11600, 11050, 10900, 10950, 10900],
           'close': [11900, 11600, 11000, 11100, 11050]}

date = ['16.02.29', '16.02.26', '16.02.25', '16.02.24', '16.02.23']

daeshin_day = DataFrame(daeshin, columns=['open', 'high', 'low', 'close'], index=date)
daeshin_day

Unnamed: 0,open,high,low,close
16.02.29,11650,12100,11600,11900
16.02.26,11100,11800,11050,11600
16.02.25,11200,11200,10900,11000
16.02.24,11100,11100,10950,11100
16.02.23,11000,11150,10900,11050


# Getting Started with pandas

In [8]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## Creating Series & DataFrame
#### Series
- from ndarray
- from dict
- from scalar value
<br>

#### DataFrame
- from dict of Series or dicts
- from dict of ndarrays / lists
- from structured or record array
- from a list of dicts
- from a dict of tuples
- from a Series

*참고: pandas documentation (https://pandas.pydata.org/pandas-docs/stable/dsintro.html)

## Load Data
- CSV
- Excel
- Database
- URL

*참고: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

## Read csv
- csv 파일 읽어오기

tips data description
- customer_id: customer id
- total_bill: total bill (cost of the meal), including tax, in US dollars
- tip: tip (gratuity) in US dollars
- sex: sex of person paying for the meal (male, Female)
- smoker: somker in party? (No, Yes)
- day: Thur, Fri, Sat, Sun
- time: Lunch, Dinner
- size: size of the party

In [9]:
# read_csv 함수
tips = pd.read_csv('tips.csv')
tips

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.00,Male,No,Sun,Dinner,2.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
9,10,14.78,3.23,Male,No,Sun,Dinner,2.0


## Looking through data

전체적으로 데이터 훑어보기

In [10]:
tips.describe()

Unnamed: 0,customer_id,total_bill,tip,size
count,251.0,251.0,248.0,250.0
mean,124.191235,19.987689,2.998347,2.58
std,71.523558,9.046846,1.38207,0.950375
min,1.0,3.07,1.0,1.0
25%,62.5,13.405,2.0,2.0
50%,124.0,17.82,2.855,2.0
75%,185.5,24.535,3.5625,3.0
max,248.0,50.81,10.0,6.0


In [11]:
tips.head()

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.5,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0


In [12]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 8 columns):
customer_id    251 non-null int64
total_bill     251 non-null float64
tip            248 non-null float64
sex            250 non-null object
smoker         250 non-null object
day            251 non-null object
time           251 non-null object
size           250 non-null float64
dtypes: float64(3), int64(1), object(4)
memory usage: 15.8+ KB


## Indexing, selection
- 데이터 검색하기

### column 이름으로 접근

In [13]:
tips[["total_bill", 'tip']]

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.50
3,23.68,3.31
4,24.59,3.61
5,25.29,4.71
6,8.77,2.00
7,26.88,3.12
8,15.04,1.96
9,14.78,3.23


In [14]:
tips.day

0           Sun
1        Sunday
2           Sun
3           Sun
4           Sun
5           Sun
6           Sun
7        Sunday
8           Sun
9           Sun
10          Sun
11       Sunday
12          Sun
13          Sun
14       Sunday
15          Sun
16       Sunday
17       Sunday
18          Sun
19     Saturday
20          Sat
21          Sat
22          Sat
23          Sat
24          Sat
25          Sat
26          Sat
27     saturday
28          Sat
29          Sat
         ...   
221         Sat
222         Sat
223         Sat
224         Sat
225         Sat
226         Sat
227         Fri
228         Fri
229      Friday
230      Friday
231         Fri
232         Fri
233         Fri
234         Sat
235         Sat
236         Sat
237         Sat
238         Sat
239         Sat
240         Sat
241         Sat
242         Sat
243         Sat
244         Sat
245         Sat
246         Sat
247         Sat
248         Sat
249         Sat
250        Thur
Name: day, Length: 251, 

In [15]:
tips.day.values

array(['Sun', 'Sunday', 'Sun', 'Sun', 'Sun', 'Sun', 'Sun', 'Sunday',
       'Sun', 'Sun', 'Sun', 'Sunday', 'Sun', 'Sun', 'Sunday', 'Sun',
       'Sunday', 'Sunday', 'Sun', 'Saturday', 'Sat', 'Sat', 'Sat', 'Sat',
       'Sat', 'Sat', 'Sat', 'saturday', 'Sat', 'Sat', 'saturday', 'Sat',
       'Sat', 'Sat.', 'Sat', 'Sat.', 'Sat', 'Sat', 'Sat', 'Saturday',
       'Saturday', 'Sun', 'Sun', 'Sun', 'Sunday', 'Sunday', 'Sunday',
       'Sun', 'Sun', 'Sun', 'Sun', 'Sun.', 'Sun.', 'Sun.', 'Sun', 'Sun',
       'Sun', 'Sat', 'Sat', 'Sat', 'Sat', 'Sat', 'Sat', 'Sat.',
       'Saturday', 'Sat', 'Saturday', 'Sat', 'Sat', 'Thur', 'Sat', 'Sat',
       'Sat', 'Sat', 'Sat', 'Sat', 'Sat', 'Sat', 'Sat', 'Sat', 'Thur',
       'Thur', 'Thur', 'Thur', 'Thur', 'Thur', 'Thur', 'Thur', 'Thur',
       'Thur', 'Thur', 'Thur', 'Thur', 'Thur', 'Thur', 'Fri', 'Fri',
       'Friday', 'friday', 'Fri', 'Fri', 'Fri', 'Friday', 'Fri', 'Fri',
       'Fri', 'Friday', 'Sat', 'Sat', 'Sat', 'Fri', 'Sat', 'Sat', 'Sat',
       '

### index로 접근

In [16]:
tips[0:7]

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.5,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.0,Male,No,Sun,Dinner,2.0


### loc, iloc
- loc: 라벨 이름으로 검색 (원래 인덱스가 정수인 경우는 예외적으로 허용)
- iloc: 정숫값 인덱스로 검색<br>

*참고: https://datascienceschool.net/view-notebook/704731b41f794b8ea00768f5b0904512/

In [17]:
tips.loc[:, ['tip', 'sex']]
# tips.loc[:, [1, 2, 3]] -> 에러 발생

Unnamed: 0,tip,sex
0,1.01,Female
1,1.66,Male
2,3.50,Male
3,3.31,Male
4,3.61,Female
5,4.71,Male
6,2.00,Male
7,3.12,Male
8,1.96,Male
9,3.23,Male


In [18]:
tips.iloc[[2, 5, 7], 1:4]

Unnamed: 0,total_bill,tip,sex
2,21.01,3.5,Male
5,25.29,4.71,Male
7,26.88,3.12,Male


## Exploring data

### 데이터 구성 파악

In [19]:
tips.columns

Index(['customer_id', 'total_bill', 'tip', 'sex', 'smoker', 'day', 'time',
       'size'],
      dtype='object')

In [20]:
tips.sex.unique()

array(['Female', 'Male', nan], dtype=object)

In [21]:
tips.day.unique()

array(['Sun', 'Sunday', 'Saturday', 'Sat', 'saturday', 'Sat.', 'Sun.',
       'Thur', 'Fri', 'Friday', 'friday'], dtype=object)

In [22]:
tips.sex.value_counts()

Male      159
Female     91
Name: sex, dtype: int64

### Sorting

In [23]:
tips.sort_values('total_bill', ascending=False)

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
177,175,50.81,10.00,Male,Yes,Sat,Dinner,3.0
110,109,48.68,,Female,Yes,Fri,Dinner,
219,217,48.33,9.00,Male,No,Sat,Dinner,4.0
60,60,48.27,6.73,Male,No,Sat,Dinner,4.0
163,161,48.17,5.00,Male,No,Sun,Dinner,6.0
189,187,45.35,3.50,Male,Yes,Sun,Dinner,3.0
107,106,44.30,2.50,Female,Yes,Sat,Dinner,3.0
204,202,43.11,5.00,Female,Yes,Thur,Lunch,4.0
148,147,41.19,5.00,Male,No,Thur,Lunch,5.0
191,189,40.55,3.00,Male,Yes,Sun,Dinner,2.0


### filtering
- 특정 조건을 만족하는 데이터만 추출

흡연자가 없었던 경우만 검색

In [24]:
tips[tips.smoker == 'No']

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.00,Male,No,Sun,Dinner,2.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
9,10,14.78,3.23,Male,No,Sun,Dinner,2.0


총 지출액이 15달러 이상이었던 경우만 검색

In [25]:
tips[tips['total_bill'] >= 15]

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
11,12,35.26,5.00,Female,No,Sunday,Dinner,4.0
12,13,15.42,1.57,Male,No,Sun,Dinner,2.0
13,14,18.43,3.00,Male,No,Sun,Dinner,4.0


### 실습 1) loc 또는 iloc을 이용하여 '흡연자이면서 팁을 5달러 이상 낸 고객'의 'size' 값의 구성을 파악하여라.

In [56]:
tips.loc[(tips['smoker']=='Yes')&(tips['tip']>=5),['size']]

Unnamed: 0,size
76,2.0
86,2.0
177,3.0
179,2.0
188,2.0
190,4.0
204,4.0
218,4.0
221,3.0


## Handling missing data

#### find missing data
*참고: https://stackoverflow.com/questions/29530232/how-to-check-if-any-value-is-nan-in-a-pandas-dataframe

In [27]:
tips.isnull()

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False


In [28]:
tips.notnull()

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True
5,True,True,True,True,True,True,True,True
6,True,True,True,True,True,True,True,True
7,True,True,True,True,True,True,True,True
8,True,True,True,True,True,True,True,True
9,True,True,True,True,True,True,True,True


특정 column의 null 값 찾기

In [29]:
tips.isnull().any()

customer_id    False
total_bill     False
tip             True
sex             True
smoker          True
day            False
time           False
size            True
dtype: bool

In [30]:
tips.isnull().sum() # sum 함수는 summarizing and computing descriptive statistics  참조

customer_id    0
total_bill     0
tip            3
sex            1
smoker         1
day            0
time           0
size           1
dtype: int64

특정 row의 null 값 찾기

In [31]:
tips[tips.isnull().any(axis=1)]

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
61,61,28.03,2.1,Female,,Sat,Lunch,2.0
69,69,15.2,,Male,Yes,Thur,Dinner,4.0
87,87,27.1,,,No,Thur,Lunch,3.0
110,109,48.68,,Female,Yes,Fri,Dinner,


#### Filtering out missing data

In [32]:
tips.dropna() # how='all' 조건을 넣어줄 경우 모든 값이 Null인 row만 제거

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.00,Male,No,Sun,Dinner,2.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
9,10,14.78,3.23,Male,No,Sun,Dinner,2.0


#### Filling in Missing Data
* 참고: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

### 실습2) dropna 함수를 사용하지 않고 null 값을 제거하여라.

In [33]:
tips[tips.notnull().all(axis=1)]

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.00,Male,No,Sun,Dinner,2.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
9,10,14.78,3.23,Male,No,Sun,Dinner,2.0


In [34]:
tips = tips[tips.notnull().all(axis=1)]

## Handling duplicated data

In [35]:
tips.duplicated()

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
221    False
222    False
223    False
224    False
225    False
226    False
227    False
228    False
229    False
230    False
231    False
232    False
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241    False
242    False
243    False
244    False
245    False
246    False
247    False
248    False
249    False
250    False
Length: 247, dtype: bool

In [36]:
tips.drop_duplicates()

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.00,Male,No,Sun,Dinner,2.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
9,10,14.78,3.23,Male,No,Sun,Dinner,2.0


### 실습2) 중복 값을 확인하고 모두 제거하여라. (함수의 parameter 기능을 읽어보고 활용할 것)
1. 중복 값 확인
2. 제거

In [57]:
tips.drop_duplicates()

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2.0
1,2,10.34,1.66,Male,No,Sunday,Dinner,3.0
2,3,21.01,3.50,Male,No,Sun,Dinner,3.0
3,4,23.68,3.31,Male,No,Sun,Dinner,2.0
4,5,24.59,3.61,Female,No,Sun,Dinner,4.0
5,6,25.29,4.71,Male,No,Sun,Dinner,4.0
6,7,8.77,2.00,Male,No,Sun,Dinner,2.0
7,8,26.88,3.12,Male,No,Sunday,Dinner,4.0
8,9,15.04,1.96,Male,No,Sun,Dinner,2.0
9,10,14.78,3.23,Male,No,Sun,Dinner,2.0


In [58]:
tips = tips.drop_duplicates()

## Summarizing and Computing Descriptive Statistics

In [37]:
tips.describe()

Unnamed: 0,customer_id,total_bill,tip,size
count,247.0,247.0,247.0,247.0
mean,124.882591,19.829555,3.001984,2.574899
std,71.855026,8.901847,1.383686,0.950747
min,1.0,3.07,1.0,1.0
25%,62.5,13.38,2.0,2.0
50%,126.0,17.81,2.88,2.0
75%,186.5,24.175,3.575,3.0
max,248.0,50.81,10.0,6.0


In [38]:
tips.total_bill.mean()

19.829554655870453

In [39]:
tips.total_bill.quantile(0.25)

13.379999999999999

In [40]:
tips.cumsum()

Unnamed: 0,customer_id,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2
1,3,27.33,2.67,FemaleMale,NoNo,SunSunday,DinnerDinner,5
2,6,48.34,6.17,FemaleMaleMale,NoNoNo,SunSundaySun,DinnerDinnerDinner,8
3,10,72.02,9.48,FemaleMaleMaleMale,NoNoNoNo,SunSundaySunSun,DinnerDinnerDinnerDinner,10
4,15,96.61,13.09,FemaleMaleMaleMaleFemale,NoNoNoNoNo,SunSundaySunSunSun,DinnerDinnerDinnerDinnerDinner,14
5,21,121.9,17.8,FemaleMaleMaleMaleFemaleMale,NoNoNoNoNoNo,SunSundaySunSunSunSun,DinnerDinnerDinnerDinnerDinnerDinner,18
6,28,130.67,19.8,FemaleMaleMaleMaleFemaleMaleMale,NoNoNoNoNoNoNo,SunSundaySunSunSunSunSun,DinnerDinnerDinnerDinnerDinnerDinnerDinner,20
7,36,157.55,22.92,FemaleMaleMaleMaleFemaleMaleMaleMale,NoNoNoNoNoNoNoNo,SunSundaySunSunSunSunSunSunday,DinnerDinnerDinnerDinnerDinnerDinnerDinnerDinner,24
8,45,172.59,24.88,FemaleMaleMaleMaleFemaleMaleMaleMaleMale,NoNoNoNoNoNoNoNoNo,SunSundaySunSunSunSunSunSundaySun,DinnerDinnerDinnerDinnerDinnerDinnerDinnerDinn...,26
9,55,187.37,28.11,FemaleMaleMaleMaleFemaleMaleMaleMaleMaleMale,NoNoNoNoNoNoNoNoNoNo,SunSundaySunSunSunSunSunSundaySunSun,DinnerDinnerDinnerDinnerDinnerDinnerDinnerDinn...,28


In [41]:
tips.tip.idxmax()

177

In [42]:
tips.tip.max()

10.0

In [43]:
tips.loc[177, :]

customer_id       175
total_bill      50.81
tip                10
sex              Male
smoker            Yes
day               Sat
time           Dinner
size                3
Name: 177, dtype: object

In [44]:
tips.tip / tips.total_bill

0      0.059447
1      0.160542
2      0.166587
3      0.139780
4      0.146808
5      0.186240
6      0.228050
7      0.116071
8      0.130319
9      0.218539
10     0.166504
11     0.141804
12     0.101816
13     0.162778
14     0.203641
15     0.181650
16     0.161665
17     0.227747
18     0.206246
19     0.162228
20     0.227679
21     0.135535
22     0.141408
23     0.192288
24     0.160444
25     0.131387
26     0.149589
27     0.157604
28     0.198157
29     0.152672
         ...   
221    0.230742
222    0.085271
223    0.106572
224    0.129422
225    0.186047
226    0.102522
227    0.180921
228    0.259314
229    0.223776
230    0.187735
231    0.117735
232    0.153657
233    0.198216
234    0.146699
235    0.204819
236    0.130199
237    0.083299
238    0.191205
239    0.291990
240    0.136490
241    0.193175
242    0.124131
243    0.079365
244    0.035638
245    0.130338
246    0.203927
247    0.073584
248    0.088222
249    0.098204
250    0.159744
Length: 247, dtype: floa

## Function application and mapping

#### 숫자 다루기

In [45]:
tips[['total_bill', 'tip']].apply(lambda x: x.max())

total_bill    50.81
tip           10.00
dtype: float64

In [46]:
tips[['total_bill', 'tip']].apply(lambda x: x.max() - x.min(), axis=1)

0      15.98
1       8.68
2      17.51
3      20.37
4      20.98
5      20.58
6       6.77
7      23.76
8      13.08
9      11.55
10      8.56
11     30.26
12     13.85
13     15.43
14     11.81
15     17.66
16      8.66
17     12.58
18     13.47
19     17.30
20     13.84
21     17.54
22     13.54
23     31.84
24     16.64
25     15.47
26     11.37
27     10.69
28     17.40
29     16.65
       ...  
221    21.67
222    11.80
223    25.15
224    10.09
225     6.30
226    27.05
227     9.96
228     9.94
229     6.66
230    12.98
231    11.84
232    13.77
233     8.09
234    17.45
235    10.56
236    19.24
237    22.01
238    12.69
239     8.22
240     9.30
241    12.53
242     8.82
243    11.60
244    31.66
245    31.16
246    23.11
247    25.18
248    20.67
249    16.07
250    15.78
Length: 247, dtype: float64

In [47]:
tips['tip'].apply(lambda x: x - 0.5)

0      0.51
1      1.16
2      3.00
3      2.81
4      3.11
5      4.21
6      1.50
7      2.62
8      1.46
9      2.73
10     1.21
11     4.50
12     1.07
13     2.50
14     2.52
15     3.42
16     1.17
17     3.21
18     3.00
19     2.85
20     3.58
21     2.25
22     1.73
23     7.08
24     2.68
25     1.84
26     1.50
27     1.50
28     3.80
29     2.50
       ... 
221    6.00
222    0.60
223    2.50
224    1.00
225    0.94
226    2.59
227    1.70
228    2.98
229    1.42
230    2.50
231    1.08
232    2.00
233    1.50
234    2.50
235    2.22
236    2.38
237    1.50
238    2.50
239    2.89
240    0.97
241    2.50
242    0.75
243    0.50
244    0.67
245    4.17
246    5.42
247    1.50
248    1.50
249    1.25
250    2.50
Name: tip, Length: 247, dtype: float64

In [48]:
def f(x):
    return x*2
tips['tip'].apply(f)

0       2.02
1       3.32
2       7.00
3       6.62
4       7.22
5       9.42
6       4.00
7       6.24
8       3.92
9       6.46
10      3.42
11     10.00
12      3.14
13      6.00
14      6.04
15      7.84
16      3.34
17      7.42
18      7.00
19      6.70
20      8.16
21      5.50
22      4.46
23     15.16
24      6.36
25      4.68
26      4.00
27      4.00
28      8.60
29      6.00
       ...  
221    13.00
222     2.20
223     6.00
224     3.00
225     2.88
226     6.18
227     4.40
228     6.96
229     3.84
230     6.00
231     3.16
232     5.00
233     4.00
234     6.00
235     5.44
236     5.76
237     4.00
238     6.00
239     6.78
240     2.94
241     6.00
242     2.50
243     2.00
244     2.34
245     9.34
246    11.84
247     4.00
248     4.00
249     3.50
250     6.00
Name: tip, Length: 247, dtype: float64

#### 문자 다루기

In [49]:
tips.day.replace('Friday', 'Fri')

0           Sun
1        Sunday
2           Sun
3           Sun
4           Sun
5           Sun
6           Sun
7        Sunday
8           Sun
9           Sun
10          Sun
11       Sunday
12          Sun
13          Sun
14       Sunday
15          Sun
16       Sunday
17       Sunday
18          Sun
19     Saturday
20          Sat
21          Sat
22          Sat
23          Sat
24          Sat
25          Sat
26          Sat
27     saturday
28          Sat
29          Sat
         ...   
221         Sat
222         Sat
223         Sat
224         Sat
225         Sat
226         Sat
227         Fri
228         Fri
229         Fri
230         Fri
231         Fri
232         Fri
233         Fri
234         Sat
235         Sat
236         Sat
237         Sat
238         Sat
239         Sat
240         Sat
241         Sat
242         Sat
243         Sat
244         Sat
245         Sat
246         Sat
247         Sat
248         Sat
249         Sat
250        Thur
Name: day, Length: 247, 

In [50]:
tips.day.replace(['Friday', 'fri'], 'Fri')

0           Sun
1        Sunday
2           Sun
3           Sun
4           Sun
5           Sun
6           Sun
7        Sunday
8           Sun
9           Sun
10          Sun
11       Sunday
12          Sun
13          Sun
14       Sunday
15          Sun
16       Sunday
17       Sunday
18          Sun
19     Saturday
20          Sat
21          Sat
22          Sat
23          Sat
24          Sat
25          Sat
26          Sat
27     saturday
28          Sat
29          Sat
         ...   
221         Sat
222         Sat
223         Sat
224         Sat
225         Sat
226         Sat
227         Fri
228         Fri
229         Fri
230         Fri
231         Fri
232         Fri
233         Fri
234         Sat
235         Sat
236         Sat
237         Sat
238         Sat
239         Sat
240         Sat
241         Sat
242         Sat
243         Sat
244         Sat
245         Sat
246         Sat
247         Sat
248         Sat
249         Sat
250        Thur
Name: day, Length: 247, 

In [51]:
dict_day = {'Friday': 'Fri'}
tips.day.map(dict_day)

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
10     NaN
11     NaN
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20     NaN
21     NaN
22     NaN
23     NaN
24     NaN
25     NaN
26     NaN
27     NaN
28     NaN
29     NaN
      ... 
221    NaN
222    NaN
223    NaN
224    NaN
225    NaN
226    NaN
227    NaN
228    NaN
229    Fri
230    Fri
231    NaN
232    NaN
233    NaN
234    NaN
235    NaN
236    NaN
237    NaN
238    NaN
239    NaN
240    NaN
241    NaN
242    NaN
243    NaN
244    NaN
245    NaN
246    NaN
247    NaN
248    NaN
249    NaN
250    NaN
Name: day, Length: 247, dtype: object

In [52]:
tips['day'].apply(str.lower) # tips['day'].apply(lambda x: str.lower(x))도 가능

0           sun
1        sunday
2           sun
3           sun
4           sun
5           sun
6           sun
7        sunday
8           sun
9           sun
10          sun
11       sunday
12          sun
13          sun
14       sunday
15          sun
16       sunday
17       sunday
18          sun
19     saturday
20          sat
21          sat
22          sat
23          sat
24          sat
25          sat
26          sat
27     saturday
28          sat
29          sat
         ...   
221         sat
222         sat
223         sat
224         sat
225         sat
226         sat
227         fri
228         fri
229      friday
230      friday
231         fri
232         fri
233         fri
234         sat
235         sat
236         sat
237         sat
238         sat
239         sat
240         sat
241         sat
242         sat
243         sat
244         sat
245         sat
246         sat
247         sat
248         sat
249         sat
250        thur
Name: day, Length: 247, 

### 실습 3) apply 함수를 이용하여 describe 함수를 구현하라. (tips[['total_bill', 'tip']]에만 적용 가능하면 됨)
1. count, mean, std, min, 25%, 50%, 75%, max 값을 인덱스로 하는 Series를 return 하는 함수 정의
2. apply 함수로 tips[['total_bill', 'tip']]에 1에서 정의한 함수 적용

### 실습 4) tips 데이터의 day 값을 4가지 분류로 통일하여라.
1. tips 데이터의 day 값 구성 파악
2. day 값을 4가지 분류로 나누는 dictionary 정의
3. map 또는 replace를 이용하여 day 값을 4가지 분류로 통일

In [53]:
tips.day = tips.day.replace(day_dict)

NameError: name 'day_dict' is not defined

## Data Aggregation and Group Operations
- split: 특정 기준으로 데이터를 그룹핑
- apply: 각 그룹에 특정 함수를 적용
- combine: 각 그룹에 함수를 적용한 결과를 묶어서 제시
<br>
<img src="https://i.stack.imgur.com/sgCn1.jpg" alt="Drawing" style="width: 450px;"/>

In [None]:
tips.groupby('sex')

In [None]:
tips.groupby('sex').groups

In [None]:
grouped_sex = tips.groupby('sex')
grouped_sex.mean()
# tips.groupby('sex').mean()

In [None]:
tips.groupby(['sex', 'day']).count()['customer_id']

### Iterating over groups

In [None]:
for name, group in tips.groupby('smoker'):
    print (name)
    print (group)

In [None]:
for (name1, name2), group in tips.groupby(['sex', 'smoker']):
    print (name1, name2)
    print (group)

### Apply function

In [None]:
def f(x):
    return x.head()
tips.groupby(['sex', 'smoker']).apply(f)

### 실습 5) 성별과 흡연 여부로 그룹핑 했을 때, 각 그룹에서 tip을 가장 많이 준 순서대로 5개씩의 전체 데이터를 보여라.
1. tip을 가장 많이 준 순서대로 5개씩 보여주는 함수 정의
2. groupby 함수를 이용해서 1에서 정의한 함수 적용

## Data transformation

### Stack & Unstack

In [None]:
tips.groupby(['sex', 'day']).count()['customer_id']

In [None]:
tips.groupby(['sex', 'day']).count()['customer_id'].unstack()

### Pivot table

In [None]:
tips.pivot_table(values='tip', index='sex', columns='day', aggfunc='mean')

In [None]:
tips.pivot_table(values='customer_id', index='sex', columns='day', aggfunc='count')

### 실습 6) 각 요일별 성별별 지출 대비 팁의 비율(%)의 평균을 구하라.
1. 지출 대비 팁의 비율의 평균을 구하는 함수 정의
2. groupby 함수를 이용하여 1에서 정의한 함수를 각 그룹에 적용
3. stack 또는 unstack을 이용해 dataframe 변형