# 8장 pandas dtypes
## 8.1 자료형 다루기
### 8.1.1 astype 메서드(172쪽)
tips 데이터를 활용해보자.

In [1]:
import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips")
tips.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

#### 1) 여러 가지 자료형을 문자열로 변환하기
자료형을 변환하려면 astype 메서드를 사용하면 됩니다. 다음은 astype 메서드를 사용해 sex 열의 데이터를 문자열로 변환하여 sex_str이라는 새로운 열에 저장한 것입니다.

In [2]:
tips['sex_str'] = tips['sex'].astype(str)
print(tips.dtypes)

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


#### 2) 자료형을 변환한 데이터 다시 원래대로 만들기

In [3]:
tips['total_bill'] = tips['total_bill'].astype(str) 
print(tips.dtypes)

total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


In [4]:
tips['total_bill'] = tips['total_bill'].astype(float) 
print(tips.dtypes)

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


### 8.1.2 잘못 입력한 문자열 처리하기 ─ to_numeric 메서드(174쪽)
실수가 입력되어야 하는데, 문자열이 입력된 경우 처리 방법을 알아보자.

In [5]:
tips_sub_miss = tips.head(10)
tips_sub_miss.loc[[1, 3, 5, 7], 'total_bill'] = 'missing'

print(tips_sub_miss)

  total_bill   tip     sex smoker  day    time  size sex_str
0      16.99  1.01  Female     No  Sun  Dinner     2  Female
1    missing  1.66    Male     No  Sun  Dinner     3    Male
2      21.01  3.50    Male     No  Sun  Dinner     3    Male
3    missing  3.31    Male     No  Sun  Dinner     2    Male
4      24.59  3.61  Female     No  Sun  Dinner     4  Female
5    missing  4.71    Male     No  Sun  Dinner     4    Male
6       8.77  2.00    Male     No  Sun  Dinner     2    Male
7    missing  3.12    Male     No  Sun  Dinner     4    Male
8      15.04  1.96    Male     No  Sun  Dinner     2    Male
9      14.78  3.23    Male     No  Sun  Dinner     2    Male


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [7]:
print(type(tips_sub_miss.total_bill[0]))
print(type(tips_sub_miss.total_bill[1]))
print(tips_sub_miss.dtypes)

<class 'float'>
<class 'str'>
total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


float 타입에 str을 삽입하여, 전체적으로는 object로 dtype이 정해졌다. 

이 경우 astype 메소드, to_numeric 함수로 타입변경이 이뤄지지 않는다.

In [8]:
try:
    tips_sub_miss['total_bill'].astype(float)
except ValueError as e:
    print(e)

could not convert string to float: 'missing'


In [9]:
try:
    pd.to_numeric(tips_sub_miss['total_bill'])
except ValueError as e:
    print(e)

Unable to parse string "missing" at position 1


pd.to_numeric 함수의 errors 옵션의 기본값이 'raise'이므로, 이를 다양하게 변경할 때 나타는 결과를 확인해보자.

In [10]:
tips_sub_miss['total_bill'] = pd.to_numeric( tips_sub_miss['total_bill'], errors='ignore')
print(tips_sub_miss.dtypes)
tips_sub_miss.head()

total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_str
0,16.99,1.01,Female,No,Sun,Dinner,2,Female
1,missing,1.66,Male,No,Sun,Dinner,3,Male
2,21.01,3.5,Male,No,Sun,Dinner,3,Male
3,missing,3.31,Male,No,Sun,Dinner,2,Male
4,24.59,3.61,Female,No,Sun,Dinner,4,Female


In [11]:
tips_sub_miss['total_bill'] = pd.to_numeric( tips_sub_miss['total_bill'], errors='coerce')

print(tips_sub_miss.dtypes)
tips_sub_miss.head()

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_str
0,16.99,1.01,Female,No,Sun,Dinner,2,Female
1,,1.66,Male,No,Sun,Dinner,3,Male
2,21.01,3.5,Male,No,Sun,Dinner,3,Male
3,,3.31,Male,No,Sun,Dinner,2,Male
4,24.59,3.61,Female,No,Sun,Dinner,4,Female


downcast 옵션은 정수, 실수형과 같은 자료형을 더 작은 size의 자료형으로 변경한다.

즉, int32는 int16으로, float64는 float32로 변경합니다.

In [12]:
tips_sub_miss['total_bill'] = pd.to_numeric( tips_sub_miss['total_bill'], 
                                            errors='coerce', downcast='float')

print(tips_sub_miss.dtypes)
tips_sub_miss.head()

total_bill     float32
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_str
0,16.99,1.01,Female,No,Sun,Dinner,2,Female
1,,1.66,Male,No,Sun,Dinner,3,Male
2,21.01,3.5,Male,No,Sun,Dinner,3,Male
3,,3.31,Male,No,Sun,Dinner,2,Male
4,24.59,3.61,Female,No,Sun,Dinner,4,Female


## 8.2 문자열을 카테고리로 변환하기(179쪽)
일정한 범주의 값을 가질 수 있는 문자열 타입이 category이며, 내부적으로 정수 값을 저장하고, 각 정수에 해당하는 문자열로 변환하는 lookup 테이블을 가지므로서, 메모리 공간을 절약한다.

문자열로 저장하는 것과 카테고리로 저장하는 것의 메모리 용량을 확인해봅시다.

In [13]:
tips['sex'] = tips['sex'].astype('str') 
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null object
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
sex_str       244 non-null object
dtypes: category(3), float64(2), int64(1), object(2)
memory usage: 10.7+ KB
None


In [18]:
tips[['sex']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 1 columns):
sex    244 non-null category
dtypes: category(1)
memory usage: 468.0 bytes


In [17]:
tips['sex'] = tips['sex'].astype('category') 
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
sex_str       244 non-null object
dtypes: category(4), float64(2), int64(1), object(1)
memory usage: 9.2+ KB
None


학습내용:
- df.info()
- df.astype()
- pd.to_numeric()
- pd.to_category()