# 01 머신러닝 개념

- 애플리케이션을 수정하지 않고도 데이터를 기반으로 패턴을 학습하고 결과를 예측하는 알고리즘 기법을 통칭  
- 금융 사기 거래 적발하는 프로그램/스팸 메일 필터링  
- 데이터 마이닝, 영상 인식, 음성 인식, 자연어 처리 같이 난이도와 개발 복잡도가 높아지는 분야에서 머신러닝이 급속하게 발전을 이룸 

## 머신러닝의 분류

1. 지도학습
- 분류  
- 회귀  
- 추천 시스템  
- 시각/음성 감지/인지  
- 텍스트 분석,NLP

2. 비지도학습  
- 클러스터링  
- 차원 축소  
- 강화학습

## 넘파이

In [1]:
import numpy as np

In [2]:
array1=np.array([1,2,3])
print('array1 type:',type(array1))
print('array1 array 형태:',array1.shape)

array1 type: <class 'numpy.ndarray'>
array1 array 형태: (3,)


In [3]:
array2=np.array([[1,2,3],
                [2,3,4]])
print('array2 type:',type(array2))
print('array2 array 형태:',array2.shape)

array2 type: <class 'numpy.ndarray'>
array2 array 형태: (2, 3)


In [4]:
array3=np.array([[1,2,3]])
print('array3 type:',type(array3))
print('array3 array 형태:',array3.shape)

array3 type: <class 'numpy.ndarray'>
array3 array 형태: (1, 3)


In [5]:
print('array1:{:0}차원,array2:{:1}차원,array3:{:2}차원'.\
      format(array1.ndim,array2.ndim,array3.ndim))

array1:1차원,array2:2차원,array3: 2차원


연산은 같은 데이터의 타입만 가능  
**dtype**속성으로 확인

서로 다른 데이터 유형이 섞여 있는 리스트를 **ndarray**로 변경하면 큰 타입으로 형 변환

In [6]:
list1=[1,2,3]
print(type(list1))
array1=np.array(list1)
print(type(array1))
print(array1,array1.dtype)

<class 'list'>
<class 'numpy.ndarray'>
[1 2 3] int32


int와 string형이 있으면 string형으로 바뀜

In [7]:
list2=[1,2,'test']
print(type(list2))
array2=np.array(list2)
print(type(array2))
print(array2,array2.dtype)

<class 'list'>
<class 'numpy.ndarray'>
['1' '2' 'test'] <U11


int와 float가 있으면 float형으로 바뀜

In [8]:
list3=[1,2,3.0]
print(type(list3))
array3=np.array(list3)
print(type(array3))
print(array3,array3.dtype)

<class 'list'>
<class 'numpy.ndarray'>
[1. 2. 3.] float64


float64<==>int32

In [9]:
array_int=np.array([1,2,3])
array_float=array_int.astype('float64')
print(array_float,array_float.dtype)

[1. 2. 3.] float64


In [10]:
array_int1=array_float.astype('int32')
print(array_int1,array_int1.dtype)

[1 2 3] int32


In [11]:
array_float1=np.array([1.1,2.1,3.1])
array_int2=array_float1.astype('int32')
print(array_int2,array_int2.dtype)

[1 2 3] int32


특정 크기과 차원을 가진 ndarray를 연속값이나 0,1로 초기화해 쉽게 생성해야 할 필요가 있는 경우 **array**를 **range()**로 표현  

0부터 x-1까지 값을 순차적으로 ndarray의 데이터값으로 변환

In [12]:
sequence_array=np.arange(10)
print(sequence_array)
print(sequence_array.dtype,sequence_array.shape)

[0 1 2 3 4 5 6 7 8 9]
int32 (10,)


**zeros()**는 함수 인자로 튜플 형태의 **shape** 값을 입력하면 모든 값을 0으로 채운 해당 **shape**를 가진  ndarray의 데이터값으로 반환

In [13]:
zero_array=np.zeros((3,2),dtype='int32')
print(zero_array)
print(zero_array.dtype,zero_array.shape)

[[0 0]
 [0 0]
 [0 0]]
int32 (3, 2)


**ones()**는 함수 인자로 튜플 형태의 **shape** 값을 입력하면 모든 값을 0으로 채운 해당 **shape**를 가진  ndarray의 데이터값으로 반환

In [14]:
one_array=np.ones((3,2))
print(one_array)
print(one_array.dtype,one_array.shape)

[[1. 1.]
 [1. 1.]
 [1. 1.]]
float64 (3, 2)


**reshape()**는 특정 차원 및 크기로 변환

In [15]:
array1=np.arange(10)
print('array1:\n',array1)

array1:
 [0 1 2 3 4 5 6 7 8 9]


In [16]:
array2=array1.reshape(2,5)
print('array2:\n',array2)

array2:
 [[0 1 2 3 4]
 [5 6 7 8 9]]


In [17]:
array3=array1.reshape(5,2)
print('array3:\n',array3)

array3:
 [[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


In [18]:
array1.reshape(3,4)

ValueError: cannot reshape array of size 10 into shape (3,4)

arange(10)을 (3,4)로는 변경할 수 없음

실전에서는 -1을 많이 사용

In [19]:
array2=array1.reshape(-1,5)
print('array2 shape:',array2.shape)

array2 shape: (2, 5)


In [20]:
array3=array1.reshape(5,-1)
print('array3 shape:',array3.shape)

array3 shape: (5, 2)


차원 변경(3차원==>2차원)

In [21]:
array1=np.arange(8)
array3d=array1.reshape((2,2,2))
print('array3d:\n',array3d.tolist())

array3d:
 [[[0, 1], [2, 3]], [[4, 5], [6, 7]]]


In [22]:
array5=array3d.reshape(-1,1)
print('array5:\n',array5.tolist())
print('array5 shape:',array5.shape)

array5:
 [[0], [1], [2], [3], [4], [5], [6], [7]]
array5 shape: (8, 1)


- 단일 값 추출

In [23]:
array1=np.arange(1,10)
print('array1',array1)
value=array1[2]
print('value:',value)
print(type(value))

array1 [1 2 3 4 5 6 7 8 9]
value: 3
<class 'numpy.int32'>


In [24]:
print('맨 뒤의 값:', array1[-1],'맨 뒤에서 두 번째 값:', array1[-2])

맨 뒤의 값: 9 맨 뒤에서 두 번째 값: 8


- 슬라이싱

In [25]:
array1=np.arange(1,10)
array3=array1[0:3]
print(array3)
print(type(array3))

[1 2 3]
<class 'numpy.ndarray'>


- 팬시 인덱싱

In [26]:
array1d=np.arange(1,10)
array2d=array1d.reshape(3,3)
array3=array2d[[0,1],2]
print('array2d[[0,1],2]=>',array3.tolist())

array2d[[0,1],2]=> [3, 6]


In [27]:
array4=array2d[[0,1],0:2]
print('array2d[[0,1],0:2]=>',array4.tolist())

array2d[[0,1],0:2]=> [[1, 2], [4, 5]]


- 불린 인덱싱

In [28]:
array3=array1d[array1d>5]
print('array1d>5 불린 인덱싱 결과 값:',array3)

array1d>5 불린 인덱싱 결과 값: [6 7 8 9]


행렬 정렬

In [29]:
org_array=np.array([3,1,5,9])
print('원본 행렬:',org_array)
sort_array1=np.sort(org_array)
print('np.sort( )호출 후 반환된 정렬 행렬:',sort_array1)
print('np.sort( )호출 후 원본 행렬:',org_array)

원본 행렬: [3 1 5 9]
np.sort( )호출 후 반환된 정렬 행렬: [1 3 5 9]
np.sort( )호출 후 원본 행렬: [3 1 5 9]


In [30]:
sort_array2=org_array.sort()
print('org_array.sort( )호출 후 반환된 정렬 행렬:',sort_array2)
print('org_array.sort( )호출 후 원본 행렬:',org_array)

org_array.sort( )호출 후 반환된 정렬 행렬: None
org_array.sort( )호출 후 원본 행렬: [1 3 5 9]


**내림차순**

In [31]:
sort_array1_desc=np.sort(org_array)[::-1]
print('내림차순으로 정렬:',sort_array1_desc)

내림차순으로 정렬: [9 5 3 1]


**정렬된 행렬 인덱스를 반환**

In [32]:
org_array=np.array([3,1,9,5])
sort_indices=np.argsort(org_array)
print(type(sort_indices))
print('행렬 정렬 시 원본 행렬의 인덱스:',sort_indices)

<class 'numpy.ndarray'>
행렬 정렬 시 원본 행렬의 인덱스: [1 0 3 2]


**행렬 내적**  

**np.dot()**사용

In [33]:
A=np.array([[1,2,3],
           [4,5,6]])
B=np.array([[7,8],
           [9,10],
           [11,12]])

In [34]:
dot_product=np.dot(A,B)
print('행렬 내적 결과:\n',dot_product)

행렬 내적 결과:
 [[ 58  64]
 [139 154]]


**전치행렬**

In [35]:
A=np.array([[1,2],
           [3,4]])
transpose_mat=np.transpose(A)
print('A의 전치 행렬:\n',transpose_mat)

A의 전치 행렬:
 [[1 3]
 [2 4]]


## 판다스

In [36]:
import pandas as pd

In [37]:
titanic=pd.read_csv("titanic_train.csv")

In [38]:
titanic.shape

(891, 12)

In [39]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [40]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [41]:
value_counts=titanic['Pclass'].value_counts()
value_counts

3    491
1    216
2    184
Name: Pclass, dtype: int64

**넘파이 ndarray, 리스트, 딕셔너리를 DataFrame으로 변환**

- 1차원

In [42]:
col_name1=['col1']
list1=[1,2,3]
array1=np.array(list1)

In [43]:
array1

array([1, 2, 3])

In [44]:
array1.shape

(3,)

In [45]:
df_list1=pd.DataFrame(list1,columns=col_name1)

In [46]:
df_list1

Unnamed: 0,col1
0,1
1,2
2,3


In [47]:
df_array1=pd.DataFrame(array1,columns=col_name1)

In [48]:
df_array1

Unnamed: 0,col1
0,1
1,2
2,3


- 2차원

In [49]:
col_name2=['col1','col2','col3']
list2=[[1,2,3],[11,12,13]]
array2=np.array(list2)

In [50]:
array2

array([[ 1,  2,  3],
       [11, 12, 13]])

In [51]:
array2.shape

(2, 3)

In [52]:
df_list2=pd.DataFrame(list2,columns=col_name2)

In [53]:
df_list2

Unnamed: 0,col1,col2,col3
0,1,2,3
1,11,12,13


In [54]:
df_array2=pd.DataFrame(array2,columns=col_name2)

In [55]:
df_array2

Unnamed: 0,col1,col2,col3
0,1,2,3
1,11,12,13


- 딕셔너리

In [56]:
dict={'col1':[1,11],'col2':[2,22],'col3':[3,33]}
df_dict=pd.DataFrame(dict)
df_dict

Unnamed: 0,col1,col2,col3
0,1,2,3
1,11,22,33


**DataFrame을  넘파이 ndarray, 리스트, 딕셔너리로 변환**

In [57]:
array3=df_dict.values
array3

array([[ 1,  2,  3],
       [11, 22, 33]], dtype=int64)

In [58]:
list3=df_dict.values.tolist()
list3

[[1, 2, 3], [11, 22, 33]]

In [59]:
dict3=df_dict.to_dict('list')
dict3

{'col1': [1, 11], 'col2': [2, 22], 'col3': [3, 33]}

### DataFrame의 칼럼 데이터 세트 생성과 수정

In [60]:
titanic['Age_0']=0

In [61]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


기존의 칼럼 series의 데이터를 이용해 새로운 칼럼 series를 만들기

In [62]:
titanic['Age_by_10']=titanic['Age']*10
titanic['Family_No']=titanic['SibSp']+titanic['Parch']+1
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,220.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,380.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,260.0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,350.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,350.0,1


In [63]:
titanic['Age_by_10']=titanic['Age_by_10']+100
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,320.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,480.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,360.0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,450.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,450.0,1


### DataFrame 데이터 삭제

In [64]:
titanic_drop=titanic.drop('Age_0',axis=1)
titanic_drop.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,320.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,480.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,360.0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,450.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,450.0,1


In [65]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0,Age_by_10,Family_No
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,320.0,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,480.0,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,360.0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,450.0,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,450.0,1


inplace=False로 설정했기 때문에 삭제된 결과 DataFrame을 반환해서 titanic에는 Age_0이 살아있음

In [66]:
drop_result=titanic.drop(['Age_0','Age_by_10','Family_No'],axis=1,inplace=True)
drop_result

In [67]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


inplace=True로 설정했기 때문에 titanic에도 삭제됨

### Index 객체

insex 객체는 1차원 array이고, 단일 값 반환 및 슬라이싱도 가능    
**함부로 변경할 수 없음**

In [68]:
index=titanic.index

In [70]:
index[0]=5

TypeError: Index does not support mutable operations

인덱스를 칼럼으로 추가

In [71]:
titanic_reset=titanic.reset_index(inplace=False)
titanic_reset.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


reset_index()는 인덱스가 연속된 int 숫자형 데이터가 아닐 때 다시 이를 연속 int 숫자형 데이터로 만들 때 사용

### 데이터 셀렉션 및 필터링

iloc[ ]와loc[ ]을 사용

**DataFrame의 [ ]연산자**  
넘파이와 DataFrame 간 데이터 셀렉션에서 가장 유의해야 할 부분은  [ ] 연산자이다.  

넘파이: 행과 열의 위치, 슬라이싱 범위 등을 지정  
DataFrame: 칼럼명 또는 인덱스로 변환 가능한 표현식

In [72]:
titanic['Pclass'].head()

0    3
1    1
2    3
3    1
4    3
Name: Pclass, dtype: int64

In [73]:
titanic[['Survived','Pclass']].head()

Unnamed: 0,Survived,Pclass
0,0,3
1,1,1
2,1,3
3,1,1
4,0,3


숫자를 넣으면 칼럼명이 아니라서 오류가 뜸

In [74]:
titanic[0]

KeyError: 0

- 인덱스 형태로 변환 가능한 표현식

In [75]:
titanic[0:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


- 인덱스 형태로 변환 가능한 표현식

In [76]:
titanic[titanic['Pclass']==3].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


**DataFrame의 iloc[ ]연산자 & loc[ ]연산자**

로우나 칼럼을 지정하여 데이터를 선택할 수 있는 인덱싱 방식으로 iloc[ ], loc [ ]를 제공  

iloc: 위치 기반 인덱싱/ 정수값 또는 정수형 슬라이싱, 팬시 리스트 값을 입력  
loc: 명칭 기반 인덱싱/ 칼럼명을 입력

In [77]:
data={'Name':['Chulmin','Eunkyung','Jinwoong','Soobeom'],
     'Year':[2011,2016,2015,2015],
     'Gender':['Male','Female','Male','Male']
     }
data_df=pd.DataFrame(data,index=['one','two','three','four'])
data_df

Unnamed: 0,Name,Year,Gender
one,Chulmin,2011,Male
two,Eunkyung,2016,Female
three,Jinwoong,2015,Male
four,Soobeom,2015,Male


In [78]:
data_df.iloc[0,0]

'Chulmin'

iloc는 칼럼을 넣으면 오류가 뜸

In [79]:
data_df.iloc[0,'Name']

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

In [80]:
data_df.loc['one','Name']

'Chulmin'

- 불린 인덱싱

In [81]:
titanic[titanic['Age']>60].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
54,55,0,1,"Ostby, Mr. Engelhart Cornelius",male,65.0,0,1,113509,61.9792,B30,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
170,171,0,1,"Van der hoef, Mr. Wyckoff",male,61.0,0,0,111240,33.5,B19,S


In [82]:
titanic[titanic['Age']>60][['Name','Age']].head()

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
170,"Van der hoef, Mr. Wyckoff",61.0


loc도 가능

In [83]:
titanic.loc[titanic['Age']>60,['Name','Age']].head()

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
170,"Van der hoef, Mr. Wyckoff",61.0


and : & / or : | / Not : ~을 사용

In [84]:
titanic[(titanic['Age']>60)&(titanic['Pclass']==1)&(titanic['Sex']=='female')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


**DataFrame와 series의 정렬-sort.values()**  
주요 파라미터는 by, ascending, inplace이다.

by: 특정 칼럼을 입력하면  그 칼럼으로 정렬 수행  
ascending=True이면 오름차순  
ibplace=False이면 결과만 반환

In [85]:
titanic_sorted=titanic.sort_values(by=['Name'])

In [86]:
titanic_sorted.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
845,846,0,3,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S
308,309,0,2,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0,,C
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C


In [87]:
titanic_sorted=titanic.sort_values(by=['Pclass','Name'],ascending=False)

In [88]:
titanic_sorted.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5,,S
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5,,S
282,283,0,3,"de Pelsmaeker, Mr. Alfons",male,16.0,0,0,345778,9.5,,S
286,287,1,3,"de Mulder, Mr. Theodore",male,30.0,0,0,345774,9.5,,S
559,560,1,3,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1,0,345572,17.4,,S


**Aggregation**  
**min(), max(), sum(),count()와 같은 함수들**

In [89]:
titanic.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

In [90]:
titanic[['Age','Fare']].mean()

Age     29.699118
Fare    32.204208
dtype: float64

**groupby**

In [91]:
titanic.groupby(by='Pclass').count()

Unnamed: 0_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,216,216,216,216,186,216,216,216,216,176,214
2,184,184,184,184,173,184,184,184,184,16,184
3,491,491,491,491,355,491,491,491,491,12,491


In [92]:
titanic.groupby(by='Pclass')['Age'].agg([max,min])

Unnamed: 0_level_0,max,min
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80.0,0.92
2,70.0,0.67
3,74.0,0.42


isna(), isnull()로 확인하며 fillna()로 대체함

In [93]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [94]:
titanic['Cabin']=titanic['Cabin'].fillna('C000')

In [95]:
titanic['Age']=titanic['Age'].fillna(titanic['Age'].mean())

In [96]:
titanic['Embarked']=titanic['Embarked'].fillna('s')

In [97]:
titanic.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [98]:
def get_square(a):
    return a**2

In [99]:
get_square(3)

9

In [100]:
lambda_square=lambda x: x**2

In [101]:
lambda_square(3)

9

In [102]:
titanic['Name_len']=titanic.Name.apply(lambda x: len(x))

In [108]:
titanic[['Name','Name_len']].head()

Unnamed: 0,Name,Name_len
0,"Braund, Mr. Owen Harris",23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,"Heikkinen, Miss. Laina",22
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",44
4,"Allen, Mr. William Henry",24


In [104]:
titanic['Child_Adult']=titanic.Age.apply(lambda x: 'Child' if x<=15 else 'Adult')

In [105]:
titanic[['Age','Child_Adult']][:10]

Unnamed: 0,Age,Child_Adult
0,22.0,Adult
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult
5,29.699118,Adult
6,54.0,Adult
7,2.0,Child
8,27.0,Adult
9,14.0,Child


lambda식은 else if 를 지원하지 않음  
사용하려면 else절을 ()로 내포해서 다시 if else를 해야 함

In [106]:
titanic['Age_cat']=titanic.Age.apply(lambda x: 'Child' if x<=15 else ('Adult' if x<=60 else 'Elderly'))

In [107]:
titanic.Age_cat.value_counts()

Adult      786
Child       83
Elderly     22
Name: Age_cat, dtype: int64