# 머신러닝

## 개념

- 애플리케이션을 수정하지 않고, 데이터를 기반으로 패턴을 학습하고 결과를 예측하는 알고리즘 기법
- 숨겨진 패턴 찾아 해결
- 놀랄 만한 이익으로 연결 히히

## 분류

> **지도학습**

결과가 지정되어 있음 -> 학습할 데이터 존재함<br>
즉, <span style="color: #2D3748; background-color:#fff5b1;">명확한 정답이 주어진 데이터를 먼저 학습한 뒤 미지의 정답을 예측하는 방식</span>  
ex) **분류**, **회귀**, 추천 시스템, 시각/음성 감지/인지, 텍스트 분석, NLP

> **비지도학습**

결과가 지정되어 있지 않음<br>
ex) 클러스터링(군집), 차원 축소, 강화 학습, 피처 추출(Feature Extraction)

## 단점

데이터에 매우 의존적임<br>
<br>
" Garbage In, Garbage Out. " <br>
: " 터무니없는 입력 데이터(쓰레기)라도 의심을 품지 않고 처리하며, 생각하지도 않던 터무니없는 출력(쓰레기)을 만들어낸다. "

------------

## 파이썬 머신러닝 생태계를 구성하는 주요 패키지

- 머신러닝 패키지 : scikit-learn
- 행렬/선형대수/통계 패키지 : numpy
- 데이터 핸들링 : pandas, matplotlib
- 시각화 : matplotlib, seaborn

----------

# Numpy

## ndarray

- Numpy 기반 데이터 타입
- 다차원 배열을 쉽게 생성하고 다양한 연산을 수행할 수 있음
- ndarray 내의 데이터값은 숫자, 문자열, 불값 등 모두 가능함
- 다만, 그 연산의 특성상 같은 데이터 타입으로만 구성되어야 함

### 생성하기

In [5]:
import numpy as np

In [6]:
arr = np.array([1,2,3,4,5])
arr

array([1, 2, 3, 4, 5])

### 데이터 타입 확인하기

In [7]:
arr.dtype

dtype('int32')

### 데이터 타입 변경하기

메모리를 절약해야 할 때 보통 이용

In [9]:
arr = arr.astype('float64')
arr.dtype

dtype('float64')

### 편리하게 생성하기

#### arange()

In [12]:
arr1 = np.arange(10)
print(arr1)
print(arr1.dtype)
arr1.shape

[0 1 2 3 4 5 6 7 8 9]
int32


(10,)

#### zeros()

In [13]:
arr2 = np.zeros((3,2), dtype = 'int32')
print(arr2)
print(arr2.dtype)
arr2.shape

[[0 0]
 [0 0]
 [0 0]]
int32


(3, 2)

#### ones()

In [14]:
arr3 = np.ones((3,2))
print(arr3)
print(arr3.dtype)
arr3.shape

[[1. 1.]
 [1. 1.]
 [1. 1.]]
float64


(3, 2)

### 차원과 크기 변경하기

#### reshape()

In [16]:
arr1 = np.arange(10)
print(arr1)
print()
arr2 = arr1.reshape(2,5)
print(arr2)
print()
arr3 = arr1.reshape(5,2)
arr3

[0 1 2 3 4 5 6 7 8 9]

[[0 1 2 3 4]
 [5 6 7 8 9]]



array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

> 실전에서 더욱 효율적으로 사용하기

- -1을 인자로 적용하는 경우, 기존 ndarray와 호환되는 새로운 shape로 변환시켜줌

In [17]:
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [19]:
arr1.reshape(-1,2)

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [20]:
arr1.reshape(2,-1)

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

- 3차원을 2차원으로, 1차원을 2차원으로 변경하는 경우

In [21]:
arr1 = np.arange(8)
arr1

array([0, 1, 2, 3, 4, 5, 6, 7])

In [23]:
arr3d = arr1.reshape(2, 2, 2)
arr3d

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

In [25]:
arr5 = arr3d.reshape(-1,1) # 3차원 -> 2차원
arr5

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7]])

In [26]:
arr2 = arr1.reshape(-1,1) # 1차원 -> 2차원
arr2

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7]])

### 데이터 세트 선택하기

#### Indexing

##### 특정 데이터 값만 추출

In [30]:
arr1 = np.arange(1, 10)
arr1

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [31]:
arr1[2]

3

In [32]:
arr1[-1]

9

In [35]:
arr3d = arr1.reshape(3, 3)
arr3d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [37]:
arr3d[2, 1]

8

In [38]:
arr3d[1, 0]

4

##### slicing

시작 인덱스에서 종료 인덱스 -1 위치에 있는 데이터를 ndarray 형태로 추출

In [39]:
arr1 = np.arange(1, 10)
arr3 = arr1[0:3]
print(arr1)
arr3

[1 2 3 4 5 6 7 8 9]


array([1, 2, 3])

##### fancy indexing

In [45]:
arr1 = np.arange(1, 10)
arr2 = arr1.reshape(3, 3)
print(arr1)
arr2

[1 2 3 4 5 6 7 8 9]


array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [46]:
arr3 = arr2[[0, 1], 2]
arr3

array([3, 6])

In [47]:
arr4 = arr2[[0, 1], 0:2]
arr4

array([[1, 2],
       [4, 5]])

In [49]:
arr5 = arr2[[0, 1]] # (0, :),(1, :) 적용 
arr5

array([[1, 2, 3],
       [4, 5, 6]])

##### boolean indexing

조건 필터링과 검색을 동시에 할 수 있음

In [51]:
arr = np.arange(1, 10)
arr3 = arr1[arr1 > 5]
arr3

array([6, 7, 8, 9])

### 행렬의 정렬

#### sort()

기존 행렬은 그대로 유지한 채 원 행렬의 정렬된 행렬을 반환

In [52]:
org_array = np.array([3,1,9,5])
sort_array1 = np.sort(org_array)
sort_array1

array([1, 3, 5, 9])

#### ndarray.sort()

원 행렬 자체를 정렬한 형태로 변환하며 반환 값 None

In [53]:
sort_array2 = org_array.sort()

In [55]:
org_array

array([1, 3, 5, 9])

> 기본값: 오름차순 정렬, <br>내림차순으로 정렬하고 싶다면?

In [56]:
sort_array1_desc = np.sort(org_array)[::-1]
sort_array1_desc

array([9, 5, 3, 1])

> 정렬의 방향을 설정하고 싶다면?

In [57]:
arr2d = np.array([[8,12], [7,1]])
arr2d

array([[ 8, 12],
       [ 7,  1]])

In [58]:
np.sort(arr2d, axis = 0)

array([[ 7,  1],
       [ 8, 12]])

In [59]:
np.sort(arr2d, axis = 1)

array([[ 8, 12],
       [ 1,  7]])

#### argsort()

기존 원본 행렬의 원소에 대한 인덱스를 반환해줌

In [62]:
org_array = np.array([3,1,9,5])
org_array

array([3, 1, 9, 5])

In [63]:
sort = np.argsort(org_array)
sort

array([1, 0, 3, 2], dtype=int64)

In [64]:
np.argsort(org_array)[::-1]

array([2, 3, 0, 1], dtype=int64)

> 활용 예제

In [65]:
name = np.array(['John', 'Mike', 'Sarah', 'Kate', 'Samuel'])
score = np.array([78, 95, 84, 98, 88])

sort = np.argsort(score)
print('성적 오름차순 정렬 시 score 인덱스:', sort)
print('성적 오름차순으로 name 이름 출력:', name[sort])

성적 오름차순 정렬 시 score 인덱스: [0 2 4 1 3]
성적 오름차순으로 name 이름 출력: ['John' 'Sarah' 'Samuel' 'Mike' 'Kate']


### 행렬 내적(행렬 곱)

#### np.dot()

왼쪽 행렬의 열 개수와 오른쪽 행렬의 행 개수가 동일해야 내적 연산이 가능함

In [67]:
A = np.array( [[1,2,3],[4,5,6]] )
B = np.array( [[7,8], [9,10], [11,12]] )

dot_product = np.dot(A, B)
dot_product

array([[ 58,  64],
       [139, 154]])

### 전치 행렬

#### transpose()

In [68]:
A = np.array([[1,2], [3,4]])
A

array([[1, 2],
       [3, 4]])

In [69]:
np.transpose(A)

array([[1, 3],
       [2, 4]])

----------

## Pandas

In [3]:
import pandas as pd

titanic = pd.read_csv('./data/titanic_data/titanic_train.csv')
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### info()

총 데이터 건수와 데이터 타입, Null 건수를 보여줌

In [2]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### describe()

컬럼별 숫자형 데이터값의 n-percentile 분포도, 평균값, 최댓값, 최솟값을 보여줌

In [3]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### value_counts()

- 해당 컬럼값의 유형과 건수를 보여줌
- 데이터의 분포도를 확인하는 데 매우 유용함

In [4]:
titanic['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

> Null 값을 포함하여 value_counts()를 적용해보자.

In [7]:
titanic['Embarked'].value_counts(dropna = False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

---------------

## 데이터 프레임

### 변환하기

In [9]:
dict = {'col' : [1, 11], 'col2' : [2, 22], 'col3' : [3, 33]}
df_dict = pd.DataFrame(dict)
df_dict

Unnamed: 0,col,col2,col3
0,1,2,3
1,11,22,33


> 리스트로 변환하기

In [14]:
list = df_dict.values.tolist()
type(list)

list

> 딕셔너리로 변환하기

to_dict() 메서드 호출 후, 인자로 'list'를 입력하면 딕셔너리 값이 리스트형으로 반환됨

In [15]:
dict2 = df_dict.to_dict('list')
type(dict2)

dict

### 컬럼 데이터 생성과 수정하기

> 생성

In [18]:
titanic['Age_0'] = 0
titanic[:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0


> 수정

In [20]:
titanic['Age_0'] = titanic['Age_0'] + 100
titanic[:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_0
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,100
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,100
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,100


### 삭제하기

In [22]:
titanic_drop = titanic.drop('Age_0', axis = 1)
titanic_drop[:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


#### inplace = True

In [24]:
titanic.drop('Age_0', axis = 1, inplace = True)
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### reset_index()

새롭게 인덱스를 연속 숫자 형으로 할당하며, 기존 인덱스는 'index'라는 새로운 컬럼명으로 추가시킴

In [29]:
titanic.reset_index(inplace = True)
titanic[:3]

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


> drop = True

In [31]:
titanic.reset_index(inplace = True, drop = True)

In [33]:
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### 셀렉션 및 필터링

#### df['컬럼명']

In [37]:
titanic[0:2] # 권장하지 않음

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [36]:
titanic[titanic['Pclass'] == 3].head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### iloc[]

위치 기반 인덱싱

In [38]:
data = {'Name' : ['Chulmin', 'Eunkyung', 'Jinwoong', 'Soobeon'],
        'Year' : [2011, 2016, 2015, 2015],
        'Gender' : ['Male', 'Female', 'Male', 'Male']
       }
data_df = pd.DataFrame(data, index = ['one', 'two', 'three', 'four'])
data_df

Unnamed: 0,Name,Year,Gender
one,Chulmin,2011,Male
two,Eunkyung,2016,Female
three,Jinwoong,2015,Male
four,Soobeon,2015,Male


In [39]:
data_df.iloc[0,0]

'Chulmin'

In [41]:
data_df.iloc[0:2, [0,1]]

Unnamed: 0,Name,Year
one,Chulmin,2011
two,Eunkyung,2016


#### loc[]

명칭 기반 인덱싱

- 슬라이싱할 때 종료점을 포함

In [42]:
data_df.loc['one', 'Name']

'Chulmin'

In [43]:
data_df.loc['one' : 'three', ['Name', 'Gender']]

Unnamed: 0,Name,Gender
one,Chulmin,Male
two,Eunkyung,Female
three,Jinwoong,Male


### sort_values()

In [45]:
titanic_sorted = titanic.sort_values(by = ['Name'])
titanic_sorted.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
845,846,0,3,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S


In [46]:
titanic.sort_values(by = ['Name'], ascending = False, inplace = False)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
153,154,0,3,"van Billiard, Mr. Austin Blyler",male,40.5,0,2,A/5. 851,14.5000,,S
361,362,0,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C
282,283,0,3,"de Pelsmaeker, Mr. Alfons",male,16.0,0,0,345778,9.5000,,S
286,287,1,3,"de Mulder, Mr. Theodore",male,30.0,0,0,345774,9.5000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
308,309,0,2,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0000,,C
279,280,1,3,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.2500,,S
746,747,0,3,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.2500,,S


### Aggregation 함수

DataFrame에서 바로 aggregation을 호출할 경우 모든 컬럼에 해당 aggregation을 적용함<br>
특정 컬럼에만 적용하기 위해서는 DataFrame에 대상 컬럼들만 추출해서 적용시키면 됨

In [47]:
titanic[['Age', 'Fare']].mean()

Age     29.699118
Fare    32.204208
dtype: float64

### groupby()

In [48]:
titanic.groupby('Pclass').count()

Unnamed: 0_level_0,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,216,216,216,216,186,216,216,216,216,176,214
2,184,184,184,184,173,184,184,184,184,16,184
3,491,491,491,491,355,491,491,491,491,12,491


In [49]:
titanic.groupby('Pclass')[['PassengerId', 'Survived']].count()

Unnamed: 0_level_0,PassengerId,Survived
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,216,216
2,184,184
3,491,491


In [50]:
titanic.groupby('Pclass')['Age'].agg([max, min])

Unnamed: 0_level_0,max,min
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80.0,0.92
2,70.0,0.67
3,74.0,0.42


In [52]:
titanic.groupby('Pclass').agg({'Age' : 'max', 'SibSp' : 'sum', 'Fare' : 'mean'})

Unnamed: 0_level_0,Age,SibSp,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,80.0,90,84.154687
2,70.0,74,20.662183
3,74.0,302,13.67555


### 결손 데이터 처리하기

In [53]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [54]:
titanic['Cabin'].fillna('C000', inplace = True)

In [55]:
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64

### lambda

In [1]:
a =[1, 2, 3]
b = map(lambda x : x**2, a)
list(b)

[1, 4, 9]

In [71]:
titanic['Name_len'] = titanic['Name'].apply(lambda x : len(x))
titanic[['Name', 'Name_len']].head(3)

Unnamed: 0,Name,Name_len
0,"Braund, Mr. Owen Harris",23
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",51
2,"Heikkinen, Miss. Laina",22


In [4]:
titanic['Child_Adult'] = titanic['Age'].apply(lambda x : 'Child' if x <= 15 else 'Adult')
titanic[['Age', 'Child_Adult']].head(10)

Unnamed: 0,Age,Child_Adult
0,22.0,Adult
1,38.0,Adult
2,26.0,Adult
3,35.0,Adult
4,35.0,Adult
5,,Adult
6,54.0,Adult
7,2.0,Child
8,27.0,Adult
9,14.0,Child


-----------------