- loc 및 iloc 접근자는 대상으로 지정하려는 행/열의 인덱스 레이블과 위치를 알고 있을 때 사용 가능
-> 식별자가 아닌 조건이나 기준으로 행을 지정하고 싶을 때의 필터링 방법

# 데이터셋과 메모리 최적화
- 데이터셋을 가져올 때마다 각 열이 데이터를 가장 최적의 유형으로 저장하는지 확인하기
  - 최적의 유형: 가장 적은 메모리를 사용하거나 가장 많은 유틸리티를 제공하는 데이터 유형
  - ex. 정수->부동소수점이 아닌 정수형, 날짜/시간 유형->문자열 대신 특화된 메서드를 제공하는 유형

In [1]:
import pandas as pd

In [2]:
pd.read_csv("employees.csv")

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,8/6/93,,True,Marketing
1,Thomas,Male,3/31/96,61933.0,True,
2,Maria,Female,,130590.0,False,Finance
3,Jerry,,3/4/05,138705.0,True,Finance
4,Larry,Male,1/24/98,101004.0,True,IT
...,...,...,...,...,...,...
996,Phillip,Male,1/31/84,42392.0,False,Finance
997,Russell,Male,5/20/13,96914.0,False,Product
998,Larry,Male,4/20/13,60500.0,False,Business Dev
999,Albert,Male,5/15/12,129949.0,True,Sales


In [3]:
employees = pd.read_csv("employees.csv", parse_dates = ["Start Date"])

In [4]:
#info 메서드: 각 열의 데이터 유형, 결측값의 개수를 목록으로 확인
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  933 non-null    object        
 1   Gender      854 non-null    object        
 2   Start Date  999 non-null    datetime64[ns]
 3   Salary      999 non-null    float64       
 4   Mgmt        933 non-null    object        
 5   Team        957 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 47.0+ KB


## astype 메서드를 사용하여 데이터 유형 변환
- Mgmt 열의 값을 문자열로 가져옴 -> 불리언 데이터 유형으로 변환 가능
- astype 메서드: Series의 값을 다른 데이터 유형으로 변환

In [5]:
employees["Mgmt"].astype(bool) #결측값을 True로 가져옴

0        True
1        True
2       False
3        True
4        True
        ...  
996     False
997     False
998     False
999      True
1000     True
Name: Mgmt, Length: 1001, dtype: bool

- DataFrame의 열 업데이트 ~= 딕셔너리에서 키-값 쌍을 설정하는 것
  - 지정된 이름의 열이 존재하는 경우, 판다스는 이를 새 Series로 덮어씀
  - 지정된 이름의 열이 존재하지 않는 경우, 판다스는 새 Series를 생성하고 DataFrame 오른쪽에 추가. 라이브러리는 공통 인덱스 레이블로 Series와 DF의 행을 연결.

In [6]:
employees["Mgmt"] = employees["Mgmt"].astype(bool)

In [7]:
employees.tail()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev
999,Albert,Male,2012-05-15,129949.0,True,Sales
1000,,,NaT,,True,


In [8]:
#메모리 사용량 비교
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  933 non-null    object        
 1   Gender      854 non-null    object        
 2   Start Date  999 non-null    datetime64[ns]
 3   Salary      999 non-null    float64       
 4   Mgmt        1001 non-null   bool          
 5   Team        957 non-null    object        
dtypes: bool(1), datetime64[ns](1), float64(1), object(3)
memory usage: 40.2+ KB


In [9]:
#employees에서 Salary열의 값엔 NaN이 있기 때문에 정수형이 아닌 부동소수점으로 저장
#astype으로 강제 변환 시 IntCastingNaNError 발생
employees["Salary"].astype(int)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [10]:
employees["Salary"] = employees["Salary"].fillna(0).astype(int)

In [11]:
#범주형 데이터 유형. 전체 개수에 비해 소수의 고유값으로 구성된 열에 적용.
employees.nunique()

First Name    200
Gender          2
Start Date    971
Salary        995
Mgmt            2
Team           10
dtype: int64

In [12]:
#소수의 고유값을 가진 Gender 채택
employees["Gender"].astype("category")

0         Male
1         Male
2       Female
3          NaN
4         Male
         ...  
996       Male
997       Male
998       Male
999       Male
1000       NaN
Name: Gender, Length: 1001, dtype: category
Categories (2, object): ['Female', 'Male']

In [13]:
employees["Gender"] = employees["Gender"].astype("category")

In [14]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  933 non-null    object        
 1   Gender      854 non-null    category      
 2   Start Date  999 non-null    datetime64[ns]
 3   Salary      1001 non-null   int32         
 4   Mgmt        1001 non-null   bool          
 5   Team        957 non-null    object        
dtypes: bool(1), category(1), datetime64[ns](1), int32(1), object(2)
memory usage: 29.6+ KB


In [15]:
#소수의 고유값을 가진 Team열 채택
employees["Team"] = employees["Team"].astype("category")
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  933 non-null    object        
 1   Gender      854 non-null    category      
 2   Start Date  999 non-null    datetime64[ns]
 3   Salary      1001 non-null   int32         
 4   Mgmt        1001 non-null   bool          
 5   Team        957 non-null    category      
dtypes: bool(1), category(2), datetime64[ns](1), int32(1), object(1)
memory usage: 23.1+ KB


# 단일 조건으로 필터링
- 데이터의 하위 집합(어떤 종류의 조건에 맞는 데이터셋의 일부)을 추출

In [16]:
#등호 연산자를 활용한 구문(Series == value) -> 불리언 Series를 반환
#판다스는 Series 자체가 아닌 각 Series 값을 지정된 문자열과 비교 가능
employees["First Name"] == "Maria"

0       False
1       False
2        True
3       False
4       False
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: First Name, Length: 1001, dtype: bool

In [17]:
#필터링-1
employees[employees["First Name"] == "Maria"]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,NaT,130590,False,Finance
198,Maria,Female,1990-12-27,36067,True,Product
815,Maria,,1986-01-18,106562,False,HR
844,Maria,,1985-06-19,148857,False,Legal
936,Maria,Female,2003-03-14,96250,False,Business Dev
984,Maria,Female,2011-10-15,43455,False,Engineering


In [18]:
#필터링-2
marias = employees["First Name"] == "Maria"
employees[marias]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,NaT,130590,False,Finance
198,Maria,Female,1990-12-27,36067,True,Product
815,Maria,,1986-01-18,106562,False,HR
844,Maria,,1985-06-19,148857,False,Legal
936,Maria,Female,2003-03-14,96250,False,Business Dev
984,Maria,Female,2011-10-15,43455,False,Engineering


In [19]:
#부등호 연산자를 활용한 구문
employees[employees["Team"] != "Finance"]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
4,Larry,Male,1998-01-24,101004,True,IT
5,Dennis,Male,1987-04-18,115163,False,Legal
6,Ruby,Female,1987-08-17,65476,True,Product
...,...,...,...,...,...,...
995,Henry,,2014-11-23,132483,False,Distribution
997,Russell,Male,2013-05-20,96914,False,Product
998,Larry,Male,2013-04-20,60500,False,Business Dev
999,Albert,Male,2012-05-15,129949,True,Sales


In [20]:
#불리언 값을 가진 열은 등호/부등호 연산자를 굳이 사용할 필요가 없음
employees[employees["Mgmt"]].head() #employees[employees["Mgmt"] == True]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
3,Jerry,,2005-03-04,138705,True,Finance
4,Larry,Male,1998-01-24,101004,True,IT
6,Ruby,Female,1987-08-17,65476,True,Product


In [21]:
high_earners = employees["Salary"] > 100000 #해당 조건을 만족하는 불리언 Series를 생성
high_earners.head()

0    False
1    False
2     True
3     True
4     True
Name: Salary, dtype: bool

In [22]:
employees[high_earners].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,NaT,130590,False,Finance
3,Jerry,,2005-03-04,138705,True,Finance
4,Larry,Male,1998-01-24,101004,True,IT
5,Dennis,Male,1987-04-18,115163,False,Legal
9,Frances,Female,2002-08-08,139852,True,Business Dev


# 다중 조건으로 필터링
- 2개의 독립적인 불리언 Series를 만든 다음 두 Series 사이에 적용해야하는 논리적 기준을 선언하여 다중 조건 필터링 가능
## AND 조건

In [23]:
is_female = employees["Gender"] == "Female"
in_biz_dev = employees["Team"] == "Business Dev"
employees[is_female & in_biz_dev].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
9,Frances,Female,2002-08-08,139852,True,Business Dev
33,Jean,Female,1993-12-18,119082,False,Business Dev
36,Rachel,Female,2009-02-16,142032,False,Business Dev
38,Stephanie,Female,1986-09-13,36844,True,Business Dev
61,Denise,Female,2001-11-06,106862,False,Business Dev


In [24]:
#원하는 개수만큼 조건을 포함 가능
is_manager = employees["Mgmt"]
employees[is_female & in_biz_dev & is_manager].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
9,Frances,Female,2002-08-08,139852,True,Business Dev
38,Stephanie,Female,1986-09-13,36844,True,Business Dev
66,Nancy,Female,2012-12-15,125250,True,Business Dev
92,Linda,Female,2000-05-25,119009,True,Business Dev
111,Bonnie,Female,1999-12-17,42153,True,Business Dev


## OR 조건

In [25]:
earning_below_40k = employees["Salary"] < 40000
started_after_2015 = employees["Start Date"] > "2015-01-01"
employees[earning_below_40k | started_after_2015].tail()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
958,Gloria,Female,1987-10-24,39833,False,Engineering
964,Bruce,Male,1980-05-07,35802,True,Sales
967,Thomas,Male,2016-03-12,105681,False,Engineering
989,Justin,,1991-02-10,38344,False,Legal
1000,,,NaT,0,True,


## ~ 기호로 반전

In [26]:
my_series = pd.Series([True, False, True])
my_series

0     True
1    False
2     True
dtype: bool

In [27]:
~my_series

0    False
1     True
2    False
dtype: bool

In [28]:
#연봉 100000달러 미만인 직원을 식별 -1
employees[employees["Salary"] < 100000].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
6,Ruby,Female,1987-08-17,65476,True,Product
7,,Female,2015-07-20,45906,True,Finance
8,Angela,Female,2005-11-22,95570,True,Engineering


In [29]:
#연봉 100000달러 미만인 직원을 식별 -2
employees[~(employees["Salary"] >= 100000)].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
6,Ruby,Female,1987-08-17,65476,True,Product
7,,Female,2015-07-20,45906,True,Finance
8,Angela,Female,2005-11-22,95570,True,Engineering


## 불리언 메서드
- 등호(==): eq
- 부등호(!=): ne
- 작음(<): lt
- 작거나 같음(<=): le
- 큼(>): gt
- 크거나 같음(>=): ge

# 조건별 필터링
- 복잡한 유형의 추출용 불리언 Series를 생성할 때 활용할 수 있는 헬퍼 메서드를 제공

## isin 메서드
- '==' 조건이 여러 개일 경우의 확장성을 지원.
- 조건의 피연산자가 될 열을 담은 반복 가능한 요소(리스트, 튜플, Series 등)를 인수로 받아 불리언 Series를 반환하는 메서드

In [30]:
sales = employees["Team"] == "Sales"
legal = employees["Team"] == "Legal"
mktg = employees["Team"] == "Marketing"
employees[sales | legal | mktg].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
5,Dennis,Male,1987-04-18,115163,False,Legal
11,Julie,Female,1997-10-26,102508,True,Legal
13,Gary,Male,2008-01-27,109831,False,Sales
20,Lois,,1995-04-22,64714,True,Legal


In [31]:
all_star_teams = ["Sales", "Legal", "Marketing"]
on_all_star_teams = employees["Team"].isin(all_star_teams)
employees[on_all_star_teams].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
5,Dennis,Male,1987-04-18,115163,False,Legal
11,Julie,Female,1997-10-26,102508,True,Legal
13,Gary,Male,2008-01-27,109831,False,Sales
20,Lois,,1995-04-22,64714,True,Legal


## between 메서드
- 숫자나 날짜를 다룰 때, 범위에 속하는 값을 구하는 경우
  1. 하한, 상한을 나타내는 2개의 불리언 Series를 생성 후 & 연산자를 사용
  2. between 메서드: 하한과 상한을 인수로 받아 불리언 Series를 반환

In [32]:
#1번 방법
higher_than_80 = employees["Salary"] >= 80000
lower_than_90 = employees["Salary"] < 90000
employees[higher_than_80 & lower_than_90].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
19,Donna,Female,2010-07-22,81014,False,Product
31,Joyce,,2005-02-20,88657,False,Product
35,Theresa,Female,2006-10-10,85182,False,Sales
45,Roger,Male,1980-04-17,88010,True,Sales
54,Sara,Female,2007-08-15,83677,False,Engineering


In [33]:
between_80k_and_90k = employees["Salary"].between(80000, 90000)
employees[between_80k_and_90k].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
19,Donna,Female,2010-07-22,81014,False,Product
31,Joyce,,2005-02-20,88657,False,Product
35,Theresa,Female,2006-10-10,85182,False,Sales
45,Roger,Male,1980-04-17,88010,True,Sales
54,Sara,Female,2007-08-15,83677,False,Engineering


- between 메서드는 다른 데이터 유형의 열에도 적용 가능
  - ex. 날짜/시간 열 -> 문자열: 첫번째/두번째 인수에 대한 키워드 매개변수: left, right

In [34]:
eighties_folk = employees["Start Date"].between(
    left = "1980-01-01",
    right = "1990-01-01"
)
employees[eighties_folk].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
5,Dennis,Male,1987-04-18,115163,False,Legal
6,Ruby,Female,1987-08-17,65476,True,Product
10,Louise,Female,1980-08-12,63241,True,
12,Brandon,Male,1980-12-01,112807,True,HR
17,Shawn,Male,1986-12-07,111737,False,Product


In [35]:
name_starts_with_r = employees["First Name"].between("R", "S")
employees[name_starts_with_r].head()
#문자 및 문자열 작업 시 대소문자 구분!

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
6,Ruby,Female,1987-08-17,65476,True,Product
36,Rachel,Female,2009-02-16,142032,False,Business Dev
45,Roger,Male,1980-04-17,88010,True,Sales
67,Rachel,Female,1999-08-16,51178,True,Finance
78,Robin,Female,1983-06-04,114797,True,Sales


## isnull과 notnull 메서드
- isnull 메서드는 행의 값이 누락되면 True값을 가지는 불리언 Series를 반환
- 판다스는 NaT와 None값도 null로 간주

In [36]:
employees["Team"].isnull().head() #isnull() 사용-명시적 

0    False
1     True
2    False
3    False
4    False
Name: Team, dtype: bool

In [37]:
employees["Team"].notnull().head() #notnull() 사용-명시적

0     True
1    False
2     True
3     True
4     True
Name: Team, dtype: bool

In [38]:
~(employees["Team"].isnull()).head() #~isnull() 사용

0     True
1    False
2     True
3     True
4     True
Name: Team, dtype: bool

In [39]:
no_team = employees["Team"].isnull()
employees[no_team].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
1,Thomas,Male,1996-03-31,61933,True,
10,Louise,Female,1980-08-12,63241,True,
23,,Male,2012-06-14,125792,True,
32,,Male,1998-08-21,122340,True,
91,James,,2005-01-26,128771,False,


In [40]:
has_name = employees["First Name"].notnull()
employees[has_name].tail()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
995,Henry,,2014-11-23,132483,False,Distribution
996,Phillip,Male,1984-01-31,42392,False,Finance
997,Russell,Male,2013-05-20,96914,False,Product
998,Larry,Male,2013-04-20,60500,False,Business Dev
999,Albert,Male,2012-05-15,129949,True,Sales


## null 값 다루기

In [41]:
employees = pd.read_csv("employees.csv", parse_dates = ["Start Date"])
employees

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
3,Jerry,,2005-03-04,138705.0,True,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT
...,...,...,...,...,...,...
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev
999,Albert,Male,2012-05-15,129949.0,True,Sales


In [42]:
#dropna 메서드는 몇 개의 열 값이 결측값인지 상관없이 NaN이 하나라도 있는 DataFrame행을 제거
#how의 기본 인수: any
employees.dropna()
#employees.dropna(how = "any")

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
4,Larry,Male,1998-01-24,101004.0,True,IT
5,Dennis,Male,1987-04-18,115163.0,False,Legal
6,Ruby,Female,1987-08-17,65476.0,True,Product
8,Angela,Female,2005-11-22,95570.0,True,Engineering
9,Frances,Female,2002-08-08,139852.0,True,Business Dev
...,...,...,...,...,...,...
994,George,Male,2013-06-21,98874.0,True,Marketing
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev


In [43]:
#dropna 메서드: 모든 값이 누락된 행만 제거->how 매개변수에 인수로 all을 전달
employees.dropna(how = "all").tail()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
995,Henry,,2014-11-23,132483.0,False,Distribution
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev
999,Albert,Male,2012-05-15,129949.0,True,Sales


In [44]:
#dropna 메서드: 특정 열에 결측값이 있는 행을 지정하여 제거->subset 매개변수에 해당하는 열들의 리스트를 전달
employees.dropna(subset = ["Gender"]).tail()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
994,George,Male,2013-06-21,98874.0,True,Marketing
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev
999,Albert,Male,2012-05-15,129949.0,True,Sales


In [45]:
#dropna 메서드: thresh 매개변수는 판다스가 행을 유지하는 조건으로 
#최소 몇 개의 null이 아닌 값을 가져야하는지 결정하는 임계값
employees.dropna(thresh = 4).head()
#how와 thresh를 동시에 쓰면 에러남.

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
3,Jerry,,2005-03-04,138705.0,True,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT


# 중복 처리
- 결측값 뿐만 아니라 중복값도 많음. 이를 식별하고 제거하는 메서드 제공
## duplicated 메서드

In [46]:
employees["Team"].head()

0    Marketing
1          NaN
2      Finance
3      Finance
4           IT
Name: Team, dtype: object

In [47]:
#duplicated 메서드: 열에서 중복값을 식별하는 불리언 Series를 반환
#이전에 한번이라도 본 적이 있는 값이면 True를 반환
employees["Team"].duplicated().head()

0    False
1    False
2    False
3     True
4    False
Name: Team, dtype: bool

In [48]:
#duplicated 메서드의 keep 매개변수: 유지할 중복값의 위치를 나타냄
#기본인수 = first
employees["Team"].duplicated(keep = "first").head()

0    False
1    False
2    False
3     True
4    False
Name: Team, dtype: bool

In [49]:
employees["Team"].duplicated(keep = "last")

0        True
1        True
2        True
3        True
4        True
        ...  
996     False
997     False
998     False
999     False
1000    False
Name: Team, Length: 1001, dtype: bool

In [50]:
(~employees["Team"].duplicated()).head()

0     True
1     True
2     True
3    False
4     True
Name: Team, dtype: bool

In [51]:
first_one_in_team = ~employees["Team"].duplicated()
employees[first_one_in_team]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT
5,Dennis,Male,1987-04-18,115163.0,False,Legal
6,Ruby,Female,1987-08-17,65476.0,True,Product
8,Angela,Female,2005-11-22,95570.0,True,Engineering
9,Frances,Female,2002-08-08,139852.0,True,Business Dev
12,Brandon,Male,1980-12-01,112807.0,True,HR
13,Gary,Male,2008-01-27,109831.0,False,Sales


## drop_duplicates 메서드
- 행의 모든 값이 일치하는 행을 제거

In [52]:
employees.drop_duplicates()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
3,Jerry,,2005-03-04,138705.0,True,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT
...,...,...,...,...,...,...
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev
999,Albert,Male,2012-05-15,129949.0,True,Sales


In [53]:
#subset 매개변수에 고유값을 가져야하는 열 목록을 전달하면 특정 열에 대한 중복을 제거
#특정 열에서 각 고유값의 첫번째 항목을 찾음
employees.drop_duplicates(subset = ["Team"])

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT
5,Dennis,Male,1987-04-18,115163.0,False,Legal
6,Ruby,Female,1987-08-17,65476.0,True,Product
8,Angela,Female,2005-11-22,95570.0,True,Engineering
9,Frances,Female,2002-08-08,139852.0,True,Business Dev
12,Brandon,Male,1980-12-01,112807.0,True,HR
13,Gary,Male,2008-01-27,109831.0,False,Sales


In [54]:
#각 중복값이 마지막으로 나타난 행을 유지하고 싶다면 keep 매개변수에 last 전달
employees.drop_duplicates(subset = ["Team"], keep = "last")

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
988,Alice,Female,2004-10-05,47638.0,False,HR
989,Justin,,1991-02-10,38344.0,False,Legal
990,Robin,Female,1987-07-24,100765.0,True,IT
993,Tina,Female,1997-05-15,56450.0,True,Engineering
994,George,Male,2013-06-21,98874.0,True,Marketing
995,Henry,,2014-11-23,132483.0,False,Distribution
996,Phillip,Male,1984-01-31,42392.0,False,Finance
997,Russell,Male,2013-05-20,96914.0,False,Product
998,Larry,Male,2013-04-20,60500.0,False,Business Dev
999,Albert,Male,2012-05-15,129949.0,True,Sales


In [55]:
#keep 매개변수에 불리언형 인수 전달 시 중복값이 있는 모든 행을 컨트롤 가능
employees.drop_duplicates(subset = ["First Name"], keep = False)

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
5,Dennis,Male,1987-04-18,115163.0,False,Legal
8,Angela,Female,2005-11-22,95570.0,True,Engineering
33,Jean,Female,1993-12-18,119082.0,False,Business Dev
190,Carol,Female,1996-03-19,57783.0,False,Finance
291,Tammy,Female,1984-11-11,132839.0,True,IT
495,Eugene,Male,1984-05-24,81077.0,False,Sales
688,Brian,Male,2007-04-07,93901.0,True,Legal
832,Keith,Male,2003-02-12,120672.0,False,Legal
887,David,Male,2009-12-05,92242.0,False,Legal


In [57]:
#여러 열에 있는 값을 조합하여 중복을 식별
employees.drop_duplicates(subset = ["Gender", "Team"]).head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
3,Jerry,,2005-03-04,138705.0,True,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT


In [59]:
#여러 열에 있는 값이 동시에 중복된 값의 예시
name_is_douglas = employees["First Name"] == "Douglas"
is_male = employees["Gender"] == "Male"
employees[name_is_douglas & is_male]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
217,Douglas,Male,1999-09-03,83341.0,True,IT
322,Douglas,Male,2002-01-08,41428.0,False,Product
835,Douglas,Male,2007-08-04,132175.0,False,Engineering


# 코딩 챌린지

In [60]:
pd.read_csv("netflix.csv")

Unnamed: 0,title,director,date_added,type
0,Alias Grace,,3-Nov-17,TV Show
1,A Patch of Fog,Michael Lennox,15-Apr-17,Movie
2,Lunatics,,19-Apr-19,TV Show
3,Uriyadi 2,Vijay Kumar,2-Aug-19,Movie
4,Shrek the Musical,Jason Moore,29-Dec-13,Movie
...,...,...,...,...
5832,The Pursuit,John Papola,7-Aug-19,Movie
5833,Hurricane Bianca,Matt Kugelman,1-Jan-17,Movie
5834,Amar's Hands,Khaled Youssef,26-Apr-19,Movie
5835,Bill Nye: Science Guy,Jason Sussberg,25-Apr-18,Movie


In [71]:
#1. 데이터셋을 최적화
netflix = pd.read_csv(
    "netflix.csv",
    parse_dates = ["date_added"]
)
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5837 entries, 0 to 5836
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       5837 non-null   object        
 1   director    3936 non-null   object        
 2   date_added  5195 non-null   datetime64[ns]
 3   type        5837 non-null   object        
dtypes: datetime64[ns](1), object(3)
memory usage: 182.5+ KB


In [72]:
netflix.nunique()

title         5780
director      3024
date_added    1092
type             2
dtype: int64

In [75]:
netflix["type"] = netflix["type"].astype("category")
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5837 entries, 0 to 5836
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   title       5837 non-null   object        
 1   director    3936 non-null   object        
 2   date_added  5195 non-null   datetime64[ns]
 3   type        5837 non-null   category      
dtypes: category(1), datetime64[ns](1), object(2)
memory usage: 142.8+ KB


In [77]:
#2. 제목이 'Limitless'인 콘텐츠 찾기
netflix[netflix["title"] == "Limitless"]

Unnamed: 0,title,director,date_added,type
1559,Limitless,Neil Burger,2019-05-16,Movie
2564,Limitless,,2016-07-01,TV Show
4579,Limitless,Vrinda Samartha,2019-10-01,Movie


In [79]:
#3. 'Robert Rodriguez' 감독의 'Movie' 유형인 콘텐츠 찾기
is_Robert_Rodriguez = netflix["director"] == "Robert Rodriguez"
is_movie = netflix["type"] == "Movie"
netflix[is_Robert_Rodriguez & is_movie]

Unnamed: 0,title,director,date_added,type
1384,Spy Kids: All the Time in the World,Robert Rodriguez,2019-02-19,Movie
1416,Spy Kids 3: Game Over,Robert Rodriguez,2019-04-01,Movie
1460,Spy Kids 2: The Island of Lost Dreams,Robert Rodriguez,2019-03-08,Movie
2890,Sin City,Robert Rodriguez,2019-10-01,Movie
3836,Shorts,Robert Rodriguez,2019-07-01,Movie
3883,Spy Kids,Robert Rodriguez,2019-04-01,Movie


In [80]:
#4. 등록된 날짜가 '2019-07-31'이거나, 감독이 'Robert Altman'인 콘텐츠 찾기
is_date = netflix["date_added"] == "2019-07-31"
is_robert_altman = netflix["director"] == "Robert Altman"
netflix[is_date | is_robert_altman]

Unnamed: 0,title,director,date_added,type
611,Popeye,Robert Altman,2019-11-24,Movie
1028,The Red Sea Diving Resort,Gideon Raff,2019-07-31,Movie
1092,Gosford Park,Robert Altman,2019-11-01,Movie
3473,Bangkok Love Stories: Innocence,,2019-07-31,TV Show
5117,Ramen Shop,Eric Khoo,2019-07-31,Movie


In [81]:
#5. 감독이 'Orson Welles','Aditya Kripalani','Sam Raimi'인 콘텐츠 찾기
directors = ['Orson Welles','Aditya Kripalani','Sam Raimi']
is_directors = netflix["director"].isin(directors)
netflix[is_directors]

Unnamed: 0,title,director,date_added,type
946,The Stranger,Orson Welles,2018-07-19,Movie
1870,The Gift,Sam Raimi,2019-11-20,Movie
3706,Spider-Man 3,Sam Raimi,2019-11-01,Movie
4243,Tikli and Laxmi Bomb,Aditya Kripalani,2018-08-01,Movie
4475,The Other Side of the Wind,Orson Welles,2018-11-02,Movie
5115,Tottaa Pataaka Item Maal,Aditya Kripalani,2019-06-25,Movie


In [83]:
#6. 2019-05-01 과 2019-06-01 사이에 등록된 콘텐츠 찾기
contents = netflix["date_added"].between("2019-05-01", "2019-06-01")
netflix[contents].head()

Unnamed: 0,title,director,date_added,type
29,Chopsticks,Sachin Yardi,2019-05-31,Movie
60,Away From Home,,2019-05-08,TV Show
82,III Smoking Barrels,Sanjib Dey,2019-06-01,Movie
108,Jailbirds,,2019-05-10,TV Show
124,Pegasus,Han Han,2019-05-31,Movie


In [84]:
#7. director열에서 NaN이 있는 모든 행을 삭제
netflix.dropna(subset = ["director"], how = "any")

Unnamed: 0,title,director,date_added,type
1,A Patch of Fog,Michael Lennox,2017-04-15,Movie
3,Uriyadi 2,Vijay Kumar,2019-08-02,Movie
4,Shrek the Musical,Jason Moore,2013-12-29,Movie
5,Schubert In Love,Lars Büchel,2018-03-01,Movie
6,We Have Always Lived in the Castle,Stacie Passon,2019-09-14,Movie
...,...,...,...,...
5830,Bibi & Tina,Detlev Buck,2017-04-15,Movie
5832,The Pursuit,John Papola,2019-08-07,Movie
5833,Hurricane Bianca,Matt Kugelman,2017-01-01,Movie
5834,Amar's Hands,Khaled Youssef,2019-04-26,Movie


In [85]:
#8. 넷플릭스가 콘텐츠를 단 하나만 등록한 날짜를 식별
netflix.drop_duplicates(subset = ["date_added"], keep = False)

Unnamed: 0,title,director,date_added,type
4,Shrek the Musical,Jason Moore,2013-12-29,Movie
12,Without Gorky,Cosima Spender,2017-05-31,Movie
30,Anjelah Johnson: Not Fancy,Jay Karas,2015-10-02,Movie
38,One Last Thing,Tim Rouhana,2019-08-25,Movie
70,Marvel's Iron Man & Hulk: Heroes United,Leo Riley,2014-02-16,Movie
...,...,...,...,...
5748,Menorca,John Barnard,2017-08-27,Movie
5749,Green Room,Jeremy Saulnier,2018-11-12,Movie
5788,Chris Brown: Welcome to My Life,Andrew Sandler,2017-10-07,Movie
5789,A Very Murray Christmas,Sofia Coppola,2015-12-04,Movie
