<a href="https://colab.research.google.com/github/kyt50207/StudyML/blob/main/2021_04_26_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 5.1 Pandas란?

- pandas는 <b>"python data analysis"</b>의 약자입니다.
> pandas는 정형 데이터 처리에 특화되어 있다.

- pandas 역시 다양한 머신러닝 라이브러리들에 의존성을 가지고 있습니다.
> scikit-learn, scipy, statsmodel, tensorflow, pytorch, ...


- 간단하게 생각하면, **python에서 excel의 기능을 사용**할 수 있게 됩니다.
> pandas = python + excel // pandas & excel // pandas VS MS Excel

- 하지만, pandas는 numpy array를 베이스로 지원하며 파이썬과 함께 강력한 시너지를 내기 때문에, 엑셀 그 이상의 퍼포먼스를 냅니다.
> pandas가 Excel에 비해 고성능 데이터처리에 적합하다.

![numpy_data_type](../images/pandas/dataframe.png)

- Pandas 라이브러리에서 기본적으로 데이터를 다루는 단위는 DataFrame입니다. 흔히 알고있는 spreadsheet와 같은 개념입니다.


- 이러한 형태의 데이터는 Structured Data 또는 Panel Data 또는 Tabular Data라고 부릅니다.


- pandas를 공부한다는 것은 결국 dataframe의 사용법을 익히고 활용하는 방법을 배운다는 것과 같습니다.


- pandas를 잘 활용하면 대부분의 structured data를 자유자재로 다룰 수 있게 됩니다.

![pandas_files](../images/pandas/pandas_files.png)

## 5.2. Pandas의 기본 자료구조(Series, DataFrame)

In [1]:
# pandas 라이브러리를 불러옵니다. pd를 약칭으로 사용합니다.
import pandas as pd
import numpy as np
print(pd.__version__) # pandas version 확인

1.1.5


- DataFrame은 2차원 테이블이고, 테이블의 한 줄(행/열)을 Series라고 합니다.


- Series의 모임이 곧, DataFrame이 됩니다.

In [5]:
# s는 1, 3, 5, np.nan, 6, 8을 원소로 가지는 pandas.Series
s= pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

- pandas는 date_range라는 함수를 통해, 날짜정보를 쉽게 생성해주는 객체도 제공합니다.

In [6]:
# 20210101부터 6일간의 날짜 범위를 생성하는 pandas.date_range
dates = pd.date_range('20210101',periods=6)
dates

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
# 6x4 행렬에 -1에서 1 사이의 랜덤한 숫자를 가지는 원소를 가지고, index열은 dates, 나머지 coulmns은 순서대로 A, B, C, D로 하는 DataFrame 생성
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
2021-01-01,-0.559762,-0.586253,0.215565,0.971196
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-03,-1.360237,-2.106209,-0.593947,0.560732
2021-01-04,1.337817,-0.607375,-0.120128,-0.602777
2021-01-05,-0.597587,-0.617681,1.17096,-0.04462
2021-01-06,1.981247,0.860015,-0.399118,0.194694


## 5.3. Dataframe 기초 method

In [8]:
# dataframe의 맨 위 다섯줄을 보여주는 head()
df.head()

Unnamed: 0,A,B,C,D
2021-01-01,-0.559762,-0.586253,0.215565,0.971196
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-03,-1.360237,-2.106209,-0.593947,0.560732
2021-01-04,1.337817,-0.607375,-0.120128,-0.602777
2021-01-05,-0.597587,-0.617681,1.17096,-0.04462


In [10]:
# 3줄
# df.head(3)
df.tail(3)

Unnamed: 0,A,B,C,D
2021-01-04,1.337817,-0.607375,-0.120128,-0.602777
2021-01-05,-0.597587,-0.617681,1.17096,-0.04462
2021-01-06,1.981247,0.860015,-0.399118,0.194694


In [11]:
# dataframe index
df.index

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06'],
              dtype='datetime64[ns]', freq='D')

In [12]:
# dataframe columns
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [13]:
# dataframe values
df.values

array([[-0.55976209, -0.58625289,  0.21556451,  0.97119572],
       [ 0.68757623,  1.21847008, -1.37327455, -0.89199509],
       [-1.36023706, -2.10620897, -0.59394687,  0.56073158],
       [ 1.33781703, -0.60737536, -0.12012782, -0.60277655],
       [-0.59758745, -0.61768124,  1.17095993, -0.04462018],
       [ 1.98124669,  0.86001464, -0.39911782,  0.19469364]])

In [14]:
# dataframe에 대한 전체적인 요약정보를 보여줍니다. index, columns, null/not-null/dtype/memory usage가 표시됩니다.
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2021-01-01 to 2021-01-06
Freq: D
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 240.0 bytes


In [16]:
# dataframe에 대한 전체적인 통계정보를 보여줍니다.
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.248176,-0.306506,-0.183324,0.031205
std,1.291433,1.199238,0.851863,0.700099
min,-1.360237,-2.106209,-1.373275,-0.891995
25%,-0.588131,-0.615105,-0.54524,-0.463237
50%,0.063907,-0.596814,-0.259623,0.075037
75%,1.175257,0.498448,0.131641,0.469222
max,1.981247,1.21847,1.17096,0.971196


In [19]:
# column B를 기준으로 내림차순 정렬
df.sort_values(by='B',ascending=False).head(3)

Unnamed: 0,A,B,C,D
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-06,1.981247,0.860015,-0.399118,0.194694
2021-01-01,-0.559762,-0.586253,0.215565,0.971196


## 5.4. DataFrame Indexing

> Indexing : 데이터에서 어떤 특정 조건을 만족하는 원소를 찾는 방법.

> 전체 DataFrame에서 조건에 만족하는 데이터를 쉽게 찾아서 조작할 때 유용하게 사용할 수 있습니다.

In [20]:
# pandas dataframe은 column 이름을 이용하여 기본적인 Indexing이 가능합니다.
# column A를 indexing
df["A"] # dataframe에 바로 indexing을 사용하면, column을 찾습니다. == dictionary의 indexing과 같다.
# == "key"를 indexing == "key" == "column"

2021-01-01   -0.559762
2021-01-02    0.687576
2021-01-03   -1.360237
2021-01-04    1.337817
2021-01-05   -0.597587
2021-01-06    1.981247
Freq: D, Name: A, dtype: float64

In [23]:
# 특정날짜를 통한 Indexing
df.loc['2021-01-03'] #pd.Series

A   -1.360237
B   -2.106209
C   -0.593947
D    0.560732
Name: 2021-01-03 00:00:00, dtype: float64

In [24]:
# 특정 위치를 통한 indexing
df.iloc[2]

A   -1.360237
B   -2.106209
C   -0.593947
D    0.560732
Name: 2021-01-03 00:00:00, dtype: float64

In [25]:
# dataframe에서 slicing을 이용하면 row 단위로 잘려나옵니다.
# 앞에서 3줄을 slicing 합니다.
df[:3]

Unnamed: 0,A,B,C,D
2021-01-01,-0.559762,-0.586253,0.215565,0.971196
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-03,-1.360237,-2.106209,-0.593947,0.560732


In [26]:
# df에서 index value를 기준으로 indexing도 가능합니다. (여전히 row 단위)
# 20210102부터 20210104까지 잘라봅니다. # index의 값을 사용하게되면 Index를 이용한 slicing
df['20210102':'20210104']

Unnamed: 0,A,B,C,D
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-03,-1.360237,-2.106209,-0.593947,0.560732
2021-01-04,1.337817,-0.607375,-0.120128,-0.602777


In [27]:
df.loc['2021-01-02']

A    0.687576
B    1.218470
C   -1.373275
D   -0.891995
Name: 2021-01-02 00:00:00, dtype: float64

In [28]:

# df.loc는 특정값을 기준으로 indexing합니다. (key - value)
# 2021-01-01값을 가지는 row를 가져옵니다.
df.loc[dates[0]] #df.loc[]

A   -0.559762
B   -0.586253
C    0.215565
D    0.971196
Name: 2021-01-01 00:00:00, dtype: float64

In [29]:
# df.loc에 2차원 indexing도 가능합니다. [:, ["A", "B"]]의 의미는 모든 row에 대해서 columns는 A, B만 가져오라는 의미입니다.
df.loc[:,["A","B"]] #dataframe에서 2차원 인덱싱할때 컬럼들은 리스트로 넘겨 줄 수 있다

Unnamed: 0,A,B
2021-01-01,-0.559762,-0.586253
2021-01-02,0.687576,1.21847
2021-01-03,-1.360237,-2.106209
2021-01-04,1.337817,-0.607375
2021-01-05,-0.597587,-0.617681
2021-01-06,1.981247,0.860015


In [30]:
# 이번엔 slicing을 통해 특정 row중에서 columns는 A, B
df.loc['2021-01-03':'2021-01-05',['A','C']]

Unnamed: 0,A,C
2021-01-03,-1.360237,-0.593947
2021-01-04,1.337817,-0.120128
2021-01-05,-0.597587,1.17096


In [31]:
# 특정 row를 index값을 통한 indexing
df.loc['2021-01-02':'2021-01-04',['A','B']]

Unnamed: 0,A,B
2021-01-02,0.687576,1.21847
2021-01-03,-1.360237,-2.106209
2021-01-04,1.337817,-0.607375


In [32]:
# 2차원 리스트 indexing과 같은 원리가 되었습니다.
df.loc['2021-01-05','C']

1.1709599288684085

In [34]:
# df.iloc는 정수를 이용한 indexing과 같습니다.(row 기준) 3은 4번째를 의미합니다.
df.iloc[3:5,0:2] #df.iloc의 인덱싱은 넘파이 어레이의 2차원 인덱스와 동일해진다.

Unnamed: 0,A,B
2021-01-04,1.337817,-0.607375
2021-01-05,-0.597587,-0.617681


In [35]:
# iloc로 2차원 indexing을 하게되면, row 기준으로 index 3,4를 가져오고 column 기준으로 0, 1을 가져옵니다.
df.iloc[3:5,0:2] #df.iloc의 인덱싱은 넘파이 어레이의 2차원 인덱스와 동일해진다.

Unnamed: 0,A,B
2021-01-04,1.337817,-0.607375
2021-01-05,-0.597587,-0.617681


In [36]:
# slicing이 아닌 직접 리스트 형태로 기재하는 indexing
df.iloc[[1,2,4],[0,3]]

Unnamed: 0,A,D
2021-01-02,0.687576,-0.891995
2021-01-03,-1.360237,0.560732
2021-01-05,-0.597587,-0.04462


In [37]:
# Q. 2차원 indexing에 뒤에가 : 면 어떤 의미일까요?
df.iloc[1:3,:]

Unnamed: 0,A,B,C,D
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-03,-1.360237,-2.106209,-0.593947,0.560732


In [38]:
# numpy array의 2차원 indexing과 같다.
df.iloc[:,1:3]

Unnamed: 0,B,C
2021-01-01,-0.586253,0.215565
2021-01-02,1.21847,-1.373275
2021-01-03,-2.106209,-0.593947
2021-01-04,-0.607375,-0.120128
2021-01-05,-0.617681,1.17096
2021-01-06,0.860015,-0.399118


In [39]:
df

Unnamed: 0,A,B,C,D
2021-01-01,-0.559762,-0.586253,0.215565,0.971196
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-03,-1.360237,-2.106209,-0.593947,0.560732
2021-01-04,1.337817,-0.607375,-0.120128,-0.602777
2021-01-05,-0.597587,-0.617681,1.17096,-0.04462
2021-01-06,1.981247,0.860015,-0.399118,0.194694


In [40]:
# pandas는 fancy indexing을 지원합니다. (사실 numpy에서 지원하기 때문에 pandas도 지원합니다.)
# fancy indexing이란 조건문을 통해 indexing을 할 수 있는 방법으로 True와 False를 원소로 하는 리스트를 통해 masking하는 원리로 동작합니다.
# column A에 있는 원소들중에 0보다 큰 데이터를 가져옵니다.
df['A'] >0

2021-01-01    False
2021-01-02     True
2021-01-03    False
2021-01-04     True
2021-01-05    False
2021-01-06     True
Freq: D, Name: A, dtype: bool

In [41]:
# fancy indexing
df[df['A'] >0]

Unnamed: 0,A,B,C,D
2021-01-02,0.687576,1.21847,-1.373275,-0.891995
2021-01-04,1.337817,-0.607375,-0.120128,-0.602777
2021-01-06,1.981247,0.860015,-0.399118,0.194694


In [43]:
df[df < 0] = 0
df

Unnamed: 0,A,B,C,D
2021-01-01,0.0,0.0,0.215565,0.971196
2021-01-02,0.687576,1.21847,0.0,0.0
2021-01-03,0.0,0.0,0.0,0.560732
2021-01-04,1.337817,0.0,0.0,0.0
2021-01-05,0.0,0.0,1.17096,0.0
2021-01-06,1.981247,0.860015,0.0,0.194694


In [42]:
#df[df > 0]
df[df > 0]

Unnamed: 0,A,B,C,D
2021-01-01,,,0.215565,0.971196
2021-01-02,0.687576,1.21847,,
2021-01-03,,,,0.560732
2021-01-04,1.337817,,,
2021-01-05,,,1.17096,
2021-01-06,1.981247,0.860015,,0.194694


In [44]:
 # dataframe 하나를 복사합니다. 정말 말그대로 복사합니다.
 df2 = df.copy()

In [46]:
# dataframe은 dictionary와 비슷한 방식으로 assignment가 가능합니다.
# df에 ['one', 'one','two','three','four','three'] 리스트를 column의 value로 하는 column E를 추가합니다.
df2['E'] =  ['one', 'one','two','three','four','three']
df2

Unnamed: 0,A,B,C,D,E
2021-01-01,0.0,0.0,0.215565,0.971196,one
2021-01-02,0.687576,1.21847,0.0,0.0,one
2021-01-03,0.0,0.0,0.0,0.560732,two
2021-01-04,1.337817,0.0,0.0,0.0,three
2021-01-05,0.0,0.0,1.17096,0.0,four
2021-01-06,1.981247,0.860015,0.0,0.194694,three


In [47]:
# df.isin은 해당 value들이 들어있는 row에 대해선 True를 가지는 Series를 리턴한다.
df2['E'].isin(['two','four'])

2021-01-01    False
2021-01-02    False
2021-01-03     True
2021-01-04    False
2021-01-05     True
2021-01-06    False
Freq: D, Name: E, dtype: bool

In [48]:
df2[df2['E'].isin(['two','four'])]

Unnamed: 0,A,B,C,D,E
2021-01-03,0.0,0.0,0.0,0.560732,two
2021-01-05,0.0,0.0,1.17096,0.0,four


## 5.5. 외부 데이터 읽고 쓰기

In [51]:
# data 폴더에 있는 iris.csv를 불러오자.
data=pd.read_csv("/content/Iris.csv")
data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


In [52]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [54]:
set(data["Species"])

{'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'}

In [57]:
# Species column을 숫자로 바꿔보자.
data.loc[data["Species"] == "Iris-setosa","Species"] = 0
data.loc[data["Species"] == "Iris-versicolor","Species"] = 1
data.loc[data["Species"] == "Iris-virginica","Species"] = 2

In [59]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [61]:
# 바꾼 Dataframe을 Iris_edited.csv 로 저장하자.
data.to_csv("/content/Iris_edited.csv")

In [64]:
# 다른 파일도 불러오자.
data2= pd.read_csv("/content/kaggle_survey_2020_responses.csv")
data2

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Time from Start to Finish (seconds),Q1,Q2,Q3,Q4,Q5,Q6,Q7_Part_1,Q7_Part_2,Q7_Part_3,Q7_Part_4,Q7_Part_5,Q7_Part_6,Q7_Part_7,Q7_Part_8,Q7_Part_9,Q7_Part_10,Q7_Part_11,Q7_Part_12,Q7_OTHER,Q8,Q9_Part_1,Q9_Part_2,Q9_Part_3,Q9_Part_4,Q9_Part_5,Q9_Part_6,Q9_Part_7,Q9_Part_8,Q9_Part_9,Q9_Part_10,Q9_Part_11,Q9_OTHER,Q10_Part_1,Q10_Part_2,Q10_Part_3,Q10_Part_4,Q10_Part_5,Q10_Part_6,Q10_Part_7,...,Q31_B_Part_7,Q31_B_Part_8,Q31_B_Part_9,Q31_B_Part_10,Q31_B_Part_11,Q31_B_Part_12,Q31_B_Part_13,Q31_B_Part_14,Q31_B_OTHER,Q33_B_Part_1,Q33_B_Part_2,Q33_B_Part_3,Q33_B_Part_4,Q33_B_Part_5,Q33_B_Part_6,Q33_B_Part_7,Q33_B_OTHER,Q34_B_Part_1,Q34_B_Part_2,Q34_B_Part_3,Q34_B_Part_4,Q34_B_Part_5,Q34_B_Part_6,Q34_B_Part_7,Q34_B_Part_8,Q34_B_Part_9,Q34_B_Part_10,Q34_B_Part_11,Q34_B_OTHER,Q35_B_Part_1,Q35_B_Part_2,Q35_B_Part_3,Q35_B_Part_4,Q35_B_Part_5,Q35_B_Part_6,Q35_B_Part_7,Q35_B_Part_8,Q35_B_Part_9,Q35_B_Part_10,Q35_B_OTHER
0,Duration (in seconds),What is your age (# years)?,What is your gender? - Selected Choice,In which country do you currently reside?,What is the highest level of formal education ...,Select the title most similar to your current ...,For how many years have you been writing code ...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming languages do you use on a reg...,What programming language would you recommend ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following integrated development ...,Which of the following hosted notebook product...,Which of the following hosted notebook product...,Which of the following hosted notebook product...,Which of the following hosted notebook product...,Which of the following hosted notebook product...,Which of the following hosted notebook product...,Which of the following hosted notebook product...,...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which of the following business intelligence t...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which categories of automated machine learning...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,Which specific automated machine learning tool...,"In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor...","In the next 2 years, do you hope to become mor..."
1,1838,35-39,Man,Colombia,Doctoral degree,Student,5-10 years,Python,R,SQL,C,,,Javascript,,,,MATLAB,,Other,Python,"Jupyter (JupyterLab, Jupyter Notebooks, etc)",,,Visual Studio Code (VSCode),,Spyder,,,,,,,Kaggle Notebooks,Colab Notebooks,,,,,,...,,,,,,,SAP Analytics Cloud,,,"Automated data augmentation (e.g. imgaug, albu...",,,,Automated hyperparameter tuning (e.g. hyperopt...,Automation of full ML pipelines (e.g. Google C...,,,Google Cloud AutoML,,Databricks AutoML,,,Auto-Keras,Auto-Sklearn,,,,,,,,,,TensorBoard,,,,,,
2,289287,30-34,Man,United States of America,Master’s degree,Data Engineer,5-10 years,Python,R,SQL,,,,,,,,,,,Python,,,Visual Studio,,PyCharm,,,Sublime Text,,,,,,Colab Notebooks,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,860,35-39,Man,Argentina,Bachelor’s degree,Software Engineer,10-20 years,,,,,,Java,Javascript,,,Bash,,,,R,,,,Visual Studio Code (VSCode),,,Notepad++,Sublime Text,Vim / Emacs,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,507,30-34,Man,United States of America,Master’s degree,Data Scientist,5-10 years,Python,,SQL,,,,,,,Bash,,,,Python,,,,,PyCharm,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20032,126,18-21,Man,Turkey,Some college/university study without earning ...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
20033,566,55-59,Woman,United Kingdom of Great Britain and Northern I...,Master’s degree,Currently not employed,20+ years,Python,,,,,,,,,,,,,Python,"Jupyter (JupyterLab, Jupyter Notebooks, etc)",RStudio,,,PyCharm,,,,,,,,,Colab Notebooks,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
20034,238,30-34,Man,Brazil,Master’s degree,Research Scientist,< 1 years,Python,,,,,,,,,,,,,Python,"Jupyter (JupyterLab, Jupyter Notebooks, etc)",,,,PyCharm,,,,,,,,,Colab Notebooks,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
20035,625,22-24,Man,India,Bachelor’s degree,Software Engineer,3-5 years,Python,,SQL,C,,Java,Javascript,,,,,,,Python,"Jupyter (JupyterLab, Jupyter Notebooks, etc)",RStudio,,Visual Studio Code (VSCode),,Spyder,,Sublime Text,,,,,,Colab Notebooks,,,,,,...,,,,,,,SAP Analytics Cloud,,,"Automated data augmentation (e.g. imgaug, albu...",Automated feature engineering/selection (e.g. ...,"Automated model selection (e.g. auto-sklearn, ...",Automated model architecture searches (e.g. da...,Automated hyperparameter tuning (e.g. hyperopt...,Automation of full ML pipelines (e.g. Google C...,,,Google Cloud AutoML,H20 Driverless AI,Databricks AutoML,DataRobot AutoML,Tpot,Auto-Keras,Auto-Sklearn,Auto_ml,Xcessiv,MLbox,,,Neptune.ai,Weights & Biases,,,TensorBoard,,,Trains,,,


In [66]:
# 박사 학위 소지자들만 골라보자.
phd=data2[data2["Q4"] == "Doctoral degree"]

In [67]:
# 박사 학위 소지자들에 대한 정보만 kaggle_survey_2020_phd.csv로 다시 저장하자.
phd.to_csv("/content/kaggle_survey_2020_phd.csv")

In [70]:
# (OPTIONAL) 박사 학위 소지자이면서, 대한민국 국적을 가진 사람들을 뽑아보자.
phd_Korea= phd[phd["Q3"]== "South Korea"]