<a href="https://colab.research.google.com/github/kiuugi/pandas/blob/main/02_DataFrame.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### [참고] <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Pandas Cheat Sheet</a>

### DataFrame (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

<img src="https://miro.medium.com/max/1059/1*5zJ9tsVIRvxY83GsO8eyOw.png" width="500" height="350">

**pd.DataFrame(data=None,index: Union[Collection, NoneType] = None, columns: Union[Collection, NoneType] = None,  dtype: Union[str, numpy.dtype, ForwardRef('ExtensionDtype'), NoneType] = None,   copy: bool = False)**

- 데이터프레임은 테이블형(2차원) 데이터이며, 데이터 분석/머신 러닝에서 데이터 처리를 위해 주로 사용됨
- 2차원이기 때문에 엑셀/csv와 같이 데이터가 row, column로 구성되며, 인덱스도 두 개, row/column 각각 존재함
  - 행의 레이블은 인덱스로, 열의 레이블은 컬럼으로 부름

In [12]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
# 현재 디렉토리 확인
!pwd

/content


In [23]:
import pandas as pd

### 생성
#### 1) 딕셔너리로 생성

In [4]:
pd.DataFrame?

In [24]:
friend_dict_list = [
    {
       "name" : "John",
       "age" : 25,
       "job" : "student"
    },
    {
       "name" : "Nate",
       "age" : 34,
       "job" : "teacher"
    },
    {
       "name" : "Jenny",
       "age" : 30,
       "job" : "developer"
    }
]

friend_df = pd.DataFrame(friend_dict_list)
friend_df

Unnamed: 0,name,age,job
0,John,25,student
1,Nate,34,teacher
2,Jenny,30,developer


In [27]:
dict = {"국어":[15,25,35], "영어":[45,55,65], "수학":[75,85,95]}
student_df = pd.DataFrame(dict)
student_df

Unnamed: 0,국어,영어,수학
0,15,45,75
1,25,55,85
2,35,65,95


><b>index 넣어서 생성</b>

In [29]:
person_df = pd.DataFrame(friend_dict_list, index=["f1","f2","f3"])
person_df

Unnamed: 0,name,age,job
f1,John,25,student
f2,Nate,34,teacher
f3,Jenny,30,developer


In [30]:
month_df = pd.DataFrame(dict, index=["1월", "2월", "3월"])
month_df

Unnamed: 0,국어,영어,수학
1월,15,45,75
2월,25,55,85
3월,35,65,95


#### 2) 이차원 리스트로 생성

In [21]:
list = [
    [1,2,3,4,5],
    [6,7,8,9,10]
]
two_df = pd.DataFrame(list, index=["내용1", "내용2"], columns = ["c1","c2","c3","c4","c5"])

two_df

Unnamed: 0,c1,c2,c3,c4,c5
내용1,1,2,3,4,5
내용2,6,7,8,9,10


#### 3) csv 파일로 생성

In [33]:
# 첫번째 행이 컬럼명으로 사용됨

df = pd.read_csv("./gdrive/MyDrive/Colab Notebooks/data/sample3.csv",
                 header=None)
df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12
4,13,14,15


In [34]:
df = pd.read_csv("./gdrive/MyDrive/Colab Notebooks/data/sample3.csv",
                 header=None, names = ["c1", "c2", "c3"])
df

Unnamed: 0,c1,c2,c3
0,1,2,3
1,4,5,6
2,7,8,9
3,10,11,12
4,13,14,15


In [35]:
# cp949 : 한글 깨질때 사용

df = pd.read_csv("./gdrive/MyDrive/Colab Notebooks/data/sample1.csv",
                 encoding="cp949")
df

Unnamed: 0,번호,이름,가입일시,나이
0,1,김정수,2017-01-19 11:30:00,25
1,2,박민구,2017-02-07 10:22:00,35
2,3,정순미,2017-01-22 09:10:00,33
3,4,김정현,2017-02-22 14:09:00,45
4,5,홍미진,2017-04-01 18:00:00,17
5,6,김순철,2017-05-14 22:33:07,22
6,7,이동철,2017-03-01 23:44:45,27
7,8,박지숙,2017-01-11 06:04:18,30
8,9,김은미,2017-02-08 07:44:33,51
9,10,장혁철,2017-12-01 13:01:11,16


In [37]:
# csv 기본은 ,로 나눠져있음(delimiter)
df = pd.read_csv("./gdrive/MyDrive/Colab Notebooks/data/sample2.csv",
                 encoding="cp949", delimiter="|")
df

Unnamed: 0,번호,이름,가입일시,나이
0,1,김정수,2017-01-19 11:30:00,25
1,2,박민구,2017-02-07 10:22:00,35
2,3,정순미,2017-01-22 09:10:00,33
3,4,김정현,2017-02-22 14:09:00,45
4,5,홍미진,2017-04-01 18:00:00,17
5,6,김순철,2017-05-14 22:33:07,22
6,7,이동철,2017-03-01 23:44:45,27
7,8,박지숙,2017-01-11 06:04:18,30
8,9,김은미,2017-02-08 07:44:33,51
9,10,장혁철,2017-12-01 13:01:11,16


In [38]:
df = pd.read_csv("./gdrive/MyDrive/Colab Notebooks/data/sample2.csv",
                 encoding="cp949", delimiter="|", index_col=0)
df

Unnamed: 0_level_0,이름,가입일시,나이
번호,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,김정수,2017-01-19 11:30:00,25
2,박민구,2017-02-07 10:22:00,35
3,정순미,2017-01-22 09:10:00,33
4,김정현,2017-02-22 14:09:00,45
5,홍미진,2017-04-01 18:00:00,17
6,김순철,2017-05-14 22:33:07,22
7,이동철,2017-03-01 23:44:45,27
8,박지숙,2017-01-11 06:04:18,30
9,김은미,2017-02-08 07:44:33,51
10,장혁철,2017-12-01 13:01:11,16


#### 4) excel 파일로 생성

In [40]:
df = pd.read_excel("./gdrive/MyDrive/Colab Notebooks/data/train.xlsx")
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [41]:
df = pd.read_excel("./gdrive/MyDrive/Colab Notebooks/data/sample.xlsx")
df.head(3)

Unnamed: 0,Sap Co.,대리점,영업사원,전월,금월,TEAM,총 판매수량
0,KI1316,경기수원대리점,이기정,1720000,2952000,1,123
1,KI1451,충청홍성대리점,정미진,4080000,2706000,2,220
2,KI1534,경기화성대리점,경인선,600000,2214000,1,320


### 조회

In [42]:
friend_df.index

RangeIndex(start=0, stop=3, step=1)

In [44]:
friend_df.columns

Index(['name', 'age', 'job'], dtype='object')

In [45]:
friend_df.values

array([['John', 25, 'student'],
       ['Nate', 34, 'teacher'],
       ['Jenny', 30, 'developer']], dtype=object)

In [46]:
# 데이터 타입 확인

friend_df.dtypes

name    object
age      int64
job     object
dtype: object

### 삭제
* 수정과 삭제는 해당 dataframe에 적용을 하지 않으면 반영 안됨
* 직접 반영을 위해 inplace=True 사용/ df에 직접 변경을 원하지 않으면 새로운 변수에 담아서 사용하기

In [48]:
# drop()

friend_df.drop([0,2])

Unnamed: 0,name,age,job
1,Nate,34,teacher


In [49]:
friend_df

Unnamed: 0,name,age,job
0,John,25,student
1,Nate,34,teacher
2,Jenny,30,developer


In [50]:
friend_df.drop([0,2], inplace=True) # inplace=True ==  friend_df=

In [51]:
friend_df

Unnamed: 0,name,age,job
1,Nate,34,teacher


In [52]:
friend_df = pd.DataFrame(friend_dict_list)
friend_df

Unnamed: 0,name,age,job
0,John,25,student
1,Nate,34,teacher
2,Jenny,30,developer


In [53]:
# drop(제거할 데이터, axis=0) axis 기본값 = 0(행) 삭제방향 행 기준임

friend_df.drop("age", axis=1) # 0번이 가로, 1번이 세로

Unnamed: 0,name,job
0,John,student
1,Nate,teacher
2,Jenny,developer


### 수정

#### 1)  컬럼명 수정

In [54]:
friend_df

Unnamed: 0,name,age,job
0,John,25,student
1,Nate,34,teacher
2,Jenny,30,developer


In [60]:
friend_df.columns = ["이름", "나이", "직업"]
friend_df

Unnamed: 0,이름,나이,직업
p1,John,25,student
p2,Nate,34,teacher
p3,Jenny,30,developer


#### 2) 인덱스 수정

In [58]:
friend_df.index = ["p1", "p2", "p3"]
friend_df

Unnamed: 0,name,age,job
p1,John,25,student
p2,Nate,34,teacher
p3,Jenny,30,developer


#### 3) 컬럼 추가

In [61]:
# append 개념
friend_df["주소"] = ["서울", "경기", "부산"]
friend_df

Unnamed: 0,이름,나이,직업,주소
p1,John,25,student,서울
p2,Nate,34,teacher,경기
p3,Jenny,30,developer,부산


In [70]:
# insert(loc: 'int', column: 'Hashable', value: 'Scalar)

# friend_df.insert?
friend_df.insert(1, "전화번호",["010-1234-5678", "010-7890-4567","010-6389-7890"])
friend_df

Unnamed: 0,이름,전화번호,나이,직업,주소
p1,John,010-1234-5678,25,student,서울
p2,Nate,010-7890-4567,34,teacher,경기
p3,Jenny,010-6389-7890,30,developer,부산


#### 4) 행 추가

In [90]:
# loc[행 레이블] = [데이터]
friend_df.loc["p4"] = ["Tom", "010-7891-4578", 28, "tester", "대전"]
friend_df

Unnamed: 0,이름,전화번호,나이,직업,주소
p1,John,010-1234-5678,25,student,서울
p2,Nate,010-7890-4567,34,teacher,경기
p3,Jenny,010-6389-7890,30,developer,부산
p4,Tom,010-7891-4578,28,tester,대전
