<a href="https://colab.research.google.com/github/jiseon0516/pdm19/blob/main/py-pandas/pandas_1_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Python module 3. **pandas**

# Using pandas

* [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
* [Pandas tutorial with interactive exercises](https://www.kaggle.com/pistak/pandas-tutorial-with-interactive-exercises)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline  # work for Jupyter notebook or lab 
#데이터 자체에서 그림 그려지게

## [1] Make data: Series, and DataFrame
> pandas의 데이터 구조
- Series
- DataFrame

### Series
> 1차원 데이터

In [2]:
# Creating a Series by passing a list of values
s = pd.Series([1,3,5,np.nan,6,8,'you']) #특수 상수값, 정수, 문자열 
#Series: 1차원 데이터, np.nan: 숫자가 아닌 고유 값(값 지정 x)
s
#default range index, value 순서로 출력

0      1
1      3
2      5
3    NaN
4      6
5      8
6    you
dtype: object

In [3]:
type(s)

pandas.core.series.Series

In [8]:
# s.info()

#AttributeError: 'Series' object has no attribute 'info' - 시리즈는 info 안 됨. 데이터 프레임만 가능

In [10]:
# indexing & slicing of series
s[0],s[:3],s[-1:] #s[:3]: 1,3,5

#s[:3]
#인덱스, 값
#0    1
#1    3
#2    5

(1, 0    1
 1    3
 2    5
 dtype: object, 6    you
 dtype: object)

### 1차원 series의 용도
- 2차원 데이터프레임의 열을 구성

In [16]:
# Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
dates = pd.date_range('20210927', periods=6) #9월 27일부터 6일동안
dates

DatetimeIndex(['2021-09-27', '2021-09-28', '2021-09-29', '2021-09-30',
               '2021-10-01', '2021-10-02'],
              dtype='datetime64[ns]', freq='D')

In [17]:
type(dates) #DatetimeIndex 구조: 시간과 연결하기 위해, 인덱스 객체

pandas.core.indexes.datetimes.DatetimeIndex

In [20]:
# Make dataframe using an array with random numbers
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) #index=dates(날짜 인덱스) 없으면 range(기본) 인덱스 (0~5)
#index: 6일, columns: 4개의 열, randn: 가우시안 분포
df

Unnamed: 0,A,B,C,D
2021-09-27,-0.902987,-0.574021,-1.667751,0.469512
2021-09-28,0.599148,-1.262151,-0.233235,-0.795725
2021-09-29,0.043196,0.70666,1.180832,0.058274
2021-09-30,-1.244955,0.966673,-1.358386,-0.751163
2021-10-01,0.543143,0.57912,-0.512113,-3.219162
2021-10-02,1.693869,0.780737,1.58184,-1.898911


In [23]:
# check types of df  --> same type
df.dtypes #dtypes: 열의 데이터들의 type, 속성 값

A    float64
B    float64
C    float64
D    float64
dtype: object

In [24]:
type(df)

pandas.core.frame.DataFrame

### 데이터프레임 (DataFrame)
- 2차원 데이터
- 다차원 데이터

In [28]:
# Creating a DataFrame by passing a dict of objects that can be converted to series-like.
df2 = pd.DataFrame({ 'A' : 1., 
                    'B' : pd.Timestamp('20191129'), #Timestamp: 특정 시간 정의
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'), #1을 4개 
                    'D' : np.array([3] * 4,dtype='int32'), #3이라는 리스트 4개
                    'E' : pd.Categorical(["test","train","test","train"]), #Categorical: 2개 이상의 기준값들
                    'F' : 'foo' })
#C,D,E는 데이터 4개 => 나머지(A,B,F)도 자동으로 4개로 확장

In [29]:
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2019-11-29,1.0,3,test,foo
1,1.0,2019-11-29,1.0,3,train,foo
2,1.0,2019-11-29,1.0,3,test,foo
3,1.0,2019-11-29,1.0,3,train,foo


In [30]:
# check types of df2 --> different types
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [31]:
type(df2)

pandas.core.frame.DataFrame



---

