
## Series

Series是一種一維的陣列資料結構,在這個陣列內可以存放整數、浮點數、字串、Python物件(例如:字串list、字典dist…)、Numpy的ndarray,約量,…等。雖然是一維陣列資料,可是看起來卻好像是二維陣列資料,因為一個是索引(index)或稱標籤(label),另一個是實際的資料。

Series結構與 Python的list類似,不過程式設計師可以為Series的每個元素自行命名索引。可以使用pandas.Series()建立Series物件,語法如下:
```
pandas.Series(data=None, index=None, dtype=None, name=None, options, ... )
```

下列章節實例,因為用下列指令導入pandas:
```
import pasdas as pd
```

所以可用pd.Series()取代上述的pandas.Series()。

In [9]:
import pandas as pd

s1 = pd.Series([11, 22, 33, 44, 55])
print(s1)
s1[1] = 20
print(f"修改後 s1={s1}")

0    11
1    22
2    33
3    44
4    55
dtype: int64
修改後 s1=0    11
1    20
2    33
3    44
4    55
dtype: int64


In [10]:
import pandas as pd

s = pd.Series([30, 50, 40], index=['Orange', 'Apple', 'Grape'])
print(f"{s.values}")
print(f"{s.index}")

[30 50 40]
Index(['Orange', 'Apple', 'Grape'], dtype='object')


In [8]:
import pandas as pd

s6 = pd.Series(9, index=[1, 2, 3])
print(f"{s6}")

1    9
2    9
3    9
dtype: int64


### 使用 Python字典 dict 建立 Series物件

如果我們使用Python的字典建立Series物件時,字典的鍵(key)就會被視為Series物件的索引,字典鍵的值(value)就會被視為Series物件的值。


In [4]:
import pandas as pd

mydict = {'北京':'Beijing', '東京':'Tokyo'}
s2 = pd.Series(mydict)
print(f"{s2}")

北京    Beijing
東京      Tokyo
dtype: object


### 使用Numpy的ndarray 建立 Series物件

In [5]:
import pandas as pd
import numpy as np

s3 = pd.Series(np.arange(0, 7, 2))
print(f"{s3}")

0    0
1    2
2    4
3    6
dtype: int32


### 自訂索引

In [6]:
import pandas as pd

myindex = [3, 5, 7]
price = [100, 200, 300]
s4 = pd.Series(price, index=myindex)
print(f"{s4}")

3    100
5    200
7    300
dtype: int64


In [18]:
fruits = ['Orange', 'Apple', 'Grape']
price = [30, 50, 40]
s5 = pd.Series(price, index=fruits)
print(f"{s5}")
print(f"{s5['Orange']}")


Orange    30
Apple     50
Grape     40
dtype: int64
30


### Series運算

In [17]:
import pandas as pd

s = pd.Series([0, 1, 2, 3, 4, 5])
# 切片
# print(f"s[2:4] = \n{s[2:4]}")
# print(f"s[:3] = \n{s[:3]}")
# print(f"s[2:] = \n{s[2:]}")
# print(f"s[-1:] = \n{s[-1:]}")

# 物件相加
x = pd.Series([1, 2])
y = pd.Series([3, 4])
print("物件相加")
print(f"{x + y}")

print("物件相乘")
print(f"{x * y}")

print("邏輯運算")
x = pd.Series([1, 5, 9])
y = pd.Series([2, 4, 8])
print(f"{x > y}")

物件相加
0    4
1    6
dtype: int64
物件相乘
0    3
1    8
dtype: int64
邏輯運算
0    False
1     True
2     True
dtype: bool


## DataFrame 
DataFrame 是一種二維的陣列資料結構,邏輯上而言可以視為是類似Excel的工作表,在這個二維陣列內可以存放整數、浮點數、字串、Python物件(例如:字串list、字典dist…)、Numpy的ndarray,純量,…等。

可以使用 DataFrame()建立DataFrame物件,語法如下:
```
pandas.DataFrame(data=None,index=None,dtype=None,name=None)
```



我們可以使用組合Series物件成為二維陣列的DataFrame。組合的方式是使用pancas.concat([Series1, Series2, ... ], axis=0) .預設axis是0

- axis=0 是直的
- axis=1 是橫的


In [19]:
import pandas as pd
years = range(2020, 2023)
beijing = pd.Series([20, 21, 19], index = years)
hongkong = pd.Series([25, 26, 27], index = years)
singapore = pd.Series([30, 29, 31], index = years)
citydf = pd.concat([beijing, hongkong, singapore])  # 預設axis=0
print(type(citydf))
print(citydf)

<class 'pandas.core.series.Series'>
2020    20
2021    21
2022    19
2020    25
2021    26
2022    27
2020    30
2021    29
2022    31
dtype: int64


In [20]:
import pandas as pd
years = range(2020, 2023)
beijing = pd.Series([20, 21, 19], index = years)
hongkong = pd.Series([25, 26, 27], index = years)
singapore = pd.Series([30, 29, 31], index = years)
citydf = pd.concat([beijing, hongkong, singapore], axis=1)  # 預設axis=0
print(type(citydf))
print(citydf)

<class 'pandas.core.frame.DataFrame'>
       0   1   2
2020  20  25  30
2021  21  26  29
2022  19  27  31


### col屬性(直的)

In [21]:
import pandas as pd
years = range(2020, 2023)
beijing = pd.Series([20, 21, 19], index = years)
hongkong = pd.Series([25, 26, 27], index = years)
singapore = pd.Series([30, 29, 31], index = years)
citydf = pd.concat([beijing,hongkong,singapore],axis=1)  # axis=1
cities = ["Beijing", "HongKong", "Singapore"]
citydf.columns = cities
print(citydf)

      Beijing  HongKong  Singapore
2020       20        25         30
2021       21        26         29
2022       19        27         31


Series物件有name屬性,我們可以在建立物件時,在Series()內建立此屬性,也可以物件建立好了後再設定此屬性,如果有name屬性,在列印Series物件時就可以看到此屬性。

In [22]:
import pandas as pd

beijing = pd.Series([20, 21, 19], name='Beijing')
print(beijing)

0    20
1    21
2    19
Name: Beijing, dtype: int64


In [23]:
import pandas as pd
years = range(2020, 2023)
beijing = pd.Series([20, 21, 19], index = years)
hongkong = pd.Series([25, 26, 27], index = years)
singapore = pd.Series([30, 29, 31], index = years)
beijing.name = "Beijing"
hongkong.name = "HongKong"
singapore.name = "Singapore"
citydf = pd.concat([beijing, hongkong, singapore],axis=1)  
print(citydf)

      Beijing  HongKong  Singapore
2020       20        25         30
2021       21        26         29
2022       19        27         31


### 使用字典建立dataFrame

In [24]:
import pandas as pd
cities = {'country':['China','Japan','Singapore'],
          'town':['Beijing','Tokyo','Singapore'],
          'population':[2000, 1600, 600]}
citydf = pd.DataFrame(cities)
print(citydf)


     country       town  population
0      China    Beijing        2000
1      Japan      Tokyo        1600
2  Singapore  Singapore         600


### 更改index屬性

In [25]:
import pandas as pd
cities = {'country':['China','Japan','Singapore'],
          'town':['Beijing','Tokyo','Singapore'],
          'population':[2000, 1600, 600]}
rowindex = ['first', 'second', 'third']
citydf = pd.DataFrame(cities, index=rowindex)
print(citydf)

          country       town  population
first       China    Beijing        2000
second      Japan      Tokyo        1600
third   Singapore  Singapore         600


In [26]:
import pandas as pd
cities = {'country':['China', 'Japan', 'Singapore'],
          'town':['Beijing','Tokyo','Singapore'],
          'population':[2000, 1600, 600]}
citydf = pd.DataFrame(cities,columns=["town","population"],
                      index=cities["country"])
print(citydf)

                town  population
China        Beijing        2000
Japan          Tokyo        1600
Singapore  Singapore         600


## 基本Pandas 資料分析與處理

Series和DataFrame 物件建立完成後,下一步就是執行資料分析與處理,Pandas提供許多函數或方法,使用者可以針對此執行許多資料分析與處理

In [28]:
import pandas as pd
cities = {'Country':['Taiwan','Taiwan','Thailand','Japan','Singapore'],
          'Town':['Taipei','Tainan','Bangkok','Tokyo','Singapore'],
          'Population':[2000, 2300, 900, 1600, 600]}
df = pd.DataFrame(cities, columns=["Town","Population"],
                  index=cities["Country"])
print(df)


                Town  Population
Taiwan        Taipei        2000
Taiwan        Tainan        2300
Thailand     Bangkok         900
Japan          Tokyo        1600
Singapore  Singapore         600


### 索引取值

In [37]:
import pandas as pd
cities = {'Country':['Taiwan','Taiwan','Thailand','Japan','Singapore'],
          'Town':['Taipei','Tainan','Bangkok','Tokyo','Singapore'],
          'Population':[2000, 2300, 900, 1600, 600]}
df = pd.DataFrame(cities, columns=["Town","Population"],
                  index=cities["Country"])
print(df["Town"])
print()
print(df["Town"]["Thailand"])
print()
print(df["Population"]>1000)

Taiwan          Taipei
Taiwan          Tainan
Thailand       Bangkok
Japan            Tokyo
Singapore    Singapore
Name: Town, dtype: object

Bangkok

Taiwan        True
Taiwan        True
Thailand     False
Japan         True
Singapore    False
Name: Population, dtype: bool


### 四則運算方法

下列是適用Pandas的四則運算方法。
```
add():加法運算。
sub():減法運算。
mul():乘法運算。
div():除法運算。
```

In [41]:
import pandas as pd

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
x = s1.add(s2)
print(x)

y = s1.sub(s2)
print(y)

data1 = [{'a':10, 'b':20}, {'a':30, 'b':40}]
df1 = pd.DataFrame(data1)
data2 = [{'a':1, 'b':2}, {'a':3, 'b':4}]
df2 = pd.DataFrame(data2)                            
print(df1.mul(df2))
print(df1.div(df2))


0    5
1    7
2    9
dtype: int64
0   -3
1   -3
2   -3
dtype: int64
    a    b
0  10   40
1  90  160
      a     b
0  10.0  10.0
1  10.0  10.0


### 邏輯運算方法

下列是適用Pandas的邏輯運算方法。

- gt()、It():大於、小於運算。
- ge()、le():大於或等於、小於或等於運算。
- eq()、ne():等於、不等於運算。

In [42]:
import pandas as pd

s1 = pd.Series([1, 5, 9])
s2 = pd.Series([2, 4, 8])
x = s1.gt(s2)
print(x)

y = s1.eq(s2)
print(y)

0    False
1     True
2     True
dtype: bool
0    False
1    False
2    False
dtype: bool


### numpy應用在pandas

In [43]:
import pandas as pd
import numpy as np

s = pd.Series([1, 2, 3])
x = np.square(s)
print(x)


0    1
1    4
2    9
dtype: int64


In [44]:
import pandas as pd
import numpy as np
name = ['Frank', 'Peter', 'John']
score = ['first', 'second', 'final']
df = pd.DataFrame(np.random.randint(60,100,size=(3,3)),
                  columns=name,
                  index=score)
print(df)

        Frank  Peter  John
first      80     96    98
second     65     81    88
final      75     93    84



## NaN 相關的運算

在大數據的資料收集中常常因為執行者疏忽,漏了收集某一時間的資料,這些可用NaN代替。在先前四則運算我們沒有對NaN的值做運算實例,其實凡與NaN做運算,所獲得的結果也是 NaN。

In [45]:
import pandas as pd
import numpy as np

s1 = pd.Series([1, np.nan, 5])
s2 = pd.Series([np.nan, 6, 8])
x = s1.add(s2)
print(x)


0     NaN
1     NaN
2    13.0
dtype: float64


### 處理NaN

- dropna():將NaN刪除,然後傳回新的Series或DataFrame物件。
- fillna(value):將NaN由特定value值取代,然後傳回新的Series或DataFrame物件。
- isna():判斷是否為NaN,如果是傳回True,如果否傳回False。
- notna():判斷是否為NaN,如果是傳回False,如果否傳回True。

In [46]:
# isna()跟notna()應用
import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],[4, np.nan, 6],[7, 8, np.nan]])
print(df)
print("-"*70)
x = df.isna()
print(x)
print("-"*70)
y = df.notna()
print(y)

   0    1    2
0  1  2.0  3.0
1  4  NaN  6.0
2  7  8.0  NaN
----------------------------------------------------------------------
       0      1      2
0  False  False  False
1  False   True  False
2  False  False   True
----------------------------------------------------------------------
      0      1      2
0  True   True   True
1  True  False   True
2  True   True  False


In [48]:
# 在NaN填0
import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],[4, np.nan, 6],[7, 8, np.nan]])
z = df.fillna(0)
print(z)


   0    1    2
0  1  2.0  3.0
1  4  0.0  6.0
2  7  8.0  0.0


In [57]:
# 刪除NaN的row
import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],[4, np.nan, 6],[7, 8, np.nan]])
x = df.dropna(axis=0)
print(x)


   0    1    2
0  1  2.0  3.0


In [56]:
# 刪除NaN的col
import pandas as pd
import numpy as np

df = pd.DataFrame([[1, 2, 3],[4, np.nan, 6],[7, 8, np.nan]])
x = df.dropna(axis=1)
print(x)

   0
0  1
1  4
2  7
