### 類別特徵處理

類別特徵是不能用數值表示的特徵，例如性別、顏色、國家等。類別特徵處理的常用技術包括：

標籤編碼：將類別特徵轉換為數值。

讀熱編碼：將類別特徵轉換為二進制向量。

In [86]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# 讀取數據
df = pd.read_csv("dataset/titanic/train.csv")

# 選擇類別特徵
categorical_features = "Sex"

# 標籤編碼
le = LabelEncoder()
df["SexLabel"] = le.fit_transform(df["Sex"])

# 查看結果
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  SexLabel  
0      0         A/5 21171   7.2500   NaN        S         1  
1      0          PC 17599  71.2833   C85        C         0  
2      0  STON/O2. 3101282   7.9250   NaN        S         0  
3      0            113803  53.1000  C123        S         0  
4    

In [9]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# 讀取數據
df = pd.read_csv("dataset/titanic.csv")

# 選擇類別特徵
categorical_features = ["Sex", "Embarked"]
# df["Embarked"].unique()
# 讀熱編碼
ohe = OneHotEncoder()
df_encoded = ohe.fit_transform(df[categorical_features])

# 查看結果
print(ohe.get_feature_names_out())
print(df_encoded.toarray())

['Sex_female' 'Sex_male' 'Embarked_C' 'Embarked_Q' 'Embarked_S'
 'Embarked_nan']
[[0. 1. 0. 0. 1. 0.]
 [1. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 1. 0.]
 ...
 [1. 0. 0. 0. 1. 0.]
 [0. 1. 1. 0. 0. 0.]
 [0. 1. 0. 1. 0. 0.]]


## 文本編碼

### CountVectorizer
CountVectorizer 是將文本轉換為詞頻矩陣的一種方法。這個詞頻矩陣中，每一行代表一個文檔，每一列代表一個詞語，矩陣中的值代表每個詞語在每個文檔中出現的次數。

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# 文本資料
texts = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding',
    'Coding is fun and exciting',
    'I love coding but hate machine learning'
]

# 創建 CountVectorizer 物件
vectorizer = CountVectorizer()

# 將文本資料轉換為詞頻矩陣
X = vectorizer.fit_transform(texts)

# 獲取詞語特徵
features = vectorizer.get_feature_names_out()

# 將結果轉換為DataFrame以便查看
df = pd.DataFrame(X.toarray(), columns=features)

print(df)


   and  coding  exciting  fun  is  learning  love  machine
0    0       0         0    0   0         1     1        1
1    0       0         0    1   1         1     0        1
2    0       1         0    0   0         0     1        0
3    1       1         1    1   1         0     0        0
4    1       1         0    0   0         1     1        1


### TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF 用於文本資料，計算詞頻和逆文檔頻率的乘積，表示詞在文檔中的重要性。

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# 假設我們有一個包含文本資料的資料框
df_text = pd.DataFrame({'text': ['I love machine learning', 'Machine learning is fun', 'I love coding']})


print(df_text)
# 使用TfidfVectorizer進行編碼
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df_text['text'])

print(pd.DataFrame(tfidf_matrix.toarray()))


                      text
0  I love machine learning
1  Machine learning is fun
2            I love coding
          0         1         2         3         4         5
0  0.000000  0.000000  0.000000  0.577350  0.577350  0.577350
1  0.000000  0.562829  0.562829  0.428046  0.000000  0.428046
2  0.795961  0.000000  0.000000  0.000000  0.605349  0.000000


### Word Embeddings
Word Embeddings (如Word2Vec, GloVe) 將詞語轉換為稠密向量，這些向量可以捕捉詞語之間的語義關係。

In [2]:
from gensim.models import Word2Vec

# 假設我們有一個包含文本資料的列表
sentences = [['I', 'love', 'machine', 'learning'], ['Machine', 'learning', 'is', 'fun'], ['I', 'love', 'coding']]

# 訓練Word2Vec模型
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, workers=4)

# 獲取單詞向量
word_vector = model.wv['machine']
print(word_vector)

[ 0.00855287  0.00015212 -0.01916856 -0.01933109 -0.01229639 -0.00025714
  0.00399483  0.01886394  0.0111687  -0.00858139  0.00055663  0.00992872
  0.01539662 -0.00228845  0.00864684 -0.01162876 -0.00160838  0.0162001
 -0.00472013 -0.01932691  0.01155852 -0.00785964 -0.00244575  0.01996103
 -0.0045127  -0.00951413 -0.01065877  0.01396178 -0.01141774  0.00422733
 -0.01051132  0.01224143  0.00871461  0.00521271 -0.00298217 -0.00549213
  0.01798587  0.01043155 -0.00432504 -0.01894062 -0.0148521  -0.00212748
 -0.00158989 -0.00512582  0.01936544 -0.00091704  0.01174752 -0.01489517
 -0.00501215 -0.01109973]


## 練習
嘗試使用 CountVectorizer 來萃取以下的 texts 內文字的特徵

In [5]:
texts = [
    'The quick brown fox jumps over the lazy dog',
    'Never jump over the lazy dog quickly',
    'Bright sun and warm weather make a perfect day',
    'The weather is bright and sunny today',
    'Dogs are loyal and friendly animals',
    'Foxes are quick and cunning creatures'
]

## 回家作業
- 練習使用三種不同的方法來去處理電影的大綱，依此來預測一個電影的評分。
- 根據預測的結果做一個簡單的結論