# 线性判别分析

## 实验内容
1. 使用线性判别分析完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [1]:
import numpy as np

In [2]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [3]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from matplotlib.font_manager import FontProperties

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [4]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

# 4. 训练

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

**注意：dota2数据集上，线性判别分析模型在训练的过程中会有警告出现，不会影响程序运行**

In [5]:
# YOUR CODE HERE

model = LinearDiscriminantAnalysis()
prediction = cross_val_predict(model, spamx, spamy, cv = 10)
prediction.shape




(4601L,)

# 5. 评价指标的计算

请你计算两个模型的四项指标

In [6]:
# YOUR CODE HERE

print("accuracy:",np.mean(prediction))
#准确率=查准率
precisions = cross_val_score(model, spamx, spamy, cv=10, scoring='precision')
print('precision:', np.mean(precisions))
#召回率=查全率
recalls = cross_val_score(model, spamx, spamy, cv=10, scoring='recall')
print('recall:', np.mean(recalls))

f1s = cross_val_score(model, spamx, spamy, cv=10, scoring='f1')
print('F1:', np.mean(f1s))



('accuracy:', 0.338839382742882)
('precision:', 0.91001837760157955)
('recall:', 0.7816495659037096)
('F1:', 0.84000288082271291)


In [7]:
# YOUR CODE HERE
model = LinearDiscriminantAnalysis()
prediction = cross_val_predict(model, dota2x, dota2y, cv = 10)
prediction.shape

print("accuracy:",np.mean(prediction))
#准确率=查准率
precisions = cross_val_score(model, dota2x, dota2y, cv=10, scoring='precision')
print('precision:', np.mean(precisions))
#召回率=查全率
recalls = cross_val_score(model, dota2x, dota2y, cv=10, scoring='recall')
print('recall:', np.mean(recalls))

f1s = cross_val_score(model, dota2x, dota2y, cv=10, scoring='f1')
print('F1:', np.mean(f1s))




('accuracy:', 0.17409606044252562)
('precision:', 0.60671434328081852)
('recall:', 0.67645874357903235)
('F1:', 0.63967806187236342)


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 33.88 | 91.00 | 78.16 | 84.00
dota2Results | 17.41 | 60.67 | 67.65 | 63.97

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [8]:
# YOUR CODE HERE




###### 双击此处填写
1. 模型1的处理流程：
2. 模型2的处理流程：
3. 模型3的处理流程:

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | 数据集 | 0.0 | 0.0 | 0.0 | 0.0
模型2 | 数据集 | 0.0 | 0.0 | 0.0 | 0.0
模型3 | 数据集 | 0.0 | 0.0 | 0.0 | 0.0