# 对数几率回归

## 实验内容
1. 使用对数几率回归完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [4]:
import numpy as np

In [5]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [12]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]
print(spamx.shape,spamx)
print(spamy.shape,spamy)

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

(4601, 57) [[0.000e+00 6.400e-01 6.400e-01 ... 3.756e+00 6.100e+01 2.780e+02]
 [2.100e-01 2.800e-01 5.000e-01 ... 5.114e+00 1.010e+02 1.028e+03]
 [6.000e-02 0.000e+00 7.100e-01 ... 9.821e+00 4.850e+02 2.259e+03]
 ...
 [3.000e-01 0.000e+00 3.000e-01 ... 1.404e+00 6.000e+00 1.180e+02]
 [9.600e-01 0.000e+00 0.000e+00 ... 1.147e+00 5.000e+00 7.800e+01]
 [0.000e+00 0.000e+00 6.500e-01 ... 1.250e+00 5.000e+00 4.000e+01]]
(4601,) [1. 1. 1. ... 0. 0. 0.]


# 4. 训练并预测

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

In [10]:
# YOUR CODE HERE
model1 = LogisticRegression(max_iter=5000)
prediction1 = cross_val_predict(model1, spamx, spamy.astype('int'), cv = 10)

In [11]:
model2 = LogisticRegression(max_iter=5000)
prediction2 = cross_val_predict(model2, dota2x, dota2y.astype('int'), cv = 10)

# 5. 评价指标的计算

请你计算两个模型的四项指标

In [91]:
# YOUR CODE HERE
# acc_spam = 0
# tp_spam = 0
# fp_spam = 0
# fn_spam = 0
# for i in range(prediction1.shape[0]):
#     if prediction1[i] == spamy[i]:
#         acc_spam += 1
#     if spamy[i]and prediction1[i]:
#         tp_spam +=1
#     if spamy[i]and not prediction1[i]:
#         fn_spam +=1
#     if (not spamy[i])and prediction1[i]:
#         fp_spam +=1
# acc_spam /= prediction1.shape[0]
# pre_spam = tp_spam/(tp_spam+fp_spam)
# rec_spam = tp_spam/(tp_spam+fn_spam)
# f1_spam = 2*pre_spam*rec_spam/(pre_spam + rec_spam)

# print(acc_spam,pre_spam,rec_spam,f1_spam)

acc_spam = accuracy_score(spamy,prediction1)
pre_spam = precision_score(spamy,prediction1)
rec_spam = recall_score(spamy,prediction1)
f1_spam = f1_score(spamy,prediction1)

print(acc_spam,pre_spam,rec_spam,f1_spam)

0.9180612910236905 0.9042792792792793 0.8858246001103144 0.8949568124825857


In [92]:
# acc_dota2 = 0
# tp_dota2 = 0
# fp_dota2 = 0
# fn_dota2 = 0

# for i in range(prediction2.shape[0]):
#     if prediction2[i] == dota2y[i]:
#         acc_dota2 += 1
#     if dota2y[i]>0 and prediction2[i]>0:
#         tp_dota2 +=1
#     if dota2y[i]>0 and prediction2[i]<0:
#         fn_dota2 +=1
#     if dota2y[i]<0 and prediction2[i]>0:
#         fp_dota2 +=1
# print(acc_dota2)
# acc_dota2 /= prediction2.shape[0]
# pre_dota2 = tp_dota2/(tp_dota2+fp_dota2)
# rec_dota2 = tp_dota2/(tp_dota2+fn_dota2)
# f1_dota2 = 2*pre_dota2*rec_dota2/(pre_dota2 + rec_dota2)

# print(acc_dota2,pre_dota2,rec_dota2,f1_dota2)

acc_dota2 = accuracy_score(dota2y,prediction2)
pre_dota2 = precision_score(dota2y,prediction2)
rec_dota2 = recall_score(dota2y,prediction2)
f1_dota2 = f1_score(dota2y,prediction2)

print(acc_dota2,pre_dota2,rec_dota2,f1_dota2)

0.5987371829465731 0.6066498796110794 0.6766020253372146 0.6397193499307097


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.9180612910236905 | 0.9042792792792793 | 0.8858246001103144 | 0.8949568124825857
dota2Results | 0.5987371829465731  | 0.6066498796110794  | 0.6766020253372146 | 0.6397193499307097

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [None]:
# YOUR CODE HERE




###### 双击此处填写
1. 模型1的处理流程：
2. 模型2的处理流程：
3. 模型3的处理流程:

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | 数据集 | 0.0 | 0.0 | 0.0 | 0.0
模型2 | 数据集 | 0.0 | 0.0 | 0.0 | 0.0
模型3 | 数据集 | 0.0 | 0.0 | 0.0 | 0.0