# 对数几率回归

## 实验内容
1. 使用对数几率回归完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [1]:
import numpy as np

In [14]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')
print(spambase.shape)

(4601, 58)


# 2. 导入模型

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [16]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]
#print(dota2x)
print(dota2y)
print(spamy.shape)
print(spamx.shape)

[-1.  1.  1. ...  1. -1. -1.]
(4601,)
(4601, 57)


# 4. 训练并预测

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

In [5]:
# YOUR CODE HERE
model1 = LogisticRegression()
prediction1 = cross_val_predict(model1,spamx,spamy,cv = 10)

model2 = LogisticRegression()
prediction2 = cross_val_predict(model2, dota2x, dota2y, cv = 10)



# 5. 评价指标的计算

请你计算两个模型的四项指标

In [6]:
# YOUR CODE HERE

acc1 = accuracy_score(spamy, prediction1)
acc2 = accuracy_score(dota2y, prediction2)

precision1 = precision_score(spamy, prediction1)
precision2 = precision_score(dota2y,prediction2)

recall1 = recall_score(spamy,prediction1)
recall2 = recall_score(dota2y,prediction2)

#help(f1_score)
f1 = f1_score(spamy,prediction1)
f2 = f1_score(dota2y,prediction2)

print("对数几率回归在spam测试集上的四项指标")
print("精度:",acc1)
print("查准率:",precision1)
print("查全率:",recall1)
print("f1值:",f1)
print()
print("对数几率回归在dota2测试集上的四项指标")
print("精度:",acc2)
print("查准率:",precision2)
print("查全率:",recall2)
print("f1值:",f2)

对数几率回归在spam测试集上的四项指标
精度: 0.9184959791349706
查准率: 0.9039325842696629
查全率: 0.8874793160507446
f1值: 0.8956303924297244

对数几率回归在dota2测试集上的四项指标
精度: 0.5986832164058283
查准率: 0.6066019702984855
查全率: 0.6765610266081752
f1值: 0.639674387053009


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.9315366224733753 | 0.928 | 0.8957528957528957 | 0.9115913555992141
dota2Results | 0.6009498111171074 | 0.6084839800117577 | 0.6789594522569801 | 0.6417927800492181

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [7]:
# YOUR CODE HERE
import pandas as pd
from sklearn.feature_selection import RFE 
from sklearn.model_selection import train_test_split

model = LogisticRegression()
def train(x,y):
    
    
    prediction_dota2 = cross_val_predict(model,x,y,cv = 10)
    
    acc = accuracy_score(y,prediction_dota2)
    precision = precision_score(y,prediction_dota2)
    recall = recall_score(y,prediction_dota2)
    f1 = f1_score(y,prediction_dota2)
    
    print("对数几率回归在dota2测试集上的四项指标")
    print("精度:",acc)
    print("查准率:",precision)
    print("查全率:",recall)
    print("f1值:",f1)
    print()

#特征变换       
x1 = pd.DataFrame(dota2x)
y1 = pd.Series(dota2y)

for elm in x1:
    elm_mean = x1[elm].mean()
    x1[elm] = x1[elm].apply(lambda x:1 if x > elm_mean else 0)
train(x1,y1)





对数几率回归在dota2测试集上的四项指标
精度: 0.573815434430653
查准率: 0.5805992925019074
查全率: 0.6863597228485917
f1值: 0.6290652888680132



In [8]:
#特征筛选
rfe = RFE(model)
rfe = rfe.fit(dota2x,dota2y)
#print(rfe.ranking_)

feature_indexs = rfe.get_support(True)#返回筛选出来的特征下标
#print(feature_indexs)

x2 = dota2x[feature_indexs]
y2 = dota2y[feature_indexs]
train(x2,y2)





对数几率回归在dota2测试集上的四项指标
精度: 0.39655172413793105
查准率: 0.40625
查全率: 0.4482758620689655
f1值: 0.4262295081967213





In [9]:
#特征组合
x3 = dota2x[:,0:3]
y3 = dota2y
train(x3,y3)



对数几率回归在dota2测试集上的四项指标
精度: 0.5265191581219644
查准率: 0.5265191581219644
查全率: 1.0
f1值: 0.6898297415012161




1. 模型1的处理流程：对每一列特征取平均值,大于平均值的为1,小于的为0,处理后进行训练
2. 模型2的处理流程：利用RFE模型训练,选出最好的特征作为特征集训练
3. 模型3的处理流程: 选出样本中的几个特征训练

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | 数据集 | 0.5738046411225041 | 0.5805892247134509| 0.6863597228485917 | 0.6290593794327907
模型2 | 数据集 | 0.39655172413793105 | 0.40625 | 0.4482758620689655 | 0.4262295081967213
模型3 | 数据集 | 0.5265191581219644 | 0.5265191581219644 | 1.0 | 0.6898297415012161