实验目的：实现基于logistic/softmax regression的文本分类


需要了解的知识点：

1)文本特征表示：Bag-of-Word，N-gram

2)分类器：logistic/softmax regression，损失函数、（随机）梯度下降、特征选择

3)数据集：训练集/验证集/测试集的划分

实验：

分析不同的特征、损失函数、学习率对最终分类性能的影响

shuffle 、batch、mini-batch

In [2]:
import numpy as np
import pandas as pd

In [3]:
train=pd.read_csv(r'C:\Users\16786\Desktop\复旦nlp上手教程\task1\sentiment-analysis-on-movie-reviews\train.tsv',sep='\t')
test=pd.read_csv(r'C:\Users\16786\Desktop\复旦nlp上手教程\task1\sentiment-analysis-on-movie-reviews\test.tsv',sep='\t')

 该数据集由制表符分隔的文件组成，其中包含来自烂番茄数据集的短语。出于基准测试的目的，保留了训练/测试拆分，但句子已从其原始顺序中洗牌。每个句子都被斯坦福解析器解析为许多短语。每个短语都有一个短语 Id。每个句子都有一个句子 Id。重复的短语（如短词/常用词）仅在数据中包含一次。

train.tsv 包含短语及其关联的情绪标签。我们还提供了一个SectionId，以便您可以跟踪哪些短语属于单个句子。

test.tsv 只包含短语。必须为每个短语分配一个情绪标签。

情绪标签包括：

0 - 负

1 - 有点负

2 - 中性

3 - 有点正

4 - 正

In [4]:
print(train.info())
train.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB
None


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [5]:
print(test.info())
test.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66292 entries, 0 to 66291
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   PhraseId    66292 non-null  int64 
 1   SentenceId  66292 non-null  int64 
 2   Phrase      66292 non-null  object
dtypes: int64(2), object(1)
memory usage: 1.5+ MB
None


Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [6]:
train.Sentiment.value_counts()/train.Sentiment.count()

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

## 尝试手写Bag-of-Word模型


列位置对应单词，列位置的数量大小对应该词在本句中出现的次数

In [7]:
class Bag_of_Word:
    def __init__(self,do_lower_case=False):
        self.do_lower_case=do_lower_case
        self.feature_set=set()#利用set（集合）中元素不重复的特性
    def fit_transform(self,data):
        #首先把所有字母转化为小写字母
        for sentence in data:
            if self.do_lower_case:#此处一直为false，强制将所有字母转化为小写
                sentence=sentence.lower()
            words=sentence.split(' ')#把每句话拆成单词
            for word in words:
                self.feature_set.add(word)
        feature_bow=np.zeros((len(data),len(self.feature_set)),dtype='uint8')
        feature_dic=dict(zip(self.feature_set,range(len(self.feature_set))))#集合不能被索引，但是我们想用集合中单词的位置作为词表，转化为可索引的字典形势
        for idx,sentence in enumerate(data):
            if self.do_lower_case:
                sentence=sentence.lower()
            words=sentence.split(' ')
            for word in words:
                if word in self.feature_set:
                    feature_bow[idx,feature_dic[word]]+=1
        return feature_bow

## 尝试手写N-Gram

此处使用字典来实现“字典”的生成，也可改为上文BoW中的元组进行实现

In [8]:
class N_Gram:
    def __init__(self,ngram,do_lower_case=False):
        self.ngram=ngram
        self.feature_dict={}
        self.do_lower_case=do_lower_case
    def fit_transform(self,data):
        for gram in self.ngram:
            for sentence in data:
                if self.do_lower_case:
                    words=sentence.lower()
                words=sentence.split(' ')
                for i in range(len(words)-gram+1):#对n元特征，长度为L的句子能提取出L-N+1个N元特征
                    feature="_".join(words[i:(i+gram)])#生成一个n-gram词组,如i have经此处理得i_have,使其变成“一个词”,详见注
                    if feature not in self.feature_dict:
                        self.feature_dict[feature]=len(self.feature_dict)
            n=len(data)#总句子数量
            m=len(self.feature_dict)
            ngram_feature = np.zeros((n, m),dtype='uint8')
            for idx,sentence in enumerate(data):
                if self.do_lower_case:
                    words=sentence.lower()
                words=sentence.split(' ')
                for gram in self.ngram:
                    for i in range(len(words)-gram+1):
                        feature="_".join(words[i:(i+gram)])
                        if feature in self.feature_dict:
                            ngram_feature[idx][self.feature_dict[feature]]=1
        return ngram_feature

## 处理，划分数据

此处直接采用sklearn提供的相关工具，不再利用shuffel()自行划分

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X, y = train["Phrase"].values, train["Sentiment"].values
#y = np.array(y).reshape((-1, 1))#把y变成“一列’
y.shape

(156060,)

In [11]:
Bag_of_Word=Bag_of_Word()
#N_Gram=N_Gram(ngram=(1, 2))

X_BoW=Bag_of_Word.fit_transform(X)
#X_NGram = N_Gram.fit_transform(X)

In [12]:
X_BoW.shape#,X_NGram.shape

(156060, 18227)

In [None]:
X_train_BoW,X_test_BoW,y_train_BoW,y_test_BoW=train_test_split(X_BoW,y,test_size=0.2,random_state=42,stratify=y) 
#X_train_NGram, X_test_NGram, y_train_NGram, y_test_NGram=train_test_split(X_NGram, y,test_size=0.2,random_state=42,stratify=y)

## 训练模型

In [13]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_1=LogisticRegression(solver='sag', multi_class='multinomial')
model_1.fit(X_train_BoW,y_train_BoW)

In [None]:
predict = model_1.predict(X_test_BoW)
#评估
print(np.mean(predict == y_test_BoW))

In [None]:
#注 
'''
a="a b d d d v b e e"
b=a.split(' ')
c="_".join(b[0:2])
c
'''

## 不知道为什么最后跑了半天跑不出来QAQ