## 任务2：正则关键词与文本分类

正则表达式（Regular Expressions，简称regex）是一种用于字符串搜索和操作的强大工具。它使用单个字符串来描述、匹配一系列符合某个句法规则的字符串。正则表达式在计算机科学、编程、数据挖掘和文本处理中有着广泛的应用。

1. 定义规则：根据分类需求，定义一组正则表达式规则。

2. 预处理文本：对输入文本进行清洗，如去除标点符号、转换为小写等。

3. 模式匹配：使用正则表达式在文本中搜索定义的模式。
4. 分类决策：根据匹配结果，将文本分配到相应的类别。

使用正则表达式进行文本分类时，确定关键词是一个关键步骤，因为它直接影响到分类的准确性和效率。可以从分析中找出每个类别的高频词汇，或考虑类别相关的专业术语或行业特定的词汇。

In [34]:
import pandas as pd
import jieba
import matplotlib.pyplot as plt

# 读取数据集，这里是直接联网读取，也可以通过下载文件，再读取
data_dir = 'https://mirror.coggle.club/dataset/coggle-competition/'
train_data = pd.read_csv(data_dir + 'intent-classify/train.csv', sep='\t', header=None)
test_data = pd.read_csv(data_dir + 'intent-classify/test.csv', sep='\t', header=None)


In [35]:
train_data.head(10)

Unnamed: 0,0,1
0,还有双鸭山到淮阴的汽车票吗13号的,Travel-Query
1,从这里怎么回家,Travel-Query
2,随便播放一首专辑阁楼里的佛里的歌,Music-Play
3,给看一下墓王之王嘛,FilmTele-Play
4,我想看挑战两把s686打突变团竞的游戏视频,Video-Play
5,我想看和平精英上战神必备技巧的游戏视频,Video-Play
6,2019年古装爱情电视剧小女花不弃的花絮播放一下,Video-Play
7,找一个2004年的推理剧给我看一会呢,FilmTele-Play
8,自驾游去深圳都经过那些地方啊,Travel-Query
9,给我转播今天的女子双打乒乓球比赛现场,Video-Play


In [36]:
train_text = '。'.join(list(train_data[0]))
train_words = jieba.lcut(train_text)

In [37]:
print(train_text[:10])

print(train_words[:10])

还有双鸭山到淮阴的汽
['还有', '双鸭山', '到', '淮阴', '的', '汽车票', '吗', '13', '号', '的']


In [38]:
cn_stopwords = ' '.join(pd.read_csv('https://mirror.coggle.club/stopwords/baidu_stopwords.txt', header=None)[0])

In [39]:

# 去除停用词
train_words = [x for x in train_words if x not in cn_stopwords]
# 去除长度为1的词
train_words = [x for x in train_words if len(x) > 1]
# 去除数字
train_words = [x for x in train_words if not x.isdigit()]

In [40]:
from collections import Counter

# 频次大于等于5的词
train_words_freq = Counter(train_words)
train_words = [x for x in train_words if train_words_freq[x] >= 5]

In [41]:
train_word_prior = {}
for row in train_data.iloc[:].itertuples():
    text, label = row[1], row[2]
    words = jieba.lcut(text)
    words = [x for x in words if x in train_words]
    
    if len(words) == 0:
        continue
    
    # 统计每个词出现的次数以及该词所在句子的label
    for word in words:
        if word not in train_word_prior:
            train_word_prior[word] = {"total": 0}
            
        if label not in train_word_prior[word]:
            train_word_prior[word][label] = 0
        
        train_word_prior[word][label] += 1
        train_word_prior[word]['total'] += 1

In [42]:
train_word_prior

{'汽车票': {'total': 39, 'Travel-Query': 39},
 '回家': {'total': 20,
  'Travel-Query': 8,
  'Music-Play': 3,
  'Alarm-Update': 7,
  'Video-Play': 1,
  'FilmTele-Play': 1},
 '随便': {'total': 91,
  'Music-Play': 30,
  'Video-Play': 26,
  'Radio-Listen': 9,
  'Audio-Play': 6,
  'FilmTele-Play': 15,
  'Other': 4,
  'TVProgram-Play': 1},
 '播放': {'total': 1729,
  'Music-Play': 464,
  'Video-Play': 310,
  'FilmTele-Play': 589,
  'Radio-Listen': 180,
  'Weather-Query': 48,
  'Audio-Play': 58,
  'TVProgram-Play': 69,
  'HomeAppliance-Control': 5,
  'Other': 6},
 '一首': {'total': 406, 'Music-Play': 391, 'Audio-Play': 3, 'Other': 12},
 '专辑': {'total': 110, 'Music-Play': 108, 'Video-Play': 1, 'FilmTele-Play': 1},
 '挑战': {'total': 8, 'Video-Play': 8},
 '游戏': {'total': 85,
  'Video-Play': 73,
  'FilmTele-Play': 3,
  'Music-Play': 7,
  'Alarm-Update': 1,
  'TVProgram-Play': 1},
 '视频': {'total': 556,
  'Video-Play': 541,
  'FilmTele-Play': 3,
  'HomeAppliance-Control': 1,
  'Alarm-Update': 1,
  'TVProgram-Pl

In [43]:
train_word_prior = pd.DataFrame(train_word_prior).T
train_word_prior.fillna(0, inplace=True)

In [44]:
train_word_prior.head(10)

Unnamed: 0,total,Travel-Query,Music-Play,Alarm-Update,Video-Play,FilmTele-Play,Radio-Listen,Audio-Play,Other,TVProgram-Play,Weather-Query,HomeAppliance-Control,Calendar-Query
汽车票,39.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
回家,20.0,8.0,3.0,7.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
随便,91.0,0.0,30.0,0.0,26.0,15.0,9.0,6.0,4.0,1.0,0.0,0.0,0.0
播放,1729.0,0.0,464.0,0.0,310.0,589.0,180.0,58.0,6.0,69.0,48.0,5.0,0.0
一首,406.0,0.0,391.0,0.0,0.0,0.0,0.0,3.0,12.0,0.0,0.0,0.0,0.0
专辑,110.0,0.0,108.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
挑战,8.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
游戏,85.0,0.0,7.0,1.0,73.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
视频,556.0,0.0,0.0,1.0,541.0,3.0,2.0,0.0,0.0,8.0,0.0,1.0,0.0
和平,11.0,0.0,0.0,0.0,8.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# 占比
for category in train_data[1].unique():
    train_word_prior[category] /= train_word_prior['total']

In [46]:
train_word_prior.head(10)

Unnamed: 0,total,Travel-Query,Music-Play,Alarm-Update,Video-Play,FilmTele-Play,Radio-Listen,Audio-Play,Other,TVProgram-Play,Weather-Query,HomeAppliance-Control,Calendar-Query
汽车票,39.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
回家,20.0,0.4,0.15,0.35,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
随便,91.0,0.0,0.32967,0.0,0.285714,0.164835,0.098901,0.065934,0.043956,0.010989,0.0,0.0,0.0
播放,1729.0,0.0,0.268363,0.0,0.179294,0.340659,0.104106,0.033545,0.00347,0.039907,0.027762,0.002892,0.0
一首,406.0,0.0,0.963054,0.0,0.0,0.0,0.0,0.007389,0.029557,0.0,0.0,0.0,0.0
专辑,110.0,0.0,0.981818,0.0,0.009091,0.009091,0.0,0.0,0.0,0.0,0.0,0.0,0.0
挑战,8.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
游戏,85.0,0.0,0.082353,0.011765,0.858824,0.035294,0.0,0.0,0.0,0.011765,0.0,0.0,0.0
视频,556.0,0.0,0.0,0.001799,0.973022,0.005396,0.003597,0.0,0.0,0.014388,0.0,0.001799,0.0
和平,11.0,0.0,0.0,0.0,0.727273,0.272727,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [47]:
# 找出最大概率的label对应的下标
train_word_prior.values[:, 1:].argmax(1)

array([0, 0, 1, ..., 7, 3, 4], dtype=int64)

In [48]:
# 最大概率对应的label名称
train_word_prior['category'] = train_word_prior.columns[1:][train_word_prior.values[:, 1:].argmax(1)]

train_word_prior.head(10)

Unnamed: 0,total,Travel-Query,Music-Play,Alarm-Update,Video-Play,FilmTele-Play,Radio-Listen,Audio-Play,Other,TVProgram-Play,Weather-Query,HomeAppliance-Control,Calendar-Query,category
汽车票,39.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Travel-Query
回家,20.0,0.4,0.15,0.35,0.05,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Travel-Query
随便,91.0,0.0,0.32967,0.0,0.285714,0.164835,0.098901,0.065934,0.043956,0.010989,0.0,0.0,0.0,Music-Play
播放,1729.0,0.0,0.268363,0.0,0.179294,0.340659,0.104106,0.033545,0.00347,0.039907,0.027762,0.002892,0.0,FilmTele-Play
一首,406.0,0.0,0.963054,0.0,0.0,0.0,0.0,0.007389,0.029557,0.0,0.0,0.0,0.0,Music-Play
专辑,110.0,0.0,0.981818,0.0,0.009091,0.009091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Music-Play
挑战,8.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Video-Play
游戏,85.0,0.0,0.082353,0.011765,0.858824,0.035294,0.0,0.0,0.0,0.011765,0.0,0.0,0.0,Video-Play
视频,556.0,0.0,0.0,0.001799,0.973022,0.005396,0.003597,0.0,0.0,0.014388,0.0,0.001799,0.0,Video-Play
和平,11.0,0.0,0.0,0.0,0.727273,0.272727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Video-Play


In [49]:

train_word_prior.groupby('category').apply(lambda x: list(x.index))

category
Alarm-Update             [早上, 我定, 下午, 参加, 公司, 闹钟, 活动, 提醒, 创建, 周末, 上午, 取...
Audio-Play               [故事, 小说, 广播剧, 英文版, 岳云鹏, 爆笑, 相声, 有声, 俄语, 第五章, 郭...
Calendar-Query           [昨天, 农历, 我查, 星期, 几号, 告诉, 几月, 查查, 礼拜, 几是, 春节, 母...
FilmTele-Play            [播放, 古装, 爱情, 电视剧, 一个, 推理, 一会, 地方, 导演, 赵丽颖, 麻烦,...
HomeAppliance-Control    [空调, 客厅, 风速, 打开, 烤箱, 儿童房, 调高, 洗衣机, 停止, 工作, 模式,...
Music-Play               [随便, 一首, 专辑, 单曲, 循环, 王菲, 钢琴曲, 随机, 治愈, 日语, 歌曲, ...
Other                                     [永远, 电话, 笑话, 之间, 老婆, 不好, 漫画, 有人]
Radio-Listen             [河南, 新闻广播, 新闻台, 交通, 广播电台, 经典音乐, 七点, 中央, 电台, 都市...
TVProgram-Play           [播出, 卫视, 广西, 法治, CCTV11, 剧场, 开播, 文化, 结束, 早间, 贵...
Travel-Query             [汽车票, 回家, 深圳, 武汉, 北京, 桂林, 飞机, 起飞, 快点, 三张, 成都, ...
Video-Play               [挑战, 游戏, 视频, 和平, 精英, 花絮, 转播, 比赛, 现场, 世界, 年谍, 第...
Weather-Query            [查询, 海南, 几级, 刮风, 几天, 山西, 明天, 衡水, 气温, 适合, 杭州, 香...
dtype: object