# MIND: Microsoft News Recommendation Dataset
# MIND：Microsoft 新闻推荐数据集

- behaviors.tsv: The click histories and impression logs of users
- behaviors.tsv：用户的点击记录和展示日志
- news.tsv: The information of news articles
- news.tsv：新闻文章的信息
- entity_embedding.vec: The embeddings of entities in news extracted from knowledge graph
- entity_embedding.vec： 从知识图谱中提取的新闻中实体的嵌入
- relation_embedding.vec: The embeddings of relations between entities extracted from knowledge graph
- relation_embedding.vec：从知识图谱中提取的实体之间关系的嵌入

## news.tsv

- News ID 新闻 ID
- Category 类别
- SubCategory 子类别
- Title 标题
- Abstract 抽象
- URL
- Title Entities (entities contained in the title of this news)
- 标题实体（本新闻标题中包含的实体）
- Abstract Entities (entites contained in the abstract of this news)
- 抽象实体（包含在本新闻摘要中的实体）

In [32]:
import pandas as pd

# 恢复显示的列宽为默认值
pd.reset_option('display.max_colwidth')

# 指定 TSV 文件路径
tsv_file_path = 'news.tsv'

# 手动设置列名
column_names = ["news_id", "category", "sub_category ", "title", "abstract", "url", "title_entities", "abstract_entities"]

# 使用 Pandas 读取 TSV 文件，指定没有标题，并设置列名
df_news = pd.read_csv(tsv_file_path, sep='\t', header=None, names=column_names)

# 打印
df_news.head(2)

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."


In [33]:
# 设置显示的列宽为 None，以显示完整文本列内容
pd.set_option('display.max_colwidth', None)

df_news.iloc[0]

news_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              N55528
category                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          lifestyle
sub_category    

In [45]:
df_news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51282 entries, 0 to 51281
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   news_id            51282 non-null  object
 1   category           51282 non-null  object
 2   sub_category       51282 non-null  object
 3   title              51282 non-null  object
 4   abstract           48616 non-null  object
 5   url                51282 non-null  object
 6   title_entities     51279 non-null  object
 7   abstract_entities  51278 non-null  object
dtypes: object(8)
memory usage: 3.1+ MB


In [48]:
df_news.describe()

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entities
count,51282,51282,51282,51282,48616,51282,51279,51278
unique,51282,17,264,50434,47309,51281,34472,36277
top,N55528,news,newsus,Photos of the Day,What's the weather today? What's the weather for the week? Here's your forecast.,[],[],[]
freq,1,15774,6564,15,124,2,13842,13825


## behaviors.tsv

The click histories and impression logs of users

用户的点击记录和展示日志

- Impression ID. The ID of an impression.
- 展示 ID。展示的 ID。
- User ID. The anonymous ID of a user.
- 用户 ID。用户的匿名 ID。
- Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
- 时间。格式为“MM/DD/YYYY HH：MM：SS AM/PM”的展示时间。
- History. The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time.
- 历史。此用户在此展示之前的新闻点击记录（点击新闻的 ID 列表）。点击的新闻文章按时间排序。
- Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a impressions have been shuffled.
- 印象。此展示中显示的新闻列表以及用户在这些新闻上的点击行为（1 表示点击，0 表示未点击）。印象中的新闻顺序被洗牌了。

In [26]:
import pandas as pd

# 恢复显示的列宽为默认值
pd.reset_option('display.max_colwidth')

# 指定 TSV 文件路径
tsv_file_path = 'behaviors.tsv'

# 手动设置列名
column_names = ["impression_id", "user_id", "time", "click_history", "impression_lpg"]

# 使用 Pandas 读取 TSV 文件，指定没有标题，并设置列名
df_behaviors = pd.read_csv(tsv_file_path, sep='\t', header=None, names=column_names)

# 打印
print(len(df_behaviors))
df_behaviors.head(2)

156965


Unnamed: 0,impression_id,user_id,time,click_history,impression_lpg
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...


In [27]:
df_behaviors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156965 entries, 0 to 156964
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   impression_id   156965 non-null  int64 
 1   user_id         156965 non-null  object
 2   time            156965 non-null  object
 3   click_history   153727 non-null  object
 4   impression_lpg  156965 non-null  object
dtypes: int64(1), object(4)
memory usage: 6.0+ MB


In [28]:
df_behaviors.describe()

Unnamed: 0,impression_id
count,156965.0
mean,78483.0
std,45312.036839
min,1.0
25%,39242.0
50%,78483.0
75%,117724.0
max,156965.0


In [29]:
# 设置显示的列宽为 None，以显示完整文本列内容
pd.set_option('display.max_colwidth', None)

df_behaviors.iloc[0]

impression_id                                                                  1
user_id                                                                   U13740
time                                                       11/11/2019 9:05:58 AM
click_history     N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801
impression_lpg                                                 N55689-1 N35729-0
Name: 0, dtype: object

In [30]:
unique_user_ids = df_behaviors['user_id'].unique()
print(len(unique_user_ids))
unique_user_ids

50000


array(['U13740', 'U91836', 'U73700', ..., 'U43157', 'U66493', 'U72015'],
      dtype=object)

In [31]:
user_id_counts = df_behaviors['user_id'].value_counts()
print(user_id_counts)

user_id
U32146    62
U15740    44
U20833    41
U51286    40
U44201    40
          ..
U60416     1
U20588     1
U84385     1
U89164     1
U72015     1
Name: count, Length: 50000, dtype: int64


In [32]:
user_id_counts_df = user_id_counts.reset_index()
user_id_counts_df.columns = ['user_id', 'row_count']
print(user_id_counts_df)

      user_id  row_count
0      U32146         62
1      U15740         44
2      U20833         41
3      U51286         40
4      U44201         40
...       ...        ...
49995  U60416          1
49996  U20588          1
49997  U84385          1
49998  U89164          1
49999  U72015          1

[50000 rows x 2 columns]


In [43]:
user_id_to_check = 'U13740'
user_id_records = df_behaviors[df_behaviors['user_id'] == user_id_to_check]

In [45]:
user_id_records

Unnamed: 0,impression_id,user_id,time,click_history,impression_lpg
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801,N55689-1 N35729-0
35262,35263,U13740,11/9/2019 5:59:43 AM,N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801,N20020-0 N3737-0 N43202-0 N18708-0 N30125-0 N349-0 N43388-0 N32260-0 N3491-0 N57972-0 N43370-0 N31602-0 N7128-0 N43001-0 N10359-0 N22600-0 N10602-0 N5889-0 N17650-0 N59761-0 N7891-0 N31025-0 N49879-0 N31748-0 N59457-0 N60374-0 N38330-0 N46567-0 N33291-0 N58075-0 N52649-0 N54300-0 N39707-0 N1080-0 N21428-0 N15361-0 N59931-0 N41400-0 N39115-0 N59893-0 N14522-0 N3449-0 N28091-0 N5442-0 N63835-0 N20147-0 N23060-0 N19099-0 N39587-0 N14884-0 N29749-0 N33981-0 N17087-0 N62647-0 N9019-0 N53343-0 N64228-0 N41615-0 N52875-0 N60186-0 N62801-0 N18378-0 N49953-0 N59143-0 N8555-0 N43373-0 N46175-0 N46092-0 N57099-0 N3590-0 N28684-0 N29069-0 N41178-0 N25437-0 N6825-0 N22816-0 N28324-0 N24180-0 N15020-0 N32268-0 N53017-0 N32399-0 N57005-0 N8519-0 N52294-0 N54634-0 N24272-0 N38488-0 N37377-0 N23784-0 N15134-0 N22664-0 N36964-0 N18595-0 N13423-0 N24104-0 N54283-0 N3727-0 N43587-0 N36086-0 N39949-0 N1215-0 N20567-0 N3035-0 N64634-0 N58890-0 N62838-0 N34451-0 N25546-0 N59508-0 N5615-0 N41569-0 N51049-0 N17340-0 N16280-0 N36836-0 N26376-0 N28934-0 N31002-0 N28788-0 N57592-0 N17331-0 N53617-0 N36509-0 N36707-0 N11821-0 N30151-0 N48578-0 N7795-0 N33347-0 N58638-0 N59407-0 N15117-0 N32846-0 N30295-0 N62443-0 N58553-0 N9457-0 N29536-0 N62070-0 N38783-0 N25703-0 N49527-0 N27714-0 N21086-0 N55743-0 N57187-0 N47267-0 N3599-0 N475-0 N51377-0 N45445-0 N24054-0 N123-0 N61862-0 N36865-0 N42488-0 N46213-0 N8399-0 N58574-0 N7035-0 N17859-0 N35290-0 N1480-0 N48510-0 N49908-0 N30473-0 N37250-0 N51398-0 N29857-0 N56154-0 N8170-0 N46057-0 N37108-0 N55558-0 N46597-0 N1914-0 N15960-0 N26514-0 N4912-0 N31688-0 N41309-0 N43671-0 N47061-0 N35106-0 N509-0 N60902-0 N44628-0 N55645-0 N35486-0 N26702-0 N14512-0 N25467-0 N45428-0 N21342-0 N55582-0 N60770-0 N62853-0 N57420-0 N44324-0 N48621-0 N9340-0 N41757-0 N27143-0 N56227-0 N13689-0 N13176-0 N28614-0 N4247-0 N39000-0 N63583-0 N499-0 N55714-0 N279-0 N1774-0 N23760-0 N27101-0 N61652-0 N11379-0 N23575-0 N41881-0 N61140-0 N3580-0 N23873-0 N17350-0 N56396-0 N54287-0 N24983-0 N26130-0 N17681-0 N3894-0 N55940-0 N17000-0 N11484-0 N6268-0 N9702-0 N4080-0 N1936-0 N51752-0 N9617-0 N8913-0 N56665-0 N28910-1 N13648-0 N25764-0 N10116-0 N39112-0 N50329-0 N53296-0 N27583-0 N1019-0 N38906-0 N3256-0 N6926-0 N27845-0 N55434-0 N33575-0 N19300-0 N20926-0 N17913-0 N48476-0 N34930-0 N46046-0 N40817-0 N10969-0 N52694-0 N15855-0 N60291-0 N42336-0 N40123-0 N31281-0 N40356-0 N62762-0 N45462-0 N29017-0 N53008-0 N62391-0 N209-0 N37593-0 N61822-0 N60281-0 N30367-0 N43500-0 N61742-0 N49108-0 N9836-0 N1920-0 N44737-0
154836,154837,U13740,11/13/2019 3:27:40 PM,N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801,N13907-0 N8509-0 N47061-0 N51048-0 N22417-0 N35273-0 N33831-0 N64252-0 N18862-0 N58133-1 N56214-0 N37437-0 N38215-0 N4510-0 N14029-0 N17117-0 N51287-0 N24767-0 N21420-0 N41612-0 N62894-0 N34048-0 N36261-0 N36016-0 N21128-0 N4642-0


In [46]:
# 设置显示的列宽为 None，以显示完整文本列内容
pd.set_option('display.max_colwidth', None)

user_id_records.iloc[0]

impression_id                                                                  1
user_id                                                                   U13740
time                                                       11/11/2019 9:05:58 AM
click_history     N55189 N42782 N34694 N45794 N18445 N63302 N10414 N19347 N31801
impression_lpg                                                 N55689-1 N35729-0
Name: 0, dtype: object

In [47]:
user_id_records.iloc[1]

impression_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

## entity_embedding.vec 和 relation_embedding.vec


The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values. We hope this data can facilitate the research of knowledge-aware news recommendation.


entity_embedding.vec 和 relation_embedding.vec 文件包含通过 TransE 方法从子图（来自维基数据知识图谱）中学习的实体和关系的 100 维嵌入。在这两个文件中，第一列是实体/关系的 ID，其他列是嵌入向量值。我们希望这些数据能够促进知识感知新闻推荐的研究。

In [43]:
# 设置.vec 文件路径
vec_file_path = 'entity_embedding.vec'

# 读取 .vec 文件
with open(vec_file_path, 'r', encoding='utf-8') as file:
    
    # 逐行读取词向量
    for line in file:
        # print(line)
        line = line.strip()
        parts = line.split('\t')
        # print(parts)
        word = parts[0]
        vector = [float(value) for value in parts[1:]]
        print(f'Word: {word}, Vector: {vector}')
        
        # 如果要查找特定单词的向量，可以在这里添加条件
        if word == 'Q41':
            # print(f'Word vector for "Q41": {vector}')
            break


Word: Q41, Vector: [-0.063388, -0.181451, 0.057501, -0.091254, -0.076217, -0.052525, 0.0505, -0.224871, -0.018145, 0.030722, 0.064276, 0.073063, 0.039489, 0.159404, -0.128784, 0.016325, 0.026797, 0.13709, 0.001849, -0.059103, 0.012091, 0.045418, 0.000591, 0.211337, -0.034093, -0.074582, 0.014004, -0.099355, 0.170144, 0.109376, -0.014797, 0.071172, 0.080375, 0.045563, -0.046462, 0.070108, 0.015413, -0.020874, -0.170324, -0.00113, 0.05981, 0.054342, 0.027358, -0.028995, -0.224508, 0.066281, -0.200006, 0.018186, 0.082396, 0.167178, -0.136239, 0.055134, -0.080195, -0.00146, 0.031078, -0.017084, -0.091176, -0.036916, 0.124642, -0.098185, -0.054836, 0.152483, -0.053712, 0.092816, -0.112044, -0.072247, -0.114896, -0.036541, -0.186339, -0.16061, 0.037342, -0.133474, 0.11008, 0.070678, -0.005586, -0.046667, -0.07201, 0.086424, 0.026165, 0.030561, 0.077888, -0.117226, 0.211597, 0.112512, 0.079999, -0.083398, -0.121117, 0.071751, -0.017654, -0.134979, -0.051949, 0.001861, 0.124535, -0.151043, -0.

In [44]:
# 设置.vec 文件路径
vec_file_path = 'relation_embedding.vec'

# 读取 .vec 文件
with open(vec_file_path, 'r', encoding='utf-8') as file:
    
    # 逐行读取词向量
    for line in file:
        # print(line)
        line = line.strip()
        parts = line.split('\t')
        # print(parts)
        word = parts[0]
        vector = [float(value) for value in parts[1:]]
        print(f'Word: {word}, Vector: {vector}')
        
        # 如果要查找特定单词的向量，可以在这里添加条件
        if word == 'P31':
            # print(f'Word vector for "Q41": {vector}')
            break


Word: P31, Vector: [-0.073467, -0.132227, 0.034173, -0.032769, 0.008289, -0.107088, -0.031712, -0.039581, 0.101882, -0.106961, -0.053441, 0.068202, -0.045584, -0.140448, -0.079402, 0.001022, 0.059921, -0.06251, 0.102848, 0.077947, -0.063644, 0.05007, -0.01918, 0.064456, -0.052222, 0.071078, -0.036413, -0.039235, 0.137947, 0.067378, -0.137468, 0.103482, 0.121755, -0.006587, 0.063077, -0.024954, -0.0313, -0.056833, -0.139115, -0.05357, 0.165815, -0.022143, 0.006561, -0.108691, -0.149139, 0.080943, 0.054542, -0.034564, 0.082343, -0.095843, -0.068758, 0.01385, -0.025589, -0.012451, 0.116367, -0.066981, -0.006472, 0.136078, -0.057084, -0.066427, -0.035916, -0.028447, -0.070395, -0.052364, -0.040038, 0.037342, -0.073347, 0.112529, 0.106537, 0.107426, 0.086297, 0.085833, 0.054393, 0.053187, 0.066242, 0.058507, -0.04718, -0.086089, 0.050148, 0.053491, -0.04237, -0.110435, -0.058929, 0.063987, -0.037393, -0.057942, -0.032128, 0.141226, -0.106979, 0.072183, -0.045641, -0.050068, -0.053686, -0.04