加载数据集并检查书籍、用户和评分数据集的形状

In [56]:
import pandas as pd

books = pd.read_csv('../data/BX-Books.csv', on_bad_lines='skip', sep=';', dtype={3: str}, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM',
                 'imageUrlL']
users = pd.read_csv('../data/BX-Users.csv', on_bad_lines='skip', sep=';', encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']
ratings = pd.read_csv('../data/BX-Book-Ratings.csv', on_bad_lines='skip', sep=';', encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

In [57]:
print(books.shape)
print(users.shape)
print(ratings.shape)

(271360, 8)
(278858, 3)
(1149780, 3)


逐个探索每个数据集并从书籍数据集开始

In [58]:
print(books.columns)

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher',
       'imageUrlS', 'imageUrlM', 'imageUrlL'],
      dtype='object')


我们可以看到图像URL列似乎不需要进行分析，因此可以删除这些列。

In [59]:
books.drop(['imageUrlS', 'imageUrlM', 'imageUrlL'], axis=1, inplace=True)
print(books.columns)  # 输出删除后的 columns

Index(['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher'], dtype='object')


我们现在检查每个列的数据类型，并更正缺失和不一致的条目

In [60]:
print(books.dtypes)

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication    object
publisher            object
dtype: object


yearOfPublication 列,现在我们检查此属性的唯一值

In [61]:
books.yearOfPublication.unique()

array(['2002', '2001', '1991', '1999', '2000', '1993', '1996', '1988',
       '2004', '1998', '1994', '2003', '1997', '1983', '1979', '1995',
       '1982', '1985', '1992', '1986', '1978', '1980', '1952', '1987',
       '1990', '1981', '1989', '1984', '0', '1968', '1961', '1958',
       '1974', '1976', '1971', '1977', '1975', '1965', '1941', '1970',
       '1962', '1973', '1972', '1960', '1966', '1920', '1956', '1959',
       '1953', '1951', '1942', '1963', '1964', '1969', '1954', '1950',
       '1967', '2005', '1957', '1940', '1937', '1955', '1946', '1936',
       '1930', '2011', '1925', '1948', '1943', '1947', '1945', '1923',
       '2020', '1939', '1926', '1938', '2030', '1911', '1904', '1949',
       '1932', '1928', '1929', '1927', '1931', '1914', '2050', '1934',
       '1910', '1933', '1902', '1924', '1921', '1900', '2038', '2026',
       '1944', '1917', '1901', '2010', '1908', '1906', '1935', '1806',
       '2021', '2012', '2006', 'DK Publishing Inc', 'Gallimard', '1909',
       

yearOfPublication中有一些不正确的条目：
- 由于 csv 文件中的一些错误，发布商名称 'DK Publishing Inc'和'Gallimard' 在数据集中被错误地加载为 yearOfPublication。
- 此外，某些值是字符串，并且在某些地方已将相同年份作为数字输入。
我们将对这些行进行必要的更正，并将 yearOfPublication 的数据类型设置为 int。

In [62]:
books.loc[books.yearOfPublication == 'DK Publishing Inc', :]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


从上面可以看出，bookAuthor 错误的装载有误，因此需要进行修正

In [63]:
# ISBN '0789466953'
books.loc[books.ISBN == '0789466953','yearOfPublication'] = 2000
books.loc[books.ISBN == '0789466953','bookAuthor'] = "James Buckley"
books.loc[books.ISBN == '0789466953','publisher'] = "DK Publishing Inc"
books.loc[books.ISBN == '0789466953','bookTitle'] = "DK Readers: Creating the X-Men, How Comic Books Come to Life (Level 4: Proficient Readers)"

#ISBN '078946697X'
books.loc[books.ISBN == '078946697X','yearOfPublication'] = 2000
books.loc[books.ISBN == '078946697X','bookAuthor'] = "Michael Teitelbaum"
books.loc[books.ISBN == '078946697X','publisher'] = "DK Publishing Inc"
books.loc[books.ISBN == '078946697X','bookTitle'] = "DK Readers: Creating the X-Men, How It All Began (Level 4: Proficient Readers)"

books.loc[(books.ISBN == '0789466953') | (books.ISBN == '078946697X'),:]


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",Michael Teitelbaum,2000,DK Publishing Inc
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",James Buckley,2000,DK Publishing Inc


继续纠正出版年鉴的类型

In [64]:
books.yearOfPublication=pd.to_numeric(books.yearOfPublication, errors='coerce')
sorted(books['yearOfPublication'].unique())

[np.float64(0.0),
 np.float64(1376.0),
 np.float64(1378.0),
 np.float64(1806.0),
 np.float64(1897.0),
 np.float64(1900.0),
 np.float64(1901.0),
 np.float64(1902.0),
 np.float64(1904.0),
 np.float64(1906.0),
 np.float64(1908.0),
 np.float64(1909.0),
 np.float64(1910.0),
 np.float64(1911.0),
 np.float64(1914.0),
 np.float64(1917.0),
 np.float64(1919.0),
 np.float64(1920.0),
 np.float64(1921.0),
 np.float64(1922.0),
 np.float64(1923.0),
 np.float64(1924.0),
 np.float64(1925.0),
 np.float64(1926.0),
 np.float64(1927.0),
 np.float64(1928.0),
 np.float64(1929.0),
 np.float64(1930.0),
 np.float64(1931.0),
 np.float64(1932.0),
 np.float64(1933.0),
 np.float64(1934.0),
 np.float64(1935.0),
 np.float64(1936.0),
 np.float64(1937.0),
 np.float64(1938.0),
 np.float64(1939.0),
 np.float64(1940.0),
 np.float64(1941.0),
 np.float64(1942.0),
 np.float64(1943.0),
 np.float64(1944.0),
 np.float64(1945.0),
 np.float64(1946.0),
 np.float64(1947.0),
 np.float64(1948.0),
 np.float64(1949.0),
 np.float64(1950

现在可以看出 yearOfPublication 的类型为 float64，其值范围 为0-2050
由于该数据集建于 2004 年，我假设 2006 年之后的所有年份都无效，保留两年以防数据集可能已更新
对于所有无效条目（包括 0），我将这些条目转换为 NaN，然后用剩余年份的平均值替换它们

In [65]:
import numpy as np

books.loc[(books.yearOfPublication > 2006) | (books.yearOfPublication == 0), 'yearOfPublication'] = np.nan


用年出版的平均价值代替 NaN 在案例数据集被更新的情况下保留一定的空白

In [66]:
books.yearOfPublication = books.yearOfPublication.fillna(round(books.yearOfPublication.mean()))
books.yearOfPublication.isnull().sum()

np.int64(0)

将 dtype 重置为 int32

In [67]:
books.yearOfPublication = books.yearOfPublication.astype(np.int32)

来到 publisher 专栏，我已经处理了两个 NaN 值，将其替换为 其他

In [68]:
books.loc[books.publisher.isnull(),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


调查有 NaNs 的行以 Tyrant Moon 的书名来查看是否能得到任何线索

In [69]:
books.loc[(books.bookTitle == 'Tyrant Moon'),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,


检查行是否有书签作为查找器，看看我们是否能得到任何线索与不同的出版商和图书作者的所有行

In [70]:
books.loc[(books.bookTitle == 'Finders Keepers'),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
10799,082177364X,Finders Keepers,Fern Michaels,2002,Zebra Books
42019,0070465037,Finders Keepers,Barbara Nickolae,1989,McGraw-Hill Companies
58264,0688118461,Finders Keepers,Emily Rodda,1993,Harpercollins Juvenile Books
66678,1575663236,Finders Keepers,Fern Michaels,1998,Kensington Publishing Corporation
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,
134309,0156309505,Finders Keepers,Will,1989,Voyager Books
173473,0973146907,Finders Keepers,Sean M. Costello,2002,Red Tower Publications
195885,0061083909,Finders Keepers,Sharon Sala,2003,HarperTorch
211874,0373261160,Finders Keepers,Elizabeth Travis,1993,Worldwide Library


由图书作者检查以找到模式都有不同的出版商。这里没有线索

In [71]:
books.loc[(books.bookAuthor == 'Elaine Corvidae'),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
126762,1931696934,Winter's Orphans,Elaine Corvidae,2001,Novelbooks
128890,193169656X,Tyrant Moon,Elaine Corvidae,2002,
129001,0759901880,Wolfkin,Elaine Corvidae,2001,Hard Shell Word Factory


由图书作者检查以找到模式

In [72]:
books.loc[(books.bookAuthor == 'Linnea Sinclair'),:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
129037,1931696993,Finders Keepers,Linnea Sinclair,2001,


因为没有什么共同的东西可以推断出 NaN 的发布者，将它们替换为 other

In [73]:
books.loc[(books.ISBN == '193169656X'),'publisher'] = 'other'
books.loc[(books.ISBN == '1931696993'),'publisher'] = 'other'

现在我们探索用户数据集，首先检查其形状，前几列和数据类型

In [74]:
print (users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [75]:
users.dtypes

userID        int64
Location     object
Age         float64
dtype: object

In [76]:
users.userID.values

array([     1,      2,      3, ..., 278856, 278857, 278858])

In [77]:
sorted(users.Age.unique())

[np.float64(nan),
 np.float64(0.0),
 np.float64(1.0),
 np.float64(2.0),
 np.float64(3.0),
 np.float64(4.0),
 np.float64(5.0),
 np.float64(6.0),
 np.float64(7.0),
 np.float64(8.0),
 np.float64(9.0),
 np.float64(10.0),
 np.float64(11.0),
 np.float64(12.0),
 np.float64(13.0),
 np.float64(14.0),
 np.float64(15.0),
 np.float64(16.0),
 np.float64(17.0),
 np.float64(18.0),
 np.float64(19.0),
 np.float64(20.0),
 np.float64(21.0),
 np.float64(22.0),
 np.float64(23.0),
 np.float64(24.0),
 np.float64(25.0),
 np.float64(26.0),
 np.float64(27.0),
 np.float64(28.0),
 np.float64(29.0),
 np.float64(30.0),
 np.float64(31.0),
 np.float64(32.0),
 np.float64(33.0),
 np.float64(34.0),
 np.float64(35.0),
 np.float64(36.0),
 np.float64(37.0),
 np.float64(38.0),
 np.float64(39.0),
 np.float64(40.0),
 np.float64(41.0),
 np.float64(42.0),
 np.float64(43.0),
 np.float64(44.0),
 np.float64(45.0),
 np.float64(46.0),
 np.float64(47.0),
 np.float64(48.0),
 np.float64(49.0),
 np.float64(50.0),
 np.float64(51.0),
 np.

检查唯一值后，userID 看起来正确。但是，Age 列具有 NaN 和一些非常高的值。在我看来，5 岁以下和 90 岁以上的年龄没有多大意义，因此，这些正在被 NaN 取代。然后将所有 NaN 替换为 Age 的平均值，并将其数据类型设置为 int

In [78]:
users.loc[(users.Age > 90) | (users.Age < 5), 'Age'] = np.nan

用平均值代替 NaN

In [79]:
users.Age = users.Age.fillna(users.Age.mean())

将数据类型设置为 int

In [80]:
users.Age = users.Age.astype(np.int32)
sorted(users.Age.unique())

[np.int32(5),
 np.int32(6),
 np.int32(7),
 np.int32(8),
 np.int32(9),
 np.int32(10),
 np.int32(11),
 np.int32(12),
 np.int32(13),
 np.int32(14),
 np.int32(15),
 np.int32(16),
 np.int32(17),
 np.int32(18),
 np.int32(19),
 np.int32(20),
 np.int32(21),
 np.int32(22),
 np.int32(23),
 np.int32(24),
 np.int32(25),
 np.int32(26),
 np.int32(27),
 np.int32(28),
 np.int32(29),
 np.int32(30),
 np.int32(31),
 np.int32(32),
 np.int32(33),
 np.int32(34),
 np.int32(35),
 np.int32(36),
 np.int32(37),
 np.int32(38),
 np.int32(39),
 np.int32(40),
 np.int32(41),
 np.int32(42),
 np.int32(43),
 np.int32(44),
 np.int32(45),
 np.int32(46),
 np.int32(47),
 np.int32(48),
 np.int32(49),
 np.int32(50),
 np.int32(51),
 np.int32(52),
 np.int32(53),
 np.int32(54),
 np.int32(55),
 np.int32(56),
 np.int32(57),
 np.int32(58),
 np.int32(59),
 np.int32(60),
 np.int32(61),
 np.int32(62),
 np.int32(63),
 np.int32(64),
 np.int32(65),
 np.int32(66),
 np.int32(67),
 np.int32(68),
 np.int32(69),
 np.int32(70),
 np.int32(71),


我这里没有对 Location 列进行任何处理。但是，如果您希望可以进一步将其拆分为城市，州，国家，并使用文本处理模型进行一些处理。

评级数据集
我们检查评级数据集的形状和前几行。它揭示了我们的用户手册评级矩阵将非常稀疏，因为与评级矩阵的大小（用户数量×书籍数量）相比，实际评级非常低。

In [81]:
ratings.shape

(1149780, 3)

如果每个用户对每个条目进行评级，那么评级数据集将有 nusers * nbooks 条目，这表明数据集非常稀疏。

In [82]:
n_users = users.shape[0]
n_books = books.shape[0]
print (n_users * n_books)

75670906880


In [83]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [84]:
ratings.bookRating.unique()

array([ 0,  5,  3,  6,  8,  7, 10,  9,  4,  1,  2])

除非将新书添加到图书数据集中，否则评级数据集应该只存在于我们的图书数据集里的书籍。

In [85]:
ratings_new = ratings[ratings.ISBN.isin(books.ISBN)]
print (ratings.shape)
print (ratings_new.shape)

(1149780, 3)
(1031136, 3)


可以看到，有许多行，有图书 ISBN，而不是书籍数据集的一部分被删除了，除非新用户被添加到用户数据集，否则评级数据集应该有来自用户数据集的用户的评级。

In [86]:
ratings = ratings[ratings.userID.isin(users.userID)]
print (ratings.shape)
print (ratings_new.shape)

(1149780, 3)
(1031136, 3)


没有新用户添加，因此我们将使用高于数据集的新用户（1031136，3）

In [87]:
print ("number of users: " + str(n_users))
print ("number of books: " + str(n_books))

number of users: 278858
number of books: 271360


很明显，用户已经评价了一些书籍，这些书籍不是原始书籍数据集的一部分。数据集的稀疏度可以如下计算：

In [88]:
sparsity=1.0-len(ratings_new)/float(n_users*n_books)
print ('图书交叉数据集的稀疏级别是 ' +  str(sparsity*100) + ' %')

图书交叉数据集的稀疏级别是 99.99863734155898 %


由 1-10 表示的显式评级和由 0 表示的隐含评级现在必须分开。我们将仅使用明确的评级来构建我们的图书推荐系统。同样，用户也被分为明确评级的人和记录其隐性行为的人。

In [89]:
ratings.bookRating.unique()

array([ 0,  5,  3,  6,  8,  7, 10,  9,  4,  1,  2])

因此，对隐式和显式的评级数据集进行了划分

In [90]:
ratings_explicit = ratings_new[ratings_new.bookRating != 0]
ratings_implicit = ratings_new[ratings_new.bookRating == 0]
print (ratings_new.shape)
print( ratings_explicit.shape)
print (ratings_implicit.shape)

(1031136, 3)
(383842, 3)
(647294, 3)


统计 bookRating 的计数图表示更高的评级在用户中更常见，评级 8 的评级最高。

In [91]:
from matplotlib import pyplot as plt
import seaborn as sns

sns.countplot(data=ratings_explicit , x='bookRating')
plt.show()

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## 基于简单流行度的推荐系统
此时，可以基于不同书籍的用户评级的计数来构建基于简单流行度的推荐系统。很明显， J. K. Rowling 撰写的书很受欢迎。

In [92]:
ratings_count = pd.DataFrame(ratings_explicit.groupby(['ISBN'])['bookRating'].sum())
top10 = ratings_count.sort_values('bookRating', ascending = False).head(10)
print ("推荐下列书籍")
top10.merge(books, left_index = True, right_on = 'ISBN')

推荐下列书籍


Unnamed: 0,bookRating,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
408,5787,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown"
748,4108,0385504209,The Da Vinci Code,Dan Brown,2003,Doubleday
522,3134,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA
2143,2798,059035342X,Harry Potter and the Sorcerer's Stone (Harry P...,J. K. Rowling,1999,Arthur A. Levine Books
356,2595,0142001740,The Secret Life of Bees,Sue Monk Kidd,2003,Penguin Books
26,2551,0971880107,Wild Animus,Rich Shapero,2004,Too Far
1105,2524,0060928336,Divine Secrets of the Ya-Ya Sisterhood: A Novel,Rebecca Wells,1997,Perennial
706,2402,0446672211,Where the Heart Is (Oprah's Book Club (Paperba...,Billie Letts,1998,Warner Books
231,2219,0452282152,Girl with a Pearl Earring,Tracy Chevalier,2001,Plume Books
118,2179,0671027360,Angels &amp; Demons,Dan Brown,2001,Pocket Star


类似地隔离那些在 1-10 中给出明确评分的用户以及那些隐含行为被跟踪的用户

In [93]:
users_exp_ratings = users[users.userID.isin(ratings_explicit.userID)]
users_imp_ratings = users[users.userID.isin(ratings_implicit.userID)]
print (users.shape)
print (users_exp_ratings.shape)
print (users_imp_ratings.shape)

(278858, 3)
(68091, 3)
(52451, 3)


## 基于协同过滤的推荐系统
为了应对我的机器具有的计算能力并减少数据集大小，我正在考虑已经评定至少 100 本书籍和至少有 100 个评级的书籍的用户。

In [94]:
counts1 = ratings_explicit['userID'].value_counts()
ratings_explicit = ratings_explicit[ratings_explicit['userID'].isin(counts1[counts1 >= 100].index)]
counts = ratings_explicit['bookRating'].value_counts()
ratings_explicit = ratings_explicit[ratings_explicit['bookRating'].isin(counts[counts >= 100].index)]

从显式的评级表生成评级矩阵 构建基于 CF 的推荐系统的下一个关键步骤是从评级表生成用户项目评级矩阵。

In [95]:
ratings_matrix = ratings_explicit.pivot(index='userID', columns='ISBN', values='bookRating')
userID = ratings_matrix.index
ISBN = ratings_matrix.columns
print(ratings_matrix.shape)
ratings_matrix.head()

(449, 66574)


ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,,,,,,,,,,,...,,,,,,,,,,
2110,,,,,,,,,,,...,,,,,,,,,,
2276,,,,,,,,,,,...,,,,,,,,,,
4017,,,,,,,,,,,...,,,,,,,,,,
4385,,,,,,,,,,,...,,,,,,,,,,


In [96]:
n_users = ratings_matrix.shape[0] #只考虑那些给出明确评级的用户
n_books = ratings_matrix.shape[1]
print (n_users, n_books)

449 66574


因为 NaN 不能通过训练算法来处理，将它们替换为0，这表示没有评级 设置数据类型

In [97]:
ratings_matrix.fillna(0, inplace = True)
ratings_matrix = ratings_matrix.astype(np.int32)
ratings_matrix.head(5)

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2033,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2110,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2276,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4017,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4385,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


复查稀疏

In [98]:
sparsity=1.0-len(ratings_explicit)/float(users_exp_ratings.shape[0]*n_books)
print ('图书交叉数据集的稀疏级别是 ' +  str(sparsity*100) + ' %')

图书交叉数据集的稀疏级别是 99.99772184106935 %
