# 数据来源  
零售商店提供的数据，反映的是零售商店中进行的交易的样本。零售商希望能更好的了解对于不同商品，消费者不同购买行为，从而通过借助已知的其他变量来预测消费者的购买量。

从另外一个角度，我们也可以预测消费者性别，年龄，甚至可以来预测商品的类别等。

# 提出问题  
- 1男性和女性的购买力差异
- 2各个年龄段的购买力差异
- 3婚姻状况是否对人们的购买力产生影响
- 4各个职业的购买力差异
- 5三个城市购物水平差异
- 6在城市呆的年限是否影响人们的购买力
- 7人们的购物水平差异，哪一类产品更受人们喜欢，以及最受欢迎的商品是哪些


# 数据清洗

## 数据信息  

User_ID：用户编码  

Product_ID：产品编码  

Gender：性别 （M为男性，F为女性）  

Age：年龄（0-17，18-25，26-35，36-45,46-50，51-55，55+ 7种） 

Occupation：职业（用数字代表具体职业，一共有20种职业） 

City_Category：城市分类（分为三类城市：ABC） 

Stay_In_Current_City_Years：在目前城市的居住的年数 （0，1,2，3,4+5种） 

Marital_Status：婚姻状况 （0代表未婚，1代表已婚） 

Product_Category_1：产品分类为1，不可为空 

Product_Category_2：产品分类为2  

Product_Category_3：产品分类为3  

Purchase：购买金额 （单位为美元）  

# 数据分析
## 导入数据
- 这是一个 537577*12的数据集，每一行有12列不同属性的数据
- 我们可以选取前5列先观察一下数据集信息

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
data = pd.read_csv('./BlackFriday.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 537577 entries, 0 to 537576
Data columns (total 12 columns):
User_ID                       537577 non-null int64
Product_ID                    537577 non-null object
Gender                        537577 non-null object
Age                           537577 non-null object
Occupation                    537577 non-null int64
City_Category                 537577 non-null object
Stay_In_Current_City_Years    537577 non-null object
Marital_Status                537577 non-null int64
Product_Category_1            537577 non-null int64
Product_Category_2            370591 non-null float64
Product_Category_3            164278 non-null float64
Purchase                      537577 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 49.2+ MB


In [24]:
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,,,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,,,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,14.0,,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,,,7969


In [25]:
# 确认数据集是否存在缺失值
data.isnull().any()

User_ID                       False
Product_ID                    False
Gender                        False
Age                           False
Occupation                    False
City_Category                 False
Stay_In_Current_City_Years    False
Marital_Status                False
Product_Category_1            False
Product_Category_2             True
Product_Category_3             True
Purchase                      False
dtype: bool

In [26]:
# 我们发现 Product_Category_2和Product_Category_3是存在缺失值的NAN的
# 我们需要填补缺失值 用0填补缺失值
data.fillna(value=0,inplace=True)

In [27]:
data.isnull().any()

User_ID                       False
Product_ID                    False
Gender                        False
Age                           False
Occupation                    False
City_Category                 False
Stay_In_Current_City_Years    False
Marital_Status                False
Product_Category_1            False
Product_Category_2            False
Product_Category_3            False
Purchase                      False
dtype: bool

## 处理数据
- 由于大多数特征是离散的，因此随机森林预估器很好地拟合数据
- 我们选择用回归森林预估器处理数据

In [28]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [29]:
# 对于字符以及混合类型编码 利用labelEncode
transfer = LabelEncoder()
data['User_ID'] = transfer.fit_transform(data['User_ID'])
data['Product_ID'] = transfer.fit_transform(data['Product_ID'])
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,670,F,0-17,10,A,2,0,3,0.0,0.0,8370
1,0,2374,F,0-17,10,A,2,0,1,6.0,14.0,15200
2,0,850,F,0-17,10,A,2,0,12,0.0,0.0,1422
3,0,826,F,0-17,10,A,2,0,12,14.0,0.0,1057
4,1,2732,M,55+,16,C,4+,0,8,0.0,0.0,7969


In [30]:
# 将数据转换成one-hot编码
data.loc[data['Gender']=='M','Gender'] = 0
data.loc[data['Gender']=='F','Gender'] = 1
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
0,0,670,1,0-17,10,A,2,0,3,0.0,0.0,8370
1,0,2374,1,0-17,10,A,2,0,1,6.0,14.0,15200
2,0,850,1,0-17,10,A,2,0,12,0.0,0.0,1422
3,0,826,1,0-17,10,A,2,0,12,14.0,0.0,1057
4,1,2732,0,55+,16,C,4+,0,8,0.0,0.0,7969


In [31]:
data_Age = pd.DataFrame()
data_Age = pd.get_dummies(data.Age)
data_Age.head()

Unnamed: 0,0-17,18-25,26-35,36-45,46-50,51-55,55+
0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0
3,1,0,0,0,0,0,0
4,0,0,0,0,0,0,1


In [32]:
# 合并表格
data = pd.concat((data,data_Age),axis=1)
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,0-17,18-25,26-35,36-45,46-50,51-55,55+
0,0,670,1,0-17,10,A,2,0,3,0.0,0.0,8370,1,0,0,0,0,0,0
1,0,2374,1,0-17,10,A,2,0,1,6.0,14.0,15200,1,0,0,0,0,0,0
2,0,850,1,0-17,10,A,2,0,12,0.0,0.0,1422,1,0,0,0,0,0,0
3,0,826,1,0-17,10,A,2,0,12,14.0,0.0,1057,1,0,0,0,0,0,0
4,1,2732,0,55+,16,C,4+,0,8,0.0,0.0,7969,0,0,0,0,0,0,1


In [33]:
# 接着完成剩下几列的one-hot编码
data_City = pd.get_dummies(data.City_Category)
data_City_Years = pd.get_dummies(data.Stay_In_Current_City_Years)

In [34]:
data = pd.concat([data,data_City,data_City_Years],axis=1)
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,...,51-55,55+,A,B,C,0,1,2,3,4+
0,0,670,1,0-17,10,A,2,0,3,0.0,...,0,0,1,0,0,0,0,1,0,0
1,0,2374,1,0-17,10,A,2,0,1,6.0,...,0,0,1,0,0,0,0,1,0,0
2,0,850,1,0-17,10,A,2,0,12,0.0,...,0,0,1,0,0,0,0,1,0,0
3,0,826,1,0-17,10,A,2,0,12,14.0,...,0,0,1,0,0,0,0,1,0,0
4,1,2732,0,55+,16,C,4+,0,8,0.0,...,0,1,0,0,1,0,0,0,0,1


In [35]:
# 删除多余的列数
data.drop(['Age','City_Category','Stay_In_Current_City_Years'],axis=1,inplace=True)

In [36]:
# 重新观察数据
data.head()

Unnamed: 0,User_ID,Product_ID,Gender,Occupation,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase,0-17,...,51-55,55+,A,B,C,0,1,2,3,4+
0,0,670,1,10,0,3,0.0,0.0,8370,1,...,0,0,1,0,0,0,0,1,0,0
1,0,2374,1,10,0,1,6.0,14.0,15200,1,...,0,0,1,0,0,0,0,1,0,0
2,0,850,1,10,0,12,0.0,0.0,1422,1,...,0,0,1,0,0,0,0,1,0,0
3,0,826,1,10,0,12,14.0,0.0,1057,1,...,0,0,1,0,0,0,0,1,0,0
4,1,2732,0,16,0,8,0.0,0.0,7969,0,...,0,1,0,0,1,0,0,0,0,1


In [37]:
data.shape

(537577, 24)

In [38]:
#原数据集数据太大，随机选取n个值
data = data.sample(frac=0.02,random_state=100)
data.shape

(10752, 24)

In [43]:
# 划分数据集
x = data.drop(['Purchase'], axis=1)
y = data['Purchase']
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=100)


In [44]:
# 特征工程
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

In [48]:
# GridSearchCV表格优化
param_grid = {'n_estimators':[1,3,10,30,100,150,300],'max_depth':[1,3,5,7,9]}
estimator = GridSearchCV(RandomForestRegressor(),param_grid,cv=3,scoring='neg_mean_squared_error')
estimator.fit(x_train,y_train)
y_pred = estimator.predict(x_test)
estimator.score(x_train,y_train)
y_predict = estimator.predict(x_test)
print("y_predict:", y_predict)
print("对比真实值和预测值：", y_predict == y_test )
score = estimator.score(x_test,y_test)
## 准确率是test测试集中的结果
print("准确率：",score)
# 最佳参数: best_params_
print("最佳参数：",estimator.best_params_)
# 最佳结果: best_score_
# 此最佳结果是训练集中验证集当中的结果
print("最佳结果",estimator.best_score_)
# 最佳估计器: best_estimator_
print("最佳预估器",estimator.best_estimator_)
# 交叉验证结果: cv_results_
print("交叉验证结果",estimator.cv_results_)

y_predict: [  2017.47426005  12384.77926821   6266.5396602  ...,   6262.62985314
   6328.62574945  12538.68181027]
对比真实值和预测值： 80549     False
167501    False
103583    False
375606    False
397270    False
529001    False
122805    False
390526    False
220585    False
227988    False
346898    False
508293    False
82695     False
201742    False
114549    False
261127    False
163935    False
508032    False
70234     False
532823    False
255782    False
242604    False
372356    False
200687    False
421372    False
422886    False
318534    False
87625     False
183441    False
297387    False
          ...  
115474    False
497461    False
137846    False
522362    False
385903    False
81198     False
451365    False
487       False
303742    False
160592    False
478609    False
504922    False
347665    False
407535    False
455378    False
280790    False
521220    False
86593     False
425055    False
224043    False
381487    False
138430    False
529693    False
167392    

In [49]:
print('Best score: {:.2f}'.format((-1*estimator.best_score_)**0.5))

Best score: 2959.40


- 最佳参数： {'max_depth': 9, 'n_estimators': 150}
- 在这个模型中，max_depth = 9, n_estimators = 150 是最优结果
- 此模型在新顾客进行消费的情况下，是无法使用这些已知的顾客信息和特定产品信息来对从未在这里消费的顾客进行预测的