# 520天猫数据分析

## 数据探索和数据清洗

1. 去除重复值
2. 从goods_name中提取具体的商品名称
3. 将购买人数转换为数值型数据
4. 增加收入列，价格*购买人数=收入
5. 增加商品价格分箱数据

In [1]:
import numpy as np 
import pandas as pd 
import time 
import re 
import jieba
import jieba.analyse
from collections import Counter

from pyecharts.charts import Bar, Map, Pie, TreeMap, WordCloud, Page
from pyecharts import options as opts 
from pyecharts.globals import SymbolType


In [2]:
# 读入数据
df_all = pd.read_excel('../data/520礼物天猫数据.xlsx') 
df = df_all.copy()
df.head() 

Unnamed: 0,goods_name,shop_name,price,purchase_num,location
0,永生花礼盒告白玫瑰兔 送女朋友爱人生日礼,joyflower旗舰店,359.0,['38人付款'],北京
1,THEBEAST/野兽派音乐盒水晶球玫瑰永生花 结婚生日520礼物送女友,thebeast野兽派官方旗舰店,520.0,['1175人付款'],上海
2,进口永生花玻璃罩礼盒真花小王子玫瑰花情人节520送女友生日礼物,芊云旗舰店,179.0,['1997人付款'],云南 昆明
3,THE BEAST/野兽派永生花告白兔大声说爱你新婚生日520礼物送女友,thebeast野兽派官方旗舰店,630.0,['1196人付款'],上海
4,JANE永生花玫瑰花熊音乐旋转告白小熊礼520情人节礼物生日送女友,janeflowers旗舰店,258.0,['2805人付款'],江苏 苏州


In [3]:
df[df.goods_name.str.contains('施华洛')]

Unnamed: 0,goods_name,shop_name,price,purchase_num,location
801,520生日礼物送女友纯银手链女手镯首饰ins小众采用施华洛世奇元素,glten旗舰店,253.0,['1865人付款'],江苏 南京
805,四叶草手链女纯银ins小众设计施华洛世奇锆手饰手镯520礼物送女友,vigg旗舰店,183.0,['2614人付款'],广东 深圳
812,施华洛世奇DAZZLING SWAN浪漫天鹅魅力百搭女手链520礼物,施华洛世奇官方旗舰店,1190.0,['204人付款'],浙江 嘉兴
816,纯银手链女ins小众设计手饰手镯520礼物送女友采用施华洛世奇元素,glten旗舰店,353.0,['188人付款'],江苏 南京
821,【新品】施华洛世奇DAZZLING SWAN蓝调天鹅手链全新配色520礼物,施华洛世奇官方旗舰店,1190.0,['145人付款'],浙江 嘉兴
849,【新品】施华洛世奇 LIFELONG HRT 挚爱璀璨 女手镯520礼物,施华洛世奇官方旗舰店,1290.0,['146人付款'],浙江 嘉兴
850,施华洛世奇 FURTHER 环环相扣 现代风格 女手链520礼物,施华洛世奇官方旗舰店,1490.0,['84人付款'],浙江 嘉兴
855,【直营】告白日施华洛世奇SWAN新款首饰镀玫瑰金天鹅手链520礼物,天猫国际时尚直营,569.0,['149人付款'],浙江 杭州
863,星座手链女采用施华洛世奇元素水晶手饰品小众设计520礼物送女友,静风格官方旗舰店,149.0,['632人付款'],上海
890,JG镶施华洛世奇闺蜜手链女ins小众设计潮纯银学生简约清新520礼物,johnsongem旗舰店,95.0,['873人付款'],广东 广州


In [4]:
# 去除重复值
df.drop_duplicates(inplace=True)
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3855 entries, 0 to 4403
Data columns (total 5 columns):
goods_name      3855 non-null object
shop_name       3855 non-null object
price           3855 non-null float64
purchase_num    3855 non-null object
location        3855 non-null object
dtypes: float64(1), object(4)
memory usage: 180.7+ KB


In [5]:
# 删除购买人数为空的记录
df = df[df['purchase_num'].str.contains('人付款')]
df.shape

(3854, 5)

In [6]:
# 购买人数
df['purchase_num'] = df['purchase_num'].str.extract('(\d+)').astype('float')
# 销售额
df['sales_volume'] = df['price'] * df['purchase_num']
df.head(2) 

Unnamed: 0,goods_name,shop_name,price,purchase_num,location,sales_volume
0,永生花礼盒告白玫瑰兔 送女朋友爱人生日礼,joyflower旗舰店,359.0,38.0,北京,13642.0
1,THEBEAST/野兽派音乐盒水晶球玫瑰永生花 结婚生日520礼物送女友,thebeast野兽派官方旗舰店,520.0,1175.0,上海,611000.0


In [7]:
# 省份处理
df['province_name'] = df.location.str[:2]
df.head(2)

Unnamed: 0,goods_name,shop_name,price,purchase_num,location,sales_volume,province_name
0,永生花礼盒告白玫瑰兔 送女朋友爱人生日礼,joyflower旗舰店,359.0,38.0,北京,13642.0,北京
1,THEBEAST/野兽派音乐盒水晶球玫瑰永生花 结婚生日520礼物送女友,thebeast野兽派官方旗舰店,520.0,1175.0,上海,611000.0,上海


In [35]:
txt = df.goods_name.str.cat(sep='。') 

# 添加词典
word_list = ['永生花', '音乐盒', '水晶球', '玫瑰永生花', '玻璃罩礼盒', '玫瑰兔', '手镯', '手链', '首饰']
    
for i in word_list:
    jieba.add_word(i)  
  
# 停用词
stop_words = []
with open(r'C:\Users\wzd\Desktop\CDA\CDA_Learning\Python\Python项目实作\天猫\stop_words.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    for line in lines:
        stop_words.append(line.strip())

stop_words.extend([' ', '925', '2020', '18k', '5.20',
                   '925', '999', '18K', '18k', '740', 
                   '888', '2020', '1314', '100ml'])

word_num = jieba.lcut(txt) 

# 去停用词
word_num_selected = []

for i in word_num:
    if i not in stop_words and len(i)>2:
        word_num_selected.append(i)

In [36]:
# 计数
df_num = pd.Series(Counter(word_num_selected))
df_num = df_num.sort_values(ascending=False) 
num_top10 = df_num[:10] 
num_top10

520     3639
情人节     1903
生日礼物    1216
永生花      593
女朋友      375
玫瑰花      340
施华洛      221
男朋友      204
玻璃罩      199
母亲节      179
dtype: int64

In [10]:
data = [
    {"value": 593, "name": "永生花"},
    {"value": 340, "name": "玫瑰花"},
    {"value": 221, "name": "施华洛"},
    {"value": 114, "name": "巧克力"},
    {"value": 66, "name": "银项链"},
    {"value": 65, "name": "四叶草"},
    {"value": 65, "name": "音乐盒"},
    {"value": 65, "name": "潘多拉"},
    {"value": 59, "name": "满天星"},
    {"value": 49, "name": "康乃馨"}
] 

In [11]:
tree = TreeMap(init_opts=opts.InitOpts(width="1280px", height="720px"))
tree.add(series_name='', data=data, label_opts=opts.LabelOpts(position='inside'))
tree.set_global_opts(title_opts=opts.TitleOpts(title='520大家都买什么礼物top10'), 
                     legend_opts=opts.LegendOpts(is_show=False))
tree.render() 

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

In [12]:
# 绘制柱形图
bar0 = Bar(init_opts=opts.InitOpts(width='1350px', height='750px')) 
bar0.add_xaxis(num_top10.index.tolist()) 
bar0.add_yaxis('', num_top10.values.tolist()) 
bar0.set_global_opts(title_opts=opts.TitleOpts(title='520大家都买什么礼物top10'),
                     xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
                     visualmap_opts=opts.VisualMapOpts(max_=593)) 
bar0.render() 

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

## 数据可视化

### 店铺销量排名top10 - 柱形图

In [13]:
# 计算top10店铺
shop_top10 = df.groupby('shop_name')['purchase_num'].sum().sort_values(ascending=False).head(10)

# 绘制柱形图
bar1 = Bar(init_opts=opts.InitOpts(width='1350px', height='750px')) 
bar1.add_xaxis(shop_top10.index.tolist())
bar1.add_yaxis('', shop_top10.values.tolist()) 
bar1.set_global_opts(title_opts=opts.TitleOpts(title='520礼物商品销量Top10店铺'),
                     xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
                     visualmap_opts=opts.VisualMapOpts(max_=shop_top10.values.max())) 
bar1.render() 

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

### 全国各省份销量排名Top10 - 柱形图

In [14]:
# 计算销量top10
province_top10 = df.groupby('province_name')['purchase_num'].sum().sort_values(ascending=False).head(10)

# 条形图
bar2 = Bar(init_opts=opts.InitOpts(width='1350px', height='750px')) 
bar2.add_xaxis(province_top10.index.tolist())
bar2.add_yaxis('', province_top10.values.tolist()) 
bar2.set_global_opts(title_opts=opts.TitleOpts(title='520礼物商品销量省份排名Top10'),
                     visualmap_opts=opts.VisualMapOpts(max_=province_top10.values.max())) 
bar2.render() 

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

### 全国省份销量地区分布-地图

In [15]:
# 计算销量
province_num = df.groupby('province_name')['purchase_num'].sum().sort_values(ascending=False) 

# 绘制地图
map1 = Map(init_opts=opts.InitOpts(width='1350px', height='750px'))
map1.add("", [list(z) for z in zip(province_num.index.tolist(), province_num.values.tolist())],
         maptype='china'
        ) 
map1.set_global_opts(title_opts=opts.TitleOpts(title='520礼物国内各省份销量分布'),
                     visualmap_opts=opts.VisualMapOpts(max_=province_num.quantile(0.9)),
                    )
map1.render()

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

### 不同价格区间的商品数量

In [16]:
def tranform_price(x):
    if x <= 50:
        return '0~50'
    elif x <= 100:
        return '50~100'
    elif x <= 150:
        return '100~150'
    elif x <= 200:
        return '150~200'
    elif x <= 250:
        return '200~250'
    elif x <= 300:
        return '250~300'
    elif x <= 500:
        return '300~500'
    elif x <= 1000:
        return '500~1000'
    elif x <= 2000:
        return '1000~2000'
    elif x <= 5000:
        return '2000~5000'
    else:
        return '5000~10000'
    
df['price_cut'] = df.price.apply(lambda x: tranform_price(x)) 
price_num = df.price_cut.value_counts()
price_num

150~200       620
50~100        594
100~150       565
300~500       399
0~50          395
500~1000      394
250~300       302
1000~2000     273
200~250       212
2000~5000      91
5000~10000      9
Name: price_cut, dtype: int64

In [17]:
x_data = ['0~50', '50~100', '100~150', '150~200', '200~250', '250~300', 
          '300~500', '500~1000', '1000~2000', '2000~5000', '5000~10000']
y_data = [395, 594, 565, 620, 212, 302, 399, 394, 273, 91, 9]

In [18]:
bar3 = Bar(init_opts=opts.InitOpts(width='1350px', height='750px')) 
bar3.add_xaxis(x_data)
bar3.add_yaxis('', y_data) 
bar3.set_global_opts(title_opts=opts.TitleOpts(title='520不同价格区间的商品数量'),
                     visualmap_opts=opts.VisualMapOpts(max_=800)) 
bar3.render()

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

In [19]:
price_cut_num = df.groupby('price_cut')['purchase_num'].sum() 
data_pair = [list(z) for z in zip(price_cut_num.index, price_cut_num.values)]

In [20]:
# 饼图
pie1 = Pie(init_opts=opts.InitOpts(width='1350px', height='750px'))
# 内置富文本
pie1.add( 
        series_name="",
        radius=["35%", "55%"],
        data_pair=data_pair,
        label_opts=opts.LabelOpts(
            position="outside",
            formatter="{b|{b}: }{c}  {per|{d}%}  ",
            background_color="#eee",
            border_color="#aaa",
            border_width=1,
            border_radius=4,
            rich={
                "a": {"color": "#999", "lineHeight": 22, "align": "center"},
                "abg": {
                    "backgroundColor": "#e3e3e3",
                    "width": "100%",
                    "align": "right",
                    "height": 22,
                    "borderRadius": [4, 4, 0, 0],
                },
                "hr": {
                    "borderColor": "#aaa",
                    "width": "100%",
                    "borderWidth": 0.5,
                    "height": 0,
                },
                "b": {"fontSize": 16, "lineHeight": 33},
                "per": {
                    "color": "#eee",
                    "backgroundColor": "#334455",
                    "padding": [2, 4],
                    "borderRadius": 2,
                },
            },
        ),
)
pie1.set_global_opts(legend_opts=opts.LegendOpts(pos_left="left", pos_top='30%', orient="vertical"), 
                     title_opts=opts.TitleOpts(title='520礼物不同价格区间销量占比'))
pie1.set_series_opts(
    tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)")
    )
pie1.render() 

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

In [44]:
# 词云图
word1 = WordCloud(init_opts=opts.InitOpts(width='1350px', height='750px'))
word1.add("", [*zip(df_num[:100].index.astype('str'), df_num[:100].values.astype('str'))],
          word_size_range=[20, 200],
          shape=SymbolType.DIAMOND)
word1.set_global_opts(title_opts=opts.TitleOpts('520礼物商品标题词云图'))
word1.render() 

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\render.html'

In [41]:
page = Page()
page.add(tree, bar1, bar2, map1, bar3, pie1, word1)
page.render('520天猫数据分析.html')

'C:\\Users\\wzd\\Desktop\\CDA\\CDA_Learning\\Python\\Python项目实作\\520\\code\\520天猫数据分析.html'