# 项目：评估和清理英国电商公司销售数据

## 分析目标

此数据分析的目的是，根据市场销售数据，挖掘畅销产品，以便制定更有效的市场策略来提升营收。

本实战项目的目的在于练习评估数据干净和整洁度，并且基于评估结果，对数据进行清洗，从而得到可供下一步分析的数据。

## 简介

原始数据集记录了一家英国在线零售公司在2010年12月1日至2011年12月9日期间的所有交易情况，涵盖了该公司在全球不同国家和地区的业务数据。该公司主要销售覆盖各个场景的礼品，包括但不限于生日礼品、结婚纪念品、圣诞礼品等等。该公司的客户群体主要包括批发商和个人消费者，其中批发商占据了相当大的比例。

数据每列的含义如下：
- `InvoiceNo`: 发票号码。6位数，作为交易的唯一标识符。如果这个代码以字母“c”开头，表示这笔交易被取消。
- `StockCode`: 产品代码。5位数，作为产品的唯一标识符。
- `Description`: 产品名称。
- `Quantity`: 产品在交易中的数量。
- `InvoiceDate`: 发票日期和时间。交易发生的日期和时间。
- `UnitPrice`: 单价。价格单位为英镑（£）。
- `CustomerID`: 客户编号。5位数，作为客户的唯一标识符。
- `Country`: 国家名称。客户所居住的国家的名称。

## 读取数据

In [1]:
import pandas as pd

In [2]:
original_data = pd.read_csv('e_commerce.csv')

## 评估数据

### 评估数据结构的整洁度
##### 是否符合‘每列是一个变量、每行是一个观察值、每个单元格是一个值’的要求

#### **调用`head`&`sample`方法查看数据集的结构。**

In [3]:
original_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [4]:
original_data.sample(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
530335,580848,22544,MINI JIGSAW SPACEBOY,2,12/6/2011 11:51,0.19,18005.0,United Kingdom
164085,550636,22161,HEART DECORATION RUSTIC HANGING,3,4/19/2011 15:37,0.79,,United Kingdom
193949,553546,22620,4 TRADITIONAL SPINNING TOPS,320,5/17/2011 15:42,1.25,12415.0,Australia
356262,568048,22494,EMERGENCY FIRST AID TIN,12,9/23/2011 12:36,1.25,14911.0,EIRE
396661,571082,22834,HAND WARMER BABUSHKA DESIGN,1,10/13/2011 15:25,4.13,,United Kingdom
497524,578400,22623,BOX OF VINTAGE JIGSAW BLOCKS,1,11/24/2011 11:52,5.95,12748.0,United Kingdom
67394,541830,22452,MEASURING TAPE BABUSHKA PINK,2,1/21/2011 17:09,1.63,,United Kingdom
371672,569223,21509,COWBOYS AND INDIANS BIRTHDAY CARD,12,10/2/2011 13:49,0.42,16283.0,United Kingdom
154086,549844,21900,"KEY FOB , SHED",6,4/12/2011 14:17,0.65,12854.0,United Kingdom
191675,553385,23163,REGENCY SUGAR TONGS,1,5/16/2011 15:53,2.49,16710.0,United Kingdom


- 经过查看数据集的前五行数据和随机抽取的十行数据，确认该数据集符合‘每列是一个变量，每行是一个观察值，每个单元格是一个值’的要求。
- 数据结构整齐。

### 评估数据内容的干净度
##### 丢失数据、重复数据、不一致数据、无效/错误数据、错误数据类型

#### **调用`info`方法查看数据集的观察值个数、各个变量的数据类型、是否有缺失值。**

In [5]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


- 发现该数据集有`541909`条观察值；
- `Description`&`CustomerID`这两个变量存在缺失值；
- `InvoiceDate`的数据类型错误，应转为日期类型；`CustomerID`的数据类型错误，应转为字符串；`Country`的数据类型无误，但可以改为`category`方便后续分析。

#### **本次数据分析的目的是挖掘畅销商品，因此`Description`属于重要数据，如果缺失可以删除。**
#### **`CustomerID`则不属于，但也可以调用出来查看是否与其他变量存在关联。**
#### **将调用`isnull`&`sum`以及一些条件语句查看缺失值和各个变量间的关系。**

In [6]:
original_data[original_data['Description'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.0,,United Kingdom
1970,536545,21134,,1,12/1/2010 14:32,0.0,,United Kingdom
1971,536546,22145,,1,12/1/2010 14:33,0.0,,United Kingdom
1972,536547,37509,,1,12/1/2010 14:33,0.0,,United Kingdom
1987,536549,85226A,,1,12/1/2010 14:34,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
535322,581199,84581,,-2,12/7/2011 18:26,0.0,,United Kingdom
535326,581203,23406,,15,12/7/2011 18:31,0.0,,United Kingdom
535332,581209,21620,,6,12/7/2011 18:35,0.0,,United Kingdom
536981,581234,72817,,27,12/8/2011 10:33,0.0,,United Kingdom


- 发现似乎缺失`Description`变量的观察值，也缺失`CustomerID`变量。
- 为了验证猜想，增加条件语句，筛选出缺失`Description`变量同时也缺失`CustomerID`变量的观察值。

In [7]:
original_data[(original_data['Description'].isnull()) & original_data['CustomerID'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.0,,United Kingdom
1970,536545,21134,,1,12/1/2010 14:32,0.0,,United Kingdom
1971,536546,22145,,1,12/1/2010 14:33,0.0,,United Kingdom
1972,536547,37509,,1,12/1/2010 14:33,0.0,,United Kingdom
1987,536549,85226A,,1,12/1/2010 14:34,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
535322,581199,84581,,-2,12/7/2011 18:26,0.0,,United Kingdom
535326,581203,23406,,15,12/7/2011 18:31,0.0,,United Kingdom
535332,581209,21620,,6,12/7/2011 18:35,0.0,,United Kingdom
536981,581234,72817,,27,12/8/2011 10:33,0.0,,United Kingdom


- 发现缺失`Description`变量同时也缺失`CustomerID`变量的观察值个数等于仅仅缺少`Description`变量的观察值个数。
- 猜想得到验证，所有缺失`Description`变量的观察值，也缺失`CustomerID`变量。

#### **但是`CustomerID`的缺失值个数比`Description`的缺失值个数要多得多，这意味着，所有缺少`CustomerID`变量的观察值并不一定缺少`Description`变量。**
#### **基于此，将增加条件语句查看只缺失`CustomerID`变量但不缺失`Description`变量的观察值**

In [8]:
original_data[original_data['CustomerID'].isnull()]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,12/1/2010 11:52,0.00,,United Kingdom
1443,536544,21773,DECORATIVE ROSE BATHROOM BOTTLE,1,12/1/2010 14:32,2.51,,United Kingdom
1444,536544,21774,DECORATIVE CATS BATHROOM BOTTLE,2,12/1/2010 14:32,2.51,,United Kingdom
1445,536544,21786,POLKADOT RAIN HAT,4,12/1/2010 14:32,0.85,,United Kingdom
1446,536544,21787,RAIN PONCHO RETROSPOT,2,12/1/2010 14:32,1.66,,United Kingdom
...,...,...,...,...,...,...,...,...
541536,581498,85099B,JUMBO BAG RED RETROSPOT,5,12/9/2011 10:26,4.13,,United Kingdom
541537,581498,85099C,JUMBO BAG BAROQUE BLACK WHITE,4,12/9/2011 10:26,4.13,,United Kingdom
541538,581498,85150,LADIES & GENTLEMEN METAL SIGN,1,12/9/2011 10:26,4.96,,United Kingdom
541539,581498,85174,S/4 CACTI CANDLES,1,12/9/2011 10:26,10.79,,United Kingdom


- 发现一切正常，没有更多的关系。

#### **调用`duplicated`&`sum`方法查看并计算重复值；**

#### **根据该数据集的特点和本次数据分析的目的，只查看并关注`InvoiceNo`&`StockCode`都同时重复的观察值，以及所有变量都同时重复的观察值。**

In [9]:
original_data.duplicated().sum()

np.int64(5268)

In [10]:
original_data[original_data.duplicated(subset = ['InvoiceNo','StockCode'])]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
125,536381,71270,PHOTO CLIP LINE,3,12/1/2010 9:41,1.25,15311.0,United Kingdom
498,536409,90199C,5 STRAND GLASS NECKLACE CRYSTAL,1,12/1/2010 11:45,6.35,17908.0,United Kingdom
502,536409,85116,BLACK CANDELABRA T-LIGHT HOLDER,5,12/1/2010 11:45,2.10,17908.0,United Kingdom
517,536409,21866,UNION JACK FLAG LUGGAGE TAG,1,12/1/2010 11:45,1.25,17908.0,United Kingdom
525,536409,90199C,5 STRAND GLASS NECKLACE CRYSTAL,2,12/1/2010 11:45,6.35,17908.0,United Kingdom
...,...,...,...,...,...,...,...,...
541692,581538,22992,REVOLVER WOODEN RULER,1,12/9/2011 11:34,1.95,14446.0,United Kingdom
541697,581538,21194,PINK HONEYCOMB PAPER FAN,1,12/9/2011 11:34,0.65,14446.0,United Kingdom
541698,581538,35004B,SET OF 3 BLACK FLYING DUCKS,1,12/9/2011 11:34,5.45,14446.0,United Kingdom
541699,581538,22694,WICKER STAR,1,12/9/2011 11:34,2.10,14446.0,United Kingdom


In [11]:
original_data[original_data.duplicated(subset = ['InvoiceNo','StockCode'],keep = 'last')]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
113,536381,71270,PHOTO CLIP LINE,1,12/1/2010 9:41,1.25,15311.0,United Kingdom
483,536409,90199C,5 STRAND GLASS NECKLACE CRYSTAL,3,12/1/2010 11:45,6.35,17908.0,United Kingdom
485,536409,22111,SCOTTIE DOG HOT WATER BOTTLE,1,12/1/2010 11:45,4.95,17908.0,United Kingdom
489,536409,22866,HAND WARMER SCOTTY DOG DESIGN,1,12/1/2010 11:45,2.10,17908.0,United Kingdom
491,536409,85116,BLACK CANDELABRA T-LIGHT HOLDER,1,12/1/2010 11:45,2.10,17908.0,United Kingdom
...,...,...,...,...,...,...,...,...
541656,581538,23318,BOX OF 6 MINI VINTAGE CRACKERS,1,12/9/2011 11:34,2.49,14446.0,United Kingdom
541659,581538,22899,CHILDREN'S APRON DOLLY GIRL,2,12/9/2011 11:34,2.10,14446.0,United Kingdom
541666,581538,23343,JUMBO BAG VINTAGE CHRISTMAS,1,12/9/2011 11:34,2.08,14446.0,United Kingdom
541674,581538,35004B,SET OF 3 BLACK FLYING DUCKS,2,12/9/2011 11:34,5.45,14446.0,United Kingdom


- 经过查看后，发现有`5268`条各个变量都同时重复的观察值，这不符合常理，应该删除以避免后续畅销商品的分析；
- 另外发现有`10684`条`InvoiceNo`&`StockCode`两个变量都同时重复的观察值，经过比对，发现这些观察值的`Quantity`变量大多数不一致，也许是因为客户多次下单同一产品/取消订单后再次下单/我方通过减少商品数量的方式来给予客户折扣等多种因素，考虑后认为可以保留。

#### **不一致数据方面，为了后续数据分析时能准确查看各个国家的销售占比情况，应统一`Country`这列的表达，这将调用`value_counts`方法查看。**

In [12]:
original_data['Country'].value_counts()

Country
United Kingdom          495266
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
China                      288
Singapore                  229
USA                        218
UK                         211
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United States               73
United Arab Emirates        68


经过仔细比对，发现：
- 英国的表达有两种：`United Kingdom`&`UK`&`U.K.`,应统一为`United Kingdom`;
- 美国的表达有两种：`USA`&`United States`,应统一为`USA`.

#### **在无效/错误数据方面，首先需要查看`InvoiceNo`开头为`C`的观察值并在后续删除；**

#### **同时需要调用`describe`方法，查看是否存在负数等不合理数据。**

In [13]:
original_data[original_data['InvoiceNo'].str[0] == 'C']

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,12/1/2010 9:41,27.50,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,12/1/2010 9:49,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,12/1/2010 10:24,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
...,...,...,...,...,...,...,...,...
540449,C581490,23144,ZINC T-LIGHT HOLDER STARS SMALL,-11,12/9/2011 9:57,0.83,14397.0,United Kingdom
541541,C581499,M,Manual,-1,12/9/2011 10:28,224.69,15498.0,United Kingdom
541715,C581568,21258,VICTORIAN SEWING BOX LARGE,-5,12/9/2011 11:57,10.95,15311.0,United Kingdom
541716,C581569,84978,HANGING HEART JAR T-LIGHT HOLDER,-1,12/9/2011 11:58,1.25,17315.0,United Kingdom


In [14]:
original_data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


- 发现有`9288`条发票号码为`C`的观察值，这些都是取消交易的观察值，可以删除。
- 发现`Quantity`&`UnitPrice`都存在负数，这可能是因为取消交易的观察值的标记原因，也可能是因为录入错误。为防止后续分析时出错，应该删除为负数的观察值。

## 清理数据

todolist:
- 数据类型转换：`InvoiceDate`应转为日期类型；`CustomerID`应转为字符串；`Country`改为`category`;
- 删除缺失值：删除`Description`的缺失值；
- 删除重复值：删除各个变量都同时重复的观察值；
- 统一不一致数据：`United Kingdom`&`UK`&`U.K.`应统一为`United Kingdom`;`USA`&`United States`,应统一为`USA`;
- 删除无效/错误数据：删除`InvoiceNo`开头为`C`的观察值;删除`Quantity`&`UnitPrice`变量中为负数的观察值。

### 数据类型转换

In [17]:
cleaned_data = original_data.copy()
cleaned_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [19]:
cleaned_data['InvoiceDate'] = pd.to_datetime(cleaned_data['InvoiceDate'])
cleaned_data['InvoiceDate']

0        2010-12-01 08:26:00
1        2010-12-01 08:26:00
2        2010-12-01 08:26:00
3        2010-12-01 08:26:00
4        2010-12-01 08:26:00
                 ...        
541904   2011-12-09 12:50:00
541905   2011-12-09 12:50:00
541906   2011-12-09 12:50:00
541907   2011-12-09 12:50:00
541908   2011-12-09 12:50:00
Name: InvoiceDate, Length: 541909, dtype: datetime64[ns]

In [21]:
cleaned_data['CustomerID'] = cleaned_data['CustomerID'].astype(str)
cleaned_data['Country'] = cleaned_data['Country'].astype('category')
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   541909 non-null  object        
 7   Country      541909 non-null  category      
dtypes: category(1), datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 29.5+ MB


### 删除缺失值

In [22]:
cleaned_data.dropna(subset = ['Description'],inplace = True)
cleaned_data['Description'].isnull().sum()

np.int64(0)

### 删除重复值

In [24]:
cleaned_data = cleaned_data.drop_duplicates()
cleaned_data.duplicated().sum()

np.int64(0)

### 统一不一致数据

In [29]:
cleaned_data['Country'] = cleaned_data['Country'].replace({'UK':'United Kingdom','U.K.':'United Kingdom'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['Country'] = cleaned_data['Country'].replace({'UK':'United Kingdom','U.K.':'United Kingdom'})


In [31]:
len(cleaned_data[cleaned_data['Country'] == 'UK'])

0

In [32]:
cleaned_data['Country'] = cleaned_data['Country'].replace('United States','USA')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['Country'] = cleaned_data['Country'].replace('United States','USA')


In [33]:
len(cleaned_data[cleaned_data['Country'] == 'United States'])

0

### 删除无效/错误数据

In [35]:
cleaned_data.drop(cleaned_data[cleaned_data['InvoiceNo'].str[0] == 'C'].index, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data.drop(cleaned_data[cleaned_data['InvoiceNo'].str[0] == 'C'].index, inplace = True)


In [36]:
len(cleaned_data[cleaned_data['InvoiceNo'].str[0] == 'C'])

0

In [37]:
cleaned_data.describe()

Unnamed: 0,Quantity,InvoiceDate,UnitPrice
count,525936.0,525936,525936.0
mean,10.365655,2011-07-04 15:11:14.737610240,3.872616
min,-9600.0,2010-12-01 08:26:00,-11062.06
25%,1.0,2011-03-28 11:59:00,1.25
50%,3.0,2011-07-20 11:07:00,2.08
75%,11.0,2011-10-19 11:41:00,4.13
max,80995.0,2011-12-09 12:50:00,13541.33
std,160.075723,,42.021233


In [39]:
cleaned_data.drop(cleaned_data[cleaned_data['Quantity'] <= 0].index,inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data.drop(cleaned_data[cleaned_data['Quantity'] <= 0].index,inplace = True)


In [40]:
cleaned_data[cleaned_data['UnitPrice'] <= 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
6391,536941,22734,amazon,20,2010-12-03 12:08:00,0.0,,United Kingdom
6392,536942,22139,amazon,15,2010-12-03 12:08:00,0.0,,United Kingdom
9302,537197,22841,ROUND CAKE TIN VINTAGE GREEN,1,2010-12-05 14:02:00,0.0,12647.0,Germany
14335,537534,85064,CREAM SWEETHEART LETTER RACK,1,2010-12-07 11:48:00,0.0,,United Kingdom
14336,537534,84832,ZINC WILLIE WINKIE CANDLE STICK,1,2010-12-07 11:48:00,0.0,,United Kingdom
...,...,...,...,...,...,...,...,...
524622,580609,22927,Amazon,1,2011-12-05 11:41:00,0.0,,United Kingdom
535325,581202,23404,check,41,2011-12-07 18:30:00,0.0,,United Kingdom
535334,581211,22142,check,14,2011-12-07 18:36:00,0.0,,United Kingdom
538504,581406,46000M,POLYESTER FILLER PAD 45x45cm,240,2011-12-08 13:58:00,0.0,,United Kingdom


In [41]:
cleaned_data.drop(cleaned_data[cleaned_data['UnitPrice'] <= 0].index,inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data.drop(cleaned_data[cleaned_data['UnitPrice'] <= 0].index,inplace = True)


In [42]:
cleaned_data.describe()

Unnamed: 0,Quantity,InvoiceDate,UnitPrice
count,524878.0,524878,524878.0
mean,10.6166,2011-07-04 15:30:16.317049088,3.922573
min,1.0,2010-12-01 08:26:00,0.001
25%,1.0,2011-03-28 12:13:00,1.25
50%,4.0,2011-07-20 11:22:00,2.08
75%,11.0,2011-10-19 11:41:00,4.13
max,80995.0,2011-12-09 12:50:00,13541.33
std,156.280031,,36.093028


## 保存清理后的数据

In [43]:
cleaned_data.to_csv('cleaned_e_commerce.csv',index = False)

In [44]:
pd.read_csv('cleaned_e_commerce.csv')

  pd.read_csv('cleaned_e_commerce.csv')


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
524873,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France
524874,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France
524875,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France
524876,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France
