## RETAIL in GERMANY (mini-project)

**Importing dataset from a csv-file in the working directory**

In [117]:
# Запишите полученный датафрейм в retail
retail = pd.read_csv('data.csv.zip', encoding='ISO-8859-1', compression='zip')

retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


**Save names of columns into a variable retail_columns**

In [118]:
retail_columns = retail.columns

for i in retail_columns:
    print(i)

InvoiceNo
StockCode
Description
Quantity
InvoiceDate
UnitPrice
CustomerID
Country


**Dataset columns comments:**
* InvoiceNo – invoice id
* StockCode – product id
* Description – product description
* Quantity – quantity of items ordered
* InvoiceDate – transaction date
* UnitPrice – price for pcs.
* CustomerID – clients' id
* Country – country of clients' residence

**Check for duplicates. Print number of duplicates (if any)**

In [119]:
retail.duplicated().sum()

5268

**Remove duplicates (if any)**

In [120]:
retail = retail.drop_duplicates()

retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France


**Count number of cancelled purchases**

In [121]:
retail['InvoiceNo'].str.startswith('C').value_counts()

False    527390
True       9251
Name: InvoiceNo, dtype: int64

**Filter the dataset. Only purcahses with Quantity > 0 should be remained**

In [122]:
retail = retail.query('Quantity > 0')

retail.shape

(526054, 8)

**Count a number of purchases for every customer from Germany. Print customers ids of clients who fall above 80 percentile.**

In [124]:
germany_all = retail.query('Country == "Germany"') \
    .groupby(['CustomerID'], as_index=False) \
    .agg({'InvoiceNo': 'nunique'})
    
transactions_80 = germany_all.InvoiceNo.quantile(q=0.8)

germany_top = germany_all.query('InvoiceNo > @transactions_80').drop(columns='InvoiceNo') \
    .reset_index(drop=True)

germany_top

Unnamed: 0,CustomerID
0,12471.0
1,12474.0
2,12476.0
3,12481.0
4,12500.0
5,12524.0
6,12569.0
7,12600.0
8,12619.0
9,12621.0


**Save ids of such customers to a list**

In [125]:
lst = germany_top.CustomerID.to_list()

print(lst)

[12471.0, 12474.0, 12476.0, 12481.0, 12500.0, 12524.0, 12569.0, 12600.0, 12619.0, 12621.0, 12626.0, 12647.0, 12662.0, 12705.0, 12708.0, 12709.0, 12712.0, 12720.0]


**Filter the initial dataset leaving in it only customers with ids from the list above**

In [130]:
top_retail_germany = retail.query('CustomerID in @germany_top.CustomerID')

top_retail_germany

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
1109,536527,22809,SET OF 6 T-LIGHTS SANTA,6,12/1/2010 13:04,2.95,12662.0,Germany
1110,536527,84347,ROTATING SILVER ANGELS T-LIGHT HLDR,6,12/1/2010 13:04,2.55,12662.0,Germany
1111,536527,84945,MULTI COLOUR SILVER T-LIGHT HOLDER,12,12/1/2010 13:04,0.85,12662.0,Germany
1112,536527,22242,5 HOOK HANGER MAGIC TOADSTOOL,12,12/1/2010 13:04,1.65,12662.0,Germany
1113,536527,22244,3 HOOK HANGER MAGIC GARDEN,12,12/1/2010 13:04,1.95,12662.0,Germany
...,...,...,...,...,...,...,...,...
541726,581570,22139,RETROSPOT TEA SET CERAMIC 11 PC,3,12/9/2011 11:59,4.95,12662.0,Germany
541727,581570,23077,DOUGHNUT LIP GLOSS,20,12/9/2011 11:59,1.25,12662.0,Germany
541728,581570,20750,RED RETROSPOT MINI CASES,2,12/9/2011 11:59,7.95,12662.0,Germany
541729,581570,22505,MEMO BOARD COTTAGE DESIGN,4,12/9/2011 11:59,4.95,12662.0,Germany


**Find the most popular product among top customers from Germany**

In [131]:
top_retail_germany.groupby(['StockCode'], as_index=False) \
    .agg({'InvoiceNo': 'nunique'}) \
    .sort_values('InvoiceNo', ascending=False)

Unnamed: 0,StockCode,InvoiceNo
1157,POST,213
409,22326,52
411,22328,38
453,22423,34
45,20719,30
...,...,...
520,22563,1
524,22569,1
528,22574,1
529,22576,1


**Count revenue and add this info to a corresponding new column**

In [133]:
retail = retail.assign(Revenue = retail.UnitPrice * retail.Quantity)

retail

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom,15.30
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom,22.00
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12/9/2011 12:50,0.85,12680.0,France,10.20
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12/9/2011 12:50,2.10,12680.0,France,12.60
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12/9/2011 12:50,4.15,12680.0,France,16.60
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12/9/2011 12:50,4.15,12680.0,France,16.60


**Count the total cost of each purchase. List ids of the top five (highest Total Revenue) purchases**

In [151]:
top5_inv = retail.groupby(['InvoiceNo'], as_index=False) \
    .agg({'Revenue': 'sum'}) \
    .sort_values('Revenue', ascending=False) \
    .head(5) \
    .InvoiceNo \
    .to_list()


', '.join(top5_inv)

'581483, 541431, 574941, 576365, 556444'