<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/ul/17_customer_segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Wizualizacja danych](#2)
4. [Algorytm K-średnich](#3)
5. [Wizualizacja klastrów](#4)




### <a name='0'></a> Import bibliotek

In [0]:
import numpy as np
import pandas as pd
import plotly.express as px

In [0]:
url = 'https://storage.googleapis.com/esmartdata-courses-files/ml-course/OnlineRetail.csv'

In [3]:
raw_data = pd.read_csv('OnlineRetail.csv', encoding='latin', parse_dates=['InvoiceDate'])
data = raw_data.copy()
data.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [5]:
data.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


In [6]:
data.describe(include=['object'])

Unnamed: 0,InvoiceNo,StockCode,Description,Country
count,541909,541909,540455,541909
unique,25900,4070,4223,38
top,573585,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,1114,2313,2369,495478


In [7]:
data.describe(include=['datetime'])

Unnamed: 0,InvoiceDate
count,541909
unique,23260
top,2011-10-31 14:41:00
freq,1114
first,2010-12-01 08:26:00
last,2011-12-09 12:50:00


In [8]:
data.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [9]:
data = data.dropna()
data.isnull().sum()

InvoiceNo      0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
UnitPrice      0
CustomerID     0
Country        0
dtype: int64

In [0]:
data['CustomerID'] = data['CustomerID'].astype(int)

In [11]:
data['Country'].value_counts()

United Kingdom          361878
Germany                   9495
France                    8491
EIRE                      7485
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               1877
Portugal                  1480
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
USA                        291
Israel                     250
Unspecified                244
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58
Lebanon                     45
Lithuani

In [12]:
tmp = data['Country'].value_counts().reset_index()
tmp.columns = ['Country', 'Count']
tmp.query("Count > 200", inplace=True)
px.bar(tmp, x='Country', y='Count', template='plotly_dark', color_discrete_sequence=['#03fcb5'])

In [13]:
data_uk = data.query("Country == 'United Kingdom'").copy()
data_uk.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


In [14]:
tmp = data_uk.groupby(data_uk['InvoiceDate'].dt.date)['CustomerID'].count().reset_index()
tmp.columns = ['InvoiceDate', 'Count']
tmp.head()

Unnamed: 0,InvoiceDate,Count
0,2010-12-01,1809
1,2010-12-02,2029
2,2010-12-03,937
3,2010-12-05,2492
4,2010-12-06,1915


In [15]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02)

trace1 = px.line(tmp, x='InvoiceDate', y='Count', template='plotly_dark', color_discrete_sequence=['#03fcb5'])['data'][0]
trace2 = px.scatter(tmp, x='InvoiceDate', y='Count', template='plotly_dark', color_discrete_sequence=['#03fcb5'])['data'][0]

fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_layout(template='plotly_dark', title='Count by day', width=950)
fig.show()

In [16]:
data_uk.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom


In [17]:
data_uk['Sales'] = data_uk['Quantity'] * data_uk['UnitPrice']
data_uk.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Sales
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [18]:
tmp = data_uk.groupby(data_uk['InvoiceDate'].dt.date)['Sales'].sum().reset_index()
tmp.columns = ['InvoiceDate', 'Sales']
tmp.head()

Unnamed: 0,InvoiceDate,Sales
0,2010-12-01,42030.85
1,2010-12-02,45622.08
2,2010-12-03,17512.44
3,2010-12-05,25458.85
4,2010-12-06,29007.74


In [19]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02)

trace1 = px.line(tmp, x='InvoiceDate', y='Sales', template='plotly_dark', color_discrete_sequence=['#03fcb5'])['data'][0]
trace2 = px.scatter(tmp, x='InvoiceDate', y='Sales', template='plotly_dark', color_discrete_sequence=['#03fcb5'])['data'][0]

fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_layout(template='plotly_dark', title='Sales by day', width=950)
fig.show()

In [20]:
data_user = pd.DataFrame(data['CustomerID'].unique(), columns=['CustomerID'])
data_user.head(3)

Unnamed: 0,CustomerID
0,17850
1,13047
2,12583


In [21]:
last_purchase = data_uk.groupby('CustomerID')['InvoiceDate'].max().reset_index()
last_purchase.columns = ['CustomerID', 'LastPurchaseDate']
last_purchase

Unnamed: 0,CustomerID,LastPurchaseDate
0,12346,2011-01-18 10:17:00
1,12747,2011-12-07 14:34:00
2,12748,2011-12-09 12:20:00
3,12749,2011-12-06 09:56:00
4,12820,2011-12-06 15:12:00
...,...,...
3945,18280,2011-03-07 09:52:00
3946,18281,2011-06-12 10:53:00
3947,18282,2011-12-02 11:43:00
3948,18283,2011-12-06 12:02:00


In [22]:
last_purchase['LastPurchaseDate'].max()

Timestamp('2011-12-09 12:49:00')

In [23]:
last_purchase['LastPurchaseDate'].min()

Timestamp('2010-12-01 09:53:00')

In [24]:
last_purchase['Retention'] = (last_purchase['LastPurchaseDate'].max() - last_purchase['LastPurchaseDate']).dt.days
last_purchase

Unnamed: 0,CustomerID,LastPurchaseDate,Retention
0,12346,2011-01-18 10:17:00,325
1,12747,2011-12-07 14:34:00,1
2,12748,2011-12-09 12:20:00,0
3,12749,2011-12-06 09:56:00,3
4,12820,2011-12-06 15:12:00,2
...,...,...,...
3945,18280,2011-03-07 09:52:00,277
3946,18281,2011-06-12 10:53:00,180
3947,18282,2011-12-02 11:43:00,7
3948,18283,2011-12-06 12:02:00,3


In [25]:
last_purchase['Retention'].value_counts()

3      114
8       97
0       97
2       92
1       77
      ... 
243      1
370      1
285      1
295      1
174      1
Name: Retention, Length: 348, dtype: int64

In [26]:
px.histogram(last_purchase, x='Retention', template='plotly_dark', 
             width=950, height=500, title='Retention', nbins=100, 
             color_discrete_sequence=['#03fcb5'])

In [27]:
data_user = pd.merge(data_user, last_purchase, on='CustomerID')
data_user = data_user[['CustomerID', 'Retention']]
data_user.head()

Unnamed: 0,CustomerID,Retention
0,17850,301
1,13047,31
2,13748,95
3,15100,329
4,15291,25


In [28]:
px.scatter(data_user, x='CustomerID', y='Retention', template='plotly_dark',
           color_discrete_sequence=['#03fcb5'])

In [62]:
data_retention = data_user[['Retention']]
data_retention.head()

Unnamed: 0,Retention
0,301
1,31
2,95
3,329
4,25


In [0]:
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 10):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data_retention)
    wcss.append(kmeans.inertia_)

In [64]:
wcss = pd.DataFrame(data=np.c_[range(1, 10), wcss], columns=['number_of_clusters', 'wcss'])
wcss

Unnamed: 0,number_of_clusters,wcss
0,1.0,39672140.0
1,2.0,8179631.0
2,3.0,3895369.0
3,4.0,2219888.0
4,5.0,1369992.0
5,6.0,951810.9
6,7.0,712918.2
7,8.0,548520.4
8,9.0,413444.5


In [65]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.02)

trace1 = px.line(wcss, x='number_of_clusters', y='wcss', template='plotly_dark', color_discrete_sequence=['#03fcb5'])['data'][0]
trace2 = px.scatter(wcss, x='number_of_clusters', y='wcss', template='plotly_dark', color_discrete_sequence=['#03fcb5'])['data'][0]

fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_layout(template='plotly_dark', title='WCSS', width=950, height=700)
fig.show()

In [66]:
kmeans = KMeans(n_clusters=3)
kmeans.fit(data_user)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [67]:
data_user['Cluster'] = kmeans.labels_
data_user.head()

Unnamed: 0,CustomerID,Retention,cluster,Cluster
0,17850,301,1,1
1,13047,31,0,0
2,13748,95,0,0
3,15100,329,2,2
4,15291,25,2,2


In [68]:
data_user.groupby('Cluster').describe()

Unnamed: 0_level_0,CustomerID,CustomerID,CustomerID,CustomerID,CustomerID,CustomerID,CustomerID,CustomerID,Retention,Retention,Retention,Retention,Retention,Retention,Retention,Retention,cluster,cluster,cluster,cluster,cluster,cluster,cluster,cluster
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
0,1322.0,13742.080938,540.070579,12346.0,13271.25,13740.5,14211.75,14667.0,1322.0,88.684569,98.43412,0.0,15.0,49.0,136.75,373.0,1322.0,0.025719,0.22542,0.0,0.0,0.0,0.0,2.0
1,1297.0,17391.124133,520.45008,16493.0,16932.0,17396.0,17849.0,18287.0,1297.0,91.850424,102.641896,0.0,15.0,46.0,146.0,373.0,1297.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
2,1331.0,15587.300526,523.020524,14669.0,15141.0,15589.0,16041.5,16491.0,1331.0,91.813674,99.661336,0.0,17.0,49.0,145.5,373.0,1331.0,1.990233,0.098382,1.0,2.0,2.0,2.0,2.0


In [69]:
px.scatter(data_user, x='CustomerID', y='Retention', color='Cluster', template='plotly_dark',
           color_discrete_sequence=['#03fcb5'])