# 1. Clustering online retail customers.

For this exercise, we are going to use a small portion of the [Online Retail dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail), that contains information about transactions occurring in November 2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The assignment is to create a clustering algorithm to group the customers. You should find an optimal number of clusters (a good way is using the elbow method).
Finally, you can validate your clustering model using cross validation.

This dataset doesn't follow the principles of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), because each customerID spans multiple rows with different orders.

You will probably need a way to process this dataframe into a dataframe with a customer on each row.

There are different ways to process this dataset, for example:

- Use a text vectorizer to vectorize the product descriptions then grouping by the customer.
- Use `pd.crosstab` to create a dataframe with the customers as rows, the different products as columns and for values how many times each customer bought each product.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("./data/retail.csv")

In [3]:
df.shape

(84711, 8)

In [4]:
df.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

In [5]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,573744,21314,SMALL GLASS HEART TRINKET POT,8,2011-11-01 08:16:00,2.1,17733.0,United Kingdom
1,573744,21704,BAG 250g SWIRLY MARBLES,12,2011-11-01 08:16:00,0.85,17733.0,United Kingdom
2,573744,21791,VINTAGE HEADS AND TAILS CARD GAME,12,2011-11-01 08:16:00,1.25,17733.0,United Kingdom
3,573744,21892,TRADITIONAL WOODEN CATCH CUP GAME,12,2011-11-01 08:16:00,1.25,17733.0,United Kingdom
4,573744,21915,RED HARMONICA IN BOX,12,2011-11-01 08:16:00,1.25,17733.0,United Kingdom


# 2.Anomaly Detection

One of the uses of clustering algorithms is to perform [Anomaly Detection](https://www.datascience.com/blog/python-anomaly-detection). If we can use clustering to find groups of similar elements, those elements that dont fit as well on any cluster are probably the most likely to be anomalies.

For this exercise we are going to use a dataset of credit card transaction. Your assignment is to implement a model that clusters the data properly and find the potential outliers.

In [6]:
df = pd.read_csv("./data/CC GENERAL.csv")

In [7]:
df.dtypes

CUST_ID                              object
BALANCE                             float64
BALANCE_FREQUENCY                   float64
PURCHASES                           float64
ONEOFF_PURCHASES                    float64
INSTALLMENTS_PURCHASES              float64
CASH_ADVANCE                        float64
PURCHASES_FREQUENCY                 float64
ONEOFF_PURCHASES_FREQUENCY          float64
PURCHASES_INSTALLMENTS_FREQUENCY    float64
CASH_ADVANCE_FREQUENCY              float64
CASH_ADVANCE_TRX                      int64
PURCHASES_TRX                         int64
CREDIT_LIMIT                        float64
PAYMENTS                            float64
MINIMUM_PAYMENTS                    float64
PRC_FULL_PAYMENT                    float64
TENURE                                int64
dtype: object

In [8]:
customer_ids = df.CUST_ID
df = df.drop(columns="CUST_ID")