# ADA Project : Dunnhumby dataset
## Tell me what you buy and I will tell you who you are



### Abstract
We would like to analyse the Dunnhumby dataset. Living in a time and age where every piece of our data is stored and analysed; and being active consumers ourselves, we would like to see what informations retail chains can gather and infer about us knowing only our shopping habits. As transactions over two years of several households and their basic demographic profiles are provided, we want to see if there are any links and correlations between specific demographics (e.g. marital status, income, number of children, etc) and purchase patterns. Furthermore, if time permits it, we want to see if we can create a model predicting a consumer demographic profile from their shopping. Thus, we would like to see how "easy" and how precise it actually is for retailers to infer who their customer is by what they buy and target them with specific marketing. Basically, we want to know how much of a target we actually
are.

Research questions: 
- What are the main shopping trends that we can identify in this data ?
- Can we relate shopping trends to specific demographic parameters ?
- Can we predict some of these demographic parameters (age, marital statute etc) with knowing the household's habbits?
- In the opposite way, can we predict household consumption behaviour with knowing its characteristics?
- What accuracy in consumption prediction can the retailer obtain from a simple profile information?

### Task 1: Clean up the data and prepare the sets we want to keep

In [None]:
%matplotlib inline
import pandas as pd

import matplotlib.pyplot as plt

import os

In [None]:
os.getcwd()

In [None]:
'''As we said in the description of our project, we are going to concentrate on 3 of the 8 tables :
- hh_demographic.csv
- transaction_data.csv
- product.csv
In this first step, we want to load the data, and prepare it for the analysis'''

#load the data
hh_demographic = pd.read_csv('../data/dunnhumby_complete_csv/hh_demographic.csv', sep = ',')

transaction_data = pd.read_csv('../data/dunnhumby_complete_csv/transaction_data.csv', sep = ',')

product = pd.read_csv('../data/dunnhumby_complete_csv/product.csv', sep = ',')

#### Data exploration

In [None]:
transaction_data.head(4)

In [None]:
hh_demographic.head(4)

In [None]:
product.head(4)

In [None]:
transaction_data.groupby('household_key').count().describe()

For the table *transaction_data*, we have the values of the transactions for 2500 different households. It would be interesting to know if we have the demographic data for all the households or not.

In [None]:
hh_demographic.household_key.is_unique

We don't have doublets in the table, meaning that one household has exactly one row. 

In [None]:
hh_demographic.describe()

But in this table, we have only 801 rows, meaning that out of the 2500 household represented in the table *transaction_data*, we have the demographic data only for one third of them. This is something we should keep in mind later on if we want for example to merge the tables on the *household_key* column, and decide if we want to keep all the household, thus introducing missing data, or if we want to continue only with one third of the households for our study.

In [None]:
product.PRODUCT_ID.is_unique

In [None]:
product.describe()

In the table *product*, there are no doublets for the products, each product is represented once, and we have its characteristics described in the corresponding row. There are 92 353 products. As for the households, we can investigate whether all the products are represented in the *transaction_data* table.

In [None]:
transaction_data.groupby('PRODUCT_ID').count().describe()

There are 92 339 products represented in the *transaction_data* table, meaning that only 14 are not represented. We can more easily imagine to do an inner join, and just drop those 14 products. 

#### Some plots

In [None]:
hh_demographic.groupby('AGE_DESC').count()

In [None]:
hh_demographic['AGE_DESC'].value_counts().plot(kind='bar')

In [None]:
hh_demographic['MARITAL_STATUS_CODE'].value_counts().plot(kind='bar')

In [None]:
hh_demographic['INCOME_DESC'].value_counts().plot(kind='bar')

PS:
- we should continue to make some plots
- we should order the categories when it makes sense, so that the plots are more meaningful