# Data Analytics - Challenge


The goal of this challenge is to analyze a restaurant invoices. Some celles are already implemented, you just need to **run** them. Some other cells need you to write some code.

Start the challenge by running the two following cells:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data/invoices.csv')

---

❓ Display the 10 first rows of the dataset (no need to sort)

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html"><code>pandas.DataFrame.head()</code></a> function:
    
<pre>
df.head(10)
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ How many kind of products can you find in the dataset?

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html"><code>pandas.Series.unique()</code></a> function combiend with the <code>len()</code> Python built-in.
    
<pre>
len(df['product'].unique())
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ What kind of invoices is the most representated in the invocies? Plot this with a Seaborn Countplot.

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
df['product'].value_counts()
</pre>
    
<pre>
sns.countplot(data=df, x='product')
</pre>
</p>
</details>

In [None]:
# Your code here

In [None]:
# Your plot here

---

❓ Try to do some other countplots, varying `x` with one of the categorical column (`document_type`, `contract_type`)
Is there any value to the `document_type` column?

In [None]:
# Your first plot here
# To add a cell, you can go in the menu and do Insert > Insert cell below

---
❓ Let's plot the distribution of `amount` based on a given category. Start with `product`:

```python
sns.catplot(data=df, x='product', y='amount', kind="box")
```

1. Change the value of `x` with one of the categorical column of the dataset and the value of `kind` (`"bar"`, `"box"`, `"violin"`, `"boxen"`)
1. Change the value of `y` with one of the numerical column of the dataset

In [None]:
# Your experiments here

---
❓ Let's use [`seaborn.FacetGrid`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)

1. Run the cell below. What do you observe?
2. Change `col` in the first line with another column (e.g. `"contract_type"`). Run the cell again. What do you observe?

In [None]:
g = sns.FacetGrid(df, col="product")
g.map(plt.hist, "amount")

---
❓ Let's continue with FacetGrid and add a `row="contract_time"` parameter. How many cells do you get in the plot?

<details>
    <summary>🙈 Reveal solution</summary>

You get 7 * 7 = 49 cells!
    
<pre>
g = sns.FacetGrid(df, col="product", row="contract_type")
g.map(plt.hist, "Income")
</pre>
</p>
</details>

In [None]:
# Your code here

---
❓ Have a representation of the time to pay ranges of the invoices. You may find helpful to use the [`pd.cut()`](https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.cut.html) function. Here are a few intermediary steps:

* Create a subset of the DataFrame with cleared invoices
* Generate an `time_to_pay` column
* Cut the values of the column by categories (e.g.: '7 days') and store it in another new column: `time_ranges`
* Display the representation

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
sub_df = df[~df.clearing_date.isna()]
sub_df['time_to_pay'] = pd.to_datetime(sub_df['clearing_date']) - pd.to_datetime(sub_df['due_date'])
labels = ['paid in advance', 'within a week', 'within a month', 'bad payers']
bins = [float('-inf'), 0, 7, 30, float('inf')]
sub_df['time_ranges'] = pd.cut(sub_df['time_to_pay'], bins=bins, labels=labels)
sns.countplot(data=sub_df, x='time_ranges')
</pre>
</p>
</details>

In [None]:
# Your code here

---
❓ Thanks to the previously created subset and columns, could you get the habits of payments by payers?

* Group rows by the payer_id 
* Use as an aggregative function [`GroupBy.aggregate()`](https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) which allows you to provide different kind of functions 
* Apply the [`pd.Series.mode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mode.html) function to each group

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
sub_df.groupby('payer_id')[['time_ranges']].aggregate(pd.Series.mode)
</pre>
</p>
</details>

In [None]:
# Your code here

In [None]:
# You can easily compare the shape of your DataFrame with the number of unique values in payer_id to see if it worked

## Good job!

Save your notebook, go back to the **Le Wagon - Learn** platform to upload your progress. A quiz awaits you!