# Data Analytics - Challenge


The goal of this challenge is to analyze a restaurant invoices. Some celles are already implemented, you just need to **run** them. Some other cells need you to write some code.

Start the challenge by running the two following cells:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data/marketing_campaign.csv', sep=';')

---

❓ Display the 10 first rows of the dataset (no need to sort)

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html"><code>pandas.DataFrame.head()</code></a> function:
    
<pre>
df.head(10)
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ How many level of education can you fin in the dataset?

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html"><code>pandas.Series.unique()</code></a> function combiend with the <code>len()</code> Python built-in.
    
<pre>
len(df['Education'].unique())
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ What level of education is the most representated in the customers? Plot this with a Seaborn Countplot.

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
df['Education'].value_counts()
</pre>
    
<pre>
sns.countplot(data=df, x='Education')
</pre>
</p>
</details>

In [None]:
# Your code here

In [None]:
# Your plot here

---

❓ Try to do some other countplots, varying `x` with one of the categorical column (`Marital_Status`, `Kidhome`, `Teenhome`)

In [None]:
# Your first plot here
# To add a cell, you can go in the menu and do Insert > Insert cell below

---
❓ Let's plot the distribution of `Income` based on a given category. Start with `Education`:

```python
sns.catplot(data=df, x='Education', y='Income', kind="box")
```

1. Change the value of `x` with one of the categorical column of the dataset and the value of `kind` (`"bar"`, `"box"`, `"violin"`, `"boxen"`)
1. Change the value of `y` with one of the numerical column of the dataset

In [None]:
# Your experiments here

---
❓ Let's use [`seaborn.FacetGrid`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)

1. Run the cell below. What do you observe?
2. Change `col` in the first line with another column (e.g. `"time"`). Run the cell again. What do you observe?

In [None]:
g = sns.FacetGrid(df, col="Education")
g.map(plt.hist, "Income")

---
❓ Let's continue with FacetGrid and add a `row="Teenhome"` parameter. How many cells do you get in the plot?

<details>
    <summary>🙈 Reveal solution</summary>

You get 3 * 5 = 15 cells!
    
<pre>
g = sns.FacetGrid(df, col="Education", row="Teenhome")
g.map(plt.hist, "Income")
</pre>
</p>
</details>

In [None]:
# Your code here

---
❓ Show the percentage of positive reactions to the campaign by education level.

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
df.groupby('Education')['Response'].mean()
</pre>
</p>
</details>

In [None]:
# Your code here

---
❓ Try to have something a bit more representative since the education level is not distributed equally. You could use [`seaborn.countplot`](https://seaborn.pydata.org/generated/seaborn.countplot.html) to do so:

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
sns.countplot(data=df, x='Education', hue='Response')
</pre>
</p>
</details>

In [None]:
# Your code here

---
❓ Have a representation of age ranges of the customers. You may find helpful to use the [`pd.cut()`](https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.cut.html) function. Here are a few intermediary steps:

* Generate an `age` column
* Cut the values of the column by categories (e.g.: '20-40') and store it in another new column
* Display the representation

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
df['Age'] = pd.to_datetime('today').year - df['Year_Birth']
labels = ['20-40', '41-60', '60+']
bins = [20, 40, 60, 100]
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)
sns.countplot(data=df, x='AgeGroup')
</pre>
</p>
</details>

In [None]:
# Your code here

## Correlation

Let's start looking for correlation between columns in the dataset.


---
❓ What is your intuition about the relationship between the columns `Income` and `MntWines`?

---
❓ Let's look at the data to see if our intuition is correct. We will do a **scatterplot** with `x` being `Income` and `y` the `MntWines`.

In [None]:
with sns.axes_style(style="whitegrid"):
    sns.relplot(x="Income", y="MntWines", data=df)

---
❓ Another way of looking at this data is to use a [`seaborn.jointplot`](https://seaborn.pydata.org/generated/seaborn.jointplot.html).

In [None]:
with sns.axes_style("white"):
    sns.jointplot(x="Income", y="MntWines", kind="hex", data=df)

❓ A very useful tool to **identify** correlations is the [`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html):

In [None]:
cols = ['Income', 'MntWines', 'MntFruits', 'Education',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds']
sns.pairplot(df[cols], height=2, hue='Education')

## Regression

We are not doing Machine Learning yet but we can use [`seaborn.lmplot`](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to graphically read a linear correlation between two columns:

In [None]:
sns.lmplot(x="Income", y="MntWines", col="Teenhome", data=df)

## Good job!

Save your notebook, go back to the **Le Wagon - Learn** platform to upload your progress. A quiz awaits you!