# Data Analytics - Challenge


The goal of this challenge is to analyze a restaurant invoices. Some celles are already implemented, you just need to **run** them. Some other cells need you to write some code.

Start the challenge by running the two following cells:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('data/experiments_results.csv', sep=';')

---

❓ Display the 10 first rows of the dataset (no need to sort)

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html"><code>pandas.DataFrame.head()</code></a> function:
    
<pre>
df.head(10)
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ How many academic level can you find in the participants?

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html"><code>pandas.Series.unique()</code></a> function combiend with the <code>len()</code> Python built-in.
    
<pre>
len(df['academic_lvl'].unique())
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ Add columns: `score_total`, `score_dis`, `score_ndis`, `score_t1`, `score_t2`, `score_t3`.

<details>
    <summary>🙈 Reveal solution</summary>

<p>
<pre>
df.columns
</pre>
    
<pre>
df['score_total'] = df[[col for col in df.columns if col.startswith('q')]].sum(axis=1)
df['score_dis'] = df[[col for col in df.columns if '_Dis_' in col]].sum(axis=1)
df['score_ndis'] = df[[col for col in df.columns if '_NDis_' in col]].sum(axis=1)
df['score_t1'] = df[[col for col in df.columns if col.endswith('1')]].sum(axis=1)
df['score_t2'] = df[[col for col in df.columns if col.endswith('2')]].sum(axis=1)
df['score_t3'] = df[[col for col in df.columns if col.endswith('3')]].sum(axis=1)
</pre>
</p>
</details>

In [None]:
# Your code here

---

❓ What is the mean score by gender? Plot this with a Seaborn Barplot.

<details>
    <summary>🙈 Reveal solution</summary>

<p>    
<pre>
sns.barplot(data=df, x='gender', y='score')
</pre>
</p>
</details>

In [None]:
# Your plot here

---

❓ Try to do some other barplots, varying `x` with one of the categorical column (`academic_lvl`, `age`)

In [None]:
# Your first plot here
# To add a cell, you can go in the menu and do Insert > Insert cell below

---

❓ Try to create a new categorical column by combining the `age` and `gender` ones to add a new demographic feature to the dataset.

<details>
    <summary>🙈 Reveal solution</summary>

<p>    
<pre>
df['demo'] = df['gender'] + df['age']
</pre>
</p>
</details>

In [None]:
# Your first code here

---
❓ Let's plot the distribution of `score_total` based on a given category. Start with `age`:

```python
sns.catplot(data=df, x='age', y='score_total', kind="violin")
```

1. Change the value of `x` with one of the categorical column of the dataset and the value of `kind` (`"bar"`, `"box"`, `"violin"`, `"boxen"`)
1. Change the value of `y` with one of the numerical column of the dataset

In [None]:
# Your experiments here

---
❓ Let's use [`seaborn.FacetGrid`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html)

1. Run the cell below. What do you observe?
2. Change `col` in the first line with another column (e.g. `"academic_lvl"`). Run the cell again. What do you observe?

In [None]:
g = sns.FacetGrid(df, col="demo")
g.map(plt.hist, "score_total")

---
❓ Let's continue with FacetGrid and add a `row="academic_lvl"` parameter. How many cells do you get in the plot?

<details>
    <summary>🙈 Reveal solution</summary>

You get 2 * 4 = 8 cells!
    
<pre>
g = sns.FacetGrid(tips_df, col="demo", row="academic_lvl")
g.map(plt.hist, "score_total")
</pre>
</p>
</details>

In [None]:
# Your code here

## Correlation

Let's start looking for correlation between columns in the dataset.


---
❓ What is your intuition about the relationship between the columns `fatigue_pre` and `score_total`?

---
❓ Let's look at the data to see if our intuition is correct. We will do a **scatterplot** with `x` being `fatigue_pre` and `y` the `score_total`.

In [None]:
with sns.axes_style(style="whitegrid"):
    sns.relplot(x="fatigue_pre", y="score_total", data=df)

---
❓ Another way of looking at this data is to use a [`seaborn.jointplot`](https://seaborn.pydata.org/generated/seaborn.jointplot.html).

In [None]:
with sns.axes_style("white"):
    sns.jointplot(x="fatigue_pre", y="score_total", kind="hex", data=df)

❓ A very useful tool to **identify** correlations is the [`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html):

In [None]:
sns.pairplot(df[['fatigue_pre', 'fatigue_post', 'score_total', 'demo']], height=2, hue="demo")

## Regression

We are not doing Machine Learning yet but we can use [`seaborn.lmplot`](https://seaborn.pydata.org/generated/seaborn.lmplot.html) to graphically read a linear correlation between two columns:

In [None]:
sns.lmplot(x="fatigue_pre", y="score_total", col="demo", data=df)

## Good job!

Save your notebook, go back to the **Le Wagon - Learn** platform to upload your progress. A quiz awaits you!