# Python Foundations for Data Science 🤓

### Welcome to the Jupyter Notebook! How to use this file? 🤔

* Type inside the empty cells to write code. These empty cells will have a `In [ ]:` prefix before
* Press the `return/enter ⏎` key to add a new line inside the cell
* To display your results use the Python built in `print(STUFF_YOU_WANT_TO_PRINT)` method or simply put the stuff you want to print as the last line inside the cell. The result of the last line will appear as the `Out[]:` or the output of the cell :)
* Press `shift` + `return/enter ⏎` to run your code 🤓 this will run the code inside your currently selected cell and print anything inside `print()` method and the last line of your cell
* To add a new cell, select any cell and press the `b` key (make sure you are not just typing the letter `b` in the cell). This will add a new cell below
* To delete a cell, double press the `d` key (make sure you are not just typing the letter `d` in the cell)

#### Try to run the code below! 😃

In [None]:
print("Welcome to the Python workshop of Le Wagon!")
print("Are you ready to do your first data analysis?")
"Yes we are! 🚀"

---
# Data Science - Analysis 🧮

### First of all - what are we using? 🔨

[Jupyter Notebook](https://jupyter.org) is an open-source web application which allows you to create and share documents with code, visualizations and narrative. Which is what we are doing right now! 

[Numpy](https://www.numpy.org) fundamental package for scientific computing in Python.

[Pandas](https://pandas.pydata.org) is an open-source library providing easy to use data structuring and analytics for Python. 

### So this is how every Jupyter notebook starts...

In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib
print("You are good to go!")

### Python and Pandas is great for reading data files, like CSV

In [None]:
file = "data/countries.csv"
df = pd.read_csv(file, decimal=",")
# no need to use print() when we want to see DataFrames, simply put what you want to see on the last line
df

---
### We can then have a quick look at our `DataFrame` 🗺

❓ Display the 10 first rows of the dataset (no need to sort)

<details>
    <summary>🙈 Reveal hint</summary>

<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html"><code>pandas.DataFrame.head()</code></a> method.
</p>
</details>

<details>
    <summary>🙈 Reveal solution</summary>

<pre>
df.head(10)
</pre>
</details>

In [None]:
# Your code here

---
❓ What are the countries with more than one billion (10^9) inhabitants?

<details>
    <summary>🙈 Reveal hint</summary>

<p>
You can use the <a href="https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing"><code>boolean indexing</code></a> to filter the rows of your dataframe.
</p>
</details>

<details>
    <summary>🙈 Reveal solution</summary>
<pre>
big_population = df['Population'] > 1000000000
df[big_population]
</pre>
</details>

In [None]:
# Your code here

---
❓ What are the available regions?

<details>
    <summary>🙈 Reveal hint</summary>
<p>
You can use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html"><code>pandas.Series.unique()</code></a> method combiend with the <code>len()</code> Python built-in function.
</p>
</details>

<details>
    <summary>🙈 Reveal solution</summary>
<pre>
len(df['Region'].unique())
</pre>
</details>

In [None]:
# Your code here

---
❓ What are the 5 countries with the biggest GDP?

<details>
    <summary>🙈 Reveal hint</summary>

<p>
You can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html"><code>pandas.DataFrame.sort_values()</code></a> method to order rows of your dataframe based on a criteria (or several).
</p>
</details>

<details>
    <summary>🙈 Reveal solution</summary>

<p>
You can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html"><code>pandas.DataFrame.head()</code></a> method to limit the number of rows displaied:
    
<pre>
df.sort_values(by='GDP ($ per capita)', na_position='first').head()
</pre>
</p>
</details>

In [None]:
# Your code here

---
❓ Which region of the world is the most populated?

<details>
    <summary>🙈 Reveal hint</summary>

<p>
You can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html"><code>pandas.DataFrame.groupby()</code></a> method to aggregate rows of your dataframe based on a criteria (or several).
</p>
</details>

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
regions = df.groupby('Region')
regions[['Population', 'Area (sq. mi.)']].sum()
</pre>
</details>

In [None]:
# Your code here

---
❓ Plot the GDP of the 10 richest countries?

<details>
    <summary>🙈 Reveal hint</summary>

<p>
Reuse what you did previously to retrieve the richest countries previously then plot it with <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html"><code>pandas.DataFrame.plot()</code></a>.
</p>
</details>

<details>
    <summary>🙈 Reveal hint</summary>

<p>
You can use the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html"><code>pandas.DataFrame.set_index()</code></a> method to annotate the rows with one of the column of the dataframe.
</p>
</details>

<details>
    <summary>🙈 Reveal solution</summary>
    
<pre>
column = 'GDP ($ per capita)'
df = df.set_index('Country')
top_ten_countries_df = df[[column]].sort_values(by=column, ascending=False).head(10)
top_ten_countries_df.plot(kind='bar')
</pre>
</details>

In [None]:
# Your code here