# Data analysis with Python: a quick refresher

<div class="alert alert-warning">
<h3>Goal of this session:</h3>

The following tasks are aimed to refresh your mind on some of the key Python functionalities in data analysis. In particular, you will practice how to explore, summarize, and visualize data using pandas, seaborn and matplotlib libraries.
</div>

The dataset that you will be working with is collected from multiple public resources published by the Federal Statistical Office of Switzerland. It contains some of the demographic and geographic information for communes in Switzerland. More specifically, the dataset contains the following information for each commune in Switzerland:

* The canton where the commune is located
* Name of the commune
* Language
* Number of residents
* Population density per km²
* Percentage of residents aged from 0 to 19 years
* Percentage of residents aged from 20 to 64 years
* Percentage of residents aged 65 years and more
* Number of private households
* Surface area in km²
* Percentage of the Settlement area
* Percentage of the Agricultural area
* Percentage of the Wooded area
* Percentage of the Unproductive area
* East coordinate of the center
* North coordinate of the center
* Median elevation in meters

You will analyse this dataset through the following steps:

- A. Importing and Cleaning
- B. Data Exploration
- C. Data Visualizations

The path where you can find the dataset in the Github repo of the workshop is: `Day1-01/data.csv`.

Let's start by importing the necessary libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<div class="alert alert-info">
<h1>A. Importing and Cleaning</h1>
</div>


The objective of this task is to gain some general information about the dataset before starting your exploration journey. By the end of this task, you will know about the number of observations, the index and columns, the data types, and the missing observations in the dataset. These are necessary information to know about any dataset before jumping into data exploration.

> Hint: the following functions can be used to address the tasks in this part:

- `read_csv`
- `shape`
- `head`
- `index`
- `columns`
- `dtypes`
- `isna`, `any`
- `dropna`

1. Import the data as a Pandas DataFrame and name it as `df`.

In [None]:
# write your code here


2. Check the number of rows and columns.

In [None]:
# write your code here


3. Display the first few entries of the DataFrame.

In [None]:
# write your code here


4. Obtain the index labels, and then show the column names as a `list`.

In [None]:
# write your code here


In [None]:
# write your code here


5. Check the data type for each column.

In [None]:
# write your code here


6. Check if there are any missing values and show the rows that contain the missing values.

In [None]:
# write your code here


7. If necessary remove any observations to ensure that there are no missing values.

In [None]:
# write your code here


<div class="alert alert-info">
<h1>B. Data Exploration</h1>
</div>

Now that you are familiar with the general structure of the dataset, you can start an exploration adventure! Data exploration or explanatory data analysis enables you to tell a story with the data. It can be done in several ways depending on the objective and data types. In this task, you use your data summarization and aggregation skills to explore the data.

> Hint: the following functions can be used to address the tasks in this part:

- `describe`
- `loc`
- `sort_values`
- `groupby`
- `count`
- `agg`
- `apply`

1. Obtain the mean, minimum and maximum value for each column containing numerical data.

In [None]:
# write your code here


2. List the 10 most populated communes, ordered by their number of 'Residents'.

In [None]:
# write your code here


3. Compute the number of communes in each canton where more than 50 percent of their populations are aged between 20 and 64 years old.
You should group the communes by cantons, and then count how many communes have at least 50 percent of their population between 20 and 64 years old. Note that the column '20-64 years' holds percentage of people aged between 20 and 64 years old.

In [None]:
# write your code here


4. Compute the difference between the maximum and minimum elevations for each canton. Find the top 5 cantons that have the largest range of elevations?
Note that the column 'Elevation' holds the elevation for each commune.

In [None]:
# write your code here


5. The following image shows a matrix whose rows correspond to communes and the columns to the cantons. The matrix is filled in with 0 and 1 values where entry $(i,j)$ is a 1 if the commune in row $i$ is in the canton in column $j$ and a 0 otherwise. Write a code that can generate the matrix.

<img src="img/plot_output_4.png" width=500 height=250 />

In [None]:
# write your code here


<div class="alert alert-info">
<h1>C. Data Visualizations</h1>
</div>

Let's continue the explanatory data analysis but this time with visualizations. Visual exploration is a powerful way to learn about the distributions and the relationship between columns and add them to your story. In this task, you use your data visualization skills to explore the data.

> Hint: the following functions can be used to address the tasks in this part:

- `barh`
- `xlabel`
- `title`
- `legend`
- `violinplot`
- `subplots`
- `boxplot`
- `pairplot`
- `scatterplot`

1. Your task is to obtain a **horizontal** bar plot that shows the top 10 populated communes. Your bar chart should have the names of the communes listed vertically along the $y$-axis and the $x$-axis should show the populations. 

In [None]:
# write your code here


2. For the top 10 populated communes of the previous step, your task now is to plot a horizontal **stacked** bar chart that shows how their lands are divided into the 4 area types: Settlement, Agricultural, Wooded, Unproductive. Remember that these 4 area types represent the percentages and should add up to 100 for each commune. Ensure that the chart has an appropriate title, legend and labels. Your output should look like the following plot.

<img src="img/plot_output_1.png" width=600 height=300 />

In [None]:
# write your code here


3. Your task is to investigate the distributions of the age group 0-19 years, which is a numerical variable, across four language regions, which is a categorical variable. Hint: you can [choose](https://seaborn.pydata.org/tutorial/categorical.html) from Seaborn one of the violin plot, box plot or boxen plot for this purpose.

In [None]:
# write your code here


4. Your task is to do the previous task for the three age groups 0-19 years, 20-64 years, and 65 years or over. In order to make the comparison easy, you should make a plot with one subplot per age group (plot with 1 row and 3 columns).  The $y$-axis of the subplot should show the percentages of the populations for age groups, ranging from 0 to 80%. Hint: you can use `sharey` parameter of the subplot.

In [None]:
# write your code here


5. Your task is to use the pairplot from Seaborn and produce **3 plots** to visually investigate the relation between the Agricultural area of communes and their Settlement area, Wooded area and the Unproductive area. Use the Elevation as the color-code variable in the plot and show that communes that are located in high altitudes have very low Settlement and Agricultural areas, but have a lot of Unproductive areas. Your output should look like the following plot.

![title](img/plot_output_2.png)


In [None]:
# write your code here


6. Your task is to use Seaborn scatter plot and draw maps of Switzerland using the East and North coordinates of communes. Write a code that can generate the following plots.

![title](img/plot_output_3.png)


In [None]:
# write your code here
