Skip to content

Commit fe4edc9

Browse files
authored
Second pandas chapter translated (#524)
* first chapter * update * update
1 parent f32915b commit fe4edc9

22 files changed

+1757
-469
lines changed

_quarto-prod.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ project:
1717
- content/manipulation/02_pandas_intro.qmd
1818
- content/manipulation/02_pandas_intro_en.qmd
1919
- content/manipulation/02_pandas_suite.qmd
20+
- content/manipulation/02_pandas_suite_en.qmd
2021
- content/manipulation/03_geopandas_intro.qmd
2122
- content/manipulation/02a_pandas_tutorial.qmd
2223
- content/manipulation/02b_pandas_TP.qmd

_quarto.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@ project:
1212
- content/getting-started/06_rappels_fonctions.qmd
1313
- content/getting-started/07_rappels_classes.qmd
1414
- content/manipulation/index.qmd
15-
- content/manipulation/02_pandas_intro.qmd
16-
- content/manipulation/02_pandas_intro_en.qmd
15+
- content/manipulation/02_pandas_suite.qmd
16+
- content/manipulation/02_pandas_suite_en.qmd
1717
- content/visualisation/index.qmd
1818
- content/modelisation/index.qmd
1919
- content/NLP/index.qmd

content/manipulation/02_pandas_suite.qmd

Lines changed: 18 additions & 467 deletions
Large diffs are not rendered by default.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
::: {.exercise}
2+
## Exercise 1: Group Aggregations
3+
4+
1. Calculate the total emissions of the "Residential" sector by department and compare the value to the most polluting department in this sector. Draw insights from the reality that this statistic reflects.
5+
6+
2. Calculate the total emissions for each sector in each department. For each department, calculate the proportion of total emissions coming from each sector.
7+
8+
<details>
9+
<summary>
10+
Hint for this question
11+
</summary>
12+
13+
* _"Group by"_ = `groupby`
14+
* _"Total emissions"_ = `agg({*** : "sum"})`
15+
</details>
16+
17+
:::
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
::: {.exercise}
2+
## Exercice 1 : agrégations par groupe
3+
4+
1. Calculer les émissions totales du secteur "Résidentiel" par département et rapporter la valeur au département le plus polluant dans le domaine. En tirer des intutitions sur la réalité que cette statistique reflète.
5+
6+
2. Calculer, pour chaque département, les émissions totales de chaque secteur. Pour chaque département, calculer la proportion des émissions totales venant de chaque secteur.
7+
8+
<details>
9+
<summary>
10+
Indice pour cette question
11+
</summary>
12+
13+
* _"Grouper par"_ = `groupby`
14+
* _"émissions totales"_ = `agg({*** : "sum"})`
15+
</details>
16+
17+
:::
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
```{python}
2+
#| echo: false
3+
#| output: asis
4+
if lang == "en":
5+
print("In question 1, the result should be as follows:")
6+
else:
7+
print("A la question 1, le résultat obtenu devrait être le suivant:")
8+
```
9+
10+
11+
```{python}
12+
# Question 1
13+
emissions_residentielles = (
14+
emissions
15+
.groupby("dep")
16+
.agg({"Résidentiel" : "sum"})
17+
.reset_index()
18+
.sort_values("Résidentiel", ascending = False)
19+
)
20+
emissions_residentielles["Résidentiel (% valeur max)"] = emissions_residentielles["Résidentiel"]/emissions_residentielles["Résidentiel"].max()
21+
emissions_residentielles.head(5)
22+
```
23+
24+
25+
```{python}
26+
#| echo: false
27+
#| output: asis
28+
if lang == "en":
29+
print(
30+
"""
31+
This ranking may reflect demographics rather than the process we wish to measure. Without the addition of information on the population of each département to control for this factor, it is difficult to know whether there is a structural difference in behavior between the inhabitants of Nord (département 59) and Moselle (département 57).
32+
"""
33+
)
34+
else:
35+
print(
36+
"""
37+
Ce classement reflète peut-être plus la démographie que le processus qu'on désire mesurer. Sans l'ajout d'une information annexe sur la population de chaque département pour contrôler ce facteur, on peut difficilement savoir s'il y a une différence structurelle de comportement entre les habitants du Nord (département 59) et ceux de la Moselle (département 57).
38+
"""
39+
)
40+
```
41+
42+
43+
```{python}
44+
# Question 2
45+
emissions_par_departement = (
46+
emissions.groupby('dep').sum(numeric_only=True)
47+
)
48+
emissions_par_departement['total'] = emissions_par_departement.sum(axis = 1)
49+
emissions_par_departement["Part " + secteurs] = (
50+
emissions_par_departement
51+
.loc[:, secteurs]
52+
.div(emissions_par_departement['total'], axis = 0)
53+
.mul(100)
54+
)
55+
```
56+
57+
58+
```{python}
59+
#| echo: false
60+
#| output: asis
61+
if lang == "en":
62+
print(
63+
"""
64+
At the end of question 2, let's take the share of emissions from agriculture and the tertiary sector in departmental emissions:
65+
"""
66+
)
67+
else:
68+
print(
69+
"""
70+
A l'issue de la question 2, prenons la part des émissions de l'agriculture et du secteur tertiaire dans les émissions départementales:
71+
"""
72+
)
73+
```
74+
75+
76+
```{python}
77+
emissions_par_departement.sort_values("Part Agriculture", ascending = False).head(5)
78+
```
79+
80+
```{python}
81+
emissions_par_departement.sort_values("Part Tertiaire", ascending = False).head(5)
82+
```
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
::: {.exercise}
2+
## Exercice 2: Restructuring Data: Wide to Long
3+
4+
1. Create a copy of the ADEME data by doing `df_wide = emissions.copy()`
5+
6+
2. Restructure the data into the *long* format to have emission data by sector while keeping the commune as the level of analysis (pay attention to other identifying variables).
7+
8+
3. Sum the emissions by sector and represent it graphically.
9+
10+
4. For each department, identify the most polluting sector.
11+
12+
:::
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
::: {.exercise}
2+
## Exercice 2: Restructurer les données : wide to long
3+
4+
1. Créer une copie des données de l'`ADEME` en faisant `df_wide = emissions_wide.copy()`
5+
6+
2. Restructurer les données au format *long* pour avoir des données d'émissions par secteur en gardant comme niveau d'analyse la commune (attention aux autres variables identifiantes).
7+
8+
3. Faire la somme par secteur et représenter graphiquement
9+
10+
4. Garder, pour chaque département, le secteur le plus polluant
11+
12+
:::
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
```{python}
2+
#| output: false
3+
#| label: question1
4+
# Question 1
5+
6+
emissions_wide = emissions.copy()
7+
emissions_wide[['Commune','dep', "Agriculture", "Tertiaire"]].head()
8+
```
9+
10+
```{python}
11+
#| output: false
12+
#| label: question2
13+
# Question 2
14+
emissions_wide.reset_index().melt(id_vars = ['INSEE commune','Commune','dep'],
15+
var_name = "secteur", value_name = "emissions")
16+
```
17+
18+
```{python}
19+
#| output: false
20+
#| label: question3
21+
# Question 3
22+
23+
emissions_totales = (
24+
emissions_wide.reset_index()
25+
.melt(
26+
id_vars = ['INSEE commune','Commune','dep'],
27+
var_name = "secteur", value_name = "emissions"
28+
)
29+
.groupby('secteur')
30+
.sum(numeric_only = True)
31+
)
32+
33+
emissions_totales.plot(kind = "barh")
34+
```
35+
36+
```{python}
37+
#| output: false
38+
#| label: question4
39+
# Question 4
40+
41+
top_commune_dep = (
42+
emissions_wide
43+
.reset_index()
44+
.melt(
45+
id_vars = ['INSEE commune','Commune','dep'],
46+
var_name = "secteur", value_name = "emissions"
47+
)
48+
.groupby(['secteur','dep'])
49+
.sum(numeric_only=True).reset_index()
50+
.sort_values(['dep','emissions'], ascending = False)
51+
.groupby('dep').head(1)
52+
)
53+
display(top_commune_dep)
54+
```
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
::: {.exercise}
2+
## Exercise 3: Verification of Join Keys
3+
4+
Let's start by checking the dimensions of the `DataFrames` and the structure of some key variables.
5+
In this case, the fundamental variables for linking our data are the communal variables.
6+
Here, we have two geographical variables: a commune code and a commune name.
7+
8+
1. Check the dimensions of the `DataFrames`.
9+
10+
2. Identify in `filosofi` the commune names that correspond to multiple commune codes and select their codes. In other words, identify the `LIBGEO` where there are duplicate `CODGEO` and store them in a vector `x` (tip: be careful with the index of `x`).
11+
12+
We temporarily focus on observations where the label involves more than two different commune codes.
13+
14+
* _Question 3_. Look at these observations in `filosofi`.
15+
16+
* _Question 4_. To get a better view, reorder the obtained dataset alphabetically.
17+
18+
* _Question 5_. Determine the average size (variable number of people: `NBPERSMENFISC16`) and some descriptive statistics of this data. Compare it to the same statistics on the data where labels and commune codes coincide.
19+
20+
* _Question 6_. Check the major cities (more than 100,000 people) for the proportion of cities where the same name is associated with different commune codes.
21+
22+
* _Question 7_. Check in `filosofi` the cities where the label is equal to Montreuil. Also, check those that contain the term _'Saint-Denis'_.
23+
24+
:::

0 commit comments

Comments
 (0)