# Academic Integrity Declaration

Academic Integrity and Learning Statement

By submitting my work, I confirm that:

1. The code, analysis, and documentation in this notebook are my own work and reflect my own understanding.
2. I am prepared to explain all code and analysis included in this submission.

If I used assistance (e.g., AI tools, tutors, or other resources), I have:

- Clearly documented where and how external tools or resources were used in my solution.
- Included a copy of the interaction (e.g., AI conversation or tutoring notes) in an appendix.

I acknowledge that:

- I may be asked to explain any part of my code or analysis during evaluation.
- Misrepresenting assisted work as my own constitutes academic dishonesty and undermines my learning.


For each task
- brief explanantion
- detailed discussion of approaches,
- observations, and
- decisions

A. An overview of the dataset

B. Exploration: numerical summaries, indexing and grouping

C. Exploration: visualizations

D. Probabilities

E. Matrices

Appendix


## A. An overview of the dataset

**Brief Explanation:**  
In this task, we aim to understand the general structure of the dataset. This includes examining the number of observations, columns, data types, and any missing values. Understanding these aspects is essential before performing further analysis and visualizations.

**Approach:**  
To explore the dataset, we will:  
1. Load the dataset into a Pandas DataFrame as `df`.  
2. Check the shape to determine the number of rows and columns.  
3. Display the first few entries to quicly inspect the data.  
4. Retrieve the index labels and show the column names as `list` to understand the structure.  
5. Examine the data types of each column.  
6. Identify any missing values and show the rows containing them. 
7. Handle any observations to make sure there are no missing values and the data type is the same across all the values of each column.

This structured exploration helps us ensure that the dataset is clean and ready for further analysis.`

In [14]:
import pandas as pd

# Load the dataset
df = pd.read_csv("p1_communes.csv")

In [16]:
# Display shape (rows, columns)
print("(Rows, Columns):", df.shape)

(Rows, Columns): (2202, 17)


In [17]:
# Display first few entries
print("First 5 rows:")
display(df.head())

First 5 rows:


Unnamed: 0,Canton,Commune,Language,Residents,Population density per km²,0-19 years,20-64 years,65 years or over,Private households,Surface area in km²,Settlement area,Agricultural area,Wooded area,Unproductive area,East coordinate,North coordinate,Elevation
0,ZH,Aeugst am Albis,de,1982,250.5689,19.677094,62.764884,17.558022,835,7.91,12.658228,51.139241,30.886076,5.316456,2679300,1235700,673
1,ZH,Affoltern am Albis,de,12229,1154.76865,20.508627,61.329626,18.161747,5348,10.59,30.674264,40.17094,28.205128,0.949668,2676800,1236800,502
2,ZH,Bonstetten,de,5548,746.702557,23.666186,60.310022,16.023792,2325,7.43,15.456989,55.510753,28.629032,0.403226,2677800,1241000,583
3,ZH,Hausen am Albis,de,3701,272.132353,21.804918,60.686301,17.508781,1546,13.6,12.69259,55.90609,28.833456,2.567865,2682900,1233100,653
4,ZH,Hedingen,de,3734,571.822358,21.772898,61.756829,16.470273,1540,6.53,19.817073,46.341463,33.231707,0.609756,2676400,1239000,543


In [18]:
# Index labels
print("Index labels:", df.index)

Index labels: RangeIndex(start=0, stop=2202, step=1)


In [20]:
# Column names as list
print("Columns:", df.columns.tolist())

Columns: ['Canton', 'Commune', 'Language', 'Residents', 'Population density per km²', '0-19 years', '20-64 years', '65 years or over', 'Private households', 'Surface area in km²', 'Settlement area', 'Agricultural area', 'Wooded area', 'Unproductive area', 'East coordinate', 'North coordinate', 'Elevation']


In [21]:
# Data types of each column
print("Data types of each column:")
print(df.dtypes)

Data types of each column:
Canton                         object
Commune                        object
Language                       object
Residents                       int64
Population density per km²    float64
0-19 years                    float64
20-64 years                   float64
65 years or over              float64
Private households              int64
Surface area in km²           float64
Settlement area               float64
Agricultural area             float64
Wooded area                   float64
Unproductive area             float64
East coordinate                 int64
North coordinate                int64
Elevation                       int64
dtype: object


In [22]:
# Identify rows with missing values
missing_rows = df[df.isnull().any(axis=1)]
print(f"Number of rows with missing values: {len(missing_rows)}")
display(missing_rows)


Number of rows with missing values: 11


Unnamed: 0,Canton,Commune,Language,Residents,Population density per km²,0-19 years,20-64 years,65 years or over,Private households,Surface area in km²,Settlement area,Agricultural area,Wooded area,Unproductive area,East coordinate,North coordinate,Elevation
155,ZH,Stammheim,,2747,114.649416,21.405169,58.281762,20.313069,1125,23.96,9.056761,55.801336,34.557596,0.584307,2702400,1276500,455
156,ZH,Wädenswil,,24341,682.968575,19.740356,59.944127,20.315517,10371,35.64,19.336706,63.26588,15.064643,2.332771,2693400,1231600,641
157,ZH,Elgg,,4903,201.02501,20.762798,60.47318,18.764022,2121,24.39,9.545268,46.907005,43.138058,0.409668,2707700,1260900,598
158,ZH,Horgen,,22665,735.160558,20.657401,60.480918,18.861681,9685,30.83,18.456049,38.858255,39.506974,3.178722,2687800,1234900,624
440,,Thurnen,,1922,323.02521,20.759625,58.324662,20.915713,814,5.95,13.94958,79.159664,6.05042,0.840336,2605300,1184700,558
704,,Villaz,,2287,148.217758,24.398776,61.346742,14.254482,897,15.44,8.80829,70.336788,20.401554,0.453368,2563200,1174400,727
757,,Prez,,2236,139.401496,25.0,61.67263,13.32737,839,16.04,6.924517,68.808484,23.268871,0.998129,2567700,1181700,651
1133,GR,Bergün Filisur,,905,4.759651,16.574586,59.668508,23.756906,397,190.14,0.977815,19.824414,27.541794,51.655977,2776700,1166700,2273
1165,GR,Rheinwald,,577,4.21722,16.984402,54.07279,28.942808,266,136.82,1.124334,34.328685,14.448419,50.098562,2744500,1157500,2192
1626,TI,Riviera,,4220,48.780488,20.07109,61.445498,18.483412,1774,86.58,4.008317,7.03477,65.553887,23.403026,2718700,1128900,1458


In [23]:
# Drop rows with missing values
df = df.dropna()
print("Dataset shape after dropping missing values:", df.shape)

Dataset shape after dropping missing values: (2191, 17)


**Observations and Decisions:**  
- The dataset contains `2202` rows and `17` columns.  
- Columns such as `Population`, `Area`, and `Elevation` are numeric, while `Canton` and `Commune Name` are categorical.  
- There were `11` rows with missing values. After inspection, we decided to drop them because the number of missing observations was small and would not significantly affect our analysis.  
- After cleaning, the dataset is ready for exploration in Task B, where we will compute numerical summaries, groupings, and other insights.

# B. Exploration: numerical summaries, indexing and grouping


1. Obtain the mean, minimum and maximum value for each column containing numerical data. Your output should preferably show only the three requested statistics and not the full table of descriptive statistics.
2.	List the 10 most populated communes, ordered by their number of residents.
3.	List the 10 least populated communes, ordered by their number of residents.
4.	Group the communes by canton and save them into separate .csv files, e.g. a ZH.csv with all the data for communes in Zurich (Do not include the .csv files in your submission).
5.	Compute the population density at the canton level and rank the cantons from most dense to least dense. Clearly state and comment your observations.
6.	Compute the number of communes in each canton where more than 50 percent of their populations are aged between 20 and 64 years old.
7.	Compute the difference between the maximum and minimum elevations for each canton. Find the top 5 cantons that have the largest range of elevations?


# C. Exploration: visualizations