# Analysing info about driver's licenses

Hello, i am Iza and this is simple analysis i made to get used to working with data in Python. My aim is to answer some questions using data about newly granted driver licences in Poland.

## About the data

Datasets come from dane.gov.pl website. They contain info about newly granted driver licenses in Poland in both 2022 and 2023. They contain month, age, sex and number of licenses granted. Unfortunately this dataset don't contain much information, but we can still do some interesting stuff with it!

## Table of contents

1. [Data load and setup](#1-data-load-and-setup)
2. [In what months were the most driver's licenses granted? Are their numbers significantly higher in the summer?](#2-in-what-months-were-the-most-drivers-licenses-granted-are-their-numbers-significantly-higher-in-the-summer)
3. Which voivodeships have the highest number of driving licenses obtained (per number of inhabitants)?
4. What is the difference in the number of driving licenses obtained between men and women? Does age play a factor here?
5. Which age groups get the most driving licenses?
6. How did the number of driving licenses obtained changed between 2022 and 2023? Is the difference significant?
7. Conclusion


## 1. Data load and setup

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import os


In [3]:
csv_files_22 = [f for f in os.listdir('2022') if f.endswith('.csv')] 
csv_files_23 = [f for f in os.listdir('2023') if f.endswith('.csv')]
list1 = [] 
list2 = []

for file in csv_files_22:
    file_path = os.path.join('2022', file) 
    df = pd.read_csv(file_path, encoding='utf-8', sep='|')  
    list1.append(df)

for file in csv_files_23:
    file_path = os.path.join('2023', file)  
    df = pd.read_csv(file_path, encoding='utf-8', sep='|')  
    list2.append(df)


While working with this data i encountered peculiar problem. In files about march and november 2023 there is different separator and it caused problems (fixed it manually in csv's)

In [4]:
# changing list to dataframe 
df_2022 = pd.concat(list1, ignore_index=True)
df_2023 = pd.concat(list2, ignore_index=True)


In [14]:
df = pd.concat([df_2022, df_2023], ignore_index=True)
df[['YEAR', 'MONTH']] = df['DATA_MC'].str.split('-', expand=True)
df.set_index(['YEAR', 'MONTH'], inplace=True)
df.drop(columns='DATA_MC', inplace=True)
# translating polish columns
df.rename(columns={'KOD_WOJ' : 'VOIV_CODE', 'WOJEWODZTWO' : 'VOIVODESHIP', 'PLEC' : 'GENDER', 'WIEK' : 'AGE', 'LICZBA' : 'NUMBER'}, inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,VOIV_CODE,VOIVODESHIP,GENDER,AGE,NUMBER
YEAR,MONTH,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022,01,2,WOJ. DOLNOŚLĄSKIE,K,54,4
2022,01,2,WOJ. DOLNOŚLĄSKIE,K,47,5
2022,01,2,WOJ. DOLNOŚLĄSKIE,K,42,5
2022,01,2,WOJ. DOLNOŚLĄSKIE,K,20,89
2022,01,2,WOJ. DOLNOŚLĄSKIE,K,29,22
...,...,...,...,...,...,...
2023,12,32,WOJ. ZACHODNIOPOMORSKIE,M,18,275
2023,12,32,WOJ. ZACHODNIOPOMORSKIE,M,20,20
2023,12,32,WOJ. ZACHODNIOPOMORSKIE,M,19,51
2023,12,32,WOJ. ZACHODNIOPOMORSKIE,M,41,3


As we can see data is fairly simple, but nevertheless we can answer some questions with it :)

## 2. In what months were the most driver's licenses granted? Are their numbers significantly higher in the summer?

In [30]:
num_months_1 = df_2022.groupby('DATA_MC')['LICZBA'].sum().to_frame()
num_months_1 = num_months_1.reset_index()
num_months_2 = df_2023.groupby('DATA_MC')['LICZBA'].sum().to_frame()
num_months_2 = num_months_2.reset_index()
num_months_1.rename(columns = {'DATA_MC':'MONTH', 'LICZBA':'NUMBER_OF_LICENSES'}, inplace = True)
num_months_2.rename(columns = {'DATA_MC':'MONTH', 'LICZBA':'NUMBER_OF_LICENSES'}, inplace = True)

