# 1.2 Pregled in čiščenje podatkov (Python Notebook)

V tem razdelku preverimo **manjkajoče vrednosti**, **podvojene zapise** in **ekstremne vrednosti** ter utemeljimo vse odločitve čiščenja.

## 1) Uvoz podatkov in osnovni pregled

Uporabljena datoteka: `CVD_cleaned.csv`.
Velikost ob uvozu: **308,854** vrstic × **19** stolpcev.


In [10]:
import pandas as pd
import numpy as np

# Pot do datoteke (prilagodi, če imaš drugačno strukturo map)
CSV_PATH = "CVD_cleaned.csv"

df = pd.read_csv(CSV_PATH)
df.shape


(308854, 19)

### Hiter pregled podatkov (glava, tipi, osnovne info)

In [11]:
df.head()

Unnamed: 0,General_Health,Checkup,Exercise,Heart_Disease,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Height_(cm),Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption
0,Poor,Within the past 2 years,No,No,No,No,No,No,Yes,Female,70-74,150.0,32.66,14.54,Yes,0.0,30.0,16.0,12.0
1,Very Good,Within the past year,No,Yes,No,No,No,Yes,No,Female,70-74,165.0,77.11,28.29,No,0.0,30.0,0.0,4.0
2,Very Good,Within the past year,Yes,No,No,No,No,Yes,No,Female,60-64,163.0,88.45,33.47,No,4.0,12.0,3.0,16.0
3,Poor,Within the past year,Yes,Yes,No,No,No,Yes,No,Male,75-79,180.0,93.44,28.73,No,0.0,30.0,30.0,8.0
4,Good,Within the past year,No,No,No,No,No,No,No,Male,80+,191.0,88.45,24.37,Yes,0.0,8.0,4.0,0.0


In [12]:
df.dtypes

General_Health                   object
Checkup                          object
Exercise                         object
Heart_Disease                    object
Skin_Cancer                      object
Other_Cancer                     object
Depression                       object
Diabetes                         object
Arthritis                        object
Sex                              object
Age_Category                     object
Height_(cm)                     float64
Weight_(kg)                     float64
BMI                             float64
Smoking_History                  object
Alcohol_Consumption             float64
Fruit_Consumption               float64
Green_Vegetables_Consumption    float64
FriedPotato_Consumption         float64
dtype: object

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308854 entries, 0 to 308853
Data columns (total 19 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   General_Health                308854 non-null  object 
 1   Checkup                       308854 non-null  object 
 2   Exercise                      308854 non-null  object 
 3   Heart_Disease                 308854 non-null  object 
 4   Skin_Cancer                   308854 non-null  object 
 5   Other_Cancer                  308854 non-null  object 
 6   Depression                    308854 non-null  object 
 7   Diabetes                      308854 non-null  object 
 8   Arthritis                     308854 non-null  object 
 9   Sex                           308854 non-null  object 
 10  Age_Category                  308854 non-null  object 
 11  Height_(cm)                   308854 non-null  float64
 12  Weight_(kg)                   308854 non-nul

## 2) Manjkajoče vrednosti (NA)

Najprej preverimo, ali so v podatkih prisotne manjkajoče vrednosti.


In [14]:
missing_by_col = df.isna().sum().sort_values(ascending=False)
missing_total = int(missing_by_col.sum())

missing_by_col, missing_total


(General_Health                  0
 Age_Category                    0
 Green_Vegetables_Consumption    0
 Fruit_Consumption               0
 Alcohol_Consumption             0
 Smoking_History                 0
 BMI                             0
 Weight_(kg)                     0
 Height_(cm)                     0
 Sex                             0
 Checkup                         0
 Arthritis                       0
 Diabetes                        0
 Depression                      0
 Other_Cancer                    0
 Skin_Cancer                     0
 Heart_Disease                   0
 Exercise                        0
 FriedPotato_Consumption         0
 dtype: int64,
 0)

**Rezultat:** Skupaj NA = **0**.

**Odločitev:** Ker manjkajočih vrednosti ni, **imputacija (nadomeščanje) ni potrebna**.


## 3) Podvojeni zapisi (duplikati)

Preverimo popolne duplikate (identične vrstice čez vse stolpce).


In [15]:
dup_count = int(df.duplicated().sum())
dup_count


80

**Rezultat:** Najdenih je **80** duplikatov.

### Odločitev glede duplikatov

Ker nabor nima identifikatorja posameznika (ID), so popolnoma enaki odgovori **možni pri različnih osebah**. Zato duplikatov **ne odstranjujemo**.

## 4) Ekstremne vrednosti (outliers)

Najprej pogledamo razpone numeričnih spremenljivk.


In [16]:
num_cols = df_nodup.select_dtypes(include="number").columns
minmax = df_nodup[num_cols].agg(["min","max"]).T
minmax


Unnamed: 0,min,max
Height_(cm),91.0,241.0
Weight_(kg),24.95,293.02
BMI,12.02,99.33
Alcohol_Consumption,0.0,30.0
Fruit_Consumption,0.0,120.0
Green_Vegetables_Consumption,0.0,128.0
FriedPotato_Consumption,0.0,128.0


**Opažanja:**
- `Height_(cm)` min/max: **91 – 241 cm**
- `Weight_(kg)` min/max: **24.95 – 293.02 kg**
- `BMI` min/max: **12.02 – 99.33**

Takšni repi porazdelitve lahko predstavljajo:
- realne redke primere (npr. zelo visoka teža ali BMI),
- ali napake pri vnosu/pretvorbi.

Ker želimo zajeti tudi redke, a realne primere, ekstremov **ne odstranjujemo**.

## 5) Povzetek čiščenja (odločitve)
### Kaj smo naredili in zakaj 
- **NA vrednosti:** ni 
- **Duplikati:** zaznani (identične vrstice), vendar **ne odstranjujemo**, ker lahko predstavljajo različne osebe  
- **Ekstremi:** prisotni, vendar **ne odstranjujemo**, ker so lahko realni in informativni  
