# 1) Pandas data cleaning

- v podstate cistenie mnozstva dat, najcastejsie:
  - odstranenie irelevantnych stlpcov
  - premenovanie nazvov stlpcov
  - nahradenie alebo vyplnenie chybajucich udajov


## 1.1) Drop rows with missing values

- **dropna()**


In [1]:
import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {"A": [1, 2, 3, None, 5], "B": [None, 2, 3, 4, 5], "C": [1, 2, None, None, 5]}

df = pd.DataFrame(data)
print("Original Data:\n", df)
print()

# use dropna() to remove rows with any missing values
df_cleaned = df.dropna()

print("Cleaned Data:\n", df_cleaned)

Original Data:
      A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Cleaned Data:
      A    B    C
1  2.0  2.0  2.0
4  5.0  5.0  5.0


## 1.2) Fill missing values

- **fillna()**


In [2]:
import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {"A": [1, 2, 3, None, 5], "B": [None, 2, 3, 4, 5], "C": [1, 2, None, None, 5]}

df = pd.DataFrame(data)

print("Original Data:\n", df)

# filling NaN values with 0
df.fillna(0, inplace=True)

print("\nData after filling NaN with 0:\n", df)

Original Data:
      A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Data after filling NaN with 0:
      A    B    C
0  1.0  0.0  1.0
1  2.0  2.0  2.0
2  3.0  3.0  0.0
3  0.0  4.0  0.0
4  5.0  5.0  5.0


## 1.3) Use aggregate functions to fill missing values


In [3]:
import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {"A": [1, 2, 3, None, 5], "B": [None, 2, 3, 4, 5], "C": [1, 2, None, None, 5]}

df = pd.DataFrame(data)

print("Original Data:\n", df)

# filling NaN values with the mean of each column
df.fillna(df.mean(), inplace=True)

print("\nData after filling NaN with mean:\n", df)

Original Data:
      A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Data after filling NaN with mean:
       A    B         C
0  1.00  3.5  1.000000
1  2.00  2.0  2.000000
2  3.00  3.0  2.666667
3  2.75  4.0  2.666667
4  5.00  5.0  5.000000


## 1.4) Handle duplicates values

- **duplicated():** kontrola duplicitnych hodnot
- **drop_duplicates():** - odstranenie duplicitnych riadkov


In [4]:
import pandas as pd

# sample data
data = {"A": [1, 2, 2, 3, 3, 4], "B": [5, 6, 6, 7, 8, 8]}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df.to_string(index=False))

# detect duplicates
print("\nDuplicate Rows:\n", df[df.duplicated()].to_string(index=False))

# remove duplicates based on column 'A'
df.drop_duplicates(subset=["A"], keep="first", inplace=True)

print(
    "\nDataFrame after removing duplicates based on column 'A':\n",
    df.to_string(index=False),
)

Original DataFrame:
  A  B
 1  5
 2  6
 2  6
 3  7
 3  8
 4  8

Duplicate Rows:
  A  B
 2  6

DataFrame after removing duplicates based on column 'A':
  A  B
 1  5
 2  6
 3  7
 4  8


## 1.5) Rename column names to meaningful names

- **rename()**


In [5]:
import pandas as pd

# sample data
data = {"A": [25, 30, 35], "B": ["John", "Doe", "Smith"], "C": [50000, 60000, 70000]}

df = pd.DataFrame(data)

# rename columns
df.rename(columns={"A": "Age", "B": "Name", "C": "Salary"}, inplace=True)

print(df.to_string(index=False))

 Age  Name  Salary
  25  John   50000
  30   Doe   60000
  35 Smith   70000
