## Data Quality Control Automation

Subject: Data Cleaning and reporting with Pandas

Components: Pandas, NumPy

Requirements:
- Load CSV data and calculate NaN ratio each column
- Detect outliners in a specific numerical column (using rules like 3 standart deviations)
- Create a new Pandas DataFrame summarizing all control results

In [1]:
import pandas as pd

In [2]:
# Loading Titanic Dataset
data = pd.read_csv('titanic.csv')

In [3]:
# Display basic information about the dataset
print('Initial DataInformation:')
print(data.info())

Initial DataInformation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB
None


In [4]:
# Check for missing values
print('Missing Values:')
print(data.isnull().sum())

Missing Values:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [5]:
# Get some basic statistics about the data
print('Basic Statistics:')
print(data.describe(include='all'))

Basic Statistics:
          survived      pclass   sex         age       sibsp       parch  \
count   891.000000  891.000000   891  714.000000  891.000000  891.000000   
unique         NaN         NaN     2         NaN         NaN         NaN   
top            NaN         NaN  male         NaN         NaN         NaN   
freq           NaN         NaN   577         NaN         NaN         NaN   
mean      0.383838    2.308642   NaN   29.699118    0.523008    0.381594   
std       0.486592    0.836071   NaN   14.526497    1.102743    0.806057   
min       0.000000    1.000000   NaN    0.420000    0.000000    0.000000   
25%       0.000000    2.000000   NaN   20.125000    0.000000    0.000000   
50%       0.000000    3.000000   NaN   28.000000    0.000000    0.000000   
75%       1.000000    3.000000   NaN   38.000000    1.000000    0.000000   
max       1.000000    3.000000   NaN   80.000000    8.000000    6.000000   

              fare embarked  class  who adult_male deck  embark_town 

In [8]:
# Calculate the survival rate
survival_rate = data['survived'].mean() * 100

print(f'Survival Rate: {survival_rate:0.2f}')

Survival Rate: 38.38


In [10]:
# Check the proportion of survivors by gender
print('Survivors by Gender:')
print(data.groupby('sex')['survived'].mean())

Survivors by Gender:
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64


In [13]:
# Check the proportion of survivors based on passenger class
print('Survivors by Pclass:')
print(data.groupby('pclass')['survived'].mean())

Survivors by Pclass:
pclass
1    0.629630
2    0.472826
3    0.242363
Name: survived, dtype: float64
