# Data Profiling & EDA

Discover insights about your data automatically using MKYZ's profiling tools.

In [1]:
import mkyz
mkyz.init()

mkyz package initialized. Version: 0.2.0
mkyz package initialized. Version: 0.2.0


## 1. Load Titanic Dataset

This dataset contains mixed types (numeric, categorical) and missing values.

In [2]:
df = mkyz.load_data('data/titanic.csv')
print(f"Loaded {len(df)} rows.")

Loaded 891 rows.


## 2. Quick Info

Get immediate statistics about memory usage and types.

In [3]:
info = mkyz.data_info(df)
print(f"Missing Data: {info['missing_pct']}%")
print(f"Columns: {info['columns']}")

Missing Data: 8.1%
Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']


## 3. Quick EDA Report

Generate a console-based report with recommendations.

In [4]:
# 'survived' is the target column
mkyz.quick_eda(df, target_column='survived')

DATA PROFILE SUMMARY

üìä Overview
----------------------------------------
  Rows: 891
  Columns: 12
  Memory: 0.31 MB

üìà Column Types
----------------------------------------
  Numerical: 7
  Categorical: 5
  Datetime: 0
  Boolean: 0

‚ùì Missing Values
----------------------------------------
  Total: 866 (8.10%)
  Columns affected: 3
  Complete rows: 183

üîÑ Duplicates
----------------------------------------
  Duplicate rows: 0


üí° RECOMMENDATIONS
----------------------------------------
  ‚ö†Ô∏è High missing rate (8.1%). Consider imputation or dropping columns with >50% missing.
  ‚ö†Ô∏è Column 'Name' has high cardinality (891 unique). Consider target encoding or grouping rare categories.
  ‚ö†Ô∏è Column 'Ticket' has high cardinality (681 unique). Consider target encoding or grouping rare categories.
  ‚ö†Ô∏è Column 'Cabin' has high cardinality (147 unique). Consider target encoding or grouping rare categories.
  ‚ö†Ô∏è Column 'SibSp' has 5.2% outliers. Consider capping 

## 4. Advanced Profiling

Deep dive into the data using the `DataProfile` class.

In [5]:
profile = mkyz.DataProfile(df, target_column='survived')

# Check recommendations
print("Recommendations:")
for rec in profile.get_recommendations():
    print(f"- {rec}")

Recommendations:
- ‚ö†Ô∏è High missing rate (8.1%). Consider imputation or dropping columns with >50% missing.
- ‚ö†Ô∏è Column 'Name' has high cardinality (891 unique). Consider target encoding or grouping rare categories.
- ‚ö†Ô∏è Column 'Ticket' has high cardinality (681 unique). Consider target encoding or grouping rare categories.
- ‚ö†Ô∏è Column 'Cabin' has high cardinality (147 unique). Consider target encoding or grouping rare categories.
- ‚ö†Ô∏è Column 'SibSp' has 5.2% outliers. Consider capping or investigation.
- ‚ö†Ô∏è Column 'Parch' has 23.9% outliers. Consider capping or investigation.
- ‚ö†Ô∏è Column 'Fare' has 13.0% outliers. Consider capping or investigation.


## 5. Export Report

Save the analysis as a shareable HTML file.

In [6]:
profile.export_report('titanic_eda_report.html')
print("Report saved!")

Report saved!
