# Example EDA Workflow with the `jcds` library

Install the library using:
    ```bash
    pip install git+https://github.com/junclemente/jcds.git`
    ```

This can be done in the command line or within the Jupyter Notebook as shown in the next cell. 

### Install the `jcds` library directly from Github using `pip`: 

- From a command line:  
    ```bash
    pip install git+https://github.com/junclemente/jcds.git`
    ```  
    
- From a notebook cell:  
    ```python
    !pip install git+https://github.com/junclemente/jcds.git
    ```  

In [1]:
# import standard libraries
import pandas as pd
import seaborn as sns

In [2]:
# import sebaorn's titanic dataset
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Import the `reports` library from jcds

In [3]:
from jcds import reports as jrep

# import jcds.reports as jrep

In [4]:
jrep.help()

Available functions in jcds:

  - data_cardinality
  - data_info
  - data_quality

Use jcds.help("function_name") to see its documentation.


In [5]:
jrep.help("data_info")


Help for 'data_info':

Summarize the dataset's shape, memory usage, duplicates, and variable types.

Parameters
----------
dataframe : pandas.DataFrame
    The input dataset.

show_columns : bool, optional
    Whether to display the list of columns in each category.

Returns
-------
None
    Prints summary information to the console.


### Check basic dataset information

In [6]:
jrep.data_info(df)


SHAPE:
There are 891 rows and 15 columns (0.31 MB).

DUPLICATES:
There are 107 duplicated rows.

COLUMNS/VARIABLES:
Column dType Summary:
 * object: 5
 * int: 4
 * float: 2
 * bool: 2
 * category: 2
There are 8 numerical (int/float/bool) variables.
There are 7 categorical (nominal/ordinal) variables.

DATETIME COLUMNS:
There are 0 datetime variables and 0 possible datetime variables.

OTHER COLUMN/VARIABLE INFO:
ID Like Columns (threshold = 95.0%): 0
Columns with mixed datatypes: 3


#### Using the `show_columns=True`
Adding this variable will include the list of columns in the output. 

In [7]:
jrep.data_info(df, show_columns=True)


SHAPE:
There are 891 rows and 15 columns (0.31 MB).

DUPLICATES:
There are 107 duplicated rows.

COLUMNS/VARIABLES:
Column dType Summary:
 * object: 5
 * int: 4
 * float: 2
 * bool: 2
 * category: 2
There are 8 numerical (int/float/bool) variables.
 * Columns: ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'adult_male', 'alone']
There are 7 categorical (nominal/ordinal) variables.
 * Columns: ['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']

DATETIME COLUMNS:
There are 0 datetime variables and 0 possible datetime variables.

OTHER COLUMN/VARIABLE INFO:
ID Like Columns (threshold = 95.0%): 0
Columns with mixed datatypes: 3
 * Columns: ['embarked', 'deck', 'embark_town']


### Check the cardinality of the dataset

In [8]:
jrep.data_cardinality(df, show_columns=True)

CARDINALITY REPORT

Total columns analyzed: 15

[BINARY COLUMNS]
There are 5 binary columns.
 * Columns: ['survived', 'sex', 'adult_male', 'alive', 'alone']
There are 0 binary with nan.

[CONSTANT/NEAR CONSTANT COLUMNS]
There are 0 constant columns.
There are 0 near-constant columns with >= 95% of values being the same.

[LOW CARDINALITY CATEGORICAL COLUMNS]
 * There are 7 low cardinality columns with <= 10 unique values.
Columns:
 * sex: 2 unique values
 * embarked: 3 unique values
 * class: 3 unique values
 * who: 3 unique values
 * deck: 7 unique values
 * embark_town: 3 unique values
 * alive: 2 unique values

[HIGH CARDINALITY CATEGORICAL COLUMNS]
 * There are 0 high cardinality variables with >=90% unique values.


## Check data quality

In [9]:
jrep.data_quality(df, show_columns=True)

DATA QUALITY REPORT

 * Total entries (rows * cols): 13365
 * Memory usage: 0.31 MB
 * Rows: 891
 * Columns: 15

MISSING DATA:
 * Total entries: 869 missing (6.5%)

ROWS:
----------
 * Rows missing any: 709
 * Rows missing all: 0

DUPLICATES: 107

COLUMNS:
----------------
Columns missing any: 4
	'deck': 688 missing (77.2%)
	'age': 177 missing (19.9%)
	'embarked': 2 missing (0.2%)
	'embark_town': 2 missing (0.2%)
Column list: ['deck', 'age', 'embarked', 'embark_town']

CONSTANT: 0

NEAR CONSTANT: 0
	(95% of values are the same)

MIXED DATATYPES: 3
	Column list: ['embarked', 'deck', 'embark_town']

HIGH CARDINALITY: 0
	(60% >= unique values)


## Import the eda library from jcds

In [10]:
from jcds import eda as jeda

In [18]:
jeda.show_dimensions(df)

(891, 15, 13365, np.float64(0.31))

In [21]:
jeda.show_highcardvars(df, percent_unique=10, verbose=True)

Cateogrical variables with cardinality >= 10%


[]

In [23]:
jeda.show_lowcardvars(df, max_unique=5, verbose=True)

Categorical variables with cardinality <= 5


[('sex', 2),
 ('embarked', 3),
 ('class', 3),
 ('who', 3),
 ('embark_town', 3),
 ('alive', 2)]

In [24]:
jeda.help()

Available functions in jcds:

  - clean_column_names
  - convert_to_bool
  - convert_to_categorical
  - convert_to_datetime
  - convert_to_int
  - convert_to_numeric
  - convert_to_object
  - correlation_matrix
  - count_cols_with_all_na
  - count_cols_with_any_na
  - count_id_like_columns
  - count_rows_with_all_na
  - count_rows_with_any_na
  - count_total_na
  - count_unique_values
  - create_dt_col
  - create_dt_cols
  - delete_columns
  - describe_categorical
  - detect_outliers_iqr
  - display_all_col_head
  - dqr_cat
  - dqr_cont
  - get_cat_list
  - get_cont_list
  - get_dtype_summary
  - list_unique_values
  - long_report
  - plot_categorical
  - plot_correlation_heatmap
  - quick_report
  - rename_column
  - show_binary_list
  - show_catvar
  - show_constantvars
  - show_convar
  - show_datetime_columns
  - show_dimensions
  - show_dupes
  - show_highcardvars
  - show_lowcardvars
  - show_memory_use
  - show_missing_summary
  - show_mixed_type_columns
  - show_nearconstvars
 

In [25]:
categorical_var = jeda.show_catvar(df)
print(categorical_var)

['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']


In [26]:
jeda.list_unique_values(df, categorical_var)

[1;36m>>> EXECUTING [1;33mDataFrame["['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']"].unique().tolist()[0m
Unique values in 'sex':
['male', 'female']
---
Unique values in 'embarked':
['S', 'C', 'Q', nan]
---
Unique values in 'class':
['Third', 'First', 'Second']
---
Unique values in 'who':
['man', 'woman', 'child']
---
Unique values in 'deck':
[nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
---
Unique values in 'embark_town':
['Southampton', 'Cherbourg', 'Queenstown', nan]
---
Unique values in 'alive':
['no', 'yes']
---


In [14]:
jeda.show_convar(df)

['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'adult_male', 'alone']

In [15]:
jeda.show_binary_list(df)

{'binary_columns': ['survived', 'sex', 'adult_male', 'alive', 'alone'],
 'binary_with_nan': []}

In [16]:
jeda.quick_report(df)

[93m
------------------------------------------------------------
`quick_report()` is deprecated and will be removed in v0.3.0. This will be replaced with a new function.
[0m
Quick Report - info(memory_usage='deep')
Total cols: 15
Rows missing all values: 0 (0.0%)
Total Rows: 891
Cols with missing values: 4 (26.67%)
Total missing values in dataset: 869


  jeda.quick_report(df)


In [17]:
jeda.long_report(df)

[93m
------------------------------------------------------------
`long_report()` is deprecated and will be removed in v0.3.0. This will be replaced with a new function.
[0m
Quick Report - info(memory_usage='deep')
Total cols: 15
Rows missing all values: 0 (0.0%)
Total Rows: 891
Cols with missing values: 4 (26.67%)
Total missing values in dataset: 869
Categorical features: 7
- sex: 2 unique values
- embarked: 4 unique values
- class: 3 unique values
- who: 3 unique values
- deck: 8 unique values
- embark_town: 4 unique values
- alive: 2 unique values
Continuous features: 8
- survived: 2 unique values
- pclass: 3 unique values
- age: 89 unique values
- sibsp: 7 unique values
- parch: 7 unique values
- fare: 248 unique values
- adult_male: 2 unique values
- alone: 2 unique values


  jeda.long_report(df)
