# Example EDA Workflow with the `jcds` library

Install the library using:
    ```bash
    pip install git+https://github.com/junclemente/jcds.git`
    ```

This can be done in the command line or within the Jupyter Notebook as shown in the next cell. 

### Install the `jcds` library directly from Github using `pip`: 

- From a command line:  
    ```bash
    pip install git+https://github.com/junclemente/jcds.git`
    ```  
    
- From a notebook cell:  
    ```python
    !pip install git+https://github.com/junclemente/jcds.git
    ```  

In [1]:
# import standard libraries
import pandas as pd
import seaborn as sns

In [2]:
# import sebaorn's titanic dataset
df = sns.load_dataset("titanic")
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Import the `reports` library from jcds

In [3]:
from jcds import reports as jrep

# import jcds.reports as jrep

### Check basic dataset information

In [4]:
jrep.data_info(df)


SHAPE:
There are 891 rows and 15 columns (0.31 MB).

DUPLICATES:
There are 107 duplicated rows.

COLUMNS/VARIABLES:
Column dType Summary:
 * object: 5
 * int: 4
 * float: 2
 * bool: 2
 * category: 2
There are 8 numerical (int/float/bool) variables.
There are 7 categorical (nominal/ordinal) variables.

DATETIME COLUMNS:
There are 0 datetime variables and 0 possible datetime variables.

OTHER COLUMN/VARIABLE INFO:
ID Like Columns: 0
Columns with mixed datatypes: 3


#### Using the `show_columns=True`
Adding this variable will include the list of columns in the output. 

In [5]:
jrep.data_info(df, show_columns=True)


SHAPE:
There are 891 rows and 15 columns (0.31 MB).

DUPLICATES:
There are 107 duplicated rows.

COLUMNS/VARIABLES:
Column dType Summary:
 * object: 5
 * int: 4
 * float: 2
 * bool: 2
 * category: 2
There are 8 numerical (int/float/bool) variables.
 * Columns: ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'adult_male', 'alone']
There are 7 categorical (nominal/ordinal) variables.
 * Columns: ['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']

DATETIME COLUMNS:
There are 0 datetime variables and 0 possible datetime variables.

OTHER COLUMN/VARIABLE INFO:
ID Like Columns: 0
Columns with mixed datatypes: 3
 * Columns: ['embarked', 'deck', 'embark_town']


### Check the cardinality of the dataset

In [6]:
jrep.data_card2(df, show_columns=True)

CARDINALITY REPORT

-- BINARY COLUMNS --
There are 5 binary columns.
 • Columns: ['survived', 'sex', 'adult_male', 'alive', 'alone']
There are 0 binary with nan.

-- CONSTANT VARIABLES --
Showing constant columns (only one unique value)
There are 0 constant columns.

-- LOW CARDINALITY CATEGORICAL --
Showing cat var of cardinality <= 10
Found 7 categorical variables with ≤ 10 unique values.
 • sex: 2 unique values
 • embarked: 3 unique values
 • class: 3 unique values
 • who: 3 unique values
 • deck: 7 unique values
 • embark_town: 3 unique values
 • alive: 2 unique values

-- HIGH CARDINALITY CATEGORICAL --
Showing cat var of cardinality >= 90%
Found 0 categorical variables with ≥ 90% unique values.


## Import the eda library from jcds

In [9]:
from jcds import eda as jeda

In [10]:
jeda.show_shape(df)

(891, 15)

In [11]:
jeda.show_dupes(df)

np.int64(107)

In [12]:
jeda.show_catvar(df)

['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']

In [13]:
jeda.show_convar(df)

['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare', 'adult_male', 'alone']

In [14]:
jeda.show_binary_list(df)

{'binary_columns': ['survived', 'sex', 'adult_male', 'alive', 'alone'],
 'binary_with_nan': []}

In [15]:
jeda.quick_report(df)

[93m
------------------------------------------------------------
`quick_report()` is deprecated and will be removed in v0.3.0. This will be replaced with a new function.
[0m
Quick Report - info(memory_usage='deep')
Total cols: 15
Rows missing all values: 0 (0.0%)
Total Rows: 891
Cols with missing values: 4 (26.67%)
Total missing values in dataset: 869


  jeda.quick_report(df)


In [16]:
jeda.long_report(df)

[93m
------------------------------------------------------------
`long_report()` is deprecated and will be removed in v0.3.0. This will be replaced with a new function.
[0m
Quick Report - info(memory_usage='deep')
Total cols: 15
Rows missing all values: 0 (0.0%)
Total Rows: 891
Cols with missing values: 4 (26.67%)
Total missing values in dataset: 869
Categorical features: 7
- sex: 2 unique values
- embarked: 4 unique values
- class: 3 unique values
- who: 3 unique values
- deck: 8 unique values
- embark_town: 4 unique values
- alive: 2 unique values
Continuous features: 8
- survived: 2 unique values
- pclass: 3 unique values
- age: 89 unique values
- sibsp: 7 unique values
- parch: 7 unique values
- fare: 248 unique values
- adult_male: 2 unique values
- alone: 2 unique values


  jeda.long_report(df)
