## Manage Data

This section covers some tools to work with your data. 

### DVC: A Data Version Control Tool for Your Data Science Projects

In [None]:
!pip install dvc

Git is great for managing code versions, but what about data? DVC solves this problem by allowing you to track data versions in Git while storing the actual data separately. Think of it as Git for data. 

Here's some example code for using DVC.

```bash
# Initialize
$ dvc init

# Track data directory
$ dvc add data # Create data.dvc
$ git add data.dvc
$ git commit -m "add data"

# Store the data remotely
$ dvc remote add -d remote gdrive://lynNBbT-4J0ida0eKYQqZZbC93juUUUbVH

# Push the data to remote storage
$ dvc push 

# Get the data
$ dvc pull 

# Switch between different version
$ git checkout HEAD^1 data.dvc
$ dvc checkout
```

[Link to DVC](https://dvc.org/)

Find step-by-step instructions on how to use DVC in [my article](https://towardsdatascience.com/introduction-to-dvc-data-version-control-tool-for-machine-learning-projects-7cb49c229fe0?sk=842f755cdf21a5db60aada1168c55447).

### sweetviz: Compare the similar features between 2 different datasets

In [None]:
!pip install sweetviz 

Comparing similar characteristics of two datasets, such as the training and testing sets, can be useful. sweetviz provides an easy way to compare 2 datasets through graphs.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import sweetviz as sv

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

report = sv.compare([X_train, "train data"], [X_test, "test data"])
report.show_html()

![image](../img/sweetviz_output.png)

[Link to sweetviz](https://github.com/fbdesignpro/sweetviz)

### quadratic: Data Science Speadsheet with Python and SQL

If you want to use Python or SQL in an Excel sheet, use quadratic.

![](../img/quadratic.gif)

[Link to quadratic](https://github.com/quadratichq/quadratic).


### whylogs: Data Logging Made Easy

In [None]:
!pip install whylogs

Logging the summary statistics of a dataset is valuable for monitoring data changes and ensuring data quality. With whylogs, you can easily log your data in just a few lines of code.

In [1]:
import pandas as pd
import whylogs as why

data = {
    "Fruit": ["Apple", "Banana", "Orange"],
    "Color": ["Red", "Yellow", "Orange",],
    "Quantity": [5, 8, 3],
}

df = pd.DataFrame(data)

# Log the DataFrame using whylogs and create a profile
profile = why.log(df).profile()

# View the profile and convert it to a pandas DataFrame
prof_view = profile.view()
prof_df = prof_view.to_pandas()
prof_df

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n,counts/nan,counts/null,distribution/max,distribution/mean,distribution/median,...,frequent_items/frequent_strings,type,types/boolean,types/fractional,types/integral,types/object,types/string,types/tensor,ints/max,ints/min
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Color,3.0,3.0,3.00015,0,3,0,0,,0.0,,...,"[FrequentItem(value='Yellow', est=1, upper=1, ...",SummaryType.COLUMN,0,0,0,0,3,0,,
Fruit,3.0,3.0,3.00015,0,3,0,0,,0.0,,...,"[FrequentItem(value='Orange', est=1, upper=1, ...",SummaryType.COLUMN,0,0,0,0,3,0,,
Quantity,3.0,3.0,3.00015,0,3,0,0,8.0,5.333333,5.0,...,"[FrequentItem(value='8', est=1, upper=1, lower...",SummaryType.COLUMN,0,0,3,0,0,0,8.0,3.0


In [3]:
prof_df.iloc[:, :5]

Unnamed: 0_level_0,cardinality/est,cardinality/lower_1,cardinality/upper_1,counts/inf,counts/n
column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Color,3.0,3.0,3.00015,0,3
Fruit,3.0,3.0,3.00015,0,3
Quantity,3.0,3.0,3.00015,0,3


In [16]:
prof_df.columns

Index(['cardinality/est', 'cardinality/lower_1', 'cardinality/upper_1',
       'counts/inf', 'counts/n', 'counts/nan', 'counts/null',
       'distribution/max', 'distribution/mean', 'distribution/median',
       'distribution/min', 'distribution/n', 'distribution/q_01',
       'distribution/q_05', 'distribution/q_10', 'distribution/q_25',
       'distribution/q_75', 'distribution/q_90', 'distribution/q_95',
       'distribution/q_99', 'distribution/stddev',
       'frequent_items/frequent_strings', 'type', 'types/boolean',
       'types/fractional', 'types/integral', 'types/object', 'types/string',
       'types/tensor', 'ints/max', 'ints/min'],
      dtype='object')

[Link to whylogs](https://github.com/whylabs/whylogs).

### Fluke: The Easiest Way to Move Data Around

Data scientists often need to transfer data between locations, such as a remote server to cloud storage. However, many Python libraries require a lot of boilerplate code to handle HTTP/SSH connections and iterate directories. 

This can be cumbersome for those who want to transfer files easily. Fluke offers a simple API that allows users to interact with remote data in a few lines of code. 

```python
from fluke.auth import RemoteAuth, AWSAuth

# This object will be used to authenticate
# with the remote machine.
rmt_auth = RemoteAuth.from_password(
    hostname="host",
    username="user",
    password="password")

# This object will be used to authenticate
# with AWS.
aws_auth = AWSAuth(
    aws_access_key_id="aws_access_key",
    aws_secret_access_key="aws_secret_key")
```

```python
from fluke.storage import RemoteDir, AWSS3Dir

with (
    RemoteDir(auth=rmt_auth, path='/home/user/dir') as rmt_dir,
    AWSS3Dir(auth=aws_auth, bucket="bucket", path='dir', create_if_missing=True) as aws_dir
):
    rmt_dir.transfer_to(dst=aws_dir, recursively=True)
```

[Link to Fluke](https://github.com/manoss96/fluke).