# Review - Library AutoClean

Me parece una opcion ideal para rapidos data cleaning para Data Analysis. No obstante, **echo de menos la posibilidad de incluirlo en un pipeline de un proyecto ML, es decir, un fit / transform.** 

Por tanto, esta libreria **solo es util para**:
- la fase del **conocimiento del dato**. 
- la fase de **desarrollo de algoritmos ML**, para hacer pruebas rapidas pudiendo centrarse mas en la modelizacion que en la limpieza del dato. 

### References

- [PyPi: py-AutoClean 1.1.3](https://pypi.org/project/py-AutoClean/)
- [GitHub: AutoClean](https://github.com/elisemercury/AutoClean)


![image](parameters.png "AutoClean arguments")

In [30]:
import seaborn as sns
from AutoClean import AutoClean

### load data

In [31]:
df = sns.load_dataset("taxis")
print(df.shape)
df.head()

(6433, 14)


Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [32]:
df.isnull().sum()

pickup              0
dropoff             0
passengers          0
distance            0
fare                0
tip                 0
tolls               0
total               0
color               0
payment            44
pickup_zone        26
dropoff_zone       45
pickup_borough     26
dropoff_borough    45
dtype: int64

# Data Cleaning

### Pipeline Attributes
```
 'count_missing',
 'duplicates',
 'encode_categ',
 'extract_datetime',
 'missing_categ',
 'missing_num',
 'mode',
 'outlier_param',
 'outliers',
 'output'
```

### auto (full)

In [33]:
pipeline = AutoClean(df)

AutoClean process completed in 3.054068 seconds
Logfile saved to: /Users/juan/Workspace/projects/utilsDS/notebooks/data_cleaning/autoclean.log


In [34]:
pipeline.output.isnull().sum()

pickup                           0
dropoff                          0
passengers                       0
distance                         0
fare                             0
tip                              0
tolls                            0
total                            0
color                            0
payment                          0
pickup_zone                      0
dropoff_zone                     0
pickup_borough                   0
dropoff_borough                  0
Day                              0
Month                            0
Year                             0
Hour                             0
Minute                           0
Sec                              0
payment_cash                     0
payment_credit card              0
color_green                      0
color_yellow                     0
dropoff_borough_Bronx            0
dropoff_borough_Brooklyn         0
dropoff_borough_Manhattan        0
dropoff_borough_Queens           0
dropoff_borough_Stat

### manual settings

In [35]:
ppm = AutoClean(df, mode='manual', duplicates='auto', missing_num=False, missing_categ='knn', 
          encode_categ=False, extract_datetime=False, outliers='delete', outlier_param=1.5, 
          logfile=True, verbose=True)

04-11-2023 11:56:21.02 - INFO - Started validation of input parameters...
04-11-2023 11:56:21.02 - INFO - Completed validation of input parameters
04-11-2023 11:56:21.03 - INFO - Started handling of duplicates... Method: "AUTO"
04-11-2023 11:56:21.04 - DEBUG - 0 missing values found
04-11-2023 11:56:21.04 - INFO - Completed handling of duplicates in 0.013371 seconds
04-11-2023 11:56:21.05 - INFO - Started handling of missing values...
04-11-2023 11:56:21.05 - INFO - Found a total of 186 missing value(s)
04-11-2023 11:56:21.05 - INFO - Started handling of CATEGORICAL missing values... Method: "KNN"
04-11-2023 11:56:21.08 - DEBUG - KNN imputation of 44 value(s) succeeded for feature "payment"
04-11-2023 11:56:21.10 - DEBUG - KNN imputation of 26 value(s) succeeded for feature "pickup_zone"
04-11-2023 11:56:21.12 - DEBUG - KNN imputation of 45 value(s) succeeded for feature "dropoff_zone"
04-11-2023 11:56:21.14 - DEBUG - KNN imputation of 26 value(s) succeeded for feature "pickup_borough"

Logfile saved to: /Users/juan/Workspace/projects/utilsDS/notebooks/data_cleaning/autoclean.log


In [40]:
ppm.output.shape, ppm.count_missing

((4981, 14), 186)

In [37]:
ppm.output.isnull().sum()

pickup             0
dropoff            0
passengers         0
distance           0
fare               0
tip                0
tolls              0
total              0
color              0
payment            0
pickup_zone        0
dropoff_zone       0
pickup_borough     0
dropoff_borough    0
dtype: int64