## A common pipeline to clean dataframes


0. Import modules
1. Normalize column names
2. Eval misspelling in object non-date columns
3. Eval & Drop duplicates without index_id `mlg.DF_wo_colX(df, colnames)`
4. Check NAs `mlg.na_absperc(df)`
    - By Columns
    - By Rows
5. Eval 'unique/freq ratio'`mlg.categ_summ(df)`
6. Eval Data types
7. Save the cleaned dataframe

## Summary of cleaning `actor.csv`


**Original DF** had 200 rows and 4 columns `['actor_id', 'first_name', 'last_name', 'last_update']`

* 1. No column name to normalize
* 2. No misspelling values
* 3. There is one hidden duplicated row - **NOTE !! I did not update the actor_id's!**
* 4. Had no NA's
* 5. Nothing striking from unique/freq ratio analysis
* 6. No data type to transform (Date could be transformed but I prefer to do it in mySQL)
* 7. Cleaned DF saved as `clean/actor1.csv`

**Cleaned DF** has 199 rows and 4 columns `['actor_id', 'first_name', 'last_name', 'last_update']`


### 0. Import modules

In [46]:
# Import modules etc

import pandas as pd
import numpy as np
import re

np.random.seed(42)
pd.set_option('display.max_columns', None) # show all the columns

# print the plot in the jupyter output
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore') # ignorar warnings

import pylab as plt   # import matplotlib.pyplot as plt
import seaborn as sns
import fuzzywuzzy as fzw

# Import my module
from src import dataanalysis_fun1 as mlg 

In [47]:
# Reload my module if neccessary

import importlib
from src import dataanalysis_fun1 as mlg # Import the module
#importlib.reload(mlg)  # Reload the module

# Suppress warning when reloading the module
with warnings.catch_warnings():
    warnings.simplefilter("ignore") 
    importlib.reload(mlg)  # Reload the module

In [51]:
DF_raw = pd.read_csv('../../data/raw/actor.csv')

### 1. Normalize column names

In [52]:
display(DF_raw.columns)
setA = set(DF_raw.columns)
DF_raw=mlg.colnnam_clean(DF_raw)
setB = set(DF_raw.columns)

print(f'Normalized column names: {len(setB.difference(setA))}')

Index(['actor_id', 'first_name', 'last_name', 'last_update'], dtype='object')

Normalized column names: 0


### 2. Eval misspelling in object non-date columns

In [53]:
for i in DF_raw.columns: 
    if (DF_raw[i].dtype == 'object') and ("date" not in i):
        print(i.upper(), ":")
        DF_raw[i]=DF_raw[i].apply(lambda a: a.strip()) ## Remove leading and trailing spaces 
        print(DF_raw[[i]].value_counts())
        print("\n")
    else:
        print(i.upper(), " ---> Non object-Column or explicit date reference")
        pass

ACTOR_ID  ---> Non object-Column or explicit date reference
FIRST_NAME :
first_name
PENELOPE      4
KENNETH       4
JULIA         4
NICK          3
DAN           3
             ..
JAMES         1
JADA          1
IAN           1
HENRY         1
ZERO          1
Length: 128, dtype: int64


LAST_NAME :
last_name
KILMER       5
TEMPLE       4
NOLTE        4
AKROYD       3
ALLEN        3
            ..
HUNT         1
HUDSON       1
HOPE         1
BIRCH        1
HURT         1
Length: 121, dtype: int64


LAST_UPDATE  ---> Non object-Column or explicit date reference


### 3. eval/drop duplicates without index_id

In [54]:
listtest=["actor_id"]
DF_raw_wo1=mlg.DF_wo_colX(DF_raw, listtest)

print("\n")

DF1=DF_raw.copy()
if any(DF_raw_wo1.duplicated()): # there are duplicates!
    print("DUPLICATED?", any(DF_raw_wo1.duplicated()))
    # Find the index positions of duplicate rows in the subset of columns
    dup_ind = DF_raw_wo1.duplicated()

    # Show the duplicate rows
    display(DF1[dup_ind])

    # Display the index positions where rows are duplicated
    nondup_ind = dup_ind[~dup_ind].index
    DF1 = DF1.iloc[nondup_ind]

    # Reset the index to have continuous index values
    DF1.reset_index(drop=True, inplace=True) 
else:
    print("NO HIDDEN DUPLICATES")
    pass


display(DF_raw.shape)
display(DF1.shape)




DUPLICATED? True


Unnamed: 0,actor_id,first_name,last_name,last_update
109,110,SUSAN,DAVIS,2006-02-15 04:34:33


(200, 4)

(199, 4)

### 4. check NAs again

- By column
- By row

In [55]:
display(mlg.na_absperc(DF1)) # by cols
display(mlg.na_absperc(DF1.T)) # by rows

Unnamed: 0,abs_NA,perc_NA


Unnamed: 0,abs_NA,perc_NA


### 5. unique/freq ratio

In [56]:
display(mlg.categ_summ(DF1).sort_values("resto_per", ascending =True))

Unnamed: 0,count,unique,top,freq,unicount_ratio,resto_abs,resto_per
last_update,199,1,2006-02-15 04:34:33,199,0.005025,0,0.0
last_name,199,121,KILMER,5,0.60804,194,97.487437
first_name,199,128,PENELOPE,4,0.643216,195,97.98995


### 6. data types

In [57]:
DF1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199 entries, 0 to 198
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   actor_id     199 non-null    int64 
 1   first_name   199 non-null    object
 2   last_name    199 non-null    object
 3   last_update  199 non-null    object
dtypes: int64(1), object(3)
memory usage: 6.3+ KB


### 7. Save the cleaned dataframe

In [58]:
actor1=DF1.copy()
#actor1.to_csv('../../data/clean/actor1.csv', index=False)