## A common pipeline to clean dataframes


0. Import modules
1. Normalize column names
2. Eval misspelling in object non-date columns
3. Eval & Drop duplicates without index_id `mlg.DF_wo_colX(df, colnames)`
4. Check NAs `mlg.na_absperc(df)`
    - By Columns
    - By Rows
5. Eval 'unique/freq ratio'`mlg.categ_summ(df)`
6. Eval Data types
7. Save the cleaned dataframe

## Summary of cleaning `inventory.csv`

I made no changes!!

**Original inventory.csv** had 1000 rows and 4 columns `['inventory_id', 'film_id', 'store_id', 'last_update']`

* 1. No column name to normalize
* 2. No misspelling values
* 3. **CAUTION!!" ~600 hidden duplicated rows when excluding the inventory_id's! <br/>
    I am not sure what this means so I haven't change anything yet.**
* 4. No NA's
* 5. Nothing striking from unique/freq ratio analysis
* 6. No data type to transform
* 7. Cleaned DF saved as `clean/inventory1.csv`

**Cleaned inventory1.csv** has 1000 rows and 4 columns `['inventory_id', 'film_id', 'store_id', 'last_update']`

### 0. Import modules

First things first!!

We need to import modules and set default notebook properties as;

- leading with warnings or
- defining how we want to display the outputs and plots.

I will also import my own modulemlgfrom scr/dataanalysis1.py.
When I make modifications in my functions, I need to detatch the module and load it again!

In [2]:
# Import modules etc

import pandas as pd
import numpy as np
import re

np.random.seed(42)
pd.set_option('display.max_columns', None) # show all the columns

# print the plot in the jupyter output
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore') # ignorar warnings

import pylab as plt   # import matplotlib.pyplot as plt
import seaborn as sns
import fuzzywuzzy as fzw

# Import my module
from src import dataanalysis_fun1 as mlg 

In [3]:
# Reload my module if neccessary

import importlib
from src import dataanalysis_fun1 as mlg # Import the module
#importlib.reload(mlg)  # Reload the module

# Suppress warning when reloading the module
with warnings.catch_warnings():
    warnings.simplefilter("ignore") 
    importlib.reload(mlg)  # Reload the module

In [4]:
DF_raw = pd.read_csv('../../data/raw/inventory.csv')

### 1. Normalize column names

In [5]:
display(DF_raw.columns)
setA = set(DF_raw.columns)
DF_raw=mlg.colnnam_clean(DF_raw)
setB = set(DF_raw.columns)

print(f'Normalized column names: {len(setB.difference(setA))}')

DF_raw.shape

Index(['inventory_id', 'film_id', 'store_id', 'last_update'], dtype='object')

Normalized column names: 0


(1000, 4)

### 2. Eval misspelling in object non-date columns

In [6]:
for i in DF_raw.columns: 
    if (DF_raw[i].dtype == 'object') and ("date" not in i):
        print(i.upper(), ":")
        DF_raw[i]=DF_raw[i].apply(lambda a: a.strip()) ## Remove leading and trailing spaces 
        print(DF_raw[[i]].value_counts())
        print("\n")
    else:
        print(i.upper(), " ---> Non object-Column or explicit date reference")
        pass

INVENTORY_ID  ---> Non object-Column or explicit date reference
FILM_ID  ---> Non object-Column or explicit date reference
STORE_ID  ---> Non object-Column or explicit date reference
LAST_UPDATE  ---> Non object-Column or explicit date reference


### 3. eval/drop duplicates without index_id

In [27]:
listtest=["inventory_idXXX"]
DF_raw_wo1=mlg.DF_wo_colX(DF_raw, listtest)

print("\n")

DF1=DF_raw.copy()
if any(DF_raw_wo1.duplicated()): # there are duplicates!
    print("DUPLICATED?", any(DF_raw_wo1.duplicated()))
    # Find the index positions of duplicate rows in the subset of columns
    dup_ind = DF_raw_wo1.duplicated()

    # Show the duplicate rows
    display(DF1[dup_ind])

    # Display the index positions where rows are duplicated
    nondup_ind = dup_ind[~dup_ind].index
    DF1 = DF1.iloc[nondup_ind]

    # Reset the index to have continuous index values
    DF1.reset_index(drop=True, inplace=True) 
else:
    print("NO HIDDEN DUPLICATES")
    pass


display(DF_raw.shape)
display(DF1.shape)




NO HIDDEN DUPLICATES


(1000, 4)

(1000, 4)

### 4. check NAs again

- By column
- By row

In [28]:
display(mlg.na_absperc(DF1)) # by cols
display(mlg.na_absperc(DF1.T)) # by rows


Unnamed: 0,abs_NA,perc_NA


Unnamed: 0,abs_NA,perc_NA


### 5. unique/freq ratio

In [29]:
display(mlg.categ_summ(DF1).sort_values("resto_per", ascending =True))

Unnamed: 0,count,unique,top,freq,unicount_ratio,resto_abs,resto_per
last_update,1000,1,2006-02-15 05:09:17,1000,0.001,0,0.0


### 6. data types

In [30]:
DF1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   inventory_id  1000 non-null   int64 
 1   film_id       1000 non-null   int64 
 2   store_id      1000 non-null   int64 
 3   last_update   1000 non-null   object
dtypes: int64(3), object(1)
memory usage: 31.4+ KB


### 7. Save the cleaned dataframe(s)

In [31]:
inventory1=DF1.copy()
#inventory1.to_csv('../../data/clean/inventory1.csv', index=False)