## A common pipeline to clean dataframes


0. Import modules
1. Normalize column names
2. Eval misspelling in object non-date columns
3. Eval & Drop duplicates without index_id `mlg.DF_wo_colX(df, colnames)`
4. Check NAs `mlg.na_absperc(df)`
    - By Columns
    - By Rows
5. Eval 'unique/freq ratio'`mlg.categ_summ(df)`
6. Eval Data types
7. Save the cleaned dataframe

## Summary of cleaning `old_HDD.csv`

I made no changes!!

**Original old_HDD.csv** had 1000 rows and 4 columns `['inventory_id', 'film_id', 'store_id', 'last_update']`

* 1. No column name to normalize
* 2. No misspelling values
* 3. **CAUTION!!" ~600 hidden duplicated rows when excluding the inventory_id's! <br/>
    I am not sure what this means so I haven't change anything yet.**
* 4. No NA's
* 5. Nothing striking from unique/freq ratio analysis
* 6. No data type to transform
* 7. Cleaned DF saved as `clean/inventory1.csv`

**Cleaned old_HDD1.csv** has 1000 rows and 4 columns `['inventory_id', 'film_id', 'store_id', 'last_update']`

### 0. Import modules

First things first!!

We need to import modules and set default notebook properties as;

- leading with warnings or
- defining how we want to display the outputs and plots.

I will also import my own modulemlgfrom scr/dataanalysis1.py.
When I make modifications in my functions, I need to detatch the module and load it again!

In [7]:
# Import modules etc

import pandas as pd
import numpy as np
import re

np.random.seed(42)
pd.set_option('display.max_columns', None) # show all the columns

# print the plot in the jupyter output
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore') # ignorar warnings

import pylab as plt   # import matplotlib.pyplot as plt
import seaborn as sns
import fuzzywuzzy as fzw

# Import my module
from src import dataanalysis_fun1 as mlg 

In [8]:
# Reload my module if neccessary

import importlib
from src import dataanalysis_fun1 as mlg # Import the module
#importlib.reload(mlg)  # Reload the module

# Suppress warning when reloading the module
with warnings.catch_warnings():
    warnings.simplefilter("ignore") 
    importlib.reload(mlg)  # Reload the module

In [9]:
DF_raw = pd.read_csv('../../data/raw/old_HDD.csv')

### 1. Normalize column names

In [11]:
display(DF_raw.columns)
setA = set(DF_raw.columns)
DF_raw=mlg.colnnam_clean(DF_raw)
setB = set(DF_raw.columns)

print(f'Normalized column names: {len(setB.difference(setA))}')

DF_raw.shape
DF_raw.head()

Index(['first_name', 'last_name', 'title', 'release_year', 'category_id'], dtype='object')

Normalized column names: 0


Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14


### 2. Eval misspelling in object non-date columns

In [12]:
for i in DF_raw.columns: 
    if (DF_raw[i].dtype == 'object') and ("date" not in i):
        print(i.upper(), ":")
        DF_raw[i]=DF_raw[i].apply(lambda a: a.strip()) ## Remove leading and trailing spaces 
        print(DF_raw[[i]].value_counts())
        print("\n")
    else:
        print(i.upper(), " ---> Non object-Column or explicit date reference")
        pass

FIRST_NAME :
first_name
SANDRA        56
VAL           35
UMA           35
JULIA         33
RIP           33
HELEN         32
WOODY         31
KARL          31
GRACE         30
VIVIEN        30
LUCILLE       30
JOHNNY        29
ALEC          29
BURT          29
CUBA          28
FRED          27
KIRSTEN       27
ELVIS         26
ZERO          25
BOB           25
JOE           25
AUDREY        25
TOM           25
NICK          25
CAMERON       24
MILLA         24
TIM           23
DAN           22
CHRISTIAN     22
ED            22
JENNIFER      22
KEVIN         21
MATTHEW       20
BETTE         20
PENELOPE      19
SISSY         18
JUDY          15
GOLDIE         7
dtype: int64


LAST_NAME :
last_name   
OLIVIER         53
PECK            43
KILMER          37
WOOD            35
BOLGER          35
CRAWFORD        33
MCQUEEN         33
VOIGHT          32
BERRY           31
HOFFMAN         31
BERGEN          30
TRACY           30
MOSTEL          30
WAYNE           29
DUKAKIS         29
LOLLO

### 3. eval/drop duplicates without index_id

In [13]:
listtest=["category_id"]
DF_raw_wo1=mlg.DF_wo_colX(DF_raw, listtest)

print("\n")

DF1=DF_raw.copy()
if any(DF_raw_wo1.duplicated()): # there are duplicates!
    print("DUPLICATED?", any(DF_raw_wo1.duplicated()))
    # Find the index positions of duplicate rows in the subset of columns
    dup_ind = DF_raw_wo1.duplicated()

    # Show the duplicate rows
    display(DF1[dup_ind])

    # Display the index positions where rows are duplicated
    nondup_ind = dup_ind[~dup_ind].index
    DF1 = DF1.iloc[nondup_ind]

    # Reset the index to have continuous index values
    DF1.reset_index(drop=True, inplace=True) 
else:
    print("NO HIDDEN DUPLICATES")
    pass


display(DF_raw.shape)
display(DF1.shape)




NO HIDDEN DUPLICATES


(1000, 5)

(1000, 5)

### 4. check NAs again

- By column
- By row

In [14]:
display(mlg.na_absperc(DF1)) # by cols
display(mlg.na_absperc(DF1.T)) # by rows


Unnamed: 0,abs_NA,perc_NA


Unnamed: 0,abs_NA,perc_NA


### 5. unique/freq ratio

In [15]:
display(mlg.categ_summ(DF1).sort_values("resto_per", ascending =True))

Unnamed: 0,count,unique,top,freq,unicount_ratio,resto_abs,resto_per
first_name,1000,38,SANDRA,56,0.038,944,94.4
last_name,1000,37,OLIVIER,53,0.037,947,94.7
title,1000,614,BOONDOCK BALLROOM,6,0.614,994,99.4


### 6. data types

In [16]:
DF1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   first_name    1000 non-null   object
 1   last_name     1000 non-null   object
 2   title         1000 non-null   object
 3   release_year  1000 non-null   int64 
 4   category_id   1000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 39.2+ KB


### 7. Save the cleaned dataframe(s)

In [17]:
old_HDD1=DF1.copy()
#old_HDD1.to_csv('../../data/clean/old_HDD1.csv', index=False)