## Data Cleansing in Python with Pandas and Pyjanitor
Here's the code example for the blog post about [data cleaning using pandas](https://www.marsja.se/easiest-data-cleaning-method-using-python-pandas-pyjanitor/)

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import norm
from random import shuffle

import janitor

subject = ['n0' + str(i) for i in range(1, 201)]

# Generate response times  using a normal distribution
a = 457
rt = norm.rvs(a, size=200)

# Shuffle the response times 
shuffle(rt)

# Adding some missing data
rt[4], rt[9], rt[100] = np.nan, np.nan, np.nan


# Creating a dictionary
data = {
    'Subject': subject,
    'RT': rt,
}

# Pandas Dataframe from dictionary
df = pd.DataFrame(data)

df.head()

Unnamed: 0,Subject,RT
0,n01,455.476461
1,n02,456.811823
2,n03,456.155956
3,n04,457.285037
4,n05,


### Append a Column, or Columns, to Pandas Dataframe:

#### Pandas Method

In [2]:
df['NewColumnName'] = np.nan
df.head()

Unnamed: 0,Subject,RT,NewColumnName
0,n01,455.476461,
1,n02,456.811823,
2,n03,456.155956,
3,n04,457.285037,
4,n05,,


#### Pyjanitor method:

In [3]:
newcolvals = [np.nan]*len(df['Subject'])
df = df.add_column('NewColumnName2', newcolvals)
df.head()

Unnamed: 0,Subject,RT,NewColumnName,NewColumnName2
0,n01,455.476461,,
1,n02,456.811823,,
2,n03,456.155956,,
3,n04,457.285037,,
4,n05,,,


### Remove Missing Data

#### Pandas Dropna

In [4]:
df.dropna(subset=['RT']).head() # Set inplace=True to modify the dataframe

Unnamed: 0,Subject,RT,NewColumnName,NewColumnName2
0,n01,455.476461,,
1,n02,456.811823,,
2,n03,456.155956,,
3,n04,457.285037,,
5,n06,457.421874,,


In [5]:
### Pyjanitor
df.dropna(subset=['RT']).head()

Unnamed: 0,Subject,RT,NewColumnName,NewColumnName2
0,n01,455.476461,,
1,n02,456.811823,,
2,n03,456.155956,,
3,n04,457.285037,,
5,n06,457.421874,,


### Removing Empty Columns

#### Pandas Method

In [6]:
df.dropna(axis=1, how='all').head()

Unnamed: 0,Subject,RT
0,n01,455.476461
1,n02,456.811823
2,n03,456.155956
3,n04,457.285037
4,n05,


#### Pyjanitor Method:

In [7]:
df.remove_empty().head()

Unnamed: 0,Subject,RT
0,n01,455.476461
1,n02,456.811823
2,n03,456.155956
3,n04,457.285037
4,n05,


### Cleaning the Column Names

In [8]:
import requests
from pandas.io.json import json_normalize

url = "https://datahub.io/core/s-and-p-500-companies-financials/r/constituents-financials.json"
resp = requests.get(url=url)

df = json_normalize(resp.json())
df.iloc[:,0:6].head()

Unnamed: 0,52 Week High,52 Week Low,Dividend Yield,EBITDA,Earnings/Share,Market Cap
0,175.49,259.77,2.332862,9048000000.0,7.92,138721100000.0
1,48.925,68.39,1.147959,601000000.0,1.7,10783420000.0
2,42.28,64.6,1.908982,5744000000.0,0.26,102121000000.0
3,60.05,125.86,2.49956,10310000000.0,3.29,181386300000.0
4,114.82,162.6,1.71447,5643228000.0,5.44,98765860000.0


#### Using Pandas Rename

In [9]:
import re

df.rename(columns=lambda x: re.sub('(\s|/)','_',x), 
          inplace=True)
df.keys()

Index(['52_Week_High', '52_Week_Low', 'Dividend_Yield', 'EBITDA',
       'Earnings_Share', 'Market_Cap', 'Name', 'Price', 'Price_Book',
       'Price_Earnings', 'Price_Sales', 'SEC_Filings', 'Sector', 'Symbol'],
      dtype='object')

#### Using clean_names from Pyjanitor

In [10]:
df = df.clean_names().head()
df.keys()

Index(['52_week_high', '52_week_low', 'dividend_yield', 'ebitda',
       'earnings_share', 'market_cap', 'name', 'price', 'price_book',
       'price_earnings', 'price_sales', 'sec_filings', 'sector', 'symbol'],
      dtype='object')

### Grouping and Aggregating:

In [11]:
df.groupby('sector').agg(['mean',
                          'std']).collapse_levels().reset_index()


Unnamed: 0,sector,52_week_high_mean,52_week_high_std,52_week_low_mean,52_week_low_std,dividend_yield_mean,dividend_yield_std,ebitda_mean,ebitda_std,earnings_share_mean,...,market_cap_mean,market_cap_std,price_mean,price_std,price_book_mean,price_book_std,price_earnings_mean,price_earnings_std,price_sales_mean,price_sales_std
0,Health Care,51.165,12.565288,95.23,43.317361,2.204271,0.417601,8027000000.0,3228650000.0,1.775,...,141753700000.0,56049030000.0,82.375,36.918045,14.665,16.228101,20.96,2.192031,5.016026,1.803893
1,Industrials,112.2075,89.49497,164.08,135.326096,1.74041,0.837853,4824500000.0,5972931000.0,4.81,...,74752240000.0,90465570000.0,141.565,115.010918,8.845,3.528463,26.035,2.439518,3.982877,0.576142
2,Information Technology,114.82,,162.6,,1.71447,,5643228000.0,,5.44,...,98765860000.0,,150.51,,10.62,,25.47,,2.604117,


### All together now!
Here we use Pyjanitor for all the above tasks (except for summary stats, of course).

In [12]:
data_id = [1]*200
df = (
    pd.read_csv('./SimData/DF_NA_Janitor.csv', 
                index_col=0)
    .add_column('data_id', data_id)
    .remove_empty()
    .dropna()
    .clean_names()
)

df.head()

Unnamed: 0,subject_id,first_name,day,age,response_time,gender,data_id
0,1,John,Sixth,23,0.562719,0,1
3,4,Don,Sixth,27,0.522961,0,1
5,6,James,Sixth,25,0.457103,0,1
11,12,Eric,Sixth,23,0.512688,0,1
13,14,James,Sixth,21,0.451782,0,1
