<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Basic-Munging-Operations" data-toc-modified-id="Basic-Munging-Operations-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Basic Munging Operations</a></span><ul class="toc-item"><li><span><a href="#Encoding-(Categorizing)" data-toc-modified-id="Encoding-(Categorizing)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Encoding (Categorizing)</a></span></li><li><span><a href="#Change-Capitalization" data-toc-modified-id="Change-Capitalization-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Change Capitalization</a></span></li><li><span><a href="#Drop-Columns" data-toc-modified-id="Drop-Columns-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Drop Columns</a></span></li><li><span><a href="#Selecting-String-Characters" data-toc-modified-id="Selecting-String-Characters-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Selecting String Characters</a></span></li><li><span><a href="#Date-Operations" data-toc-modified-id="Date-Operations-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Date Operations</a></span></li></ul></li><li><span><a href="#Data-Munging-Concerns" data-toc-modified-id="Data-Munging-Concerns-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Munging Concerns</a></span></li><li><span><a href="#Munging-Missing-Values" data-toc-modified-id="Munging-Missing-Values-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Munging Missing Values</a></span><ul class="toc-item"><li><span><a href="#Investigating-Missing-Data" data-toc-modified-id="Investigating-Missing-Data-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Investigating Missing Data</a></span></li><li><span><a href="#Identifying-Nulls" data-toc-modified-id="Identifying-Nulls-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Identifying Nulls</a></span></li><li><span><a href="#Filling-Nulls" data-toc-modified-id="Filling-Nulls-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Filling Nulls</a></span><ul class="toc-item"><li><span><a href="#Filling-Nulls-with-Zeroes" data-toc-modified-id="Filling-Nulls-with-Zeroes-4.3.1"><span class="toc-item-num">4.3.1&nbsp;&nbsp;</span>Filling Nulls with Zeroes</a></span></li><li><span><a href="#Filling-Nulls-with-Averages" data-toc-modified-id="Filling-Nulls-with-Averages-4.3.2"><span class="toc-item-num">4.3.2&nbsp;&nbsp;</span>Filling Nulls with Averages</a></span></li></ul></li><li><span><a href="#Join-Issues" data-toc-modified-id="Join-Issues-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Join Issues</a></span><ul class="toc-item"><li><span><a href="#Filling-NAs" data-toc-modified-id="Filling-NAs-4.4.1"><span class="toc-item-num">4.4.1&nbsp;&nbsp;</span>Filling NAs</a></span></li><li><span><a href="#Best-Practice" data-toc-modified-id="Best-Practice-4.4.2"><span class="toc-item-num">4.4.2&nbsp;&nbsp;</span>Best Practice</a></span></li></ul></li><li><span><a href="#Missing-Values-Outcomes" data-toc-modified-id="Missing-Values-Outcomes-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Missing Values Outcomes</a></span></li></ul></li></ul></div>

# 5.4 Data Munging

## Overview
- Data munging is the manipulation of data
    - Transformation in ETL
- We've already done some munging by changing data types, rows, and other manipulations

Setup with file variables:

In [3]:
import io
import pandas as pd
import requests as r

#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs6339/ex/L2.2/'
file_1 = 'scifi_characters.csv'

res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text)) 

## Basic Munging Operations

### Encoding (Categorizing)
- 'Bucketing' that we've done before
- Here we will encode any character over the age of 80 as 'wise'
    - The default is 'gaining wisdom'

In [4]:
df['Age_Category'] = 'Gaining Wisdom'
df['Age_Category'][df['Age'] > 80] = 'Wise'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [5]:
df

Unnamed: 0,id,fname,lname,birthdate,Age,Age_Category
0,1,Daniel,Jackson,6/23/2114,91,Wise
1,2,Samantha,Carter,5/14/2167,38,Gaining Wisdom
2,3,Jack,O'Neil,6/20/2103,102,Wise
3,4,Susan,Ivanova,9/6/2151,53,Gaining Wisdom
4,5,John,Sheridan,6/23/2167,38,Gaining Wisdom
5,6,Londo,Mallari,3/19/2174,31,Gaining Wisdom
6,7,Aeryn,Sun,1/20/2175,30,Gaining Wisdom
7,8,John,Crichton,12/9/2123,81,Wise
8,9,Sikozu,Shanu,2/22/2149,56,Gaining Wisdom


### Change Capitalization
`df['value'].str.upper()
- Change data type to string with .str
- Convert string to uppercase with str.upper()
- Place these in new column

In [6]:
df['lname_upper'] = df['lname'].str.upper()
df

Unnamed: 0,id,fname,lname,birthdate,Age,Age_Category,lname_upper
0,1,Daniel,Jackson,6/23/2114,91,Wise,JACKSON
1,2,Samantha,Carter,5/14/2167,38,Gaining Wisdom,CARTER
2,3,Jack,O'Neil,6/20/2103,102,Wise,O'NEIL
3,4,Susan,Ivanova,9/6/2151,53,Gaining Wisdom,IVANOVA
4,5,John,Sheridan,6/23/2167,38,Gaining Wisdom,SHERIDAN
5,6,Londo,Mallari,3/19/2174,31,Gaining Wisdom,MALLARI
6,7,Aeryn,Sun,1/20/2175,30,Gaining Wisdom,SUN
7,8,John,Crichton,12/9/2123,81,Wise,CRICHTON
8,9,Sikozu,Shanu,2/22/2149,56,Gaining Wisdom,SHANU


### Drop Columns
- `df.drop`
    - Use this, not the old one that looks weird
- `inplace=True`
    - Drops and saves deletion to the df in one step
- `axes=1`
    - Identifies column

In [None]:
df.drop('lname_upper', inplace=True, axis=1)
df

### Selecting String Characters
- Strings are arrays of characters
    - We can select specific characters by calling their index
- To get the first character we can use `df[].str[0]`

In [9]:
df['lname_1stchar'] = df['lname'].str[0]
df

Unnamed: 0,id,fname,lname,birthdate,Age,Age_Category,lname_upper,lname_1stchar
0,1,Daniel,Jackson,6/23/2114,91,Wise,JACKSON,J
1,2,Samantha,Carter,5/14/2167,38,Gaining Wisdom,CARTER,C
2,3,Jack,O'Neil,6/20/2103,102,Wise,O'NEIL,O
3,4,Susan,Ivanova,9/6/2151,53,Gaining Wisdom,IVANOVA,I
4,5,John,Sheridan,6/23/2167,38,Gaining Wisdom,SHERIDAN,S
5,6,Londo,Mallari,3/19/2174,31,Gaining Wisdom,MALLARI,M
6,7,Aeryn,Sun,1/20/2175,30,Gaining Wisdom,SUN,S
7,8,John,Crichton,12/9/2123,81,Wise,CRICHTON,C
8,9,Sikozu,Shanu,2/22/2149,56,Gaining Wisdom,SHANU,S


### Date Operations
- Look at datatypes to ensure we have dates
- Here birthdate is an object (string)
- `pd.to_datetime(df['value']).dt.year`
    - Casts the dataframe value to year
- We can use this to address different elements of the data (month, day, etc)
    

In [11]:
df.dtypes

id                int64
fname            object
lname            object
birthdate        object
Age               int64
Age_Category     object
lname_upper      object
lname_1stchar    object
yr                int64
dtype: object

In [10]:
df['yr'] = pd.to_datetime(df['birthdate']).dt.year
df

Unnamed: 0,id,fname,lname,birthdate,Age,Age_Category,lname_upper,lname_1stchar,yr
0,1,Daniel,Jackson,6/23/2114,91,Wise,JACKSON,J,2114
1,2,Samantha,Carter,5/14/2167,38,Gaining Wisdom,CARTER,C,2167
2,3,Jack,O'Neil,6/20/2103,102,Wise,O'NEIL,O,2103
3,4,Susan,Ivanova,9/6/2151,53,Gaining Wisdom,IVANOVA,I,2151
4,5,John,Sheridan,6/23/2167,38,Gaining Wisdom,SHERIDAN,S,2167
5,6,Londo,Mallari,3/19/2174,31,Gaining Wisdom,MALLARI,M,2174
6,7,Aeryn,Sun,1/20/2175,30,Gaining Wisdom,SUN,S,2175
7,8,John,Crichton,12/9/2123,81,Wise,CRICHTON,C,2123
8,9,Sikozu,Shanu,2/22/2149,56,Gaining Wisdom,SHANU,S,2149


## Data Munging Concerns
- Balancing between cleaning and manipulation
    - Cleaning: Making data more readily usable
    - Manipulation: Changing the meaning
- Also balance between time and value 
    - We can spend forever cleaning
- Ethics
    - Would my actions be defendable depending on analysis of processed data? 

## Munging Missing Values
- Not just missing measures or numerics
- Common missing issues
    - Data value is missing
    - Data attributes
        - Missing textual attributes
        - E.g. we have temperature but not state
    - Data keys
        - Missing key to join datasets
- Fixing missing data values
    - Easiest: replace with avg, min, max, etc
- Fixing data attributes?
    - E.g. list of states, we can't average a state

Setup:

In [13]:
url = 'http://drd.ba.ttu.edu/isqs6339/ex/L2.2/'
file_1 = 'employment.csv'
file_2 = 'job_title.csv'
file_3 = 'job_title_year.csv'

#pull employment
res = r.get(url + file_1)
res.status_code
df_emp = pd.read_csv(io.StringIO(res.text)) 

#pull job
res = r.get(url + file_2)
res.status_code
df_job = pd.read_csv(io.StringIO(res.text)) 

### Investigating Missing Data
- A good way to start is looking at the count of values
- Typically you should not have different values for each column, that's a red flag
- Note the odd/different values
- This is counting the number of non-NAs, not the number of rows

In [15]:
df_emp.head()
#Let's look at the count
df_emp.count()

id            99
ssn           99
age           99
salary        95
jobtitleid    94
dtype: int64

### Identifying Nulls

Identify the 'not' NAs:

In [None]:
df_emp[df_emp['salary'].notna()]

Now let's look for the nulls directly:

In [16]:
df_emp[df_emp['salary'].isnull()]

Unnamed: 0,id,ssn,age,salary,jobtitleid
10,11,521003273,38,,6.0
22,23,526007969,23,,1.0
44,45,707009414,29,,8.0
59,60,364006076,46,,7.0


### Filling Nulls

#### Filling Nulls with Zeroes
- Now salaries are fixed

In [17]:
df_emp['salary'].fillna(0, inplace=True)
df_emp.count()

id            99
ssn           99
age           99
salary        99
jobtitleid    94
dtype: int64

#### Filling Nulls with Averages
- Might be a better idea than zeroes depending on data

In [18]:
# Reset dataframe to unfix salaries
res = r.get(url + file_1)
res.status_code
df_emp = pd.read_csv(io.StringIO(res.text)) 

Loop over rows to fill in averages:
`df_emp['salary']`
    - Looking at salary column
`[(df_emp['jobtitleid'] == row['jobtitleid'])`
    - We want the jobtitle to equal the current row we're iterating on 
`(df_emp['salary'].isnull())]`
    - AND we want to ensure we're only editing null salary rows
    - This is important, we're replacing data!
`= row['avg_salary']`
- Set row equal to average salary

In [20]:
for index, row in df_job.iterrows():
        df_emp['salary'][(df_emp['jobtitleid'] == row['jobtitleid']) & (df_emp['salary'].isnull())] = row['avg_salary']

df_emp.count()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


id            99
ssn           99
age           99
salary        99
jobtitleid    94
dtype: int64

Nicer way to accomplish this:

In [30]:
#reset the dataframe
res = r.get(url + file_1)
res.status_code
df_emp = pd.read_csv(io.StringIO(res.text)) 

for index, row in df_emp.iterrows():
    if pd.isna(df_emp.at[index, 'salary']):
        df_emp.at[index, 'salary'] = df_job['avg_salary'][df_job['jobtitleid']==row['jobtitleid']]

### Join Issues

In [25]:
#reset the dataframe
res = r.get(url + file_1)
res.status_code
df_emp = pd.read_csv(io.StringIO(res.text)) 

Join our files and look at the count for issues

In [26]:
dfmerged = df_emp.merge(df_job, how='inner', on='jobtitleid')
dfmerged.head()

Unnamed: 0,id,ssn,age,salary,jobtitleid,jobtitle,avg_salary,avg_age
0,1,933003970,22,84370.0,8.0,Asst. Manager,81000,32
1,7,733008623,30,97576.0,8.0,Asst. Manager,81000,32
2,37,750004704,46,47686.0,8.0,Asst. Manager,81000,32
3,45,707009414,29,,8.0,Asst. Manager,81000,32
4,50,284001503,39,98150.0,8.0,Asst. Manager,81000,32


In [27]:
dfmerged.count()

id            94
ssn           94
age           94
salary        90
jobtitleid    94
jobtitle      94
avg_salary    94
avg_age       94
dtype: int64

Our counts are off, let's try with a left instead of inner join:

In [28]:
dfmerged = df_emp.merge(df_job, how='left', on='jobtitleid')
dfmerged.count()

id            99
ssn           99
age           99
salary        95
jobtitleid    94
jobtitle      94
avg_salary    94
avg_age       94
dtype: int64

#### Filling NAs
- `df.fillna()`
- To update multiple values at once we can insert a dictionary into `.fillna`
    - Key: column
    - Value: value to insert into nulls

In [29]:
dfmergedcln = dfmerged.fillna({'jobtitleid' : -1, 'jobtitle':'other', 'avg_salary':0, 'avg_age':0})
dfmergedcln.count()

id            99
ssn           99
age           99
salary        95
jobtitleid    99
jobtitle      99
avg_salary    99
avg_age       99
dtype: int64

#### Best Practice
- May be better to add null rows to job files
- Create a dummy job ID then remerge to avoid key errors 

### Missing Values Outcomes
- Everything we do is a compromise
    - We are modifying data no matter what
- We need to document our assumptions
- Practices
    - Check your assumptions
    - Can you justify your methodology?
    - Does the modification introduce ambiguity? 
- Goal: Clean data with minimal modification to the source