## Data wrangling / Data munging

**Data wrangling** , sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values etc.) within a data set, and could include such actions as extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating and filtering to create desired wrangling outputs that can be leveraged downstream.

The recipients could be individuals, such as data architects or data scientists who will investigate the data further, business users who will consume the data directly in reports, or systems that will further process the data and write it into targets such as data warehouses, data lakes or downstream applications.

**Data munging broadly consists of following steps :**

    * Data Ingestion
    * Data merging
    * Quick data exploration
    * Checking initial stats of the data
    * Cleaning the data 
        * check for missing values
        * Treat missing values
        * check for inconsistencies in names / values across data columns
        * Treat data inconsistencies
        * renaming columns names
        * introduce new columns after cleaning
    * Exploratory Data Analysis (EDA)
        * Data visualisation
        * Data distribution checking
        * Detailed data stats study
        * Check and treat outliers
    * Feature engineering 
        * Binning
        * categorical features to one hot feature conversion
        * categorical features to factorized features
        * creating interactive features
        * creating polynomial features
        * deriving features
        * acquiring more features
    * Preprocesssing 
        * feature scaling
        * feature normalization 
        * training data format changing as per ML algo 
    

### LOAD THE DATASET

In [20]:
import pandas as pd
import numpy as np
# Load Training and Test Data Sets
headers = ['age', 'workclass', 'fnlwgt', 
           'education', 'education-num', 
           'marital-status', 'occupation', 
           'relationship', 'race', 'sex', 
           'capital-gain', 'capital-loss', 
           'hours-per-week', 'native-country', 
           'predclass']
train_data_path=r'C:\Users\Prateek\Desktop\content\ag dsb1\d10 EDA Project2\Data Wrangling dataset\train.csv'
test_data_path =r'C:\Users\Prateek\Desktop\content\ag dsb1\d10 EDA Project2\Data Wrangling dataset\test.csv'
training_raw = pd.read_csv(train_data_path,names=headers,na_values=[" ?"]) #read train csv
test_raw =pd.read_csv(test_data_path,names=headers,na_values=[" ?"])#read test csv

### Check the datatypes of available data

In [28]:
test_raw.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
predclass         object
dtype: object

### Combine train and test data

In [30]:
# Join Train and test dataset Datasets
dataset_raw = training_raw.append(test_raw)
#reset the index
dataset_raw.reset_index(inplace=True)
dataset_raw.drop('index',inplace=True,axis=1)

In [31]:
print(training_raw.shape)

(32561, 15)


In [32]:
print(test_raw.shape)

(16281, 15)


In [33]:
print(dataset_raw.shape)

(48842, 15)


### Check intial data stats

In [34]:
dataset_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age               48755 non-null float64
workclass         46043 non-null object
fnlwgt            48842 non-null int64
education         48842 non-null object
education-num     48842 non-null int64
marital-status    48814 non-null object
occupation        46033 non-null object
relationship      48842 non-null object
race              48842 non-null object
sex               48842 non-null object
capital-gain      48829 non-null float64
capital-loss      48842 non-null int64
hours-per-week    48842 non-null int64
native-country    47964 non-null object
predclass         48842 non-null object
dtypes: float64(2), int64(4), object(9)
memory usage: 5.6+ MB


In [42]:
dataset_raw.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48755.0,48842.0,48842.0,48829.0,48842.0,48842.0
mean,38.670782,189664.1,10.078089,1079.354912,87.502314,40.422382
std,13.707212,105604.0,2.570973,7452.990204,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [53]:
# Describing all the Numerical Features
dataset_raw.describe(include='int64')

Unnamed: 0,fnlwgt,education-num,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0
mean,189664.1,10.078089,87.502314,40.422382
std,105604.0,2.570973,403.004552,12.391444
min,12285.0,1.0,0.0,1.0
25%,117550.5,9.0,0.0,40.0
50%,178144.5,10.0,0.0,40.0
75%,237642.0,12.0,0.0,45.0
max,1490400.0,16.0,4356.0,99.0


In [47]:
dataset_raw.describe(exclude=['float64','object'])

Unnamed: 0,fnlwgt,education-num,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0
mean,189664.1,10.078089,87.502314,40.422382
std,105604.0,2.570973,403.004552,12.391444
min,12285.0,1.0,0.0,1.0
25%,117550.5,9.0,0.0,40.0
50%,178144.5,10.0,0.0,40.0
75%,237642.0,12.0,0.0,45.0
max,1490400.0,16.0,4356.0,99.0


In [50]:
# Describing all the Categorical/string Features
dataset_raw.describe(include='O')

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country,predclass
count,46043,48842,48814,46033,48842,48842,48842,47964,48842
unique,8,16,7,14,6,5,2,41,4
top,Private,HS-grad,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
freq,33906,15784,22379,6172,19716,41762,32650,43814,24720


### Finding which columns have missing values

In [58]:
dataset_raw.isnull()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,predclass
0,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [59]:
dataset_raw.isnull().any()

age                True
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status     True
occupation         True
relationship      False
race              False
sex               False
capital-gain       True
capital-loss      False
hours-per-week    False
native-country     True
predclass         False
dtype: bool

In [68]:
dataset_raw.columns[dataset_raw.isnull().any()]

Index(['age', 'workclass', 'marital-status', 'occupation', 'capital-gain',
       'native-country'],
      dtype='object')

In [70]:
# Get all missing column names
missing_info = list(dataset_raw.columns[dataset_raw.isnull().any()])
print(missing_info)


['age', 'workclass', 'marital-status', 'occupation', 'capital-gain', 'native-country']


In [66]:
# Get the sum of missing values in each column 
dataset_raw.isnull().sum()

age                 87
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status      28
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain        13
capital-loss         0
hours-per-week       0
native-country     878
predclass            0
dtype: int64

### Find the percentage of missing values in each column

In [72]:
for col in missing_info:
    percent_missing = float(dataset_raw[dataset_raw[col].isnull()==True].shape[0]/dataset_raw.shape[0])
    print('percent missing for column {}:{}'.format(col,percent_missing*100))

percent missing for column age:0.17812538389091356
percent missing for column workclass:5.7307235575938735
percent missing for column marital-status:0.05732770975799517
percent missing for column occupation:5.751197739650301
percent missing for column capital-gain:0.0266164366733549
percent missing for column native-country:1.797633184554277


### 4. Ways to Cleanse Missing Data in Python
  ####  a. Dropping Missing Values

In [73]:
non_na_df = dataset_raw.dropna()
## Verify by cheking sum of missing values 
non_na_df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
predclass         0
dtype: int64

#### b.  Replacing Missing Values in columns'capital-gain' with 0

In [79]:
dataset_raw.replace[np.NaN,0]

TypeError: 'method' object is not subscriptable

#### c. Replacing with a Scalar Value (mean,median etc)

In [15]:
mean_fill_df
median_fill_df

age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status      28
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     878
predclass            0
dtype: int64

#### d. Filling Forward or Backward it willl take care of both the variable (Categorical and Numeric)
To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.

In [76]:
# To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.
ffil_df = dataset_raw.fillna(method='pad')
bfill_df = dataset_raw.fillna(method = "bfill")

### Recode the Class Feature 'predclass'  with>50K=1 and <=50K=0

In [74]:
#get the unique values of vategorical data

str_formated_df = mean_fill_df.copy()

#lets recode the class deature
str_formated_df.loc[str_formated_df['predclass']== '>50k', 'predclass'] = 1
str_formated_df.loc[str_formated_df['predclass']== '>50k.', 'predclass'] = 1
str_formated_df.loc[str_formated_df['predclass']== '<=50k', 'predclass'] = 0
str_formated_df.loc[str_formated_df['predclass']== '<=50k.', 'predclass'] = 0

NameError: name 'mean_fill_df' is not defined

### Check unique values of categorical data

In [22]:
obj

--------------------------------------------------
workclass unique values are: [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 nan ' Self-emp-inc' ' Without-pay' ' Never-worked']
--------------------------------------------------
education unique values are: [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
--------------------------------------------------
marital-status unique values are: [' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' nan ' Widowed' ' Married-AF-spouse']
--------------------------------------------------
occupation unique values are: [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' nan
 ' Protective-serv' ' Armed-Forces

### Clear the extra white spaces in categorical data

In [80]:
dataset_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
age               48755 non-null float64
workclass         46043 non-null object
fnlwgt            48842 non-null int64
education         48842 non-null object
education-num     48842 non-null int64
marital-status    48814 non-null object
occupation        46033 non-null object
relationship      48842 non-null object
race              48842 non-null object
sex               48842 non-null object
capital-gain      48829 non-null float64
capital-loss      48842 non-null int64
hours-per-week    48842 non-null int64
native-country    47964 non-null object
predclass         48842 non-null object
dtypes: float64(2), int64(4), object(9)
memory usage: 5.6+ MB


In [25]:
str_formated_df['workclass'] = str_formated_df['workclass'].apply(lambda x : str(x).strip())
str_formated_df['education'] = str_formated_df['education'].apply(lambda x : str(x).strip())
str_formated_df[''] = str_formated_df[''].apply(lambda x : str(x).strip())
str_formated_df[''] = str_formated_df[''].apply(lambda x : str(x).strip())
str_formated_df[''] = str_formated_df[''].apply(lambda x : str(x).strip())
str_formated_df[''] = str_formated_df[''].apply(lambda x : str(x).strip())
str_formated_df[''] = str_formated_df[''].apply(lambda x : str(x).strip())
str_formated_df[''] = str_formated_df[''].apply(lambda x : str(x).strip())

### Create one hot encoding for categorical variable

### Transform given data with dummies with MinMax  scaling

Hint: use sklearn MinMaxScaler