# Data Processing  Phase 1

This step is the first step in model developing.

It include initial steps before the data anlaysis such as:

1. Change features names.
2. Change datatypes.
3. Remove unwanted features.
4. Imputing missing values.
5. Deal with duplications.
6. Deal with logical validations.

Next step is Data Analysis.

# Import

Import all the needed modules

In [None]:
import os
import json
import re
import pandas as pd
import numpy as np

# Init

In [None]:
# pandas columns setting
pd.set_option('display.max_columns', 30)

# disable warning
import warnings
warnings.filterwarnings('ignore')

# notebook width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# Load data

In [None]:
file_path = r'/path/...'

# read data

In [None]:
# data first rows
df.head()

In [None]:
# get info on data types, size and missing values
df.info()

In [None]:
# change data columns names to PEP 
df.columns = [col.lower() if '_' in col else '_'.join(re.findall('[A-Z][^A-Z]*', col)).lower() for  col in df.columns]

In [None]:
# rename jobactualactivedays
df.rename(index=str, columns={'original_col': 'new_col'}, inplace=True)

# Data Invalidations

Data Invalidations is being made by the next steps:
1. Search for uneeded features.
2. Data types correctness.
3. Searching missing values.
4. Duplications in data set.
5. Logical validation.

In [None]:
# new dataframe columns names info
df.info()

## remove features

Some features were extracted from the database and has no need in the next model development steps. 
Reducing features number has a great contribution on model development success.  

In [None]:
# drop undeeded features
df.drop(['col1', 'col2'], axis=1, inplace=True)

## Data Types

Covert to appropriate data types.

In [None]:
# IDs (numeric to object/str)
df.col_id = df.col_id.apply(lambda x: str(int(x)))

# to int 
df.some_col = df.some_col.as_type('int')

# object to int
df.some_col = df.some_col.as_type('object')

# bool to numberic
df.bool_col = pd.to_numeric(df.bool_col)

# datetime
df.datetime_col = pd.to_datetime(df.datetime_col)


## Missing values

Missing values are something common in every dataset. 
The following steps can be applied:
1. Too many missing values in some features - drop columns
2. Some rows has many missing values - drop rows.
3. Inconsistancy inmissing values - fill values with soem strategy: constant value, means, quantiles or advanced imputing strategy.

In [None]:
# calculate missing values of each feature
(1 - (df.apply(lambda x: x.count(), axis=0)/df.shape[0])).sort_values(ascending=False).head(30)

In [None]:
# calculate missing values of category features (empty string)
numerics_columns = df.select_dtypes(include='object')
#
count
(1 - (numerics_columns.apply(lambda x: x[x!=''].count(), 
                             axis=0)/numerics_columns.shape[0])).sort_values(ascending=False).head(5)

__Fast Imputing__

In [None]:
# fill null values with constant value
cols_to_fill = ['col1','col2']
df.loc[:, df.columns.isin(cols_to_fill)] = df.loc[:, df.columns.isin(cols_to_fill)].fillna(0)
df.col_with_nulls[df.col_with_nulls.isnull()] = const

# empty categories
df.cat_col[df.cat_col==''] = 'Missing'

__Complex Imputing__

In [None]:
# some complex null values imputing, such as means or quantiles values.
# knn imputing

## Duplications

There are two possible duplications types:
1. Entire row is duplicated.
2. Duplication according to some business logic (atomic features).

Duplication from type 1 should be dropped.

Duplication from type 2 should be investigated for their origin.

__Duplicated rows__

In [None]:
# calculate duplicates rows count
duplications_count = df[df.duplicated()].shape[0]
print('Number of duplications {}'.format(duplications_count))

__Atomic features duplications__

In [None]:
# Cycle and job are the atomic features here 
granularity_cols = ['granularity_col1', 'granularity_col2']
cycle_job_dups = df.groupby(granularity_cols, as_index=False).count().iloc[:, 0:(len(granularity_cols)+1)]

cycle_job_dups.columns = granularity_cols + ['count']
cycle_job_dups.sort_values('count', ascending=False).head()

## Logical validation

The next logical rules will be tested:

1. Negative values.
2. Check uniques values.
3. Domain knowledge values.


__bla bla__

In [None]:
# some logical rule


__Negative values__

In [None]:
# columns with negative values
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerics_columns = df.select_dtypes(include=numerics)
numerics_columns.columns[(numerics_columns < 0).any()].tolist()

In [None]:
# col_with_negatives negative values
col_negative_vals = df.loc[df.col_with_negatives<0, 'col_with_negatives'].shape[0]/df.shape[0]
print('col_with_negatives number of negative values - {0:.2%}'.format(col_negative_vals))

In [None]:
# show the rows with negative col_with_negatives level
df.loc[df.col_with_negatives<0, 'col_with_negatives'].head(10)

__Unique values__

In [None]:
df.some_col.unique()

__....__

# Quick EDA

## Continuous data

In [None]:
# numeric data types
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

# numeric columns
numerics_columns_types = df.select_dtypes(include=numerics).columns

# exclude boolean and descrete columns
bool_cols = ['is_', 'has_']
descrete_cols = ['col1']
                                             
exclude_fun = lambda y: any(x in y for x in bool_cols + descrete_cols)
numeric_cols = [col for col in numerics_columns_types if not exclude_fun(col)]

In [None]:
df.loc[: ,numeric_cols].iloc[:, 0:10].describe()

In [None]:
df.loc[: ,numeric_cols].iloc[:, 10:16].describe()

__Continuous data conclusions:__

* conclus 1

__Continuous data Operations:__

In [None]:
# action 1

In [None]:
# reset index
df.reset_index(drop=True, inplace=True)

## Boolean data

In [None]:
# numeric data types
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

# numeric columns
numerics_columns_types = df.select_dtypes(include=numerics).columns

# exclude boolean columns
exclude_fun = lambda y: any(x in y for x in ['is_', 'has_', 'have'])
boolean_cols = [col for col in numerics_columns_types if exclude_fun(col)]

In [None]:
df[boolean_cols].iloc[:, :10].describe()

In [None]:
df[boolean_cols].iloc[:, 10:].describe()

__Boolean data conclusions:__

* conclus 1

__Boolean data Operations:__

In [None]:
# action 1

## Discrete data

In [None]:
discrete_cols = ['col1']
df[discrete_cols].describe()

__Discrete data conclusions:__

* conclus 1

__Discrete data Operations:__

In [None]:
# action 1

## Categorical data

In [None]:
df.describe(include='object')

__Categorical  data conclusions:__

* concluse 1

__Categorical  data Operations:__

In [None]:
# action 1

## Datetime data

In [None]:
df.describe(include='datetime')

__Datetime data conclusions:__

* concluse 1

__Datetime data Operations:__

In [None]:
# action 1

In [None]:
df.describe(include='datetime')

# Save Processed data

In [None]:
# save path
save_path = ''

# save