# 6 - Data Transformation

## 6.1 Set Up & Data Initialization 

In [1]:
#Libraries
import pandas as pd 
import numpy as np
np.random.seed(0) 
from random import random
from functions import *  
import pickle 

In [2]:
df = pd.read_pickle("./original.pkl")
df.head()

Unnamed: 0,id,status_group,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,functional,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,functional,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,functional,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,non functional,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,functional,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


## 6.2 Explanation of Columns To Drop

<b> payment_type vs payment </b> - These columns are one to one duplicates of each other and thus only one is needed for modeling. 

<b> Reference Columns </b> - These columns are references for administrative puporses, will not help with data modeling. 

<b> Installer </b> - There are too many unique values in this categorical to have meaning and not enough time to research each one 

<b> wpt_name </b> - There are too many unique values in this category and the values refer to hyper specific locations so bin would be near impossible 

<b> subvillage </b> - There are too many unique values in this category and the values refer to hyper specific locations so bin would be near impossible / 371 are missing   

<b> quality_group </b> - This is a duplicate column to water_quality however water_quality has two other columns that specify if the pump has been abandoned which might prove useful in predicting. 

<b> quantity_group </b> - This is a duplicate column to the quantity group. 

<b> source </b> - The most specific column, has very few categorized into hand_twd (which is not in source type) and contains both other & unknown - source type only contains other and combines lakes/ rivers 

<b> source_class </b> - Too generic to have value all sources grouped into 3 categories one of which is unknown 

<b> waterpoint_type </b> - The only value contained in this column not in waterpoint_type_group is communal standpipe multiple - this is a small subset of communal standpipe  

<b> scheme_name </b> - There is a large number of unique values & almost equal number of nan values  

<b> extraction_type_group </b> - This is the more generalized version of extration type and since there are no nan values it does not seem nessesary to bin or gerneralize 

<b> extraction_type_class </b> - This is the more generalized version of extration type and since there are no nan values it does not seem nessesary to bin or gerneralize 

<b> management_group </b> - A generalized version of management that seems to looks valuable indicators  

<b> num_private </b> - No information about this column could be found

## 6.3 Explanation of Columns To Group & Encode

<b> scheme_mangement </b> - Bin these values into categories: Other / Private / Government  

<b> permit </b> - Fill any nan values with unknown  

<b> construction_year </b> - Assign new values to the years listed as 0  

<b> amount_tsh </b> - Add an agregate column that groups these values into "low / medium / high"  

<b> longitude </b> - Take the average longitude for pumps in the are based on known longitudes 

<b> latitude </b> - Take the average latitude for pumps in the are based on known latitude 

<b> population </b> - Use the average population for the well based on wells in the area 

<b> public_meeting </b> - Fill na values with unknown

## 6.4 Test Function & Apply Processing 

In [3]:
df = proccess_data(df) 
df.head()

Unnamed: 0,status_group,amount_tsh,funder,gps_height,longitude,latitude,basin,subvillage,region,region_code,...,scheme_management,permit,construction_year,extraction_type,management,payment,water_quality,quantity,source_type,waterpoint_type_group
0,functional,6000.0,Roman,1390.0,34.938093,-9.856322,Lake Nyasa,Mnyusi B,Iringa,11,...,private,False,unknown,gravity,vwc,pay annually,soft,enough,spring,communal standpipe
1,functional,0.0,Grumeti,1399.0,34.698766,-2.147466,Lake Victoria,Nyamara,Mara,20,...,other,True,unknown,gravity,wug,never pay,soft,insufficient,rainwater harvesting,communal standpipe
2,functional,25.0,Lottery Club,686.0,37.460664,-3.821329,Pangani,Majengo,Manyara,21,...,private,True,unknown,gravity,vwc,pay per bucket,soft,enough,dam,communal standpipe
3,non functional,0.0,Unicef,263.0,38.486161,-11.155298,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,...,private,True,unknown,submersible,vwc,never pay,soft,dry,borehole,communal standpipe
4,functional,0.0,Action In A,0.0,31.130847,-1.825359,Lake Victoria,Kyanyamisa,Kagera,18,...,,True,0,gravity,other,never pay,soft,seasonal,rainwater harvesting,communal standpipe


In [4]:
pd.to_pickle(df, "./processed.pkl")