# Scrub
Handle any cleaning of data here. 

Issues that were discovered during EDA will be cleaned here.

In [1]:
import pandas as pd

Read in the raw data and create a combined dataframe

In [2]:
features_df = pd.read_csv('../data/raw/X_train.csv')
target_df = pd.read_csv('../data/raw/y_train.csv')
df = features_df.set_index('id').join(target_df.set_index('id')).reset_index()

## Columns

### Label encoding target column
Create column to hold numerical representation of the target column. This can be used for graphs during EDA, and is required for the modeling process.

In [3]:
df['status_group'] = df['status_group'].astype('category')
df['target'] = df['status_group'].cat.codes
df['target'].value_counts(normalize=True)

0    0.543081
2    0.384242
1    0.072677
Name: target, dtype: float64

In [4]:
df['status_group'].value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

### Dropping Useless Columns

Not sure what 'num_private' is representing. Missing the description for it in the data dictionary that came with the data set.

In [5]:
to_drop = ['amount_tsh', 'date_recorded', 'id', 'wpt_name', 
           'num_private', 'region_code', 'lga', 
           'ward', 'public_meeting', 'recorded_by',
           'payment', 'extraction_type_group',
           'extraction_type_class', 'management_group',
           'quality_group', 'quantity_group', 'source_type',
           'source_class', 'waterpoint_type_group', 'status_group',
           'scheme_name'
          ]

In [6]:
len(to_drop)

21

- **amount_tsh**: 70% of this column is 0's.
- **id**: Dropping Id because it contains all unique values.
- **date_recorded**: Is the date the row was entered, and was used to log when creating the dataset.
- **wpt_name**: The chosen name of the water well. Has no correlation to the functionality of the well.
- **num_private**: This column had no description with the dataset.
- **region_code**: Dropped because it indicates the same thing as region.
- **lga**: Geographic area.
- **ward**: Another geographic area. Dropping because there are numerous geographic area columns.
- **public_meeting**: Assumed this is an indicator for people gathering at the well, but it is unclear and there are a lot of missing values.
- **recorded_by**: The company who recorded the information. Is the same value for the entire dataset, so removing it.
- **payment**: Supposed to indicate the payment, but instead is just a copy of the payment_type.
- **extraction_type_group**: Most of these are copies of the extraction_group.
- **extraction_type_class**: Less specific version of extraction_type_group.
- **management_group**: Highly correlated with management.
- **quality_group**: Less specific version of quality.
- **quantity_group**: Duplicate of quantity column.
- **source_type**: Less specific version of source.
- **source_class**: Much less specific version of source.
- **waterpoint_type_group**: Less specific version of waterpoint_type.
- **status_group**: Categorical version of our target column.
- **scheme_name**: Has 36% missing values, and also is very similar to scheme_management


In [7]:
df = df.drop(to_drop, axis=1)

## Duplicates

## Zeros or Missing Values
Two types of missing columns, columns that are missing less than 10%, and columns that are missing more than 30%.

Imputing:
- **funder**: Most commmon (mode) will add 3k rows funded by the gov't.
- **installer**: Most common (mode), DWE is the installer for the majority of the wells.
- **scheme_management**: Most common (mode), because one management company has majority of wells.
- **permit**: Drop missing values, because it is a boolean.
- **construction_year**: Dropping next three columns because they are all missing the same rows.
- **gps_height**: ''
- **population**: ''

In [8]:
less_missing_values = ['funder', 'installer', 'scheme_management', 'permit']
more_missing_values = ['construction_year', 'gps_height', 'population']

In [25]:
df = df[(df['construction_year'] > 0) & (df['population'] > 0) & (df['gps_height'] > 0)]

In [24]:
from sklearn.impute import SimpleImputer
si = SimpleImputer()

## Outliers

### Save to CSV 

In [26]:
df.to_csv('../data/clean/tanzania.csv')