# CSMODEL_Test

### Data: Philippine Family Income and Expenditures
Data Size: 41544 rows x 60 columns<br>

The dataset was sourced by Francis Paul Flores which was provided by the Philippine Statistics Office PSA in 2017. Accroding to Flores, the PSA conducts Family Income and Expenditure Surveys on the nation every three years. This aims to provide data regarding patterns on how family incomes affect expenditure and consumption of goods/services by families in the country.
<br><br>
**Sources:** <br> 
Dataset - https://www.kaggle.com/grosvenpaul/family-income-and-expenditure <br>
Combining Columns - https://www.statology.org/pandas-combine-two-columns/ <br>
Other functions - https://pandas.pydata.org/pandas-docs/stable/reference/index.html <br>


In [1]:
#Import Libraries
import pandas as pd
import numpy as np

In [2]:
#Import Data
ph_df = pd.read_csv("philippine_family_income_expenditure.csv")
ph_df.shape

(41544, 60)

In [3]:
#Please uncomment the line of code to be used.

#Check contents of ph_df
#ph_df.info

#Check if there exists null data points in the dataframe
#ph_df.isnull().any()

## 1
### Checking for null values

Checking whether there are columns that exists on any column at a insignificant quantity.

In [4]:
#Count number of null values
ph_df_nan=ph_df.columns[ph_df.isnull().any()].tolist()
ph_df[ph_df_nan].isnull().sum()

Household Head Occupation         7536
Household Head Class of Worker    7536
dtype: int64

In [5]:
#Checking datatype of 'Household Head Occupation'
type(ph_df['Household Head Occupation'].loc[0])

str

In [6]:
#Checking datatype of 'Household Head Class of Worker'
type(ph_df['Household Head Class of Worker'].loc[0])

str

Due to a big number of null values, the values for 'Household Head Occupation' and 'Household Head Class of Worker' won't be ommited, it will rather be replaced with a "None" value.

In [7]:
#Setting mentioned columns with null values to "None"
ph_df.loc[ph_df['Household Head Occupation'].isnull(),'Household Head Occupation'] = "None"
ph_df.loc[ph_df['Household Head Class of Worker'].isnull(),'Household Head Class of Worker'] = "None"

In [8]:
#Recheck if there exists null value on any column
#ph_df.isnull().any()

## 2
### Checking for columns of similar nature

Notice that there are columns that indicate commonality: <br>
**'Number of Car, Jeep, Van', 'Number of Motorized Banca', and 'Number of Motorcycle/Tricycle'** - All of which belong to Personal Mobility or Personal Vehicles. <br>
**'Alcoholic Beverages Expenditure' and 'Tobacco Expenditure'** - All of which belong to Unhealthy Lifestyle Products.
<br><br>
The aforementioned columns can then be replaced with the new columns


In [10]:
#Creating New Columns
ph_df['Number of Vehicles'] = ph_df['Number of Car, Jeep, Van'] + ph_df['Number of Motorized Banca'] + ph_df['Number of Motorcycle/Tricycle']
ph_df['Sin Goods Expenditure'] = ph_df['Alcoholic Beverages Expenditure'] + ph_df['Tobacco Expenditure']

#Removing Columns; Don't forget to copy it back to ph_df
ph_df = ph_df.drop(columns=['Number of Car, Jeep, Van','Number of Motorized Banca','Number of Motorcycle/Tricycle','Alcoholic Beverages Expenditure','Tobacco Expenditure']).copy()

#Validate that changes did occur
ph_df.info()

KeyError: 'Number of Car, Jeep, Van'

Since changes have been made, it should be much simpler to clean any data found within the columns