# CSMODEL_Test

### Data: Philippine Family Income and Expenditures
Data Size: 41544 rows x 60 columns<br>

The dataset was sourced by Francis Paul Flores which was provided by the Philippine Statistics Office PSA in 2017. Accroding to Flores, the PSA conducts Family Income and Expenditure Surveys on the nation every three years. This aims to provide data regarding patterns on how family incomes affect expenditure and consumption of goods/services by families in the country.
<br><br>
**Sources:** <br> 
Dataset - https://www.kaggle.com/grosvenpaul/family-income-and-expenditure <br>
Combining Columns - https://www.statology.org/pandas-combine-two-columns/ <br>
Other functions - https://pandas.pydata.org/pandas-docs/stable/reference/index.html <br>


In [1]:
#Import Libraries
import pandas as pd
import numpy as np

In [2]:
#Import Data
ph_df = pd.read_csv("philippine_family_income_expenditure.csv")
ph_df.shape

(41544, 60)

In [3]:
#Please uncomment the line of code to be used.

#Check contents of ph_df
#ph_df.info

#Check if there exists null data points in the dataframe
#ph_df.isnull().any()

## 1
### Checking for null values

Checking whether there are columns that exists on any column at a insignificant quantity.

In [4]:
#Count number of null values
ph_df_nan=ph_df.columns[ph_df.isnull().any()].tolist()
ph_df[ph_df_nan].isnull().sum()

Household Head Occupation         7536
Household Head Class of Worker    7536
dtype: int64

In [5]:
#Checking datatype of 'Household Head Occupation'
type(ph_df['Household Head Occupation'].loc[0])

str

In [6]:
#Checking datatype of 'Household Head Class of Worker'
type(ph_df['Household Head Class of Worker'].loc[0])

str

Due to a big number of null values, the values for 'Household Head Occupation' and 'Household Head Class of Worker' won't be ommited, it will rather be replaced with a "None" value.

In [7]:
#Setting mentioned columns with null values to "None"
ph_df.loc[ph_df['Household Head Occupation'].isnull(),'Household Head Occupation'] = "None"
ph_df.loc[ph_df['Household Head Class of Worker'].isnull(),'Household Head Class of Worker'] = "None"

In [8]:
#Recheck if there exists null value on any column
#ph_df.isnull().any()

## 2
### Checking for columns of similar nature

Notice that there are columns that indicate commonality: <br>
**'Number of Car, Jeep, Van', 'Number of Motorized Banca', and 'Number of Motorcycle/Tricycle'** - All of which belong to Personal Mobility or Personal Vehicles. <br>
**'Alcoholic Beverages Expenditure' and 'Tobacco Expenditure'** - All of which belong to Unhealthy Lifestyle Products.
<br><br>
The aforementioned columns can then be replaced with the new columns


In [9]:
#Creating New Columns
ph_df['Number of Vehicles'] = ph_df['Number of Car, Jeep, Van'] + ph_df['Number of Motorized Banca'] + ph_df['Number of Motorcycle/Tricycle']
ph_df['Sin Goods Expenditure'] = ph_df['Alcoholic Beverages Expenditure'] + ph_df['Tobacco Expenditure']

#Removing Columns; Don't forget to copy it back to ph_df
ph_df = ph_df.drop(columns=['Number of Car, Jeep, Van','Number of Motorized Banca','Number of Motorcycle/Tricycle','Alcoholic Beverages Expenditure','Tobacco Expenditure']).copy()

#Validate that changes did occur
ph_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41544 entries, 0 to 41543
Data columns (total 57 columns):
 #   Column                                         Non-Null Count  Dtype 
---  ------                                         --------------  ----- 
 0   Total Household Income                         41544 non-null  int64 
 1   Region                                         41544 non-null  object
 2   Total Food Expenditure                         41544 non-null  int64 
 3   Main Source of Income                          41544 non-null  object
 4   Agricultural Household indicator               41544 non-null  int64 
 5   Bread and Cereals Expenditure                  41544 non-null  int64 
 6   Total Rice Expenditure                         41544 non-null  int64 
 7   Meat Expenditure                               41544 non-null  int64 
 8   Total Fish and  marine products Expenditure    41544 non-null  int64 
 9   Fruit Expenditure                              41544 non-null

Since changes have been made, it should be much simpler to clean any data found within the columns

## 3
### Determining Object Values

Most of the columns presented are of type int64, implying that most are of numeric values which goes in-line with the fact that these are mostly income and expenditure data, however there are some data that are str in datatype.<br>

### Columns with str datatype:
1. Region<br>
2. Main Source of Income<br>
3. Household Head Sex<br>
4. Household Head Marital Status<br>
5. Household Head Highest Grade Completed<br>
6. Household Head Job or Business Indicator<br>
7. Household Head Occupation<br>
8. Household Head Class of Worker<br>
9. Type of Household<br>
10. Type of Building/House<br>
11. Type of Roof<br>
12. Type of Walls<br>
13. Tenure Status<br>
14. Toilet Facilities<br>
15. Main Source of Water Supply<br>

In [10]:
ph_df['Region'].unique()

array(['CAR', 'Caraga', 'VI - Western Visayas', 'V - Bicol Region',
       ' ARMM', 'III - Central Luzon', 'II - Cagayan Valley',
       'IVA - CALABARZON', 'VII - Central Visayas',
       'X - Northern Mindanao', 'XI - Davao Region',
       'VIII - Eastern Visayas', 'I - Ilocos Region', 'NCR',
       'IVB - MIMAROPA', 'XII - SOCCSKSARGEN',
       'IX - Zasmboanga Peninsula'], dtype=object)

In [11]:
ph_df['Main Source of Income'].unique()

array(['Wage/Salaries', 'Other sources of Income',
       'Enterpreneurial Activities'], dtype=object)

In [12]:
ph_df['Household Head Sex'].unique()

array(['Female', 'Male'], dtype=object)

In [13]:
ph_df['Household Head Marital Status'].unique()

array(['Single', 'Married', 'Widowed', 'Divorced/Separated', 'Annulled',
       'Unknown'], dtype=object)

In [14]:
ph_df['Household Head Highest Grade Completed'].unique()

array(['Teacher Training and Education Sciences Programs',
       'Transport Services Programs', 'Grade 3', 'Elementary Graduate',
       'Second Year High School', 'Third Year High School',
       'Business and Administration Programs', 'First Year College',
       'High School Graduate',
       'Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree',
       'Humanities Programs', 'First Year High School', 'Grade 6',
       'Grade 4', 'Engineering and Engineering Trades Programs',
       'Grade 2', 'Grade 5', 'Social and Behavioral Science Programs',
       'Agriculture, Forestry, and Fishery Programs', 'Health Programs',
       'Fourth Year College',
       'Engineering and Engineering trades Programs',
       'Second Year College', 'Third Year College', 'Grade 1',
       'No Grade Completed', 'Security Services Programs',
       'Basic Programs', 'First Year Post Secondary',
      

In [15]:
ph_df['Household Head Occupation'].unique()

array(['General elementary education teaching professionals',
       'Transport conductors', 'Farmhands and laborers', 'Rice farmers',
       'General managers/managing proprietors in transportation, storage and communications',
       'Heavy truck and lorry drivers', 'None', 'Hog raising farmers',
       'Vegetable farmers',
       'General managers/managing proprietors in wholesale and retail trade',
       'Stocks clerks', 'Justices', 'Other social science professionals',
       'Protective services workers n. e. c.', 'Secretaries',
       'Electronics mechanics and servicers',
       'Foresters and related scientists',
       'Shop salespersons and demonstrators',
       'College, university and higher education teaching professionals',
       'General managers/managing proprietors of restaurants and hotels',
       'Welders and flamecutters', 'Car, taxi and van drivers',
       'Motor vehicle mechanics and related trades workers',
       'Traditional chiefs and heads of villages',

In [16]:
ph_df['Household Head Class of Worker'].unique()

array(['Worked for government/government corporation',
       'Worked for private establishment',
       'Employer in own family-operated farm or business',
       'Self-employed wihout any employee', 'None',
       'Worked without pay in own family-operated farm or business',
       'Worked for private household',
       'Worked with pay in own family-operated farm or business'],
      dtype=object)

In [17]:
ph_df['Type of Household'].unique()

array(['Extended Family', 'Single Family',
       'Two or More Nonrelated Persons/Members'], dtype=object)

In [18]:
ph_df['Type of Building/House'].unique()

array(['Single house', 'Duplex',
       'Commercial/industrial/agricultural building',
       'Multi-unit residential', 'Institutional living quarter',
       'Other building unit (e.g. cave, boat)'], dtype=object)

In [19]:
ph_df['Type of Roof'].unique()

array(['Strong material(galvanized,iron,al,tile,concrete,brick,stone,asbestos)',
       'Light material (cogon,nipa,anahaw)',
       'Mixed but predominantly strong materials',
       'Mixed but predominantly light materials',
       'Salvaged/makeshift materials',
       'Mixed but predominantly salvaged materials', 'Not Applicable'],
      dtype=object)

In [20]:
ph_df['Type of Walls'].unique()

array(['Strong', 'Light', 'Quite Strong', 'Very Light', 'Salvaged',
       'NOt applicable'], dtype=object)

In [21]:
ph_df['Tenure Status'].unique()

array(['Own or owner-like possession of house and lot',
       'Rent-free house and lot with consent of owner',
       'Own house, rent-free lot with consent of owner',
       'Own house, rent-free lot without consent of owner',
       'Own house, rent lot', 'Rent house/room including lot',
       'Not Applicable',
       'Rent-free house and lot without consent of owner'], dtype=object)

In [22]:
ph_df['Toilet Facilities'].unique()

array(['Water-sealed, sewer septic tank, used exclusively by household',
       'Water-sealed, sewer septic tank, shared with other household',
       'Closed pit',
       'Water-sealed, other depository, used exclusively by household',
       'Open pit',
       'Water-sealed, other depository, shared with other household',
       'None', 'Others'], dtype=object)

In [23]:
ph_df['Main Source of Water Supply'].unique()

array(['Own use, faucet, community water system',
       'Shared, faucet, community water system',
       'Shared, tubed/piped deep well', 'Own use, tubed/piped deep well',
       'Protected spring, river, stream, etc', 'Tubed/piped shallow well',
       'Lake, river, rain and others',
       'Unprotected spring, river, stream, etc', 'Dug well', 'Others',
       'Peddler'], dtype=object)

## 4
### Checking Region Population

In [33]:
over = 1 / (ph_df.shape)[0] #retrieve rows on df

ph_pop_percentage = 100 * ( ph_df['Region'].value_counts() * over )

ph_pop_percentage

IVA - CALABARZON             10.018294
NCR                           9.941267
III - Central Luzon           7.791739
VI - Western Visayas          6.862604
VII - Central Visayas         6.116407
V - Bicol Region              5.950318
XI - Davao Region             5.887733
I - Ilocos Region             5.651839
VIII - Eastern Visayas        5.625361
 ARMM                         5.411130
II - Cagayan Valley           5.341325
XII - SOCCSKSARGEN            5.107837
X - Northern Mindanao         4.542172
IX - Zasmboanga Peninsula     4.303871
Caraga                        4.289428
CAR                           4.152224
IVB - MIMAROPA                3.006451
Name: Region, dtype: float64