## Exploring the Filipino Family Income and Expenditure Dataset [WIP]
- Kaggle Competition and dataset details: https://www.kaggle.com/grosvenpaul/family-income-and-expenditure
- Resources used: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

<i>Please let me know if there are any errors or if anything can be further improved upon. </i>

The outline for this kernel is as follows:
1. <b>Exploratory Data Analysis and Data Visualization</b> - Plot different variables against each other and digging into some interesting correlations. 
2. <b>Data Pre-processing</b> - Missing values are handled; data is standardized/ normalized 
3. <b>Prediction of Household Income</b> - machine learning techniques are applied to the dataset to either (1) classify new instances as either low, middle, or high income (classification); or (2) predict the household income (regression)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
from sklearn import preprocessing
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [None]:
#df = pd.read_csv("../input/Family Income and Expenditure.csv")
df = pd.read_csv("Family-Income-and-Expenditure.csv")

In [None]:
df.head()

In [None]:
df.info()

## Inspecting the Variables
If we want to better understand our data, we first need to understand the meaning and relevance of the variables. Remember that the target variable we are trying to predict is `Total Household Income`. Thus, for each variable, we ask ourselves:
- Do we consider this variable when attempting to predict household income? 
- If so, how relevant is this variable in determining our target variable?

We can roughly group the different variables based on the information they provide:
- <b>Expenditures</b> - describes the amount of spending allocated to a certain commodity 
- <b>Household Head</b> - age, sex, marital status, education of the household breadwinner
- <b>Household Family Members</b>- type of family, number, age, and employment status of family members
- <b>Type of House</b> - describes the physical structure of the house
- <b>Number of Commodities Owned</b> - e.g. fridges, washing machines, television

First, we'll create a separate dataframe for expenditures alone. 

In [None]:
exp = df[[c for c in df.columns if ('Expenditure' in c) or ('expenses' in c)]]
exp.describe()

In [None]:
df = df[[c for c in df.columns if ('Expenditure' not in c) and ('expenses' not in c)]]
df.describe()

In [None]:
df.describe(include=['O'])

### Total Household Income
Let's start by taking a closer look at our target variable.

In [None]:
target = 'Total Household Income'
df[target].describe()

Observations:
- The average household income is P247,555.60 per year. That's an average of P20,629.63 per month. 
- The median income is at P164,079.50 per year or P13,673.29 per month.
- The highest income is P11,815,990.00 per year or P984,665.00 per month.
- The lowest being P11,285.00 per year or P940 per month.

In [None]:
f, ax = plt.subplots(figsize=(7, 5))
sns.distplot(df[target])
plt.show()

In [None]:
print("Skewness:", df[target].skew())

Taking a look at the distribution, we see that the total household income:
- deviates from the normal distribution
- has a positive skewness

Let's have a look at the correlation between variables.

In [None]:
k = 7
corrmat = df.corr()
cols = corrmat.nlargest(k, target)[target].index
f, ax = plt.subplots(figsize=(5, 5))
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

### Numerical Variables

In [None]:
var = 'Household Head Age'
sns.distplot(df[var])
plt.show()

In [None]:
sns.jointplot(x=var, y=target, data=df);
plt.show()

<i>Families with household heads in their 50's seem to have the highest household income.</i>

In [None]:
var = 'Total Number of Family members'
sns.distplot(df[var], kde=False)
plt.show()

In [None]:
sns.jointplot(x=var, y=target, data=df);
plt.show()

<i>Families with more members living in the household (10 and above) tend to have lower household income. </i>

In [None]:
var = 'Total number of family members employed'
sns.distplot(df[var], kde=False)
plt.show()

In [None]:
sns.jointplot(x=var, y=target, data=df)
plt.show()

<i>Interesting... the more family members that are employed, the less the Total Household Income? Seems pretty counterinutitive.</i> 

In [None]:
var = 'Members with age less than 5 year old'
sns.distplot(df[var], kde=False)
plt.show()

In [None]:
sns.jointplot(x=var, y=target, data=df)
plt.show()

In [None]:
var = 'Members with age 5 - 17 years old'
sns.distplot(df[var], kde=False)
plt.show()

In [None]:
sns.jointplot(x=var, y=target, data=df)
plt.show()

<i>Families with more children tend to have lower total household income.</i>

### Categorical Variables
We now take a look at the strip plots of different variables plotted against Total Household Income. 

In [None]:
var = 'Region' 
s = sns.barplot(x=var, y=target, data=df)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()

<i>Families residing in NCR (Metro Manila) enjoy an overall greater total household income. The runner up is CALABARZON, which is about an hour or two from Metro Manila. </i>

In [None]:
sex = 'Household Head Sex' 
sns.countplot(x=sex, data=df)
plt.show()

In [None]:
sex = 'Household Head Sex' 
sns.barplot(x=sex, y=target, data=df)
plt.show()

<i>While there are more male household heads, female heads generally have a higher mean household income.</i>

In [None]:
var = 'Main Source of Income'
sns.barplot(x=var, y=target, data=df)
plt.show()

In [None]:
var = 'Household Head Marital Status'
sns.countplot(x=var, hue=sex, data=df)
plt.show()

In [None]:
var = 'Household Head Highest Grade Completed'
df[var] = df[var].replace('Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level)', 'Programs of Education at the Third Level');
df[var] = df[var].replace('Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree', 'Third Level that Leads to Non-Baccalureate Award')

In [None]:
df[var].astype('category').cat.categories;

In [None]:
var = 'Household Head Occupation' 
df[var].astype('category').cat.categories
a = df[var].astype('category').cat.categories
b = {i:df[var].value_counts()[i] for i in a}
b = sorted(b.items(), key=lambda kv: kv[1], reverse=True)
a = [i[0] for i in b[:50]]
c = [i[1] for i in b[:50]]

In [None]:
f, ax = plt.subplots(figsize=(20, 10))
s = sns.barplot(x=a, y=c)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
s.set(xlabel=var, ylabel='Count')
plt.show()

In [None]:
var = 'Household Head Occupation' 
df[var].astype('category').cat.categories
a = df[var].astype('category').cat.categories
b = {i:df[var].value_counts()[i] for i in a}
b = sorted(b.items(), key=lambda kv: kv[1], reverse=True)
a = [i[0] for i in b[:50]]
c = [df[df[var]==i[0]][target].mean() for i in b[:50]]
f, ax = plt.subplots(figsize=(20, 10))
s = sns.barplot(x=a, y=c)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
s.set(xlabel=var, ylabel='Mean (Total Household Income)')
plt.show()

In [None]:
var = 'Household Head Occupation' 
df[var].astype('category').cat.categories
a = df[var].astype('category').cat.categories
b = {i:df[df[var]==i][target].mean() for i in a}
b = sorted(b.items(), key=lambda kv: kv[1], reverse=True)
d = [i for i in b[:25]]
for i in b[-25:]: d.append(i)
a = [i[0] for i in d]
c = [i[1] for i in d]
f, ax = plt.subplots(figsize=(20, 10))
s = sns.barplot(x=a, y=c)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
s.set(xlabel=var, ylabel='Count')
plt.show()

In [None]:
var = 'Household Head Highest Grade Completed' 
f, ax = plt.subplots(figsize=(20, 10))
s = sns.countplot(x=var, data=df)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()

In [None]:
var = 'Household Head Highest Grade Completed' 
f, ax = plt.subplots(figsize=(20, 10))
s = sns.barplot(x=var, y=target, data=df)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()

In [None]:
var = 'Household Head Highest Grade Completed' 
f, ax = plt.subplots(figsize=(20, 10))
s = sns.barplot(x=var, y=target, hue=sex, data=df)
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()

In [None]:
var = 'Household Head Job or Business Indicator' 
s = sns.countplot(x=var, hue=sex, data=df)
plt.show()

In [None]:
var = 'Main Source of Water Supply'
target_ = 'Housing and water Expenditure'
s = sns.barplot(x=df[var], y=exp[target_])
s.set_xticklabels(s.get_xticklabels(), rotation=90)
plt.show()

## Missing Values

In [None]:
df.isnull().sum()

In [None]:
df['Household Head Occupation'] = df['Household Head Occupation'].replace(np.nan, 'Other');
df['Household Head Occupation'] = df['Household Head Class of Worker'].replace(np.nan, 'Other');

## Classification Task

I'll start by binning the Total Household Income values into one of three categories:

1. <b>Low Income</b>: Income ranges from P11,285 to P122,244.67 (P940.42 - P10,187.06 per month)
2. <b>Middle Income</b>: Income ranges from P122,244.67 to P234,636.33 (P10187.06 - P19,553 per month)
3. <b>High Income</b>: Income ranges from 234,636.33 to 11,815,988.00 (P19,553.02 - P984,665 per month)

In [None]:
y = df[target]
X = df[df.columns.difference([target])]

In [None]:
y_ = pd.qcut(y, 3, retbins=True)
y_

In [None]:
y = pd.qcut(y, 3, labels=["low income","middle income", "high income"])
y

In [None]:
y.value_counts()

This gives us a balanced dataset.

In [None]:
cols = list(X.columns[X.dtypes != object])
std_scale = preprocessing.StandardScaler().fit(X[cols])
X[cols] = pd.DataFrame(std_scale.transform(X[cols]), columns=cols)

In [None]:
cols = list(X.columns[X.dtypes == object])
X = pd.DataFrame(pd.get_dummies(X, prefix=cols, columns=cols))

In [None]:
X.info()

In [None]:
X.describe(include='all')

In [None]:
pca = decomposition.PCA(n_components=50)
X = pca.fit_transform(X)

In [None]:
def train(X, y):
    test_size = 0.2
    seed = 42
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
    classifiers = dict() 
    classifiers['GaussianNB'] = GaussianNB()
    classifiers['DecisionTreeClassifier'] = DecisionTreeClassifier(random_state=seed)
    classifiers['SVM'] = SVC()
    classifiers['LinearSVM'] = LinearSVC()
    classifiers['MLPClassifier'] = MLPClassifier()
    classifiers['Perceptron'] = Perceptron()
    classifiers['KNeighbors Classifier'] = KNeighborsClassifier()
    classifiers['RandomForestClassifier'] = RandomForestClassifier(n_estimators=300)

    # Iterate over dictionary
    for clf_name, clf in classifiers.items(): #clf_name is the key, clf is the value
        scores = cross_val_score(clf, X, y, cv=3)
        print(clf_name + ' cross_val_score: ' + str(np.mean(scores)))
        
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)
        score = metrics.accuracy_score(y_test, pred)
        print(clf_name + ': ' + str(score))
        print(metrics.classification_report(y_test, pred))
        

In [None]:
train(X, y)