# Statistics Canada Analysis

## Summary

1-2 sentences description of analysis

## Introduction

Intro 

- Background info relating to the topic 
- Narrowing down of variables will be justified here
- Will use 3 references: Definitions of certain term(s), background info or inspiration for the study, narrowing down of variables due to prior research (hopefully from a peer reviewed article).

## Research Question

- How do family size and the major income earner in a family influence familial investment income

Predicting the economic family's investment income (EFINVA), based on the economic family's major earner (EFMJIE) and its size (EFSIZE)

## Dataset Description

Canadian Income Survey (CIS) is a cross-sectional survey sponsored both by the Government of Canada and Statistics Canada. The purpose of this survey is to collect information from all citizens and household within Canada, however around 2% of the residing on the reserve, aboriginal settlements or extremely remote areas with extremely small populations is not included in this survey. This survey collect the data from several different characteristics including labour market activity, school attendance, disability, support payments, child care expenses, inter-household transfers. This dataset also combine some information from the Labour Force Survey(LFS), such as the information about the education level ogeography information. This data set is available to all of the organization, different level of the government, and individuals. Different government could use this dataset to make ner economic well policies to all canadian.

#### Description of Relevent Variables

## Methods and Results

### Description of Methods

- The dataset was loaded, and the features of interest were identified to subset the data into columns containing only these features.
- The data was cleaned and wrangled according to results of EDA as well as information provided through the dataset description. 
- EDA was done to identify the distributions among the variables of interest and their correlations to each other.
- The data was split into a training set and a testing set using a 75/25 split since there was an abundant amount of observations.
- Within each split, the data was split into the features and the target variables. The features were "EPSIZE" (family size) and "EFMJIE" (major income earner in family), and the target was "EFINVA" (family investment income)
- Preprocessing was applied on the features of the training set to transform the categorical features "EFSIZE" (family size) and "EFMJIE" (major income earner in family) into numerical features. This was done using ordinal encoding for family size, and binary one hot encoding for the major income earner feature
- A linear regression pipeline using the Ridge model from sci-kit learn was applied using a 10 fold cross-validation score to obtain a mean training score
- The above was repeated for KNN regression
- The models were compared to identify the better 

# Preliminary Analysis

#### Loading the Data

In [None]:
# Importing the appropriate packages
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder,OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, cross_validate
from sklearn.linear_model import Ridge, RidgeCV
import seaborn as sns
import matplotlib.pyplot as plt
# import os

In [None]:
# Importing the appropriate packages
import pandas as pd
import sklearn as sk
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
# import os

In [None]:
# Loading the data
data = pd.read_csv('data/CIS-72M0003-E-2017-Annual/CIS-72M0003-E-2017-Annual_F1.csv')
data.head()

# Cleaning and Wrangling

**The data is reduced to some specific features and targets of interest before EDA as it is difficult to analyze 194 features.**

In [None]:
reduced_data = data[['EFSIZE', 'USHRWK', 'ATINC', 'HLEV2G', 'EFINVA', 'EFMJIE', 'EFATINC', 'EFMJSI']]

### Figure 1: Dataframe containing only preliminary features of interest to do EDA on 

In [None]:
reduced_data

In [None]:
reduced_data.info()

- **The types are all numeric, but this does not mean the data are all numeric. When referring to the data set description, EFMJIE, EFMJSI, and HLEV2G are categorical variables with numbers corresponding to categories**
- **The USHRWK column has zero valid values as all are NaN. This is not an error as these were the values in the description of the dataset as well**


In [None]:
reduced_data.describe()

**The above summaries indicate no missing data for the columns except USHRWK (all non-null counts at 92292 which is in line with the number of total observations in the data). However, when referring to the dataset description, the ATINC column contains the value 99999996 for observations that were skipped for the information for valid reasons. Therefore, these observations should be removed**

In [None]:
reduced_data = reduced_data.drop(columns = 'USHRWK')

In [None]:
reduced_data = reduced_data.loc[reduced_data['ATINC'] != 99999996]

### Figure 2: Dataframe containing only valid observations for ATINC (After Tax Income) column.

In [None]:
reduced_data

In [None]:
eda_feat_histograms = reduced_data.hist(bins=25, figsize=(30, 25))
texts = { 'titles':['Number of economic family members', 
                    'After Tax Income',
                    'Highest level of educaiton of person',
                    'EF Investment Income',
                    'Major Income earner in the economic Family',
                    'EF After-Tax Income',
                    'Major Source of income for the economic family' 
                    ],
            'xaxes':['Number of People',
                        'Dollars [CAD]',
                        ' Number of People',
                        'Dollars [CAD]',
                        'Dollars [CAD]',
                        'Dollars [CAD]',
                        'Number of People']
         }


for i, hist in enumerate(eda_feat_histograms.flatten()): 
    if (i == len(eda_feat_histograms.flatten()) - 2): break 
    hist.set_xlabel(texts['xaxes'][i])
    hist.set_title(texts['titles'][i])
    hist.set_ylabel('Frequency')
    for item in ([hist.title, hist.xaxis.label, hist.yaxis.label] +
             hist.get_xticklabels() + hist.get_yticklabels()):
        item.set_fontsize(20)

In [None]:
plt.figure(figsize=(10,10))
correlations = reduced_data.iloc[:,[1,3,5]].corr()

### Figure 3: Matrix of correlations between various features. 
##### Note: correlations between categorical features should be ignored as these are invalid

In [None]:
sns.heatmap(correlations, cmap=plt.cm.Blues, annot=True)

#### Visualisations

# Data Splitting and Preprocessing

In [None]:
processed = reduced_data[["EFINVA","EFSIZE","EFMJIE"]]

### Figure 4: Reduced dataframe of target variable investment income in CAD (EFINVA), and features family size (EFSIZE) and major income earner (EFMJIE)

In [None]:
processed

In [None]:
train, test = train_test_split(processed, test_size = 0.3, random_state=123)
X_train , Y_train = train.drop(columns = "EFINVA"), train["EFINVA"]
X_test, Y_test = test.drop(columns = "EFINVA"), test["EFINVA"]

**The features are identified as ordinal categorical for family size (EFSIZE) and binary for major income earner (EFMJIE). These are separated to apply different preprocessing steps**

In [None]:
binary_fea =["EFMJIE"]
cate_fea = ["EFSIZE"]

In [None]:
# make a pipe line for preprocessing the features above according to their types: ordinal vs binary
cate_trans = make_pipeline(OrdinalEncoder(categories = [[1, 2, 3, 4, 5, 6, 7]], dtype=int))
binary_trans = make_pipeline(OneHotEncoder(drop="if_binary"))
preprocessor = make_column_transformer(
    (binary_trans, binary_fea),
    (cate_trans,cate_fea)
)

train_processed = preprocessor.fit_transform(X_train)

### Figure 3: Resulting dataframe of features  after preprocessing steps

In [None]:
pd.DataFrame(train_processed, columns = ["EFMJIE","EFSIZE"])

# Hyperparameter Optimization for Ridge Regression Model training

In [None]:
alphas = 10.0 ** np.arange(-2, 5, 1)
ridge_cv_pipe = make_pipeline(preprocessor, RidgeCV(alphas = alphas, cv=10))
ridge_cv_pipe.fit(X_train, Y_train)
best_alpha = ridge_cv_pipe.named_steps["ridgecv"].alpha_
best_alpha

In [None]:
# Make pipeline using optimized alpha value

ridge_pipeline = make_pipeline(preprocessor, Ridge(alpha=best_alpha))
pd.DataFrame(cross_validate(ridge_pipeline, X_train, Y_train, cv=10, return_train_score=True))

### Testing

In [1]:
#!!!TODO

## Discussion of Results

## References