# Machine Learning Living Arrangement Predictor Results

In this notebook, we'll take a look at the output of the machine learning algorithm we used to classify students in the TTS data according to whether or not they are living with their family.

## Model Details

We used the Model Builder and ML.NET through visual studio to train a machine learning algorithm to classify students as either living with family or not according to attributes that are avilable in both SMTO and TTS. The model was trained on labelled data from SMTO (where living arrangements were available), and ran on the unlabelled TTS dataset.

### Chosen Algorithm and Evaluation Details

The algorithm selected by model builder was `MODELNAMEHERE`. When tested against a reserved portion of the training data, the model had an accoracy of XX.X%.

## Tabulation of Results

We will take a look at the model's output. In particular, we compare the model's output for different types of students with those from SMTO data. We import the model's output into a `df` and the labelled SMTO data into `SMTO_df`.

In [1]:
import pandas as pd
df = pd.read_csv('../../Data/TTS_2016_ML_Output.csv')
SMTO_df = pd.read_csv('../../Data/SMTO_2015_ML_Format.csv')

Let us compare the overall proportions of student living with their family in either dataset.

In [2]:
print("SMTO\n" + str(SMTO_df['Family?'].value_counts(normalize=True)))
print("\nTTS\n" + str(df['Family'].value_counts(normalize=True)))

SMTO
1    0.564904
0    0.435096
Name: Family?, dtype: float64

TTS
1    0.68098
0    0.31902
Name: Family, dtype: float64


We declare this function which will be used to compare the datasets below.

In [3]:
def print_comparison(col, col_print=None):
    col_print = col_print if col_print else col
    print(col_print + "\tSMTO\tTTS")
    for num in set(list(SMTO_df[col]) + list(df[col])):
        print("{}\t{:2.1%}\t{:2.1%}".format(num, len(SMTO_df[(SMTO_df['Family?'] == 1) & (SMTO_df[col] == num)]) / len(SMTO_df[SMTO_df[col] == num]) if len(SMTO_df[SMTO_df[col] == num]) else 0.0, len(df[(df['Family'] == 1) & (df[col] == num)]) / len(df[df[col] == num]) if len(df[df[col] == num]) else 0.0))

A list of obsrvations and takeaways is presented at the bottom of this notebook. For now, it is important to keep in mind that the proportion of students living with their family was predicted to be much higher in the TTS data than the SMTO data.

### Household Composition

We can compare the TTS and SMTO label distributions according to specific features. Household composition was found to be a good indicator in the SMTO data, so let us check if it was a similarly good predictor in the ML model.

We assumed that most students whose household does not include any adults, except, possibly, for the student themselves, are not living with their parents/family.

In [4]:
print("SMTO\n" + str(SMTO_df[((SMTO_df['Age'] >= 18) & (SMTO_df['Adults'] == 1)) | ((SMTO_df['Age'] < 18) & (SMTO_df['Adults'] == 0))]['Family?'].value_counts(normalize=True)))
print("\nTTS\n" + str(df[((df['Age'] >= 18) & (df['Adults'] == 1)) | ((df['Age'] < 18) & (df['Adults'] == 0))]['Family'].value_counts(normalize=True)))

SMTO
0    0.966424
1    0.033576
Name: Family?, dtype: float64

TTS
0    0.950805
1    0.049195
Name: Family, dtype: float64


Our assumption is validated, and the ML results reflect the distribution in the training data well.

For students 18 or over who live with children under 18, we have:

In [5]:
print("SMTO\n" + str(SMTO_df[(SMTO_df['Age'] >= 18) & (SMTO_df['Children'] >= 1)]['Family?'].value_counts(normalize=True)))
print("\nTTS\n" + str(df[(df['Age'] >= 18) & (df['Children'] >= 1)]['Family'].value_counts(normalize=True)))

SMTO
1    0.869098
0    0.130902
Name: Family?, dtype: float64

TTS
1    0.805471
0    0.194529
Name: Family, dtype: float64


The model results and training data proportions are slightly different when segmented according to this aspect.

Let us look at the probability a student is living with their family according to number of adults in their household.

In [6]:
print_comparison("Adults")

Adults	SMTO	TTS
0	3.8%	0.0%
1	4.3%	7.4%
2	28.5%	30.2%
3	78.1%	90.6%
4	84.5%	94.0%
5	83.6%	91.8%
6	64.6%	87.7%
7	62.5%	81.1%
8	41.5%	62.5%
9	19.2%	100.0%
10	5.0%	0.0%
11	9.1%	0.0%
12	16.7%	0.0%
13	0.0%	0.0%
14	0.0%	0.0%
15	0.0%	0.0%
16	0.0%	0.0%


Again, the proportions gollow a similar pattern again. However, the TTS proportions are often quite a bit higher. This makes sense as the overall proportion was higher.

Let's perform a similar check for number of children in the household.

In [7]:
print_comparison("Children", "Kids")

Kids	SMTO	TTS
0	46.4%	61.4%
1	85.9%	82.6%
2	86.3%	79.6%
3	84.6%	73.2%
4	92.6%	76.3%
5	86.7%	72.0%
6	100.0%	90.9%
7	100.0%	0.0%
8	100.0%	0.0%
18	0.0%	0.0%


This time, there are much more substantial differences between the samples.

### Household Income

Now let's see how the label proportions compare when segmented by income level.

In [8]:
print_comparison("Income")

Income	SMTO	TTS
0	59.6%	72.3%
1	42.5%	58.0%
2	64.4%	70.2%


Again, there are noticeable differenes in the proportions, but how much of these differences can be accounted for by the different samplewide proportions?

## Vehicle Ownership

Here we check how vehicle ownership is correlated with living arrangement.

In [9]:
print_comparison("Cars")

Cars	SMTO	TTS
0	14.3%	22.0%
1	65.7%	59.7%
2	87.6%	82.1%
3	92.0%	90.9%
4	91.9%	94.9%
5	92.9%	90.7%
6	100.0%	94.7%
7	100.0%	100.0%
8	100.0%	100.0%
9	50.0%	0.0%
99	0.0%	20.0%
12	0.0%	100.0%


COMMENT HERE

## License Ownership

Now let's look at licence ownership:

In [10]:
print_comparison("License")

License	SMTO	TTS
0	61.4%	76.2%
1	53.2%	66.2%
9	0.0%	57.6%


COMMENT HERE

Let us refine this by looking only at students who do not have a license but whose household owns at least one car.

In [11]:
print("SMTO\t", ((SMTO_df['Cars'] > 0) & (SMTO_df['Family?'] == 1) & (SMTO_df['License'] == 0)).sum()/((SMTO_df['Cars'] > 0) & (SMTO_df['License'] == 0)).sum())
print("TTS\t", ((df['Cars'] > 0) & (df['Family'] == 1) & (df['License'] == 0)).sum()/((df['Cars'] > 0) & (df['License'] == 0)).sum())

SMTO	 0.8863506567675614
TTS	 0.8891871165644172


These proportions are remarkably similar!

Finally, let us check license ownership in conjunction with enrolment status (full-time vs. part-time) and vehicle ownership.

In [12]:
print("\tFT\tPT")
print("SMTO\t{:2.1%}\t{:2.1%}".format(((SMTO_df['Cars'] > 0) & (SMTO_df['Family?'] == 1) & (SMTO_df['Status'] == 2)).sum()/((SMTO_df['Cars'] > 0) & (SMTO_df['Status'] == 2)).sum(), ((SMTO_df['Cars'] > 0) & (SMTO_df['Family?'] == 1) & (SMTO_df['Status'] == 1)).sum()/((SMTO_df['Cars'] > 0) & (SMTO_df['Status'] == 1)).sum()))
print("TTS\t{:2.1%}\t{:2.1%}".format(((df['Cars'] > 0) & (df['Family'] == 1) & (df['Status'] == 2)).sum()/((df['Cars'] > 0) & (df['Status'] == 2)).sum(), ((df['Cars'] > 0) & (df['Family'] == 1) & (df['Status'] == 1)).sum()/((df['Cars'] > 0) & (df['Status'] == 1)).sum()))

	FT	PT
SMTO	80.4%	56.1%
TTS	86.2%	49.1%


COMMENT HERE