# Machine Learning Living Arrangement Predictor Results

In this notebook, we'll take a look at the output of the machine learning algorithm we used to classify students in the TTS data according to whether or not they are living with their family.

## Model Details

We used the Model Builder and ML.NET through visual studio to train a machine learning algorithm to classify students as either living with family or not according to attributes that are avilable in both SMTO and TTS. The model was trained on labelled data from SMTO (where living arrangements were available), and ran on the unlabelled TTS dataset.

### Chosen Algorithm and Evaluation Details

The algorithm selected by model builder was `LightGbmBinary`. When tested against a reserved portion of the training data, the model had an accuracy of __91.65%__, an AUC and AUPRC of __97.00%__, and an F1-score of __92.47%__.

## Tabulation of Results

We will take a look at the model's output. In particular, we compare the model's output for different types of students with those from SMTO data. We import the model's output into a `df` and the labelled SMTO data into `SMTO_df`.

In [1]:
import pandas as pd
df = pd.read_csv('../../Data/TTS_2016_ML_Output_2.0.csv')
SMTO_df = pd.read_csv('../../Data/SMTO_2015_ML_Format.csv')

We declare this function which will be used to compare the datasets below.

In [3]:
def print_comparison(col, col_print=None):
    col_print = col_print if col_print else col
    print(col_print + "\tSMTO\tTTS")
    for num in set(list(SMTO_df[col]) + list(df[col])):
        print("{}\t{:2.1%}\t{:2.1%}".format(num, len(SMTO_df[(SMTO_df['Family?'] == 1) & (SMTO_df[col] == num)]) / len(SMTO_df[SMTO_df[col] == num]) if len(SMTO_df[SMTO_df[col] == num]) else 0.0, len(df[(df['Family'] == 1) & (df[col] == num)]) / len(df[df[col] == num]) if len(df[df[col] == num]) else 0.0))

Let us compare the overall proportions of student living with their family in either dataset.

In [2]:
print("SMTO\n" + str(SMTO_df['Family?'].value_counts(normalize=True)))
print("\nTTS\n" + str(df['Family'].value_counts(normalize=True)))

SMTO
1    0.564904
0    0.435096
Name: Family?, dtype: float64

TTS
True     0.687779
False    0.312221
Name: Family, dtype: float64


A list of obsrvations and takeaways is presented at the bottom of this notebook. For now, it is important to keep in mind that the proportion of students living with their family was predicted to be much higher in the TTS data than the SMTO data.

### 1. Household Composition

We can compare the TTS and SMTO label distributions according to specific features. Household composition was found to be a good indicator in the SMTO data, so let us check if it was a similarly good predictor in the ML model.

We assumed that most students whose household does not include any adults, except, possibly, for the student themselves, are not living with their parents/family.

In [4]:
print("SMTO\n" + str(SMTO_df[((SMTO_df['Age'] >= 18) & (SMTO_df['Adults'] == 1)) | ((SMTO_df['Age'] < 18) & (SMTO_df['Adults'] == 0))]['Family?'].value_counts(normalize=True)))
print("\nTTS\n" + str(df[((df['Age'] >= 18) & (df['Adults'] == 1)) | ((df['Age'] < 18) & (df['Adults'] == 0))]['Family'].value_counts(normalize=True)))

SMTO
0    0.966424
1    0.033576
Name: Family?, dtype: float64

TTS
False    0.953488
True     0.046512
Name: Family, dtype: float64


Our assumption is validated, and the ML results reflect the distribution in the training data well.

For students 18 or over who live with children under 18, we have:

In [5]:
print("SMTO\n" + str(SMTO_df[(SMTO_df['Age'] >= 18) & (SMTO_df['Children'] >= 1)]['Family?'].value_counts(normalize=True)))
print("\nTTS\n" + str(df[(df['Age'] >= 18) & (df['Children'] >= 1)]['Family'].value_counts(normalize=True)))

SMTO
1    0.869098
0    0.130902
Name: Family?, dtype: float64

TTS
True     0.816373
False    0.183627
Name: Family, dtype: float64


The model results and training data proportions are slightly different when segmented according to this aspect.

Let us look at the probability a student is living with their family according to number of adults in their household.

In [6]:
print_comparison("Adults")

Adults	SMTO	TTS
0	3.8%	0.0%
1	4.3%	7.5%
2	28.5%	31.8%
3	78.1%	90.9%
4	84.5%	94.1%
5	83.6%	92.7%
6	64.6%	86.6%
7	62.5%	85.3%
8	41.5%	59.4%
9	19.2%	100.0%
10	5.0%	50.0%
11	9.1%	100.0%
12	16.7%	0.0%
13	0.0%	0.0%
14	0.0%	0.0%
15	0.0%	0.0%
16	0.0%	0.0%


Again, the proportions gollow a similar pattern again. However, the TTS proportions are often quite a bit higher. This makes sense as the overall proportion was higher.

Let's perform a similar check for number of children in the household.

In [7]:
print_comparison("Children", "Kids")

Kids	SMTO	TTS
0	46.4%	61.8%
1	85.9%	83.5%
2	86.3%	80.0%
3	84.6%	76.4%
4	92.6%	88.7%
5	86.7%	72.0%
6	100.0%	90.9%
7	100.0%	0.0%
8	100.0%	0.0%
18	0.0%	0.0%


This time, there are much more substantial differences between the samples. We can see that the percentage probabilities go down a lot in TTS for all numbers of children, except for no children. 

### 2. Household Income

Now let's see how the label proportions compare when segmented by income level.

In [8]:
print_comparison("Income")

Income	SMTO	TTS
0	59.6%	73.0%
1	42.5%	59.0%
2	64.4%	70.7%


In both SMTO and TTS, higher household income predicts living with family a lot better than lower household income (slightly higher in TTS this time). It is interesting however that in SMTO, an unknown income range did not strongly predict living with family, while in TTS it looks like 72% of students with unknown income live at home. Again, there are noticeable differenes in the proportions, but how much of these differences can be accounted for by the different samplewide proportions?

### 3. Vehicle Ownership

Here we check how vehicle ownership is correlated with living arrangement.

In [9]:
print_comparison("Cars")

Cars	SMTO	TTS
0	14.3%	21.5%
1	65.7%	61.2%
2	87.6%	82.7%
3	92.0%	91.4%
4	91.9%	94.7%
5	92.9%	90.7%
6	100.0%	97.4%
7	100.0%	100.0%
8	100.0%	100.0%
9	50.0%	0.0%
99	0.0%	20.0%
12	0.0%	100.0%


In general, number of vehicles was again a strong predictor of living arrangement for TTS, where, as the number of cars increases, the probability of living at home generally increases. Starting at household with 2 cars, the probability that a student lives with their families is really high. For students with 2 cars, the probability of living at home for TTS actually goes down a little bit from SMTO. 

### 4. License Ownership

Now let's look at licence ownership:

In [10]:
print_comparison("License")

License	SMTO	TTS
0	61.4%	76.5%
1	53.2%	66.9%
9	0.0%	58.6%


It is very interesting that, while in SMTO the ownership of a licence does not provide a strong prediction of whether the student lives at home or not, for TTS it does. Students who do not have a licence, are actually quite likely to be living with their families. But again, these differences could be due to the different samplewide proportions.
 


Let us refine this by looking only at students who do not have a license but whose household owns at least one car.

In [11]:
print("SMTO\t", ((SMTO_df['Cars'] > 0) & (SMTO_df['Family?'] == 1) & (SMTO_df['License'] == 0)).sum()/((SMTO_df['Cars'] > 0) & (SMTO_df['License'] == 0)).sum())
print("TTS\t", ((df['Cars'] > 0) & (df['Family'] == 1) & (df['License'] == 0)).sum()/((df['Cars'] > 0) & (df['License'] == 0)).sum())

SMTO	 0.8863506567675614
TTS	 0.8995398773006135


These proportions are remarkably similar!

Finally, let us check license ownership in conjunction with enrolment status (full-time vs. part-time) and vehicle ownership.

In [12]:
print("\tFT\tPT")
print("SMTO\t{:2.1%}\t{:2.1%}".format(((SMTO_df['Cars'] > 0) & (SMTO_df['Family?'] == 1) & (SMTO_df['Status'] == 2)).sum()/((SMTO_df['Cars'] > 0) & (SMTO_df['Status'] == 2)).sum(), ((SMTO_df['Cars'] > 0) & (SMTO_df['Family?'] == 1) & (SMTO_df['Status'] == 1)).sum()/((SMTO_df['Cars'] > 0) & (SMTO_df['Status'] == 1)).sum()))
print("TTS\t{:2.1%}\t{:2.1%}".format(((df['Cars'] > 0) & (df['Family'] == 1) & (df['Status'] == 2)).sum()/((df['Cars'] > 0) & (df['Status'] == 2)).sum(), ((df['Cars'] > 0) & (df['Family'] == 1) & (df['Status'] == 1)).sum()/((df['Cars'] > 0) & (df['Status'] == 1)).sum()))

	FT	PT
SMTO	80.4%	56.1%
TTS	86.8%	50.7%


As we can see, there is still a significant difference between full-time and part-time students who own a car and licence, and who live at home at the same time. The reasons for this are not immediately obvious, but this difference is worth noting. 

## Overall Conclusions:

Right away we saw that the percentage of students living with their families is different from the SMTO to the TTS dataset. While in SMTO we have 56% of students living at home, in TTS we have 68% of students living at home. This is a relatively significant difference, and it leads us to a question of how to compare the SMTO and TTS results. The correlations between living arrangement and other attributes cannot be directly compared since the distributions are not the same. 

   - __Possible Reason:__
A possible reason for the bigger percentage of students living at home in the TTS dataset could be that TTS is a household survey, and thus, it is possible that the majority of students who obtained and filled out this survey were living at their family's household. StudentMoveTO, however, is a survey sent to all students.

   - __Possible Solution:__
A possible solution for this could be to normalize the results so that we can conduct an objective comparison. However, we would need some guidance on how to perform this.


We were expecting to see an increase in the proportion of students living at home for all groups explored above due to the fact that there are more students living with their families overall. However, this did not turn out to be the case when analyzing the correlation between living arrangement and students 18 or over who live with children under 18, or number of children in the household.

## Questions:

- How can we normalize the above results about student's living arrangement to obtain an unbiased comparison between both SMTO and TTS surveys?
- Could we use this machine learning technique to also predict the education level of students? (ie. Whether they are in an undergrad or graduate program). This would allow us to develop a similar segmentation for TTS as we did for SMTO. Alternatively, it could be used to just stick to age for any segmentation needs or try to use it as an explanatory variable for whatever location choice model we come up with instead.
- When we ran the machine learning program, we noticed that there are some attribute columns that, when removed, only decreased the accuracy of the algorithm by very small amounts (ex. from 91.65% accuracy to 90.40%). From this observation we inferred that such columns have little correlation with the student's living arrangement, so is it worth it to keep these columns, or should we remove them given that the accuracy changes by very little?