# Machine Learning in Python - Project 1

Due Friday, March 8th by 4 pm.

*Include contributors names in notebook metadata or here*

## Setup

*Install any packages here and load data*

In [3]:
# Add any additional libraries or submodules below

# Data libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting defaults
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 80

# sklearn modules
import sklearn

In [4]:
# Load data in easyshare.csv
d = pd.read_csv("easyshare.csv")
d.head()

Unnamed: 0,mergeid,int_year,wave,country,country_mod,female,age,birth_country,citizenship,isced1997_r,...,bmi2,smoking,ever_smoked,br010_mod,br015_,casp,chronic_mod,sp008_,ch001_,cogscore
0,AT-000674-01,2011.0,4.0,11.0,40.0,1.0,59.700001,40.0,40.0,5.0,...,2.0,5.0,5.0,5.0,1.0,44.0,1.0,5.0,2.0,8.0
1,AT-001215-01,2011.0,4.0,11.0,40.0,1.0,72.599998,528.0,528.0,5.0,...,3.0,1.0,1.0,2.0,1.0,35.0,6.0,1.0,0.0,14.5
2,AT-001492-01,2011.0,4.0,11.0,40.0,1.0,59.599998,40.0,40.0,3.0,...,2.0,5.0,1.0,7.0,2.0,43.0,1.0,5.0,6.0,18.5
3,AT-001492-02,2011.0,4.0,11.0,40.0,0.0,59.799999,40.0,40.0,4.0,...,2.0,1.0,1.0,7.0,3.0,47.0,1.0,,6.0,24.0
4,AT-001816-01,2004.0,1.0,11.0,40.0,1.0,61.299999,40.0,40.0,3.0,...,2.0,5.0,1.0,4.0,3.0,43.0,0.0,5.0,2.0,20.0


# Introduction

*This section should include a brief introduction to the task and the data (assume this is a report you are delivering to a professional body (e.g. European Union, Governments, Health Institutes and/or Charities on dementia and ageing). If you use any additional data sources, you should introduce them here and discuss why they were included.*

*Briefly outline the approaches being used and the conclusions that you are able to draw.*

# Exploratory Data Analysis and Feature Engineering

*Include a detailed discussion of the data with a particular emphasis on the features of the data that are relevant for the subsequent modeling. Including visualizations of the data is strongly encouraged - all code and plots must also be described in the write up. Think carefully about whether each plot needs to be included in your final draft - your report should include figures but they should be as focused and impactful as possible.*

*You should also split your data into training and testing sets, ideally before you look to much into the features and relationships with the target*

*Additionally, this section should also implement and describe any preprocessing / feature engineering of the data. Specifically, this should be any code that you use to generate new columns in the data frame `d`. Feature engineering that will be performed as part of an sklearn pipeline can be mentioned here but should be implemented in the following section.*

*If you decide to extract additional features from the full data (easyshare_all.csv), describe these variables here.*

*All code and figures should be accompanied by text that provides an overview / context to what is being done or presented.*


* mergeid - person identifier
* wave - wave identifier
* country - country identifier
* country mod - modified country identifier
* female - dummy encoded gender with 0 for male and 1 for female
* age - age at interview
* birth country - country of birth
* citizenship - citizenship of respondent
* isced1997 r - ISCED-97 encoding of education (6 levels - see pg. 11 of data guide)
* eduyears mod - years of education
* eurod - depression scale ranging from 0 “not depressed” to 12 “very depressed”
* bmi - body mass index
* bmi2 - categorized body mass index
* smoking - smoke at present time
* ever smoked - ever smoked daily
* br010 mod - drinking behavior
* br015 - vigorous activities
* casp - CASP-12 score measures quality of life and is based on four subscales on control,
autonomy, pleasure and self-realization, ranges from 12 to 48
* chronic mod - number of chronic diseases
* sp008 - gives help to others outside the household
* ch001 - number of children
* cogscore - measure of cognitive function combining results from two numeracy tests, two

In [5]:
d.describe().round(2)

Unnamed: 0,int_year,wave,country,country_mod,female,age,birth_country,citizenship,isced1997_r,eduyears_mod,...,bmi2,smoking,ever_smoked,br010_mod,br015_,casp,chronic_mod,sp008_,ch001_,cogscore
count,97372.0,97372.0,97372.0,97372.0,97372.0,97372.0,97170.0,97281.0,97372.0,84532.0,...,94686.0,96887.0,97116.0,79210.0,97113.0,84065.0,97283.0,83175.0,96928.0,97372.0
mean,2010.47,3.79,24.27,384.95,0.54,67.94,398.3,391.76,2.66,10.43,...,2.85,4.34,3.26,3.35,2.67,36.74,1.31,3.98,2.17,11.97
std,4.63,2.15,12.3,229.26,0.5,8.86,237.54,232.97,1.49,4.26,...,0.76,1.48,1.98,2.24,1.33,6.46,1.27,1.74,1.4,4.18
min,2004.0,1.0,11.0,40.0,0.0,55.1,2.0,4.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,12.0,0.0,1.0,0.0,0.0
25%,2006.0,2.0,15.0,208.0,0.0,60.5,208.0,208.0,1.0,8.0,...,2.0,5.0,1.0,1.0,1.0,32.0,0.0,1.0,1.0,9.0
50%,2011.0,4.0,20.0,300.0,1.0,66.6,348.0,348.0,3.0,11.0,...,3.0,5.0,5.0,3.0,3.0,37.0,1.0,5.0,2.0,12.0
75%,2013.0,5.0,31.0,616.0,1.0,74.2,642.0,620.0,3.0,13.0,...,3.0,5.0,5.0,5.0,4.0,42.0,2.0,5.0,3.0,15.0
max,2020.0,8.0,63.0,756.0,1.0,111.6,1101.0,1132.0,6.0,30.0,...,4.0,5.0,5.0,7.0,4.0,48.0,9.0,5.0,17.0,26.0


In [1]:
sns.heatmap(d.corr(numeric_only=True)#, annot=True
            )

NameError: name 'sns' is not defined

# Model Fitting and Tuning

*In this section you should detail your choice of model and describe the process used to refine and fit that model. You are strongly encouraged to explore many different modeling methods (e.g. linear regression, interaction terms, lasso, etc.) but you should not include a detailed narrative of all of these attempts. At most this section should mention the methods explored and why they were rejected - most of your effort should go into describing the model you are using and your process for tuning and validating it.*

*For example if you considered a linear regression model, a polynomial regression, and a lasso model and ultimately settled on the linear regression approach then you should mention that other two approaches were tried but do not include any of the code or any in depth discussion of these models beyond why they were rejected. This section should then detail is the development of the linear regression model in terms of features used, interactions considered, and any additional tuning and validation which ultimately led to your final model.* 

*This section should also include the full implementation of your final model, including all necessary validation. As with figures, any included code must also be addressed in the text of the document.*

*Finally, you should also provide comparison of your model with baseline model(s) on the test data but only briefly describe the baseline model(s) considered*

# Discussion & Conclusions

*In this section you should provide a general overview of your final model, its performance, and reliability. You should discuss what the implications of your model are in terms of the included features, predictive performance, and anything else you think is relevant.*

*This should be written with a target audience of a government official or charity directy, who is understands the pressing challenges associated with ageining and dementia but may only have university level mathematics (not necessarily postgraduate statistics or machine learning). Your goal should be to highlight to this audience how your model can useful. You should also mention potential limitations of your model.*

*Finally, you should include recommendations on potential lifestyle changes or governmental/societal interventions to reduce dementia risk.*

*Keep in mind that a negative result, i.e. a model that does not work well predictively, that is well explained and justified in terms of why it failed will likely receive higher marks than a model with strong predictive performance but with poor or incorrect explinations / justifications.*

# References

*Include references if any*

In [None]:
# Run the following to render to PDF
!jupyter nbconvert --to pdf project1.ipynb