## Data quality check / cleaning / preparation 

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.** An example is given below.

In [50]:
# imports:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [55]:
# load in data set
df = pd.read_csv('./Datasets/d_alc_disorders_by_age.csv')
df.head()

Unnamed: 0,Entity,Code,Year,DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: All Ages (Rate),DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 70+ years (Rate),DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 5-14 years (Rate),DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 15-49 years (Rate),DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: Age-standardized (Rate),DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 50-69 years (Rate)
0,Afghanistan,AFG,1990,56.93,52.2,1.73,101.05,79.02,108.86
1,Afghanistan,AFG,1991,56.04,51.88,1.7,99.45,78.6,107.58
2,Afghanistan,AFG,1992,55.59,51.6,1.69,99.11,78.55,106.46
3,Afghanistan,AFG,1993,55.67,51.64,1.66,100.44,78.89,106.02
4,Afghanistan,AFG,1994,55.61,51.76,1.64,102.26,79.15,105.85


In [56]:
df = df.drop(columns = "Code")
df.rename(columns = {"Entity": "Country", 
                     'DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: All Ages (Rate)': 'all ages rate', 
                     'DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 70+ years (Rate)': '70+ rate', 
                     'DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 5-14 years (Rate)':'5-14 rate', 
                     'DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 15-49 years (Rate)':'15-49 rate',
                     'DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: Age-standardized (Rate)':'all age standarized rate',
                     'DALYs (Disability-Adjusted Life Years) - Alcohol use disorders - Sex: Both - Age: 50-69 years (Rate)': '50-69 rate'}, 
         inplace = True)

In [57]:
df2 = df.set_index(['Country', 'Year'])
df2.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,all ages rate,70+ rate,5-14 rate,15-49 rate,all age standarized rate,50-69 rate
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,1990,56.93,52.2,1.73,101.05,79.02,108.86
Afghanistan,1991,56.04,51.88,1.7,99.45,78.6,107.58
Afghanistan,1992,55.59,51.6,1.69,99.11,78.55,106.46
Afghanistan,1993,55.67,51.64,1.66,100.44,78.89,106.02
Afghanistan,1994,55.61,51.76,1.64,102.26,79.15,105.85


In [58]:
df2.to_csv('./Datasets/CLEANED_alc_disorders_by_age.csv')

### Data quality check
*By Elton John*

The code below visualizes the distribution of all the variables in the dataset, and their association with the response.

In [7]:
#...Distribution of continuous variables...#

In [8]:
#...Distribution of categorical variables...#

In [9]:
#...Association of the response with the predictors...#

### Data cleaning
*By Xena Valenzuela*

From the data quality check we realized that:

1. Some of the columns that should have contained only numeric values, specifically <>, <>, and <> have special characters such as \*, #, %. We'll removes these characters, and convert the datatype of these columns to numeric.

2. Some of the columns have more than 60% missing values, and it is very difficult to impute their values, as the values seem to be missing at random with negligible association with other predictors. We'll remove such columns from the data.

3. The column `number_of_bedrooms` has some unreasonably high values such as 15. As our data consist of single-family homes in Evanston, we suspect that any value greater than 5 may be incorrect. We'll replace all values that are greater than 5 with an estimate obtained using the $K$-nearest neighbor approach.

4. The columns `house_price` has some unreasonably high values. We'll tag all values greater than 1 billion dollars as "potentially incorrect observation", to see if they distort our prediction / inference later on.

The code below implements the above cleaning.

In [None]:
#...Code with comments...#

### Data preparation
*By Sankaranarayanan Balasubramanian and Chun-Li*

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

1. Since we need to predict house price, we derived some new predictors *(from existing predictors)* that intuitively seem to be helpuful to predict house price. 

2. We have shuffled the dataset to prepare it for K-fold cross validation.

3. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

In [3]:
######---------------Creating new predictors----------------#########

#Creating number of bedrooms per unit floor area

#Creating ratio of bathrooms to bedrooms

#Creating ratio of carpet area to floor area

In [None]:
######-----------Shuffling the dataset for K-fold------------#########

In [None]:
######-----Standardizing the dataset for Lasso / Ridge-------#########

## Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

## Developing the model

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text *(in a markdown cell)* before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. **Put the name of the person / persons who contributed to each code chunk / set of code chunks.**

### Code fitting the final model

Put the code(s) that fit the final model(s) in separate cell(s), i.e., the code with the `.ols()` or `.logit()` functions.

## Conclusions and Recommendations to stakeholder(s)

You may or may not have code to put in this section. Delete this section if it is irrelevant.