In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn import preprocessing 
from scipy import stats

sb.set()

In [2]:
telco_df = pd.read_csv('telco.txt', sep='\t')
telco_df.head()

Unnamed: 0,tenure,age,marital,address,income,ed,employ,retire,gender,longmon,wiremon,churn
1,13,44,Married,9,64,College degree,5,No,Male,3.7,0.0,Yes
2,11,33,Married,7,136,Post-undergraduate degree,5,No,Male,4.4,35.7,Yes
3,68,52,Married,24,116,Did not complete high school,29,No,Female,18.15,0.0,No
4,33,33,Unmarried,12,33,High school degree,0,No,Female,9.45,0.0,Yes
5,23,30,Married,9,30,Did not complete high school,2,No,Male,6.3,0.0,No


# Problem 4

## Task 1
Description: Have a closer look at the deﬁnitions of the variables and analyze which of them might require a separate treatment. Consider for example the variable ed. There are two possibilities how the variable ed can be included into the model (one with dummy variables, the other one without dummies). Think about these two approaches and suggest which approach is more appropriate. Motivate your decision.

In given dataset the are multiple categorical variables such as:
- marital with values Married and Unmarried;
- retire with Yes and No values;
- gender with Male and Female values;
- churn with values Yes and No;
- ed with 5 distinct values - Did not complete high school, High school degree, Some college, College degree, Post-undergraduate degree.

Most common ways of including such variables into model are Dummy Variable Encoding and Label Encoding. Dummy Variable Encoding introduces $m-1$ dummy variables (with values 1 or 0 only) for each categorical variable with $m$ posible categories. It is important to introduce exqactly $m-1$ variables and not $m$ dummy variables because last category will be represented as 0 values for all other $m-1$ categories. Such category called reference category.

In our case, for variables marital, retire, gender and churn we need to introduce one dummy variabe per each of them. It is possible because all those variables are binary variables which taks either one values or another and there is no overlay between values possible. For variable ed we need to introduce 4 dummy variables for categories Did not complete high school, High school degree, Some college and College degree respectively. Category Post-undergraduate degree will be reference one and represented by this 4 dummy variables when all of them are equal to 0.

Another way to introduce ategorical variables into model is Label Encoding. With this approach for each categorical variable we introduce one new variable with respective numeric labels per category. So that, for variables marital, retire, gender and churn we will introduce label variables for each with values 0 and 1 (as for dummy variables). Simillary, for variable ed we will introduce one new label variable with values from 0 to 4 representing existing categories.

Below are transformed datasets using both approaches.

In [3]:
#Dummy Variable Encoding 

telco_dummy_df = pd.get_dummies(telco_df, columns=(['marital', 'retire', 'gender', 'churn', 'ed']))
telco_dummy_df.drop(columns=(['marital_Unmarried', 'retire_No', 'gender_Female', 'churn_No', 
                                               'ed_Post-undergraduate degree']), inplace=True)
telco_dummy_df.rename(columns={'marital_Married': 'marital_dummy', 'retire_Yes': 'retire_dummy', 
                               'gender_Male': 'gender_dummy', 'churn_Yes': 'churn_dummy', 
                              'ed_College degree': 'ed_dummy_college_degree', 
                               'ed_Did not complete high school': 'ed_dummy_no_high_school',
                              'ed_High school degree': 'ed_dummy_high_school',
                              'ed_Some college': 'ed_dummy_some_college'}, inplace=True)

telco_dummy_df.head()

Unnamed: 0,tenure,age,address,income,employ,longmon,wiremon,marital_dummy,retire_dummy,gender_dummy,churn_dummy,ed_dummy_college_degree,ed_dummy_no_high_school,ed_dummy_high_school,ed_dummy_some_college
1,13,44,9,64,5,3.7,0.0,1,0,1,1,1,0,0,0
2,11,33,7,136,5,4.4,35.7,1,0,1,1,0,0,0,0
3,68,52,24,116,29,18.15,0.0,1,0,0,0,0,1,0,0
4,33,33,12,33,0,9.45,0.0,0,0,0,1,0,0,1,0
5,23,30,9,30,2,6.3,0.0,1,0,1,0,0,1,0,0


In [14]:
#Label Encoding

telco_label_df = telco_df

# label_encoder = preprocessing.LabelEncoder() 

# telco_label_df['marital'] = label_encoder.fit_transform(telco_label_df['marital']) 
telco_label_df['marital'] = telco_label_df['marital'].astype('category')
telco_label_df['marital'] = telco_label_df['marital'].cat.codes

telco_label_df['retire'] = telco_label_df['retire'].astype('category')
telco_label_df['retire'] = telco_label_df['retire'].cat.codes

telco_label_df['gender'] = telco_label_df['gender'].astype('category')
telco_label_df['gender'] = telco_label_df['gender'].cat.codes

telco_label_df['churn'] = telco_label_df['churn'].astype('category')
telco_label_df['churn'] = telco_label_df['churn'].cat.codes

telco_label_df['ed'] = telco_label_df['ed'].astype('category')
telco_label_df['ed'] = telco_label_df['ed'].cat.codes

telco_label_df.head()

Unnamed: 0,tenure,age,marital,address,income,ed,employ,retire,gender,longmon,wiremon,churn
1,13,44,0,9,64,0,5,0,1,3.7,0.0,1
2,11,33,0,7,136,3,5,0,1,4.4,35.7,1
3,68,52,0,24,116,1,29,0,0,18.15,0.0,0
4,33,33,1,12,33,2,0,0,0,9.45,0.0,1
5,23,30,0,9,30,1,2,0,1,6.3,0.0,0


There are not that much difference between result we get applying Label Encoding and Dummy Variable Encoding to the categorical variables marital, retire, gender and churn. However, there is major difference with regards to categorical variable ed. 

For Label Encoding, ed now encoded by values from 0 to 4. Such number set will confuse regression model which will treat it like there is strong relation between variable values represented by numbers 4>3>2>1>0 and variable with value 4 two times bigger than variable with value 2. However, it is not the case and in real life we could not measure numerically difference between different levels of eduction which represented by this variable. As result, model will treat this values as related to each other and consider some of them as more important than others.

An opposite, using Dummy Variable Encoding we are replacing ed with four dummy variables each one with values 0 or 1. And fifth value encoded by setting all four dummy variables to 0. As result, model will consider all variables independently from each other and be able to assess impact of each ed dummy variable on target variable without giving more or less weight to other dummy variables. This will give us more precise model compare to Label Encoding approach. So that, we will be using Dummy Variable encoded dataset going forward.

## Task 2
Consider now the dependent variable and the interval (metric) scaled explanatory variables. Plot these data and decide if you wish to transform these x-variables and if there is a need to transform the y variable. You can also use some measure of skewness to decide about y. The variable wiremon shows a very speciﬁc pattern. How would you take it into account?

In [34]:
telco_dummy_df['test'] = 1/ (1+ telco_dummy_df['wiremon'])

telco_dummy_df

#building pairplots filtering out all binary variables - those are not representable on scatterplots.
# sb.pairplot(telco_dummy_df.filter(['tenure', 'age', 'address', 'income', 'employ', 'longmon', 'wiremon', 'test']))

Unnamed: 0,tenure,age,address,income,employ,longmon,wiremon,marital_dummy,retire_dummy,gender_dummy,churn_dummy,ed_dummy_college_degree,ed_dummy_no_high_school,ed_dummy_high_school,ed_dummy_some_college,test
1,13,44,9,64,5,3.70,0.0,1,0,1,1,1,0,0,0,1.000000
2,11,33,7,136,5,4.40,35.7,1,0,1,1,0,0,0,0,0.027248
3,68,52,24,116,29,18.15,0.0,1,0,0,0,0,1,0,0,1.000000
4,33,33,12,33,0,9.45,0.0,0,0,0,1,0,0,1,0,1.000000
5,23,30,9,30,2,6.30,0.0,1,0,1,0,0,1,0,0,1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,10,39,0,27,0,3.00,0.0,0,0,0,0,0,0,0,1,1.000000
997,7,34,2,22,5,4.65,0.0,0,0,0,0,0,0,0,0,1.000000
998,67,59,40,944,33,26.75,65.8,0,0,0,0,0,0,0,0,0.014970
999,70,49,18,87,22,25.60,0.0,0,0,0,0,0,0,1,0,1.000000
