## Lab 5: logistic regression. ROC. AUC. One-hot encoding

## 1
Consider the data from a health insurance company. Based on the characteristics of patients `Members.csv` (age, gender) and data on receiving medical care in the previous year `Claims_Y1.csv` (medical institution, doctor, type of problem, number of days of hospitalization, date, etc.), you need to predict the fact of hospitalization for at least 1 day in the next year `DaysInHospital_Y2.csv`.

In [1878]:
import pandas as pd
import numpy as np
from sklearn import *

%matplotlib inline
import matplotlib.pyplot as plt

Read the data. Use `MemberID` as the value for parameter `index_col`. 

* data from the 'DaysInHospital_Y2.csv' table assign to the `days2` variable
* data from the 'Members.csv' table assign to the `m` variable
* data from the 'Claims_Y1.csv' table assign to the `claims` variable

In [1879]:
# place for code
days2 = pd.read_csv('DaysInHospital_Y2.csv', index_col='MemberID')
m = pd.read_csv('Members.csv', index_col='MemberID')
claims = pd.read_csv('Claims_Y1.csv', index_col='MemberID')

## 2
To anonymize data, the organizer provided approximate information about patients, for example, the age column shows age groups: '0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80+'. Convert string attributes to quantitative ones and replace the missing values:

In [1880]:
i = pd.notnull(m.AgeAtFirstClaim)
m.loc[i,'AgeAtFirstClaim'] = m.loc[i,'AgeAtFirstClaim'].apply(lambda s: s.split('-')[0] if s!='80+' else '80')
m.loc[i,'AgeAtFirstClaim'] = m.loc[i,'AgeAtFirstClaim'].apply(lambda s: int(s))

m.AgeAtFirstClaim = m.AgeAtFirstClaim.fillna(value=-1)

m.Sex = m.Sex.fillna(value='N')

claims.CharlsonIndex = claims.CharlsonIndex.map({'0':0, '1-2':1, '3-4':3, '5+':5})
claims.LengthOfStay = claims.LengthOfStay.fillna(value=0)
claims.LengthOfStay = claims.LengthOfStay.map({0:0, '1 day':1, '2 days':2, '3 days':3, '4 days':4,\
    '5 days':5, '6 days':6, '1- 2 weeks':10, '2- 4 weeks':21, '4- 8 weeks':42, '26+ weeks':182})

## 3
Construct features using `claims` dta: 
* `f_Charlson` — the maximum index of Calson comorbidity for a patient (`CharlsonIndex` in a table `claims`)
* `f_LengthOfStay` — the total number of days of hospitalization last year (`LengthOfStay` in a table `claims`) 

*Functions that can be useful: `.groupby(['MemberID']), .max ().sum()`*

In [1881]:
# place for code
f_Charlson = claims['CharlsonIndex'].groupby(['MemberID']).max()
f_LengthOfStay = claims['LengthOfStay'].groupby(['MemberID']).sum()

dummy = pd.get_dummies(claims.PrimaryConditionGroup, prefix='pcg')
f_pcg = dummy.groupby(['MemberID']).max()

dummy1 = pd.get_dummies(claims.Specialty, prefix='spe')
f_spe = dummy1.groupby(['MemberID']).max()

dummy2 = pd.get_dummies(claims.ProcedureGroup, prefix='pg')
f_pg = dummy2.groupby(['MemberID']).max()

dummy3 = pd.get_dummies(claims.DSFS, prefix='dsfc')
f_dsfc = dummy3.groupby(['MemberID']).max()

dummy4 = pd.get_dummies(claims.PlaceSvc, prefix='psvc')
f_psvc = dummy4.groupby(['MemberID']).max()

dummy5 = pd.get_dummies(m.Sex, prefix='sex')
f_sex = dummy5.groupby(['MemberID']).max()



## 4

Let's create a matrix of features with columns: `f_Charlson`, `f_LengthOfStay`, patient's age, `ClaimsTruncated` (whether there were too many cases of medical care):

*Functions that can be useful: `.join()`*

In [1882]:
data = days2
data = data.join(f_Charlson)
data = data.join(f_LengthOfStay)
data = data.join(f_pcg)
data = data.join(f_pg)
data = data.join(f_spe)
data = data.join(f_dsfc)
data = data.join(f_psvc)
# data = data.join(f_sex)
data = data.join(m['AgeAtFirstClaim'])

# place for code
data.head(5)

Unnamed: 0_level_0,ClaimsTruncated,DaysInHospital,CharlsonIndex,LengthOfStay,pcg_AMI,pcg_APPCHOL,pcg_ARTHSPIN,pcg_CANCRA,pcg_CANCRB,pcg_CANCRM,...,dsfc_9-10 months,psvc_Ambulance,psvc_Home,psvc_Independent Lab,psvc_Inpatient Hospital,psvc_Office,psvc_Other,psvc_Outpatient Hospital,psvc_Urgent Care,AgeAtFirstClaim
MemberID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
98324177,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,30
33899367,1,1,0,0,0,0,1,0,0,0,...,1,0,0,1,0,1,0,0,1,80
5481382,0,1,0,0,0,0,1,0,0,0,...,1,0,0,1,0,1,0,0,1,20
69908334,0,0,0,0,0,0,1,0,0,0,...,1,1,0,1,0,1,0,0,0,60
29951458,0,0,0,0,0,0,1,0,0,0,...,1,0,0,1,0,1,0,0,1,40


## 5
Create a function that will divide the sample into two parts `dataTrain` and `dataTest`, train logistic regression on `dataTrain`, apply it to `dataTest`, build an error curve and calculate the area under it:

In [1883]:
def calcAUC(data):
    dataTrain, dataTest = model_selection.train_test_split(data, test_size=0.5, random_state=1)
    model = linear_model.LogisticRegression()
    model.fit( dataTrain.loc[:, dataTrain.columns != 'DaysInHospital'], dataTrain.DaysInHospital )
    predictionProb = model.predict_proba( dataTest.loc[:, dataTest.columns != 'DaysInHospital'] )
    # fpr, tpr, _ = metrics.roc_curve(dataTest['DaysInHospital'], predictionProb[:,1])
    # plt.figure()
    # plt.plot(fpr, tpr, color='darkorange', lw=2)
    # plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    # plt.show()
    return( metrics.roc_auc_score(dataTest['DaysInHospital'], predictionProb[:,1]) )

## 6
Apply this function to `data`:

In [1884]:
calcAUC(data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.6700704873373138

## 7
Logistic regression accepts only quantitative features as input.

Add the patient's gender to our data using one hot encoding:

*Functions that can be useful: `pd.get_dummies(m.Sex, prefix='pol')`*

## 8
Try applying one hot encoding to the existing features in `data2` or creating new features using the `claims` array.

In [1885]:
data2 = data.join(pd.get_dummies(m.Sex, prefix='gen'))
f_data = data2.join(pd.get_dummies(data2.AgeAtFirstClaim, prefix='age'))

f_data['AgeAtFirstClaim'] **=1
f_data['LengthOfStay'] **=1/9999


In [1886]:
# feature selection method 1
cols = list(f_data.columns)
model = linear_model.LinearRegression()
rfe = feature_selection.RFE(model, len(cols)-1)             
X_rfe = rfe.fit_transform(f_data, f_data.DaysInHospital)  
             
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index

for i in cols:
    if (i not in selected_features_rfe):
        f_data = f_data.drop(columns=i)

print(calcAUC(f_data))

0.7166027962226194
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Who built the feature matrix for which the logistic regression works with the best quality: +5 bonus points.

In [1887]:
# feature selection method 2
def greedy_algorithm_step(current_dataset, all_features):
    res =[]
    all_columns = list(set(all_features.columns)-{'DaysInHospital'})
    for col in all_columns:
        cd = current_dataset.copy()
        cd[col] = all_features[col]
        res.append(calcAUC(cd))
    i = np.argmax(res)
    print(all_columns[i],res[i])
    current_dataset[all_columns[i]] = all_features[all_columns[i]]


In [1888]:
c_dataset = pd.DataFrame()
c_dataset['DaysInHospital']=f_data["DaysInHospital"]
greedy_algorithm_step(c_dataset, f_data)

gen_N 0.594293803407091


In [1889]:
greedy_algorithm_step(c_dataset, f_data)

CharlsonIndex 0.6480806905083488


In [1890]:
greedy_algorithm_step(c_dataset, f_data)

dsfc_3- 4 months 0.6630443492319367


In [1891]:
greedy_algorithm_step(c_dataset, f_data)

spe_Emergency 0.6739343446789168


In [1892]:
greedy_algorithm_step(c_dataset, f_data)

pcg_PRGNCY 0.6820493308617697


In [1893]:
greedy_algorithm_step(c_dataset, f_data)

age_80 0.6881381437204682


In [1894]:
greedy_algorithm_step(c_dataset, f_data)

ClaimsTruncated 0.6951979859308135


In [1895]:
greedy_algorithm_step(c_dataset, f_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
age_-1 0.701050809150082


In [1896]:
greedy_algorithm_step(c_dataset, f_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
age_70 0.7081091155039784


In [1897]:
greedy_algorithm_step(c_dataset, f_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
age_10 0.7090422595138761


In [1898]:
greedy_algorithm_step(c_dataset, f_data)

AgeAtFirstClaim 0.7107965147498092


In [1899]:
greedy_algorithm_step(c_dataset, f_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [1900]:
greedy_algorithm_step(c_dataset, f_data)
greedy_algorithm_step(c_dataset, f_data)
greedy_algorithm_step(c_dataset, f_data)
greedy_algorithm_step(c_dataset, f_data)
greedy_algorithm_step(c_dataset, f_data)
greedy_algorithm_step(c_dataset, f_data)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt