# Logistic Regression

In this notebook you will use GPU-accelerated logistic regression to predict infection risk based on features of our population members.

## Objectives

By the time you complete this notebook you will be able to:

- Use GPU-accelerated logistic regression

## Imports

In [1]:
import cudf
import cuml

import cupy as cp

## Load Data

In [2]:
gdf = cudf.read_csv('./data/pop_2-05.csv', usecols=['age', 'sex', 'infected'])

 missing cuda symbols while dynamic loading
 cuFile initialization failed


In [3]:
gdf.dtypes

age         float64
sex         float64
infected    float64
dtype: object

In [4]:
gdf.shape

(58479894, 3)

In [5]:
gdf.head()

Unnamed: 0,age,sex,infected
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0


In [24]:
gdf['infected'].value_counts()

0.0    57574163
1.0      905731
Name: infected, dtype: int32

## Logistic Regression

Logistic regression can be used to estimate the probability of an outcome as a function of some (assumed independent) inputs. In our case, we would like to estimate infection risk based on population members' age and sex.

Here we create a cuML logistic regression instance `logreg`:

In [6]:
logreg = cuml.LogisticRegression()

## Exercise: Regress Infected Status

The `logreg.fit` method takes 2 arguments: the model's independent variables *X*, and the dependent variable *y*. Fit the `logreg` model using the `gdf` columns `age` and `sex` as *X* and the `infected` column as *y*.

In [7]:
logreg.fit(gdf[['age', 'sex']], gdf['infected'])

LogisticRegression()

#### Solution

In [8]:
# %load solutions/regress_infected
logreg.fit(gdf[['age', 'sex']], gdf['infected'])


## Viewing the Regression

After fitting the model, we could use `logreg.predict` to estimate whether someone has more than a 50% chance to be infected, but since the virus has low prevalence in the population (around 1-2%, in this data set), individual probabilities of infection are well below 50% and the model should correctly predict that no one is individually likely to have the infection.

However, we also have access to the model coefficients at `logreg.coef_` as well as the intercept at `logreg.intercept_`. Both of these values are cuDF Series:

In [9]:
# 계수
type(logreg.coef_)

cudf.core.dataframe.DataFrame

In [12]:
logreg.coef_

Unnamed: 0,0,1
0,0.014861,0.695666


In [10]:
# 절편
type(logreg.intercept_)

cudf.core.series.Series

In [13]:
logreg.intercept_

0   -5.222369
dtype: float64

Here we view these values. Notice that changing sex from 0 to 1 has the same effect via the coefficients as changing the age by ~48 years.

In [11]:
logreg_coef = logreg.coef_
logreg_int = logreg.intercept_

print("Coefficients: [age, sex]")
print([logreg_coef[0], logreg_coef[1]])

print("Intercept:")
print(logreg_int[0])

Coefficients: [age, sex]
[0    0.014861
Name: 0, dtype: float64, 0    0.695666
Name: 1, dtype: float64]
Intercept:
-5.222369426099008


## Estimate Probability of Infection

As with all logistic regressions, the coefficients allow us to calculate the logit for each; from that, we can calculate the estimated percentage risk of infection.

In [14]:
class_probs = logreg.predict_proba(gdf[['age', 'sex']])
class_probs

Unnamed: 0,0,1
0,0.994634,0.005366
1,0.994634,0.005366
2,0.994634,0.005366
3,0.994634,0.005366
4,0.994634,0.005366
...,...,...
58479889,0.960428,0.039572
58479890,0.960428,0.039572
58479891,0.960428,0.039572
58479892,0.960428,0.039572


Remembering that a 1 indicates 'infected', we assign that class' probability to a new column in the original dataframe:

In [25]:
gdf['risk'] = class_probs[1]

In [29]:
gdf.sample(5)

Unnamed: 0,age,sex,infected,risk
54106887,69.0,1.0,0.0,0.029275
56231629,77.0,1.0,0.0,0.032849
47684887,51.0,1.0,0.0,0.022559
18559803,49.0,0.0,0.0,0.01105
7279411,20.0,0.0,0.0,0.007209


Looking at the original records with their new estimated risks, we can see how estimated risk varies across individuals.

In [26]:
gdf.take(cp.random.choice(gdf.shape[0], size=5, replace=False))

Unnamed: 0,age,sex,infected,risk
43308645,40.0,1.0,0.0,0.019222
43437025,40.0,1.0,0.0,0.019222
56799880,80.0,1.0,0.0,0.034295
14157163,38.0,0.0,0.0,0.009399
10887650,29.0,0.0,0.0,0.008232


## Exercise: Show Infection Prevalence is Related to Age

The positive coefficient on age suggests that the virus is more prevalent in older people, even when controlling for sex.

For this exercise, show that infection prevalence has some relationship to age by printing the mean `infected` values for the oldest and youngest members of the population when grouped by age:

In [36]:
age_groups = gdf[['age', 'infected']].groupby(['age'])
print(age_groups.mean().head())
print(age_groups.mean().tail())

      infected
age           
66.0  0.020700
71.0  0.021292
82.0  0.022929
64.0  0.020675
77.0  0.022102
      infected
age           
33.0  0.015707
76.0  0.021928
74.0  0.021807
79.0  0.022518
86.0  0.023417


#### Solution

In [None]:
# %load solutions/risk_by_age
age_groups = gdf[['age', 'infected']].groupby(['age'])
print(age_groups.mean().head())
print(age_groups.mean().tail())


## Exercise: Show Infection Prevalence is Related to Sex

Similarly, the positive coefficient on sex suggests that the virus is more prevalent in people with sex = 1 (females), even when controlling for age.

For this exercise, show that infection prevalence has some relationship to sex by printing the mean `infected` values for the population when grouped by sex:

In [37]:
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
print(sex_groups.mean().head())
print(sex_groups.mean().tail())

     infected
sex          
0.0  0.010140
1.0  0.020713
     infected
sex          
0.0  0.010140
1.0  0.020713


#### Solution

In [34]:
# %load solutions/risk_by_sex
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
sex_groups.mean()


## Making Predictions with Separate Training and Test Data

cuML gives us a simple method for producing paired training/testing data:

In [38]:
X_train, X_test, y_train, y_test  = cuml.train_test_split(gdf[['age', 'sex']], gdf['infected'], train_size=0.9)

## Exercise: Fit Logistic Regression Model Using Training Data

For this exercise, create a new logistic regression model `logreg`, and fit it with the *X* and *y* training data just created.

In [40]:
logreg = cuml.LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

#### Solution

In [39]:
# %load solutions/fit_training
logreg = cuml.LogisticRegression()
logreg.fit(X_train, y_train)


## Use Test Data to Validate Model

We can now use the same procedure as above to predict infection risk using the test data:

In [41]:
y_test_pred = logreg.predict_proba(X_test, convert_dtype=True)[1]
y_test_pred.index = X_test.index
y_test_pred

45357067    0.020685
27654796    0.017153
1168706     0.005613
1557514     0.005697
44864947    0.020386
              ...   
35349863    0.014153
43718257    0.019515
31649666    0.011870
8903401     0.007652
9225768     0.007766
Name: 1, Length: 5847990, dtype: float64

As we saw before, very few people are actually infected in the population, even among the highest-risk groups. As a simple way to check our model, we split the test set into above-average predicted risk and below-average predicted risk, then observe that the prevalence of infections correlates closely to those predicted risks.

In [45]:
test_results = cudf.DataFrame()
test_results['age'] = X_test['age']
test_results['sex'] = X_test['sex']
test_results['infected'] = y_test
test_results['predicted_risk'] = y_test_pred

# 예측 risk의 평균보다 높으면 고위험자로 분류
test_results['high_risk'] = test_results['predicted_risk'] > test_results['predicted_risk'].mean()

risk_groups = test_results.groupby('high_risk')
risk_groups.head()

Unnamed: 0,age,sex,infected,predicted_risk,high_risk
45357067,45.0,1.0,0.0,0.020685,True
27654796,79.0,0.0,0.0,0.017153,True
1168706,3.0,0.0,0.0,0.005613,False
1557514,4.0,0.0,0.0,0.005697,False
44864947,44.0,1.0,0.0,0.020386,True
47707762,51.0,1.0,0.0,0.022569,True
8031650,22.0,0.0,0.0,0.00743,False
14223653,38.0,0.0,0.0,0.009404,False
17396596,47.0,0.0,0.0,0.010734,False
45618939,46.0,1.0,0.0,0.020988,True


In [46]:
risk_groups.mean()

Unnamed: 0_level_0,age,sex,infected,predicted_risk
high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,29.515672,0.252183,0.009921,0.010326
True,56.197501,0.889953,0.023792,0.023326


Finally, in a few milliseconds, we can do a two-tier analysis by sex and age:

In [47]:
%%time
s_groups = test_results[['sex', 'age', 'infected', 'predicted_risk']].groupby(['sex', 'age'])
s_groups.mean()

CPU times: user 11.7 ms, sys: 13.9 ms, total: 25.6 ms
Wall time: 23.8 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,infected,predicted_risk
sex,age,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,17.0,0.007419,0.006902
0.0,13.0,0.006114,0.006506
1.0,18.0,0.014374,0.013947
0.0,39.0,0.011496,0.009543
1.0,42.0,0.023233,0.019801
1.0,...,...,...
1.0,50.0,0.024914,0.022244
0.0,27.0,0.009666,0.007998
1.0,77.0,0.032190,0.032856
1.0,54.0,0.025403,0.023573


<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In the next notebook, you will use GPU-accelerated k-nearest-neighbors algorithm to locate the nearest road nodes to each hospital.