# Logistic Regression

In this notebook you will use GPU-accelerated logistic regression to predict infection risk based on features of our population members.

## Objectives

By the time you complete this notebook you will be able to:

- Use GPU-accelerated logistic regression

## Imports

In [1]:
import cudf
import cuml

import cupy as cp

## Load Data

In [2]:
gdf = cudf.read_csv('./data/pop_2-05.csv', usecols=['age', 'sex', 'infected'])

In [3]:
gdf.dtypes

age         float64
sex         float64
infected    float64
dtype: object

In [4]:
gdf.shape

(58479894, 3)

In [5]:
gdf.head()

Unnamed: 0,age,sex,infected
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0


## Logistic Regression

Logistic regression can be used to estimate the probability of an outcome as a function of some (assumed independent) inputs. In our case, we would like to estimate infection risk based on population members' age and sex.

Here we create a cuML logistic regression instance `logreg`:

In [6]:
logreg = cuml.LogisticRegression()

## Exercise: Regress Infected Status

The `logreg.fit` method takes 2 arguments: the model's independent variables *X*, and the dependent variable *y*. Fit the `logreg` model using the `gdf` columns `age` and `sex` as *X* and the `infected` column as *y*.

In [7]:
logreg.fit(gdf[['age','sex']],gdf['infected'])

LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=4, l1_ratio=None, solver='qn', handle=<cuml.common.handle.Handle object at 0x7f0afda13d90>, output_type='cudf')

#### Solution

In [8]:
# %load solutions/regress_infected
logreg.fit(gdf[['age', 'sex']], gdf['infected'])


## Viewing the Regression

After fitting the model, we could use `logreg.predict` to estimate whether someone has more than a 50% chance to be infected, but since the virus has low prevalence in the population (around 1-2%, in this dataset), individual probabilities of infection are well below 50% and the model should correctly predict that no one is individually likely to have the infection.

However, we also have access to the model coefficients at `logreg.coef_` as well as the intercept at `logreg.intercept_`. Both of these values are cuDF Series:

In [9]:
type(logreg.coef_)

cudf.core.series.Series

In [10]:
type(logreg.intercept_)

cudf.core.series.Series

Here we view these values. Notice that changing sex from 0 to 1 has the same effect via the coefficients as changing the age by ~48 years.

In [11]:
logreg_coef = logreg.coef_
logreg_int = logreg.intercept_

print("Coefficients: [age, sex]")
print([logreg_coef[0], logreg_coef[1]])

print("Intercept:")
print(logreg_int[0])

Coefficients: [age, sex]
[0.014701467289187226, 0.7002792190571455]
Intercept:
-5.215624611703686


## Estimate Probability of Infection

As with all logistic regressions, the coefficients allow us to calculate the logit for each; from that, we can calculate the estimated percentage risk of infection.

In [13]:
class_probs = logreg.predict_proba(gdf[['age', 'sex']])
class_probs

Unnamed: 0,0,1
0,0.994598,0.005402
1,0.994598,0.005402
2,0.994598,0.005402
3,0.994598,0.005402
4,0.994598,0.005402
...,...,...
58479889,0.960540,0.039460
58479890,0.960540,0.039460
58479891,0.960540,0.039460
58479892,0.960540,0.039460


Remembering that a 1 indicates 'infected', we assign that class' probability to a new column in the original dataframe:

In [14]:
gdf['risk'] = class_probs[1]

Looking at the original records with their new estimated risks, we can see how estimated risk varies across individuals.

In [15]:
gdf.take(cp.random.choice(gdf.shape[0], size=5, replace=False))

Unnamed: 0,age,sex,infected,risk
15897951,42.0,0.0,0.0,0.00997
15150513,40.0,0.0,0.0,0.009684
17600721,47.0,0.0,0.0,0.010722
42247662,37.0,1.0,0.0,0.018499
38487468,27.0,1.0,0.0,0.01601


## Exercise: Show Infection Prevalence is Related to Age

The positive coefficient on age suggests that the virus is more prevalent in older people, even when controlling for sex.

For this exercise, show that infection prevalence has some relationship to age by printing the mean `infected` values for the oldest and youngest members of the population when grouped by age:

In [17]:
gdf.groupby('age')['infected'].mean()

age
0.0     0.000000
1.0     0.000889
2.0     0.001960
3.0     0.002715
4.0     0.003586
          ...   
86.0    0.023417
87.0    0.023256
88.0    0.024569
89.0    0.024412
90.0    0.025017
Name: infected, Length: 91, dtype: float64

#### Solution

In [19]:
# %load solutions/risk_by_age
age_groups = gdf[['age', 'infected']].groupby(['age'])
print(age_groups.mean().head())
print(age_groups.mean().tail())


     infected
age          
0.0  0.000000
1.0  0.000889
2.0  0.001960
3.0  0.002715
4.0  0.003586
      infected
age           
86.0  0.023417
87.0  0.023256
88.0  0.024569
89.0  0.024412
90.0  0.025017


## Exercise: Show Infection Prevalence is Related to Sex

Similarly, the positive coefficient on sex suggests that the virus is more prevalent in people with sex = 1 (females), even when controlling for age.

For this exercise, show that infection prevalence has some relationship to sex by printing the mean `infected` values for the population when grouped by sex:

In [20]:
gdf.groupby('sex')['infected'].mean()

sex
0.0    0.010140
1.0    0.020713
Name: infected, dtype: float64

#### Solution

In [21]:
# %load solutions/risk_by_sex
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
sex_groups.mean()


## Making Predictions with Separate Training and Test Data

cuML gives us a simple method for producing paired training/testing data:

In [22]:
X_train, X_test, y_train, y_test  = cuml.train_test_split(gdf[['age', 'sex']], gdf['infected'], train_size=0.9)

## Exercise: Fit Logistic Regression Model Using Training Data

For this exercise, create a new logistic regression model `logreg`, and fit it with the *X* and *y* training data just created.

In [23]:
logreg.fit(X_train,y_train)

LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=4, l1_ratio=None, solver='qn', handle=<cuml.common.handle.Handle object at 0x7f0afda13d90>, output_type='cudf')

#### Solution

In [24]:
# %load solutions/fit_training
logreg = cuml.LogisticRegression()
logreg.fit(X_train, y_train)


## Use Test Data to Validate Model

We can now use the same procedure as above to predict infection risk using the test data:

In [26]:
y_test_pred = logreg.predict_proba(X_test, convert_dtype=True)[1]
y_test_pred.index = X_test.index
y_test_pred

0          0.008375
1          0.006156
2          0.019814
3          0.007128
4          0.016894
             ...   
5847985    0.014822
5847986    0.022567
5847987    0.019222
5847988    0.006156
5847989    0.037805
Name: 1, Length: 5847990, dtype: float64

As we saw before, very few people are actually infected in the population, even among the highest-risk groups. As a simple way to check our model, we split the test set into above-average predicted risk and below-average predicted risk, then observe that the prevalence of infections correlates closely to those predicted risks.

In [27]:
test_results = cudf.DataFrame()
test_results['age'] = X_test['age']
test_results['sex'] = X_test['sex']
test_results['infected'] = y_test
test_results['predicted_risk'] = y_test_pred

test_results['high_risk'] = test_results['predicted_risk'] > test_results['predicted_risk'].mean()

risk_groups = test_results.groupby('high_risk')
risk_groups.mean()

Unnamed: 0_level_0,age,sex,infected,predicted_risk
high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,41.472844,0.526807,0.015991,0.010347
True,7.583074,0.0,0.003964,0.023301


Finally, in a few milliseconds, we can do a two-tier analysis by sex and age:

In [28]:
%%time
s_groups = test_results[['sex', 'age', 'infected', 'predicted_risk']].groupby(['sex', 'age'])
s_groups.mean()

CPU times: user 32 ms, sys: 44 ms, total: 76 ms
Wall time: 76.4 ms


Unnamed: 0_level_0,Unnamed: 1_level_0,infected,predicted_risk
sex,age,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,0.0,0.000000,0.015395393
0.0,1.0,0.000656,0.015450358
0.0,2.0,0.001346,0.015517516
0.0,3.0,0.001856,0.015568505
0.0,4.0,0.002615,0.015522402
...,...,...,...
1.0,86.0,0.030582,
1.0,87.0,0.029942,
1.0,88.0,0.031099,
1.0,89.0,0.030081,


<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In the next notebook, you will use GPU-accelerated k-nearest-neighbors algorithm to locate the nearest road nodes to each hospital.