# Logistic Regression

In this notebook you will use GPU-accelerated logistic regression to predict infection risk based on features of our population members.

## Objectives

By the time you complete this notebook you will be able to:

- Use GPU-accelerated logistic regression

## Imports

In [1]:
import cudf
import cuml

import cupy as cp

## Load Data

In [2]:
gdf = cudf.read_csv('./data/pop_2-05.csv', usecols=['age', 'sex', 'infected'])

In [3]:
gdf.dtypes

age         float64
sex         float64
infected    float64
dtype: object

In [4]:
gdf.shape

(58479894, 3)

In [5]:
gdf.head()

Unnamed: 0,age,sex,infected
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0


## Logistic Regression

Logistic regression can be used to estimate the probability of an outcome as a function of some (assumed independent) inputs. In our case, we would like to estimate infection risk based on population members' age and sex.

Here we create a cuML logistic regression instance `logreg`:

In [12]:
logreg = cuml.LogisticRegression()

## Exercise: Regress Infected Status

The `logreg.fit` method takes 2 arguments: the model's independent variables *X*, and the dependent variable *y*. Fit the `logreg` model using the `gdf` columns `age` and `sex` as *X* and the `infected` column as *y*.

In [13]:
logreg.fit(gdf[['age','sex']],gdf['infected'])

LogisticRegression()

#### Solution

In [9]:
# %load solutions/regress_infected
logreg.fit(gdf[['age', 'sex']], gdf['infected'])


## Viewing the Regression

After fitting the model, we could use `logreg.predict` to estimate whether someone has more than a 50% chance to be infected, but since the virus has low prevalence in the population (around 1-2%, in this data set), individual probabilities of infection are well below 50% and the model should correctly predict that no one is individually likely to have the infection.

However, we also have access to the model coefficients at `logreg.coef_` as well as the intercept at `logreg.intercept_`. Both of these values are cuDF Series:

In [14]:
type(logreg.coef_)

cudf.core.series.Series

In [15]:
type(logreg.intercept_)

cudf.core.series.Series

Here we view these values. Notice that changing sex from 0 to 1 has the same effect via the coefficients as changing the age by ~48 years.

In [16]:
logreg_coef = logreg.coef_
logreg_int = logreg.intercept_

print("Coefficients: [age, sex]")
print([logreg_coef[0], logreg_coef[1]])

print("Intercept:")
print(logreg_int[0])

Coefficients: [age, sex]
[0.014860597365833405, 0.69566588394742]
Intercept:
-5.222369426098629


## Estimate Probability of Infection

As with all logistic regressions, the coefficients allow us to calculate the logit for each; from that, we can calculate the estimated percentage risk of infection.

In [17]:
class_probs = logreg.predict_proba(gdf[['age', 'sex']])
class_probs

Unnamed: 0,0,1
0,0.994634,0.005366
1,0.994634,0.005366
2,0.994634,0.005366
3,0.994634,0.005366
4,0.994634,0.005366
...,...,...
58479889,0.960428,0.039572
58479890,0.960428,0.039572
58479891,0.960428,0.039572
58479892,0.960428,0.039572


Remembering that a 1 indicates 'infected', we assign that class' probability to a new column in the original dataframe:

In [18]:
gdf['risk'] = class_probs[1]

Looking at the original records with their new estimated risks, we can see how estimated risk varies across individuals.

In [19]:
gdf.take(cp.random.choice(gdf.shape[0], size=5, replace=False))

Unnamed: 0,age,sex,infected,risk
25769148,71.0,0.0,0.0,0.015258
37063727,24.0,1.0,0.0,0.015216
4503878,12.0,0.0,0.0,0.006406
10830446,29.0,0.0,0.0,0.008232
13241992,35.0,0.0,0.0,0.008993


## Exercise: Show Infection Prevalence is Related to Age

The positive coefficient on age suggests that the virus is more prevalent in older people, even when controlling for sex.

For this exercise, show that infection prevalence has some relationship to age by printing the mean `infected` values for the oldest and youngest members of the population when grouped by age:

In [25]:
age_group = gdf.groupby('age')['infected'].mean()
print(age_group.head())
print(age_group.tail())

age
66.0    0.020700
71.0    0.021292
82.0    0.022929
64.0    0.020675
77.0    0.022102
Name: infected, dtype: float64
age
33.0    0.015707
76.0    0.021928
74.0    0.021807
79.0    0.022518
86.0    0.023417
Name: infected, dtype: float64


#### Solution

In [23]:
# %load solutions/risk_by_age
age_groups = gdf[['age', 'infected']].groupby(['age'])
print(age_groups.mean().head())
print(age_groups.mean().tail())


## Exercise: Show Infection Prevalence is Related to Sex

Similarly, the positive coefficient on sex suggests that the virus is more prevalent in people with sex = 1 (females), even when controlling for age.

For this exercise, show that infection prevalence has some relationship to sex by printing the mean `infected` values for the population when grouped by sex:

In [27]:
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
sex_groups.mean()

Unnamed: 0_level_0,infected
sex,Unnamed: 1_level_1
0.0,0.01014
1.0,0.020713


#### Solution

In [26]:
# %load solutions/risk_by_sex
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
sex_groups.mean()


## Making Predictions with Separate Training and Test Data

cuML gives us a simple method for producing paired training/testing data:

In [29]:
X_train, X_test, y_train, y_test  = cuml.train_test_split(gdf[['age', 'sex']], gdf['infected'], train_size=0.9)

## Exercise: Fit Logistic Regression Model Using Training Data

For this exercise, create a new logistic regression model `logreg`, and fit it with the *X* and *y* training data just created.

In [30]:
logreg = cuml.LogisticRegression()
logreg.fit(X_train,y_train)

LogisticRegression()

#### Solution

In [31]:
# %load solutions/fit_training
logreg = cuml.LogisticRegression()
logreg.fit(X_train, y_train)


## Use Test Data to Validate Model

We can now use the same procedure as above to predict infection risk using the test data:

In [32]:
y_test_pred = logreg.predict_proba(X_test, convert_dtype=True)[1]
y_test_pred.index = X_test.index
y_test_pred

48335128    0.023224
40868670    0.017352
36242557    0.014557
22099634    0.012794
124623      0.005360
              ...   
3908227     0.006214
42882404    0.018667
276225      0.005360
27338464    0.016657
32847239    0.012570
Name: 1, Length: 5847990, dtype: float64

As we saw before, very few people are actually infected in the population, even among the highest-risk groups. As a simple way to check our model, we split the test set into above-average predicted risk and below-average predicted risk, then observe that the prevalence of infections correlates closely to those predicted risks.

In [33]:
test_results = cudf.DataFrame()
test_results['age'] = X_test['age']
test_results['sex'] = X_test['sex']
test_results['infected'] = y_test
test_results['predicted_risk'] = y_test_pred

test_results['high_risk'] = test_results['predicted_risk'] > test_results['predicted_risk'].mean()

risk_groups = test_results.groupby('high_risk')
risk_groups.mean()

Unnamed: 0_level_0,age,sex,infected,predicted_risk
high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
True,56.177232,0.890205,0.023782,0.023317
False,29.517201,0.252528,0.010051,0.010318


Finally, in a few milliseconds, we can do a two-tier analysis by sex and age:

In [None]:
%%time
s_groups = test_results[['sex', 'age', 'infected', 'predicted_risk']].groupby(['sex', 'age'])
s_groups.mean()

<br>
<div align="center"><h2>Please Restart the Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Next

In the next notebook, you will use GPU-accelerated k-nearest-neighbors algorithm to locate the nearest road nodes to each hospital.