# Generalizing to different populations

The fact that the learning curve shows a convergence between the training and test samples (at least when our sample is larger) provides some assurance that our model will continue to perform comparably well when tested on new observations sampled from the same population. This does *not*, however, mean that performance will remain comparable when tested on new *populations*. If our goal is to generalize beyond the population we sampled from in our training sample, it's advisable to compute validation curves that evaluate generalization performance in as realistic a way as possible.

For example, if we intend to apply our age-prediction model to countries that are undersampled in our existing data (or not sampled at all), we might want to quantify how well the model generalizes across countries that *are* adequately sampled. Let's take a look at the country representation in our Johnson (2014) dataset:

In [11]:
# Show 20 most common countries in the dataset
data['COUNTRY'].value_counts()[:20]

USA            100608
Canada          10382
UK               7680
Australia        4978
Netherlands      1853
India            1218
Thailand         1156
Singapore        1149
Philippines      1088
Finland           992
New Zealand       911
Ireland           904
Sweden            738
Germany           574
Norway            520
China             484
South Afric       406
France            398
Malaysia          394
Hong Kong         356
Name: COUNTRY, dtype: int64

There's far more data from US participants than other countries, so let's train our linear regression model—once again predicting age from the 300 items—on half of the US subset. Then we'll evaluate its performance both in the other half of the US subset, and in the full sample for several other countries (all those with more than 500 data points).

In [12]:
# Split US data in two
us_data = data.query('COUNTRY == "USA"')
n_usa = len(us_data)
inds = np.random.choice(n_usa, n_usa // 2,  replace=False)
us_train = us_data.iloc[inds]
us_test = us_data.iloc[~inds]

# Train model and evaluate in-sample
model = LinearRegression()
items, age = get_features(us_train, 'items', 'AGE')
model.fit(items, age)
train_score = r2_score(age, model.predict(items))
print(f"R^2 in training half of US sample: {train_score:.2f}")

# Evaluate in testing half of US data
items, age = get_features(us_test, 'items', 'AGE')
test_score = r2_score(age, model.predict(items))
print(f"R^2 in testing half of US sample: {test_score:.2f}\n")

# Get data for all countries other than USA with >= 500 observations
countries = data.groupby('COUNTRY').filter(lambda x: len(x) >= 500)
countries = countries.query('COUNTRY != "USA"')

# Loop over countries and test performance in each one
results = []
for name, country_data in countries.groupby('COUNTRY'):
    items, age = get_features(country_data, 'items', 'AGE')
    country_score = r2_score(age, model.predict(items))
    n_obs = len(country_data)
    results.append([name, round(country_score, 2), n_obs])

results = pd.DataFrame(results, columns=['country', 'R^2', 'n'])
print("Other countries:")
results.sort_values('R^2', ascending=False)

R^2 in training half of US sample: 0.51
R^2 in testing half of US sample: 0.50

Other countries:


Unnamed: 0,country,R^2,n
5,Ireland,0.5,904
1,Canada,0.48,10382
0,Australia,0.46,4978
7,New Zealand,0.45,911
13,UK,0.44,7680
6,Netherlands,0.31,1853
11,Sweden,0.3,738
3,Germany,0.26,574
8,Norway,0.21,520
2,Finland,0.15,992


We observe that our US-trained model does very well for people living in other English-speaking countries, but predicts age more poorly for non-English-speaking Western countries (e.g., Finland and Germany), and is basically useless for age prediction in non-English-speaking countries.

Does this mean that the relationship between personality and age genuinely varies across age, or is this simply a reliability issue—i.e., that we shouldn't trust personality scores on an English-language measure when provided by non-native English speakers? The above results don't directly tell us (though the fact that our age prediction model is useless even in Singapore, a country with high English literacy, might push us to suspect the former). We could of course pursue this question further—for example, by reversing our approach and training the model on non-English speakers, or by conducting an item analysis to see if we can identify items that substantially change their predictive value across cultures.

In any case, the main point should hopefully be clear: it's a bad idea to assume that a particular pattern or prediction (even an extremely strong one, as in this case!) is going to generalize to different contexts. Fortunately, as long as we have data that spans multiple contexts, common cross-validation strategies allow us to directly estimate the generalizability of our model, rather than taking it as an article of faith.