MSc Project

This aim of this project is to find insights into socio-demographic and economic factors that matter to life satisfaction, the main research question is, What matters most to peoples life satisfaction. The data used in this project is from the UK Data Archive and its a Annual Population Survey Three-Year Pooled Dataset, January 2021 - December 2023.

In [3]:
import pandas as pd

df = pd.read_csv(
    'aps_3yr_jan21dec23_eul_withoutsmoking.tab',
    delimiter='\t',
    low_memory=False,
    dtype={
        'CLAIMS14': 'Int64',  # nullable integer (handles -9 as NA)
        'CombinedAuthorities': 'string'
    },
    na_values=[-9, -8]
)

# View the first few rows
df.head()

Unnamed: 0,AAGE,ACTHR,ACTHR2,ACTPOT,ACTUOT,ACTWKDY1,ACTWKDY2,ACTWKDY3,ACTWKDY4,ACTWKDY5,...,XDISDDA20,Y2JOB,YLESS20,YMORE,YPAYL20,YPAYM,YPTJOB,YSTART,YTETJB,YVARY99
0,13,,,,,2.0,3.0,,,,...,4.0,,,,,,4.0,,,
1,13,,,,,,,,,,...,4.0,,,,,,,,,
2,11,,,,,,,,,,...,1.0,,,,,,,,,
3,12,,,,,,,,,,...,1.0,,,,,,,5.0,,
4,13,,,,,,,,,,...,4.0,,,,,,,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341465 entries, 0 to 341464
Columns: 459 entries, AAGE to YVARY99
dtypes: Int64(1), float64(415), int64(28), object(14), string(1)
memory usage: 1.2+ GB


There are 341465 entries and 459 attributes.

In [6]:
# Count missing values in each column
missing_counts = df.isna().sum()

# Display missing values only for columns that have any
missing_counts[missing_counts > 0]

ACTHR       301195
ACTHR2      335917
ACTPOT      301195
ACTUOT      301195
ACTWKDY1    189572
             ...  
YPAYM       338804
YPTJOB      299561
YSTART      331251
YTETJB      341277
YVARY99     334901
Length: 421, dtype: int64

The dataset shows high number of missing values.

First to start off the MSc project we need a measurement for the personal wellbeing level of UK citizens.

Pos.348(SATIS)-Overall, how satisfied are you with your life nowadays?

Pos.441(WORTH)-Overall, to what extent do you feel the things you do in your life are worthwhile?

Pos.122(HAPPY)-Overall, how happy did you feel yesterday?

Pos.15(ANXIOUS)-Overall, how anxious did you feel yesterday?

These 4 questions are about personal wellbeings and citizens are asked to respond on a scale of 0 to 10, making it easy for us to analyze. Estimates is then produced by taking the mean ratings of the 4 wellbeing questions. For life satisfaction, life worthwhile and happiness will be combined into one score and anxiety will be done seperatly because they dont mean the same for scores.

In [9]:
# Create Wellbeing column
df['Wellbeing'] = df[['SATIS', 'WORTH', 'HAPPY']].mean(axis=1, skipna=True)

# Check the result
df[['SATIS', 'WORTH', 'HAPPY', 'Wellbeing']].head()

Unnamed: 0,SATIS,WORTH,HAPPY,Wellbeing
0,8.0,8.0,7.0,7.666667
1,10.0,8.0,10.0,9.333333
2,10.0,9.0,10.0,9.666667
3,,,,
4,8.0,5.0,8.0,7.0


The following code creates labels for hresholds for life satisfaction, worthwhile, happiness and anxiety scores.

In [11]:
def classify_wellbeing(score):
    if pd.isna(score):
        return pd.NA
    elif score <= 4:
        return 'Low'
    elif score <= 6:
        return 'Medium'
    elif score <= 8:
        return 'High'
    else:
        return 'Very high'

df['Wellbeing_category'] = df['Wellbeing'].apply(classify_wellbeing)

# Replace missing codes in ANXIOUS column
df['ANXIOUS'] = df['ANXIOUS'].replace([-9, -8], pd.NA)

def classify_anxiety(score):
    if pd.isna(score):
        return pd.NA
    elif score <= 1:
        return 'Very low'
    elif score <= 3:
        return 'Low'
    elif score <= 5:
        return 'Medium'
    else:
        return 'High'

df['Anxiety_category'] = df['ANXIOUS'].apply(classify_anxiety)

df[['Wellbeing', 'Wellbeing_category', 'ANXIOUS', 'Anxiety_category']].head()

Unnamed: 0,Wellbeing,Wellbeing_category,ANXIOUS,Anxiety_category
0,7.666667,High,7.0,High
1,9.333333,Very high,0.0,Very low
2,9.666667,Very high,0.0,Very low
3,,,,
4,7.0,High,1.0,Very low


In [12]:
df[['Wellbeing', 'Wellbeing_category', 'ANXIOUS', 'Anxiety_category']].isna().sum()

Wellbeing             169132
Wellbeing_category    169132
ANXIOUS               169369
Anxiety_category      169369
dtype: int64

Wellbeing and anxious is our 2 most important measurements, so if data are missing in both of these attributes predictions cannot be made, to prevent bias the entire row is dropped if both Wellbeing and ANXIOUS are missing.

In [14]:
cleaned = df[~(df['Wellbeing'].isna() & df['ANXIOUS'].isna())]

cleaned[['Wellbeing', 'ANXIOUS']].isna().sum()

Wellbeing     11
ANXIOUS      248
dtype: int64

In [15]:
# Redefine variable mapping
#Presence of dependent and non-dependent children in the household
#The relative deprivation of the area in which the individual lives,
# Whether the area is urban or rural,
#Mode of interview (telephone or personal interview)
columns = {
    'AGE': 'AAGE',
    'SEX': 'SEX',
    'ETHNICITY': 'ETHUKEUL',
    'MIGRATION': 'NATIDB11',
    'REL_STATUS': 'MARDY6',
    'ECON_ACTIVITY': 'ILODEFR',
    'TENURE': 'TEN1',
    'HEALTH': 'HEALYR',
    'DISABILITY': 'LIMACT',
    'QUALIFICATION': 'HIQUL15D',
    'SOCIO_ECON': 'NSECM20',
    'RELIGION': 'RELIG11',
    'REGION': 'GOR9d',
}

In [51]:
df[['AAGE','SEX','ETHUKEUL','NATIDB11','MARDY6','ILODEFR','TEN1','HEALYR','DISEA','HIQUL15D','GRSSWK','RELIG11','GOR9d']].isna().sum()

AAGE             0
SEX              0
ETHUKEUL       259
NATIDB11       501
MARDY6           0
ILODEFR          0
TEN1           182
HEALYR      216990
DISEA        75554
HIQUL15D    280038
GRSSWK      249693
RELIG11      20828
GOR9d            0
dtype: int64

In [17]:
from sklearn.model_selection import train_test_split

# STEP 1: Subset required columns
model_df = cleaned[[v for v in columns.values()]].copy()

# STEP 2: Drop rows with missing targets
model_df = model_df.dropna(subset=['Wellbeing', 'ANXIOUS'])

# STEP 3: Features and target
X = model_df.drop(columns=['Wellbeing', 'ANXIOUS'])
y = model_df['Wellbeing']

# STEP 4: One-hot encode categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)

X_encoded = X_encoded.dropna()
y = y[X_encoded.index]  # align y with cleaned X

# STEP 5: Final split → 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

# Confirm dimensions
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

KeyError: ['Wellbeing', 'ANXIOUS']

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize and train the model
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)

# Predict
y_pred_ols = ols_model.predict(X_test)

# Evaluate
print("OLS Regression Results:")
print(f"  R² Score: {r2_score(y_test, y_pred_ols):.4f}")
print(f"  RMSE: {mean_squared_error(y_test, y_pred_ols, squared=False):.4f}")