# Chart 6: Refugee Employment Regression Analysis

**Data Source:** SCB Registerdata för integration
- Table: KomMotForvBAS (Antal sysselsatta kommunmottagna flyktingar)
- URL: https://www.statistikdatabasen.scb.se/pxweb/sv/ssd/START__AA__AA0003__AA0003B/KomMotForvBAS/

**Analysis:** Regression of employment index on years since arrival in Sweden

**Methodology:**
- Track refugee cohorts (1997-2006) for up to 26 years
- Index employment to Year 1 = 100 for cross-cohort comparability
- Fit linear regression to pooled data

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import json

In [2]:
# Load data directly from GitHub, or replicate using instructions above
DATA_URL = "https://raw.githubusercontent.com/kelvinchwng/kelvinchwng.github.io/main/project/data/p6_raw.csv"

# Skip the title row (row 0), use row 1 as header
df_raw = pd.read_csv(DATA_URL, encoding='utf-8', skiprows=1)

print(f"Loaded from: {DATA_URL}")
print(f"Shape: {df_raw.shape}")
print(f"\nColumns: {df_raw.columns.tolist()}")

# Preview the data
df_raw.head(10)

Loaded from: https://raw.githubusercontent.com/kelvinchwng/kelvinchwng.github.io/main/project/data/p6_raw.csv
Shape: (27, 29)

Columns: ['�r efter mottagning', 'k�n', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023']


Unnamed: 0,�r efter mottagning,k�n,1997,1998,1999,2000,2001,2002,2003,2004,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,0 �r,totalt,240,279,211,275,279,291,320,256,...,1183,1054,1767,2091,1067,1684,998,772,744,956
1,1 �r,totalt,718,809,668,746,658,650,683,480,...,4142,4567,7618,7166,4917,3871,2096,1768,1654,..
2,2 �r,totalt,1342,1713,1156,1296,1187,1203,1165,920,...,8696,9795,15871,13840,8593,6516,3393,2700,..,..
3,3 �r,totalt,2344,2440,1533,1808,1726,1753,1899,1445,...,12422,15081,20979,17347,12294,8620,4151,..,..,..
4,4 �r,totalt,3032,2862,1887,2281,2169,2548,2541,1870,...,15324,17240,22357,22270,15184,9558,..,..,..,..
5,5 �r,totalt,3522,3251,2174,2624,2843,3272,2961,1788,...,17298,17879,27145,26742,16622,..,..,..,..,..
6,6 �r,totalt,3845,3544,2344,3237,3500,3642,2841,2220,...,17650,20638,30632,28163,..,..,..,..,..,..
7,7 �r,totalt,4064,3748,2727,3741,3847,3477,3301,2552,...,19672,23123,31602,..,..,..,..,..,..,..
8,8 �r,totalt,4270,4210,3133,3999,3666,3981,3748,2794,...,21630,23535,..,..,..,..,..,..,..,..
9,9 �r,totalt,4846,4690,3338,3848,4196,4415,4022,2988,...,22077,..,..,..,..,..,..,..,..,..


In [4]:
# Load the data
df = pd.read_csv(DATA_URL, encoding='latin-1', skiprows=1)

# Rename columns
df.columns = ['years_after', 'gender'] + [str(y) for y in range(1997, 2024)]

# Keep only 'totalt' rows
df = df[df['gender'] == 'totalt'].copy()

# Extract numeric years from 'X år' format
df['years_after'] = df['years_after'].str.extract(r'(\d+)').astype(int)

print(f"Data shape: {df.shape}")
df.head()

Data shape: (27, 29)


Unnamed: 0,years_after,gender,1997,1998,1999,2000,2001,2002,2003,2004,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,0,totalt,240,279,211,275,279,291,320,256,...,1183,1054,1767,2091,1067,1684,998,772,744,956
1,1,totalt,718,809,668,746,658,650,683,480,...,4142,4567,7618,7166,4917,3871,2096,1768,1654,..
2,2,totalt,1342,1713,1156,1296,1187,1203,1165,920,...,8696,9795,15871,13840,8593,6516,3393,2700,..,..
3,3,totalt,2344,2440,1533,1808,1726,1753,1899,1445,...,12422,15081,20979,17347,12294,8620,4151,..,..,..
4,4,totalt,3032,2862,1887,2281,2169,2548,2541,1870,...,15324,17240,22357,22270,15184,9558,..,..,..,..


In [5]:
# Focus on cohorts with long follow-up (1997-2006)
# These have 17-26 years of tracking data
cohorts = ['1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006']

all_data = []

for cohort in cohorts:
    cohort_data = df[['years_after', cohort]].copy()
    cohort_data.columns = ['years_after', 'employed']

    # Convert to numeric, replacing '..' with NaN
    cohort_data['employed'] = pd.to_numeric(
        cohort_data['employed'].replace('..', np.nan),
        errors='coerce'
    )
    cohort_data = cohort_data.dropna()
    cohort_data['cohort'] = int(cohort)

    # Calculate index (Year 1 = 100)
    year1_value = cohort_data[cohort_data['years_after'] == 1]['employed'].values
    if len(year1_value) > 0:
        cohort_data['employment_index'] = (cohort_data['employed'] / year1_value[0]) * 100
        all_data.append(cohort_data)
        print(f"Cohort {cohort}: {len(cohort_data)} years of data, Year 1 employed = {year1_value[0]:.0f}")

combined = pd.concat(all_data, ignore_index=True)
print(f"\nTotal observations: {len(combined)}")

Cohort 1997: 27 years of data, Year 1 employed = 718
Cohort 1998: 26 years of data, Year 1 employed = 809
Cohort 1999: 25 years of data, Year 1 employed = 668
Cohort 2000: 24 years of data, Year 1 employed = 746
Cohort 2001: 23 years of data, Year 1 employed = 658
Cohort 2002: 22 years of data, Year 1 employed = 650
Cohort 2003: 21 years of data, Year 1 employed = 683
Cohort 2004: 20 years of data, Year 1 employed = 480
Cohort 2005: 19 years of data, Year 1 employed = 785
Cohort 2006: 18 years of data, Year 1 employed = 2912

Total observations: 225


In [6]:
# Calculate average across cohorts for each year
avg_by_year = combined.groupby('years_after').agg({
    'employment_index': 'mean',
    'employed': 'mean'
}).reset_index()

print("Average employment trajectory across 1997-2006 cohorts:")
print("(Index: Year 1 = 100)")
print(avg_by_year.round(1).to_string(index=False))

Average employment trajectory across 1997-2006 cohorts:
(Index: Year 1 = 100)
 years_after  employment_index  employed
           0              42.2     380.5
           1             100.0     910.9
           2             180.4    1591.9
           3             262.5    2181.0
           4             330.7    2742.8
           5             385.9    3216.8
           6             434.3    3594.8
           7             475.0    3930.2
           8             516.2    4280.3
           9             560.6    4647.4
          10             599.6    4999.8
          11             639.1    5309.5
          12             676.0    5592.5
          13             722.1    5936.9
          14             762.8    6194.5
          15             796.2    6471.3
          16             834.0    6803.8
          17             865.3    7033.0
          18             936.3    6354.7
          19             999.5    6700.8
          20             994.3    6971.1
          21        

In [7]:
# Regression on averaged data
X = avg_by_year[['years_after']].values
y = avg_by_year['employment_index'].values

model = LinearRegression()
model.fit(X, y)

print("=" * 50)
print("REGRESSION RESULTS")
print("=" * 50)
print(f"Intercept (predicted index at Year 0): {model.intercept_:.1f}")
print(f"Slope: +{model.coef_[0]:.1f} index points per year in Sweden")
print(f"R-squared: {model.score(X, y):.3f}")

# Growth interpretation
year0_pred = model.intercept_
year26_pred = model.intercept_ + model.coef_[0] * 26
growth_factor = year26_pred / year0_pred
print(f"\nPredicted growth from Year 0 to Year 26: {growth_factor:.1f}×")

REGRESSION RESULTS
Intercept (predicted index at Year 0): 181.2
Slope: +37.3 index points per year in Sweden
R-squared: 0.938

Predicted growth from Year 0 to Year 26: 6.4×


In [8]:
# Generate predictions for plotting
avg_by_year['predicted'] = model.predict(X)

# Show observed vs predicted
print("Observed vs Predicted:")
comparison = avg_by_year[['years_after', 'employment_index', 'predicted']].copy()
comparison.columns = ['Year', 'Observed', 'Predicted']
comparison['Residual'] = comparison['Observed'] - comparison['Predicted']
print(comparison.round(1).to_string(index=False))

Observed vs Predicted:
 Year  Observed  Predicted  Residual
    0      42.2      181.2    -138.9
    1     100.0      218.5    -118.5
    2     180.4      255.8     -75.5
    3     262.5      293.2     -30.7
    4     330.7      330.5       0.2
    5     385.9      367.8      18.0
    6     434.3      405.2      29.1
    7     475.0      442.5      32.5
    8     516.2      479.8      36.4
    9     560.6      517.2      43.4
   10     599.6      554.5      45.1
   11     639.1      591.8      47.2
   12     676.0      629.2      46.8
   13     722.1      666.5      55.6
   14     762.8      703.8      59.0
   15     796.2      741.2      55.0
   16     834.0      778.5      55.5
   17     865.3      815.8      49.5
   18     936.3      853.2      83.1
   19     999.5      890.5     109.0
   20     994.3      927.8      66.5
   21     993.9      965.2      28.7
   22     960.4     1002.5     -42.1
   23     918.4     1039.8    -121.4
   24     919.7     1077.2    -157.4
   25     985.0

In [9]:
# Create export data
export_data = []

# Add observed points
for _, row in avg_by_year.iterrows():
    export_data.append({
        'years_in_sweden': int(row['years_after']),
        'employment_index': round(row['employment_index'], 1),
        'type': 'observed'
    })

print(f"Exported {len(export_data)} data points")
print("\nSample:")
for item in export_data[:5]:
    print(item)

Exported 27 data points

Sample:
{'years_in_sweden': 0, 'employment_index': np.float64(42.2), 'type': 'observed'}
{'years_in_sweden': 1, 'employment_index': np.float64(100.0), 'type': 'observed'}
{'years_in_sweden': 2, 'employment_index': np.float64(180.4), 'type': 'observed'}
{'years_in_sweden': 3, 'employment_index': np.float64(262.5), 'type': 'observed'}
{'years_in_sweden': 4, 'employment_index': np.float64(330.7), 'type': 'observed'}


In [12]:
# Save to JSON
with open('p6_data.json', 'w') as f:
    json.dump(export_data, f, indent=2)

print("Saved to p6_data.json")

# Download
from google.colab import files
files.download('p6_data.json')

Saved to p6_data.json


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


```
╔══════════════════════════════════════════════════════════════════╗
║                        DATAFLOW DIAGRAM                          ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  ┌────────────────────────────────────────────────────────────┐  ║
║  │                     DATA SOURCE                            │  ║
║  │  SCB Statistikdatabasen - KomMotForvBAS                    │  ║
║  │  Employed refugees by years since arrival (1997-2023)      │  ║
║  └────────────────────────────────────────────────────────────┘  ║
║                              │                                   ║
║                              ▼                                   ║
║  ┌────────────────────────────────────────────────────────────┐  ║
║  │                   MANUAL DOWNLOAD                          │  ║
║  │  Select: år efter mottagning (0-26), kön=totalt            │  ║
║  │  Export as CSV                                             │  ║
║  └────────────────────────────────────────────────────────────┘  ║
║                              │                                   ║
║                              ▼                                   ║
║  ┌────────────────────────────────────────────────────────────┐  ║
║  │                 COLAB PROCESSING                           │  ║
║  │  1. Load CSV (latin-1 encoding)                            │  ║
║  │  2. Filter cohorts 1997-2006                               │  ║
║  │  3. Index employment to Year 1 = 100                       │  ║
║  │  4. Average across cohorts                                 │  ║
║  │  5. Fit OLS regression (sklearn)                           │  ║
║  └────────────────────────────────────────────────────────────┘  ║
║                              │                                   ║
║                              ▼                                   ║
║  ┌────────────────────────────────────────────────────────────┐  ║
║  │                      OUTPUT                                │  ║
║  │  p6_data.json → GitHub: project/data/p6_data.json          │  ║
║  └────────────────────────────────────────────────────────────┘  ║
║                              │                                   ║
║                              ▼                                   ║
║  ┌────────────────────────────────────────────────────────────┐  ║
║  │                  VEGA-LITE CHART                           │  ║
║  │  - Scatter: observed employment index                      │  ║
║  │  - Line: regression transform (client-side)                │  ║
║  │  - Annotations: slope (+37/yr), R² = 0.94                  │  ║
║  └────────────────────────────────────────────────────────────┘  ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝
```