In [1]:
# Setting up Git in colab
! apt-get install git

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.15).
0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.


In [2]:
# Cloning the repo
!git clone https://github.com/ceciliak27/is5126_finalproj.git

Cloning into 'is5126_finalproj'...
remote: Enumerating objects: 161, done.[K
remote: Counting objects: 100% (161/161), done.[K
remote: Compressing objects: 100% (129/129), done.[K
remote: Total 161 (delta 28), reused 108 (delta 9), pack-reused 0 (from 0)[K
Receiving objects: 100% (161/161), 13.50 MiB | 10.86 MiB/s, done.
Resolving deltas: 100% (28/28), done.


In [3]:
# installing required libraries
!pip install numpy pandas matplotlib



In [4]:
# import required libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.max_rows', None)

# 1. Dataset

---

This dataset comes from the [World Values Survey](https://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp) (WVS). In particular, Wave 7 data (obtained from 2017 - 2022) is used (extracted on 1 Nov 2025).

*Home › Data and Documentation › Data Download › Wave 7 (2017-2022)*

- https://www.worldvaluessurvey.org/WVSDocumentationWV7.jsp

The WVS consists of nationally representative surveys conducted in almost 100 countries which contain almost 90 percent of the world’s population, using a common questionnaire. The WVS seeks to help scientists and policy makers understand changes in the beliefs, values and motivations of people throughout the world.

#### Survey method

The main method of data collection in the WVS survey is face-to-face interview at respondent’s home / place of residence. Respondent’s answers could be recorded in a paper questionnaire (traditional way) or by CAPI (Computer Assisted Personal Interview). The approval of the Scientific Advisory Committee in writing is necessary for application of any methods of data collection other than face-to-face interview.

Based on the documentation, we extract out the relevant questions relating to Singapore as follows:

- Master Questionnaire (All Countries): "F00011012-WVS_WAVE_7_MASTER_QUESTIONNAIRE_2017-2021_ENGLISH.pdf"
- Singapore Questionnaire (with Singapore specific questions): "F00011463-WVS7_Questionnaire_Singapore_2020_English.pdf"
- Responses options matrix: "F00011055-WVS7_Codebook_Variables_report_V6.0.pdf"

The Singapore survey responses in particular was conducted by Social Lab under the Institute of Policy Studies from November 2019 to March 2020. The survey targeted citizens and permanent residents aged 21 and above, aiming for a sample size of 2,000 individuals.

One respondent was interviewed per Primary Statistical Unit. The respondent from each household was selected by using the last birthday method. Data was obtained by face-to-face survey interviews from trained interviewers.

In [9]:
# importing data csv files

url_master_sg = 'https://raw.githubusercontent.com/ceciliak27/is5126_finalproj/refs/heads/preprocessing/Data/RawData/WVS_Cross-National_Wave_7_csv_v6_0(SGP_Only).csv'
url_sg = 'https://raw.githubusercontent.com/ceciliak27/is5126_finalproj/refs/heads/preprocessing/Data/RawData/F00013217-WVS_Wave_7_Singapore_Excel_v5.1.csv'
df = pd.read_csv(url_sg)

df.head()

Unnamed: 0,version: Version of Data File,doi: Digital Object Identifier,A_YEAR: Year of survey,B_COUNTRY: ISO 3166-1 numeric country code,B_COUNTRY_ALPHA: ISO 3166-1 alpha-3 country code,C_COW_NUM: CoW country code numeric,C_COW_ALPHA: CoW country code alpha,D_INTERVIEW: Interview ID,FW_START: Year/month of start-fieldwork,FW_END: Year/month of end-fieldwork,...,WEIGHT4A: Overall Secular Values-4: Weight 4a,WEIGHT4B: Emancipative Values-4: Weight 4b,RESEMAVALBWGT: Weight for Emancipative values,RESEMAVALWGT: Weight for Emancipative values,SECVALBWGT: Weight for overall secular values Short Version,Y001_1: Materialist/postmaterialist 12-item index: Component 1,Y001_2: Materialist/postmaterialist 12-item index: Component 2,Y001_3: Materialist/postmaterialist 12-item index: Component 3,Y001_4: Materialist/postmaterialist 12-item index: Component 4,Y001_5: Materialist/postmaterialist 12-item index: Component 5
0,6-0-0 (2024-04-15),doi.org/10.14281/18241.20,2020,702,SGP,830,SIN,702070001,201911,202003,...,1.0,1.0,1.0,1.0,0.83,0,0,0,1,0
1,6-0-0 (2024-04-15),doi.org/10.14281/18241.20,2020,702,SGP,830,SIN,702070002,201911,202003,...,1.0,1.0,1.0,1.0,0.83,1,1,1,0,1
2,6-0-0 (2024-04-15),doi.org/10.14281/18241.20,2020,702,SGP,830,SIN,702070003,201911,202003,...,1.0,1.0,1.0,1.0,1.0,1,0,1,0,1
3,6-0-0 (2024-04-15),doi.org/10.14281/18241.20,2020,702,SGP,830,SIN,702070004,201911,202003,...,1.0,1.0,1.0,1.0,1.0,1,0,0,0,0
4,6-0-0 (2024-04-15),doi.org/10.14281/18241.20,2020,702,SGP,830,SIN,702070005,201911,202003,...,1.0,1.0,1.0,1.0,1.0,0,0,0,0,0



# 2. Exploratory Data Analysis

---

### 2.1 Load the dataset
- Load the Singapore data file(s) into a DataFrame (completed in part 1 above).
- Show shape and head  (completed in part 1 above).
- Show number of data points.


In [10]:
df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2012 entries, 0 to 2011
Data columns (total 387 columns):
 #    Column                                                                                                                                          Non-Null Count  Dtype  
---   ------                                                                                                                                          --------------  -----  
 0    version: Version of Data File                                                                                                                   2012 non-null   object 
 1    doi: Digital Object Identifier                                                                                                                  2012 non-null   object 
 2    A_YEAR: Year of survey                                                                                                                          2012 non-null   int64  
 3    B_COUNTRY: ISO 3

In [14]:
df.describe()


Unnamed: 0,A_YEAR: Year of survey,B_COUNTRY: ISO 3166-1 numeric country code,C_COW_NUM: CoW country code numeric,D_INTERVIEW: Interview ID,FW_START: Year/month of start-fieldwork,FW_END: Year/month of end-fieldwork,K_TIME_START: Start time of the interview [HH.MM],K_TIME_END: End time of the interview [HH.MM],K_DURATION: Total length of interview [minutes],Q_MODE: Mode of data collection,...,WEIGHT4A: Overall Secular Values-4: Weight 4a,WEIGHT4B: Emancipative Values-4: Weight 4b,RESEMAVALBWGT: Weight for Emancipative values,RESEMAVALWGT: Weight for Emancipative values,SECVALBWGT: Weight for overall secular values Short Version,Y001_1: Materialist/postmaterialist 12-item index: Component 1,Y001_2: Materialist/postmaterialist 12-item index: Component 2,Y001_3: Materialist/postmaterialist 12-item index: Component 3,Y001_4: Materialist/postmaterialist 12-item index: Component 4,Y001_5: Materialist/postmaterialist 12-item index: Component 5
count,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,...,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0,2012.0
mean,2020.0,702.0,830.0,702071000.0,201911.0,202003.0,17.331257,18.24508,54.385686,2.0,...,0.98665,0.992565,0.994339,0.995311,0.983355,0.47664,0.265905,0.448807,0.317594,0.599404
std,0.0,0.0,0.0,580.9587,0.0,0.0,3.477873,3.445957,22.156721,0.0,...,0.066052,0.049739,0.032338,0.022337,0.052489,0.550711,0.477615,0.570154,0.47517,0.533847
min,2020.0,702.0,830.0,702070000.0,201911.0,202003.0,10.04,10.42,29.0,2.0,...,0.66,0.66,0.66,0.745,0.66,-2.0,-2.0,-2.0,-2.0,-2.0
25%,2020.0,702.0,830.0,702070500.0,201911.0,202003.0,14.5475,15.55,39.0,2.0,...,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,2020.0,702.0,830.0,702071000.0,201911.0,202003.0,17.405,18.34,49.0,2.0,...,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
75%,2020.0,702.0,830.0,702071500.0,201911.0,202003.0,20.42,21.35,62.0,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,2020.0,702.0,830.0,702072000.0,201911.0,202003.0,23.22,23.59,175.0,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The dataset itself has 2,012 responses from participants.

The data columns can be largely grouped into the following available data:
- Survey-related info (Technical Variables): "A_YEAR" to "PWGHT"
- Core questionnaire variables: "Q1" to "Q290"
- WVS indexes: "Y001" to "Y001_5"

### 2.2 Identify columns with null / missing data
- Based on the below exploration, there are data missing mainly in WVS index variables.
- On further analysis on the survey, most of the questions also have the following options in the survey:

    * -1 : Don't know
    * -2 : No answer
    * -4 : Not asked in survey
    * -5 : Missing; Not available

These are equivalent to NULL data in the results, and will require further processing (imputing or dropping the data points).

In [22]:
print(df.isnull().sum())

version: Version of Data File                                                                                                                       0
doi: Digital Object Identifier                                                                                                                      0
A_YEAR: Year of survey                                                                                                                              0
B_COUNTRY: ISO 3166-1 numeric country code                                                                                                          0
B_COUNTRY_ALPHA: ISO 3166-1 alpha-3 country code                                                                                                    0
C_COW_NUM: CoW country code numeric                                                                                                                 0
C_COW_ALPHA: CoW country code alpha                                                                 

### 2.3 Quick EDA views
**TO BE UPDATED**

Create views of the target goal and key features to give a quick overview of the data.


# 3. Preprocessing steps

---

### 3.1 Handling null / missing data

### 3.2 Assess relevant features based on survey description

### 3.3 Correlation chart

### 3.4 Ensuring column data type are correctly identified

### 3.5 Scaling data


# 4. Exporting the preprocessed data

---

### 4.1 Exporting data as csv file

In [None]:
from google.colab import files

df.to_csv('preprocessed_data.csv', index=False)
files.download('preprocessed_data.csv')