# Explanatory Data Analysis

In [10]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_rows = 999
pd.options.display.max_columns = 999

x_train = pd.read_csv("../01-Data/X_train.csv", index_col=0)
x_test = pd.read_csv("../01-Data/X_test.csv", index_col=0)
y_train = pd.read_csv("../01-Data/y_train.csv", index_col=0)

## Analysis of Missing Observations


In [11]:
# x_train
missing_values_train = x_train.isnull().sum()
print("Missing values in x_train:")
for column, count in missing_values_train.items():
    if count > 0:
        print(f"Column '{column}': {count}")

# x_test
missing_values_test = x_test.isnull().sum()
print("\nMissing values in x_test:")
for column, count in missing_values_test.items():
    if count > 0:
        print(f"Column '{column}': {count}")

Missing values in x_train:
Column 'v228b_r': 1
Column 'v231b_r': 3
Column 'v233b_r': 2
Column 'v251b_r': 4

Missing values in x_test:
Column 'v233b_r': 1


No missing values to deal with -- in range of column 1 - 146


In [12]:
print(x_train.iloc[:, :146].dtypes)

year         int64
fw_start     int64
fw_end       int64
country      int64
c_abrv      object
v1           int64
v2           int64
v3           int64
v4           int64
v5           int64
v7           int64
v8           int64
v9           int64
v10          int64
v11          int64
v12          int64
v13          int64
v14          int64
v15          int64
v16          int64
v17          int64
v18          int64
v19          int64
f20          int64
v20          int64
v20a         int64
v20b         int64
v21          int64
v22          int64
v23          int64
v24          int64
f24_IT       int64
v24a_IT      int64
v24b_IT      int64
v25          int64
v26          int64
v27          int64
v28          int64
v29          int64
v30          int64
f30a         int64
v30a         int64
v30b         int64
v30c         int64
v31          int64
v32          int64
v33          int64
v34          int64
v35          int64
v36          int64
v37          int64
v38          int64
v39         

There is one string variable, `c_abrv`. We **drop** this variable as we already have the similar variable, `country`.

In [14]:
# See the values for c_abrv
frequency_table = x_train['c_abrv'].value_counts()
print(frequency_table)

# Drop c_abrv
x_train.drop('c_abrv', axis=1, inplace=True)
x_test.drop('c_abrv', axis=1, inplace=True)

Frequency table for 'c_abrv' column:
c_abrv
DK    2843
CH    2664
NL    2000
IT    1901
GE    1854
DE    1817
FR    1587
RU    1524
CZ    1519
AZ    1514
GB    1491
BA    1439
AT    1376
IS    1357
UA    1348
RO    1336
BG    1302
BY    1297
HR    1271
HU    1260
AM    1257
RS    1255
LT    1222
AL    1194
SK    1188
PL    1137
LV    1106
EE    1101
PT    1022
ES    1010
FI    1004
NO     951
MK     940
SI     913
Name: count, dtype: int64


Out of 145 *numeric* variables, based on the codebook, we classify them into the following groups:

f46_IT? 
drop v72_DE, v73_DE v74_DE, v75_DE, v76_DE v77_DE v78_DE v79_DE

1. Typical numeric variables: year, fw_start, fw_end, country
2. How important in your life: ___ (importance order): v1, v2, v3, v4, v5
3. Misc questions with logical ordinalities: v7, v8, v38, v39, v46, v47, v48, v49, v50, v54, v55, v56, v62, v63, v64, v72, v73, v74, v75, v76, v77, v78, v79, v80, v81, v82, v83, v84, v97, v102, v103, v104, v105, v106, v107
4. Misc questions with yes/no: v21, v31, v51, v53, v71
5. Do you belong to: ___ (category mentioned / not mentioned): v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19
**4-1.** Do you belong to - none? f20 (inconsistency), v20, v20a, v20b
6. Don't like as neighbours: __ (category mentioned / not mentioned): v22, v23, v25, v26, v27, v28, v29, v30, 
**6-1.** Don't like as neighbours - immigrants / foreign workers?: v24, f24_IT, v24a_IT, v24b_IT
**6-2.** Don't like as neighbours - none/DK: f30a, v30a, v30b, v30c, 
7. How much you trust: __ (category with ordinal reponses): v32, v33, v34, v35, v36, v37
8. Important in a job: __ (category mentioned / not mentioned): v40, v41, v42, v43, v44, v45
**8-1.** Important in a job - none: f45a, v45a, v45b, v45c, 
9. which religious denomination do you belong to : v52, v52_cs
10. Do you believe in __ (category yes / no): v57, v58, v59, v60, v61
11. Important in marriage: __ (category with ordinal reponses): v65, v66, v67, v68, v69, v70
12. learn children at home: __ (category mentioned / not mentioned): v85, v86, v87, v88, v89, v90, v91, v92, v93, v94, v95
**12-1**. f85: more than five checked
**12-2**. learn children at home - none: f96, v96, v96a, v96b
13. Political Action: __ (category with frequencies): v98, v99, v100, v101
14. Aims of this country (category): f108, v108, v109
15. Aims of respondent: f110, v110, v111, v111_4