# Project 5: PISA Data Wrangling
Within this notebook, each data wrangling step is carried out. This is essential to prepare the raw PISA 2012 dataset. The process begins with an initial assessment of data quality and tidiness, followed by a number of cleaning steps. As a final result, the cleaned data is stored for further analytical purposes.

In [1]:
import datetime as dt
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.options.mode.chained_assignment = None 
import numpy as np
import csv
import os

## Gather
<p>As a first step, the CSV file with the PISA data is read in. Then, the created dataframe is reduced to only the necessary parts.<p>

In [2]:
# Read in pisa data
pisa_data = pd.read_csv("pisa2012.csv", encoding="cp1252", sep=",")
pisa_dict = pd.read_csv('pisadict2012.csv', encoding="cp1252", sep=",", header=None, names=["A", "B"])
pisa_dict = dict(zip(pisa_dict["A"], pisa_dict["B"]))


# Reduce the entirety over 350 columns to 40 essential columns
pisa_columns = ["CNT", "ST04Q01", "ST26Q04", "ST26Q05", "ST26Q06", "IC01Q01", 
            "IC01Q02", "IC01Q03", "IC01Q04", "IC02Q01", "IC02Q02",
            "IC02Q03", "IC02Q04", "IC03Q01", "IC04Q01", "IC10Q01",
            "IC10Q02", "IC10Q03", "IC10Q04", "IC10Q05", 
            "PV1MATH", "PV2MATH", "PV3MATH", "PV4MATH", "PV5MATH",   
            "PV1READ", "PV2READ", "PV3READ", "PV4READ", "PV5READ", 
            "PV1SCIE", "PV2SCIE", "PV3SCIE", "PV4SCIE", "PV5SCIE", "AGE", "OUTHOURS", "PARED", "TIMEINT"]
pisa_new = pisa_data[pisa_columns]

  interactivity=interactivity, compiler=compiler, result=result)


## Assess
The second step involves the assessment of the data at hand. The PISA program data from 2012 are assessed visually as well as programmatically in consideration of quality and tidiness issues. A summary of the findings is provided below.

In [3]:
# Explore the dataframe visually as a first assessment
pisa_new.sample(10)

Unnamed: 0,CNT,ST04Q01,ST26Q04,ST26Q05,ST26Q06,IC01Q01,IC01Q02,IC01Q03,IC01Q04,IC02Q01,IC02Q02,IC02Q03,IC02Q04,IC03Q01,IC04Q01,IC10Q01,IC10Q02,IC10Q03,IC10Q04,IC10Q05,PV1MATH,PV2MATH,PV3MATH,PV4MATH,PV5MATH,PV1READ,PV2READ,PV3READ,PV4READ,PV5READ,PV1SCIE,PV2SCIE,PV3SCIE,PV4SCIE,PV5SCIE,AGE,OUTHOURS,PARED,TIMEINT
147821,Spain,Female,Yes,Yes,Yes,"Yes, and I use it","Yes, and I use it","Yes, and I use it","Yes, and I use it","Yes, and I use it","Yes, but I don’t use it",No,"Yes, and I use it",13 years old or older,13 years old or older,Never or hardly ever,Never or hardly ever,Never or hardly ever,Once or twice a month,Never or hardly ever,501.0983,471.4987,487.0774,485.5196,459.8146,545.8539,511.6985,508.5213,540.2937,507.727,468.1462,473.7411,453.2264,477.4711,428.0492,15.83,21.0,,19.0
306745,Latvia,Female,No,No,No,"Yes, but I don’t use it",No,No,No,No,No,No,"Yes, and I use it",7-9 years old,7-9 years old,Never or hardly ever,Never or hardly ever,Never or hardly ever,Never or hardly ever,Never or hardly ever,328.9531,314.9322,369.4579,356.9949,321.1637,321.8584,329.0072,385.4032,359.9853,306.7665,292.279,315.5912,388.3253,308.1313,323.9836,16.25,2.0,11.0,11.0
118970,Colombia,Male,No,No,No,,,,,,,,,,,,,,,,341.2603,376.3125,363.8495,352.1654,373.9757,408.0916,411.2994,436.9616,399.2702,417.7149,415.6471,388.605,417.5121,395.1324,416.5796,15.92,,11.0,
232365,Iceland,Male,Yes,Yes,Yes,"Yes, and I use it","Yes, and I use it",,"Yes, and I use it","Yes, but I don’t use it",No,No,"Yes, and I use it",6 years old or younger,6 years old or younger,Never or hardly ever,Never or hardly ever,Never or hardly ever,Never or hardly ever,Never or hardly ever,466.6693,513.4055,479.9112,473.6797,513.4055,504.6458,499.0322,450.9155,393.9774,475.7758,530.3432,505.166,458.5416,442.6893,508.8959,15.33,,18.0,26.0
289088,Kazakhstan,Male,No,No,No,,,,,,,,,,,,,,,,556.4029,519.0139,560.2976,515.1192,501.8772,437.8438,390.529,462.7041,391.3309,428.2204,412.7564,411.8239,442.596,429.5412,400.6341,15.83,13.0,14.0,
229943,Ireland,Female,Yes,Yes,Yes,"Yes, and I use it","Yes, but I don’t use it","Yes, and I use it","Yes, and I use it","Yes, and I use it",No,No,"Yes, and I use it",6 years old or younger,7-9 years old,Never or hardly ever,Once or twice a week,Once or twice a week,Never or hardly ever,Never or hardly ever,494.8668,523.6875,543.1609,490.1932,493.3089,561.184,596.1337,568.3328,576.2759,550.0637,523.8157,525.6807,528.4782,499.5711,543.398,16.17,,16.0,34.0
60735,Brazil,Male,,,,,,,,,,,,,,,,,,,353.1781,357.8517,413.1563,381.9988,405.3669,328.7792,340.8084,364.0648,375.292,351.2337,372.1,370.235,364.6401,362.7751,373.9649,15.67,,16.0,
419652,Russian Federation,Female,Yes,No,Yes,"Yes, and I use it","Yes, and I use it",No,"Yes, and I use it","Yes, and I use it","Yes, and I use it","Yes, and I use it","Yes, and I use it",10-12 years old,10-12 years old,Never or hardly ever,Never or hardly ever,Never or hardly ever,Never or hardly ever,Never or hardly ever,459.7367,487.7785,470.6418,463.6314,422.3477,461.657,525.2018,444.9765,465.6285,537.9108,453.6927,463.0175,477.0048,466.7475,484.4648,16.0,24.0,13.5,43.0
444794,Sweden,Male,Yes,Yes,Yes,No,"Yes, and I use it",No,"Yes, and I use it","Yes, but I don’t use it","Yes, and I use it",No,"Yes, and I use it",6 years old or younger,7-9 years old,Almost every day,Once or twice a week,Almost every day,Almost every day,Almost every day,385.8155,388.1524,371.7947,416.973,356.9949,396.9446,408.1718,346.422,369.6784,336.7987,434.9496,388.3253,356.6207,413.5024,361.2831,16.0,,16.0,77.0
302201,Luxembourg,Female,Yes,No,Yes,,,,,,,,,,,,,,,,506.2393,448.5979,498.4499,533.5021,504.6814,519.5622,454.4288,448.8686,473.4922,544.1858,486.5162,479.9888,423.107,566.7102,524.7482,16.0,4.0,13.0,


In [4]:
# Explore data programmatically by using built-in pandas functions
# Assess missing values and data types
pisa_new.info()

# Determine number of duplicate rows
print("Number of duplicate rows:", len(pisa_new[pisa_new.duplicated(keep='first')]))

# Assess numeric variables through summary statistics
pisa_new.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 485490 entries, 0 to 485489
Data columns (total 39 columns):
CNT         485490 non-null object
ST04Q01     485490 non-null object
ST26Q04     473877 non-null object
ST26Q05     463178 non-null object
ST26Q06     473182 non-null object
IC01Q01     296977 non-null object
IC01Q02     297068 non-null object
IC01Q03     295602 non-null object
IC01Q04     297305 non-null object
IC02Q01     296975 non-null object
IC02Q02     295618 non-null object
IC02Q03     294625 non-null object
IC02Q04     296944 non-null object
IC03Q01     293216 non-null object
IC04Q01     296305 non-null object
IC10Q01     291811 non-null object
IC10Q02     291025 non-null object
IC10Q03     290262 non-null object
IC10Q04     290907 non-null object
IC10Q05     291025 non-null object
PV1MATH     485490 non-null float64
PV2MATH     485490 non-null float64
PV3MATH     485490 non-null float64
PV4MATH     485490 non-null float64
PV5MATH     485490 non-null float64
PV1READ  

Unnamed: 0,PV1MATH,PV2MATH,PV3MATH,PV4MATH,PV5MATH,PV1READ,PV2READ,PV3READ,PV4READ,PV5READ,PV1SCIE,PV2SCIE,PV3SCIE,PV4SCIE,PV5SCIE,AGE,OUTHOURS,PARED,TIMEINT
count,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485490.0,485374.0,308799.0,473091.0,297074.0
mean,469.621653,469.648358,469.64893,469.641832,469.695396,472.00464,472.068052,472.022059,471.926562,472.013506,475.769824,475.813674,475.851549,475.78524,475.820184,15.784283,11.1041,12.995225,50.895996
std,103.265391,103.382077,103.407631,103.392286,103.41917,102.505523,102.626198,102.640489,102.576066,102.659989,101.464426,101.514649,101.495072,101.5122,101.566347,0.290221,10.476669,3.398623,40.987895
min,19.7928,6.473,42.2262,24.6222,37.0852,0.0834,0.7035,0.7035,4.1344,2.3074,2.6483,2.8348,11.8799,8.4297,17.7546,15.17,0.0,3.0,0.0
25%,395.3186,395.3186,395.2407,395.3965,395.2407,403.6007,403.3601,403.3601,403.3546,403.3601,404.4573,404.4573,404.5505,404.4573,404.4573,15.58,4.0,12.0,19.0
50%,466.2019,466.124,466.2019,466.2798,466.4356,475.455,475.5352,475.455,475.5352,475.5352,475.6994,475.6061,475.6994,475.9791,475.8859,15.75,8.0,13.0,39.0
75%,541.0578,541.4473,541.2915,541.4473,541.4473,544.5025,544.5035,544.5035,544.5025,544.5035,547.7807,547.8739,547.9672,547.7807,547.7807,16.0,14.0,16.0,71.0
max,962.2293,957.0104,935.7454,943.4569,907.6258,904.8026,881.2392,884.447,881.159,901.6086,903.3383,900.5408,867.624,926.5573,880.9586,16.33,180.0,18.0,206.0


### Summary
#### Quality
- some columns have multiple data types
- there are many missing entries for the information and communication (IC) columns
- potential outlier values for variable OUTHOURS (learning time out of school)

#### Tidiness
- test scores vor READ, MATH and SCIENCE are split into five columns with plausible values
- column names should be renamed for reasons of consistency and clarity

## Clean
The third step is dedicated to data cleaning. Therefore, documented quality and tidiness issues are sequentially resolved.
### Tidiness
Taking into account the five different plausible test scores, average test scores are computed for READ, MATH, and SCIENCE. In addition, a total average score is calculated for each student record.

In [5]:
# Create average of MATH, READ and SCIENCE scores and store results in new columns
pisa_new["avg_math_score"] = (pisa_new["PV1MATH"] + pisa_new["PV2MATH"] + pisa_new["PV3MATH"] + pisa_new["PV4MATH"] + pisa_new["PV5MATH"]) / 5
pisa_new["avg_read_score"] = (pisa_new["PV1READ"] + pisa_new["PV2READ"] + pisa_new["PV3READ"] + pisa_new["PV4READ"] + pisa_new["PV5READ"]) / 5
pisa_new["avg_science_score"] = (pisa_new["PV1SCIE"] + pisa_new["PV2SCIE"] + pisa_new["PV3SCIE"] + pisa_new["PV4SCIE"] + pisa_new["PV5SCIE"]) / 5
pisa_new["avg_total_score"] = (pisa_new["avg_math_score"] + pisa_new["avg_read_score"] + pisa_new["avg_science_score"]) / 3

pisa_new.drop(columns=["PV1MATH", "PV2MATH", "PV3MATH", "PV4MATH", "PV5MATH",   
            "PV1READ", "PV2READ", "PV3READ", "PV4READ", "PV5READ", 
            "PV1SCIE", "PV2SCIE", "PV3SCIE", "PV4SCIE", "PV5SCIE"], inplace=True)

For ensuring more transparency, the column name codes are replaced by their dictionary definition.

In [6]:
# Rename columns of dataframe
pisa_new.rename({"CNT":"country", "ST04Q01":"gender", "ST26Q04":"posession_computer", "ST26Q05":"posession_software",
          "ST26Q06":"posession_internet", "IC01Q01":"at_home_computer", "IC01Q02":"at_home_laptop", "IC01Q03":"at_home_tablet",
          "IC01Q04":"at_home_internet", "IC02Q01":"at_school_computer", "IC02Q02":"at_school_laptop", "IC02Q03":"at_school_tablet",
          "IC02Q04":"at_school_internet", "IC03Q01":"first_use_computer", "IC04Q01":"first_use_internet", "IC10Q01":"at_school_chatting",
          "IC10Q02":"at_school_emailing", "IC10Q03":"at_school_browsing", "IC10Q04":"at_school_downloading", "IC10Q05":"at_school_posting",
          "AGE":"age", "OUTHOURS":"study_time", "TIMEINT":"computer_time", "PARED":"parent_education_years"}, 
          axis="columns", inplace=True)

# Test if changes were successful
pisa_new.sample(5)

Unnamed: 0,country,gender,posession_computer,posession_software,posession_internet,at_home_computer,at_home_laptop,at_home_tablet,at_home_internet,at_school_computer,at_school_laptop,at_school_tablet,at_school_internet,first_use_computer,first_use_internet,at_school_chatting,at_school_emailing,at_school_browsing,at_school_downloading,at_school_posting,age,study_time,parent_education_years,computer_time,avg_math_score,avg_read_score,avg_science_score,avg_total_score
321806,Mexico,Female,No,No,No,No,No,No,No,No,No,No,"Yes, and I use it",13 years old or older,13 years old or older,Never or hardly ever,Never or hardly ever,Once or twice a month,Never or hardly ever,Never or hardly ever,15.75,8.0,14.0,4.0,317.8922,421.30604,362.77506,367.324433
122630,Colombia,Female,No,No,No,,,,,,,,,,,,,,,,16.17,3.0,3.0,,391.2681,446.64454,412.66318,416.858607
11800,United Arab Emirates,Female,Yes,Yes,Yes,,,,,,,,,,,,,,,,15.75,9.0,16.0,,438.7833,442.91126,447.2585,442.984353
54373,Bulgaria,Male,Yes,Yes,Yes,,,,,,,,,,,,,,,,15.5,5.0,10.0,,428.5792,288.0404,381.89108,366.170227
49459,Belgium,Male,Yes,Yes,Yes,,,,,,,,,,,,,,,,15.67,7.0,17.0,,599.7118,606.65318,540.22752,582.1975


### Quality
As pandas gave a warning message about inconsistent data types after the CSV file was read in, the columns must be casted to adequate data types.

In [7]:
# Do conversion for string data type
for c in ["country", "gender", "posession_computer", "posession_software", 
          "posession_internet", "at_home_computer", "at_home_laptop", "at_home_tablet", 
          "at_home_internet", "at_school_computer", "at_school_laptop", "at_school_tablet", 
          "at_school_internet", "first_use_computer", "first_use_internet", "at_school_chatting", 
          "at_school_emailing", "at_school_browsing", "at_school_downloading", "at_school_posting"]:
    pisa_new[c] = pisa_new[c].astype(str)

# Do conversion for float data type
for c in ["age", "study_time", "computer_time", "parent_education_years", "avg_math_score", 
          "avg_read_score", "avg_science_score", "avg_total_score"]:
    pisa_new[c] = pisa_new[c].astype(float)

# Transform empty strings to real NaN values
pisa_new.replace(to_replace="None", value=np.nan, inplace=True)
pisa_new.replace(to_replace="nan", value=np.nan, inplace=True)

# Test if changes were successful
pisa_new.dtypes

country                    object
gender                     object
posession_computer         object
posession_software         object
posession_internet         object
at_home_computer           object
at_home_laptop             object
at_home_tablet             object
at_home_internet           object
at_school_computer         object
at_school_laptop           object
at_school_tablet           object
at_school_internet         object
first_use_computer         object
first_use_internet         object
at_school_chatting         object
at_school_emailing         object
at_school_browsing         object
at_school_downloading      object
at_school_posting          object
age                       float64
study_time                float64
parent_education_years    float64
computer_time             float64
avg_math_score            float64
avg_read_score            float64
avg_science_score         float64
avg_total_score           float64
dtype: object

Next, the large number of missing values for columns related to the information and communication technology (ICT) questionnaire needs to be addressed. If there is not a single answer for the ICT questions, the respective row is deleted.

In [8]:
# Drop rows if they do not contain at least one answer for the first eight ICT questions
pisa_new.dropna(subset=["at_home_computer", "at_home_laptop", "at_home_tablet", "at_home_internet", "at_school_computer", "at_school_laptop", "at_school_tablet", "at_school_internet"], thresh=1, inplace=True)

# Test if changes were successful
pisa_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 299907 entries, 22151 to 475552
Data columns (total 28 columns):
country                   299907 non-null object
gender                    299907 non-null object
posession_computer        296940 non-null object
posession_software        290994 non-null object
posession_internet        296702 non-null object
at_home_computer          296977 non-null object
at_home_laptop            297068 non-null object
at_home_tablet            295602 non-null object
at_home_internet          297305 non-null object
at_school_computer        296975 non-null object
at_school_laptop          295618 non-null object
at_school_tablet          294625 non-null object
at_school_internet        296944 non-null object
first_use_computer        292771 non-null object
first_use_internet        295827 non-null object
at_school_chatting        291252 non-null object
at_school_emailing        290479 non-null object
at_school_browsing        289714 non-null object
at_

Considering the column for study time out of school per week, there is a large number of outliers. It seems to be unrealistic that a student learns more than 40 hours, i. e. a full-time job, besides the time spent at school. So, all  values that exceed this limit are removed.

In [9]:
# Remove study time values that are larger than 40
pisa_new.loc[pisa_new["study_time"] > 40, "study_time"] = np.nan

# Test if changes were successful
print("Maximum hours of study time out of school:", max(pisa_new.study_time))

Maximum hours of study time out of school: 40.0


In [10]:
# Store prepared dataframe in new CSV file
pisa_new.to_csv("pisa_new.csv", index=False, encoding="utf-8")