# __Imports__

In [365]:
import pandas as pd
import numpy as np
import os
import json

# __Data Import and Shape__

<a href="../data/raw/data.csv">data.csv</a> is a pre-processed dataset formed from the merge of <a href="../data/raw/data.csv">metrics.csv</a>, <a>survey.csv</a>, and <a href="../data/raw/data.csv">violence.csv</a>. It was pre-processed in Excel due to original databases not being in correct tabular format (multiple rows for column titles).

In [366]:
relative_path = os.path.join('..', 'data', 'raw', 'data.csv')

df = pd.read_csv(relative_path)
df.shape

  df = pd.read_csv(relative_path)


(6025, 174)

# __Column Types__

Assing the majority type to each column with mixed types.

In [367]:
for column in [16, 17, 18, 19, 20, 21, 22, 23, 165]:
    majority_type = df.iloc[:,column].apply(lambda x: type(x)).value_counts().idxmax()
    print(majority_type)

<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>
<class 'float'>


In [368]:
for column in [16, 17, 18, 19, 20, 21, 22, 23, 165]:
    df.iloc[:,column] = pd.to_numeric(df.iloc[:,column], errors='coerce')

# __Column Names__

In [369]:
column_names = df.columns.tolist()
column_dict = dict.fromkeys(column_names)
column_dict

{'A.02. Código del centro educativo según último acuerdo de acreditación:': None,
 'A.01. Nombre del centro educativo, según último acuerdo de acreditacion:': None,
 'A.03. Distrito educativo según última organización de la Dirección Departamental de Educación:': None,
 'A.04. Sector:': None,
 'A.04.a. Si marco sector privado, ¿Recibe Subsidio?:': None,
 'A.05. Organismo de administración:': None,
 'A.06. Zona:': None,
 'Código del Departamento': None,
 'A.07. Departamento:': None,
 'Código del Municipio': None,
 'A.08. Municipio:': None,
 'A.09. Nombre del cantón:': None,
 'A.10. Nombre del caserío:': None,
 'A.11. Dirección actual:': None,
 'A.12. ¿El centro educativo, se encuentra ubicado en una comunidad indígena?': None,
 'B.01. Proporcionó estos datos:': None,
 'Apoyo que recibe: Asistencia Técnica;No recibe apoyo': None,
 'Apoyo que recibe: Económico Monetario;No recibe apoyo': None,
 'Apoyo que recibe: Material Didáctico;No recibe apoyo': None,
 'Apoyo que recibe: Mobiliario y 

Extract column names to a JSON editor to manually assign names.

In [370]:
# Write column_dict to a JSON file with UTF-8 encoding
with open('../data/interim/column_dict.json', 'w', encoding='utf-8') as f:
    json.dump(column_dict, f, ensure_ascii=False)

At this point, I manually assigned appropiate English names to all columns. After assignment, open JSON to reassign column names.

In [371]:
# Open the JSON file and make list of values
with open('../data/interim/edit_column_dict.json', 'r', encoding='utf-8') as f:
    edit_column_dict = json.load(f)

# Assign new column names to df
df.columns = list(edit_column_dict.values())

# Print column names
list(df.columns)

['School ID',
 'School Name',
 'School District',
 'Sector',
 'If private, does it receive subsidy?',
 'Administrative Body',
 'Zone',
 'Department Code',
 'Department Name',
 'Municipality Code',
 'Municipality Name',
 'Canton Name',
 'Hamlet Name',
 'Address',
 'Does the school belong to an indigenous community?',
 'Who provided the data?',
 'Does not receive technical support.',
 'Does not receive economic support.',
 'Does not receive didactic material support.',
 'Does not receive furniture and equipment support.',
 'Does not receive infrastructure support.',
 'Does not receive teacher remuneration support.',
 'Does not receive construction material support.',
 'Does not receive food support.',
 'Funding: cafeterias.',
 'Funding: voluntary contributions.',
 'Funding: own activities.',
 'Funding: donations.',
 'Do you own your facilities?',
 'Water source: internal pipeline.',
 'Water source: river, lake, spring.',
 'Water source: rainwater.',
 'Water source: public sink.',
 'Water

# __Generating new features__

In [372]:
# Make a hard copy of the dataframe
new_df = df.copy()

## __Explanation__

Many of the current features can be combined linearly to form metrics that better represent the significance of a characteristics of each school.

Take for example the following features. These represent if the school has robotic kits, and if they are used by students.

In [373]:
list(df.iloc[:10, 108:136].columns)

['Lego robotics kit: good condition.',
 'Lego robotics kit: bad condition.',
 'Lego robotics kit: rented.',
 'Lego robotics kit: used by students.',
 'Rex robotics kit: good condition.',
 'Rex robotics kit: bad condition.',
 'Rex robotics kit: rented.',
 'Rex robotics kit: used by students.',
 'NXT robotics kit: good condition.',
 'NXT robotics kit: bad condition.',
 'NXT robotics kit: rented.',
 'NXT robotics kit: used by students.',
 'EV3 robotics kit: good condition.',
 'EV3 robotics kit: bad condition.',
 'EV3 robotics kit: rented.',
 'EV3 robotics kit: used by students.',
 'Chumchebot robotics kit: good condition.',
 'Chumchebot robotics kit: bad condition.',
 'Chumchebot robotics kit: rented.',
 'Chumchebot robotics kit: used by students.',
 'Make Block robotics kit: good condition.',
 'Make Block robotics kit: bad condition.',
 'Make Block robotics kit: rented.',
 'Make Block robotics kit: used by students.',
 'Other robotics kit: good condition.',
 'Other robotics kit: bad cond

From these features, we will create two features.

1. **Good condition robotic kit rate** measures the rate of good robotic kits out of all robotic kits owned by the school. A rate of 1 means that all robotic kits are in good condition, which is better.
2. **Robotic kit usage rate** measures how many robotic kits are used by students out of all robotic kits possesed by the school. A rate of 1 means that all robotic kits owned are being used by students, which is better.

$\text{Good condition robotic kit rate (GCRKR)} = \frac{\text{total robotic kits in good condition}}{\text{total robotic kits in good condition} + \text{total robotic kits in bad condition}} \in [0,1]$

$\text{Robotic kit usage rate (RKUR)} = \frac{\text{total robotic used by students}}{\text{total robotic kits in good condition} + \text{total robotic kits in bad condition} + \text{total rented robotic kits}} \in [0,1]$

## __New features__

\begin{equation}
\text{Good condition robotic kit rate (GCRKR)} = \frac{\text{total robotic kits in good condition}}{\text{total robotic kits in good condition} + \text{total robotic kits in bad condition}} \in [0,1]
\end{equation}

- **Def**: Measures the rate of good robotic kits out of all robotic kits owned by the school. A rate of 1 means that all robotic kits are in good condition, which is better.
- **Assumptions**: We assume the denominator represents all robotic kits owned by the school.

\begin{equation}
\text{Robotic kit usage rate (RKUR)} = \frac{\text{total robotic kits used by students}}{\text{total robotic kits in good condition} + \text{total robotic kits in bad condition} + \text{total rented robotic kits}} \in [0,1]
\end{equation}

- **Def**: Measures how many robotic kits are used by students out of all robotic kits possesed by the school. A rate of 1 means that all robotic kits owned are being used by students, which is better.
- **Assumptions**: We assume the denominator represents all robotic kits possesed by the school.

In [374]:
list(df.iloc[:10, 108:136].columns)

['Lego robotics kit: good condition.',
 'Lego robotics kit: bad condition.',
 'Lego robotics kit: rented.',
 'Lego robotics kit: used by students.',
 'Rex robotics kit: good condition.',
 'Rex robotics kit: bad condition.',
 'Rex robotics kit: rented.',
 'Rex robotics kit: used by students.',
 'NXT robotics kit: good condition.',
 'NXT robotics kit: bad condition.',
 'NXT robotics kit: rented.',
 'NXT robotics kit: used by students.',
 'EV3 robotics kit: good condition.',
 'EV3 robotics kit: bad condition.',
 'EV3 robotics kit: rented.',
 'EV3 robotics kit: used by students.',
 'Chumchebot robotics kit: good condition.',
 'Chumchebot robotics kit: bad condition.',
 'Chumchebot robotics kit: rented.',
 'Chumchebot robotics kit: used by students.',
 'Make Block robotics kit: good condition.',
 'Make Block robotics kit: bad condition.',
 'Make Block robotics kit: rented.',
 'Make Block robotics kit: used by students.',
 'Other robotics kit: good condition.',
 'Other robotics kit: bad cond

In [375]:
# Drop columns 108 to 135 from new_df
new_df.drop([
    'Lego robotics kit: good condition.',
    'Lego robotics kit: bad condition.',
    'Lego robotics kit: rented.',
    'Lego robotics kit: used by students.',
    'Rex robotics kit: good condition.',
    'Rex robotics kit: bad condition.',
    'Rex robotics kit: rented.',
    'Rex robotics kit: used by students.',
    'NXT robotics kit: good condition.',
    'NXT robotics kit: bad condition.',
    'NXT robotics kit: rented.',
    'NXT robotics kit: used by students.',
    'EV3 robotics kit: good condition.',
    'EV3 robotics kit: bad condition.',
    'EV3 robotics kit: rented.',
    'EV3 robotics kit: used by students.',
    'Chumchebot robotics kit: good condition.',
    'Chumchebot robotics kit: bad condition.',
    'Chumchebot robotics kit: rented.',
    'Chumchebot robotics kit: used by students.',
    'Make Block robotics kit: good condition.',
    'Make Block robotics kit: bad condition.',
    'Make Block robotics kit: rented.',
    'Make Block robotics kit: used by students.',
    'Other robotics kit: good condition.',
    'Other robotics kit: bad condition.',
    'Other robotics kit: rented.',
    'Other robotics kit: used by students.'
], inplace=True, axis=1)

In [376]:
def compute_gcrkr_rkur(row):
    total_rk_good = 0
    total_rk_bad = 0
    total_rk_rented = 0
    total_rk_student = 0

    for j in range(108, 136, 4):
        if not np.isnan(row.iloc[j]):
            total_rk_good += row.iloc[j]
        if not np.isnan(row.iloc[j+1]):
            total_rk_bad += row.iloc[j+1]
        if not np.isnan(row.iloc[j+2]):
            total_rk_rented += row.iloc[j+2]
        if not np.isnan(row.iloc[j+3]):
            total_rk_student += row.iloc[j+3]
    
    good_bad = total_rk_good + total_rk_bad
    all_rk = (total_rk_good + total_rk_bad + total_rk_rented)

    gcrkr = 0 if good_bad == 0 else total_rk_good / good_bad
    rkur = 0 if all_rk == 0 else total_rk_student / all_rk
    
    return pd.Series([gcrkr, rkur])

In [377]:
new_df[['GCRKR', 'RKUR']] = df.apply(compute_gcrkr_rkur, axis=1)

In [378]:
new_df[['GCRKR', 'RKUR']].head()

Unnamed: 0,GCRKR,RKUR
0,0.0,0.0
1,0.0,0.0
2,1.0,1.0
3,0.0,0.0
4,0.0,0.0


\begin{equation}
\text{Switch "Does not receive..." to "Receive..." features}
\end{equation}

- **Def**: A simple switch between 1s and 0s.

In [379]:
list(df.iloc[:10, 16:24].columns)

['Does not receive technical support.',
 'Does not receive economic support.',
 'Does not receive didactic material support.',
 'Does not receive furniture and equipment support.',
 'Does not receive infrastructure support.',
 'Does not receive teacher remuneration support.',
 'Does not receive construction material support.',
 'Does not receive food support.']

In [380]:
new_df.drop([
    'Does not receive technical support.',
    'Does not receive economic support.',
    'Does not receive didactic material support.',
    'Does not receive furniture and equipment support.',
    'Does not receive infrastructure support.',
    'Does not receive teacher remuneration support.',
    'Does not receive construction material support.',
    'Does not receive food support.'
], inplace=True, axis=1)

In [381]:
new_df['Receives technical support'] = df.iloc[:, 16].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives economic support'] = df.iloc[:, 17].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives didactic material support'] = df.iloc[:, 18].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives furniture and equipment support'] = df.iloc[:, 19].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives infrastructure support'] = df.iloc[:, 20].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives teacher renumeration support'] = df.iloc[:, 21].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives construction material support'] = df.iloc[:, 22].apply(lambda x: 0 if x == 1 else 1)
new_df['Receives food support'] = df.iloc[:, 23].apply(lambda x: 0 if x == 1 else 1)

\begin{equation}
\text{Total internal funding (TIF)} = \text{total funding from cafeterias} + \text{total funding from own activities} \in \mathbb{R}
\end{equation}

- **Def**: All funding coming from internal operations at the school.
- **Assumptions**: We assume "total funding from own activities" represents total funding obtained from school's operations.

\begin{equation}
\text{Total external funding (TEF)} = \text{total funding from voluntary contributions} + \text{total funding from donations} \in \mathbb{R}
\end{equation}

- **Def**: All funding coming from external operations at the school.
- **Assumptions**: We assume its two parameters constitute all external funding to the school.

In [382]:
list(df.iloc[:10, 24:28].columns)

['Funding: cafeterias.',
 'Funding: voluntary contributions.',
 'Funding: own activities.',
 'Funding: donations.']

In [383]:
new_df.drop(['Funding: cafeterias.', 'Funding: voluntary contributions.', 'Funding: own activities.', 'Funding: donations.'], inplace=True, axis=1)

In [384]:
# Remove commas from columns 24 to 27
df.iloc[:, 24:28] = df.iloc[:, 24:28].replace(',', '', regex=True)

# Make columns 24 to 27 into floats
df.iloc[:, 24:28] = df.iloc[:, 24:28].astype(float)

In [385]:
def  compute_funding(row):
    tif = 0
    tef = 0
    
    if not np.isnan(row.iloc[24]) and not np.isnan(row.iloc[26]):
        tif = row.iloc[24] + row.iloc[26]
    if not np.isnan(row.iloc[25]) and not np.isnan(row.iloc[27]):
        tef = row.iloc[25] + row.iloc[27]
    
    return pd.Series([tif, tef])

In [386]:
new_df[['TIF', 'TEF']] = df.apply(compute_funding, axis=1)

In [387]:
new_df[['TIF', 'TEF']].head()

Unnamed: 0,TIF,TEF
0,4573.89,727338.03
1,2652.0,0.0
2,3919.64,2450.5
3,2000.0,300.0
4,245.0,0.0


\begin{equation}
\text{Dissability Infrastructure Score (DIS)} = \text{Has ramps} + \text{Has handrails} + \text{Has special bathrooms} \in \{0,1,2,3\}
\end{equation}

- **Def**: Captures if dissability infastructure is included in facilities.

In [388]:
list(df.iloc[:, 45:49].columns)

['Dissability accomodations: ramp.',
 'Dissability accomodations: handrails.',
 'Dissability accomodations: special bathrooms.',
 'Dissability accomodations: none.']

In [389]:
new_df.drop(['Dissability accomodations: ramp.',
 'Dissability accomodations: handrails.',
 'Dissability accomodations: special bathrooms.',
 'Dissability accomodations: none.'] , inplace=True, axis=1)

In [390]:
df.iloc[:, 45:49] = df.iloc[:, 45:49].map(lambda x: float(1.0) if x == 'Sí' else float(0.0))

In [391]:
new_df['DIS'] = df.apply(lambda x: 0.0 if x.iloc[48] == 0.0 else x.iloc[45] + x.iloc[46] + x.iloc[47], axis=1)

In [392]:
np.unique(new_df['DIS'])

array([0., 1., 2., 3.])

\begin{align}
\text{Total classrooms (TC)} &= \text{total classrooms used for teaching} \\
&\quad + \text{total classrooms used for purposes other than teaching} \\
&\quad + \text{total classrooms not used} \\
&\quad + \text{total classrooms used for computer labs} \\
&\quad + \text{total classrooms used for temporary classrooms} \\
&\quad + \text{total classrooms used other spaces} \\
\in \mathbb{R}
\end{align}

- **Def**: Total number of classrooms in a school
- **Assumptions**: We assume there are no other type of classrooms other than the categories above.

\begin{equation}
\text{Teaching Classroom Rate (TCR)} = \frac{\text{total classrooms used for teaching}}{\text{total classrooms}} \in [0,1)
\end{equation}

- **Def**: Out of all classrooms, how many are used for teaching.

\begin{equation}
\text{Unusued Classroom Rate (UCR)} = \frac{\text{total unused classrooms}}{\text{total classrooms}} \in [0,1)
\end{equation}

- **Def**: Out of all classrooms, how many are not used.

In [393]:
list(df.iloc[:, 39:45].columns)

['Amount of classrooms: for teaching.',
 'Amount of classrooms: for purposes other than teaching.',
 'Amount of classrooms: not used.',
 'Amount of classrooms: computer labs.',
 'Amount of classrooms: temporary classrooms.',
 'Amount of classrooms: other spaces.']

In [394]:
new_df.drop(['Amount of classrooms: for teaching.',
 'Amount of classrooms: for purposes other than teaching.',
 'Amount of classrooms: not used.',
 'Amount of classrooms: computer labs.',
 'Amount of classrooms: temporary classrooms.',
 'Amount of classrooms: other spaces.'], inplace=True, axis=1)

In [395]:
def compute_classrooms(row):
    TC = row.iloc[39] + row.iloc[40] + row.iloc[41] + row.iloc[42] + row.iloc[43] + row.iloc[44]
    TCR = 0.0 if TC == 0 else row.iloc[39] / TC
    UCR = 0.0 if TC == 0 else row.iloc[41] / TC
    
    return pd.Series([TC, TCR, UCR])

In [396]:
new_df[['TC', 'TCR', 'UCR']] = df.apply(compute_classrooms, axis=1)

In [397]:
new_df[['TC', 'TCR', 'UCR']].head()

Unnamed: 0,TC,TCR,UCR
0,19.0,1.0,0.0
1,14.0,0.857143,0.0
2,43.0,0.44186,0.023256
3,12.0,0.333333,0.0
4,4.0,0.75,0.0


\begin{equation}
\text{Computer Class Rate (CCR)} = \frac{\text{total students receiving computer classes}}{\frac{\text{total students enrolled at beginning of year} + \text{total students enrolled at end of year}}{2}} \in \mathbb{R}
\end{equation}

- **Def**: Out of all students, how many are in computer classes within the school.
- **Assumptions**: No assumptions, but note that might be greater than 1 because we are using average of enrolled students at start and end of year. We use this because we do not know if measurement of total students receiving computer classes was taken at the start or end of year.

In [398]:
list(df.iloc[:, 49:51].columns)

['Does the school have computers for student use?',
 'Amount of students receiving computer classes.']

In [399]:
new_df.drop(['Does the school have computers for student use?',
 'Amount of students receiving computer classes.'], inplace=True, axis=1)

In [400]:
# Make columns numeric
df.iloc[:, 50] = pd.to_numeric(df.iloc[:, 50], errors='coerce')
df['Total enrolled students at the end of the year'] = pd.to_numeric(df['Total enrolled students at the end of the year'], errors='coerce')
df['Total initially enrolled students at the beginning of the year'] = pd.to_numeric(df['Total initially enrolled students at the beginning of the year'], errors='coerce')

In [401]:
def compute_computers(row):
    CCR = 0.0
    avg_enrollment = (row['Total initially enrolled students at the beginning of the year'] + row['Total enrolled students at the end of the year']) / 2

    if row.iloc[49] == 'No' or avg_enrollment == 0.0:
        CCR = 0.0
    else:
        CCR = row.iloc[50] / avg_enrollment

    return CCR

In [402]:
new_df['CCR'] = df.apply(compute_computers, axis=1)

In [403]:
# Convert nan to 0
new_df['CCR'] = new_df['CCR'].fillna(0)

In [404]:
new_df['CCR'].head()

0    0.000000
1    0.000000
2    0.320484
3    0.956522
4    1.023256
Name: CCR, dtype: float64

\begin{equation}
\text{Good condition technology rate (GCTR)} = \frac{\text{total tech devices in good condition}}{\text{total tech devices in good condition} + \text{total tech devices in bad condition}} \in [0,1]
\end{equation}

- **Def**: Out of all tech devices owned, how many are in good condition. A rate of 1 means that all tech devices are in good condition, which is better.
- **Assumptions**: We assume the denominator represents all tech devices owned by the school.

\begin{equation}
\text{Tech device usage rate (TDUR)} = \frac{\text{total tech devices used by students}}{\text{total tech devices in good condition} + \text{total tech devices in bad condition} + \text{total rented tech devices}} \in [0,1]
\end{equation}

- **Def**: Out of all tech devices possessed, how many are used by students. A rate of 1 means that all tech devices are used by students, which is better.
- **Assumptions**: We assume the denominator represents all tech devices possesed by the school.

In [405]:
list(df.iloc[:, 52:108].columns)

['Desktop computers: good condition.',
 'Desktop computers: bad condition.',
 'Desktop computers: rented.',
 'Desktop computers: used by students.',
 'Laptops: good condition.',
 'Laptops: bad condition.',
 'Laptops: rented.',
 'Laptops: used by students.',
 'Printers: good condition.',
 'Printers: bad condition.',
 'Printers: rented.',
 'Printers: used by students.',
 'Scanner: good condition.',
 'Scanner: bad condition.',
 'Scanner: rented.',
 'Scanner: used by students.',
 'Projector: good condition.',
 'Projector: bad condition.',
 'Projector: rented.',
 'Projector: used by students.',
 'Television: good condition.',
 'Television: bad condition.',
 'Television: rented.',
 'Television: used by students.',
 'Recorder: good condition.',
 'Recorder: bad condition.',
 'Recorder: rented.',
 'Recorder: used by students.',
 'DVD: good condition.',
 'DVD: bad condition.',
 'DV: rented.',
 'DVD: used by students.',
 'Microphone: good condition.',
 'Microphone: bad condition.',
 'Microphone: 

In [406]:
new_df.drop(
    ['Desktop computers: good condition.',
 'Desktop computers: bad condition.',
 'Desktop computers: rented.',
 'Desktop computers: used by students.',
 'Laptops: good condition.',
 'Laptops: bad condition.',
 'Laptops: rented.',
 'Laptops: used by students.',
 'Printers: good condition.',
 'Printers: bad condition.',
 'Printers: rented.',
 'Printers: used by students.',
 'Scanner: good condition.',
 'Scanner: bad condition.',
 'Scanner: rented.',
 'Scanner: used by students.',
 'Projector: good condition.',
 'Projector: bad condition.',
 'Projector: rented.',
 'Projector: used by students.',
 'Television: good condition.',
 'Television: bad condition.',
 'Television: rented.',
 'Television: used by students.',
 'Recorder: good condition.',
 'Recorder: bad condition.',
 'Recorder: rented.',
 'Recorder: used by students.',
 'DVD: good condition.',
 'DVD: bad condition.',
 'DV: rented.',
 'DVD: used by students.',
 'Microphone: good condition.',
 'Microphone: bad condition.',
 'Microphone: rented.',
 'Microphone: used by students.',
 'Speakers: good condition.',
 'Speakers: bad condition.',
 'Speakers: rented.',
 'Speakers: used by students.',
 'Camera: good condition.',
 'Camera: bad condition.',
 'Camera: rented.',
 'Camera: used by students.',
 'Video camera: good condition.',
 'Video camera: bad condition.',
 'Video camera: rented.',
 'Video camera: used by students.',
 'Web camera: good condition.',
 'Web camera: bad condition.',
 'Web camera: rented.',
 'Web camera: used by students.',
 'Photocopier: good condition.',
 'Photocopier: bad condition.',
 'Photocopier: rented.',
 'Photocopier: used by students.'], inplace=True, axis=1)

In [407]:
def compute_gctr_tdur(row):
    total_good = 0
    total_bad = 0
    total_rented = 0
    total_student = 0

    for j in range(52, 108, 4):
        if not np.isnan(row.iloc[j]):
            total_good += row.iloc[j]
        if not np.isnan(row.iloc[j+1]):
            total_bad += row.iloc[j+1]
        if not np.isnan(row.iloc[j+2]):
            total_rented += row.iloc[j+2]
        if not np.isnan(row.iloc[j+3]):
            total_student += row.iloc[j+3]
    
    good_bad = total_good + total_bad
    all_td = (total_good + total_bad + total_rented)

    gctr = 0 if good_bad == 0 else total_good / good_bad
    tdur = 0 if all_td == 0 else total_student / all_td
    
    return pd.Series([gctr, tdur]) 

In [408]:
new_df[['GCTR', 'TDUR']] = df.apply(compute_gctr_tdur, axis=1)

In [409]:
new_df[['GCTR', 'TDUR']].head()

Unnamed: 0,GCTR,TDUR
0,0.641975,0.234568
1,0.76087,0.152174
2,0.956522,0.826087
3,0.95,0.0
4,0.545455,0.545455


In [410]:
list(new_df.columns)

['School ID',
 'School Name',
 'School District',
 'Sector',
 'If private, does it receive subsidy?',
 'Administrative Body',
 'Zone',
 'Department Code',
 'Department Name',
 'Municipality Code',
 'Municipality Name',
 'Canton Name',
 'Hamlet Name',
 'Address',
 'Does the school belong to an indigenous community?',
 'Who provided the data?',
 'Do you own your facilities?',
 'Water source: internal pipeline.',
 'Water source: river, lake, spring.',
 'Water source: rainwater.',
 'Water source: public sink.',
 'Water source: well.',
 'Water source: pipe.',
 'Does the school have electrical installations?',
 'If it posseses electrical installations, do they work?',
 'Does the school have sanitary services?',
 'If it posseses sanitary services, are they separated by gender?',
 'Does the school have internet service?',
 'Does the school have a library?',
 'Does the school have a computer center?',
 'Does the school have a science lab?',
 'Does the school have an educational support classroo

# __Cleaning features__

Get Bernoulli for Yes/No

In [411]:
def bernoulli_to_numeric(x):
    if x == 'Sí':
        return 1.0
    elif x == 'No':
        return 0.0
    else:
        return x

In [412]:
new_df['If private, does it receive subsidy?'] = new_df['If private, does it receive subsidy?'].map(bernoulli_to_numeric)
new_df['Do you own your facilities?'] = new_df['Do you own your facilities?'].map(bernoulli_to_numeric)
new_df['Water source: internal pipeline.'] = new_df['Water source: internal pipeline.'].map(bernoulli_to_numeric)
new_df['Water source: pipe.'] = new_df['Water source: pipe.'].map(bernoulli_to_numeric)
new_df['Water source: public sink.'] = new_df['Water source: public sink.'].map(bernoulli_to_numeric)
new_df['Water source: rainwater.'] = new_df['Water source: rainwater.'].map(bernoulli_to_numeric)
new_df['Water source: river, lake, spring.'] = new_df['Water source: river, lake, spring.'].map(bernoulli_to_numeric)
new_df['Water source: well.'] = new_df['Water source: well.'].map(bernoulli_to_numeric)
new_df['If it posseses electrical installations, do they work?']
new_df['If it posseses sanitary services, are they separated by gender?']

for column in new_df.columns:
    if 'Does the school have' in column or 'Did you' in column or 'Pregnancy prevention':
        new_df[column] = new_df[column].map(bernoulli_to_numeric)

Convert numbers to floats

In [413]:
# Convert columns 55 and above to floats
# If there is a comma in a cell, remove it
for column in new_df.columns[55:]:
    new_df[column] = new_df[column].replace(',', '', regex=True)
    new_df[column] = pd.to_numeric(new_df[column], errors='coerce')

Drop unnecessary columns

In [414]:
new_df.drop(
    ['Total municipal homicides', 'Total departmental homicides'],
    inplace=True, axis=1)

Convert ints to floats

In [415]:
# Go through each column. If majority type is int, convert to float
for column in new_df.columns:
    majority_type = new_df[column].apply(lambda x: type(x)).value_counts().idxmax()
    if majority_type == int:
        new_df[column] = new_df[column].astype(float)

Pass absolute value to floats

In [416]:
# Go through each column if float, do absolute value
for column in new_df.columns:
    if new_df[column].dtype == float:
        new_df[column] = new_df[column].abs()

In [417]:
list(new_df.columns)

['School ID',
 'School Name',
 'School District',
 'Sector',
 'If private, does it receive subsidy?',
 'Administrative Body',
 'Zone',
 'Department Code',
 'Department Name',
 'Municipality Code',
 'Municipality Name',
 'Canton Name',
 'Hamlet Name',
 'Address',
 'Does the school belong to an indigenous community?',
 'Who provided the data?',
 'Do you own your facilities?',
 'Water source: internal pipeline.',
 'Water source: river, lake, spring.',
 'Water source: rainwater.',
 'Water source: public sink.',
 'Water source: well.',
 'Water source: pipe.',
 'Does the school have electrical installations?',
 'If it posseses electrical installations, do they work?',
 'Does the school have sanitary services?',
 'If it posseses sanitary services, are they separated by gender?',
 'Does the school have internet service?',
 'Does the school have a library?',
 'Does the school have a computer center?',
 'Does the school have a science lab?',
 'Does the school have an educational support classroo

In [418]:
new_df['Dropout rate'] = new_df['Total dropouts'] / new_df['Total initially enrolled students at the beginning of the year']

In [419]:
new_df.drop('Total dropouts', inplace=True, axis=1)

In [420]:
new_df['Dropout rate'].head()

0    0.049142
1    0.020599
2    0.093660
3    0.053763
4    0.045455
Name: Dropout rate, dtype: float64

# __Final Clean Data Shape__

In [421]:
new_df.shape

(6025, 83)

In [422]:
# Save new_df to a CSV file in interim
new_df.to_csv('../data/processed/data.csv', index=False)