# I. Project Team Members

| Prepared by | Email | Prepared for |
| :-: | :-: | :-: |
| **Hardefa Rogonondo** | hardefarogonondo@gmail.com | **Erasmus Scholarship Grant Prediction Engine** |

# II. Notebook Target Definition

This notebook is dedicated to the data preparation phase of the Erasmus Scholarship Grant Prediction Engine Project. Here, we process the raw data acquired from Kaggle by inspecting its shape, information, and data definitions. We then segregate the features to be used as predictors from the labels. Finally, the prepared datasets are exported to pickle files for subsequent use in the prediction engine.

# III. Notebook Setup

## III.A. Import Libraries

In [1]:
import pandas as pd
import pickle

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## III.B. Import Data

In [2]:
df = pd.read_csv('../../data/raw/erasmus.csv')
df.head()

Unnamed: 0,INDEX,COUNTRIES,UNIVERSITIES,FACULTIES,DEPARTMENTS,EXAM SCORE,GRANT
0,1,ITALIA,UNIVERSITA DEGLI STUDI DI ROMA LA SAPIENZA,FACULTY OF ARTS AND SCIENCES,ENGLISH LANGUAGE AND LITERATURE,98.5,1
1,2,ITALIA,ALMA MATER STUDIORUM - UNIVERSITA DI BOLOGNA,FACULTY OF ARTS AND SCIENCES,SOCIOLOGY,97.1,1
2,3,GERMAN,UNIVERSITAET BIELEFELD,FACULTY OF ARTS AND SCIENCES,PSYCHOLOGY,96.8,1
3,4,GERMAN,HOCHSCHULE FUR ANGEWANDTE WISSENSCHAFTEN HAMBURG,FACULTY OF HEALTH SCIENCES,NUTRITION AND DIETETICS,96.5,1
4,5,ITALIA,UNIVERSITA DEGLI STUDI DI ROMA LA SAPIENZA,FACULTY OF ARTS AND SCIENCES,ENGLISH LANGUAGE AND LITERATURE,96.32,1


# IV. Data Preparation

## IV.A. Data Shape Inspection

In [3]:
df.shape

(341, 7)

## IV.B. Data Information Inspection

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341 entries, 0 to 340
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   INDEX         341 non-null    int64  
 1   COUNTRIES     341 non-null    object 
 2   UNIVERSITIES  341 non-null    object 
 3   FACULTIES     341 non-null    object 
 4   DEPARTMENTS   339 non-null    object 
 5   EXAM SCORE    341 non-null    float64
 6   GRANT         341 non-null    int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 18.8+ KB


## IV.C. Data Definition

| Variables | Columns Definition |
| :-: | :-: |
| INDEX | Unique index number of each record. |
| COUNTRIES | Countries to be attended under Erasmus programme. |
| UNIVERSITIES | Universities to be attended under the Erasmus programme. |
| FACULTIES | Faculties where the students are enrolled. |
| DEPARTMENTS | Departments where the students are enrolled. |
| EXAM SCORE | Students' Erasmus exam scores. |
| GRANT | Column indicating whether the students received a grant (1: received, 0: not received). |

## IV.D. Data Segregation

In [5]:
X = df.drop("GRANT", axis=1)
y = df["GRANT"]
X.shape, y.shape

((341, 6), (341,))

In [6]:
X.head()

Unnamed: 0,INDEX,COUNTRIES,UNIVERSITIES,FACULTIES,DEPARTMENTS,EXAM SCORE
0,1,ITALIA,UNIVERSITA DEGLI STUDI DI ROMA LA SAPIENZA,FACULTY OF ARTS AND SCIENCES,ENGLISH LANGUAGE AND LITERATURE,98.5
1,2,ITALIA,ALMA MATER STUDIORUM - UNIVERSITA DI BOLOGNA,FACULTY OF ARTS AND SCIENCES,SOCIOLOGY,97.1
2,3,GERMAN,UNIVERSITAET BIELEFELD,FACULTY OF ARTS AND SCIENCES,PSYCHOLOGY,96.8
3,4,GERMAN,HOCHSCHULE FUR ANGEWANDTE WISSENSCHAFTEN HAMBURG,FACULTY OF HEALTH SCIENCES,NUTRITION AND DIETETICS,96.5
4,5,ITALIA,UNIVERSITA DEGLI STUDI DI ROMA LA SAPIENZA,FACULTY OF ARTS AND SCIENCES,ENGLISH LANGUAGE AND LITERATURE,96.32


In [7]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: GRANT, dtype: int64

## IV.E. Export Data

In [8]:
X.to_pickle('../../data/processed/X.pkl')
y.to_pickle('../../data/processed/y.pkl')