# DAAD Data Cleaning

In this notebook we are going to clean the data set we grab from the DAAD webpage and let it ready to analyze later.

## Imports


In [1]:
import warnings
import re
import pandas as pd
warnings.filterwarnings('ignore')
#pd.set_option("display.max_rows", None, "display.max_columns", None)

In [2]:
# Jason Load/Save files
# Load Data

import json
def load_data(title):
    with open(title, encoding='utf-8') as f:
        return json.load(f)


## Pre-Analysis

In [3]:
# Create DF
courses = load_data('./Database/daad_courses_final_1.json')

df = pd.DataFrame(courses)


In [4]:
df.describe(include='all')


Unnamed: 0,Couse ID,Program,University,Degree,Teaching language,Languages,Programme duration,Beginning,More information on beginning of studies,Application deadline,...,Phase(s) of compulsory attendance,Accommodation options,Required technical equipment,Description of other e-learning elements,"Number of course units, where relevant",Faculties,Financial support,Additional information on support programmes,List of doctoral programmes,Course language
count,2047.0,2047,2047,1705,2002,1729,1705,1705,782,1638,...,6,3,24,10,9,21,21,6,18,1
unique,,1961,289,773,37,986,8,5,529,1142,...,2,2,16,7,8,21,2,6,18,1
top,,Master of Science in Economics Master of Scien...,University of Göttingen • Göttingen,Master of Science,English,English,4 semesters,Winter semester,October,15 July for the following winter semester,...,Yes,The majority of our students live in residence...,Internet connection Personal computer or lapto...,"E-learning platform, virtual classroom, voice ...",Up to 240 hours,"Faculty 1 Mathematics, Computer Science, Physi...",Yes,Specific support for international candidates:...,The main programme at the GSLS is the PhD / Dr...,Courses are held in English.
freq,,10,71,320,1523,103,993,1015,47,44,...,5,2,5,4,2,1,17,1,1,1
mean,5036.176356,,,,,,,,,,...,,,,,,,,,,
std,1062.226761,,,,,,,,,,...,,,,,,,,,,
min,3589.0,,,,,,,,,,...,,,,,,,,,,
25%,4160.5,,,,,,,,,,...,,,,,,,,,,
50%,4772.0,,,,,,,,,,...,,,,,,,,,,
75%,5696.5,,,,,,,,,,...,,,,,,,,,,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2047 entries, 0 to 2046
Columns: 104 entries, Couse ID to Course language
dtypes: int64(1), object(103)
memory usage: 1.6+ MB


There are 104 Features, thus we need to select the most relevante for us

In [6]:
col_features = ['Couse ID','University', 'Program', 'Degree', 'Teaching language','Languages','Programme duration','Beginning','Application deadline',
               'Tuition fees per semester in EUR','Description/content','Types of assessment', 'Integrated internships','Course-specific, integrated German language courses','Semester contribution',
               'Academic admission requirements','Language requirements','Submit application to','Accommodation','Possibility of finding part-time employment']

In [7]:
# The new data frame is called programmes
programmes = df[col_features].copy()

In [8]:
programmes.head()

Unnamed: 0,Couse ID,University,Program,Degree,Teaching language,Languages,Programme duration,Beginning,Application deadline,Tuition fees per semester in EUR,Description/content,Types of assessment,Integrated internships,"Course-specific, integrated German language courses",Semester contribution,Academic admission requirements,Language requirements,Submit application to,Accommodation,Possibility of finding part-time employment
0,4000,Friedrich Schiller University Jena • Jena,Master of Science in Molecular Medicine MSc Mo...,Master of Science in Molecular Medicine,English,Courses are held in English.,4 semesters,Winter semester,"31 May for the following winter semester, appl...",,The MSc course in Molecular Medicine conveys b...,Each of the modules of the Master's programme ...,Not systematically provided,Yes,Semester fee (student services and student sel...,Bachelor's degree in biochemistry/molecular bi...,Non-native English speakers must prove profici...,https://www.uni-jena.de/en/studies/study+progr...,Accommodation in student residences is availab...,"Generally possible, depending on visa requirem..."
1,4001,University of Augsburg • Augsburg,Master of Science in Software Engineering Mast...,Master of Science in Software Engineering,English,The courses are in English unless all students...,4 semesters,Winter semester,"1 May, for all applicants",,Our programme offers basic and advanced lectur...,,An internship of eight weeks is to be complete...,No,115 EUR per semester,Bachelor's degree (or equivalent) in computer ...,Fluency in written and spoken English is expec...,Application website,Students can apply for a dorm room at the Stud...,
2,4002,Philipps-Universität Marburg • Marburg,German as a Foreign Language (MA) German as a ...,Master of Arts,German,German in courses and for the Master's thesis,4 semesters,Winter semester,15 July for the following winter semester,,"Didactics of German language, literature, and ...",,A teaching module is an integral part of the p...,No,The university charges a registration fee of 5...,Bachelor's degree in German with 12 ECTS in Ge...,A very good command of the German language is ...,Philipps-Universität Marburg c/o uni-assist e....,The market situation for accommodation is not ...,"Within certain legal limits, job opportunities..."
3,4003,University of Stuttgart • Stuttgart,Graduate School Simulation Technology (GS SimT...,PhD,German English,"English about 60%, German about 40%",6 semesters,Only for doctoral programmes: any time,No application deadline,Varied,"In the 21st century's societies, simulation te...",,The educational programme does not include a m...,No,Approx. 180 EUR per semester,"A university degree is mandatory. (A ""Diplom"" ...",English is mandatory. German is an advantage.,SC SimTech Pfaffenwaldring 5a 70569 Stuttgart ...,Both the campus in Stuttgart-Vaihingen and the...,Please be aware that it may be very challengin...
4,4004,University of Cologne • Köln,Bonn-Cologne Graduate School of Physics and As...,Doctoral Degree in Physics,English,The entire programme is taught in English.,6 semesters,Only for doctoral programmes: any time,No deadlines,,The Bonn-Cologne Graduate School of Physics an...,"Interim reports, thesis defence",,No,The semester contribution amounts to approx. 2...,"German ""Diplom"" or Master of Science in physic...","TOEFL 80 points Internet-based or equivalent, ...",http://www.gradschool.physics.uni-koeln.de/App...,Accommodation is available through the student...,


In [9]:
programmes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2047 entries, 0 to 2046
Data columns (total 20 columns):
 #   Column                                               Non-Null Count  Dtype 
---  ------                                               --------------  ----- 
 0   Couse ID                                             2047 non-null   int64 
 1   University                                           2047 non-null   object
 2   Program                                              2047 non-null   object
 3   Degree                                               1705 non-null   object
 4   Teaching language                                    2002 non-null   object
 5   Languages                                            1729 non-null   object
 6   Programme duration                                   1705 non-null   object
 7   Beginning                                            1705 non-null   object
 8   Application deadline                                 1638 non-null   object
 9

It can be seen in the table above that there are 19 features - Most of them have a big ammount of not null objects (Strings). WE want to transform some of them into numbers, like teaching language, or program duration, or some tuition, or semester contribution

In [10]:
programmes

Unnamed: 0,Couse ID,University,Program,Degree,Teaching language,Languages,Programme duration,Beginning,Application deadline,Tuition fees per semester in EUR,Description/content,Types of assessment,Integrated internships,"Course-specific, integrated German language courses",Semester contribution,Academic admission requirements,Language requirements,Submit application to,Accommodation,Possibility of finding part-time employment
0,4000,Friedrich Schiller University Jena • Jena,Master of Science in Molecular Medicine MSc Mo...,Master of Science in Molecular Medicine,English,Courses are held in English.,4 semesters,Winter semester,"31 May for the following winter semester, appl...",,The MSc course in Molecular Medicine conveys b...,Each of the modules of the Master's programme ...,Not systematically provided,Yes,Semester fee (student services and student sel...,Bachelor's degree in biochemistry/molecular bi...,Non-native English speakers must prove profici...,https://www.uni-jena.de/en/studies/study+progr...,Accommodation in student residences is availab...,"Generally possible, depending on visa requirem..."
1,4001,University of Augsburg • Augsburg,Master of Science in Software Engineering Mast...,Master of Science in Software Engineering,English,The courses are in English unless all students...,4 semesters,Winter semester,"1 May, for all applicants",,Our programme offers basic and advanced lectur...,,An internship of eight weeks is to be complete...,No,115 EUR per semester,Bachelor's degree (or equivalent) in computer ...,Fluency in written and spoken English is expec...,Application website,Students can apply for a dorm room at the Stud...,
2,4002,Philipps-Universität Marburg • Marburg,German as a Foreign Language (MA) German as a ...,Master of Arts,German,German in courses and for the Master's thesis,4 semesters,Winter semester,15 July for the following winter semester,,"Didactics of German language, literature, and ...",,A teaching module is an integral part of the p...,No,The university charges a registration fee of 5...,Bachelor's degree in German with 12 ECTS in Ge...,A very good command of the German language is ...,Philipps-Universität Marburg c/o uni-assist e....,The market situation for accommodation is not ...,"Within certain legal limits, job opportunities..."
3,4003,University of Stuttgart • Stuttgart,Graduate School Simulation Technology (GS SimT...,PhD,German English,"English about 60%, German about 40%",6 semesters,Only for doctoral programmes: any time,No application deadline,Varied,"In the 21st century's societies, simulation te...",,The educational programme does not include a m...,No,Approx. 180 EUR per semester,"A university degree is mandatory. (A ""Diplom"" ...",English is mandatory. German is an advantage.,SC SimTech Pfaffenwaldring 5a 70569 Stuttgart ...,Both the campus in Stuttgart-Vaihingen and the...,Please be aware that it may be very challengin...
4,4004,University of Cologne • Köln,Bonn-Cologne Graduate School of Physics and As...,Doctoral Degree in Physics,English,The entire programme is taught in English.,6 semesters,Only for doctoral programmes: any time,No deadlines,,The Bonn-Cologne Graduate School of Physics an...,"Interim reports, thesis defence",,No,The semester contribution amounts to approx. 2...,"German ""Diplom"" or Master of Science in physic...","TOEFL 80 points Internet-based or equivalent, ...",http://www.gradschool.physics.uni-koeln.de/App...,Accommodation is available through the student...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2042,3993,University of Applied Sciences Ravensburg-Wein...,Electrical Engineering and Information Technol...,Bachelor of Engineering (BEng),German English,"English (semesters 1 to 4), German (semesters ...",7 semesters,Summer semester,15 November for the following summer semester,1500,The course at RWU trains engineers in the fiel...,,"In the fifth semester, an internship of 26 wee...",Yes,Approx. 170 EUR/semester,University entrance qualification for engineer...,Applicants must provide proof of their English...,Hochschule Ravensburg-Weingarten Studierenden-...,Accommodation is mainly provided and organised...,Foreign students registered at universities in...
2043,3995,Otto von Guericke University Magdeburg • Magde...,"International Management, Marketing, Entrepren...","Master of Science in International Management,...",English,Courses and examinations are held exclusively ...,4 semesters,Winter and summer semester,International applicants: 15 June for the foll...,,"International Management, Marketing, Entrepren...","For examinations, students can expect mostly w...",There are no compulsory internship requirement...,No,"Currently, the semester fee is approx. 128.50 ...",Applicants must provide proof of a Bachelor's ...,Proof of advanced English skills on the C1 lev...,www.uni-assist.de,"The ""Studentenwerk"" (student union) in Magdebu...",It is possible to find a part-time job as a re...
2044,3996,Dortmund University of Applied Sciences and Ar...,European Master's in Project Management (EuroM...,Master of Arts (MA),English,Courses are held in English (100%). Students c...,4 semesters,Winter semester,Application deadline: 15 July for the followin...,,The substantive idea for the course has been t...,Assessment is partly (50%) based on students' ...,Project managers from several companies such a...,Yes,Enrolment fees are approx. 310 EUR per semeste...,For EuroMPM-IT: The course of study is designe...,Applicants must provide proof of their English...,Only online application are allowed. Please se...,Suggestions on finding accommodations are avai...,Not offered by the university Students have go...
2045,3997,FH Aachen University of Applied Sciences • Jülich,Bachelor of Science in Applied Chemistry Bache...,Bachelor of Science in Applied Chemistry,German,The course of study is conducted in German. A ...,6 semesters,Winter semester,1 July for non-EU applicants 15 July for EU ap...,,The degree programme has a strong focus on app...,"Written/oral examinations, internship reports,...",Students are required to complete an eight-wee...,Yes,Currently approx. 320 EUR per semester,German entrance qualification for universities...,Applicants must provide proof of their German ...,https://hi.fh-aachen.de,Some students live in student dormitories. Oth...,Applicants should not count on financing their...


## Data cleaning itself

- ~~Separate University from City~~
- ~~Drop all the null records from language~~
- ~~Split English / German / Others in teaching languages~~
- ~~Transform Programme duration in int32~~
- ~~Clean the begginnin and leave only the season~~
- ~~Clean the deadline~~
- ~~Make Tution a number , if it is None or simmiar put 0~~
- ~~Create a column where you can filter as masters / Phd / Bacherlor / Other~~
- ~~Make semester contribution in number~~
- Submit application leave just a link (Optional)
- ~~Delete type of assesment~~
- Integrated internships should be yes or No
- ~~Rename the columns that are wrong~~
- ~~Create a total contribution~~

### Separate University from City

In [11]:
programmes['University'].apply(lambda x: x.split('•')[0].strip()).unique()

array(['Friedrich Schiller University Jena', 'University of Augsburg',
       'Philipps-Universität Marburg', 'University of Stuttgart',
       'University of Cologne', 'Ruhr-Universität Bochum',
       'Technische Universität Berlin',
       'FH Aachen University of Applied Sciences',
       'University of Münster', 'Freie Universität Berlin',
       'University of Göttingen', 'FAU Erlangen-Nürnberg',
       'University of Passau', 'TU Bergakademie Freiberg',
       'Ludwig-Maximilians-Universität München',
       'Helmholtz Centre for Environmental Research - UFZ',
       'Leipzig University', 'Hamburg University of Applied Sciences',
       'Brandenburg University of Technology Cottbus-Senftenberg',
       'Offenburg University of Applied Sciences', 'University of Bonn',
       'Charité - Universitätsmedizin Berlin',
       'SRH University Heidelberg',
       'SRH Berlin University of Applied Sciences',
       'University of Konstanz',
       'TH Köln (University of Applied Sciences

In [12]:
programmes['City'] = programmes['University'].apply(lambda x: x.split('•')[1].strip())
programmes['University'] = programmes['University'].apply(lambda x: x.split('•')[0].strip())

In [13]:
programmes

Unnamed: 0,Couse ID,University,Program,Degree,Teaching language,Languages,Programme duration,Beginning,Application deadline,Tuition fees per semester in EUR,...,Types of assessment,Integrated internships,"Course-specific, integrated German language courses",Semester contribution,Academic admission requirements,Language requirements,Submit application to,Accommodation,Possibility of finding part-time employment,City
0,4000,Friedrich Schiller University Jena,Master of Science in Molecular Medicine MSc Mo...,Master of Science in Molecular Medicine,English,Courses are held in English.,4 semesters,Winter semester,"31 May for the following winter semester, appl...",,...,Each of the modules of the Master's programme ...,Not systematically provided,Yes,Semester fee (student services and student sel...,Bachelor's degree in biochemistry/molecular bi...,Non-native English speakers must prove profici...,https://www.uni-jena.de/en/studies/study+progr...,Accommodation in student residences is availab...,"Generally possible, depending on visa requirem...",Jena
1,4001,University of Augsburg,Master of Science in Software Engineering Mast...,Master of Science in Software Engineering,English,The courses are in English unless all students...,4 semesters,Winter semester,"1 May, for all applicants",,...,,An internship of eight weeks is to be complete...,No,115 EUR per semester,Bachelor's degree (or equivalent) in computer ...,Fluency in written and spoken English is expec...,Application website,Students can apply for a dorm room at the Stud...,,Augsburg
2,4002,Philipps-Universität Marburg,German as a Foreign Language (MA) German as a ...,Master of Arts,German,German in courses and for the Master's thesis,4 semesters,Winter semester,15 July for the following winter semester,,...,,A teaching module is an integral part of the p...,No,The university charges a registration fee of 5...,Bachelor's degree in German with 12 ECTS in Ge...,A very good command of the German language is ...,Philipps-Universität Marburg c/o uni-assist e....,The market situation for accommodation is not ...,"Within certain legal limits, job opportunities...",Marburg
3,4003,University of Stuttgart,Graduate School Simulation Technology (GS SimT...,PhD,German English,"English about 60%, German about 40%",6 semesters,Only for doctoral programmes: any time,No application deadline,Varied,...,,The educational programme does not include a m...,No,Approx. 180 EUR per semester,"A university degree is mandatory. (A ""Diplom"" ...",English is mandatory. German is an advantage.,SC SimTech Pfaffenwaldring 5a 70569 Stuttgart ...,Both the campus in Stuttgart-Vaihingen and the...,Please be aware that it may be very challengin...,Stuttgart
4,4004,University of Cologne,Bonn-Cologne Graduate School of Physics and As...,Doctoral Degree in Physics,English,The entire programme is taught in English.,6 semesters,Only for doctoral programmes: any time,No deadlines,,...,"Interim reports, thesis defence",,No,The semester contribution amounts to approx. 2...,"German ""Diplom"" or Master of Science in physic...","TOEFL 80 points Internet-based or equivalent, ...",http://www.gradschool.physics.uni-koeln.de/App...,Accommodation is available through the student...,,Köln
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2042,3993,University of Applied Sciences Ravensburg-Wein...,Electrical Engineering and Information Technol...,Bachelor of Engineering (BEng),German English,"English (semesters 1 to 4), German (semesters ...",7 semesters,Summer semester,15 November for the following summer semester,1500,...,,"In the fifth semester, an internship of 26 wee...",Yes,Approx. 170 EUR/semester,University entrance qualification for engineer...,Applicants must provide proof of their English...,Hochschule Ravensburg-Weingarten Studierenden-...,Accommodation is mainly provided and organised...,Foreign students registered at universities in...,Weingarten
2043,3995,Otto von Guericke University Magdeburg,"International Management, Marketing, Entrepren...","Master of Science in International Management,...",English,Courses and examinations are held exclusively ...,4 semesters,Winter and summer semester,International applicants: 15 June for the foll...,,...,"For examinations, students can expect mostly w...",There are no compulsory internship requirement...,No,"Currently, the semester fee is approx. 128.50 ...",Applicants must provide proof of a Bachelor's ...,Proof of advanced English skills on the C1 lev...,www.uni-assist.de,"The ""Studentenwerk"" (student union) in Magdebu...",It is possible to find a part-time job as a re...,Magdeburg
2044,3996,Dortmund University of Applied Sciences and Arts,European Master's in Project Management (EuroM...,Master of Arts (MA),English,Courses are held in English (100%). Students c...,4 semesters,Winter semester,Application deadline: 15 July for the followin...,,...,Assessment is partly (50%) based on students' ...,Project managers from several companies such a...,Yes,Enrolment fees are approx. 310 EUR per semeste...,For EuroMPM-IT: The course of study is designe...,Applicants must provide proof of their English...,Only online application are allowed. Please se...,Suggestions on finding accommodations are avai...,Not offered by the university Students have go...,Dortmund
2045,3997,FH Aachen University of Applied Sciences,Bachelor of Science in Applied Chemistry Bache...,Bachelor of Science in Applied Chemistry,German,The course of study is conducted in German. A ...,6 semesters,Winter semester,1 July for non-EU applicants 15 July for EU ap...,,...,"Written/oral examinations, internship reports,...",Students are required to complete an eight-wee...,Yes,Currently approx. 320 EUR per semester,German entrance qualification for universities...,Applicants must provide proof of their German ...,https://hi.fh-aachen.de,Some students live in student dormitories. Oth...,Applicants should not count on financing their...,Jülich


### Drop all the null records from language

In [14]:
programmes.dropna(subset=['Teaching language'],how='any', inplace = True)

### Split English / German / Others in teaching languages

In [15]:
programmes['English'] = programmes['Teaching language'].apply(lambda x: 0 if re.search('English',x) == None else 1)
programmes['German'] = programmes['Teaching language'].apply(lambda x: 0 if re.search('German',x) == None else 1)
programmes['Other Language'] = programmes['Teaching language'].apply(lambda x: 0 if re.search('French|Spanish|Chinese|Italian|Other|Russian',x) == None else 1)

In [125]:
programmes.describe()

Unnamed: 0,Couse ID,English,German,Other Language
count,2002.0,2002.0,2002.0,2002.0
mean,5013.24975,0.892607,0.231768,0.043956
std,1055.59843,0.309689,0.422067,0.205049
min,3589.0,0.0,0.0,0.0
25%,4143.25,1.0,0.0,0.0
50%,4748.5,1.0,0.0,0.0
75%,5658.0,1.0,0.0,0.0
max,7169.0,1.0,1.0,1.0


### Transform Programme duration in int32

In [16]:
programmes['Programme duration'].unique()

array(['4 semesters', '6 semesters', '7 semesters', '8 semesters',
       '3 semesters', '2 semesters', 'More than 9 semesters',
       '5 semesters', nan], dtype=object)

In [17]:
programmes[programmes['Programme duration'].isnull()] # We are gonna delete those programmes - They are irrelevant for our analysis

Unnamed: 0,Couse ID,University,Program,Degree,Teaching language,Languages,Programme duration,Beginning,Application deadline,Tuition fees per semester in EUR,...,Semester contribution,Academic admission requirements,Language requirements,Submit application to,Accommodation,Possibility of finding part-time employment,City,English,German,Other Language
741,4887,Catholic University of Eichstätt-Ingolstadt,Course I - Intensive German and Cultural Studi...,,German,,,,,,...,,,German A1,https://www.ku.de/en/international/internation...,,,Eichstätt,0,1,0
742,4888,Catholic University of Eichstätt-Ingolstadt,Course II – Intensive German and German Litera...,,German,,,,,,...,,,German B1+ / B2,https://www.ku.de/en/international/internation...,,,Eichstätt,0,1,0
743,4889,Kiel University,74th International Summer Course: Germany Toda...,,German,,,,,,...,,,Participants should have good basic German lan...,Andreas Ritter International Center Universitä...,,,Kiel,0,1,0
744,4890,Philipps-Universität Marburg,Hessen International Summer University Marburg...,,German English,,,,,,...,,,Good command of English to follow the orientat...,https://www.uni-marburg.de/isu,,,Marburg,1,1,0
745,4891,Universität Heidelberg,International Summer School of German Language...,,German,,,,,,...,,,Beginners to advanced level,http://www.ifk.uni-hd.de/organization/registra...,,,Heidelberg,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1637,7092,University of Passau,Academic German Year Academic German Year,,German,,,,,,...,,,Application should provide the following docum...,info-gcp@uni-passau.de or Universität Passau L...,,,Passau,0,1,0
1638,7093,TU Darmstadt,Hessen:International Summer University (ISU) –...,,German English,,,,,,...,,,Intensive German Language Courses None to B1 (...,summer@pvw.tu-darmstadt.de,,,Darmstadt,1,1,0
1639,7094,Technische Universität Ilmenau,43rd International Summer Course – DSH Course ...,,German,,,,,,...,,,Minimum upper intermediate German skills – at ...,https://www.tu-ilmenau.de/en/international /,,,Ilmenau,0,1,0
1640,7095,Freie Universität Berlin,Sustainable and Social Business Sustainable an...,,English,,,,,,...,,,"Students must be able to speak, read, and writ...",http://www.fubis.org/5_geb/anmeldung/index.html,,,Berlin,1,0,0


In [18]:
programmes[programmes['Programme duration'].isnull() & programmes['Languages'].isnull()].describe()

Unnamed: 0,Couse ID,English,German,Other Language
count,297.0,297.0,297.0,297.0
mean,5493.390572,0.494949,0.555556,0.006734
std,698.707873,0.500818,0.497743,0.081922
min,4887.0,0.0,0.0,0.0
25%,4998.0,0.0,0.0,0.0
50%,5114.0,0.0,1.0,0.0
75%,6138.0,1.0,1.0,0.0
max,7138.0,1.0,1.0,1.0


In [24]:
programmes.dropna(subset=['Programme duration'],how='any', inplace = True)

In [25]:
programmes.describe()

Unnamed: 0,Couse ID,English,German,Other Language
count,1705.0,1705.0,1705.0,1705.0
mean,4929.612317,0.961877,0.175367,0.05044
std,1084.674919,0.19155,0.380392,0.218915
min,3589.0,0.0,0.0,0.0
25%,4058.0,1.0,0.0,0.0
50%,4578.0,1.0,0.0,0.0
75%,5631.0,1.0,0.0,0.0
max,7169.0,1.0,1.0,1.0


In [19]:
programmes['Programme duration'].unique()

array(['4 semesters', '6 semesters', '7 semesters', '8 semesters',
       '3 semesters', '2 semesters', 'More than 9 semesters',
       '5 semesters', nan], dtype=object)

In [20]:
programmes[programmes['Programme duration']== 'More than 9 semesters'] # IT's gonna be a 9 but we need to say that its 9 or more

Unnamed: 0,Couse ID,University,Program,Degree,Teaching language,Languages,Programme duration,Beginning,Application deadline,Tuition fees per semester in EUR,...,Semester contribution,Academic admission requirements,Language requirements,Submit application to,Accommodation,Possibility of finding part-time employment,City,English,German,Other Language
107,4118,Max Delbrück Center for Molecular Medicine (MD...,MDC International PhD Programme in Biomedical ...,PhD,English,Working language is English,More than 9 semesters,Only for doctoral programmes: any time,Twice a year (spring and autumn) For more info...,,...,"Around 350 EUR, depending on the university",Applicants must hold a Master's degree with a ...,Applicants must have a good command of English...,For more information on the recruitment proces...,,,Berlin,1,0,0
266,4314,University of Münster,International and European Governance Internat...,Bachelor of Arts in International and European...,German English French,Courses are held in German and French. English...,More than 9 semesters,Winter semester,1 May for the following winter semester For de...,Varied,...,Social contribution fee of 299.34 EUR per seme...,University entrance qualification,The ability to actively participate in French ...,WWU Münster Institute of Political Science Bew...,As in all popular university cities in Germany...,Part-time employment is only possible during t...,Münster,1,1,1
369,4441,Ludwig-Maximilians-Universität München,Graduate School of Quantitative Biosciences Mu...,Doctoral Degree Dr rer nat,English,Courses are held in English (100%).,More than 9 semesters,Winter semester,31 August 2020,,...,142.40 EUR per semester,Master's degree in any relevant discipline: bi...,Applicants must provide proof of English profi...,http://qbm.genzentrum.lmu.de/application/,The International Office helps visiting academ...,,München,1,0,0
413,4494,University of Cologne,CEPLAS - Cluster of Excellence on Plant Scienc...,Doctoral Degree (Dr rer nat or PhD),English,The programme is conducted in English. Non-nat...,More than 9 semesters,Winter semester,Please consult the CEPLAS website .,,...,The semester contribution amounts to approx. 2...,A prerequisite for admission to the CEPLAS Gra...,Knowledge of the German language is not requir...,Please consult the CEPLAS website .,Accommodation is available through the student...,"Due to the intense nature of the programme, st...",Düsseldorf,1,0,0
415,4496,Humboldt-Universität zu Berlin,Integrative Research Institute of Transformati...,Dr rer nat Dr phil,English,English,More than 9 semesters,Winter and summer semester,15 March (registration for summer semester) 15...,,...,315 EUR,Excellent Master's degree in agricultural scie...,Very good English language skills German langu...,Contact: kathrin.klementz@hu-berlin.de,,,Berlin,1,0,0
663,4782,University of Göttingen,Chemistry (PhD) Chemistry (PhD),Dr rer nat / PhD,German English,There is a broad variety of courses both in Ge...,More than 9 semesters,Winter and summer semester,No fixed application deadlines,,...,Fees amount to around 380 EUR per semester. Th...,Master's degree in Chemistry or a related fiel...,Applicants must provide proof of their German ...,Georg-August-Universität Göttingen Fakultät fü...,The Accommodation Service of the International...,The university supports students in finding pa...,Göttingen,1,1,0
1312,6280,University of Göttingen,Max Planck School Matter to Life – Combined Ma...,Master of Science / PhD,English,Courses are held in English.,More than 9 semesters,Winter semester,1 December for following winter semester,,...,Fees amount to around 380 EUR per semester. Th...,Bachelor's degree in natural science or engine...,Very good English language skills are required...,https://www.maxplanckschools.de/23875/application,"For the first semester, students have the opti...",The university supports students in finding pa...,Göttingen,1,0,0
1322,6293,Heinrich Heine University Düsseldorf,CEPLAS - Cluster of Excellence on Plant Scienc...,Doctoral Degree (Dr rer nat or PhD),English,The programme is conducted in English. Non-nat...,More than 9 semesters,Winter semester,Please consult the CEPLAS website .,,...,A semester fee of approx. 300 EUR is required ...,A prerequisite for admission to the CEPLAS Gra...,Knowledge of the German language is not requir...,Please consult the CEPLAS website .,"On-campus student accommodation is available, ...","Due to the intense nature of the programme, st...",Düsseldorf,1,0,0
1430,6885,Frankfurt School of Finance & Management,"Doctoral Programme in Accounting, Finance and ...",Dr rer pol,English,English,More than 9 semesters,Winter semester,15 January,,...,,Master's degree with distinction or equivalent...,Proof of English proficiency with one of the f...,Please apply online: https://applymasters.fran...,The International Office at Frankfurt School w...,Support in finding internships or permanent po...,Frankfurt am Main,1,0,0
1582,7037,Universität Heidelberg,Max Planck School Matter to Life – Combined Ma...,Master of Science / PhD,English,The daily scientific communication at the inst...,More than 9 semesters,Winter semester,1 December with the start of studies the follo...,,...,"Depending on the home university, the semester...",Applicants with a Bachelor of Science in Chemi...,Very good English language skills are required...,An online application portal will be available...,Accommodation is offered at the location of th...,The three teaching universities try to make it...,Heidelberg,1,0,0


In [26]:
programmes['Duration_in_semesters']=programmes['Programme duration'].apply(lambda x: x.split(' ')[0] if x.split(' ')[0] != 'More' else 9)

In [27]:
programmes['Duration_in_semesters'] = programmes['Duration_in_semesters'].astype('int32')

In [74]:
programmes.describe()

Unnamed: 0,Couse ID,English,German,Other Language,Duration_in_semesters
count,1705.0,1705.0,1705.0,1705.0,1705.0
mean,4929.612317,0.961877,0.175367,0.05044,4.486804
std,1084.674919,0.19155,0.380392,0.218915,1.380607
min,3589.0,0.0,0.0,0.0,2.0
25%,4058.0,1.0,0.0,0.0,4.0
50%,4578.0,1.0,0.0,0.0,4.0
75%,5631.0,1.0,0.0,0.0,6.0
max,7169.0,1.0,1.0,1.0,9.0


### Clean the begginnin and leave only the season

In [28]:
programmes['Beginning'].unique()

array(['Winter semester', 'Only for doctoral programmes: any time',
       'Winter and summer semester', 'Summer semester', 'Other'],
      dtype=object)

So for :
- Winter semester is gonna be winter semester and summer semester
- Only for doctoral.... is gona be Any time
- Winter and summer semester is gonna add 1 for both
- Other will remain other
- We are do it like dummies

In [29]:
programmes['Beginning_in_Winter'] = programmes['Beginning'].apply(lambda x: 0 if re.search('Winter',x) == None else 1)
programmes['Beginning_in_Summer'] = programmes['Beginning'].apply(lambda x: 0 if re.search('(s|S)ummer',x) == None else 1)
programmes['Beginning_in_Any_time'] = programmes['Beginning'].apply(lambda x: 0 if re.search('any time',x) == None else 1)
programmes['Beginning_in_Other'] = programmes['Beginning'].apply(lambda x: 0 if re.search('Other',x) == None else 1)

In [30]:
programmes.describe()

Unnamed: 0,Couse ID,English,German,Other Language,Duration_in_semesters,Beginning_in_Winter,Beginning_in_Summer,Beginning_in_Any_time,Beginning_in_Other
count,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0
mean,4929.612317,0.961877,0.175367,0.05044,4.486804,0.893842,0.328446,0.065103,0.011144
std,1084.674919,0.19155,0.380392,0.218915,1.380607,0.308131,0.469786,0.246779,0.105005
min,3589.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
25%,4058.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0
50%,4578.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0
75%,5631.0,1.0,0.0,0.0,6.0,1.0,1.0,0.0,0.0
max,7169.0,1.0,1.0,1.0,9.0,1.0,1.0,1.0,1.0


### Clean the deadline

In [31]:
programmes['Application deadline'].fillna('No information', inplace= True)

In [32]:
programmes['Beginning_in_Winter'] = programmes['Beginning'].apply(lambda x: 0 if re.search('Winter',x) == None else 1)

In [33]:
# There are other more effective ways, but we are going to do ir more easy (It's not perfect , some month are wrong)
programmes['Deadline_in_January'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(J|j)anuary',x) == None else 1)
programmes['Deadline_in_February'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(F|f)ebruary',x) == None else 1)
programmes['Deadline_in_March'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(M|m)arch',x) == None else 1)
programmes['Deadline_in_April'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(A|a)pril',x) == None else 1)
programmes['Deadline_in_May'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(M|m)ay',x) == None else 1)
programmes['Deadline_in_June'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(J|j)une',x) == None else 1)
programmes['Deadline_in_July'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(J|j)uly',x) == None else 1)
programmes['Deadline_in_August'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(A|a)ugust',x) == None else 1)
programmes['Deadline_in_September'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(S|s)eptember',x) == None else 1)
programmes['Deadline_in_October'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(O|o)ctober',x) == None else 1)
programmes['Deadline_in_November'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(N|n)ovember',x) == None else 1)
programmes['Deadline_in_December'] = programmes['Application deadline'].apply(lambda x: 0 if re.search('(D|d)ecember',x) == None else 1)


In [140]:
programmes.describe()

Unnamed: 0,Couse ID,English,German,Other Language,Duration_in_semesters,Beginning_in_Winter,Beginning_in_Summer,Beginning_in_Any_time,Beginning_in_Other,Deadline_in_January,...,Deadline_in_March,Deadline_in_April,Deadline_in_May,Deadline_in_June,Deadline_in_July,Deadline_in_August,Deadline_in_September,Deadline_in_October,Deadline_in_November,Deadline_in_December
count,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,...,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0
mean,4929.612317,0.961877,0.175367,0.05044,4.486804,0.893842,0.328446,0.065103,0.011144,0.219355,...,0.159531,0.157185,0.269795,0.179472,0.34956,0.057478,0.102639,0.113783,0.11437,0.093255
std,1084.674919,0.19155,0.380392,0.218915,1.380607,0.308131,0.469786,0.246779,0.105005,0.413931,...,0.366278,0.364082,0.443983,0.38386,0.476971,0.232822,0.303576,0.317641,0.318353,0.290875
min,3589.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4058.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4578.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5631.0,1.0,0.0,0.0,6.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,7169.0,1.0,1.0,1.0,9.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Make Tution a number 

- If None --> 0
- if Yes or Varied --> -1


In [34]:
programmes['Tuition fees per semester in EUR'].value_counts()

None      1128
1,500      107
Varied      91
Yes         68
10,000      20
          ... 
3,720        1
5,750        1
1,200        1
300          1
5,480        1
Name: Tuition fees per semester in EUR, Length: 129, dtype: int64

In [35]:
# Get rid of the ','
programmes['Tuition fees per semester in EUR']=programmes['Tuition fees per semester in EUR'].str.replace(',','', regex=True)

In [36]:

programmes['Tuition fees per semester in EUR'] = programmes['Tuition fees per semester in EUR'].apply(lambda x: 0 if x == 'None' else (-1 if ((x == 'Varied') | (x == 'Yes')) else x)).astype('int32')

### Create a column where you can filter as masters / Phd / Bacherlor / Other

- Ways to call a Master (Master / master / MSc
- Ways to call a Phd (Doctor/ Phd / Doctorade / Dr / Doctoral

In [37]:
programmes['Degree']

0                 Master of Science in Molecular Medicine
1               Master of Science in Software Engineering
2                                          Master of Arts
3                                                     PhD
4                              Doctoral Degree in Physics
                              ...                        
2042                       Bachelor of Engineering (BEng)
2043    Master of Science in International Management,...
2044                                  Master of Arts (MA)
2045             Bachelor of Science in Applied Chemistry
2046    Bachelor of Engineering in Electrical Engineering
Name: Degree, Length: 1705, dtype: object

In [38]:
programmes['Master'] = programmes['Degree'].apply(lambda x: 0 if re.search('(M|m)aster|(M|m)(S|s)(C|c)|MBA|MA|LLM|MCBL',x) == None else 1)
programmes['PhD'] = programmes['Degree'].apply(lambda x: 0 if re.search('(D|d)octo(r|ral|rade)|(P|p)(H|h)(D|d)|(D|d)(R|r)',x) == None else 1)
programmes['Bachelor'] = programmes['Degree'].apply(lambda x: 0 if re.search('(B|b)achelor|(B|b)(S|s)(C|c)|BEng|BA',x) == None else 1)


### Make semester contribution int

- If None --> 0
- If more than one value pick the highest one
- But if it's more than value and the higest one is less than 100 sum up all the values

In [39]:
programmes['Semester contribution'][2]

'The university charges a registration fee of 50 EUR as well as student union fees. State law requires all students to be members of the student union. These fees give students access to subsidised accommodation and meals. The fees also automatically include a free travel pass for public transport in most of the state of Hesse. In summer 2020, the student union dues amounted to approx. 335 EUR. https://www.uni-marburg.de/en/studying/life-at-umr/finance'

In [40]:
# Find all the number + EUR in every string for each sample
programmes['Contribution per semester']=programmes['Semester contribution'].apply(lambda x: re.findall('\d+.\d+ EUR|\d+ EUR',x))

In [41]:
programmes['Contribution per semester'][1261]

['80 EUR', '80 EUR']

In [42]:
# Get rid of EUR and also transform them to int to do that I delete what goes after '.' because it's almost irrelevant to 
# Also if it [] give the value of 0
programmes['Contribution per semester'] = programmes['Contribution per semester'].apply(lambda x: 0 if x ==[] else [int(i.split('-')[0].strip(' EUR').replace(',','.').split('.')[0]) for i in x])

In [43]:
# Grab the max value
programmes[programmes['Contribution per semester']!=0]['Contribution per semester'].apply(max)

0       230
1       115
2       335
3       180
4       290
       ... 
2042    170
2043    128
2044    310
2045    320
2046    320
Name: Contribution per semester, Length: 1528, dtype: int64

In [44]:
# Find the max element in the list if the value is over 100 if it less we sum all the elements in the list
programmes['Contribution per semester'] = programmes['Contribution per semester'].apply(lambda x: 0 if (x ==0) else (max(x) if max(x)>100 else sum(x)))

### Create a total contribution
Total fees + contribution per semester*number of semesters

In [45]:
programmes['Total contribution'] = (programmes['Contribution per semester']+ programmes['Tuition fees per semester in EUR'])*programmes['Duration_in_semesters']  

### Rename Some columns

In [46]:
programmes.rename(columns={'Program':'Programme'},inplace=True)

In [47]:
programmes.rename(columns={'Couse ID':'Course ID'},inplace=True)

In [48]:
programmes.drop('Types of assessment', axis=1, inplace=True)

In [49]:
programmes.describe()

Unnamed: 0,Course ID,Tuition fees per semester in EUR,English,German,Other Language,Duration_in_semesters,Beginning_in_Winter,Beginning_in_Summer,Beginning_in_Any_time,Beginning_in_Other,...,Deadline_in_August,Deadline_in_September,Deadline_in_October,Deadline_in_November,Deadline_in_December,Master,PhD,Bachelor,Contribution per semester,Total contribution
count,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,...,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0,1705.0
mean,4929.612317,1050.983578,0.961877,0.175367,0.05044,4.486804,0.893842,0.328446,0.065103,0.011144,...,0.057478,0.102639,0.113783,0.11437,0.093255,0.747214,0.139589,0.150147,221.53783,5340.947214
std,1084.674919,2383.670294,0.19155,0.380392,0.218915,1.380607,0.308131,0.469786,0.246779,0.105005,...,0.232822,0.303576,0.317641,0.318353,0.290875,0.434737,0.346662,0.35732,136.910529,9888.235706
min,3589.0,-1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-6.0
25%,4058.0,0.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,130.0,750.0
50%,4578.0,0.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,230.0,1228.0
75%,5631.0,0.0,1.0,0.0,0.0,6.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,300.0,3272.0
max,7169.0,21750.0,1.0,1.0,1.0,9.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,900.0,61800.0


## Create a CSV file

In [50]:
programmes.to_csv('./Database/DAAD_data_base_cleaned.csv', index= False)