<center><h1> Groningen Machine Learning Competition <h1></center>
    
<center><h1> Pandas Tutorial for Reading and Storing Data in CSV files <h2></center>

# Introduction to Pandas and CSV files

## What is Pandas?
Pandas is a open source python library created to handle the manipulation and storage of large scale datasets

## Advantages of using Pandas for Machine Learning
- Pandas is optimized for handling large scale datasets
- This optimization results in massive speedups when compared to other libraries like csv and pickle
- Pandas contains several useful features for Machine Learning such as:
    - Data cleansing
    - Data inspection
    - Statistical analysis
    - Data normalization
    - Loading and storing data

## What is a CSV file?
A CSV file is a text file which uses a comma to separate values. An example is provided below

id,title,city,postalCode <br>
0,West-Varkenoordseweg,Rotterdam,3974HN <br>
3,Ruiterakker,Assen,9407BG <br>
8,Brusselseweg,Maastricht,6217GX <br>
10,Donkerslootstraat,Rotterdam,3074WL <br>
12,Vorselenburgstraat,Alphen aan den Rijn,2405XJ <br>

While the above data might be difficult for a human to read at a first glance, machines can parse these files quickly

## Why do we use CSV files?
- CSV files are plain text files which can easily be manipulated using any text editor
- The usage of commas to separate values ensures that the file behaves the same on all computers. This is useful when compared to a TSV file which uses tabs to separate values. Since the length of a tab can vary between different machines, the data may be read in an unintended way
- CSV files are compact and can save space when used for large datasets

Now that we know what Pandas and CSV files are, we can get started with using them for our Machine Learning Purposes

# Getting Started with Pandas
## Installing Pandas
To use Pandas, we need to first install it using pip.

In [None]:
# We use an ! before pip to run the command in the terminal instead of python
# When using a computer locally, it is sufficient to execute the below line in the terminal
!pip install pandas

Once pandas is installed, we need to import it in our code to use it

In [1]:
import pandas as pd

## Reading Data from a CSV file
Now that Pandas has been imported to our program, we can use it to load data from a csv file. The function read_csv() can be used which takes the file path of the csv file as an argument.

In [2]:
train = pd.read_csv("datasets/train.csv")

To verify that the dataset has been read properly, we can display its contents using head(). This function takes the number of rows we wish to display as input. We can also look at the dtypes attribute to check how Pandas interpreted the data in the columns.

In [3]:
train.head(8)

Unnamed: 0,id,title,city,postalCode,latitude,longitude,areaSqm,firstSeenAt,lastSeenAt,isRoomActive,...,matchAge,matchGender,matchCapacity,matchLanguages,matchStatus,coverImageUrl,additionalCosts,rent,deposit,registrationCost
0,0,West-Varkenoordseweg,Rotterdam,3074HN,51.896601,4.514993,14,2019-07-14 11:25:46.511000+00:00,2019-07-26 22:18:23.142000+00:00,True,...,16 years - 99 years,Not important,1 person,Not important,Not important,https://resources.kamernet.nl/image/913b4b03-5...,50.0,500,500.0,0.0
1,3,Ruiterakker,Assen,9407BG,53.013494,6.561012,16,2019-07-14 11:25:46.988000+00:00,2019-07-18 22:00:31.174000+00:00,False,...,18 years - 32 years,Female,1 person,Not important,"Student, Working student",https://resources.kamernet.nl/image/84e95365-6...,,290,290.0,
2,8,Brusselseweg,Maastricht,6217GX,50.860841,5.671673,16,2019-07-14 11:25:47.814000+00:00,2019-08-10 00:14:27.130000+00:00,True,...,16 years - 40 years,Male,4 persons,Dutch English,Student,https://resources.kamernet.nl/image/6e625591-d...,,425,425.0,25.0
3,10,Donkerslootstraat,Rotterdam,3074WL,51.893195,4.516478,25,2019-07-14 11:25:48.140000+00:00,2019-07-16 06:05:32.183000+00:00,False,...,21 years - 99 years,Not important,4 persons,Dutch English Spanish French Italian German Po...,"Student, Working student, Working, Looking for...",https://resources.kamernet.nl/image/ea3aea77-0...,,600,1200.0,0.0
4,12,Vorselenburgstraat,Alphen aan den Rijn,2405XJ,52.122335,4.661434,10,2019-07-14 11:25:48.465000+00:00,2019-08-01 00:02:40.516000+00:00,True,...,22 years - 40 years,Not important,1 person,Dutch English,"Student, Working student, Working",https://resources.kamernet.nl/image/d0780298-b...,,425,425.0,
5,17,Groenhoven,Amsterdam,1103LW,52.326211,4.976048,19,2019-07-14 11:25:49.250000+00:00,2019-07-25 22:00:46.074000+00:00,False,...,21 years - 99 years,Not important,1 person,Dutch English German,"Student, Working student, Working",https://resources.kamernet.nl/image/7a90ceee-d...,,750,1500.0,
6,18,Noorderhagen,Enschede,7511EL,52.221643,6.894667,21,2019-07-14 11:25:59.227000+00:00,2020-03-03 01:00:57.909000+00:00,True,...,16 years - 99 years,Male,1 person,Not important,Student,https://resources.kamernet.nl/image/f2697bd3-4...,,240,,
7,19,Jaersveltstraat,Rotterdam,3082SJ,51.890481,4.466388,16,2019-07-14 11:25:59.390000+00:00,2019-08-22 06:00:37.445000+00:00,True,...,18 years - 30 years,Female,1 person,Not important,"Student, Working student, Working",https://resources.kamernet.nl/image/acc30b0b-4...,,500,,


In [4]:
train.dtypes

id                            int64
title                        object
city                         object
postalCode                   object
latitude                    float64
longitude                   float64
areaSqm                       int64
firstSeenAt                  object
lastSeenAt                   object
isRoomActive                 object
rawAvailability              object
postedAgo                    object
descriptionNonTranslated     object
descriptionTranslated        object
rentDetail                   object
propertyType                 object
furnish                      object
energyLabel                  object
gender                       object
internet                     object
roommates                    object
shower                       object
toilet                       object
kitchen                      object
living                       object
pets                         object
smokingInside                object
matchAge                    

Now that we have loaded our data into a pandas dataframe, we can feed the training data into our machine learning program and predict the outcome for each test case.

## Creating a submission file for the Competition
Once the machine learning model is trained, we can feed it the testing data which does not contain goal states. To evaluate our model, we need to create a submission file and upload it to the judging software. The submission file is a csv file which consists of the IDs of the datapoints and their predicted goal state. An example of the submission file for the competition's dataset is given below. 

In [5]:
# For demonstration purposes, we will load the submission file from an existing csv
# Normally, a submission dataframe would be generated when we feed the test data to the model
submission = pd.read_csv("datasets/sample_submission.csv")

submission.head(8)

Unnamed: 0,id,rent
0,1,550.0
1,2,550.0
2,4,550.0
3,5,550.0
4,6,550.0
5,7,550.0
6,9,550.0
7,11,550.0


Once the submission file is obtained, we can convert it into a csv file using the to_csv() function which takes the desired file name as an argument

In [None]:
submission.to_csv("submission.csv")

Now that we have the submission file in csv format, we can upload it to the judging software for evaluation