<center><h1> FCG - Contest Name<h1></center>
    
<center><h1> Pandas Tutorial for Reading and Storing Data in CSV files <h2></center>

# Introduction to Pandas and CSV files

## What is Pandas?
Pandas is a open source python library created to handle the manipulation and storage of large scale datasets

## Advantages of using Pandas for Machine Learning
- Pandas is optimized for handling large scale datasets
- This optimization results in massive speedups when compared to other libraries like csv and pickle
- Pandas contains several useful features for Machine Learning such as:
    - Data cleansing
    - Data inspection
    - Statistical analysis
    - Data normalization
    - Loading and storing data

## What is a CSV file?
A CSV file is a text file which uses a comma to separate values. An example is provided below

id,title,city,postalCode <br>
0,West-Varkenoordseweg,Rotterdam,3974HN <br>
3,Ruiterakker,Assen,9407BG <br>
8,Brusselseweg,Maastricht,6217GX <br>
10,Donkerslootstraat,Rotterdam,3074WL <br>
12,Vorselenburgstraat,Alphen aan den Rijn,2405XJ <br>

While the above data might be difficult for a human to read at a first glance, machines can parse these files quickly

## Why do we use CSV files?
- CSV files are plain text files which can easily be manipulated using any text editor
- The usage of commas to separate values ensures that the file behaves the same on all computers. This is useful when compared to a TSV file which uses tabs to separate values. Since the length of a tab can vary between different machines, the data may be read in an unintended way
- CSV files are compact and can save space when used for large datasets

Now that we know what Pandas and CSV files are, we can get started with using them for our Machine Learning Purposes

# Getting Started with Pandas
## Installing Pandas
To use Pandas, we need to first install it using pip. Once installed, we can use it in our program by importing it

In [7]:
!pip install pandas
import pandas as pd



## Reading Data from a CSV file
Now that Pandas has been imported to our program, we can use it to load data from a csv file. The function read_csv() can be used which takes the file path of the csv file as an argument. We will use the popular titanic dataset to illustrate the functionality of pandas.

In [8]:
train = pd.read_csv("datasets/titanic.csv")

To verify that the dataset has been read properly, we can display its contents using head(). This function takes the number of rows we wish to display as input. We can also look at the dtypes attribute to check how Pandas interpreted the data in the columns.

In [9]:
train.head(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [10]:
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Now that we have loaded our data into a pandas dataframe, we can feed the training data into our machine learning program and predict the outcome for each test case.

## Creating a submission file for the Competition
Once the machine learning model is trained, we can feed it the testing data which does not contain goal states. To evaluate our model, we need to create a submission file and upload it to the judging software. The submission file is a csv file which consists of the IDs of the datapoints and their predicted goal state. An example of the submission file for the titanic dataset is given below. 

In [11]:
# For demonstration purposes, we will load the submission file from an existing csv
# Normally, a submission dataframe would be made when we feed the test data to the model
submission = pd.read_csv("datasets/titanic_submission.csv")

submission.head(8)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0


Once the submission file is obtained, we can convert it into a csv file using the to_csv() function which takes the desired file name as an argument

In [11]:
submission.to_csv("submission.csv")

Now that we have the submission file in csv format, we can upload it to the judging software for evaluation