## Hello!
You already went through the Jupyter Notebook tutorial and want to do more cool stuff with data? You've come to the right place! This Notebook will cover the basics of a very popular package called ```pandas```, which does wonders for data analysis and exploration.

## Menu
- <a href="#Introduction">Introduction to pandas</a>
- <a href="#read_csv">Reading files</a>
- <a href="#to_csv">Saving files</a>
- <a href="#joins">Joining DataFrames</a>
- <a href="#more">More tutorials</a>

<a id='Introduction'></a>
## Introduction to pandas
What is ```pandas``` and what is it used for? ```pandas``` is a popular package written for Python, used for data analysis and exploration. ```pandas``` is suited for working with tabular data, and has all the right tools for assessing, cleaning and processing data. A data table is called a DataFrame and has the following basic structure, with darker gray areas represent the index for rows, and labels for columns:<br>
<img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg"/><br>
Using ```pandas``` is fairly simple if you're using Anaconda, as the software includes ```pandas``` in its package list, due to the popularity and usefulness of the package, so you don't need to install it again. Otherwise, getting started is quite easy, and you only need to type ```pip install pandas``` on your terminal to install the package.<br><br>
Now, why use ```pandas``` instead of [other software]? Getting started with Python and/or ```pandas``` might seem like an up-hill battle at first, but after some testing and getting used to it, you'll find that ```pandas ``` is a powerful yet easy and intuitive tool for data manipulation, with many different methods and functions which are equivalent to other programming languages or software such as R, SQL and Excel, so everything you're already doing with somewhere else, you'll probably be able to do with ```pandas``` as well! You can also easily do visual assessment and showcase your results using it in a Notebook like this one, so you won't need to use any additional resources to manipulate and assess. You can read all about the <a href="https://pandas.pydata.org/docs/getting_started/overview.html">package overview</a> to learn more about ```pandas```, but for now, we'll jump straight into coding!

<a id="read_csv"></a>
## Reading files
We'll start at the very beginning of any data analysis process: **We need to open the file.** To do this, ```pandas``` has the <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">```read_csv```</a> function, which will let you read any delimited text file. The function has some key arguments which will let us do a lot of cool stuff, and we'll see some examples using a CSV file with information about medical appointments in Brazil (obtained from Kaggle and modified to use in this tutorial).

In [1]:
# First, we need to import pandas, using pd as a shortcut for future reference in code
import pandas as pd 

In [2]:
# Using the read_csv function to open the file
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No


This simple function allows us to get a DataFrame from a CSV file with just a line of code. The function will always take the file path as a first argument. You can also add other arguments like ```sep``` or ```delimiter``` to specify a data delimiter, which is ```','``` by default, but you can use it to set the delimiter the file uses, which could be ```'\t'``` for TSV files or ```';'```.<br><br>
You can set the header row in the arguments of the function. The value is ```header=0``` as default, which refers to the first row having the column names, which is generally the case. If your headers are in the 2nd row for some reason *(I don't even want to know)*, or you don't have a header row, you could change your ```header``` argument to reflect this:

In [3]:
# Setting the second row as our headers row
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv", header=1)
df.head()

Unnamed: 0,29872499824296,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0.1,0.2,0.3,0.4,No
0,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
1,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
3,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No
4,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,0.0,No


In [4]:
# Setting no header row
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv", header=None)
df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
1,29872499824296,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
2,558997776694438,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
3,4262962299951,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
4,867951213174,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No


In this case, we get messy tables because our header row is, in fact, the first row, but you can see how you can set your rows as you want depending on how your original file is configured.<br><br>
You can ignore the original headers and just replace them with your own column names, like this:

In [5]:
# Creating a list with the new names, and setting header=0 to replace the original ones
column_names = ["patient_id", "appointment_id", "gender", "scheduled_date", "appointment_date", "age", "neighbourhood", "scholarship", "hipertension", "diabetes", "alcoholism", "handcap", "sms_received", "no_show"]
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv", header=0, names=column_names)
df.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,0.0,0.0,0.0,0.0,0.0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0.0,0.0,0.0,0.0,0.0,0.0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No


You can also skip lines if your files contains some text at the beginning *(again, don't want to know)*, and you'd want to start reading from a specific row (consider the original file index as 0-indexed). You can use the ```skiprows``` argument to pass an ```int``` or a list of integers.

In [6]:
# Skipping file rows 2 and 3

df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv", header=0, skiprows=[2,3], names=column_names)
df.head()

Unnamed: 0,patient_id,appointment_id,gender,scheduled_date,appointment_date,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0.0,1.0,0.0,0.0,0.0,0.0,No
1,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0.0,0.0,0.0,0.0,0.0,0.0,No
2,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0.0,1.0,1.0,0.0,0.0,0.0,No
3,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0.0,1.0,0.0,0.0,0.0,0.0,No
4,733688200000000.0,5630279,F,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,23,GOIABEIRAS,0.0,0.0,0.0,0.0,0.0,0.0,Yes


### xls packages
It's easier and faster to save your files to CSV if you're using Excel, but sometimes a file just can't be converted because of the formatting or you want to save the different sheets. There's the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html">```read_excel```</a> function in ```pandas```, which allows you to read different sheets and has a lot of useful arguments with you can find in the documentation. There's also <a href="https://openpyxl.readthedocs.io/en/stable/">```openpyxl```</a> for older Excel files, which can be useful if you get a compatibility error with ```pd.read_excel```, so you can read the file and then convert it to a DataFrame to explore. If you want to open Excel files outside of ```pandas``` functions, you can also import the <a href="https://xlrd.readthedocs.io/en/latest/">```xldr```</a> package, which ```pandas``` uses to read files. 

<a id="to_csv"></a>
## Saving files
Say we're happy with the little tweaks we just did and want to save this DataFrame as a new file. We can do that using the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html">```to_csv```</a> function. Same as with ```read_csv```, ```to_csv``` will take the path of the new file as a first argument, and you can set the delimeter with ```sep``` or ```delimiter```. Note that DataFrames have an index column that you might not want in your file, so make sure to add the ```index``` argument set to False to avoid this (it's set as True by default).

In [7]:
# Saving the new file with a new delimiter
df.to_csv("new_file.csv", sep=";", index=False)

<a id="joins"></a>
## Joining DataFrames

We know how to open files, but it's not always as easy as having just one DataFrame. We might work with more than one, and often, we'll want to perform some joins to get a more complete DataFrame. We'll use two simple DataFrames to show how to join them using <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html">merge</a>.<br> There's also other methods you can use to join DataFrames, you can check this <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html">documentation</a> examples and explanations to see what works better for you, and the difference between all of them.

In [9]:
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
df1

Unnamed: 0,a,b
0,foo,1
1,bar,2


In [10]:
df2

Unnamed: 0,a,c
0,foo,3
1,baz,4


With these DataFrames, you can use ```merge``` to perform a left join on a specified column, with the ```on``` argument:

In [11]:
df1.merge(df2, how='left', on='a')

Unnamed: 0,a,b,c
0,foo,1,3.0
1,bar,2,


You can change the ```how``` argument to change the join type.

In [12]:
df1.merge(df2, how='inner', on='a')

Unnamed: 0,a,b,c
0,foo,1,3


You can define on which left and right columns to perform the join, with the ```left_on``` and ```right_on``` arguments.

In [13]:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

In [14]:
df1

Unnamed: 0,lkey,value
0,foo,1
1,bar,2
2,baz,3
3,foo,5


In [15]:
df2

Unnamed: 0,rkey,value
0,foo,5
1,bar,6
2,baz,7
3,foo,8


In [16]:
df1.merge(df2, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,value_x,rkey,value_y
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


<a id="more"></a>
## More tutorials

Now that we've covered the basics, you're ready to learn how to actually do stuff with your data using ```pandas```! Head over to the following tutorials to learn more about this awesome package:
- <a href="https://github.com/lona9/PythonTutorials/blob/master/Assessing%20with%20pandas.ipynb">Assessing with pandas
- <a href="https://github.com/lona9/PythonTutorials/blob/master/Cleaning%20with%20pandas.ipynb">Cleaning with pandas