# Introduction to Pandas
*by Jonathan Frawley*

We are going to look at parsing datasets using CSV with a library called [Pandas](https://pandas.pydata.org/). 

First, we import our dependencies:

In [1]:
import numpy as np
import pandas as pd
from io import StringIO

## Parsing CSV

Here, we read in a simple CSV file from a string:

In [2]:
csv_string_file = StringIO("""name,age,course
Muire,23,Computer Science
Seán,19,Computer Science
Saoirse,17,English
Niamh,19,Mathematics
""")
csv = pd.read_csv(csv_string_file)
csv

Unnamed: 0,name,age,course
0,Muire,23,Computer Science
1,Seán,19,Computer Science
2,Saoirse,17,English
3,Niamh,19,Mathematics


Now, say we wanted all of the students who studied Computer Science:

In [4]:
csv[csv['course'] == 'Computer Science']

Unnamed: 0,name,age,course
0,Muire,23,Computer Science
1,Seán,19,Computer Science


Use describe() to get a summary of the numeric columns of your dataframe:

In [7]:
csv.describe()

Unnamed: 0,age
count,4.0
mean,19.5
std,2.516611
min,17.0
25%,18.5
50%,19.0
75%,20.0
max,23.0


## Using groupby

Get average age of students on a course:

In [11]:
csv.groupby('course').mean()

Unnamed: 0_level_0,age
course,Unnamed: 1_level_1
Computer Science,21
English,17
Mathematics,19


# Sorting data

In [13]:
csv_string_file = StringIO("""item_name,item_price
Hammer,1.90
Nail,0.20
Scissors,3.00
Lawnmower,99.90
""")
csv = pd.read_csv(csv_string_file)
csv

Unnamed: 0,item_name,item_price
0,Hammer,1.9
1,Nail,0.2
2,Scissors,3.0
3,Lawnmower,99.9


In [14]:
csv.sort_values('item_price')

Unnamed: 0,item_name,item_price
1,Nail,0.2
0,Hammer,1.9
2,Scissors,3.0
3,Lawnmower,99.9


In [15]:
csv.sort_values('item_price', ascending=False)

Unnamed: 0,item_name,item_price
3,Lawnmower,99.9
2,Scissors,3.0
0,Hammer,1.9
1,Nail,0.2


## Getting a column's values
Getting the name of most expensive item:

In [16]:
csv.sort_values('item_price', ascending=False)['item_name'].iloc[0]

'Lawnmower'

## Getting an index

In [18]:
csv.sort_values('item_price', ascending=False).index[0]

3

## Plotting Data

In [22]:
df = pd.DataFrame(np.random.randn(1000, 4),
                  index=ts.index, columns=list('ABCD')) 

df = df.cumsum()

plt.figure()

df.plot()

NameError: name 'ts' is not defined

## Resources
 * [10 minutes intro to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)