Introduction

In this micro-course, you'll learn all about pandas, the most popular Python library for data analysis.

Along the way, you'll complete several hands-on exercises with real-world data. We recommend that you work on the exercises while reading the corresponding tutorials.

To start the first exercise, please click here.

In this tutorial, you will learn how to create your own data, along with how to work with data that already exists.

In [3]:
import pandas as pd

Two core objects in pandas: DataFrame and the Series

*Dataframe*

In [3]:
# heres a dataframe
pd.DataFrame({'Yes':[50,21], 'No':[131,2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


In [10]:
# name of the person is the column name 
# all arrays need to be same length

pd.DataFrame({
    'Bob':['I Liked it.', 'It was awful.', 'Meh.'],
    'Sue':['Pretty good.','Bland.', 'Ok.'],
    'Pete':['Nice.','Plain.', 'Eh.'],
    'Jan':['Acceptable.','Horrible.', 'Blegh.'],
    'Jim':['Not bad.','Gross!.', 'Heh.']
})

Unnamed: 0,Bob,Sue,Pete,Jan,Jim
0,I Liked it.,Pretty good.,Nice.,Acceptable.,Not bad.
1,It was awful.,Bland.,Plain.,Horrible.,Gross!.
2,Meh.,Ok.,Eh.,Blegh.,Heh.


In [11]:
# assign an index parameter

pd.DataFrame({
    'Bob':['I Liked it.', 'It was awful.', 'Meh.'],
    'Sue':['Pretty good.','Bland.', 'Ok.'],
    'Pete':['Nice.','Plain.', 'Eh.'],
    'Jan':['Acceptable.','Horrible.', 'Blegh.'],
    'Jim':['Not bad.','Gross!.', 'Heh.']
},
index=['Product A', 'Product B', 'Product C'])

Unnamed: 0,Bob,Sue,Pete,Jan,Jim
Product A,I Liked it.,Pretty good.,Nice.,Acceptable.,Not bad.
Product B,It was awful.,Bland.,Plain.,Horrible.,Gross!.
Product C,Meh.,Ok.,Eh.,Blegh.,Heh.


*Series*

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [14]:
# create a list of numbers

pd.Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall *name:*

In [15]:
pd.Series([30,35,40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

Reading data files (csv example)

In [5]:

# load data file
wine_reviews = pd.read_csv("winemag-data-130k-v2.csv")

# get 'shape' of data
wine_reviews.shape

# get 'head' of data 
wine_reviews.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [6]:
# load the same data, but this time use the builtin index 

wine_reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)

wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


Tutorial complete - now for the exercise

In [7]:
import pandas as pd
pd.set_option('max_rows', 5)

In [9]:
# create a fruits dataframe
fruits = pd.DataFrame({'Apples':[30], 'Bananas': [21]})

fruits

Unnamed: 0,Apples,Bananas
0,30,21


In [18]:
# create a fruits dataframe
# with a custom index
fruit_sales = pd.DataFrame({'Apples':[35,41], 'Bananas': [21,34]}, index=['2017 Sales', '2018 Sales'])
fruit_sales

Unnamed: 0,Apples,Bananas
2017 Sales,35,21
2018 Sales,41,34


Create a variable ingredients with a Series that looks like:

Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

In [21]:
pd.Series([30,35,40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

ingredients = pd.Series(['4 cups', '1 cup', '2 large', '1 can'], index=['Flour', 'Milk', 'Eggs', 'Spam'], name = "Dinner")
ingredients

Flour     4 cups
Milk       1 cup
Eggs     2 large
Spam       1 can
Name: Dinner, dtype: object

In [22]:
animals = pd.DataFrame({'Cows': [12, 20], 'Goats': [22, 19]}, index=['Year 1', 'Year 2'])
animals

Unnamed: 0,Cows,Goats
Year 1,12,22
Year 2,20,19


In [23]:
pd.DataFrame.to_csv(animals, 'cows_and_goats.csv')