# <font color='#eb3483'> Introduction to Pandas </font>

Pandas is Numpy's extension for Data Analysis. Among many other things, it provides a really useful data structure called a `DataFrame`. A `pandas.DataFrame` is basically a table similar to an Excel spreadsheet with rows and columns. If you have experience with the R programming language, the `pandas.DataFrame` is very similar to an R `data.frame`.

http://pandas.pydata.org/

The standard way of importing pandas is:

In [1]:
import pandas as pd

In this notebook we will cover:
<font color='#eb3483'>
1. Dataframes
1. Reading and writing a data frame
1. Inspecting a data frame
1. Indexing

    </font>


##  <font color='#eb3483'> 1. Building DataFrames </font>

There are many ways to create a dataframe

In [2]:
#We can feed in a 2D list and specify column names - and make the data frame with the function DataFrame()
rick_morty = pd.DataFrame(
    [
        ["Rick", "Sanchez", 60],
        ["Morty", "Smith", 14]
    ], columns = ["first_name", "last_name", "age"]
)
rick_morty

Unnamed: 0,first_name,last_name,age
0,Rick,Sanchez,60
1,Morty,Smith,14


In [3]:
type(rick_morty)

pandas.core.frame.DataFrame

In [4]:
#We can take a peak of our dataframe using the built-in head method .head() and also .(tail)

a = rick_morty.head()
b = rick_morty.tail(1)

print(a)
print(b)

  first_name last_name  age
0       Rick   Sanchez   60
1      Morty     Smith   14
  first_name last_name  age
1      Morty     Smith   14


In [5]:
#we can also ask to return a specific row by refering to the row index:
print(rick_morty.loc[0])
print(rick_morty.loc[1])

first_name       Rick
last_name     Sanchez
age                60
Name: 0, dtype: object
first_name    Morty
last_name     Smith
age              14
Name: 1, dtype: object


We can create an empty dataframe

In [6]:
df3 = pd.DataFrame()


In [7]:
#It's as you would expect ... empty
print(df3)


Empty DataFrame
Columns: []
Index: []


Now lets add columns to the empty dataframe

In [8]:
df3['Colours'] = ["Red", "Yellow", "Blue"]
print(df3)

  Colours
0     Red
1  Yellow
2    Blue


In [9]:
#assign it to column 'number' in df3

df3['Number']= [28,29, 30]
print(df3)

  Colours  Number
0     Red      28
1  Yellow      29
2    Blue      30


We can see the column names with `.columns`

In [10]:
#Let's sort by name
df3.columns

Index(['Colours', 'Number'], dtype='object')

We can see the values of a column

In [11]:
df3["Colours"]

0       Red
1    Yellow
2      Blue
Name: Colours, dtype: object

We can also sort our dataframe by our columns

In [12]:
df3.sort_values(by="Number", ascending=True)

Unnamed: 0,Colours,Number
0,Red,28
1,Yellow,29
2,Blue,30


In [13]:
#Try and sort by number value

#How can we change this to descending?
#descending -- False
df3.sort_values(by="Number", ascending=False)

Unnamed: 0,Colours,Number
2,Blue,30
1,Yellow,29
0,Red,28


How do we get help ... lets try a few things

Selecting a column that does not exists will raise a `KeyError` (same error as when selecting a missing key in a dictionary)

In [14]:

df3["Black"]

KeyError: 'Black'

## <font color='#eb3483'> 2. Reading/Writing data with dataframes </font>

It's not very often we have to create our own data frame but now we know just incase.   
Pandas can import from and export to many types of files, csv, json, excel among others.

For example, we can read a csv including information about the Avengers (taken from [here](https://github.com/fivethirtyeight/data/tree/master/avengers))

In [None]:
# read in data
avengers = pd.read_csv("data/avengers.csv")
print(avengers)

                                                   URL  \
0        http://marvel.wikia.com/Henry_Pym_(Earth-616)   
1    http://marvel.wikia.com/Janet_van_Dyne_(Earth-...   
2    http://marvel.wikia.com/Anthony_Stark_(Earth-616)   
3    http://marvel.wikia.com/Robert_Bruce_Banner_(E...   
4     http://marvel.wikia.com/Thor_Odinson_(Earth-616)   
..                                                 ...   
168   http://marvel.wikia.com/Eric_Brooks_(Earth-616)#   
169  http://marvel.wikia.com/Adam_Brashear_(Earth-6...   
170  http://marvel.wikia.com/Victor_Alvarez_(Earth-...   
171     http://marvel.wikia.com/Ava_Ayala_(Earth-616)#   
172         http://marvel.wikia.com/Kaluu_(Earth-616)#   

                            name  appearances current  gender  starting_date  \
0      Henry Jonathan "Hank" Pym         1269     YES    MALE           1963   
1                 Janet van Dyne         1165     YES  FEMALE           1963   
2    Anthony Edward "Tony" Stark         3068     YES    MALE  

In [None]:
# look at the head of our data set
avengers

Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,1963,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,1963,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,1963,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,1963,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,1963,Dies in Fear Itself brought back because that'...
...,...,...,...,...,...,...,...
168,http://marvel.wikia.com/Eric_Brooks_(Earth-616)#,Eric Brooks,198,YES,MALE,2013,
169,http://marvel.wikia.com/Adam_Brashear_(Earth-6...,Adam Brashear,29,YES,MALE,2014,
170,http://marvel.wikia.com/Victor_Alvarez_(Earth-...,Victor Alvarez,45,YES,MALE,2014,
171,http://marvel.wikia.com/Ava_Ayala_(Earth-616)#,Ava Ayala,49,YES,FEMALE,2014,


In [None]:
avengers.shape

(173, 7)

We can save the dataframe back to a csv file with `to_csv` (this method writes the index by default as a separate column, we can avoid this by passing the argument `index=False`).

In [None]:
avengers.to_csv("avengers2.csv", index=False)

or we can export to excel using `to_excel` (it requires a separate package, `xlwt`)

In [None]:
avengers.to_excel("avengers.xls", index=False)

  avengers.to_excel("avengers.xls", index=False)


ModuleNotFoundError: No module named 'xlwt'

Likewise we can read from a excel file easily (this requires the package `xlrd`)

In [None]:
advengers_relod= pd.read_excel("avengers2.xls")

Bothered by that extra 'Unnamed:0' column? It's the index column, which was created when you read in the data the first time. To avoid saving this column, use `index=False` when saving: `avengers.to_csv("avengers2.csv", index=False)`.

## <font color='#eb3483'> 3. Inspecting a dataframe </font>

Once we read in a data frame - we generally have a quick look around (just as you would do in excel).




We can see the first rows of a dataframe with `head()`

In [None]:
avengers.head()

Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,1963,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,1963,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,1963,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,1963,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,1963,Dies in Fear Itself brought back because that'...


and the last ones with tail()

In [None]:
avengers.tail(10)

Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
163,http://marvel.wikia.com/Tony_Masters_(Earth-616)#,Tony Masters,173,NO,MALE,2013,
164,http://marvel.wikia.com/Victor_Mancha_(Earth-6...,Victor Mancha,75,YES,MALE,2013,Died in Avengers_A.I._Vol_1_4. Returned in Ave...
165,http://marvel.wikia.com/Monica_Chang_(Earth-616)#,Monica Chang,12,YES,FEMALE,2013,
166,http://marvel.wikia.com/Doombot_(Avenger)_(Ear...,,14,YES,MALE,2013,
167,http://marvel.wikia.com/Alexis_(Earth-616)#,Alexis,13,YES,FEMALE,2013,
168,http://marvel.wikia.com/Eric_Brooks_(Earth-616)#,Eric Brooks,198,YES,MALE,2013,
169,http://marvel.wikia.com/Adam_Brashear_(Earth-6...,Adam Brashear,29,YES,MALE,2014,
170,http://marvel.wikia.com/Victor_Alvarez_(Earth-...,Victor Alvarez,45,YES,MALE,2014,
171,http://marvel.wikia.com/Ava_Ayala_(Earth-616)#,Ava Ayala,49,YES,FEMALE,2014,
172,http://marvel.wikia.com/Kaluu_(Earth-616)#,Kaluu,35,YES,MALE,2015,


We can see the size of a dataframe (n_rows, n_columns) with `shape`

In [None]:
avengers.shape

(173, 7)

We can see the data type of each column with `dtypes`

In [None]:
avengers.dtypes


URL              object
name             object
appearances       int64
current          object
gender           object
starting_date     int64
notes            object
dtype: object

In [None]:
print(avengers.loc[110:115])

                                                   URL            name  \
110    http://marvel.wikia.com/Maria_Hill_(Earth-616)#      Maria Hill   
111  http://marvel.wikia.com/Robert_Baldwin_(Earth-...  Robbie Baldwin   
112  http://marvel.wikia.com/Sharon_Carter_(Earth-6...   Sharon Carter   
113  http://marvel.wikia.com/Eric_O%27Grady_(Earth-...    Eric O'Grady   
114    http://marvel.wikia.com/Brunnhilde_(Earth-616)#      Brunnhilde   
115  http://marvel.wikia.com/Richard_Rider_(Earth-6...   Richard Rider   

     appearances current  gender  starting_date  \
110          359     YES  FEMALE           2010   
111          299      NO    MALE           2010   
112          333      NO  FEMALE           2010   
113           88      NO    MALE           2010   
114          369      NO  FEMALE           2010   
115          380      NO    MALE           2010   

                                                 notes  
110                                                NaN  
111      

We can look at the column names using `.columns`

In [None]:
avengers.columns

Index(['URL', 'name', 'appearances', 'current', 'gender', 'starting_date',
       'notes'],
      dtype='object')

We can use `.describe()` to find statistical information about the dataframe's columns.

In [None]:
avengers.describe()

Unnamed: 0,appearances,starting_date
count,173.0,173.0
mean,414.052023,1988.445087
std,677.99195,30.374669
min,2.0,1900.0
25%,58.0,1979.0
50%,132.0,1996.0
75%,491.0,2010.0
max,4333.0,2015.0


<hr>

## <font color='#eb3483'> 3. Indexes </font>

Dataframes have an index that allows us to perform complex data manipulations. By default, the index is the row number.

In [None]:
avengers.index

RangeIndex(start=0, stop=173, step=1)

We can change the index to one of the columns using `set_index()`

In [None]:
# lets change it the index to gender
avengers = avengers.set_index('gender')

In [None]:
#reset the index

In [None]:
# make appearances the index
avengers = avengers.set_index('appearances')


We can also sort our dataframe by the index using the `sort_index` command

In [None]:
avengers.sort_index

<bound method DataFrame.sort_index of                                                            URL  \
appearances                                                      
1269             http://marvel.wikia.com/Henry_Pym_(Earth-616)   
1165         http://marvel.wikia.com/Janet_van_Dyne_(Earth-...   
3068         http://marvel.wikia.com/Anthony_Stark_(Earth-616)   
2089         http://marvel.wikia.com/Robert_Bruce_Banner_(E...   
2402          http://marvel.wikia.com/Thor_Odinson_(Earth-616)   
...                                                        ...   
198           http://marvel.wikia.com/Eric_Brooks_(Earth-616)#   
29           http://marvel.wikia.com/Adam_Brashear_(Earth-6...   
45           http://marvel.wikia.com/Victor_Alvarez_(Earth-...   
49              http://marvel.wikia.com/Ava_Ayala_(Earth-616)#   
35                  http://marvel.wikia.com/Kaluu_(Earth-616)#   

                                    name current  starting_date  \
appearances                         

In [None]:
#to make it decending we just add ascending=False argument
