# Introduction to Pandas dataframes

### Introduction

As we saw previously, before feeding our data into a machine learning model, we need to properly format our training data.

Let's officially begin our journey into pandas so that we can learn how to properly format our data.

### Gathering our data

In [2]:
import pandas as pd
movies_df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv')

> Press `shift + return` on the cell above.

What we have just created is called a pandas dataframe.

In [3]:
type(movies_df)

pandas.core.frame.DataFrame

Now a pandas dataframe is essentially a table of data, and we can slice the first few rows of the data frame just like we would a list.  

In [4]:
movies_df[:3]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,1.0,1.0


We can think of our dataframe as consisting of rows and columns.

A dataframe is like a list of dictionaries where each row is a dictionary.  In fact we can convert our dataframe to just that.

In [20]:
movie_records = movies_df.to_dict('records')
movie_records[:2]

[{'year': 2013,
  'imdb': 'tt1711425',
  'title': '21 &amp; Over',
  'test': 'notalk',
  'clean_test': 'notalk',
  'binary': 'FAIL',
  'budget': 13000000,
  'domgross': 25682380.0,
  'intgross': 42195766.0,
  'code': '2013FAIL',
  'budget_2013$': 13000000,
  'domgross_2013$': 25682380.0,
  'intgross_2013$': 42195766.0,
  'period code': 1.0,
  'decade code': 1.0},
 {'year': 2012,
  'imdb': 'tt1343727',
  'title': 'Dredd 3D',
  'test': 'ok-disagree',
  'clean_test': 'ok',
  'binary': 'PASS',
  'budget': 45000000,
  'domgross': 13414714.0,
  'intgross': 40868994.0,
  'code': '2012PASS',
  'budget_2013$': 45658735,
  'domgross_2013$': 13611086.0,
  'intgross_2013$': 41467257.0,
  'period code': 1.0,
  'decade code': 1.0}]

Or another way to think of our dataframe is as a list of lists, which we can see by calling `to_numpy` (technically we get a array of numpy arrays). 

In [21]:
movies_df.to_numpy()

array([[2013, 'tt1711425', '21 &amp; Over', ..., 42195766.0, 1.0, 1.0],
       [2012, 'tt1343727', 'Dredd 3D', ..., 41467257.0, 1.0, 1.0],
       [2013, 'tt2024544', '12 Years a Slave', ..., 158607035.0, 1.0,
        1.0],
       ...,
       [1971, 'tt0067116', 'The French Connection', ..., 236848653.0,
        nan, nan],
       [1971, 'tt0067992', 'Willy Wonka &amp; the Chocolate Factory',
        ..., 23018057.0, nan, nan],
       [1970, 'tt0065466', 'Beyond the Valley of the Dolls', ...,
        53978683.0, nan, nan]], dtype=object)

But either way, a nice way to think about a dataframe is as a nested data structure.

### Working with a series

Let's select a single column from our dataframe above.

> Below, we select the first column, `year` and then we select the first two elements from year. 

In [22]:
year_ser = movies_df['year']
year_ser[:2]

0    2013
1    2012
Name: year, dtype: int64

Notice that to select a column, we refer to our dataframe, `movies_df`, followed by the square brackets and then the name of the column.

This column is called a pandas series.

In [23]:
type(year_ser)

pandas.core.series.Series

Essentially, a series is similar to a list in Python.  And we can see that we can convert it to a list.

In [24]:
year_list = year_ser.to_list()
year_list[:3]

[2013, 2012, 2013]

So to recap, we can think of a dataframe as a nested data structure in Python and a series as a Python list. 

### The index

Let's take another look at our dataframe.

In [25]:
movies_df[:2]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0


Those numbers of `0` and `1` on the left, are part of the dataframe's index.  An index is a series that identifies each row in the dataframe.

Let's take a look at the index of `movies_df`.

In [26]:
movies_df.index

RangeIndex(start=0, stop=1794, step=1)

> Values in a series also have an index.

In [6]:
movies_df['year'][:3]

0    2013
1    2012
2    2013
Name: year, dtype: int64

### Selecting Rows

We can use the index to select a specific row by using the `iloc` method.

In [29]:
movies_df.iloc[0:3]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,1.0,1.0


The `iloc` method stands for the index locator, and it allows us to select rows from a dataframe, like we would data from a list.  For example, here we'll select the first row.

In [4]:
movies_df.iloc[0]

year                       2013
imdb                  tt1711425
title             21 &amp; Over
test                     notalk
clean_test               notalk
binary                     FAIL
budget                 13000000
domgross            2.56824e+07
intgross            4.21958e+07
code                   2013FAIL
budget_2013$           13000000
domgross_2013$      2.56824e+07
intgross_2013$      4.21958e+07
period code                   1
decade code                   1
Name: 0, dtype: object

However, if we wish to select multiple rows by *slicing* rows, we can leave out the `iloc` method.

In [5]:
movies_df[2:4]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
2,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,1.0,1.0
3,2013,tt1272878,2 Guns,notalk,notalk,FAIL,61000000,75612460.0,132493015.0,2013FAIL,61000000,75612460.0,132493015.0,1.0,1.0


### Summary

In this lesson, we were introduced to the dataframe and the series.  We saw that we can think of a dataframe as a table, or a nested data structure in Python.  And we can think of a series as a Python list.  We saw that an index is a type of series that identifies every low in a dataframe.

We also saw how we can select rows from a dataframe.  We can select individual rows by using the `iloc` method followed by the bracket accessors (like `df.iloc[0]`).  

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="jigsaw-main.png" width="15%" style="text-align: center"></a>
</center>