We have now come across three important structures that Python uses to store and access data:

* arrays
* data frames
* series

Here we stop to go back over the differences between these structures, and how to convert between them.

## Data frames

We start by loading a data frame from a Comma Separated Value file (CSV
file).

The data file we will load is a table with average
<https://ratemyprofessors.com> scores across all professors teaching
a particular academic discipline.

See the [array indexing page](../03/array_indexing) for more detail.

Each row in this table corresponds to one *discipline*.  Each column corresponds to a different *rating*.

If you are running on your laptop, you should download
the [rate_my_course.csv]({{ site.baseurl }}/data/rate_my_course.csv)
file to the same directory as this notebook.

In [1]:
# Load the Numpy library, rename to "np"
import numpy as np

# Load the Pandas data science library, rename to "pd"
import pandas as pd

In [2]:
# Read the file.
courses = pd.read_csv('rate_my_course.csv')

In [3]:
# Show the first five rows.
courses.head()

Unnamed: 0,Discipline,Number of Professors,Clarity,Helpfulness,Overall Quality,Easiness
0,English,23343,3.756147,3.821866,3.791364,3.162754
1,Mathematics,22394,3.487379,3.641526,3.566867,3.063322
2,Biology,11774,3.608331,3.70153,3.657641,2.710459
3,Psychology,11179,3.90952,3.887536,3.900949,3.31621
4,History,11145,3.788818,3.753642,3.773746,3.053803


The `pd.read_csv` function returned this table in a structure called a *data frame*.

In [4]:
type(courses)

pandas.core.frame.DataFrame

The data frame is a two-dimensional structure. It has rows, and columns.   We can see the number of rows and columns with:

In [5]:
courses.shape

(75, 6)

This means there are 75 rows.  In this case, each row corresponds to one discpline.

There are 6 columns.  In this case, each column corresponds to a different student rating.

Passing the data frame to the Python `len` function shows us the number of rows:

In [6]:
len(courses)

75

### Indexing into data frames

There are two simple ways of indexing into data frames.

We index into a data frame to get a subset of of the data.

To index into anything, we can give the name of thing - in this case `courses` - followed by an opening square bracket `[`, followed by something to specify which subset of the data we want, followed by a closing square bracket `]`.

The two simple ways of indexing into a data frame are:

* Indexing with a string to get a column.
* Indexing with a Boolean sequence to get a subset of the rows.

When we index with a string, the string should be a column name:

In [7]:
easiness = courses['Easiness']

The result is a *series*:

In [8]:
type(easiness)

pandas.core.series.Series

The Series is a structure that holds the data for a single column.

In [9]:
easiness

0     3.162754
1     3.063322
2     2.710459
3     3.316210
4     3.053803
5     2.652054
6     3.379829
7     3.172033
8     3.057758
9     2.910078
10    3.115357
11    3.395819
12    3.132724
13    2.784706
14    3.277406
15    2.854413
16    2.785668
17    3.248045
18    3.430916
19    3.542273
20    3.138076
21    3.468012
22    3.344138
23    2.885714
24    3.469440
25    3.244433
26    3.194300
27    3.338846
28    3.144567
29    2.868762
        ...   
45    3.324156
46    3.276412
47    3.180846
48    3.423021
49    3.674701
50    3.314322
51    3.199716
52    2.978182
53    2.977254
54    3.471498
55    2.825019
56    3.178866
57    2.887940
58    3.323158
59    3.365544
60    2.830455
61    3.606082
62    3.002857
63    3.267099
64    3.882635
65    3.275238
66    3.402397
67    3.541439
68    3.468333
69    2.969417
70    2.863504
71    3.106727
72    3.309636
73    2.799135
74    3.109118
Name: Easiness, Length: 75, dtype: float64

We will come back to the Series soon.

Notice that, if your string specifying the column name does not match a column name exactly, you will get a long error.   This gives you some practice in reading long error messages - skip to the end first, you will often see the most helpful information there.

In [10]:
# The exact column name starts with capital E
courses['easiness']

KeyError: 'easiness'

You have just seen indexing into the data frame with a string to get the data for one column.

The other simple way of indexing into a data frame is with a Boolean sequence.

A Boolean sequence is a sequence of values, all of which are either True or False.  Examples of sequences are series and arrays.

For example, imagine we only wanted to look at courses with an easiness rating of greater than 3.25.

We first make the Boolean sequence, by asking the question `> 3.25` of the values in the "Easiness" column, like this:

In [11]:
is_easy = easiness > 3.25

This is a series that has True and False values:

In [12]:
type(is_easy)

pandas.core.series.Series

In [13]:
is_easy

0     False
1     False
2     False
3      True
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14     True
15    False
16    False
17    False
18     True
19     True
20    False
21     True
22     True
23    False
24     True
25    False
26    False
27     True
28    False
29    False
      ...  
45     True
46     True
47    False
48     True
49     True
50     True
51    False
52    False
53    False
54     True
55    False
56    False
57    False
58     True
59     True
60    False
61     True
62    False
63     True
64     True
65     True
66     True
67     True
68     True
69    False
70    False
71    False
72     True
73    False
74    False
Name: Easiness, Length: 75, dtype: bool

It has True values where the corresponding row had an "Easiness" score greater than 3.25, and False values where the corresponding row had an "Easiness" score of less than or equal to 3.25.

We can index into the data frame with this Boolean series.

When we do this, we ask the data frame to give us a new version of itself, that only has the rows where there was a True value in the Boolean series:

In [14]:
easy_courses = courses[is_easy]

The result is a data frame:

In [15]:
type(easy_courses)

pandas.core.frame.DataFrame

The data frame contains only the rows where the "Easiness" score is greater than 3.25:

In [16]:
easy_courses

Unnamed: 0,Discipline,Number of Professors,Clarity,Helpfulness,Overall Quality,Easiness
3,Psychology,11179,3.90952,3.887536,3.900949,3.31621
6,Communications,6940,3.867349,3.878602,3.875019,3.379829
11,Sociology,4839,3.74098,3.748169,3.746962,3.395819
14,Languages,3867,3.77278,3.917949,3.846951,3.277406
18,Education,2544,3.707429,3.806128,3.758211,3.430916
19,Music,2455,3.844509,3.787804,3.818114,3.542273
21,Health,1937,3.891177,3.884729,3.891213,3.468012
22,Humanities,1897,3.806969,3.816299,3.813569,3.344138
24,Criminal Justice,1786,4.056685,4.033779,4.046702,3.46944
27,Social Science,1412,3.683555,3.691133,3.690262,3.338846


The way this works can be easier to see when we use a smaller data frame.

Here we take the first eight rows from the data frame, by using the `head` method.

The `head` method can take an argument, which is the number of rows we want.

In [17]:
first_8 = courses.head(8)

The result is a new data frame:

In [18]:
type(first_8)

pandas.core.frame.DataFrame

In [19]:
first_8

Unnamed: 0,Discipline,Number of Professors,Clarity,Helpfulness,Overall Quality,Easiness
0,English,23343,3.756147,3.821866,3.791364,3.162754
1,Mathematics,22394,3.487379,3.641526,3.566867,3.063322
2,Biology,11774,3.608331,3.70153,3.657641,2.710459
3,Psychology,11179,3.90952,3.887536,3.900949,3.31621
4,History,11145,3.788818,3.753642,3.773746,3.053803
5,Chemistry,7346,3.387174,3.53898,3.465485,2.652054
6,Communications,6940,3.867349,3.878602,3.875019,3.379829
7,Business,6120,3.640327,3.680503,3.663332,3.172033


We index into the new data frame with a string, to get the "Easiness" column:

In [20]:
easiness_first_8 = first_8["Easiness"]
easiness_first_8

0    3.162754
1    3.063322
2    2.710459
3    3.316210
4    3.053803
5    2.652054
6    3.379829
7    3.172033
Name: Easiness, dtype: float64

This Boolean series has True where the "Easiness" score is greater than 3.25, and False otherwise:

In [21]:
is_easy_first_8 = easiness_first_8 > 3.25
is_easy_first_8

0    False
1    False
2    False
3     True
4    False
5    False
6     True
7    False
Name: Easiness, dtype: bool

We index into the `first_8` data frame with this Boolean series, to select the rows where `is_easy_first_8` has True, and throw away the rows where it has False.

In [22]:
easy_first_8 = first_8[is_easy_first_8]
easy_first_8

Unnamed: 0,Discipline,Number of Professors,Clarity,Helpfulness,Overall Quality,Easiness
3,Psychology,11179,3.90952,3.887536,3.900949,3.31621
6,Communications,6940,3.867349,3.878602,3.875019,3.379829


Oh dear, Psychology looks pretty easy.

## Series and array

The series, as you have seen, is the structure that Pandas uses to store the data from a column:

In [23]:
first_8

Unnamed: 0,Discipline,Number of Professors,Clarity,Helpfulness,Overall Quality,Easiness
0,English,23343,3.756147,3.821866,3.791364,3.162754
1,Mathematics,22394,3.487379,3.641526,3.566867,3.063322
2,Biology,11774,3.608331,3.70153,3.657641,2.710459
3,Psychology,11179,3.90952,3.887536,3.900949,3.31621
4,History,11145,3.788818,3.753642,3.773746,3.053803
5,Chemistry,7346,3.387174,3.53898,3.465485,2.652054
6,Communications,6940,3.867349,3.878602,3.875019,3.379829
7,Business,6120,3.640327,3.680503,3.663332,3.172033


In [24]:
easiness_first_8 = first_8["Easiness"]
easiness_first_8

0    3.162754
1    3.063322
2    2.710459
3    3.316210
4    3.053803
5    2.652054
6    3.379829
7    3.172033
Name: Easiness, dtype: float64

You can index into a series, but this indexing is powerful and sophisticated, so we will not use that for now.

For now, you can convert the series to an array, like this:

In [25]:
easi_8 = np.array(easiness_first_8)
easi_8

array([3.16275414, 3.06332232, 2.71045949, 3.31620986, 3.0538026 ,
       2.65205418, 3.37982853, 3.17203268])

Then you can use the usual [array indexing](../03/array_indexing) to get the values you want:

In [26]:
# The first value
easi_8[0]

3.1627541447114904

In [27]:
# The first five values
easi_8[:5]

array([3.16275414, 3.06332232, 2.71045949, 3.31620986, 3.0538026 ])

You can think of a data frame as sequence of columns, where each column is series.

Here I take two columns from the data frame, as series:

In [28]:
disciplines = first_8['Discipline']
disciplines

0           English
1       Mathematics
2           Biology
3        Psychology
4           History
5         Chemistry
6    Communications
7          Business
Name: Discipline, dtype: object

In [29]:
clarity = first_8['Clarity']
clarity

0    3.756147
1    3.487379
2    3.608331
3    3.909520
4    3.788818
5    3.387174
6    3.867349
7    3.640327
Name: Clarity, dtype: float64

I can make a new data frame by inserting these two columns:

In [30]:
# A new data frame
thinner_courses = pd.DataFrame()
thinner_courses['Discipline'] = disciplines
thinner_courses['Clarity'] = clarity
thinner_courses

Unnamed: 0,Discipline,Clarity
0,English,3.756147
1,Mathematics,3.487379
2,Biology,3.608331
3,Psychology,3.90952
4,History,3.788818
5,Chemistry,3.387174
6,Communications,3.867349
7,Business,3.640327
