# Section 2: Series at a Glance

In [1]:
# Import pandas
import pandas as pd

A series consists of one-dimensional list of values and their associated labels. The series may contain any type of data - strings, booleans, integers, floats, etc.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html


## Illustrating series

Create a simple list of student names


In [2]:
students = ["Andrew", "Brie", "Kanika"]
type(students)

list

Now create a series from this list using the series constructor
The series consists of the list of values and their associated labels. The series may contain any type of data - strings, booleans, integers, floats, etc.

In [3]:

pd.Series(students)

0    Andrew
1      Brie
2    Kanika
dtype: object

Create a list of ages for these students and convert to a series. Notice the datatype is different from the first series. Pandas has automatically figured out what type of data this is and has constructed a series of that data type accordingly


In [4]:
ages = [27, 49, 37]
pd.Series(ages)

0    27
1    49
2    37
dtype: int64

Create a list of heights for these students and convert to a series. Notice that the datatype for this series is different as well. It is a float64


In [5]:
heights = [167.4, 173.2, 190.0]
pd.Series(heights)

0    167.4
1    173.2
2    190.0
dtype: float64

## Unlike Numpy arrays, Pandas series support mixed datatypes
In the example below, we are able to create a list of mixed datatypes, then turn them into a series. This makes the series much more flexible

In [6]:
mixed = [True, 'say', {'my_mood' : 100}]
pd.Series(mixed)

0                True
1                 say
2    {'my_mood': 100}
dtype: object

## Parameters vs. Arguments
* In the previous examples, we provided Python lists directly into the series constructor, which is itself a function. In doing so, we are passing a list as an **argument** into the "data" parameter of the `pd.Series(data = )` function
* **Parameters** are what the function expects to be passed in
* **Arguments** are the *data* being passed in to the function

In this simple example of a function called `greeting`, `something` is the parameter while *"good morning to you"* is the data

In [7]:
def greeting(something):
    print(something)

greeting('good morning to you')

good morning to you


## What's in the data?
Notice that when we create Pandas series, we only specify the *values*. Pandas automatically generates the *labels* that go with the values.
In the example below, these three book titles are assigned numerical labels.

In [8]:
books_list = ['Fooled by Randomness', 'Sapiens', 'Lenin on the Train']

In [9]:
list_s = pd.Series(books_list)

Behind the scenes, this is what's going on:

`books_list = [0:'Fooled by Randomness', 1:'Sapiens', 2:"Lenin on the Train"]`

Each item is associated with an index that specifies its location in the series. Notice how similar this is to a Python dictionary

Let's try creating a dictionary for these book titles.

In [10]:
books_dict = {0: "Fooled by Randomness", 1: "Sapiens", 2: "Lenin on the Train"}

We can then convert this into a Pandas series! This series based on a dictionary is seemingly identical to the series that was based on a list. But is it actually identical?

In [11]:
dict_s = pd.Series(books_dict)

We will now use the `equals()` method to test whether the two series are actually equivalent. It turns out that they are indeed identical!

In [12]:
list_s.equals(dict_s)

True

We can also create Pandas series by passing in scalers. Pandas will automatically assign indices. Pandas is not critically dependent on labels, instead relying on integer sequences when there is nothing else to work with

In [13]:
pd.Series(714)

0    714
dtype: int64

In [14]:
pd.Series('Andy')

0    Andy
dtype: object

## The dtype attribute
Remember that the dtype is automatically inferred by Pandas when providing data to the Series constructor function.
We also have the option of specifying *dtype* ourselves.
Returning to our `ages` series, we can specify the dtype as "float" even though the valeus passed in were integers

In [15]:
pd.Series(ages)

0    27
1    49
2    37
dtype: int64

In [16]:
pd.Series(ages, dtype = 'float')

0    27.0
1    49.0
2    37.0
dtype: float64

* For the most part, you'll want to avoid specify dtype manually because the automatic inference from Pandas is usually good enough
* Any series that contains a string becomes a `dtype: object`, or `dytpe('O'),` automatically

## What is `dtype`?

* Remember that numpy arrays are homogenous - they only contain same-sized data

Let's return to our heights series. The dtype is `float64`, and Python knows exactly how much memory to allocate to each of these items


In [17]:
heights

[167.4, 173.2, 190.0]

In [18]:
pd.Series(heights)

0    167.4
1    173.2
2    190.0
dtype: float64

However, strings are variable-length and can take up different amounts of memory. So instead of saving the strings in the numpy array itself, Numpy instead saves a pointer to the object in memory

Let's create a new series in which one of the heights values is a string instead of a float. This forces Pandas/Numpy to store references to the data points themselves.

In [19]:
heights2 = [167.4, '173.2', 190.0]

Note that the datatype is simply `object`

In [20]:
pd.Series(heights2)

0    167.4
1    173.2
2      190
dtype: object

## Index and RangeIndex
* In Pandas, data aligns automatically by the labels, or the indices
* When constructing a series from a list, Pandas automatically generates the index for us
  * The **index** is a sequence of numbers that starts at 0 and ends at one value less than the length of the series
* However, we can generate our own labels if we want! They don't have to be just integers

In the example below, we create a new series using the previous list of books, and then set the indices to be customized values that we have passed in using a list
* Note that although we have specified the keyword arguments here when passing in the data and index lists ("data" and "index"), we did not have to do so. Pandas would have automatically recognized them because they are *positional arguments*. That is, the `Series` constructor function knows in what order to expect the arguments.

In [21]:
pd.Series(data = books_list, index = ['funny', 'serious and amusing', 'kinda interesting'])

funny                  Fooled by Randomness
serious and amusing                 Sapiens
kinda interesting        Lenin on the Train
dtype: object

Starting with Pandas version 1.0.1, a `dtype` of "string" was introduced.

### RangeIndex
* The index that is automatically generated is a built-in Pandas object called a `RangeIndex`
* Don't worry about the technicalities. The point is that this is built-in object that creates a sequence of integers with fixed differences, as specified in the `step` parameter.
* By default, the index begins at 0 and stops at one integer prior the length of the series
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.RangeIndex.html

In [22]:
list_s.index

RangeIndex(start=0, stop=3, step=1)

In [23]:
type(list_s.index)

pandas.core.indexes.range.RangeIndex

Let's create our own Range Indices using the `RangeIndex()` function. This allows us to generate a custom index for our series

In [24]:
pd.RangeIndex(start = 4, stop = 7, step = 1)
list(pd.RangeIndex(start = 4, stop = 7, step = 1))

[4, 5, 6]

Another example

In [25]:
pd.RangeIndex(start = 10, stop = -11, step = -1)
list(pd.RangeIndex(start = 10, stop = -11, step = -1))

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10]

RangeIndex is **immutable** and cannot be changed after it is created. To be modified, we have to create a new one and assign it to another place in memory.
Pandas uses this fact internally for performance optimization reasons.

## Series and Index Names


Let's revisit our book title series, which recall is a sequence of values and their associated indices (labels). Let's also assign it to a more **intelligable** variable name, `books_series`

In [26]:
books_series = list_s
books_series

0    Fooled by Randomness
1                 Sapiens
2      Lenin on the Train
dtype: object

### Attributes and Methods
* Objects in Python have *attributes* and *methods* (more on this if you learn OOP)
* An **attribute** is a varialbe that is bound to an object
* A **method** is a function that is bound to an object
* For Pandas series,  `.size` and `.dytpe` are both attributes

In [27]:
books_series.size

3

### Series Name Attribute

Pandas series also have a **name** attribute. However, by default, the name attribute points to the `None` object in Python

In [28]:
books_series.name

In [29]:
books_series.name == None

True

* Let's give our series a name by setting it to a string.
* After doing this, you will see that the series now has a name attribute that is reported when that series is called
* Later on, when we create *data frames*, the name of this series will become the column name of the data frame. Keep this in mind for later

In [30]:
books_series.name = 'my favorite books'
books_series

0    Fooled by Randomness
1                 Sapiens
2      Lenin on the Train
Name: my favorite books, dtype: object

* You can also name the series index! (it does not have one by default)
* After naming the series index, you will see that name appear above the indices

In [31]:
books_series.index.name == None

True

In [32]:
books_series.index.name = 'My books indices'
books_series

My books indices
0    Fooled by Randomness
1                 Sapiens
2      Lenin on the Train
Name: my favorite books, dtype: object

## Skill Challenge Instructions
1. Create a Python list of length 4 that contains some of your favorite actors. This should be a list of strings. Call this list - assign it to a variable called `actor_names`
2. Next, create another Python list of the same length that contains your guesses for how old each actor is. Use integers or floats. Call this list `actor_ages`
3. Create a series that stores actor ages and labels the ages using the actor names. To clarify, use the actor name in the index and actor age as values. Give this series a name of `actors`
4. Repeat step 3, but create the series from a Python dictionary instead of a list. As an additional challenge, do not type the dictionary manually, but instead dynamically create it uzing the two lists defined in Steps 1 and 2

Create the list of actors

In [33]:
actor_names = ['Sean Connery','Morgan Freeman','Tom Hanks','Harrison Ford']

Create the list of actor ages

In [34]:
# Create list of actor ages
actor_ages = [85, 79, 65, 80]

Using these two lists, create the Pandas series by passing `actor_ages` as data and `actor_names` as the index. Also assign the name of the series within the constructor itself (alternative, can assign afterward using `series.name`

In [35]:
actors = pd.Series(data = actor_ages, index = actor_names, name = 'actors')
actors

Sean Connery      85
Morgan Freeman    79
Tom Hanks         65
Harrison Ford     80
Name: actors, dtype: int64

Create a pPthon dictionary from the actor and ages lists using the `zip` function

In [36]:
keys = actor_names
values = actor_ages
actors_dict = dict(zip(keys,values))
actors_dict

{'Sean Connery': 85,
 'Morgan Freeman': 79,
 'Tom Hanks': 65,
 'Harrison Ford': 80}

Convert this dictionary to a Pandas series

In [37]:
pd.Series(data = actors_dict, name = 'actors')

Sean Connery      85
Morgan Freeman    79
Tom Hanks         65
Harrison Ford     80
Name: actors, dtype: int64

An alternative approach to constructing the dictionary using dictionary comprehension

In [38]:
{name:age for name,age in zip(actor_names, actor_ages)}

{'Sean Connery': 85,
 'Morgan Freeman': 79,
 'Tom Hanks': 65,
 'Harrison Ford': 80}

## The Head and Tail Methods

* The `head(n=)` method displays the first *n* rows of the series. If n is not specified, the default is 5
    * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.head.html
* The `tail(n=)` method displays the last *n* rows of the series. The default is also 5 if n is not specified
    * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tail.html
* These methods are very useful when working with very large datasets


Start by creating a very long series object with integers 0 through 59

In [42]:
int_series = pd.Series(range(60))
int_series

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
26    26
27    27
28    28
29    29
30    30
31    31
32    32
33    33
34    34
35    35
36    36
37    37
38    38
39    39
40    40
41    41
42    42
43    43
44    44
45    45
46    46
47    47
48    48
49    49
50    50
51    51
52    52
53    53
54    54
55    55
56    56
57    57
58    58
59    59
dtype: int64

In [43]:
int_series.head()

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [44]:
int_series.tail(n=7)

53    53
54    54
55    55
56    56
57    57
58    58
59    59
dtype: int64

Note that most data science environments will already truncate the output for large data sets. See example below, where only the first 5 and last 5 records are displayed

In [45]:
pd.Series(range(100000))

0            0
1            1
2            2
3            3
4            4
         ...  
99995    99995
99996    99996
99997    99997
99998    99998
99999    99999
Length: 100000, dtype: int64

You can force Pandas to display a minimum number of rows by modifying the display options. For example:

`pd.options.display.min_rows = 40`

## Extracting Data from Series using Index Position
You can extract specific rows from a Pandas series using index positions

Let's start by creating a series with the letters of the English alphabet

In [46]:
from string import ascii_lowercase

In [47]:
ascii_lowercase

'abcdefghijklmnopqrstuvwxyz'

Create a Pandas series from these letters

In [50]:
alphabet_series = pd.Series([letter for letter in ascii_lowercase])
alphabet_series.head(6)

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In order to extract the data you want and have it returned as a , **use square bracket indexing**, similar to how you would with a Python list
* Remember that the default index is always zero-based. Keep this in mind when using bracket indexing - it's very easy to be off by one position

First letter (remember that the first letter indexed at 0)

In [51]:
alphabet_series[0]

'a'

22nd letter (remember that the 22nd letter is index at 21)

In [66]:
alphabet_series[21]

'v'

First three letters (remember that bracket indexing is exclusive of the end value)

In [57]:
alphabet_series[0:3]

0    a
1    b
2    c
dtype: object

Sixth through tenth letters (remember that the sixth letter is indexed at 5, while the tenth letter is indexed at 9. To include the tenth letter, we must slice from 5 to 10)

In [65]:
alphabet_series[5:10]

5    f
6    g
7    h
8    i
9    j
dtype: object

Last six letters - can use negative indexing for this

In [59]:
alphabet_series[-6:]

20    u
21    v
22    w
23    x
24    y
25    z
dtype: object