# Day 1: O'Reilly Training
>Week 1: Python Programming for Data Analysis

- Numpy
- Pandas

# Welcome and agenda

1. Getting started
    - What is data analysis?
    - What is Pandas?
    - Descriptive statistics
    - Using Pandas series
    - Broadcasting 
    - Mask arrays
    - Useful methods
2. Data types and data frames
    - dtypes (different types of data we can store)
    - `NaN` ("not a number")
    - Data frames
    - Querying with boolean/mask indexes
    - Reading CSV data
3. Real-world data
    - CSV
    - Online data
    - Sorting
    - Grouping
    - Pivot tables
    - Joining
    - Cleaning data
4. Text and dates
    - Text
    - Dates
5. Visualization
    - Line plots
    - Bar plots
    - Histograms
    - Pie plots
    - Scatter plots
    - Boxplots
    




# What is data?

If we're talking about data analysis, then we should really know what data is!

The nouns in the computer world are data.
    - Files (lots of types of files)
    - Logs (information about who did what, and when, on which computer)
    - Preferences (e.g., Netflix)
    - Store inventories and purchasing histories
    
Thanks to mobile devices, and computers, and companies all getting interconnected, there is a **LOT** of data out there.  

The problem isn't finding data. The problem is understanding what our data really means, and doing something useful with it.

The scientific method means: I ask a question, and I try to answer that question as best as possible, using techniques that others have demonstrated are reliable.  Data science is all about applying that method to the world of data.

I want to be able to ask a question, and then answer it.  I'll use data in order to do that.

Some examples:
    - What products are selling best?
    - What products are selling best in each country?  In each age demographic? In each country vs. each age demographic?
    - Which employees are bringing in the most sales?
    - Which universities' graduates earn the most money 10 years after graduating?
    - Which stocks/bonds/investments have done best over the last 10 years? 50 years?  

# Data science

The idea is: Use scientific principles to ask questions and answer them with data.

I divide the world of data science into three pieces:

- Data analysis — use the data we've collected to understand the past and present
- Data engineering — there's so much data out there, in so many different formats and sources, and getting it in a timely, organized way to our team's computers is hard -- data engineers solve these problems
- Machine learning — learn from the past, to make predictions about the future


How are we going to gather the data, and ask questions?  We're going to read it into data structures on our computer, and then use methods on those data structures to create queries.  We'll be using Python and Pandas in this course.

# Exercise: Think about data

What data does Amazon have about its products? What data does Amazon have about you? How does Amazon use this data in its business?

# Python -- why is this a language for data analytics?

It's also a very good choice:
- Easy to read
- Easy to learn
- Lots of support
- Open source (cheap or free)

It's a very bad choice, in many ways:
- It runs slowly (relative to many other languages)
- It uses lots of memory

Data analytics uses *lots* of data, often many hundreds of megabytes -- or more!   It's not at all unusual to have data sets that are several GB in size.  If the same data in Python is 10x bigger than the data in C, then you can handle more data in C, even if it's a harder language to work with.

The reason is something called "NumPy." This is a Python module, written mostly in C.  It allows us to work with data in C format (i.e., very fast, very small), using a very thin layer of Python on top of it.  NumPy is super fast and super efficient, but it lets us work with friendly Python code.

The best of both worlds!  (Almost)

NumPy can still be a bit low level and hard to work with.  We, in this class, will be using Pandas.  Pandas is (mostly) a wrapper around NumPy, making it far friendlier and easier to work with.

Just the Pandas library for Python has between 5-10 million users.  People at companies and organizations around the world are using Pandas more and more to analyze their data:

- E-commerce companies
- Manufacturers
- Banks and financial institutions
- Marketing companies



# Data structures in Pandas

Pandas mostly ignores Python data structures, in favor of its own:

- Series (1-dimensional data)
- Data frame (2-dimensional data)

Assuming that you have loaded Pandas and Jupyter onto your computer, you can say:

In [4]:
import numpy as np          # we will use, occasionally, some of the low-level NumPy functionality via "np"
import pandas as pd         # load the Pandas module into memory, and make it available via the "pd" namespace
from pandas import Series   # I also want to use the Series name by itself, rather than pd.Series

In [5]:
# Let's say I want to create a series

# I use a Python list of integers to create my series
s = Series([10, 20, 30, 40, 50, 60, 70])

In [6]:
type(s)   # in Jupyter, if an expression is on the final line of a cell, we get the value back w/o print

pandas.core.series.Series

In [7]:
# A series works a lot like a list, in many ways

s[0]

10

In [8]:
s[5]

60

In [9]:
t = Series([50, 40, 20, 30, 88, 22, 16])

In [11]:
# what happens when I add together two Python lists?
mylist1 = [10, 20, 30]
mylist2 = [40, 50, 60]

mylist1 + mylist2  # we get a new list -- all elements of mylist1, followed by all elements of mylist2

[10, 20, 30, 40, 50, 60]

In [12]:
# let's look at s and t

In [13]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [14]:
t

0    50
1    40
2    20
3    30
4    88
5    22
6    16
dtype: int64

In [15]:
# what happens when I add s and t together?

# we get a new series -- the new series has 7 elements, with the index 0-6
# the new series, at index 0, is s[0] + t[0]
# the new series, at index 1, is s[1] + t[1]

# the addition took places as vectors!

s + t

0     60
1     60
2     50
3     70
4    138
5     82
6     86
dtype: int64

In [16]:
s - t

0   -40
1   -20
2    10
3    10
4   -38
5    38
6    54
dtype: int64

In [17]:
s * t

0     500
1     800
2     600
3    1200
4    4400
5    1320
6    1120
dtype: int64

In [18]:
s / t

0    0.200000
1    0.500000
2    1.500000
3    1.333333
4    0.568182
5    2.727273
6    4.375000
dtype: float64

In [20]:
s % t  # remainder from dividing s/t

0    10
1    20
2    10
3    10
4    50
5    16
6     6
dtype: int64

# Exercise: Series operations

1. Create a series containing three elements — your birth year, month, and day.
2. Create a second series containing three elements -- birth year, month, and day -- from someone you know.
3. Show the difference between your two ages in year, month, day

In [22]:
me = Series([1970, 7, 14])
sis = Series([1971, 8, 22])

In [24]:
sis - me

0    1
1    1
2    8
dtype: int64

In [25]:
me = Series([1970, 07, 14])


SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers (2946636120.py, line 1)

# Different number bases

If you want to enter numbers in a different number base, you can do that in Python:

- base 2 (binary): Use 0b, as in 0b101010
- base 16 (hex): Use 0x, as in 0xab12cd34
- base 8 (octal): Use 0o, as in 0o12345

# Next up:

- Descriptive statistics
- Aggregate methods
- Five-number summary
- Mean and standard deviation

# Lists vs. arrays

Python normally uses lists.  Many people say to me, "Oh, those are just arrays, right?"  No, because there are two basic differences between lists and arrays:

1. Arrays have a set length, that cannot be changed once they're created.
2. All elements in an array must be of the same type.

Since Pandas series are based on NumPy arrays, you might think that they cannot have different types in them.  But it turns out that they can, thanks to a bit of Python magic!

In [26]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [27]:
s = Series([10.5, 20.3, 30.4, 40, 50])
s

0    10.5
1    20.3
2    30.4
3    40.0
4    50.0
dtype: float64

In [28]:
# a dtype of "object" means: Something in the Python world, I don't know what
s = Series(['a', 'bc', 'defg', 'hi'])
s

0       a
1      bc
2    defg
3      hi
dtype: object

In [29]:
# once we have a dtype of object, anything is possible
# we can have a mix, then, and Pandas will live with it
# but we should to avoid it

s = Series([10, 20, 'abc', 'def', 30.5, 60.8, [1,2,3,4]])

In [30]:
s

0              10
1              20
2             abc
3             def
4            30.5
5            60.8
6    [1, 2, 3, 4]
dtype: object

In [31]:
# you can install NumPy using pip, just as you did for Pandas:
# pip install numpy

# this is not a Python command -- you need to run it on the command line

In [33]:
# ages of people in my family

# how can we describe this?

s = Series([52, 50, 21, 19, 16])

In [35]:
# How can we describe a bunch of data points, in a way that will make sense?

# one possibility: mean (average) -- sum, divided by the number of data points

# Pandas series have a bunch of methods for this, and other *descriptive statistics*.

s.mean()   # mean is great... but also flawed, because it is easily skewed with one or two outliers

31.6

In [36]:
# very often, we don't use the mean.  Rather, we use the *median*.
# the median is calculated in this way: Take all values, from the smallest to the largest,
# and line them up, sorted.  Take the middle value or (if there is an even number of values)
# the average (mean) of the two middle values

In [37]:
s.median()

21.0

In [38]:
# why are mean and median floats, when the numbers in s are integers?
# In Python 3, division with / always returns a float, even if the numbers are all integers
# if you want an integer result, use //, known as "floordiv," which removes (not rounds!) any decimals 

In [39]:
# how do I get help on things in Pandas?
# My favorite way: help(function)

help(s.median)

Help on method median in module pandas.core.generic:

median(axis: 'int | None | lib.NoDefault' = <no_default>, skipna=True, level=None, numeric_only=None, **kwargs) method of pandas.core.series.Series instance
    Return the median of the values over the requested axis.
    
    Parameters
    ----------
    axis : {index (0)}
        Axis for the function to be applied on.
    skipna : bool, default True
        Exclude NA/null values when computing the result.
    level : int or level name, default None
        If the axis is a MultiIndex (hierarchical), count along a
        particular level, collapsing into a scalar.
    numeric_only : bool, default None
        Include only float, int, boolean columns. If None, will attempt to use
        everything, then use only numeric data. Not implemented for Series.
    **kwargs
        Additional keyword arguments to be passed to the function.
    
    Returns
    -------
    scalar or Series (if level specified)



In [40]:
# standard deviation tells us how much the data varies from the mean
# a standard deviation of 0 means: all data points are the same
# often we talk about mean+1sd mean-1sd

s.std()   # calculate the standard deviation on our data

17.812916661793487

In [41]:
s.mean()

31.6

In [42]:
s.mean() + s.std()

49.41291666179349

In [43]:
s.mean() - s.std()

13.787083338206514

In [44]:
# 68% of the data points in s should be between 49 and 13

In [45]:
s

0    52
1    50
2    21
3    19
4    16
dtype: int64

In [46]:
s.mean() + 2 * s.std()

67.22583332358698

In [47]:
s.mean() - 2 * s.std()

-4.025833323586973

# Descriptive statistics

We can now understand our data in a variety of ways:

- Mean, which tells us the arithmetic average — easy to understand, and often useful, and the reference point for the standard deviation. However, it's easy to skew with outliers.
- Standard deviation, which tells us how much the data spreads out from the mean. A high std means that the data is all over the place, whereas a low one means that it's concentrated around the mean.
- Median, which is the middle value.

You can also think of the median as the 50% point in our data, if we line things up from lowest to highest.  We can similarly calculate the 25% mark and the 75% mark in our data.  These are often known as the 1Q and 3Q values.  Pandas lets us calculate these pretty easily:

In [48]:
s.quantile(0.25)

19.0

In [49]:
s.quantile(0.5)  # median

21.0

In [50]:
s.quantile(0.75)

50.0

In [51]:
s

0    52
1    50
2    21
3    19
4    16
dtype: int64

In [52]:
# we can also see how much the data is spread out using the IQR (inter-quartile range)

In [53]:
# we might also want to know the minimum and maximum values in our data:

s.min()

16

In [54]:
s.max()

52

# Descriptive statistics summary

For a series `s`, we can call:

- `s.mean()`
- `s.std()`
- `s.quantile(0.25)`
- `s.quantile(0.50)` or `s.median()`
- `s.quantile(0.75)`
- `s.min()`
- `s.max()`

# Exercise: Weather forecast

1. Go to a Web site that shows the 10-day forecast for your area, Create a series containing the high temperatures for each day in the next 10 days.
2. Calculate each of the descriptive statistics.  Are there are any obvious outliers (very hot or very cold) in the coming 10 days?

In [56]:
help(s.quantile)

Help on method quantile in module pandas.core.series:

quantile(q=0.5, interpolation='linear') method of pandas.core.series.Series instance
    Return value at the given quantile.
    
    Parameters
    ----------
    q : float or array-like, default 0.5 (50% quantile)
        The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
    interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
        This optional parameter specifies the interpolation method to use,
        when the desired quantile lies between two data points `i` and `j`:
    
            * linear: `i + (j - i) * fraction`, where `fraction` is the
              fractional part of the index surrounded by `i` and `j`.
            * lower: `i`.
            * higher: `j`.
            * nearest: `i` or `j` whichever is nearest.
            * midpoint: (`i` + `j`) / 2.
    
    Returns
    -------
    float or Series
        If ``q`` is an array, a Series will be returned where the
        index is ``q

In [57]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32])

In [58]:
s.min()

32

In [59]:
s.max()

34

In [60]:
s.mean()

33.2

In [61]:
s.median()

33.5

In [62]:
s.std()

0.9189365834726815

In [63]:
s.mean() - s.std()

32.281063416527324

In [64]:
s.mean() + s.std()

34.11893658347268

In [65]:
# we can get all of these descriptive statistics with the "describe" method

s.describe()

count    10.000000
mean     33.200000
std       0.918937
min      32.000000
25%      32.250000
50%      33.500000
75%      34.000000
max      34.000000
dtype: float64

In [66]:
# I'm going to use NumPy to generate some random numbers, and use those to create a series

np.random.seed(0)   # always start random numbers from the same place, so we all get the same values
s = Series(np.random.randint(0, 100, 10))   # get 10 random ints from 0-100, and create a series from them

In [67]:
s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [68]:
# to retrieve a value and index n, I could use s[n]
# but it's much better to use s.loc[n]

s.loc[5]

9

In [70]:
# note the double square brackets!
# outer ones are for s.loc
# inner ones mean: I'm asking for 3 values

s.loc[[3, 5, 7]]  # this is a "fancy" index, where I ask for three different values

3    67
5     9
7    21
dtype: int64

In [72]:
# I can also use a slice (just like in Python)

s.loc[4:7]  # starting at 4, until 7

4    67
5     9
6    83
7    21
dtype: int64

In [73]:
# can I set values in this way? YES!

s.loc[5] = 999
s

0     44
1     47
2     64
3     67
4     67
5    999
6     83
7     21
8     36
9     87
dtype: int64

In [74]:
# if I assign to a fancy index, all of the elements are changed to the new value

s.loc[[3,5,7]] = 888


In [75]:
s

0     44
1     47
2     64
3    888
4     67
5    888
6     83
7    888
8     36
9     87
dtype: int64

In [76]:
# this is the series I created with the high temps in Modi'in (my city)

s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32])

In [77]:
s

0    33
1    34
2    34
3    34
4    34
5    34
6    33
7    32
8    32
9    32
dtype: int64

In [78]:
# I don't want to have an index that's 0-9.  I want an index that is easier to understand.
# For example, MMDD

# Pandas allows for this!

# pass index = and a list of values (strings, integers, anything!)
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

In [79]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [80]:
# what will the weather be like on August 9th?

s.loc['0809']

32

In [81]:
# what will the high temperature be on August 8th through 10th?

s.loc['0808':'0810']  # use a slice

0808    33
0809    32
0810    32
dtype: int64

In [82]:
# or, if I prefer, just state them explicitly with a fancy index
s.loc[['0808', '0809', '0810']]

0808    33
0809    32
0810    32
dtype: int64

In [83]:
s.loc[['0808', '0809', '0810']] = 98,99,100

In [84]:
s

0802     33
0803     34
0804     34
0805     34
0806     34
0807     34
0808     98
0809     99
0810    100
0811     32
dtype: int64

In [85]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

# Retrieving via indexes

We've now seen that we can use `.loc` to retrieve one or more values from our series, using the index.  If the index is the default (numbers), we can use it via the positions. But if the index is custom, either strings or integers, we can still use it.

But what if we want to retrieve by position, even though we have our own string index? We can do that with the `.iloc` accessor.  It works just like `.loc`, but uses the numeric positions (starting at 0), rather than our own custom index.

In [86]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [87]:
s.loc['0804']  # retrieving via .loc and the index

34

In [89]:
s.iloc[2]       # retrieving via .iloc and the position

34

# Exercise: Weather, with dates

1. Recreate your weather series, using MMDD-style strings as your indexes.
2. Retrieve, via the index, the high temperature on August 5th
3. Retrieve, via the index, the high temperatures  on August 4th through 9th.
4. What is the max temp going to be from August 9th through 11th?


In [90]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [91]:
# if I want to use my index to retrieve values, I use .loc
# if I want to use the positions to retrieve values, I use .iloc 

# .iloc and .loc are exactly the same if I don't have a special/new/custom index, if I just
# use the default.

# .iloc will always use integers, starting at 0.  
# .loc can use integers, but it can also use strings and other types

In [92]:
# make sure that your index values are strings, not integers

# '0805'  not 0805

In [93]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [97]:
# if you assign a list (or other iterable) with the right number of values 
# to s.index, you're set

s.index = [2,4,6,8,10,12,14,16, 18, 20]

In [98]:
s

2     33
4     34
6     34
8     34
10    34
12    34
14    33
16    32
18    32
20    32
dtype: int64

In [100]:

s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [102]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

# Exercise: Weather, with dates

1. Recreate your weather series, using MMDD-style strings as your indexes.
2. Retrieve, via the index, the high temperature on August 5th
3. Retrieve, via the index, the high temperatures  on August 4th through 9th.
4. What is the max temp going to be from August 9th through 11th?


In [99]:
# method 1 for creating my series of temps with dates:
# create the series
# pass the keyword argument index with a list of strings

s = Series([33, 34, 34c, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

In [104]:
# method 2 for creating the series with dates

s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32])

s.index = ['0802', '0803', '0804', '0805', '0806', 
           '0807', '0808', '0809', '0810', '0811' ]

In [105]:
s.loc['0805']

34

In [106]:
s.loc['0804':'0809']

0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
dtype: int64

In [108]:
s.loc['0809':'0811'].max()

32

In [109]:
s.loc['0809':'0811'].describe()

count     3.0
mean     32.0
std       0.0
min      32.0
25%      32.0
50%      32.0
75%      32.0
max      32.0
dtype: float64

# Next up

1. Broadcasting 
2. Mask arrays 
3. More about indexes
4. Useful methods

10-minute break

In [110]:
# Create a series of 10 random integers (via NumPy)

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10),
          index=list('abcdefghij'))  
s

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [111]:
s + s    # we'll get a new series back, with + happening on each index

a     88
b     94
c    128
d    134
e    134
f     18
g    166
h     42
i     72
j    174
dtype: int64

In [113]:
# what if, instead of adding two series objects, I add a series to a scalar (individual) value?

s + 10   # broadcasting -- each element in s ran +10

a    54
b    57
c    74
d    77
e    77
f    19
g    93
h    31
i    46
j    97
dtype: int64

In [115]:
s - 3

a    41
b    44
c    61
d    64
e    64
f     6
g    80
h    18
i    33
j    84
dtype: int64

In [116]:
s * 3


a    132
b    141
c    192
d    201
e    201
f     27
g    249
h     63
i    108
j    261
dtype: int64

In [117]:
s / 3    # truediv

a    14.666667
b    15.666667
c    21.333333
d    22.333333
e    22.333333
f     3.000000
g    27.666667
h     7.000000
i    12.000000
j    29.000000
dtype: float64

In [118]:
s // 3    # floordiv

a    14
b    15
c    21
d    22
e    22
f     3
g    27
h     7
i    12
j    29
dtype: int64

In [119]:
s ** 3   # to the 3rd power

a     85184
b    103823
c    262144
d    300763
e    300763
f       729
g    571787
h      9261
i     46656
j    658503
dtype: int64

In [120]:
s % 3   # remainder = modulus

a    2
b    2
c    1
d    1
e    1
f    0
g    2
h    0
i    0
j    0
dtype: int64

# Exercise: Convert our temperatures

Assign `s` to be our 10-day forecast:

- If your forecast is in Celsius, use broadcasting to get a new series in Fahrenheit
- If your forecast is in Fahrenheit, use broadcasting to get a new series in Celsius



In [121]:
# °F = (°C × 9/5) + 32 
# C =  (°F − 32) x 5/9 

In [122]:
s

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [123]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

In [124]:
s

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [125]:
s * (9/5) + 32

0802    91.4
0803    93.2
0804    93.2
0805    93.2
0806    93.2
0807    93.2
0808    91.4
0809    89.6
0810    89.6
0811    89.6
dtype: float64

In [126]:
# since the result is a series, I can run lots of other methods:

(s * (9/5) + 32).describe()

count    10.000000
mean     91.760000
std       1.654086
min      89.600000
25%      90.050000
50%      92.300000
75%      93.200000
max      93.200000
dtype: float64

In [127]:
f_temps = s * (9/5) + 32

In [128]:
(f_temps - 32) * (5/9)

0802    33.0
0803    34.0
0804    34.0
0805    34.0
0806    34.0
0807    34.0
0808    33.0
0809    32.0
0810    32.0
0811    32.0
dtype: float64

In [130]:
s  # s is an instance of Series, and it has access to the Series methods

0802    33
0803    34
0804    34
0805    34
0806    34
0807    34
0808    33
0809    32
0810    32
0811    32
dtype: int64

In [132]:
# we can use any index values we want in fancy indexing
s.loc[['0802', '0808', '0806']]

0802    33
0808    33
0806    34
dtype: int64

In [133]:
# let's try something else
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [134]:
# what if I try fancy indexing with this s?

s.loc[[2, 3, 0]]

2    30
3    40
0    10
dtype: int64

In [135]:
# what if, instead of passing a list of integers (for the indexes), I pass a list of boolean (True/False) values?

# Pandas will return all elements of our series s, that correspond 
# to a True value.  It will drop/ignore all elements that correspond
# to a False value.

s.loc[[True, False, True, True, False]]

0    10
2    30
3    40
dtype: int64

In [136]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [137]:
s + 30  # this broadcasts with our series, giving us a new series, with everything +30

0    40
1    50
2    60
3    70
4    80
dtype: int64

In [138]:
s < 30 # this broadcasts the < operator to our series, returning... a series of booleans

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [139]:
s > 30

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [140]:
s < s.mean()

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [142]:
# each time I run a comparison between a series and a scalar value, I get back
# a series of booleans

# what happens if I then apply that series of booleans as a mask index to my series?

s.loc[s<30]  # what are the elements of s that are less than 30?

0    10
1    20
dtype: int64

In [143]:
s.loc[2:4]   # this means: s, from index 2 to index 4 -- we're passing a slice object to s.loc[]

2    30
3    40
4    50
dtype: int64

In [144]:
s.loc[[2,3,4]]  # this means, s, indexes 2,3, and 4 -- we're passing a list object to s.loc[]

2    30
3    40
4    50
dtype: int64

# Mask indexes

If I want selected values from my series, a standard way to do this is with a "mask index":

- We create a boolean series, based on broadcasting a comparison operator (`<`, `==`, etc.)
- We apply that boolean series as an index on the series.

The result: We get only those values for which the comparison is `True`.

In [145]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [146]:
s < s.mean()

0     True
1     True
2    False
3    False
4    False
dtype: bool

In [147]:
# show all values in s that are < mean
s[s < s.mean()]

0    10
1    20
dtype: int64

In [148]:
# show all values in s that are >= 15

s[s >= 15]

1    20
2    30
3    40
4    50
dtype: int64

In [149]:
np.random.seed(0)
s = Series(np.random.randint(-50, 50, 10), 
          index=list('abcdefghij'))
s

a    -6
b    -3
c    14
d    17
e    17
f   -41
g    33
h   -29
i   -14
j    37
dtype: int64

In [157]:
# find positive numbers in s
s.loc[s > 0]

c    14
d    17
e    17
g    33
j    37
dtype: int64

In [156]:
# find negative numbers in s
s.loc[s < 0]

a    -6
b    -3
f   -41
h   -29
i   -14
dtype: int64

In [155]:
# find even numbers in s
s.loc[s%2 == 0]

a    -6
c    14
i   -14
dtype: int64

In [154]:
# find odd numbers in s
s.loc[s%2 == 1]

b    -3
d    17
e    17
f   -41
g    33
h   -29
j    37
dtype: int64

# Exercise: Family ages

1. Create a series in which the values are ages of people in your family, and the index contains their names.
2. Find all people who are below the mean age.
3. Find all people who are above the mean age + 1 std.
4. Find all people whose ages are odd.

In [158]:
s = Series([52, 50, 21, 19, 16],
          index=['RL', 'SF', 'AMLF', 'SBLF', 'ADLF'])
s

RL      52
SF      50
AMLF    21
SBLF    19
ADLF    16
dtype: int64

In [162]:
# find people below the mean age
s.loc[s < s.mean()]

AMLF    21
SBLF    19
ADLF    16
dtype: int64

In [166]:
# find people above mean age + 1 std
s.loc[s >  s.mean() + s.std()]

RL    52
SF    50
dtype: int64

In [168]:
s.loc[s%2 == 1]

AMLF    21
SBLF    19
dtype: int64

In [171]:
# get the index back, rather than the values?
s.loc[s%2 == 1].index.values

array(['AMLF', 'SBLF'], dtype=object)

In [173]:
# combine queries

np.random.rand(0)
s = Series(np.random.randint(0, 100, 100))  # 100 numbers, each between 0-100
s

0     17
1     79
2      4
3     42
4     58
      ..
95    19
96    46
97    42
98    56
99    60
Length: 100, dtype: int64

In [179]:
# let's find even numbers greater than the mean
# first, find s>s.mean(), returning a boolean series (same index as s)
# next, find s%2==0, returning a boolean series (same index as s)

# use & to get a new boolean series with the same index as s, where the values
# are True if both input series are True

s.loc[(s>s.mean()) & 
      (s%2==0)]

4     58
13    82
21    84
23    68
25    68
28    76
29    52
30    78
34    58
39    48
44    64
47    94
49    50
52    48
55    98
63    58
67    98
68    62
70    94
72    82
77    50
81    58
85    86
90    80
92    54
98    56
99    60
dtype: int64

# Combining conditions

If you want to get a new boolean series based on two existing boolean series:

- Use `&` for `and`, `|` for `or`, and `~` for `not`
- Put `()` around every clause in your condition
- Then apply the resulting boolean series on your series

# Exercise: Very big and very small numbers

1. Generate a series with 100 random integers (as I've done, using `np.random.randint`) from 0-1000.
2. Find the numbers that are either very big (i.e., > mean + std) or very small (i.e., < mean - std)

In [180]:
np.random.seed(0)
s = Series(np.random.randint(0, 1000, 100))
s

0     684
1     559
2     629
3     192
4     835
     ... 
95    398
96    611
97    565
98    908
99    633
Length: 100, dtype: int64

In [181]:
# boolean series indicating which numbers are very big
s > s.mean() + s.std()

0     False
1     False
2     False
3     False
4      True
      ...  
95    False
96    False
97    False
98     True
99    False
Length: 100, dtype: bool

In [182]:
# boolean series indicating which numbers are very small
s < s.mean() - s.std()

0     False
1     False
2     False
3      True
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 100, dtype: bool

In [184]:
# find numbers that are either very big or very small
s.loc[(s > s.mean() + s.std()) |
      (s < s.mean() - s.std())]

3     192
4     835
8       9
14     70
22     87
23    174
25    849
28    845
29     72
31    916
32    115
33    976
36    847
39    850
40     99
41    984
42    177
46    147
47    910
50    961
58    151
62    882
63    183
64     28
66    128
67    128
68    932
69     53
70    901
78     42
81    888
84    999
85    937
86     57
88    870
89    119
92     82
93     91
94    896
98    908
dtype: int64

In [185]:
# what about 2 * std?

s.loc[(s > s.mean() + 2 * s.std()) |
      (s < s.mean() - 2 * s.std())]

Series([], dtype: int64)

# Next up

1. More on indexes
2. Useful methods on our series



In [187]:
np.random.seed(0)
s = Series(np.random.randint(0, 1000, 10),
          index=list('abcdefghij'))
s

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [188]:
t = Series(np.random.randint(0, 1000, 10),
          index=list('jihgfedcba'))
t

j    277
i    754
h    804
g    599
f     70
e    472
d    600
c    396
b    314
a    705
dtype: int64

In [189]:
# the addition will take place via the indexes, not the positions
s + t

a    1389
b     873
c    1025
d     792
e    1307
f     833
g    1306
h    1163
i     763
j    1000
dtype: int64

In [190]:
684+705

1389

In [192]:
# index entries don't have to be unique!

t = Series(np.random.randint(0, 1000, 10),
          index=list('abcdeabcde'))
t

a    777
b    916
c    115
d    976
e    755
a    709
b    847
c    431
d    448
e    850
dtype: int64

In [193]:
# retrieve an index that refers to multiple values, and you'll get a series back
t.loc['a']

a    777
a    709
dtype: int64

In [194]:
t.loc[['a', 'c']]

a    777
a    709
c    115
c    431
dtype: int64

In [195]:
# what will happen when we add s (index a-j) and t (index a-e, a-e) now?
s + t

a    1461.0
a    1393.0
b    1475.0
b    1406.0
c     744.0
c    1060.0
d    1168.0
d     640.0
e    1590.0
e    1685.0
f       NaN
g       NaN
h       NaN
i       NaN
j       NaN
dtype: float64

In [196]:
t

a    777
b    916
c    115
d    976
e    755
a    709
b    847
c    431
d    448
e    850
dtype: int64

In [198]:
# use iloc when the index would make things ambiguious, or impossible
t.iloc[5]

709

In [199]:
# what if I try to get t.loc['a':'c']?

t.loc['a':'c']

KeyError: "Cannot get left slice bound for non-unique label: 'a'"

In [202]:
t.loc[['a', 'c', 'd']]

a    777
a    709
c    115
c    431
d    976
d    448
dtype: int64

In [203]:
t.loc[['a', 'c', 'd']].describe()

count      6.000000
mean     576.000000
std      305.947708
min      115.000000
25%      435.250000
50%      578.500000
75%      760.000000
max      976.000000
dtype: float64

# Index types

- Default is numeric, starting at 0, and going until the length - 1 (much like Python lists, strings, tuples).
- You can also have a range index, specified with `range(start, end)` or `range(start, end, step)`.
- You can have a string index, specified with a list of strings.

In [204]:
s.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

In [205]:
t.index

Index(['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e'], dtype='object')

In [206]:
t.index = range(10, 30, 2)

In [207]:
t

10    777
12    916
14    115
16    976
18    755
20    709
22    847
24    431
26    448
28    850
dtype: int64

In [208]:
t.index

RangeIndex(start=10, stop=30, step=2)

In [209]:
t

10    777
12    916
14    115
16    976
18    755
20    709
22    847
24    431
26    448
28    850
dtype: int64

In [210]:
t.index = list('abcdeabcde')
t

a    777
b    916
c    115
d    976
e    755
a    709
b    847
c    431
d    448
e    850
dtype: int64

In [211]:
t.loc['e'] = 5555  # this will assign to two elements
t

a     777
b     916
c     115
d     976
e    5555
a     709
b     847
c     431
d     448
e    5555
dtype: int64

In [212]:
# let's set all values in s that are less than the mean to be 50
s

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [213]:
s.loc[s<s.mean()] = 50  # we can assign here, using our boolean index

In [214]:
s

a    684
b    559
c    629
d     50
e    835
f    763
g    707
h     50
i     50
j    723
dtype: int64

# Exercise: Weekend temps

1. Recreate our series of high temperatures, but instead of dates as the index, use day names (`Mon`, `Tue`).
2. What will be the mean temperature on weekends (Sat-Sun).
3. What will be the mean temperature on weekdays?

In [215]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['Tue', 'Wed', 'Thu', 'Fri', 'Sat', 
                'Sun', 'Mon', 'Tue', 'Wed', 'Thu' ])

In [217]:
s.loc[['Sat', 'Sun']].mean()

34.0

In [219]:
s.loc[['Mon', 'Tue', 'Wed', 'Thu', 'Fri']]

Mon    33
Tue    33
Tue    32
Wed    34
Wed    32
Thu    34
Thu    32
Fri    34
dtype: int64

In [221]:
# indexes are series (mostly), too!
# we can run the "isin" method on a series, which returns True/False for each value

# here, we're saying: get a boolean series indicating whether each value of s's index
#  is in ['Sat', 'Sun']
# apply that as a mask index on s

s.loc[s.index.isin(['Sat', 'Sun'])]

Sat    34
Sun    34
dtype: int64

In [223]:
# tilde (~) flips the logic (True->False, False->True) on a boolean series
# here, we do the same as above, but flip the logic, so that we get
# the opposite elements

s.loc[~s.index.isin(['Sat', 'Sun'])]

Tue    33
Wed    34
Thu    34
Fri    34
Mon    33
Tue    32
Wed    32
Thu    32
dtype: int64

In [224]:
temps = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['Tue', 'Wed', 'Thu', 'Fri', 'Sat', 
                'Sun', 'Mon', 'Tue', 'Wed', 'Thu' ])

In [225]:
np.random.seed(0)
s = Series(np.random.randint(0, 1000, 100))
s

0     684
1     559
2     629
3     192
4     835
     ... 
95    398
96    611
97    565
98    908
99    633
Length: 100, dtype: int64

In [227]:
# what if I just want to look at the first part of a series? Or the last part?
# I can use the head and tail methods.  By default, they show 5 elements, but
# can pass an integer to change that

s.head()

0    684
1    559
2    629
3    192
4    835
dtype: int64

In [228]:
s.tail()

95    398
96    611
97    565
98    908
99    633
dtype: int64

In [229]:
s.head(7)

0    684
1    559
2    629
3    192
4    835
5    763
6    707
dtype: int64

In [231]:
# I want to know: What are the most common values in my series?
# the value_counts method does this for us

# if I call temps.value_counts(), each of the distinct values in temps becomes
# an index element, and the number of times it appears becomes the value

# it's ordered, from most to least common

# it's a series, so we can do series things on it -- like .head()

temps.value_counts()

34    5
32    3
33    2
dtype: int64

In [232]:
# what percentage of the time was each value in temps?

temps.value_counts(normalize=True)

34    0.5
32    0.3
33    0.2
dtype: float64

In [234]:
# what are the unique (different) values in a series?
s.unique() # returns a NumPy array!

array([684, 559, 629, 192, 835, 763, 707, 359,   9, 723, 277, 754, 804,
       599,  70, 472, 600, 396, 314, 705, 486, 551,  87, 174, 849, 677,
       537, 845,  72, 777, 916, 115, 976, 755, 709, 847, 431, 448, 850,
        99, 984, 177, 797, 659, 147, 910, 423, 288, 961, 265, 697, 639,
       544, 543, 714, 244, 151, 675, 510, 459, 882, 183,  28, 802, 128,
       932,  53, 901, 550, 488, 756, 273, 335, 388, 617,  42, 442, 888,
       257, 321, 999, 937,  57, 291, 870, 119, 779, 430,  82,  91, 896,
       398, 611, 565, 908, 633])

In [237]:
temps.unique()

array([33, 34, 32])

In [238]:
# another way -- get the index from value_counts()
temps.value_counts().index

Int64Index([34, 32, 33], dtype='int64')

In [239]:
s = Series([52, 50, 21, 19, 16],
          index=['RL', 'SF', 'AMLF', 'SBLF', 'ADLF'])
s

RL      52
SF      50
AMLF    21
SBLF    19
ADLF    16
dtype: int64

In [240]:
# what is the difference between each person's age and the next person's age?

s.diff()

RL       NaN
SF      -2.0
AMLF   -29.0
SBLF    -2.0
ADLF    -3.0
dtype: float64

In [241]:
s.pct_change()  # silly with ages!

RL           NaN
SF     -0.038462
AMLF   -0.580000
SBLF   -0.095238
ADLF   -0.157895
dtype: float64

# Exercise: Change in weather

1. Using our high-temperature series, what are the 3 most common temperatures in your 10-day forecats?
2. On what day will there be the greatest numeric (not percentage) change in temperature?  How about the smallest change?

In [242]:
s = Series([33, 34, 34, 34, 34, 34, 33, 32, 32, 32],
          index=['0802', '0803', '0804', '0805', '0806', 
                '0807', '0808', '0809', '0810', '0811' ])

In [243]:
s.value_counts()

34    5
32    3
33    2
dtype: int64

In [244]:
s.value_counts().head(2)

34    5
32    3
dtype: int64

In [245]:
s.diff()

0802    NaN
0803    1.0
0804    0.0
0805    0.0
0806    0.0
0807    0.0
0808   -1.0
0809   -1.0
0810    0.0
0811    0.0
dtype: float64

In [248]:
# find the day(s) with the greatest change in temp

s.loc[s.diff() == s.diff().max()]

0803    34
dtype: int64

In [249]:
# find the day(s) with the smallest change in temp

s.loc[s.diff() == s.diff().min()]

0808    33
0809    32
dtype: int64

# Next week:

1. dtypes -- what are they, and why do we care?
2. Data frames (2D data)