# <span style="color:blue">Course Plan 11/4/2019</span>
## <span style="color:blue">(Unchanged since 10/30/2019)</span>

## Updated schedule through the rest of the semester

|  Wk   |  M    |  W     | Topic   | Notebooks |
| :---: | :---: | :----: | :------ | :----- |
|  8  |  10/21  | 23  | **Numpy:** Data Abstraction, **Numpy:** Multi-dimensional arrays,  | Midterm, 03-01, 03-02 |
|  9  |  28  | 30  | **Numpy:** Reading into multi-dimensional arrays, **Pandas:** Dataframes and reading into them;  Merging and matching Dataframes| 03-03, 03-04, 03-05 |
|  10  |  11/4  | 6  | **Pandas:** , Series and Views; Wrap Up Unit 3| 03-06, 03-07 |
|  11 |  11  | 13   | Classification and Clustering, **Case Study:** Iris Data Set | 04-02, 04-03  |
|   |    |    | Notebooks under development&dagger;  | <del>04-04, 04-06, 04-07</del>  |
|  12 |  18  | 20  | **Case Study:** [World Happiness Report](https://worldhappiness.report/ed/2019/)  | 04-04, 05-01 |
|  13 |  25   | &mdash;  | [Geopandas](http://geopandas.org/), **Case Study:** World Happiness Map | 05-03 |
|  14 |  12/2 | 4 |  **Case Study:** Twitter Sentiment Analysis | 05-04 |
|  16 |  | 12/13 | **(Take Home) Final Exam**  |

&dagger; We will not be covering these notebooks this semester. Feel free to peruse them if interested.

<hr/>


# <span style="color:blue">Series and Views</span>

* There are two hidden but powerful parts of Pandas `DataFrame`s
* Series is the type of one column from a `DataFrame`
   * enables column operations 
   * acts like a `numpy` `ndarray`. 
* Views are subsets of the original `DataFrame` where editing them changes the original. 
   * a new syntax creates views. 
   * This is the easiest way to edit a `DataFrame`

# The hidden type: Series

When we act on columns in a `DataFrame`, they are actually of type `Series`. 
* `Series` acts a lot like an `ndarray`.
* most `ndarray` functions supported. 
* default index is integer offset. 

But `Series` is -- in fact -- much more expressive than `ndarray`
* Can index by non-numeric data, i.e., one can "label" rows. 
* Can optimize operations by careful choices in indexing. 

Consider:

### Wait, in `03-04-dataframes-in-pandas` didn't we say that a pandas column is _exactly_ an ndarray?

Not exactly, because we had discovered that <span style="color:green">**type**</span>(data1) is `pandas.core.frame.DataFrame`. Let's dig a bit deeper:

In [34]:
import pandas as pd
import numpy as np

d1 = pd.DataFrame({ 'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
d1

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [35]:
type(d1['a'])

pandas.core.series.Series

What is `Series`? See the pandas documentation on [`Series`](https://pandas.pydata.org/pandas-docs/version/0.25/reference/series.html). Also, see the documentation on [`Series.values`](https://pandas.pydata.org/pandas-docs/version/0.25/reference/api/pandas.Series.values.html)

In [36]:
print (d1['a'].values)
print (type(d1['a'].values))

[1 2 3]
<class 'numpy.ndarray'>


There is an additional level of "decoration" between `d1` (a dataview) `numpy.ndarray`: a `Series`, which happens to be `d1['a']` in our case!

The essential difference between a `Series` and a `numpy.ndarray` is that while the NumPy array has an _implicitly defined_ integer index used to access the values, the Pandas `Series` has an _explicitly defined_ index associated with the values. See the **labels** discussion below.

In [37]:
d1['a']  # one column

0    1
1    2
2    3
Name: a, dtype: int64

In [38]:
type(d1['a'])  # it's still a Series

pandas.core.series.Series

In [39]:
d1['a'][1]  # [column][row]

2

In [40]:
d1['a'].sum()  # all rows 

6

In [41]:
d1['b'].mean()  # all rows 

5.0

# A few caveats
1. A series via the syntax `df[column]` is a copy. Changing it doesn't change the original. 

In [81]:
df = pd.DataFrame(np.random.randn(7,7), columns=list('ABCDEFG'), index=range(1,8))
foo = df['E']
foo['E'] = 42
print(type(foo))
foo

<class 'pandas.core.series.Series'>


1    -0.919602
2     0.356606
3    -0.831479
4    -1.176133
5    -1.036663
6    -0.187646
7     1.273478
E    42.000000
Name: E, dtype: float64

In other words, we just generate a new index 'E' on the series foo! We didn't succeed in changing foo or df. Trying another way, we get an interesting warning:

In [82]:
df = pd.DataFrame(np.random.randn(7,7), columns=list('ABCDEFG'), index=range(1,8))
df[df['E'] < 0]['E'] = 42
df.query('2 < index <= 5')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,A,B,C,D,E,F,G
3,0.844677,-0.950417,0.33225,-0.072478,0.909121,-0.854316,1.599473
4,1.065013,-1.803072,2.219886,0.039003,-1.881972,0.001275,-1.851175
5,-0.121365,-0.653994,0.13457,-0.754715,0.917643,-0.174008,1.474087


## Reminder about Python list behavior

In [83]:
a = [1, 5, 10, 3, 99, 5, 8, 20, 40]
print (a)
print (a[2:6])
b = a[2:6]
print (a[2:6][0], '\n')
a[2:6][0] = 50
print (a[2:6])
b[0] = 50
print (b, '\n')
print (a)

[1, 5, 10, 3, 99, 5, 8, 20, 40]
[10, 3, 99, 5]
10 

[10, 3, 99, 5]
[50, 3, 99, 5] 

[1, 5, 10, 3, 99, 5, 8, 20, 40]


The underlying list didn't change! However, for dataframes...

In [84]:
d1['b'][1] = 20
d1

Unnamed: 0,a,b,c
0,1,4,7
1,2,20,8
2,3,6,9


## View vs Copy rules

The rules for when we have a copy and when we can see through to the underlying data structure:

* All operations generate a copy
* If inplace=True is provided, it will modify in-place; only some operations support this
* An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
* An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
* An indexer that gets on a multiple-dtyped object is always a copy.

[Source](https://stackoverflow.com/a/23296545/653651)

# Indexes
An index is a set of labels for rows. The default index is 0-n integers. Indexes can be anything. Let's use letters. 

In [9]:
d1['labels'] = ['d', 'e', 'f']
d1

Unnamed: 0,a,b,c,labels
0,1,4,7,d
1,2,20,8,e
2,3,6,9,f


In [10]:
d2 = d1.set_index('labels')
d2

Unnamed: 0_level_0,a,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d,1,4,7
e,2,20,8
f,3,6,9


In [11]:
d2['a']

labels
d    1
e    2
f    3
Name: a, dtype: int64

# Whoa there! What just happened?
* Labeling a `DataFrame` usually creates a new `DataFrame`.
* Series also support row labels. 
* Changing the labels on a `DataFrame` changes the labels on all Series. 

We can access by column and row, as before: 

In [12]:
d2['a']['e']

2

but the following less intuitive syntax is recommended for performance reasons. 
* `:'e'` a *row range:* labels up to and including 'e'
* `'b':`  a *column range:* labels from 'b' upward. 
* `:` by itself denotes all.

In [13]:
d2.loc[:'e','b':]  # create a view of d2

Unnamed: 0_level_0,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1
d,4,7
e,20,8


# Not particularly intuitive, but very powerful. 
* The addressing form `.loc[]` above has significant powers. 
* Consider

In [14]:
d2.loc[:'e', 'b':] = 42
d2

Unnamed: 0_level_0,a,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d,1,42,42
e,2,42,42
f,3,6,9


The assignment set multiple cells to a value. 
This is a special case of a more general property. 

# Copies and views

In dealing with Pandas, there are two kinds of derived data: 
* *Copies* are decoupled from the original data. 
* *Views* retain their coupling with the original data. 

The meaning of the word *view* is consistent with its use in databases. 

The key issue is again *mutability*. 
* Changing a view changes the original data. 
* Changing a copy does not. 

The curious notation `df.loc[rows, columns]` creates a *view*. 
* Not separate from the original `DataFrame`. 
* Changing it changes the original `DataFrame`! 

The more typical notation `df[columns][rows]` creates a *copy*. 
* The copy is independent of the original. 
* Changing it doesn't change the original data. 
* The first bracket does the copy. 
* This avoids confusion when using row expressions. 

Consider, e.g., 

In [15]:
v1 = d2.loc['e':,'b':]  # a view
v1



Unnamed: 0_level_0,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1
e,42,42
f,6,9


In [16]:
v1.loc['e','b']=100
v1

Unnamed: 0_level_0,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1
e,100,42
f,6,9


In [17]:
d2

Unnamed: 0_level_0,a,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d,1,42,42
e,2,100,42
f,3,6,9


# Whoa there! What happened?
The view `v1` was an alias for a subset of `d2`, and changing `v1` changed `d2`. 

# Views can be partial

In [18]:
v1['foo'] = True  # a new column, not part of the view
v1

Unnamed: 0_level_0,b,c,foo
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e,100,42,True
f,6,9,True


In [19]:
v1.loc['e', 'c'] = 200
v1


Unnamed: 0_level_0,b,c,foo
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e,100,200,True
f,6,9,True


In [20]:
d2

Unnamed: 0_level_0,a,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d,1,42,42
e,2,100,200
f,3,6,9


# Copies are decoupled
Consider: 

In [21]:
c1 = d2[['b', 'c']][:'e']  # copied 
c1

Unnamed: 0_level_0,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1
d,42,42
e,100,200


In [22]:
c1.loc['e', 'b'] = 300  # technically a view of a copy(!)
c1  # gets changed

Unnamed: 0_level_0,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1
d,42,42
e,300,200


In [23]:
d2  # doesn't reflect change of copy. 

Unnamed: 0_level_0,a,b,c
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d,1,42,42
e,2,100,200
f,3,6,9


# Why is this so weird? 
* Pandas is an evolving language. 
* The copy syntax (e.g., df[columns][rows]) evolved first, to enable column operations. 
* The view syntax (e.g. df.loc[columns, rows]) evolved last, to enable setting cells easily (and for efficiency). 
* People were already using the copy syntax widely, and Pandas couldn't change that without breaking users' code. 
* So Pandas instituted a new, separate syntax for the different use case. 

# Labels on series
* Series can be labeled as well. 
* They inherit their labels from the `DataFrame`. 
* All series have exactly the same row labels for each row. 
* Some of the `Series` queries look like `DataFrame` queries. 

Consider

In [24]:
s1 = d1['b']
s1

0     4
1    20
2     6
Name: b, dtype: int64

# Let's put this into practice
First, let's register you for grading. 

In [85]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('03-06-dataframe-views.ok')

Assignment: 03-06 Dataframe views
OK, version v1.14.15



Let's make up a test `DataFrame`: 

In [26]:
df = pd.DataFrame({
    'name': ['Garfield', 'Bill', 'Snoopy', 'Dogbert'],
    'kind': ['cat', 'cat', 'dog', 'dog'],
    'weight': [20, 10, 15, 10],
    'food': ['lasagna', 'roadkill', 'canned', 'pate']
})
df

Unnamed: 0,name,kind,weight,food
0,Garfield,cat,20,lasagna
1,Bill,cat,10,roadkill
2,Snoopy,dog,15,canned
3,Dogbert,dog,10,pate


1. Create a new `DataFrame` `pets` from `df` that is indexed by name. 

In [27]:
# your answer: 
pets = df.set_index(df.name)
print(pets)

              name kind  weight      food
name                                     
Garfield  Garfield  cat      20   lasagna
Bill          Bill  cat      10  roadkill
Snoopy      Snoopy  dog      15    canned
Dogbert    Dogbert  dog      10      pate


In [28]:
_ = ok.grade('q01')  # run this to check your work. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



2. In `pets`, set 'Snoopy's weight to 16. 

In [29]:
# Your answer:
pets.loc['Snoopy', 'weight'] = 16
pets

Unnamed: 0_level_0,name,kind,weight,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Garfield,Garfield,cat,20,lasagna
Bill,Bill,cat,10,roadkill
Snoopy,Snoopy,dog,16,canned
Dogbert,Dogbert,dog,10,pate


In [30]:
_ = ok.grade('q02')  # run this to check your work. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



3. Create a copy `dogs` that consists of just the dogs in `pets`.

In [31]:
# Your answer: 
dogs = pets.loc[pets.kind == 'dog']
dogs

Unnamed: 0_level_0,name,kind,weight,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Snoopy,Snoopy,dog,16,canned
Dogbert,Dogbert,dog,10,pate


In [32]:
_ = ok.grade('q03')  # run this to check your work. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



4. In `dogs`, set Dogbert's weight to 25. This will print a warning. 

In [33]:
# Your answer: 
dogs.loc['Dogbert','weight'] = 25
dogs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0_level_0,name,kind,weight,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Snoopy,Snoopy,dog,16,canned
Dogbert,Dogbert,dog,25,pate


In [34]:
_ = ok.grade('q04')  # run this to check your work. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



In [35]:
pets  # what happened to the original?

Unnamed: 0_level_0,name,kind,weight,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Garfield,Garfield,cat,20,lasagna
Bill,Bill,cat,10,roadkill
Snoopy,Snoopy,dog,16,canned
Dogbert,Dogbert,dog,10,pate


5. Create a Series `weights` of `dogs` with just the weights.

In [36]:
weights = dogs.loc[:,'weight']
print(weights)

name
Snoopy     16
Dogbert    25
Name: weight, dtype: int64


In [37]:
_ = ok.grade('q05')  # run this to check your work. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



6. Change Dogbert's weight to 35 in the copy `weights`. This will print a warning. 

In [38]:
# Your answer: 
weights['Dogbert'] = 35
print(weights)

name
Snoopy     16
Dogbert    35
Name: weight, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [39]:
_ = ok.grade('q06')  # run this to check your work. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



In [40]:
dogs  # Did you change the copy? 

Unnamed: 0_level_0,name,kind,weight,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Snoopy,Snoopy,dog,16,canned
Dogbert,Dogbert,dog,35,pate


(Ed.s note: This is amusing. It both warns me that it won't change the original and then changes it. If the type of this object were `DataFrame`, the warning would be reasonable, but the `weights` object is type `Series`, so the warning is moot.)

In [41]:
pets  # check that you didn't change the top-level original

Unnamed: 0_level_0,name,kind,weight,food
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Garfield,Garfield,cat,20,lasagna
Bill,Bill,cat,10,roadkill
Snoopy,Snoopy,dog,16,canned
Dogbert,Dogbert,dog,10,pate


7. **Challenge problem:** (optional) Create a version of `dogs` that is a *view* and demonstrate that it is a view by making a change in the view that is reflected in `pets`. I have been unable to do this! I wonder if it's possible!

In [None]:
# Your answer: 
dogs = ...
dogs