## Before you start: Create the submitter function

In [None]:
import sys
sys.path.append( '../')
import dw_utils2 

ans_submit = dw_utils2.create_submitter(host='52.91.20.10', port=80, 
                       user="mateo.restrepo@yuxiglobal.com", 
                        # put your full yuxi email address here, including @yuxiglobal.com
                       ws_key="dw3", #this is the workshop key, don't change it 
                       token="1nwK1Z4kN1Rk" ) #put the token that Mateo sent to you on an e-mail on Wednesday...

In [8]:
ans_submit( "Agg1", "")

Agg1 = ''
Answer for question Agg1 is incorrect ☹



## What is "data"?

Data is any collection of (true) **facts of statements** about **objects** (or entities) in some domain. Traditionally, data talked about objects in nature, society or the "real" world in general.  Lately, there is an increasing amount of data that refers to objects in the "virtual" world, such as web transactional events, chats, user profiles, etc...

Many times, these facts or statements take the form of definite values for certain **attributes** of the objects in question. 

**Examples:** 

  * The objects are 'users' of an app, and their attributes are name, date of birth, e-mail, hair-color, and so on...
  * The objects are 'events' in a server, and their attributes are the time, whether the event was user generated or not, whether they were an error or not, the user associated with the event (if any)
  
Attributes are no more than (partial) **functions** that map entities to other (often times simpler) data types, such as ´int´, ´double´, ´string´ or ´bool´, etc...

$$ \mathtt{name} : \mathtt{User} \rightarrow \mathtt{String} $$

$$ \mathtt{date\_of\_birth} : \mathtt{User} \rightarrow \mathtt{Date} $$

$$ \mathtt{event-time} : \mathtt{ServerEvent} \rightarrow \mathtt{DateTime}  $$

$$ \mathtt{is-user-generated} : \mathtt{ServerEvent} \rightarrow \mathtt{Bool}  $$

$$ \mathtt{hair\_color} : \mathtt{User} \rightarrow \mathtt{Color} \equiv {\mathbb{R}^3} $$


  
## Kinds of data 

Data can be roughly classified in one of three kinds. 

* **Structured data**: Data that can **easily** be put in tabular form that makes processing it convenient. Structured data consists of a multitude of identically formatted data elements (records) each having roughly the same set of (relatively simple) attributes that capture most of what is relevant about them. Much more on this below.  

* **Unstructured data** Data that is hard to put in tabular form without incurring in either significant loss or affecting the ease of processing.  Examples: a collection of texts written by humans, a collection of high-resolution photos, audio recording, videos, x-ray.

* **Semi-structured data** Data for which can be partly but not wholly put in tabular form.  Examples: web logs, chats, tweets, maps. 


**Note:** Often times working with ustructured or semi-structure data starts with converting it into an structured form (tabular). 


## Structured Data: an executive summary

In what follows we shall focus on structured data. In future workshops we will revisit semi-structured and structured data. 

Recall that structured data can be represented faithfully in **tabular form**, and for which such as representation is convenient for processing.

A structured data-set consists of many **records**. Each record holds data about one particular instance of an entity (or object) type. A record consists simply of values for a fixed set of attributes for that instance. In principle, the **set of attributes** is fixed and common accross all records in the data set; it is the values for those attributes that vary. 
It is not uncommon for some attributes to be undefined for some, sometimes many, of the objects.

We say then that, all records have the same "shape", since they all contain the **same attributes** and the **type** (set of possible values) **of each attribute is fixed**.

When represented in **tabular** form, it is usual to refer to a dataset simply as a **table**. We will use both terms **dataset** and **table** interchangeably.  Similarly, in the usual tabular representation, **records** become **rows*** and each attribute becomes a **field**. A **column** then refers collectively to all values for a particular field across all records of a given table. 

The purists of DB theory, sometimes say **relation** to mean table. The technical term actually comes from [math](https://en.wikipedia.org/wiki/Finitary_relation)...

## Example of structured data: 

The following data comes from the didactic competition [House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

First, we import the `pandas` module (=library) and alias it as  `pd` (this is a good and very common practice with Python), as it contributes to making code more concise without sacrificing readability.   


In [None]:
import pandas as pd 

print( "Pandas version is: ", pd.__version__, 
       "  Keep this in mind when looking at the API docs!")

Then we use the function `read_csv` to load the data and return it as an object of class DataFrame.
Most likely, you will have to change the path in the call to the function. Remember to use `/` to separate paths, instead of 
`\` which is used to escape special characters (just like in C). 

In [None]:
df = pd.read_csv( "C:/_DATA/Data Workshops/DW2/house_prices_and_characteristics.csv")

In [None]:
type( df ) # this yields the type of the object. In Python every type is a class (I think...)

In [None]:
df  # this attempts to display the contents of the DataFrame object in an html-friendly way

**Notice:**
  * The first column, containing the numbers from 0 to 1459 in boldface, is not actually a field, but an **index** of the rows. That means that these values can be used to refer to individual rows. In other data Frames, the index, will consists of keys of a different type, for example, strings or tuples, and not just mere numbers.
  * The visualization above only shows the first few columns, then ..., then the last few columns. Similarly for the rows. The number of rows and columns shown is configurable, but one rarely does configure it.
  * Some columns contain the special value 'NaN' (Not-a-Number) in some of the entries. 'NaN' is actually a special value of the type float (64-bit floating numbers, called 'double' in C/C++/Java, etc.). NaN is obtainable evaluating the expression `float("NaN")`

### Motivational plot

With pandas, it's very easy to create a quick plot of one column againts another...

**NOTE**: If the following doesn't show after evaluating it once 

In [None]:
#**NOTE**: If this cell doesn't show after evaluating it once, just run it a second time

df.plot.scatter( x="GrLivArea" , y="SalePrice" )

### A note on missing values in Python/pandas

In [None]:
not_a_number = float( "NaN" )
not_a_number

In [None]:
type( not_a_number )

**ACHTUNG!** `nan` does not equal any other value, not even itself!

In [None]:
not_a_number == not_a_number 

To appropriately check whether something is nan one way is to use the `isnan` function from the math module

In [None]:
import math

In [None]:
math.isnan( not_a_number )

In [None]:
math.isnan( 2 )

In [None]:
math.isnan( "nan" ) 

Questions: 

  * **(X1)** In the dataset above, what is the name of the first column (going from left to right) that contains 'NaN' values? 
  * **(X2)** What is the name of the first non-numeric column (going from left to right) that contains 'NaN' values?
  * **(X3)** Look at the last line in the output above, what is the total number of records (rows)? 
  * **(X4)** What is the total number of columns?

In [None]:
ans_submit( "X1", "...your answer here...") 
ans_submit( "X2", "...your answer here...") 
ans_submit( "X3", "...your answer here...") 
ans_submit( "X4", "...your answer here...") 

Here is how you can get the full list of column names for the DataFrame

In [None]:
df.columns

To get columns and their data types you can access the `dtypes` field

In [None]:
df.dtypes

Notice how fields of type string appear as having type `object`. This means each data cell in the table only holds a *reference to* the actual object that is an immutable string stored somewhere else. For strings that are variable length this is the only way to "store" them on a table. Thus accessing a cell value that contains an object, implicitely requires doing extra dereferencing operations and is a bit slower than accessing an **int64** or **float64**.  For fixed length strings there are other alternatives that are *faster* but at the same time might make the memory requirements of the whole dataframe grow.


**Question X5:** Besides `int64` and `object`, what is another data type that appears in the list above?  

In [None]:
ans_submit( "X5", "...your answer here...")

# 10 minutes to Pandas video

Take your time to go over the 10 minutes to Pandas video, available on this link: https://vimeo.com/59324550
        
    

# Data operations

There are a few data operations on data sets that any data-processing tool should implement. 

These could be roughly classified in the following categories:

  * **Enriching (or Transforming)**  a data set, by adding newly calculated columns on indices.
  * **Filtering** picking a subset of the rows or columns of a data set according to some criterion.
  * **Indexing** adding indices to a data set
  * **Aggregating**
  * **Sorting** sorting the rows of a dataset according to some criteriong
  * **Merging** merging to data sets in some way. This includes: concatenation (horizontal or vertical) and also joining.
  * **Summarizing** computing summaries of a data set.
  * **Pivoting**: This includes transposing and doing other operation so that data that originally had a vertical layout, is laid out horizonatally (increasing the number of columns) or viceversa (increasing number of rows). The video above has a few cool examples to better understand this concep. 
  

## Retrieving one or more columns 

Before talking about eriching to refer to a column we use "diccionary access" style notation, i.e. **square brackets**

In [None]:
df['LotArea'] # this accesses the data from a column and also shows the corresponding row index...

In [None]:
type( df['LotArea'] )

Notice this is no longer a `DataFrame` object but a `Series` object.

`Series` is just the pandas core implementation of the concept of column or field.

A `Series` is a one-dimensional object.

The reason for the name is that pandas was initially designed to handle time-series data (financial stock price series, indexed by time). So the name is more or less a historical accident... 

We can also choose a subset of columns in the dataframe by using the notation: `df[ list-of-Columns ]`, where a list literal or any expression that yields a list.

In [None]:
df[ ['Utilities', 'YearBuilt', 'TotRmsAbvGrd'] ]

## Enriching

The simplest way to enrich a data set is to do a *row-by-row* computation involving one or more of its columns as inputs. 

For example, we can transform the `LotArea` from square feet to square meters and store the result in a new columns

In [None]:
df['LotArea_m2'] = df['LotArea'] * (0.3048 * 0.3048)

Notice that `LotArea_m2` now appears as the last column.

In [None]:
df.columns

In [None]:
df['LotArea_m2'].head(10)

** Exercise E1 ** 

Convert the **First floor area** from square feet to square meters and submit the sum of the resulting column. Refer to the "metadata" file to identify the column that stores this attribute. 

In [None]:
## Your code here to define a new column



ans_submit( 'E1',  round( df["new column"].sum(), 3 ) ) # Only change the column name in quotes!

** Exercise E2 ** 

Convert the `LotFrontage` to meters and submit the sum of the resulting column (Hint: the correct conversion factor is not 0.3048 * 0.3048 but something simpler.)

In [None]:
## Your code here to define another new column



ans_submit( 'E2', round(df["another column"].sum(), 3) )

The class `Series` has a few methods that can transform one column by itself. The most important one is perhaps `.isna()` which generates a new `Series` with data-type Boolean having `True` in at the indices where the original `Series` is `NaN`.

In [None]:
df['LotFrontage_isna'] = df['LotFrontage'].isna()

In [None]:
df[['LotFrontage', 'LotFrontage_isna']].head(15)

To see all methods offered by the class `Series` go here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html 


For columns of type string there is a special attribute called `str` that gives access to most methods of the string class:

In [None]:
df['SC_len'] = df['SaleCondition'].str.len()
df['SC_first_chr'] = df['SaleCondition'].str[0]
df['SC_upper'] = df['SaleCondition'].str.upper() 
df['SC_lower'] = df['SaleCondition'].str.lower() 
df['SC_concat_LS'] = df['SaleCondition'] + '_' +  df['LotShape'] 

df['e3_input'] = (df['SaleCondition'] + '_' +  df['LotShape']).str.lower()

df[ ['SaleCondition', 'SC_len', 'SC_first_chr', 'SC_upper', 'SC_lower', 'LotShape', 'SC_concat_LS', 'e3_input']].head(15)

** Exercise E3 ** 

The following doesn't work as expected (in each value we want 'a' replaced by '@' and then each 'r' replaced by '', i.e. remove the r's). 
Figure out why an fix it! (**Hint:** What is the type of the subexpression `df['e3_input'].str.replace('a', '@')` )

In [None]:
df['e3_output'] = df['e3_input'].str.replace('a', '@').replace( 'r', '' )  # Fix this....
df['e3_output'].head(15)

In [None]:
ans_submit( 'E3' , df['e3_output'].str.len().sum() ) 

It is also posible to do vectorized operations between columns. For example:

In [None]:
df["sum_of_areas"] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

This is computing the sum of columns TotalBsmtSF, 1stFlrSF and 2ndFlrSF on a **row by row basis**.

Convince yourself that this is the case by checking a few of the rows below

In [None]:
df[['TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'sum_of_areas']].head(20)

**Exercise E4**

Redefine "sum_of_areas" to include the garage area in the sum as well (refer to metadata.txt file to find out which column that is)



In [None]:
## Your code to redefine column "sum_of_areas" here 



ans_submit( "E4", df["sum_of_areas"].sum() )

Finally you can apply any function (built-in, from a library and user-defined) to a series in a row-by-row fashiong by means of the `apply` method

In [None]:
def compute_quadratic( x ) : 
    """This computes a quadratic of a single value"""
    return 3 * (x**2) - 200 * x  -8

In [None]:
df['LotFrontage_quad'] = df['LotFrontage'].apply( compute_quadratic )
df[ ['LotFrontage', 'LotFrontage_quad']].head( 15 )

Keep in mind that doing `apply` is essentially the same as going through the series with a for-loop at *Python speed*, which is **very slow**.  

Any of the other operations we saw above also do a loop but inside the library code, so they happen at *C speed*. 

So, whenever possible you should avoid `apply` and make use of the vectorized operations that are available. 

For instance, the opperation above could have been done as follows.


In [None]:
xs = df['LotFrontage']  # alias the column for shorter code in the next line (this doesn't do a copy!)
df['LotFrontage_quad2'] = 3 * (xs ** 2) - 200 * xs - 8   # this does fast vectorized operations at C-speed!

df['equality_test']   = (df['LotFrontage_quad'] == df['LotFrontage_quad2'] )  # compare to the result previously obtained

# refined equality test: if LotFrontage is NaN then LotFrontage_quad and LotFrontage_quad2 have to be NaN as well
df['equality_test_2'] = df['equality_test']  | (df['LotFrontage'].isna() &  df['LotFrontage_quad'].isna() & 
                                                df['LotFrontage_quad'].isna()) 

# Notice how we are using single ('|' and '&'), NOT  double (|| and &&)  operators. 
# This is because the operations are component-wise (row-by-row) which is sort of analogous to 'bit-wise' 

df[ ['LotFrontage', 'LotFrontage_quad', 'LotFrontage_quad2', 'equality_test', 'equality_test_2']].head( 20 )

How to tell whether all results in equality_test_2 are true

In [None]:
df['equality_test_2'].all() 

Yes, they are! Great SUCCESS!
<img src="https://media.giphy.com/media/a0h7sAqON67nO/giphy.gif"/>

In [None]:
df

There is also an `apply` method defined on the `DataFrame` class. That gives access to a whole row at a time. 

In other words, you define a function that takes a *whole row* as an argument, not just a value, and computes a result from it.

The result of applying this function will be a series.

Example: 

In [None]:
import math

def lot_irregularity(  row ) : 
    """Compute an irregularity factor"""
    if row["LotShape"] == "Reg" : 
        return math.sqrt( row["LotArea"] ) / row["LotFrontage"] 
    else : 
        return row["LotArea"] / ( row["LotFrontage"] ** 2 ) 

In [None]:
df['LotIrregularity'] = df.apply( lot_irregularity, axis = 1 ) 
# axis = 1 means to apply function on each row, axis = 0 would be to apply the function on each colum

In [None]:
df[['LotShape', 'LotArea', 'LotFrontage', 'LotIrregularity']].head(15)

** Exercise E5 ** 

Write a function `lot_irregularity_v2` that  returns 1.0 when `LotFrontage` is `NaN` and returns the value returned by `lot_irregularity`, otherwise.

You can test this with `math.isnan( row["LotFrontage"] )`

Then apply this function to define a new column called `"LotIrregularity_v2"`

In [None]:
# Your codez here ....








ans_submit( "E5", round( df["LotIrregularity_v2"].sum(), 3) )

## Filtering

Filtering refers to extracting a *subset of a dataset*.

A 'subset' can refer to either a subset of the columns in which case more common terms are **selection**, and **projection** for the mathematical purists. 

Most commonly, 'filtering' refers to taking a subset of the rows according to some criterion


## Selection (or projection) 

We have already seen the basic method for selecting a few columns: `df[ list-of-columns ]`

A slightly different method is   `df.loc[ :,  boolean-mask ]`
 
Here, 'boolean mask' means an array of booleans of length equal to the number of columns in `df`

For example, if we want to just select all columns with info related to the basement, we can do as follows. 

In [None]:
basement_related_cols = df.columns.str.contains("Bsmt")
basement_related_cols # this is the mask

In [None]:
df.loc[ :, basement_related_cols].head( 15)  

`loc` is an accessor that yield access to 

Everything about loc here 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html


** Exercise F1 ** 

Create a mask to indentify all columns that end on 'SF' (this means surface area). Hint: First run `help(df.columns.str)` and see what method could help make this task easier. 

Then define a new dataframe `df_sf = df.loc[: , your_mask_here]`

In [None]:
### Your code here: define mask, define df_sf 





ans_submit( "F1", df_sf.sum().sum() )

## Filtering  rows

The method for filtering rows that is used 95% of the time is simple boolean filtering as follows.

In [None]:
boolean_row_mask = (df['LotArea']  >= 10000)  # this defines a boolean Series

In [None]:
df['bool_mask'] = boolean_row_mask
df[['LotArea', 'bool_mask']]

In [None]:
boolean_row_mask.head( 20 )

In [None]:
type( boolean_row_mask )

In [None]:
boolean_row_mask.dtype

In [None]:
# this selects all house with a lot area strictly greater than or equal 10000
df_biglot = df[ boolean_row_mask ] 
pd.set_option( "display.max_columns", 10 )
df_biglot.head( 25 )

In [None]:
df_biglot.shape # This yields the number of rows and columns as pair ( 2-tuple )

Naturally, it is possible and even preferable in most cases to simply use the expression that defines the mask
inside the brackets, without having to store it a name. 

This has the following advantages: (1) memory preservation,  (2) avoids name pollution and (3) makes the code more concise

The only disadvantage is that it makes code (slightly) less readable.

In [None]:
df_biglot = df[ df['LotArea']  >= 10000 ] 
df_biglot.shape

** Exercise F2 ** 

Define a subset of `df` called `df_ss` containing all records which satisfy *all* of the following conditions
 
  * `HeatingQC` equals `'Ex'` 
  * `LotFrontage`  greater than or equal to `50.0`
  * `LotShape` different from `'Reg'`
  
 
**Important Warning:** when combining comparisons with Boolean operators you should always put the comparisons in parenthesis. Unfortunately, the operator precedence is not what you are used to from other languages 
 
Check your result visually (displaying those columns) before submitting...

In [None]:
### Your code here 





In [None]:
ans_submit( 'F2', df_ss.shape[0] )   # df_ss.shape[0] is how you tell how many rows df_ss has.

### head / tail methods

The other basic method for taking a subset of rows from a dataset is to make use of the fact that (unlike SQL tables in most implementations) dataframes have an intrisic order. Thus one can talk about the first `n` rows or the last `n` rows. 

There are two corresponding methods: 

  * `df.head( n )` returns a new dataframe consisting of the **first** `n` rows of df 
  * `df.tail( n )` returns a new dataframe consisting of the **last** `n` rows of df
  
You have seen many examples of using `head()` above...

** Exercise F3 **

Define `df_t20` to be the last 20 rows of `df`. 

In [None]:
ans_submit( 'F3', df_t20['LotArea'].sum() )

## Summarizing  / Descriptive statistics

The Series class has several predefined methods to quickly calculate summaries of the data in it.

The following methods work on Series of any datatype:

  * `ser.count()` : count of non-NA values 
  * `ser.nunique()` : count of unique values 
  * `ser.value_counts()` : yields a table (actually just a series) with the distinct values and the count for each. This is know by statatisticians as the *frequency table* 
  * `ser.mode()` : The most frequent unique element in the series 

  * `ser.describe()` : computes several important statistics. It's output depends on whether the series is numeric or not. See exercise below and the documentation.


** Exercise S1 ** 

Try applying `.describe()` to any non-numeric column in `df`. You will notice that one of the outpus is called `top`. The corresponding value is a value from the series. Which of the methods listed above could be used to compute this value by it self?  Go to the documentation of `pandas.Series.describe` to find out...  When submitting the answer just give the name of the method as a single word, with no `ser.` at the beginning and no `()` at the end.

In [None]:
ans_submit( "S1", "name of method here as a single word with no punctuation")

For numeric columns we have a few more methods: 
        
 * `ser.sum()` : which 
 * `ser.mean()` : which computes the mean 
 * `ser.max()` and `ser.min()`: compute the maximum and minimum value of the series 
 * `ser.median()` : The 50th-percentile, i.e. the value M such that 50% of the values in the series are less than or equal to M and 50% of the values are greater than or equal to M. 
 * `ser.quantile(x)` : Returns the x-quantile of the distribution of x, i.e the value Q such that (x*100)% of the values of the series are less than or equal to Q.

**Exercise S2**

Try applying `.describe()` to a numeric column in `df`, you will see several outputs, such as `count`, `mean`, `min`, etc...
One of the listed outputs cannot be computed by any of the functions listed above. Which one? 



In [None]:
ans_submit("S2", ) 

** Exercise S3 **

All methods on a numeric Series skip NaN values by default. However, they all admit an optional argument to control this behaviour. 

Go to the documentation of any of these methods to figure out what that argument is called

In [None]:
ans_submit("S3", "name of optional argument to control whether to skip NaN values when computing a summary") 

** Exercise S4 **

Answer "True" or "False", when sumarizing a numeric series, skipping NaN values is equivalent to treating them as if they where `0.0`. 

In [None]:
ans_submit( "S4" , "True" / "False" )  # leave just one of the two 

### Special methods for Boolean series

Booleans Series are specially simple and also specially useful. They usually give the answer to a yes/no question about the each cell in another series or each row in a DataFrame.

We have seen how boolean series are produce above, by computing expression of the form: `another_series == some_value`, `another_series > some_threshold` or simply `another_series.isna()`

There are two very useful method for Boolean series, namely: 

  * `ser.all()`: Returns True if and only if all elements of the boolean series are 'True' 
  * `ser.any()`: Returns True if any of the elements of the series are True
  
**ACHTUNG:** boolean series cannot have any NaN values. True and False are the only allowed values. In fact, if a series from another type is coerced to boolean type, for instance calling `ser.astype( bool )`, then `NaN` values turn into True!  
This is significantly different from SQL semantics where NULL efectively acts as False for filtering operations.

** Exercise S5 **

What is the result of applying `.sum()` to a boolean Series called `ser` and having `n` elements? 

 * **a.** Since `+` is equivalent to `or` for bools the result is:  `ser[0] or ser[1] or ser[2] or ... or ser[n-1]`, 
    i.e. the same as applying `.all()`  
 * **b.** Bool is the same type as the integers modulo 2, thus  `+` is equivalent  to `xor` (1 + 1 = 0 modulo 2) and the result is `ser[0] xor ser[1] xor ... xor ser[n-1]?  
 * **c.** Pragmatism beats purity and the result is the same as first casting the whole series to type `int` (False -> 0, True -> 1) and summing the resulting elements as regular integers.

In [None]:
ans_submit( "S5", "a" / "b" or "c") # leave one

To discover all predefined predefined methods for computing descriptive statistics go here. 
https://pandas.pydata.org/pandas-docs/stable/api.html#computations-descriptive-stats

## References

  * The pandas home: https://pandas.pydata.org/
  * The pandas API:  
      * Dataframe: https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe
      * Series: https://pandas.pydata.org/pandas-docs/stable/api.html#series
      * All about indexing and selection https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label
  
  