**Coursebook: Python for Data Analysts**
- Part 1 of Data Analytics Specialization
- Course Length: 15 hours
- Last Updated: May 2019

___

- Author: [Samuel Chan](https://github.com/onlyphantom)
- Developed by [Algoritma](https://algorit.ma)'s product division and instructors team

# Background

## Top-Down Approach 

The coursebook is part of the **Data Analytics Specialization** offered by [Algoritma](https://algorit.ma). It takes a more accessible approach compared to Algoritma's core educational products, by getting participants to overcome the "how" barrier first, rather than a detailed breakdown of the "why". 

This translates to an overall easier learning curve, one where the reader is prompted to write short snippets of code in frequent intervals, before being offered an explanation on the underlying theoretical frameworks. Instead of mastering the syntactic design of the Python programming language, then moving into data structures, and then the `pandas` library, and then the mathematical details in an imputation algorithm, and its code implementation; we would do the opposite: Implement the imputation, then a succinct explanation of why it works and applicational considerations (what to look out for, what are assumptions it made, when _not_ to use it etc).

For the most part, experience in Python programming is good to have but not required. Familiarity with data manipulation and data structures in a different programming language a welcome addition but again, not required.

## Learn-by-Building

This coursebook is intended for participants new to the world of data analysis and / or programming. No prior programming knowledge is assumed. 

The coursebook focuses on:
- Introduction to the `pandas` library. 
- Introduction to `DataFrame`  
- Data Types
- Exploratory Data Analysis I
- Indexing and Subsetting

The final part of this course is a Graded Asssignment, where you are expected to apply all that you've learned on a new dataset, and attempt the given questions.

# Python for Data Analysts

## Introduction to DataFrames

We will start off by learning about a powerful Python data analysis library by the name of `pandas`. Its official documentation introduces itself as the "fundamental high-level building block for doing practical, real world data analysis in Python", and strive to do so by implementing many of the key data manipulation functionalities in R. This makes `pandas` a core member of many Python-based scientific computing environments.

From its [official documentation](https://pandas.pydata.org):

> Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

To use `pandas`, we will use Python's `import` function. Once imported, all `pandas` function can be accessed using the *pandas.function_name* notation.

In [3]:
import pandas as pd
print(pd.__version__)

0.24.2


In [4]:
elec = pd.read_csv("data_input/amazon-electronic.csv", index_col=0)
elec.head()

Unnamed: 0_level_0,product_id,date,categories,brand,name,merchant,quantity,unit_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,44342,2017-03-02,Camera & Photo,Panasonic,Lumix G 25mm f/1.7 ASPH. Lens,Bestbuy.com,1,201.99
2,46876,2017-03-02,Camera & Photo,Sony,Cyber-shot DSC-WX220 Digital Camera (Black),Bestbuy.com,1,159.99
3,12136,2017-03-02,Camera & Photo,Sony,Sony - BC-TRX Battery Charger - Black,Bestbuy.com,1,26.99
6,79238,2017-03-03,Accessories & Supplies,Insignia,"Insignia - Fixed TV Wall Mount For Most 40-70""...",Bestbuy.com,1,56.99
7,46643,2017-03-03,Camera & Photo,Sony,Cyber-shot DSC-RX100 V Digital Camera,Bestbuy.com,1,849.99


In the code above, we used `.read_csv()` to read a csv file from a specified path. Notice that we set `index_col=0` so the first column in the csv is used as the index. By default, this function treats the first row as the header row. We can add `header=None` to the function call telling `pandas` to read in a CSV without headers.

You may find it curious that we use `0` to reference the first element of an axis; This is because Python uses 0-based indexing, a behavior that is different from other languages such as R and Matlab.

#### Knowledge Check

Let's dive deeper into understanding the `index_col` parameter. From the documentation:

> `index_col` : int or sequence or `False`  
Column to use as the row labels of the DataFrame.

1. How would you change the `read.csv()` code such that the DataFrame uses `product_id` as the row label (index)? 
2. `pandas.DataFrame.head()` accepts an additional parameter, `n`, and returns the first `n` rows of the DataFrame; Set `n=8` to see the first 8 rows of your `elect` DataFrame
3. The opposite of `.head()` is `.tail()`. It returns the last `n` row of your DataFrame. Create a new cell below and print the last 4 rows of our DataFrame 

*Reminder: Python uses 0-based indexing, and `product_id` is the second column in the csv*

In [3]:
## Your code below

## -- Solution code

### Keywords

Earlier it was mentioned that `index_col` accepts an integer, sequence or a `False` value. A couple of things to note here. `False`, along with its opposite, `True` are among a reserved list of vocabulary referred to as **Python Keywords**. We cannot use keyword as variable name, function name or assign values to them, essentially treating them as an identifier. 

Interestingly, all python keywords except **True**, **False** and **None** are in lowercase and they must be written as it is. As of Python 3.7 (latest version of Python as of this writing), there are 33 keywords:

`True`, `False`, `None`, `and`, `as`, `assert`, `break`, `class`, `continue`, `def`, `del`, `elif`, `else`
`except`, `finally`, `for`, `from`, `global`, `if`, `import`, `in`, `is`, `lambda`, `nonlocal`, `not`, `or`, `pass`
`raise`, `return`, `try`, `while`, `with`, `yield`

Try and insert a new cell below and assign the value of `2` to `False`. You should expect to see Python raising a `SyntaxError`:

> SyntaxError: can't assign to keyword

Which of the following 4 lines of code will evaluate without raising an error?

- [ ] `pd.read_csv("data_input/amazon-electronic.csv", index_col=false)`
- [ ] `Import pandas as pd`
- [ ] `print(100-2)`
- [ ] `None = 2`

In [4]:
## Your Code Below

## -- Solution code

You'll be tempted to go through all the keywords above and try to wrap your head around each one of them. If this proves to be a tad overwhelming, my recommendation is to move along the rest of the section; Most of us do not know the inner workings of every components of our car engine, but that shouldn't stop you from being an effective driver. 


As stated in the beginning of this course book, we're choosing a top-down approach and concepts will be presented on a "need-to-know" basis. We'll no doubt come across many of the keywords again (since collectively they form the backbone of the language) but for now, there is no need to stress about them if that only serve to discourage you from learning to code.

---

**What you need to know**:
- Python, as with `R`, `Swift`, `C`, and many other languages, are case-sensitive. `Sales` and `sales` refer to different objects.  
- You cannot use any Python keywords as identifers. 
- When naming your variables, start with a letter and use underscore (`_`) to join multiple words.
    - Wrong: `2019`, `2019sales`, `sales-2019`, `sales.2019`
    - Correct: `sales_2019`, `profit_after_tax`

`pandas` allow data analysts to create Series objects and DataFrame objects. Series is used to represent a one-dimensional array whereas DataFrame emulates the functionality of "Data Frames" in R and is useful for tabular data. 

In practice, a large proportion of our data is tabular: when we import data from a relational database (MySQL, Postgre) or from a spreadsheet software (Google Sheets, Microsoft Excel) we can represent these data as a DataFrame object.

### Data Types

When we call `pd.read_csv()` earlier, `pandas` will try to infer data types from the values in each column. Sometimes, it get it right but more often that not, a data analyst's intervention is required. In the following sub-section, we'll learn about various techniques an analyst have at his/her disposal when it comes to the treatment of pandas data types.

In [5]:
# print(elect.dtypes)

elec.dtypes

product_id      int64
date           object
categories     object
brand          object
name           object
merchant       object
quantity        int64
unit_price    float64
dtype: object

`dtypes` simply stands for "data types". Because `elect` is a `pandas` object, accessing the `dtypes` attribute will return a series with the data type of each column. 

----
#### Knowledge check: `.dtypes` and pandas attributes
Look at the following code - what is the expected output from the following code? Why?
```
x = [2019, 4, 'data science']
x.dtypes
```

Hint: Try `type(x)` and verify the type for object `x`.

In [6]:
## Your code below

## -- Solution code

Let's take a look at some examples of `DataFrame.dtypes`:

In [7]:
member = pd.DataFrame({
    'name': ['Anita', 'Brian'],
    'birth': [pd.Timestamp('19931108'), pd.Timestamp('19800612')],
    'gender': ['F','M'],
    'vip': [True, False],
    'ordercount': [11, 7],
    'avgbuy': [250554.32,500004.23 ]
})
member

Unnamed: 0,name,birth,gender,vip,ordercount,avgbuy
0,Anita,1993-11-08,F,True,11,250554.32
1,Brian,1980-06-12,M,False,7,500004.23


In [7]:
member.dtypes

name                  object
birth         datetime64[ns]
gender                object
vip                     bool
ordercount             int64
avgbuy               float64
dtype: object

Let's go through the columns and their data types from the above `DataFrame`:

- `name` [`object`]: store text values
- `birth` [`int`]: date and time values
- `gender`[`object`]: store text values
- `vip` [`bool`]: True/False values
- `ordercount` [`datetime`]: integer values
- `avgbuy` [`float`]: floating point values

Among these columns, only `ordercount` and `avgbuy` are columns with numeric values. This is a simple, but important, observation to make as we make our way into the Exploratory Data Analysis phase. But before we do, let's do one more exercise. Take a closer look at the Data Frame we just created again.

Out of the 6 columns, one of them is of special interest to our next discussion, **categorical values**.

### Categorical and Numerical Variables

From the [main documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html):

> Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation or rating via Likert scales.

Can you spot which of our column holds values that should be encoded in the `category` data type? Once you've spotted it, use the `astype('category')` method to perform the conversion. Remember to re-assign this new column so the original column (`object`) type is overwritten with the new `category` type column.

Examples:

```py
# convert marital_status to category
member['gender'] = member['gender'].astype('category')

# convert experience to integer
member['ordercount'] = member['ordercount'].astype('int')
```

In [8]:
member

Unnamed: 0,name,birth,gender,vip,ordercount,avgbuy
0,Anita,1993-11-08,F,True,11,250554.32
1,Brian,1980-06-12,M,False,7,500004.23


In [10]:
## Your code below

## -- Solution code

Use `member.dtypes` to confirm that you've done the exercise above correctly:

In [11]:
## Your code below

## -- Solution code

In most real-world projects, your work as a data analyst will involve working with **categorical**, **numeric** and **datetime** values; either treating them as "features" or "target". In the case of machine learning:

- A **categorical** target represents a classification problem
- A **numeric** target represents a regression problem

### Exploratory Data Analysis Tools

In simple words, exploratory data analysis (EDA) refers to the process of performing initial investigations on data, often with the objective of becoming familiar with certain characteristics of the data. This is usually done with the aid of summary statistics and simple graphical techniques that purposefully uncover the structure of our data.

We'll start off by using some of the most convenient EDA tools conveniently built into `pandas`. Particularly, this is a summary of what we'll cover in common EDA workflows:

- `.head()` and `.tail()`
- `.describe()`
- `.shape` and `.size`
- `.axes`
- `.dtypes`

In [14]:
elec.describe()

Unnamed: 0,product_id,quantity,prices
count,3552.0,3552.0,3552.0
mean,98872.553491,1.126126,505.675976
std,68.12938,0.417005,550.718599
min,98763.0,1.0,1.0
25%,98817.0,1.0,149.98
50%,98864.0,1.0,479.75
75%,98928.0,1.0,479.75
max,98999.0,6.0,4199.99


The `describe()` method will generate descriptive statistics of our data, and by default include all numeric columns in our DataFrame. The code above calls `.describe()` on `elect`, from which there are two numeric columns. This method is an "instruction" to perform something (functions) associated with the object. We've seen earlier how to use `.head()` and `.tail()` on our DataFrame: these are also method calls!

We can add an `include` parameter in the `.describe()` method call, which takes a list-like of dtypes to be included or `all` for all columns of the dataframe.

Add a new cell below, calling `describe()` but only on columns of `object` and `datetime` types (`['object', 'datetime']`).

In [15]:
## Your code below

## -- Solution code


Very often, we also want to know the shape of our data - i.e. how many rows and columns are there in our DataFrame? 

Our DataFrame has attributes that we can use to answer those questions. An attribute is a value stored within an object that describe an aspect of the object's characteristic. In the following call, we are asking for the `.shape` and the `.size` attribute of our `elect` DataFrame.

_Tip:_
Unlike `describe()`, which is a method call; `shape` and `size` are **attributes** of our DataFrame - that means no function is evaluated; Only a value stored in the object's instance is looked up and returned.

In [8]:
print(elec.shape)
print(elec.size)

(2519, 8)
20152


`size` returns the number of elements in the `elec` DataFrame. Because we have 2,519 rows and 8 columns, the total number of elements would be a total of 28416. 

Use `.shape` on the `member` DataFrame. From the resulting output, could you tell what would be the result of calling `member.size`?

In [17]:
## Your code below

## -- Solution code

One other attribute that is often useful is `.axes`, which return a list representing the axes of our DataFrame. Most likely, this would be a list of length 2, one for the row axis and one for the column axis, in that particular order.

Because it is ordered that way, calling `.axes[0]` would return the first item of that list, which would be the row axis (or row names if present) and calling `.axes[1]` would return the column axis, which would be equivalent to calling `elec.columns`:

In [9]:
elec.axes[1]

Index(['product_id', 'date', 'categories', 'brand', 'name', 'merchant',
       'quantity', 'unit_price'],
      dtype='object')

We've covered `.dtypes` in earlier sections, so go ahead and practice inspecting the data types of `elec` DataFrame. Are the columns in the right data types? If they are not, formulate a mental checklist of type conversion you need to perform.

In [19]:
elec.dtypes

product_id      int64
date           object
categories     object
brand          object
name           object
merchant       object
quantity        int64
prices        float64
dtype: object

Compare your mental checklist with the following code. We converted `date` to a `datetime` type, and perform the conversion for the categorical columns as well:

In [10]:
# convert `date` type to datetime 
elec['date'] = elec['date'].astype('datetime64')

# convert multople variables to category
elec[['categories','brand','merchant']] = elec[['categories','brand','merchant']].astype('category')

elec.dtypes

product_id             int64
date          datetime64[ns]
categories          category
brand               category
name                  object
merchant            category
quantity               int64
unit_price           float64
dtype: object

#### Knowledge Check

Supposed we have a pandas DataFrame named `inventory`. 

1. We called `inventory.dtypes` and got the following output. Which of the column likely require type conversion because it seems to have the wrong data type? Choose all that apply.

    - [ ] `units_instock`: int64
    - [ ] `discount_price`: float64
    - [ ] `item_name`: object
    - [ ] `units_sold`: object
    

2. We would like to know the number of columns in `inventory`. Which of the following code would print the number of columns in `inventory`? Choose all that apply.

    - [ ] `print(len(inventory.columns))`
    - [ ] `print(inventory.shape[1])`
    - [ ] `print(len(rice.axes[1]))`


## Indexing and Subsetting with Pandas

Using indexing operators to select, summarize or transform only a subset of data is a critical part of any data analysis workflow. Consider the following use-cases:

- Compare the sales in Year 2018 vs Year 2019  
- Identify missed opportunities in a specific market segment
- Best quarter of the year to execute cross-selling promos / discounts
- Study profitability of goods in the higher price range (e.g. IDR45000000+) and how competitors positioning affect sales in that price range

Notice that in all of these use-cases, data analysts will want to use some combination of indexing and then perform the necessary computations on that specific slice or slices of data. Unsurprisingly, `pandas` come with a number of methods to help you accomplish this task.

In the following section, we'll take a closer look at some of the most common slicing and subsetting operations in `pandas`:
- `head()` and `tail()`  
- `select_dtypes()`  
- Using `.drop()` 
- The `[]` operator
- `.loc`  
- `.iloc`
- Conditional subsetting

Say we're only really interested in the numeric columns of our data, we can use `select_dtypes` to selectively include or exclude only particular data types.

In the following example, I use `select_dtypes` to _include_ only textual columns (`objects`) and then proceed to pass the output of this function call into `.head()`. Notice that when we chain two methods this way, the output of the first function call will be "passed" into the second function call: 

In [11]:
elec.select_dtypes(include = 'object').head()

Unnamed: 0_level_0,name
id,Unnamed: 1_level_1
1,Lumix G 25mm f/1.7 ASPH. Lens
2,Cyber-shot DSC-WX220 Digital Camera (Black)
3,Sony - BC-TRX Battery Charger - Black
6,"Insignia - Fixed TV Wall Mount For Most 40-70""..."
7,Cyber-shot DSC-RX100 V Digital Camera


Change the following code from `include` to `exclude` and observe the difference in the output from our `.describe()` call:

In [12]:
elec.select_dtypes(exclude='object').describe()

Unnamed: 0,product_id,quantity,unit_price
count,2519.0,2519.0,2519.0
mean,38543.870584,1.154426,568.36206
std,22545.966776,0.468605,620.874159
min,366.0,1.0,1.0
25%,18956.0,1.0,149.99
50%,38926.0,1.0,479.75
75%,56249.0,1.0,548.0
max,81681.0,6.0,4199.99


You can also use `include` or `exclude` with a list of data types instead of a singular value. To include all columns of data types integer and float, we can do either of these:
- `include='number'`  
- `include=['int', 'float']`

Try and do that now; Chain the `select_dtypes()` command with `.head()` to limit the output to only the first 5 rows:

In [23]:
## Your code below


## -- Solution code

Apart from using `select_dtypes` to exclude columns, we can also use `.drop()` to remove rows or columns by label names and the corresponding axis. By default, the `axis` is assumed to be 0, i.e. referring to the row. Hence the following code will drop the **row** with label `1`: 

In [13]:
elec.drop(1).head()

Unnamed: 0_level_0,product_id,date,categories,brand,name,merchant,quantity,unit_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2,46876,2017-03-02,Camera & Photo,Sony,Cyber-shot DSC-WX220 Digital Camera (Black),Bestbuy.com,1,159.99
3,12136,2017-03-02,Camera & Photo,Sony,Sony - BC-TRX Battery Charger - Black,Bestbuy.com,1,26.99
6,79238,2017-03-03,Accessories & Supplies,Insignia,"Insignia - Fixed TV Wall Mount For Most 40-70""...",Bestbuy.com,1,56.99
7,46643,2017-03-03,Camera & Photo,Sony,Cyber-shot DSC-RX100 V Digital Camera,Bestbuy.com,1,849.99
8,9472,2017-03-03,Camera & Photo,Yamaha,R-S202 Stereo Receiver with Bluetooth (Black),Bestbuy.com,1,125.99


We can drop multiple rows or columns by passing in a list. In the following code, we override the default `axis` value by passing `axis=1`; As a result `pandas` will drop the specified columns, while preserving all rows:

In [14]:
elec.drop(['product_id', 'brand', 'name'],axis = 1).head()

Unnamed: 0_level_0,date,categories,merchant,quantity,unit_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2017-03-02,Camera & Photo,Bestbuy.com,1,201.99
2,2017-03-02,Camera & Photo,Bestbuy.com,1,159.99
3,2017-03-02,Camera & Photo,Bestbuy.com,1,26.99
6,2017-03-03,Accessories & Supplies,Bestbuy.com,1,56.99
7,2017-03-03,Camera & Photo,Bestbuy.com,1,849.99


Rather commonly, you may want to perform subsetting by slicing out a set of rows. This can be done using the `elec[start:end]` syntax, where `start` is inclusive.

The code follows slices out the first to fourth row, or equivalently, row with the index 0, 1, 2, and 3. 

In [15]:
elec[0:4]

Unnamed: 0_level_0,product_id,date,categories,brand,name,merchant,quantity,unit_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,44342,2017-03-02,Camera & Photo,Panasonic,Lumix G 25mm f/1.7 ASPH. Lens,Bestbuy.com,1,201.99
2,46876,2017-03-02,Camera & Photo,Sony,Cyber-shot DSC-WX220 Digital Camera (Black),Bestbuy.com,1,159.99
3,12136,2017-03-02,Camera & Photo,Sony,Sony - BC-TRX Battery Charger - Black,Bestbuy.com,1,26.99
6,79238,2017-03-03,Accessories & Supplies,Insignia,"Insignia - Fixed TV Wall Mount For Most 40-70""...",Bestbuy.com,1,56.99


Recalling that the `end` is not inclusive and Python's 0-based indexing behavior, if we have wanted to subset the **8th to 12th** row of our data, how would we have done it instead? Pick the right answer and try it in a new code cell below.

- [ ] `elec[7:12]`
- [ ] `elec[8:12]`
- [ ] `elec[7:13]`
- [ ] `elec[8:13]`

Using `.loc` and `.iloc`, we can perform slicing on both the row and column indices, offering us even greater flexibility and control over our subsetting operations.

`.iloc` requires us to pass an `integer` to either the row or/and column. We can also use `:` to indicate no subsetting in a certain direction. The following code slices out the first 4 rows but take all columns (pay attention to the use of the `:` operator): 

In [16]:
elec.iloc[0:4, :]

Unnamed: 0_level_0,product_id,date,categories,brand,name,merchant,quantity,unit_price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,44342,2017-03-02,Camera & Photo,Panasonic,Lumix G 25mm f/1.7 ASPH. Lens,Bestbuy.com,1,201.99
2,46876,2017-03-02,Camera & Photo,Sony,Cyber-shot DSC-WX220 Digital Camera (Black),Bestbuy.com,1,159.99
3,12136,2017-03-02,Camera & Photo,Sony,Sony - BC-TRX Battery Charger - Black,Bestbuy.com,1,26.99
6,79238,2017-03-03,Accessories & Supplies,Insignia,"Insignia - Fixed TV Wall Mount For Most 40-70""...",Bestbuy.com,1,56.99


`.loc`, in contrast to `.iloc` does not subset based on _integer_ but rather subset based on `label`. We can still use `integer` but our integers will be treated or interpreted as _labels_.

Say, we're now wanted to use the `product_id` as the row labels:

In [66]:
product = elec.set_index('product_id')

product.head()

Unnamed: 0_level_0,date,categories,brand,name,merchant,quantity,prices
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
44342,2017-03-02,Camera & Photo,Panasonic,Lumix G 25mm f/1.7 ASPH. Lens,Bestbuy.com,1,201.99
46876,2017-03-02,Camera & Photo,Sony,Cyber-shot DSC-WX220 Digital Camera (Black),Bestbuy.com,1,159.99
12136,2017-03-02,Camera & Photo,Sony,Sony - BC-TRX Battery Charger - Black,Bestbuy.com,1,26.99
41767,2017-03-02,Headphones,Beats,Beats Solo 2 Wireless On-Ear Headphone - White...,Bestbuy.com,2,173.99
62320,2017-03-02,Home Audio,DENON - HEOS,7.2CH AVR WITH WIFI _ BLUETOOTH 2 HDMI OUTPUTS...,Bestbuy.com,1,486.99


To subset for the row of transactions corresponding to product id 62320, we can use label-based indexing (`.loc`) as such:

In [70]:
product.loc[62320, :]

Unnamed: 0_level_0,date,categories,brand,name,merchant,quantity,prices
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
62320,2017-03-02,Home Audio,DENON - HEOS,7.2CH AVR WITH WIFI _ BLUETOOTH 2 HDMI OUTPUTS...,Bestbuy.com,1,486.99
62320,2017-04-04,Home Audio,DENON - HEOS,7.2CH AVR WITH WIFI _ BLUETOOTH 2 HDMI OUTPUTS...,Bestbuy.com,1,434.99
62320,2017-12-14,Home Audio,DENON - HEOS,7.2CH AVR WITH WIFI _ BLUETOOTH 2 HDMI OUTPUTS...,Electronics Expo (Authorized Dealer),1,479.75


In the following code, we read in `womenshoes_sample.csv`. Take a peek at the data using `head` or `tail`. 

Perform a _label_-based or _integer_-based indexing (whichever deemed more appropriate) by subsetting for the row corresponding to product view count from brand 'Apple':

In [17]:
views = pd.read_csv("data_input/electronic_views.csv", index_col=0)
views.head()

Unnamed: 0_level_0,name,categories,prices_before,prices_after,prices_disc,availability,total_views
brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Razer,Blade Pro 17.3 4K Ultra HD Touch-Screen Laptop,Computers & Accessories,USD4619.989,"USD4,199.99",10%,True,13
Sony,Alpha a6500 Mirrorless Digital Camera (Body Only),Camera & Photo,USD2295.4,"USD1,996.00",15%,True,20
Samsung,Refurbished Samsung Curved 65 4K (2160P) Smart...,TV & Video,USD3672.8895,"USD3,497.99",5%,False,28
LG,75 Class - LED - UJ6470 Series - 2160p - Smart...,TV & Video,USD2636.4,"USD2,197.00",20%,True,15
Panasonic,Leica DG Vario-Elmar 100-400mm f/4-6.3 ASPH. P...,Camera & Photo,USD1919.988,"USD1,599.99",20%,False,23


In [76]:
## Your code below


## -- Solution code

Observe how in earlier exercises, you inspect the data and consciously decide whether `.loc` or `.iloc` is the more appropriate choice here. This is a helpful mental exercise to get into - in real life data analysis, very often you're faced with the dilemma of picking from a large box of tools, and knowing which method is the best fit is a critical ingredient for efficiency and fluency.

### Conditional Subsetting

Along with `.iloc` and `.loc`, probably the most helpful type of subsetting would have to be conditional subsetting.

With conditional subsetting, we select data based on criteria we specified:
- `.categories == 'Home Audio'` to select all transactions where format is Home Audio  
- `.unit_price >= 200` to select all transactions with unit price being equal to or greater than 200 USD. 
- `.quantity != 1` to select all transactions where quantity of purchase **is not** 1   

We can also use the `&` and `|` operators to join conditions.

For example:

`customer[(customer.gender == 'Male') & (customer.age > 21)]` subset any rows where customer is Female with age more than 21.

In [84]:
elec[elec.categories == 'Home Audio'].head()

Unnamed: 0_level_0,product_id,date,categories,brand,name,merchant,quantity,prices
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5,62320,2017-03-02,Home Audio,DENON - HEOS,7.2CH AVR WITH WIFI _ BLUETOOTH 2 HDMI OUTPUTS...,Bestbuy.com,1,486.99
28,19069,2017-03-03,Home Audio,Definitive Technology,Definitive Technology - Wireless Audio Adapter...,Bestbuy.com,1,355.99
29,71402,2017-03-03,Home Audio,Denon,DP-300F Fully Automatic Turntable,Bestbuy.com,1,296.99
30,33811,2017-03-03,Home Audio,Kenwood,Kenwood KMM-BT315U Digital Media Receiver with...,Bestbuy.com,1,89.99
31,63566,2017-03-03,Home Audio,MartinLogan,MartinLogan - Dynamo 500 10 360-Watt Powered S...,Bestbuy.com,1,349.98


#### Knowledge Check

In the cell below, write code using conditional subsetting and answer the following questions:

1. Say on May 26 last year, we're holding a one-day only big sale promotion. From all of our electronic categories, how many purchases do we have?
2. On that day, how many transactions do we have from `Computers & Accessories`?
3. From all transactions in our dataset, how many transactions came from `Walmart.com`?

_Tip_: You may find the `.shape` attribute convenient in extracting the number of rows / columns from a dataframe

In [97]:
## Your code below


## -- Solution code

## Referencing and Copying

In [134]:
SEO_budget = [1350, 2000, 1750]

SEO_budget

[1350, 2000, 1750]

In the cell above, we created a Python list, and our variable `SEO_budget` referenced that object accordingly.

In the following cell, we created `Socmed_budget`, and set `Socmed_budget = SEO_budget`. What do you think it does?

In [135]:
Socmed_budget = SEO_budget
print(Socmed_budget)

[1350, 2000, 1750]


Now, let's try to update our `Socmed_budget` by changing our first value to 1500 USD:

In [136]:
Socmed_budget[0] = 1500
print(Socmed_budget)

[1500, 2000, 1750]


So far, so reasonable. However, if we were to now check the values of that Python list as referenced by `SEO_budget`, we see that __the list has been updated__.

In [137]:
# print(Socmed_budget)
print(SEO_budget)

[1500, 2000, 1750]


The explanation is that when we execute the line `Socmed_budget = SEO_budget`, a new Python list **is not being created**. We truly have only one object, and that line only creates a new variable named `Socmed_budget` that references that very same object.

The behavior with using the `=` operator in Python differs from other programming languages and can be confusing to seasoned developers new to Python. 

The appropriate method instead is `.copy()`, which creates an actual copy of the python list. Notice that in the following code, there are two distinct Python objects, and changing the values in one do not affect the other:

In [140]:
SEO_budget = [1350, 2000, 1750]
Socmed_budget = SEO_budget.copy()

Socmed_budget[0] = 1500

print(SEO_budget)
print(Socmed_budget)

[1350, 2000, 1750]
[1500, 2000, 1750]


# Learn-by-Building

The data you will read in is `electronic_views.csv`, a small sample from a dataset corresponding to product view counts from electronic section of a e-commerce. Notice that the dataset has some formatting inconsistencies by design: The `prices_after` column has comma delimiter and the currency (`USD`) whereas related columns use values that has omitted the separator.

In [62]:
views = pd.read_csv("data_input/electronic_views.csv")
views.head()

Unnamed: 0,brand,name,categories,prices_before,prices_after,prices_disc,availability,total_views
0,Razer,Blade Pro 17.3 4K Ultra HD Touch-Screen Laptop,Computers & Accessories,USD4619.989,"USD4,199.99",10%,True,13
1,Sony,Alpha a6500 Mirrorless Digital Camera (Body Only),Camera & Photo,USD2295.4,"USD1,996.00",15%,True,20
2,Samsung,Refurbished Samsung Curved 65 4K (2160P) Smart...,TV & Video,USD3672.8895,"USD3,497.99",5%,False,28
3,LG,75 Class - LED - UJ6470 Series - 2160p - Smart...,TV & Video,USD2636.4,"USD2,197.00",20%,True,15
4,Panasonic,Leica DG Vario-Elmar 100-400mm f/4-6.3 ASPH. P...,Camera & Photo,USD1919.988,"USD1,599.99",20%,False,23


Pay attention to the `prices_before` and `prices_after` columns. To perform arithmetic computations on the numeric columns, we have to drop the 'USD' currency string and treat these columns as numbers. We'll use the built-in `.replace()` method for this.

How do we apply that replace function? We can call `.apply(our_function)` on our `DataFrame`. What's interesting is that `our_function` could be any of `python` built-in functions, functions from third-party modules, or it could also be a list of functions:

In [63]:
views['total_views'].apply([max, min])

max    43
min    13
Name: total_views, dtype: int64

Back to removing the currency string from `prices_before` and `prices_after` using `.apply()` and `.replace()`. We could create our own function , name it `removeUSD` for example and then apply it the following way:

`views['prices_before'].apply(removeUSD)`

Writing functions is a topic that is more suited for a later time, and students new to the trade of programming in Python will be gradually introduced to this aspect of Python programming.

However, given the task at hand, this seems like a reasonable time to introduce **Lambdas**.



In [65]:
views['prices_before'] = views['prices_before'].apply(lambda x: x.replace('USD', ''))
views['prices_after'] = views['prices_after'].apply(lambda x: x.replace('USD', ''))

views.head()

Unnamed: 0,brand,name,categories,prices_before,prices_after,prices_disc,availability,total_views
0,Razer,Blade Pro 17.3 4K Ultra HD Touch-Screen Laptop,Computers & Accessories,4619.989,4199.99,10%,True,13
1,Sony,Alpha a6500 Mirrorless Digital Camera (Body Only),Camera & Photo,2295.4,1996.0,15%,True,20
2,Samsung,Refurbished Samsung Curved 65 4K (2160P) Smart...,TV & Video,3672.8895,3497.99,5%,False,28
3,LG,75 Class - LED - UJ6470 Series - 2160p - Smart...,TV & Video,2636.4,2197.0,20%,True,15
4,Panasonic,Leica DG Vario-Elmar 100-400mm f/4-6.3 ASPH. P...,Camera & Photo,1919.988,1599.99,20%,False,23


In [69]:
views = pd.read_csv("data_input/electronic_views.csv")

views[['prices_before','prices_after']] = views[['prices_before','prices_after']].replace('[^\d.]+', '',regex=True)
views.head(3)

Unnamed: 0,brand,name,categories,prices_before,prices_after,prices_disc,availability,total_views
0,Razer,Blade Pro 17.3 4K Ultra HD Touch-Screen Laptop,Computers & Accessories,4619.989,4199.99,10%,True,13
1,Sony,Alpha a6500 Mirrorless Digital Camera (Body Only),Camera & Photo,2295.4,1996.0,15%,True,20
2,Samsung,Refurbished Samsung Curved 65 4K (2160P) Smart...,TV & Video,3672.8895,3497.99,5%,False,28


In [70]:
views.dtypes

brand            object
name             object
categories       object
prices_before    object
prices_after     object
prices_disc      object
availability       bool
total_views       int64
dtype: object

## Graded Assignment 1

### Task 1

You may have noticed that `prices_before` and `prices_after` are not stored in the right data type. You may use `views.dtypes` to double-confirm.  

The following code uses the `pd.to_numeric` method to transform these columns to numeric data types. 

```py
numeric_cols = ['prices_before', 'prices_after']
views[numeric_cols] = views[numeric_cols].apply(pd.to_numeric)
```

Add the transformation code and then now try to count the difference amounts between the prices before and after the discount. Use `head` or `tail` to peek at the resulting data frame.

### Task 2

The following code prints the product with the maximum amount of `total_views`:

```py
views.loc[views['total_views'].idxmax(), ]
views.loc[views['total_views'] == views['total_views'].max(), ]
```

Some of the products viewed by customers is not currently available (See `availability` column). Among the products that is not currently available, which product has the highest product views?

_Tip_: Recall what you've learned about conditional subsetting.

### Task 3

Combine `.select_dtypes` and `.describe` to print the summary statistics of **all numeric columns** in `views`. 

_Tip_: This can be done by passing `['int','float']` or simply `'number'` as a value to the `include` parameter

### Task 4

Say, we would like to print a dataframe including only any product with price more than USD 1,000. Which are the products that has price exceeding USD 1,000 in the sampled data frame?