# Week 7 Group Lab

**BUSN 32100 Data Analysis with R and Python**

**Nov 8,9 2019**

## Contents

- [Week 7 Group Lab](#Week-7-Group-Lab)  
  - Midterm and HW 3
  - [Tuples](#What’re-Tuples?) 
  - [Dictionaries](#Dictionaries) 
  - [Data exploration and preparation](#Data-exploration-and-preparation)  

**Outcomes**

- Understand `tuple` and  `dictionary`
- Understand what `.dtype`/`.dtypes` do  
- Know how to change part of a data frame
- Be able to select subsets of the DataFrame using multiple boolean selections  
- Be able to construct `if`/`elif`/`else` conditional blocks  
  

| Content | Items |
|------|------|
|   Interface  | Cells, Files, Tabs, Modal Editing | 
|   Syntax  | Object-oriented, Indetation, Namespace, Scope | 
|   I/O  | Read, Write, Print | 
|   Data Types  | Numerics, Logicals, Strings | |
|   Data Structure  | Data Frame, Series, List, Tuple, Dictionary|
|   Tools  | Packages, Functions, Methods |
|   Control Flow  | For Loop,  If Else |
|Data Wrangling|Indexing, Subsetting, Computation, Summary |
 

## R vs Python

<img src="https://raw.githubusercontent.com/BUSN32100/figures/master/RvsP.png" style="" width=700 >




## List genertion using range()

Go to [pollev.com/kanixw321](https://pollev.com/kanixw321) for this Poll

In [31]:
import IPython
url = "https://embed.polleverywhere.com/multiple_choice_polls/9EW2l8IqRKqRrKnjsZtpN?controls=none&short_poll=true" 
IPython.display.IFrame(url, 500,300)

### So what is the right direction of the path's slash under Windows
### \\ or / ?  

In [32]:
%cd

/Users/hikaru


In [33]:
%cd "Desktop/Week_7_files"

/Users/hikaru/Desktop/Week_7_files


# Basic Data Structures

### What are tuples?

Tuples are very similar to lists and hold ordered collections of items. However, there are three main differences between tuples and lists:

1. tuples are created using parenthesis — `(` and `)` — instead of
  square brackets — `[` and `]`  
1. tuples are *immutable*, which is a fancy computer science word
  meaning that they can’t be changed or altered after they are created  
1. there is a tight connection between tuples and multiple return values
  from functions

In [34]:
x = [2, "hello", 3.0]

t = (1, "hello", 3.0)
print("t is a", type(t))
t

t is a <class 'tuple'>


(1, 'hello', 3.0)

We can *convert* a list to a tuple by calling the `tuple` function on
a list

In [35]:
print("x is a", type(x))
print("tuple(x) is a", type(tuple(x)))
tuple(x)

x is a <class 'list'>
tuple(x) is a <class 'tuple'>


(2, 'hello', 3.0)

We can also convert a tuple to a list using the list function

In [36]:
list(t)

[1, 'hello', 3.0]

As with a list, we access items in a tuple `t` using `t[N]` where
`N` is an int

In [37]:
t[0]  # still start counting at 0

1

<blockquote>

</blockquote>

## zip and enumerate

Two functions that can be extremely useful when working with lists and dictionaries are `zip` and `enumerate`

Both of these functions are best understood by example, so let’s see them in action and then talk about what they do:

In [38]:
gdp_data = [9.607, 10.48, 11.06]
years = [2013, 2014, 2015]
z = zip(years, gdp_data)
print("type(z)", type(z))

type(z) <class 'zip'>


To see what is inside `z`, let’s convert it to a list:

In [39]:
list(z)

[(2013, 9.607), (2014, 10.48), (2015, 11.06)]

Notice that we now have a list, where each item is a tuple

Within each tuple we have one item from each of the collections we
passed to the zip function

In particular, the first item in `z` contains the first item from
`[2013, 2014, 2015]` and the first item from `[9.607, 10.48, 11.06]`

The second item in `z` contains the second item from each collection
and so on

### enumerate

Now let’s experiment with `enumerate`

In [40]:
e = enumerate(["a", "b", "c"])
print("type(e)", type(e))
e

type(e) <class 'enumerate'>


<enumerate at 0x11ea0cd20>

Again, to see what is inside we call `list(e)`

In [41]:
list(e)

[(0, 'a'), (1, 'b'), (2, 'c')]

We again have a list of tuples, but this time the first element in each
tuple is the *index* of the second tuple element in the initial
collection

Notice that the third item is `(2, 'c')` because
`["a", "b", "c"][2]` is `'c'`

<blockquote>

</blockquote>

An important quirk of some iterable types that are not lists (such as the above `zip`) is that
you cannot convert the same type to a list twice

This is because `zip`, `enumerate`, and `range` produce what is called a generator – A
generator will only produce each of its elements a single time, so if you call `list` on the same
generator a second time, it will not have any elements to iterate over anymore… For more
information, refer to the [Python documentation](https://docs.python.org/3/howto/functional.html#generators)

## Dictionaries

A dictionary (or dict) associates `key`s with `value`s

It will feel similar to a dictionary for words, where the keys are words and
the values are the associated definitions

The most common way to create a `dict` is to use curly braces — `{`
and `}` — like this:

```python3
{"key1": value1, "key2": value2, ..., "keyN": valueN}
```

where the `...` indicates that we can have any number of additional
terms


The crucial part of the syntax is that each key-value pair is written
`key: value` and that these pairs are separated by commas — `,`

Let’s see an example using World Bank data on China in 2017

In [42]:
china_data = {"country": "China", "year": 2017, "GDP" : 12.01, "population": 1.384}
print(china_data)

{'country': 'China', 'year': 2017, 'GDP': 12.01, 'population': 1.384}


a `dict` allows us to
associate a name with each field, rather than having to remember the
order within the tuple

Often it is easier to read the code that makes a dict if we put each
`key: value` pair on its own line (recall our earlier comment on
using whitespace effectively to improve readability!)

The code below is equivalent to what we saw above:

In [43]:
china_data = {
    "country": "China",
    "year": 2017,
    "GDP" : 12.01,
    "population": 1.384
}

Most often the keys (e.g. “country”, “year”, “GDP”, and “population”)
will be strings, but we could also use numbers (`int`, or
`float`) 

The values can be **any** type, and different from each other



<blockquote>

**Check for understanding**

Create a new dict which associates stock tickers with its stock price.

Here are some tickers and a price <br>

- AAPL: 175.96  
- GOOGL: 1047.43  
- TVIX: 8.38  



</blockquote>

In [44]:
# your code here
empty_dict = {}
d1 = {'AAPL':175.96,'GOOGL':1047.43,'TVIX':8.38}
d1

{'AAPL': 175.96, 'GOOGL': 1047.43, 'TVIX': 8.38}

This next example is meant to drive home the fact that values can be
*anything* – Including another dictionary.

In [45]:
companies = {"AAPL": {"bid": 175.96, "ask": 175.98},
             "GE": {"bid": 1047.03, "ask": 1048.40},
             "TVIX": {"bid": 8.38, "ask": 8.40}}

In [46]:
print(companies)

{'AAPL': {'bid': 175.96, 'ask': 175.98}, 'GE': {'bid': 1047.03, 'ask': 1048.4}, 'TVIX': {'bid': 8.38, 'ask': 8.4}}


### Getting, setting, and updating dict items

We can now ask Python to tell us what the value for a particular key is using
the syntax `d[k]`  where `d` is our `dict` and `k` is the key we want to
find the value for

For example,

In [47]:
print(china_data["year"])
print(china_data['country'], "Population =" ,china_data['population'])

2017
China Population = 1.384


If we ask for the value of a key that is not in the dict, we will get an error

In [48]:
# uncomment the line below to see the error
china_data["inflation"]

KeyError: 'inflation'

We can also add new items to a dict using the syntax `d[new_key] = new_value`:

Let’s see some examples

In [None]:
print("original dictionary",china_data)
china_data["unemployment"] = "4.05%"
print("updated dictionary",china_data)

In order to update the value, we use assignment in the same way (which will
create the key and value as required)

In [None]:
print("original dictionary", china_data)
china_data["updated dictionary","unemployment"] = "4.051%"
print(china_data)

Or we could change the type

In [None]:
china_data["unemployment"] = 4.051
print(china_data)

## Series


<a id='index-16'></a>Another main **pandas type** we will introduce is called Series

A Series is a single column of data, with row labels for each observation

Pandas refers to the row labels as the *index* of the Series

<img src="https://github.com/BUSN32100/figures/raw/master/PandasSeries.png" style="">

  
Below we create a Series which contains the US unemployment rate every other year starting in 1995

In [None]:
import pandas as pd

In [None]:
values = [5.6, 5.3, 4.3, 4.2, 5.8, 5.3, 4.6, 7.8, 9.1, 8., 5.7]

In [None]:
years = list(range(1995, 2017, 2))

In [None]:
unemp = pd.Series(data=values, index=years, name="Unemployment")

In [None]:
unemp

In [None]:
unemp.index

In [None]:
unemp.values

### What can we do with a Series object?

Series are building blocks of pandas dataframe. Most of the functions of dataframes also applies to series.

In [None]:
unemp.head()

In [None]:
unemp.tail()

In [None]:
unemp.plot()

Or more statistically oriented methods

In [None]:
unemp.describe()


<a id='user-defined-functions'></a>

# Data exploration and preparation

  
- Today we will continue explore the Game of thrones data and complete some data preparation tasks

<img src="https://github.com/BUSN32100/figures/raw/master/workflow.png" style="" width=700 >


## DataFrames


<a id='index-15'></a>
This is similar to a data frame in `R`. In addition to column names, DataFrames also have row labels (serve as an index),

<img src="https://github.com/BUSN32100/figures/raw/master/PandasDataFrame.png" style="">

  
Now let’s consider the couple lines after import from [the program above](#ourfirstprog), which was

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
got = pd.read_csv("https://raw.githubusercontent.com/BUSN32100/data_files/master/top_characters3.csv", index_col=0)
n = 10
got_top10 = got.head(n)

Our character data is saved as `top_characters.csv` in the present working directory it can be read in using the first line above.

We can select particular rows using standard Python slicing notation, notice that in `R` `got[2:5, ]` will return row 2-5 (4 rows). Here, Python returns the 3rd, 4th and 5th row. Notice that Python indexing starting at 0 and is a half-open interval (upper-bound exclusive).

In [None]:
got[2:5]

To select columns, we can pass a list containing the names of the desired columns represented as strings

In [None]:
got[['title','isAlive']]

To select both rows and columns using **integers**, the `iloc` attribute should be used with the format `.iloc[rows,columns]`

In [None]:
got.iloc[2:5, 0:4]

To select rows and columns using **labels**, the `loc` attribute can be used in a similar way

In [None]:
got.loc[['Bran Stark', 'Hodor'], ['title','isAlive']]

In [None]:
got.loc[:, "house"]

In [None]:
# `[string]` with no `.loc` extracts a whole column
got["house"]

### Computations with columns

Pandas can do various computations and mathematical operations on columns

Let’s take a look at a few of them

In [None]:
# Find maximum
got["season 5"].max()

In [None]:
# Find the difference between two columns
got["season 7"] - got["season 1"]

In [None]:
# Find correlation between two columns
got["season 1"].corr(got["season 7"])

>**Check for understanding**
>For each of the following, we recommend reading the documentation for help
>
>-  Use introspection (or google) to find a way to obtain a list with all of the column names in `got`
>-  Using the `plot` method, make a line plot. What does it look like now?
>- Using what you know about subseting rows in R, what would be you guess to select only characters from House Stark?
>- Use `.loc` to select the the screen time data for the `season 2` and `season 7` for House Stark.


<a id='series-ref'></a>

In [None]:
got.columns

In [None]:
(got
 .head(5)
 [['season 1','season 3', 'season 7']]
 .T
 .plot(figsize=(8,6))
 
)

In [None]:
criteria = got['house']=="House Stark"
got.loc[criteria, ['season 2','season 7']]

## Data types

Occasionally, you might need to investigate what types you have in your DataFrame when an operation is not doing what you expect it to

Now run the commands `got.dtypes` and `got['season 1'].dtype` and think about what these methods output


In [None]:
got.dtypes

In [None]:
got['title'].dtype

DataFrames will only distinguish between a few types

- Booleans (`bool`)  
- Floating point numbers (`float64`)  
- Integers (`int64`)  
- Dates (`datetime`) 
- Categorical data (`categorical`)  
- Everything else, including strings (`object`)  


In the future, we will often refer to the type of data stored in a column as its `dtype`

Let’s look at an example for when having an incorrect `dtype` can cause problems


In [None]:
got[['season 1', 'season 2', 'season 4']].head()

When we try to do something like compute the sum of all the columns,
we get unexpected results…

In [None]:
got[['season 1', 'season 2', 'season 4']].sum()

This happened because `.sum` effectively calls `+` on all rows in
each column

Recall that when we apply `+` to two strings, the result is the
strings mashed together

So in this case we saw that the entries in all the rows of the `season 4`
column were stitched together into one long string

## Changing DataFrames

We can change the data inside of a DataFrame in various ways:

- Adding new columns  
- Changing index labels or column names  
- Altering existing data (e.g. doing some arithmetic or making a column
  of strings lowercase)  


### Creating new columns

We can create new data by “assigning values to a column” similar to how
we assign values to a variable

In pandas, we create a new column of a DataFrame by writing

```python
df["New Column Name"] = new_values
```
Below we create an mean of the screen time over the seven seasons

In [None]:
got["screenTimeTotal"] = (got["season 1"] + got["season 2"] + 
                         got["season 3"] + got["season 5"] +
                         got["season 6"] + got["season 7"] )

In [None]:
got[['season 1', 'season 3', 
     'season 7', 'screenTimeTotal']].head()

### Changing values

Changing the values inside of a DataFrame should be done sparingly

However, it can be done by assigning a value to a location in the
DataFrame

`df.loc[index, column] = value`

In [None]:
got.loc['Sansa Stark', "screenTimeTotal"] = 0.0

In [None]:
got[['season 1', 'season 3', 
     'season 7', 'screenTimeTotal']].head()

### Renaming columns

We can also rename the columns of a DataFrame

This is helpful because the names that sometimes come with datasets are
unbearable…

They have their reasons for using these names, but it can make our job difficult since we need to type it sometimes repeatedly

We can rename columns by passing a dictionary to the `rename` method

This dictionary contains the old names as the keys and new names as the values.

See the example below

In [None]:
names = {"season 1": "S1",
         "season 2": "S2",
         "season 3": "S3",
         "season 4": "S4"}
got.rename(columns=names)

In [None]:
got.head()

We renamed our columns… Why does the DataFrame still show the old
column names?

Many of the operations that pandas does creates a copy of your data by
default

It does this in order to protect your data and make sure you don’t
overwrite information you’d like to keep


We can make these operations permanent by either

1. Assigning the output back to the variable name
  `df = df.rename(columns=rename_dict)` or  
1. Looking into whether the method has an `inplace` option. For
  example, `df.rename(columns=rename_dict, inplace=True)`  


There are times when setting `inplace=True` will make your code faster
(e.g. if you have a very large DataFrame and you don’t want to copy all
the data), but that doesn’t always happen

We recommend using the first option until you get comfortable with
pandas because operations that don’t alter your data are (usually)
easier to reason about

In [None]:
names = {"season 1": "S1",
         "season 2": "S2",
         "season 3": "S3",
         "season 4": "S4"}
got2 = got.rename(columns=names)
got2.head()

## Cleaning data

For many data projects, a [significant proportion of
time](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#74d447456f63)
is spent collecting and cleaning the data — not doing the analysis

This non-analysis work is often called “data cleaning”

Pandas provides very powerful tools for doing data cleaning, which we
will demonstrate here

What would happen if we wanted to try and compute the mean of
`season 4`?

```python3
got["season 4"].mean()
```


In [None]:
# run the above code here
got["season 4"].mean()


It throws an error!

When looking at error messages, it helps to start at the very
bottom

The final error says, `TypeError: Could not convert 42.5'27.5'47.75'32.75'3... to numeric`


**Question 1a**

Convert the string below into a number

In [None]:
minutes = "32.75'"

# your code here
minutes = minutes.replace("","")
float(minutes)

<blockquote>

</blockquote>

## String methods

Our solution to the previous exercise was to remove the `#` by using
the `replace` string method: `float(minutes.replace("'",""))`


There is a fast  way to apply a string method to an entire column of data

Most of the methods that are available to a Python string are
available to a pandas Series that has `dtype` object

We access them by doing `s.str.method_name` where `method_name` is
the name of the method

When we apply the method to a Series, it is applied to all rows in the
series in one shot!

Let’s try using a Pandas `.str` method

In [None]:
got["season_4_str"] = got["season 4"].str.replace("'", "")

We can use `.str` to access almost any string method that works on
normal strings (see the [official
documentation](https://pandas.pydata.org/pandas-docs/stable/text.html)
for more information)

In [None]:
got["title"].str.contains("p")

In [None]:
got["title"].str.lower()

Instead of replacing the " ' " sign with an empty string we can generate a substring using index slicing

In [None]:
# dropping the last character of a column
got["season_4_str2"] = got["season 4"].str[:-1]
got["season_4_str2"] == got["season_4_str"]


**Question 1b**

Make a new column called `title_upper` that contains the elements of
`title` with all uppercase letters.


In [None]:
#your code here
got['title'] = got['title'].str.upper()

## Type conversions

In our example above, even after we have removed the `" ' "`, the
`dtype` of the `season_4_str` column shows that pandas still treats
it as a string

We need to convert this column to numbers

The best way to do this is using the `pd.to_numeric` function

This method attempts to convert whatever is stored in a Series into
numeric values

For example, after having removed the `" ' "` from the time in column
`"season 4"`, they are ready to be converted to numbers

In [None]:
got["season 4"] = pd.to_numeric(got["season_4_str"])

In [None]:
got.dtypes

In [None]:
got[['title', 'season 1', 'season 4']]

There is support for converting to other types well

Using the `astype` method we can convert to any of the supported
pandas `dtypes`
Below are some examples (pay attention to the reported `dtype`)

In [None]:
got["season_4_num"].astype(str)

In [None]:
got["season_4_str"].astype(float)


**Question 1c**

Convert the column `"popularity"` to a numeric type using `pd.to_numeric` and
save it to the DataFrame as `"popularity_num"`

Notice that there is a value that is not a number

Look at the documentation for `pd.to_numeric` and think about how to
overcome this

Try converting the column using `astype(float)`, what do you get?

Discuss with your neighbor about when to use `.astype` and when to use `pd.to_numeric`


In [None]:
got['popularity']
pd.to_numeric(got['popularity'],errors ="coerce").sum()


In [None]:
got.astype?

## Missing data

Many datasets have missing data

In our example, not everyone has a title. Therefore, we are missing several elements in the `title` column 

In [None]:
got['title']

We can find missing data by using the `isnull` method

In [None]:
got.isnull()

We might want to know whether particular rows or columns have any
missing data

To do this we can use the `.any` method on the boolean DataFrame
`df.isnull()`

In [None]:
got.isnull().any(axis=0)

In [None]:
got.isnull().any(axis=1)

Instead of summing the `True` for missing values, it can be more useful to take the mean, which corresponds to the *proportion* of missing values with `.mean()`

In [None]:
got.isnull().mean()

There are many approaches to dealing with missing data

Two that are commonly used (and the corresponding DataFrame method) are

- Exclusion: Ignore any data that is missing (`.dropna`)  
- Imputation: Compute “predicted” values for the data that is missing
  (`.fillna`)  


For the advantages and disadvantages of these (and other) approaches,
consider reading the [Wikipedia
article](https://en.wikipedia.org/wiki/Missing_data)

For now, let’s see some examples

In [None]:
# drop all rows containing a missing observation
got.dropna()

In [None]:
# fill the missing values with a specific value
got['title'].fillna(value='')

In [None]:
# use the _next_ valid observation to fill the missing data
got['culture'].fillna(method="bfill")

In [None]:
# use the _previous_ valid observation to fill missing data
got['culture'].fillna(method="ffill")

Notice that the missing values in `df` are not excluded or filled yet.

In [None]:
got.isnull().mean()

In order to make the missing data treament stick, we can use an arugment `inplace=True`. Or we can assign the changed data frame back the original data frame name

In [None]:
#uncomment one of the following 4 lines

#got['culture'].fillna(value='', inplace=True) #fill in empty string for missing value
got['culture'] = got['culture'].fillna(value='') #fill in empty string for missing value


#got.dropna(subset = ['culture'], inplace=True) #exclude rows with missing values in column culture
# got = got.dropna(subset = ['culture']) #exclude rows with missing values in column culture

In [None]:
got.isnull().mean()

## Boolean Selection with Multiple conditions

Last week we showed that we can use conditional statements to construct Series of booleans from our data and then filter rows using the Series of boolean. Think "House Stark"

In the homework we saw that we can use the words `and` and `or` to combine multiple booleans into a single bool

Recall

- `True and False -> False`  
- `True and True -> True`  
- `False and False -> False`  
- `True or False -> True`  
- `True or True -> True`  
- `False or False -> False`  


Now we will do something similar in Pandas, but instead of `bool1 and bool2` we write 

```python
(bool_series1) & (bool_series2)
```


Likewise, instead of `bool1 or bool2` we write

```python
(bool_series1) | (bool_series2)
```


This is very simialr to how `R` uses multiple conditions. The difference is that the parentheses "()" around each condition are **not optional**

In [None]:
Stark_alive = (got["isAlive"] >0) & (got["house"] == "House Stark")
Stark_alive.head()

In [None]:
got[Stark_alive]

In [None]:
Stark_alive =  got["isAlive"] >0 & (got["house"] == "House Stark")
Stark_alive.head()

### isin

Sometimes we will want to check whether a data point takes on one if a
fixed set of values

We could do this by writing `(df["x"] == val_1) | (df["x"] == val_2)`
(like we did above), but there is a better way: the `.isin` method

In [None]:
got["culture"].isin(['Ironmen','Free Folk', 'Dothraki']).head(15)

In [None]:
# now select full rows where this series is True
got.loc[got["culture"].isin(['Ironmen','Free Folk', 'Dothraki'])]

### Built-in Data frame aggregations

Pandas already has some of the most frequently used aggregations built-in

For example:

- Mean  (`mean`)  
- Variance (`var`)  
- Standard deviation (`std`)  
- Minimum (`min`)  
- Median (`median`)  
- Maximum (`max`)  
- etc…  


When looking for standard operations, using some “tab completion” goes a
long ways

In [None]:
seasons = ['season '+str(i) for i in range(1,8)]
got[seasons].mean()

As seen above, an aggregation aggregates each column by default…

However, by using `axis` keyword argument, you can do aggregations by
row as well

In [None]:
got[seasons].std(axis=1).head(10)

### Writing your own aggregation

The built in aggregations will get us pretty far in our analysis, but
sometimes we need more flexibility

We can have pandas perform custom aggregations by following these two
steps:

1. Write a Python function that takes a `Series` as an input and
  outputs a single value  
1. Call the `agg` method with our new function as an argument  


For example, below we will classify character as “majort” or
“minor” roles based on whether their mean screen time is
above or below 5 minutes

In [None]:
#
# We write the (aggregation) function that we'd like to use
#
def major_or_minor(s):
    """
    This function takes a Pandas Series object and returns major
    if the mean is above 5 and minor if the mean is below 5
    """
    if s.mean() < 5:
        out = "Minor"
    else:
        out = "Major"

    return out

## Detour: Conditional Statements

Sometimes we will only want to execute some piece of code if a certain condition
is met

These conditions can be anything

We use *conditionals* to run particular pieces of code when certain criterion
are met

Conditionals are closely tied to booleans, so if you don’t remember what those are go back to the homework for a refresher

The basic syntax for conditionals is

```python
if condition:
    # code to run when condition is True
else:
    # code to run if no conditions above are True
```


Note that immediately following the condition there is a colon *and*
that the next line begins with blank spaces

Also note that the `else` clause is optional



Let’s see some simple examples

In [None]:
condition = True 
if condition:
    print("This is where `True` code is run")

This example is equivalent to just typing the print statement, but the
example below isn’t…

In [None]:
condition = False 

if condition:
    print("This is where `True` code is run")

Notice that when you run the cell above nothing is printed

That is because the condition for the `if` statement was not true, so the code
inside the indented block was never run

The next example shows us how `else` works

In [None]:
condition = False 

if condition:
    print("This is where `True` code is run")
else:
    print("This is where `False` code is run")

The `if condition: ...` part of this example is the same as the example
before, but now we added an `else:` clause

In this case because the conditional for the `if` statement was not
`True`, the if code block was not executed, but the `else` block was


**Question 1d**

Using the code cell below as a start, print `"Good afternoon"` if the
`current_time` is past noon

Otherwise do nothing

(HINT: write some conditional based on `current_time.hour`)

In [None]:
import datetime
current_time = datetime.datetime.now()

## your code here
if current_time.hour > 12:
    print("Good Afternoon")


**Question 1e**

Store your first name in a variable called `name`

Store your neighbor’s name in a variable called `neighbor`

If `name > neighbor` then print the message `"My name is greater"`,
otherwise print `"My neighbor's name is greater"`


In [None]:
## your code here
name = "Hikaru Sugimori"

neighbor = "Arvind"

if name>neighbor:
    print("I have a longer name")
else:
    print("Neighbor's name is longer")

### `elif` clauses

Sometimes you have more than one condition you want to check

For example, you might want to run a different set of code based on which
quarter a particular transaction took place in

In this case you could check if the date is in Q1, or in Q2, or in Q3, or if not
any of these it must be in Q4

The way to express this type of conditional is to use one or more `elif`
clause in addition to the `if` and the `else`

The syntax is

```python
if condition1:
    # code to run when condition1 is True
elif condition2:
    # code to run when condition2 is True
elif condition3:
    # code to run when condition3 is True
else:
    # code to run when none of the above are true
```


You can include as many `elif` clauses as you want

As before the `else` part is optional


We write the (aggregation) function that we'd like to use

In [None]:

def major_or_minor(s):
    """
    This function takes a Pandas Series object and returns major
    if the mean is above 5 and low if the mean is below 5
    """
    if s.mean() < 5:
        out = "Minor"
    else:
        out = "Major"

    return out

In [None]:
# How does this differ from got[seasons].agg(major_or_minor)?
got[seasons].agg(major_or_minor, axis=1)

Do the following exercises in separate code cells below:

**Question 1f**

-For each character, what is the median screen time across the seven seasons?

In [None]:
# median screen time by character across seasons
got[seasons].median(axis=1)

**Question 1g**

- What was the maximum screen time across the seasons and across characters in our
  sample? Which season did it happen in? Who was this character?
  - Hint 1: you can use more than one aggregation together 
  - Hint 2: Check out dataframe/series method `idxmax`  

In [None]:
# max screen time across characters and seasons
got[seasons].max().max()

In [None]:
# character with max screen time across seasons and characters
got[seasons].max().idxmax()

got[seasons].max(axis=1).idxmax()


<blockquote>

</blockquote>

**Question 1h**

- Classify each character as high or low volatility based on whether the
  standard deviation of their screen time is above or below/equal 5 

In [None]:
# low or high volatility
def high_or_low(s):
    """
    This function takes a Pandas Series object and returns high
    if the std is above 5 and low if the std is below or equal 5
    """
    if s.std()<5:
        out="low"
    else:
        out = "high"
    return out

In [None]:
got[seasons].agg(high_or_low, axis=1)


## Question 2: Data preparation with Chipotle orders

We will now use data from an
[article](https://www.nytimes.com/interactive/2015/02/17/upshot/what-do-people-actually-order-at-chipotle.html)
written by The Upshot at the NYTimes

This data has order information from almost 2,000 Chipotle orders and
includes information on what was ordered and how much it cost

In [123]:
import pandas as pd
chipotle = pd.read_csv("https://raw.githubusercontent.com/BUSN32100/data_files/master/chipotle.csv")
chipotle.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


**Question 2a**

Check the data type of all the columns in data frame `chipotle`, then check the data type of column `item_price`

In [116]:
#data type of all the columns
chipotle.dtypes

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

In [117]:
#data type of item_price
print(chipotle['item_price'].dtypes)


object


<blockquote>

</blockquote>

**Question 2b** 

How many missing items are there in this dataset? What are the percentages of missing items in each column?  

In [118]:
count_missing = chipotle.isnull().sum()
percent_missing = chipotle.isnull().sum() * 100 / len(chipotle)
missing_value_df = pd.DataFrame({'column_name': chipotle.columns,
                                 'percent_missing': percent_missing})

count_missing

order_id                 0
quantity                 0
item_name                0
choice_description    1246
item_price               0
dtype: int64

<blockquote>

</blockquote>

**Question 2c**

Next we need to deal with the missing values in column `choice_description`. You can fill in the missing value. What would be appropriate value for a string column?

In [119]:
import numpy as np
chipotle = chipotle.replace(np.nan, '', regex=True)

chipotle

chipotle.isnull().sum()


order_id              0
quantity              0
item_name             0
choice_description    0
item_price            0
dtype: int64

<blockquote>

</blockquote>

**Question 2d** 

How many items have 'Chicken' in their `item_name`, how many in their `choice_description`? <br>
Hint: use the `.str` methodwith `contains` method, pay attention to capital letters. 

In [110]:
chipotle[chipotle['item_name'].str.contains("Chicken")].sum()

order_id                                                        1484575
quantity                                                           1654
item_name             Chicken BowlChicken BowlChicken Crispy TacosCh...
choice_description    [Tomatillo-Red Chili Salsa (Hot), [Black Beans...
item_price            $16.98 $10.98 $8.75 $8.75 $11.25 $8.49 $8.49 $...
dtype: object

In [109]:
chipotle[chipotle['choice_description'].str.contains("Chicken")].sum()

order_id                                                           1860
quantity                                                              5
item_name                          BurritoCrispy TacosBurritoSaladSalad
choice_description    [Adobo-Marinated and Grilled Chicken, Pinto Be...
item_price                               $7.40 $7.40 $7.40 $7.40 $7.40 
dtype: object

<blockquote>

</blockquote>

**Question 2e** 

Converting the `item_price` in to a float column. Before you can change the type of the column, you need to remove the `$` sign at the beginning.

In [124]:
chipotle['item_price'] = chipotle['item_price'].str.replace(r'$', '').astype(float)
chipotle


Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,2.39
1,1,1,Izze,[Clementine],3.39
2,1,1,Nantucket Nectar,[Apple],3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",16.98
...,...,...,...,...,...
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",8.75


<blockquote>

</blockquote>

 **Question 2f** 
 
What is the average price of an item with chicken? <br>
Hint: You might want to create the boolean series with mutlitple conditions frist, then filter the data and cacluate the average price

In [132]:
chipotlechicken = chipotle[chipotle['item_name'].str.contains("Chicken")]

chipotlechicken['item_price'].mean()

#The average price $10.13 for an item with chicken


10.133724358974309

<blockquote>

</blockquote>

**Question 2g**  

What is the average price of an item with steak? 

In [136]:
chipotlesteak = chipotle[chipotle['item_name'].str.contains("Steak")]

chipotlesteak['item_price'].mean()

#The average price is $10.52 for an item with steak


10.518888888888851

<blockquote>

</blockquote>

**Question 2h**
 
Did chicken or steak produce more revenue (total)? 

In [140]:
chipotlechicken['quantity'].sum()*10.13
#Revenue from chicken items is $16,755

16755.02

In [142]:
chipotlesteak['quantity'].sum()*10.52
#Revenue from chicken items is $7,721

#Chicken items produced more revenue

7721.679999999999


## Question 3: Data structure methods
Last week we talked about lists. Here are a few more example with lists. Let's start with one list

In [155]:
x = [2.0, 9.1, 12.5]

We can check if a list contains an element using the `in` keyword

In [156]:
2.0 in x

True

In [157]:
1.5 in x

False

For our list `x`, other common operations we might want to do are…

In [158]:
x.reverse()
x

[12.5, 9.1, 2.0]

In [159]:
number_list = [10, 25, 42, 1.0]
print(number_list)
number_list.sort()
print(number_list)

[10, 25, 42, 1.0]
[1.0, 10, 25, 42]


Note that in order to `sort` we had to have all elements in our list
be numbers (`int` and `float`)

We could actually do the same with a list of strings. In this case sort
will put the items in alphabetical order.

In [160]:
str_list = ["NY", "AZ", "TX"]
print(str_list)
str_list.sort()
print(str_list)

['NY', 'AZ', 'TX']
['AZ', 'NY', 'TX']


The `append` method adds an element to the end of existing list

In [161]:
num_list = [10, 25, 42, 8]
print(num_list)
num_list.append(10)
print(num_list)

[10, 25, 42, 8]
[10, 25, 42, 8, 10]


However, if you call append with a list, it adds a `list` to the end,
rather than the numbers in that list

In [162]:
num_list = [10, 25, 42, 8]
print(num_list)
num_list.append([20, 4])
print(num_list)

[10, 25, 42, 8]
[10, 25, 42, 8, [20, 4]]


To combine the lists instead

In [151]:
num_list = [10, 25, 42, 8]
print(num_list)
num_list.extend([20, 4])
print(num_list)

[10, 25, 42, 8]
[10, 25, 42, 8, 20, 4]


### Removing elements
Two ways to remove elements from a list:

*  Remove by value:

In [152]:
a = ['a','b','c','d']
a.remove('d')
a

['a', 'b', 'c']

* Remove by index and return value: 

In [153]:
a.pop(1)
a

['a', 'c']


**Question 3a**

In the first cell, try `y.append(z)`

In the second cell try `y.extend(z)`

Creat a markdown cell below and explain the behavior

HINT: when you are trying to explain use `y.append?` and `y.extend?` to
see a description of what these methods are supposed to do


In [163]:
y = ["a", "b", "c"]
z = [1, 2, 3]
# your code here
y.append(z)

print(y)


['a', 'b', 'c', [1, 2, 3]]


<blockquote>

</blockquote>

In [164]:
y = ["a", "b", "c"]
z = [1, 2, 3]
# your code here
y.extend(z)
print(y)


['a', 'b', 'c', 1, 2, 3]


In [None]:
#Basically the the first cell added the list itself rather than its contents. 
#The second cell added z's contents to the original list, therby extending the list.

**Question 3b**

Verify that tuples are indeed immutable by attempting the following:  
```python
x = [2.0, 9.1, 12.5]

t = (2.0, 9.1, 12.5)
```

- Changing the first element of `t` to be `100`  


- Appending a new element `"!!"` to the end of `t` (remember with a
  list `x` we would use `x.append("!!")` to do this   

  
- Try `x.sort()` the try sort `t`   


- Reverse `x.reverse()` with  then try reverse `t`  



In [167]:
# change first element of t 
x = [2.0, 9.1, 12.5]

t = (2.0, 9.1, 12.5)

t.replace(2.0,100)

AttributeError: 'tuple' object has no attribute 'replace'

<blockquote>

</blockquote>

In [170]:
# appending to t
t.append("!!")

AttributeError: 'tuple' object has no attribute 'append'

<blockquote>

</blockquote>

In [172]:
# sorting x and t
x.sort()

x

[2.0, 9.1, 12.5]

In [173]:
t.sort()

AttributeError: 'tuple' object has no attribute 'sort'

<blockquote>

</blockquote>

In [175]:
# reversing x and t
x.reverse()
x

[2.0, 9.1, 12.5]

In [176]:
t.reverse()

#Looks like tupples are indeed immutable

AttributeError: 'tuple' object has no attribute 'reverse'

All codes that invlove `t` should return errors. 



**Question 3c**

Look at the [World Factbook for Australia](https://www.cia.gov/-library/publications/the-world-factbook/geos/as.html)
and create a dictionary with data containing the following value types:
float, string, integer, list, and dict.  Choose any data you wish

To confirm you have created the dictionary successfully, try look up some values by a key

In [15]:
# your code here
australia_data = {
    "PPP" : float(504000),
    "Territorial claims make" : int(1770),
    "Location" : "Oceania",
    "Ethnic groups" : ["English 25.9%", "Australian 25.4%", "Irish 7.5%", "Scottish 6.4%"],
    "Languages" : {"English": 72.7, "Mandarin": 2.5, "Arabic": 1.4, "Cantonese": 1.2},
    }
print(australia_data)

{'PPP': 504000.0, 'Territorial claims make': 1770, 'Location': 'Oceania', 'Ethnic groups': ['English 25.9%', 'Australian 25.4%', 'Irish 7.5%', 'Scottish 6.4%'], 'Languages': {'English': 72.7, 'Mandarin': 2.5, 'Arabic': 1.4, 'Cantonese': 1.2}}


In [23]:
# look up some values by a key

australia_data["PPP"]

504000.0

<blockquote>

</blockquote>

#### Common `dict` functionality

There are a handful of common things we can do with dicts

We will demonstrate them with examples below

In [24]:
# number of key-value pairs in a dict
len(australia_data)

5

In [25]:
# get a list of all the keys
list(australia_data.keys())

['PPP', 'Territorial claims make', 'Location', 'Ethnic groups', 'Languages']

In [26]:
# get a list of all the values
list(australia_data.values())

[504000.0,
 1770,
 'Oceania',
 ['English 25.9%', 'Australian 25.4%', 'Irish 7.5%', 'Scottish 6.4%'],
 {'English': 72.7, 'Mandarin': 2.5, 'Arabic': 1.4, 'Cantonese': 1.2}]



**Question 3d**

Use Jupyter’s help facilities to learn how to use the `pop` method to
remove a key(and its value) from the dict.


In [29]:
# uncomment and use ?
australia_data.pop?

[0;31mDocstring:[0m
D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
If key is not found, d is returned if given, otherwise KeyError is raised
[0;31mType:[0m      builtin_function_or_method


<blockquote>

</blockquote>

Explain what happens to the value you popped

Experiment with calling `pop` twice on the same key


In [32]:
# your code here
australia_data.pop('Languages')

KeyError: 'Languages'

In [34]:
australia_data
#Languages key/value combination dissappeared!

{'PPP': 504000.0,
 'Territorial claims make': 1770,
 'Location': 'Oceania',
 'Ethnic groups': ['English 25.9%',
  'Australian 25.4%',
  'Irish 7.5%',
  'Scottish 6.4%']}

<blockquote>

</blockquote>

**Question 3e**

We can also add new items to a dict using the syntax `d[new_key] = new_value`:

Let’s try adding some key value pair and print the dictionary `australia_data` before and after the addition

In [37]:
# print the old dictionary
australia_data = {
    "PPP" : float(504000),
    "Territorial claims make" : int(1770),
    "Location" : "Oceania",
    "Ethnic groups" : ["English 25.9%", "Australian 25.4%", "Irish 7.5%", "Scottish 6.4%"],
    "Languages" : {"English": 72.7, "Mandarin": 2.5, "Arabic": 1.4, "Cantonese": 1.2},
    }
print(australia_data)

# add some key value pair 
australia_data["random"] = "random"

# print the new dictionary

australia_data

{'PPP': 504000.0, 'Territorial claims make': 1770, 'Location': 'Oceania', 'Ethnic groups': ['English 25.9%', 'Australian 25.4%', 'Irish 7.5%', 'Scottish 6.4%'], 'Languages': {'English': 72.7, 'Mandarin': 2.5, 'Arabic': 1.4, 'Cantonese': 1.2}}


{'PPP': 504000.0,
 'Territorial claims make': 1770,
 'Location': 'Oceania',
 'Ethnic groups': ['English 25.9%',
  'Australian 25.4%',
  'Irish 7.5%',
  'Scottish 6.4%'],
 'Languages': {'English': 72.7,
  'Mandarin': 2.5,
  'Arabic': 1.4,
  'Cantonese': 1.2},
 'random': 'random'}

**Challenge**

For the tuple `foo` below, use a combination of `zip`,
`range`, `len` to mimic `enumerate(foo)`

First, start checking what the answer should look like by typing `list(enumerate(foo))`

Verify that your proposed solution is correct by converting each to a list
and checking equality with `==`


In [18]:
foo = ("good", "luck!")

In [19]:
# checking the behavior of enumerate
list(enumerate(foo))

[(0, 'good'), (1, 'luck!')]

In [20]:
#minic the behavior of enumerate with zip
seq1 = [0,1]
seq2 = ["good","luck!"]

zipped = zip(seq1,seq2)

list(zipped)


[(0, 'good'), (1, 'luck!')]

<blockquote>

</blockquote>

#Confirm he two are equivalent 
foo = ("good", "luck!")
list(enumerate(foo))
seq1 = [0,1]
seq2 = ["good","luck!"]
zipped = zip(seq1,seq2)

list(zipped) == list(enumerate(foo))

#True
