## Loading/Exploring the data

Load the iris.csv file into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [30]:
import pandas as pd

Read the `iris.csv` dataset into an object named `iris`

In [46]:
iris=pd.read_csv('iris.csv')
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


How many different species are in this dataset?

In [50]:
iris.value_counts('variety')

variety
Setosa        50
Versicolor    50
Virginica     50
Name: count, dtype: int64

What are their names?

###### Setosa <b>
###### Versicolor<b>
###### Virginica

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

50


## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [66]:
iris['sepal_ratio'] = iris['sepal.width']/iris['sepal.length']
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,1.4,0.2,Setosa,0.686275
1,4.9,3.0,1.4,0.2,Setosa,0.612245
2,4.7,3.2,1.3,0.2,Setosa,0.680851
3,4.6,3.1,1.5,0.2,Setosa,0.673913
4,5.0,3.6,1.4,0.2,Setosa,0.720000
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica,0.447761
146,6.3,2.5,5.0,1.9,Virginica,0.396825
147,6.5,3.0,5.2,2.0,Virginica,0.461538
148,6.2,3.4,5.4,2.3,Virginica,0.548387


Create a similar column called `'petal_ratio'`: petal width / petal length

In [71]:
iris['petal_ratio'] = iris['petal.width']/iris['petal.length']
iris

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio,petal_ratio
0,5.1,3.5,1.4,0.2,Setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,Setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,Setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,Setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,Setosa,0.720000,0.142857
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica,0.447761,0.442308
146,6.3,2.5,5.0,1.9,Virginica,0.396825,0.380000
147,6.5,3.0,5.2,2.0,Virginica,0.461538,0.384615
148,6.2,3.4,5.4,2.3,Virginica,0.548387,0.425926


Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [109]:
x =iris.iloc[:,0:4]
def new_fun(x):
    return x/0.4

x.apply(new_fun)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,12.75,8.75,3.50,0.50
1,12.25,7.50,3.50,0.50
2,11.75,8.00,3.25,0.50
3,11.50,7.75,3.75,0.50
4,12.50,9.00,3.50,0.50
...,...,...,...,...
145,16.75,7.50,13.00,5.75
146,15.75,6.25,12.50,4.75
147,16.25,7.50,13.00,5.00
148,15.50,8.50,13.50,5.75


In [113]:
part2 = iris.iloc[:,4:7]
part2

Unnamed: 0,variety,sepal_ratio,petal_ratio
0,Setosa,0.686275,0.142857
1,Setosa,0.612245,0.142857
2,Setosa,0.680851,0.153846
3,Setosa,0.673913,0.133333
4,Setosa,0.720000,0.142857
...,...,...,...
145,Virginica,0.447761,0.442308
146,Virginica,0.396825,0.380000
147,Virginica,0.461538,0.384615
148,Virginica,0.548387,0.425926


In [125]:
full_df = pd.concat([x,part2],axis=1)
full_df

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio,petal_ratio
0,5.1,3.5,1.4,0.2,Setosa,0.686275,0.142857
1,4.9,3.0,1.4,0.2,Setosa,0.612245,0.142857
2,4.7,3.2,1.3,0.2,Setosa,0.680851,0.153846
3,4.6,3.1,1.5,0.2,Setosa,0.673913,0.133333
4,5.0,3.6,1.4,0.2,Setosa,0.720000,0.142857
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica,0.447761,0.442308
146,6.3,2.5,5.0,1.9,Virginica,0.396825,0.380000
147,6.5,3.0,5.2,2.0,Virginica,0.461538,0.384615
148,6.2,3.4,5.4,2.3,Virginica,0.548387,0.425926


## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


Hint 1
Create a dictionary using the species as keys and the numbers 0-2 for values


Hint 2
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column


In [135]:
encoding = pd.get_dummies(full_df,dtype=int)
encoding

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,sepal_ratio,petal_ratio,variety_Setosa,variety_Versicolor,variety_Virginica
0,5.1,3.5,1.4,0.2,0.686275,0.142857,1,0,0
1,4.9,3.0,1.4,0.2,0.612245,0.142857,1,0,0
2,4.7,3.2,1.3,0.2,0.680851,0.153846,1,0,0
3,4.6,3.1,1.5,0.2,0.673913,0.133333,1,0,0
4,5.0,3.6,1.4,0.2,0.720000,0.142857,1,0,0
...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,0.447761,0.442308,0,0,1
146,6.3,2.5,5.0,1.9,0.396825,0.380000,0,0,1
147,6.5,3.0,5.2,2.0,0.461538,0.384615,0,0,1
148,6.2,3.4,5.4,2.3,0.548387,0.425926,0,0,1


## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>