# Shark Tank

_Shark Tank_ is a reality TV show. Contestants present their idea for a company to a panel of investors (a.k.a. "sharks"), who then decide whether or not to invest in that company.  The investors give a certain amount of money in exchange for a percentage stake in the company ("equity"). If you are not familiar with the show, you may want to watch part of an episode [here](http://abc.go.com/shows/shark-tank) to get a sense of how it works.

The data that you will examine in this lab contains data about all contestants from the first 6 seasons of the show, including:
- the name and industry of the proposed company
- whether or not it was funded (i.e., the "Deal" column)
- which sharks chose to invest in the venture (N.B. There are 7 regular sharks, not including "Guest". Each shark has a column in the data set, labeled by their last name.)
- if funded, the amount of money the sharks put in and the percentage equity they got in return

To earn full credit on this lab, you should:
- use vectorized operations instead of Python loops
- use the split-apply-combine pattern wherever possible

Of course, if you can't think of a vectorized solution, a `for` loop is still better than no solution at all!

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_rows", 15)

In [2]:
# This cell should only be modified only by a grader.
scores = []

## Getting and Cleaning the Data

The data is stored in the CSV file `/data/sharktank.csv`. Read in the data into a Pandas `DataFrame`.

In [3]:
data = pd.read_csv("/data/sharktank.csv")
data

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes
0,1.0,1.0,Ava the Elephant,Yes,Healthcare,Female,"$50,000",55%,1.0,,,,,,,,
1,1.0,1.0,Mr. Tod's Pie Factory,Yes,Food and Beverage,Male,"$460,000",50%,1.0,,,,1.0,,,,
2,1.0,1.0,Wispots,No,Business Services,Male,,,,,,,,,,,
3,1.0,1.0,College Foxes Packing Boxes,No,Lifestyle / Home,Male,,,,,,,,,,,
4,1.0,1.0,Ionic Ear,No,Uncertain / Other,Male,,,,,,,,,,,
5,1.0,2.0,A Perfect Pear,Yes,Food and Beverage,Female,"$500,000",50%,,,,1.0,,1.0,,,
6,1.0,2.0,Classroom Jams,Yes,Children / Education,Male,"$250,000",10%,1.0,1.0,,1.0,1.0,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
488,6.0,28.0,Sunscreen Mist,No,Lifestyle / Home,Male,,,,,,,,,,,
489,6.0,28.0,SynDaver Labs,Yes,Healthcare,Male,"$3,000,000",25%,,,,1.0,,,,,


There is one column for each of the sharks. A 1 indicates that they chose to invest in that company, while a missing value indicates that they did not choose to invest in that company. Notice that these missing values show up as NaNs when we read in the data. Fill in these missing values with zeros. Other columns may also contain NaNs; be careful not to fill those columns with zeros, or you may end up with strange results down the line.

_Hint:_ You should have read about Pandas' missing data capabilities in the pre-reading for this lab.

In [4]:
data.loc[:,"Corcoran":"Guest"] = data.loc[:,"Corcoran":"Guest"].fillna(0)
data

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes
0,1.0,1.0,Ava the Elephant,Yes,Healthcare,Female,"$50,000",55%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,1.0,1.0,Mr. Tod's Pie Factory,Yes,Food and Beverage,Male,"$460,000",50%,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
2,1.0,1.0,Wispots,No,Business Services,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,1.0,1.0,College Foxes Packing Boxes,No,Lifestyle / Home,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,1.0,1.0,Ionic Ear,No,Uncertain / Other,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
5,1.0,2.0,A Perfect Pear,Yes,Food and Beverage,Female,"$500,000",50%,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,
6,1.0,2.0,Classroom Jams,Yes,Children / Education,Male,"$250,000",10%,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
488,6.0,28.0,Sunscreen Mist,No,Lifestyle / Home,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
489,6.0,28.0,SynDaver Labs,Yes,Healthcare,Male,"$3,000,000",25%,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,


Notice that Amount and Equity are currently being treated as categorical variables (`dtype: object`). This is because they contain non-numerical characters, like $ and %, so Pandas assumes they are strings. Clean up these columns and cast them to numeric types (i.e., a `dtype` of `int` or `float`) because we'll need to perform mathematical operations on these columns.

_Hint:_ You should have read about Pandas' string processing capabilities in the pre-reading for this lab.

In [5]:
data["Amount"] = data["Amount"].str.replace(",","")
data["Amount"] = data["Amount"].str.replace("$","")
data["Amount"] = data["Amount"].apply(pd.to_numeric)
data["Equity"] = data["Equity"].str.replace("%","")
data["Equity"] = pd.to_numeric(data["Equity"], errors='coerce')
data

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes
0,1.0,1.0,Ava the Elephant,Yes,Healthcare,Female,50000.0,55.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,1.0,1.0,Mr. Tod's Pie Factory,Yes,Food and Beverage,Male,460000.0,50.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,
2,1.0,1.0,Wispots,No,Business Services,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
3,1.0,1.0,College Foxes Packing Boxes,No,Lifestyle / Home,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,1.0,1.0,Ionic Ear,No,Uncertain / Other,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
5,1.0,2.0,A Perfect Pear,Yes,Food and Beverage,Female,500000.0,50.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,
6,1.0,2.0,Classroom Jams,Yes,Children / Education,Male,250000.0,10.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
488,6.0,28.0,Sunscreen Mist,No,Lifestyle / Home,Male,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
489,6.0,28.0,SynDaver Labs,Yes,Healthcare,Male,3000000.0,25.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,


### Grader's Comments

- 
- 

[This question is worth 10 points.]

In [6]:
# This cell should only be modified only by a grader.
scores.append(0)

## Which Company was Worth the Most?

The valuation of a company is how much it is worth. If someone invests \$10,000 for a 40% equity stake in the company, then this means the company must be valued at \$25,000, since 40% of \$25,000 is \$10,000.

Calculate the valuation of each company that was funded. Which company was most valuable? Is it the same as the company that received the largest total investment from the sharks?

In [7]:
data["Valuation"] = data.Amount/(data.Equity/100)
largVal = data[data["Valuation"] == data["Valuation"].replace(np.inf, np.nan).dropna().max()]
largAmount = data[data["Amount"] == data["Amount"].max()]

In [8]:
largVal

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes,Valuation
421,6.0,11.0,Zipz,Yes,Food and Beverage,Male,2500000.0,10.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,with an option for another $2.5 Million for an...,25000000.0


In [9]:
largAmount

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes,Valuation
483,6.0,27.0,AirCar,Yes,Green/CleanTech,Male,5000000.0,50.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,Contingent on getting deal to bring to contine...,10000000.0


The company with the largest valuation was Zips, which was valued at 25,000,000 dollars. This is a larger valuation than the compnay which received the largest total investment from the sharks, AirCar, which received a 5,000,000 dollar investment.

In [10]:
# This cell should only be modified only by a grader.
scores.append(0)

## Which Shark Invested the Most?

Calculate the total amount of money that each shark invested over the 6 seasons. Which shark invested the most total money over the 6 seasons?

_Hint:_ If $n$ sharks funded a given venture, then the amount that each shark invested is the total amount divided by $n$.

In [11]:
data["InvestPer"] = data["Amount"].fillna(0)/data.loc[:,"Corcoran":"Guest"].sum(axis=1)
data.loc[:,"Corcoran":"Guest"].multiply(data["InvestPer"], axis=0).sum(axis=0)

Corcoran       4912500.0
Cuban         17817500.0
Greiner        8170000.0
Herjavec      16297500.0
John           8154000.0
O'Leary        7952500.0
Harrington      800000.0
Guest           400000.0
dtype: float64

The shark that invested the most over the first six seasons was Mark Cuban, who invested $17,817,500.

### Grader's Comments

- 
- 

[This question is worth 20 points.]

In [12]:
# This cell should only be modified only by a grader.
scores.append(0)

## Do the Sharks Prefer Certain Industries?

Calculate the funding rate (the proportion of companies that were funded) for each industry. Then, calculate 95% confidence intervals for the probability that a company from each industry gets funded.

Which industry gives you the best chance of being funded? Answer this question in two ways: (1) take the industry with the highest funding rate, and (2) look at the lower bounds of the 95% confidence intervals and take the industry with the highest one. Do you get the same answer from these two approaches? If not, can you explain intuitively why the industry with the highest funding rate might not have the highest confidence interval?

In [13]:
data["Deal"] = data["Deal"].replace("Yes",1)
data["Deal"] = data["Deal"].replace("No",0)
data.groupby(["Industry"],)["Deal"].sum()/len(data)

Industry
Business Services        0.006061
Children / Education     0.058586
Consumer Products        0.020202
Fashion / Beauty         0.086869
Fitness / Sports         0.046465
Food and Beverage        0.113131
Green/CleanTech          0.010101
Healthcare               0.020202
Lifestyle / Home         0.074747
Media / Entertainment    0.012121
Pet Products             0.014141
Software / Tech          0.030303
Uncertain / Other        0.010101
Name: Deal, dtype: float64

In [14]:
from symbulate import *
def calculate_interval(sample):
    return ((mean(sample)-2*(sd(sample)/sqrt(len(sample)))),(mean(sample)+2*(sd(sample)/sqrt(len(sample)))))

In [15]:
data.groupby(["Industry"],)["Deal"].apply(calculate_interval)

Industry
Business Services        (-0.00293985394238, 0.464478315481)
Children / Education        (0.392633492589, 0.661911961956)
Consumer Products           (0.297218026281, 0.755413552666)
Fashion / Beauty            (0.358964576181, 0.565766606615)
Fitness / Sports            (0.418675017991, 0.731324982009)
Food and Beverage            (0.440694013954, 0.63622906297)
Green/CleanTech             (0.224286225556, 0.886824885556)
Healthcare                  (0.321312765913, 0.789798345198)
Lifestyle / Home            (0.409243866561, 0.647898990582)
Media / Entertainment        (0.24519170042, 0.845717390489)
Pet Products                (0.173035455908, 0.650493955857)
Software / Tech             (0.281188620137, 0.627902288954)
Uncertain / Other           (0.114751378675, 0.654479390556)
Name: Deal, dtype: object

The industry that gives you the best chance of being funded is the Food and Beverage Industry. This is the same answer  attained using both approaches.

### Grader's Comments

- 
- 

[This question is worth 20 points.]

In [16]:
# This cell should only be modified only by a grader.
scores.append(0)