# Shark Tank

## Due Tuesday, April 12 at 8:00 AM

_Shark Tank_ is a reality TV show. Contestants present their idea for a company to a panel of investors (a.k.a. "sharks"), who then decide whether or not to invest in that company.  The investors give a certain amount of money in exchange for a percentage stake in the company ("equity"). If you are not familiar with the show, you may want to watch at least part of one episode [here](http://abc.go.com/shows/shark-tank) to get a sense.

The data that you will examine in this lab contains the results of all contestants in the first 6 seasons of the show: 
- the name and industry of the company
- whether or not it was funded
- which sharks chose to invest (Note: There are 7 sharks, not including "Guest". They are represented by their last names.)
- if funded, the amount of money the sharks put in and the percentage equity they got in return

In [15]:
import pandas as pd
data = pd.read_csv("/data/sharktank.csv")

There is one column for each of the "sharks" (a.k.a. investors). A 1 indicates that they chose to invest in that company, while a missing value indicates that they did not choose to invest in that company. Notice that these missing values show up as NaNs when we read in the data. Learn about Pandas' [missing data handling](http://pandas.pydata.org/pandas-docs/stable/missing_data.html), and use these methods to replace all of the missing values with zeros.

In [16]:
data

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes
0,1.0,1.0,Ava the Elephant,Yes,Healthcare,Female,"$50,000",55%,1.0,,,,,,,,
1,1.0,1.0,Mr. Tod's Pie Factory,Yes,Food and Beverage,Male,"$460,000",50%,1.0,,,,1.0,,,,
2,1.0,1.0,Wispots,No,Business Services,Male,,,,,,,,,,,
3,1.0,1.0,College Foxes Packing Boxes,No,Lifestyle / Home,Male,,,,,,,,,,,
4,1.0,1.0,Ionic Ear,No,Uncertain / Other,Male,,,,,,,,,,,
5,1.0,2.0,A Perfect Pear,Yes,Food and Beverage,Female,"$500,000",50%,,,,1.0,,1.0,,,
6,1.0,2.0,Classroom Jams,Yes,Children / Education,Male,"$250,000",10%,1.0,1.0,,1.0,1.0,1.0,,,
7,1.0,2.0,Lifebelt,No,Consumer Products,Male,,,,,,,,,,,
8,1.0,2.0,Crooked Jaw,No,Fashion / Beauty,Male,,,,,,,,,,,
9,1.0,2.0,Sticky Note Holder,No,Lifestyle / Home,Female,,,,,,,,,,,


Notice that Amount and Equity are currently being treated as categorical variables (dtype: object). This is because they contain non-numerical characters, like $ and %, so Pandas assumes they are strings. Use [Pandas' string processing capabilities](http://pandas.pydata.org/pandas-docs/stable/text.html) to strip these characters. Then, convert these columns to quantitative variables (i.e., with a dtype of int or float).

In [17]:
data.Amount = data["Amount"]
data.Amount = data.Amount.str.replace('$','')
data.Amount = data.Amount.str.replace(',','')
data.Equity = data.Equity.str.replace('%','')
data.Equity = data.Equity.str.replace('n/a', 'inf')
data = data.fillna(0)
data.Amount = data.Amount.astype(float)
data.Equity = data.Equity.astype(float)
data



Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes
0,1.0,1.0,Ava the Elephant,Yes,Healthcare,Female,50000.0,55.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,1.0,1.0,Mr. Tod's Pie Factory,Yes,Food and Beverage,Male,460000.0,50.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0
2,1.0,1.0,Wispots,No,Business Services,Male,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,1.0,1.0,College Foxes Packing Boxes,No,Lifestyle / Home,Male,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,1.0,1.0,Ionic Ear,No,Uncertain / Other,Male,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
5,1.0,2.0,A Perfect Pear,Yes,Food and Beverage,Female,500000.0,50.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0
6,1.0,2.0,Classroom Jams,Yes,Children / Education,Male,250000.0,10.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0
7,1.0,2.0,Lifebelt,No,Consumer Products,Male,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
8,1.0,2.0,Crooked Jaw,No,Fashion / Beauty,Male,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
9,1.0,2.0,Sticky Note Holder,No,Lifestyle / Home,Female,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


The valuation of a company is how much it is worth. If someone invests \$10,000 for a 40% equity stake in the company, then this means the company must be valued at \$25,000, since 40% of \$25,000 is \$10,000.

Calculate the valuation of each funded company. Which company was most valuable? Is it the same as the company that received the largest investment from the sharks? (First write your code in the code cell below, then write your text answer in the markdown cell below.)

In [18]:
import numpy as np
data["Valuation"] = data.Amount / (data.Equity / 100 )
data = data.fillna(0) ## making the companies with no equity worth 0
data.Valuation = data.Valuation.replace(np.inf , np.nan) #defined the 'n/a' as inf, replaces inf's with NaN
data.Valuation.max()


25000000.0

In [19]:
high = data["Valuation"] == 25000000.0
data[high]

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes,Valuation
421,6.0,11.0,Zipz,Yes,Food and Beverage,Male,2500000.0,10.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,with an option for another $2.5 Million for an...,25000000.0


In [20]:
data.Amount.max()
high = data.Amount == 5000000.0
data[high]

Unnamed: 0,Season,No. in series,Company,Deal,Industry,Entrepreneur Gender,Amount,Equity,Corcoran,Cuban,Greiner,Herjavec,John,O'Leary,Harrington,Guest,Details / Notes,Valuation
483,6.0,27.0,AirCar,Yes,Green/CleanTech,Male,5000000.0,50.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,Contingent on getting deal to bring to contine...,10000000.0


As seen in the table on out[43] THe most valuable company was Zipz with a total valuation of $2,500,000. The Company that received the largest investment was AirCar; they received $5,000,000 for 50% equity. There were a few cases were the companies had  investments buy had N/A as equity. I treted these n/a values as inf to differentiate them from the nan that we replaces as 0's earlier in the assignement. By treating them as inf, I was able to keep there valuation as nan instead of 0, thus differentiating them from the companies that received no money from the sharks. The reasoning behind leaving them as NaN, is that I didn't want them to affect any totals, I didn't want them to be confused with companies that weren't invested in, and I didn't want to make any assumptions about their valuations because I would be extrapolating. 

Calculate the total amount of money that each shark invested over the 6 seasons. (Note: If $n$ sharks funded a given venture, then the amount that each shark invested is the amount divided by $n$.)

In [21]:
sharks = data[['Corcoran', 'Cuban', 'Greiner', 'Herjavec', 'John', "O'Leary", 'Harrington', 'Guest' ]]
data["Total"] = data.Corcoran + data.Cuban + data.Greiner + data.Herjavec + data.Harrington + data.John + data["O'Leary"] + data.Guest
cuban = data[data.Cuban == 1.0].Amount / data.Total
cubmax = cuban.sum()
corc = data[data.Corcoran == 1.0].Amount / data.Total
cormax = corc.sum()
grei = data[data.Greiner == 1.0].Amount / data.Total
gmax = grei.sum()
herj = data[data.Herjavec == 1.0].Amount / data.Total
herjmax = herj.sum()
john = data[data.John == 1.0].Amount / data.Total
jmax = john.sum()
olea = data[data["O'Leary"] == 1.0].Amount / data.Total
omax = olea.sum()
harr = data[data.Harrington == 1.0].Amount / data.Total
harrmax = harr.sum()
guest = data[data.Guest == 1.0].Amount / data.Total
guestmax = guest.sum()

print("Across six seasons Corcoran has invested $%f dollars" % cormax)
print("Across six seasons Cuban has invested $%f dollars" % cubmax)
print("Across six seasons Greiner has invested $%f dollars" % gmax)
print("Across six seasons Herjavec has invested $%f dollars" % herjmax)
print("Across six seasons John has invested $%f dollars" % jmax)
print("Across six seasons O'Leary has invested $%f dollars" % omax)
print("Across six seasons Harrington has invested $%f dollars" % harrmax)
print("Across six seasons Guests has invested $%f dollars" % guestmax)


Across six seasons Corcoran has invested $4912500.000000 dollars
Across six seasons Cuban has invested $17817500.000000 dollars
Across six seasons Greiner has invested $8170000.000000 dollars
Across six seasons Herjavec has invested $16297500.000000 dollars
Across six seasons John has invested $8154000.000000 dollars
Across six seasons O'Leary has invested $7952500.000000 dollars
Across six seasons Harrington has invested $800000.000000 dollars
Across six seasons Guests has invested $400000.000000 dollars


As you can see, Cuban and Herjavec have Maaaaaaaaaaaaaaaaaaaaaaaaad money to spend. Cuban spent the most. 

What percentage of companies led by male entrepreneurs were funded? What percentage of companies led by female entrepreneurs were funded? Report 95% confidence intervals for each.

In [22]:
from scipy.stats import norm
data.columns
female = data["Entrepreneur Gender"] == 'Female'
funded = data.Amount != 0
fp = female[funded].sum() / female.sum()
male = data["Entrepreneur Gender"] == 'Male'
mp = male[funded].sum() / male.sum()
mp
print("The percentage of companies led by females that were funded was %f " %fp)
print("The percentage of companies led by males that were funded was %f " %mp)

fpl = fp - (norm.ppf(.975) * (np.sqrt((fp *(1 - fp))/female.sum())))
fph = fp + (norm.ppf(.975) * (np.sqrt((fp *(1 - fp))/female.sum())))
mpl = mp - (norm.ppf(.975) * (np.sqrt((mp *(1 - mp))/male.sum())))
mph = mp + (norm.ppf(.975) * (np.sqrt((mp *(1 - mp))/male.sum())))

print("female confidence interval: ( %f , %f )" % (fpl ,fph))
print("male confidence interval: ( %f , %f )" % (mpl ,mph))



The percentage of companies led by females that were funded was 0.536000 
The percentage of companies led by males that were funded was 0.481356 
female confidence interval: ( 0.448575 , 0.623425 )
male confidence interval: ( 0.424339 , 0.538373 )


The percentage of companies led by females that were funded was 0.536000 .
The percentage of companies led by Males that were funded was 0.477966.
female confidence interval: ( 0.448575 , 0.623425 ).
male confidence interval: ( 0.424339 , 0.538373 ). 


Find something else interesting in this data set. (Note: This question is deliberately open-ended. Extra credit is available if you come up with something really neat!)

http://jasoncochran.com/blog/8-things-you-didnt-know-about-shark-tank/ <- on this blog (one of the sharks!) it is revealed that people on the Shark tank have to give a percentage of their company to the producers! To be exact, 5% equity (or royalties). This begs the question, how much free equity (in dollars they didn't have to spend!) did the producers get in these companies? To investigate we will use the valuation of each company to solve how much a 5% equity is, and for companies that the sharks did not invest in, we will assume their valuation is in the 1 percent quantile of the copanies that had valid valuation calculations. Obviously some companies that weren't invested in could go on to succeeed or sell; in  either case the producers would get money. For the companies that were invested in, but had n/a for equity, we will make there valuation be equal to the median value of the none missing valuations.

Ill also count the .05% as a loss and calculate how much those companies "lost" as a whole. 

In [32]:
### 
notinvest = data.Amount == 0
invested = data[data.Valuation != 0]
invested = invested[invested.Valuation != np.nan]
median = invested.Valuation.median()
quantile = invested.Valuation.quantile(.01)
data.Valuation = data.Valuation.replace(np.inf, median) ## replacing inf with medians
data.Valuation = data.Valuation.replace(0.0, quantile) ## these are estimating the valuation of companies not invested in.
data.Valuation
Total_Val = data.Valuation.sum()
producers_net = Total_Val * .05

print("Total amount: %f" % producers_net)

losers = data[notinvest]
losses = losers.Valuation.sum() * .05
indv = losses / notinvest.sum()

print("Amount lossed by participants who didn't receive investments: %f Individually: %f " % (losses, indv))


Total amount: 15809175.473921
Amount lossed by participants who didn't receive investments: 567732.857143 Individually: 2307.857143 


Over the first 6 seasons of sharktank, if the producers had represented a shark , they would have had to spend an estimated
__ $15,809,175.47 __ in order to gain the 5 percent equity that they __ automatically __ get for each business that comes on the show. To put this into perspective, it is about 2 million shy of the biggest investor's (Mark Cuban) total investments on the show. 

Contestants that didn't get invested in "lossed" around __ 2307.86 __ dollars for just attending the show and as an aggregate lost __ 567732.86 __ dollars  