Hi class, for this evening we'll spend 40 mins doing hands-on work, and your task is to assume the role of an investor / venture capital analyzing fundraising records of startups (data by techcrunch). I'll reveal the dataset at the beginning of the class. Here are the questions to guide your analysis:

1. How many rows / columns do we have in the dataset?
2. How many startup funding has there been in California vs that of New York? (tip: value_counts())
3. Using describe(), what is the average size of funding (raised amount)?
4. Using describe(), what is the standard deviation in the size of funding (raised amount)?
5. What is the largest fundraising in the database? In which company was that? 

You'll do this together and add a few more techniques under our belt as a revision. We will all start with a blank notebook and starting from scratch (import libraries, read_csv, and then work through the questions) we'll build up our analysis notebook sequentially.


In [3]:
import pandas as pd
tc = pd.read_csv("data_input/techcrunch.csv", index_col=0)
tc.head()

Unnamed: 0_level_0,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a


In [4]:
print(f'The dimension of the data is {tc.shape}')


The dimension of the data is (1460, 9)


In [5]:
print(f'The size of the data is {tc.size}')

The size of the data is 13140


In [6]:
tc['raisedAmt'].describe()/1000000

count      0.001460
mean      10.131488
std       18.661462
min        0.006000
25%        2.000000
50%        5.500000
75%       11.025000
max      300.000000
Name: raisedAmt, dtype: float64

In [7]:
tc.city.value_counts().head(5)

San Francisco    228
New York          93
Mountain View     89
Palo Alto         78
Seattle           75
Name: city, dtype: int64

In [8]:
tc['raisedAmt'].mean()/1000000

10.1314875

In [9]:
tc['raisedAmt'].std()/1000000

18.66146188901684

In [10]:
tc.raisedAmt.max()

300000000

In [11]:
condition = tc.raisedAmt == tc.raisedAmt.max()
tc.loc[condition, :]

Unnamed: 0_level_0,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
facebook,Facebook,450.0,web,Palo Alto,CA,1-Oct-07,300000000,USD,c
zenimax,ZeniMax,,web,Rockville,MD,1-Oct-07,300000000,USD,a


1. Compare: `sum(tc.numEmps)` to `tc.numEmps.sum()`: what's the difference? Why? 
2. Use the square bracket indexing `dat['col_selected']` and compute a frequency table (`.value_counts`) on the `raisedCurrency` column. Are there any other currency apart from USD, used in the fundraising dataset?
3. Perform a conditional subsetting (boolean indexing) using the syntax: `dat[cond1]`. Chain it with `.tail()` so as not to print the full returned result 
4. Create a condition, then use the condition to subset for rows where `company` is `Tesla Motors`. Pass this condition the way you did in (3) to perform the boolean indexing, but return only the following columns: `round`, `company`, `raisedCurrency` and `raisedAmt`. Use `.loc` so you can specify **column selection by label**.
5. Use `.iloc` to select the first 10 rows and only the first 5 columns in the DataFrame.
6. Go back to (4), on the resulting DataFrame, chain the `.sort_values('round')` method at the end to sort the data frame by the values in `round`.

In [12]:
sum(tc.numEmps)

nan

In [13]:
tc.numEmps.sum()

54274.0

sum(tc.numEmps) is basically trying to sum to object, which is not a number, whilst tc.numEmps.sum() mean


In [14]:
tc['raisedCurrency'].value_counts()

USD    1458
CAD       1
EUR       1
Name: raisedCurrency, dtype: int64

In [33]:
cond1 = (tc['round'] == 'seed')
tc[cond1].tail()

Unnamed: 0_level_0,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
loladex,Loladex,2.0,web,Leesburg,VA,1-Nov-07,350000,USD,seed
yapta,Yapta,,web,Seattle,WA,1-Jul-07,700000,USD,seed
eyejot,EyeJot,5.0,web,Seattle,WA,1-May-07,750000,USD,seed
rescuetime,RescueTime,,web,Seattle,WA,14-Oct-07,20000,USD,seed
delve-networks,Delve Networks,,web,Seattle,WA,1-Dec-06,1650000,USD,seed


In [34]:
cond1 = (tc['round'] == 'seed')
cond2 = (tc['city'] == 'Seattle')
tc[(cond1) & (cond2)].tail()

Unnamed: 0_level_0,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
yapta,Yapta,,web,Seattle,WA,1-Jul-07,700000,USD,seed
eyejot,EyeJot,5.0,web,Seattle,WA,1-May-07,750000,USD,seed
rescuetime,RescueTime,,web,Seattle,WA,14-Oct-07,20000,USD,seed
delve-networks,Delve Networks,,web,Seattle,WA,1-Dec-06,1650000,USD,seed


In [28]:
tc[(tc.company == 'Tesla Motors')].loc[:, ['round', 'company', 'raisedCurrency', 'raisedAmt']]


Unnamed: 0_level_0,round,company,raisedCurrency,raisedAmt
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tesla-motors,c,Tesla Motors,USD,40000000
tesla-motors,d,Tesla Motors,USD,45000000
tesla-motors,a,Tesla Motors,USD,7500000
tesla-motors,b,Tesla Motors,USD,13000000
tesla-motors,e,Tesla Motors,USD,40000000


In [26]:
tc.iloc[0:10,0:5]

Unnamed: 0_level_0,company,numEmps,category,city,state
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
lifelock,LifeLock,,web,Tempe,AZ
lifelock,LifeLock,,web,Tempe,AZ
lifelock,LifeLock,,web,Tempe,AZ
mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ
flypaper,Flypaper,,web,Phoenix,AZ
infusionsoft,Infusionsoft,105.0,software,Gilbert,AZ
gauto,gAuto,4.0,web,Scottsdale,AZ
chosenlist-com,ChosenList.com,5.0,web,Scottsdale,AZ
chosenlist-com,ChosenList.com,5.0,web,Scottsdale,AZ
digg,Digg,60.0,web,San Francisco,CA


In [35]:
tc[(tc.company == 'Tesla Motors')].loc[:, ['round', 'company', 'raisedCurrency', 'raisedAmt']].tail().sort_values('round')

Unnamed: 0_level_0,round,company,raisedCurrency,raisedAmt
permalink,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
tesla-motors,a,Tesla Motors,USD,7500000
tesla-motors,b,Tesla Motors,USD,13000000
tesla-motors,c,Tesla Motors,USD,40000000
tesla-motors,d,Tesla Motors,USD,45000000
tesla-motors,e,Tesla Motors,USD,40000000
