Importing the Pandas library

In [None]:
import pandas as pd

Let's have a look at the Shark Tank India dataset. Shark Tank is a show where enterpreneurs pitch their ideas on live television to a group of investors, who then compete to invest money in these ideas. The first dataset is a dataset of the pitches made, and the deal values that the pitches were closed at. All money values are in lakhs of rupees.

In [None]:
df_pitches = pd.read_csv('shark_tank_india_pitches.csv')
df_pitches.head()

The second dataset is a breakdown of the deal closures in terms of the sharks (investors) who invested

In [None]:
df_sharks = pd.read_csv('shark_tank_india_sharks.csv')
df_sharks.head()

We can see that the `pitch_number` column is common between both datasets. We can therefore merge the two DataFrames on this column

In [None]:
df = df_pitches.merge(df_sharks, on = 'pitch_number')
df.head()

We could also achieve the same using the `.join()` method. However, this would require us to explicitly set our 'pitch_number' column as an index.

In [None]:
# df2 = df_pitches.join(df_sharks)

In [None]:
df_pitches_2 = df_pitches.set_index('pitch_number')
df_sharks_2 = df_sharks.set_index('pitch_number')
df2 = df_pitches_2.join(df_sharks_2)
df2.head()

Are our two merged DataFrames equal?

In [None]:
df.equals(df2)

Let's reset our index to remove 'pitch_number' as an index

In [None]:
df2.reset_index(inplace = True)

In [None]:
df.equals(df2)

Looks like the order of the columns after resetting our index has changed. Let's swap the order of our first two columns.

In [None]:
cols = list(df2.columns)
cols[0], cols[1] = cols[1], cols[0]
df2 = df2[cols]
df2.head()

In [None]:
df.equals(df2)

Generally, `.merge()` tends to be far more flexible than `.join()`, so we can avoid using the latter. Anyway, let's return to our DataFrame.

In [None]:
df.info()

In [None]:
df.describe()



How could we calculate the average difference between the pitcher ask amount and the deal amount?

In [None]:
df['ask_deal_diff'] = df['pitcher_ask_amount'] - df['deal_amount']  # making a new column with the difference
df['ask_deal_diff'].mean()

In [None]:
df['ask_deal_diff'].describe()

This seems like a very high figure, let's see if there are some potential outliers in our dataset causing the difference between ask and deal amounts to be so high. While we can manually inspect and look for data points in this case, for larger datasets, it would be preferable to employ a statistical method. Let's try to implement one such method.

The interquartile range, or IQR, is the range in which the middle 50% of the data falls. To calculate the IQR, you need to find the difference between the first and third quartiles of your data. The first quartile (Q1) is the value below which 25% of our data lies; the third quartile (Q3) is the value below which 75% of our data lies. We can calculate these using the `.quantile()` method.

To remove potential outliers using the IQR method, we remove the data that is outside the following boundaries:

*   $Lower = Q1 - 1.5 \times IQR$
*   $Upper = Q3 + 1.5 \times IQR$

**Note:** In Pandas, `|` is used instead of `or` to perform the OR operation, signifying element-wise logical operations on multiple elements rather than on single booleans. Furthermore, parantheses are necessary while chaining multiple logical operations in Pandas.


In [None]:
# Finding the first (Q1) and third (Q3) quartiles
Q1 = df['ask_deal_diff'].quantile(0.25)  # 25% of our observations lie below this figure
Q3 = df['ask_deal_diff'].quantile(0.75)  # 75% of our observations lie below this figure
IQR = Q3 - Q1  # interquartile range (difference between third and first quartiles)

# Define the bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Finding the outliers
df_outliers = df[(df['ask_deal_diff'] < lower_bound) | (df['ask_deal_diff'] > upper_bound)]
df_outliers

As none of these deals were closed, we can choose to remove them from our dataset

In [None]:
df_no_outliers = df[(df['ask_deal_diff'] >= lower_bound) & (df['ask_deal_diff'] <= upper_bound)]
df_no_outliers['ask_deal_diff'].mean()  # finding the mean difference between ask and deal amounts again

This feels like a reasonable figure. Now, let's try to find the average participation rates for each shark. We will return to using our original DataFrame as the outliers we calculated were to do with the difference between ask and deal amounts; we would still like to include these entries as cases where sharks did not participate in deals. Let's output the participation rates as percentages in dictionaries.

In [None]:
# Let's identify the sharks who participated in closing a deal
sharks = ['ashneer_deal', 'anupam_deal', 'aman_deal', 'namita_deal', 'vineeta_deal', 'peyush_deal', 'ghazal_deal']
participation_rates = {shark.split('_')[0].capitalize(): f'{df[shark].mean() * 100:.2f}%' for shark in sharks}
participation_rates

Thus, we can see that Aman and Peyush participated in the most number of deals, while Ghazal participated in the least. However, this might not be a fair assessment as we haven't taken into account whether a shark was present during the episode of a deal or not. Luckily, we have columns to account for this.

In [None]:
df.columns

In [None]:
shark_presence = ['ashneer_present', 'anupam_present', 'aman_present', 'namita_present', 'vineeta_present', 'peyush_present', 'ghazal_present']

# Dictionary to store participation rates while taking presence into account
adjusted_participation_rates = {}

# As the shark deals and presence columns line up exactly, we can zip them together to perform the following iteration
for deal, present in zip(sharks, shark_presence):
    present_pitches = df[df[present] == 1]
    participation_rate = present_pitches[deal].mean() * 100
    adjusted_participation_rates[deal.split('_')[0].capitalize()] = f'{participation_rate:.2f}%'

adjusted_participation_rates

After accounting for shark presence, we can see that the participation rates are far more even across the board. Peyush and Aman are the sharks with the highest participation, while Anupam and Namita are the sharks with the lowest.