
<h1 align=center><font size = 5>Introduction to Statistics With Python   </font></h1>

# Table of Contents


<div class="alert alert-block alert-info" style="margin-top: 20px">
<li><a href="#ref0">Using summary statistics to better understand Indian startup funding  </a></li>




</div>



For more  Data science and Statistics check out <a href= "https://cognitiveclass.ai/courses/data-analysis-python">Data Analysis wit Python</a> for Free! 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pylab 
import scipy.stats as stats
import matplotlib.mlab as mlab
from scipy.stats import norm
%matplotlib inline

 Function used to plot area under normal curve 

In [None]:
def Plot_Normal(mu,sigma,x1,x2,c='b'):
    
    x = np.linspace(mu - 4*sigma, mu + 4*sigma, 100)
    N=x.shape
    y=mlab.normpdf(x, mu, sigma);
    plt.plot(x,y,c)

    y3=np.zeros(N)
    y4=np.zeros(N)
    y4[(x>x1) & (x<x2)]=y[(x>x1) & (x<(x2))]
    plt.fill_between(x, y3, y4, color='grey', alpha='0.5')
    plt.show()

## <a id="ref0"></a> Using summary statistics to better understand Indian startup funding 

 It's difficult to deal with large datasets, in this notebook we will use statistical methods to better understand the data. This dataset provides information about  2372 Indian startups from January 2015 to August 2017. The dataset includes columns with the date funded, the city the startup is based out of, the names of the funders,  the amount invested (in USD) and other information. The dataset is from Sudalai Rajkumar.


 Let's load the data, using the function **read_csv**. The first column of the dataset contains an index, so we set the parameter  
** index_col=0**.


In [None]:
df=pd.read_csv('https://raw.githubusercontent.com/jsantarc/ADMN5016_2022/master/data/startup_funding.csv',index_col=0)

In [None]:
df.head()

We can find the shape of the dataset; there are 2378 samples and nine columns.

In [None]:
df.shape

We can find the type of data for each column using the attribute  **dtype**.

In [None]:
df.dtypes

 The column **AmountInUSD**  is of type  object, we must convert it into a numeric type. If we convert the type using the following command, we will get an error as there is a comma in the dollar amounts.

In [None]:
#df['AmountInUSD']=df['AmountInUSD'].astype(float)


We use the  **str.replace(',','')** method to replace the comma with a space, then cast the value to a float.

In [None]:
df['AmountInUSD']=df['AmountInUSD'].str.replace(',','').astype(float)


 We can see the type of **AmountInUSD** is now **float64**:

In [None]:
df.dtypes

 We can count the number of null values in each column as follows:

In [None]:
df.apply(lambda x: x.isnull().values.sum())


As our primary concern is how factors affect the amount invested, we will drop all rows that do not contain data about the amount invested.

In [None]:
df=df[df['AmountInUSD'].notnull()]
df.head()

If we check the shape of the dataframe, we see we lost 847 rows.

In [None]:
df.shape

 We can plot the investments over time:

In [None]:
plt.figure(figsize=(100,100))
df.plot(x='Date', y='AmountInUSD')
plt.show()

We can calculate the mean:

In [None]:
df['AmountInUSD'].mean()

 We can calculate the median:

In [None]:
df['AmountInUSD'].median()

 The large difference between mean and median suggests there are a lot of outliers. The  standard deviation also suggests the data is spread out:

In [None]:
df['AmountInUSD'].std()

 We can use the method **describe()** to view more summary statistics; it's interesting to study the quartiles:  

In [None]:
df.describe()

A box plot is not very useful:

In [None]:
df.boxplot('AmountInUSD')
plt.show()

 Examining the histogram, we can see the data is definitely not normally distributed:

In [None]:
df['AmountInUSD'].hist(bins=100)
plt.xlabel('Amount in USD')
plt.show()

 We can verify using a Q–Q plot (quantile-quantile plot):

In [None]:
stats.probplot(df['AmountInUSD'], dist="norm", plot=pylab)
pylab.show()

 As expected the data does not fall in a straight line. Therefore it is not normally distributed. 

Let's look at the categorical variables:

In [None]:
df.drop(labels=['AmountInUSD','Date'],axis=1).describe(include='all')

 Let's look at the top values in the column **IndustryVertical**:

In [None]:
df['IndustryVertical'].value_counts()[0:10].plot(kind='bar')
plt.show()

 Let's look at the average investment  in each city

In [None]:
df.groupby(['CityLocation'])[['AmountInUSD']].mean().sort_values('AmountInUSD',ascending=False)[0:15].plot(kind='bar')


 We can examine  how each of the other columns affects average investment:

In [None]:
for name in list(df)[2:-2]:

    df.groupby(name)[['AmountInUSD']].mean().sort_values('AmountInUSD',ascending=False)[0:10].plot(kind='bar')
    df.groupby(name)[['AmountInUSD']].std().sort_values('AmountInUSD',ascending=False)[0:10].plot(kind='bar',color='red')    

### About the Authors:  

[Joseph Santarcangelo]( https://www.linkedin.com/in/joseph-s-50398b136/) has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).