# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [27]:
#import numpy and pandas
import pandas as pd
import numpy as np

#...............

# Challenge 1 - The `stats` Submodule

This submodule contains statistical functions for conducting hypothesis tests, producing various distributions and other useful tools. Let's examine this submodule using the KickStarter dataset. We will load the dataset below.

In [28]:
# Run this code:

kickstarter = pd.read_csv('../ks-projects-201801.csv')

Now print the `head` function to examine the dataset.

In [29]:
# Your code here:
print(kickstarter.head(10))
kickstarter.shape


           ID                                               name  \
0  1000002330                    The Songs of Adelaide & Abullah   
1  1000003930      Greeting From Earth: ZGAC Arts Capsule For ET   
2  1000004038                                     Where is Hank?   
3  1000007540  ToshiCapital Rekordz Needs Help to Complete Album   
4  1000011046  Community Film Project: The Art of Neighborhoo...   
5  1000014025                               Monarch Espresso Bar   
6  1000023410  Support Solar Roasted Coffee & Green Energy!  ...   
7  1000030581  Chaser Strips. Our Strips make Shots their B*tch!   
8  1000034518  SPIN - Premium Retractable In-Ear Headphones w...   
9   100004195  STUDIO IN THE SKY - A Documentary Feature Film...   

         category main_category currency    deadline      goal  \
0          Poetry    Publishing      GBP  2015-10-09    1000.0   
1  Narrative Film  Film & Video      USD  2017-11-01   30000.0   
2  Narrative Film  Film & Video      USD  2013-02-26 

(378661, 15)

Import the `mode` function from `scipy.stats` and find the mode of the `country` and `currency` column.

In [30]:
# Your code here:
from scipy import stats
from scipy.stats import mode

countrymode= stats.mode(kickstarter['country'])
print('$ Country mode: ', countrymode)

currencymode= stats.mode(kickstarter['currency'])
print('$ Currency mode: ', currencymode)



$ Country mode:  ModeResult(mode=array(['US'], dtype=object), count=array([292627]))
$ Currency mode:  ModeResult(mode=array(['USD'], dtype=object), count=array([295365]))


The trimmed mean is a function that computes the mean of the data with observations removed. The most common way to compute a trimmed mean is by specifying a percentage and then removing elements from both ends. However, we can also specify a threshold on both ends. The goal of this function is to create a more robust method of computing the mean that is less influenced by outliers. SciPy contains a function called `tmean` for computing the trimmed mean. 

In the cell below, import the `tmean` function and then find the 75th percentile of the `goal` column. Compute the trimmed mean between 0 and the 75th percentile of the column. Read more about the `tmean` function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tmean.html#scipy.stats.tmean).

In [32]:
# Your code here:
from scipy.stats import tmean
percent75= np.percentile(kickstarter['goal'],75)
goal_tmean = stats.tmean(kickstarter['goal'], (0, percent75))
print('$ Goal percent 75 on dataset: ', percent75)
print('$ Goal Mean on dataset: ', goal_tmean)

$ Goal percent 75 on dataset:  16000.0
$ Goal Mean on dataset:  4874.150287106898


#### SciPy contains various statistical tests. One of the tests is Fisher's exact test. This test is used for contingency tables. 

The test originates from the "Lady Tasting Tea" experiment. In 1935, Fisher published the results of the experiment in his book. The experiment was based on a claim by Muriel Bristol that she can taste whether tea or milk was first poured into the cup. Fisher devised this test to disprove her claim. The null hypothesis is that the treatments do not affect outcomes, while the alternative hypothesis is that the treatment does affect outcome. To read more about Fisher's exact test, click [here](https://en.wikipedia.org/wiki/Fisher%27s_exact_test).

Let's perform Fisher's exact test on our KickStarter data. We intend to test the hypothesis that the choice of currency has an impact on meeting the pledge goal. We'll start by creating two derived columns in our dataframe. The first will contain 1 if the amount of money in `usd_pledged_real` is greater than the amount of money in `usd_goal_real`. We can compute this by using the `np.where` function. If the amount in one column is greater than the other, enter a value of 1, otherwise enter a value of zero. Add this column to the dataframe and name it `goal_met`.

In [35]:
# Your code here:
goal_met_ext= np.where(kickstarter['usd_pledged_real'] > kickstarter['usd_goal_real'],1,0)
kickstarter['goal_met'] = goal_met_ext

#print(goal_met_ext)
#df['My new column'] = 'default value'
#kickstarter=kickstarter.join(pd.DataFrame(goal_met_ext), rsuffix='_')

print(kickstarter.head(5))

           ID                                               name  \
0  1000002330                    The Songs of Adelaide & Abullah   
1  1000003930      Greeting From Earth: ZGAC Arts Capsule For ET   
2  1000004038                                     Where is Hank?   
3  1000007540  ToshiCapital Rekordz Needs Help to Complete Album   
4  1000011046  Community Film Project: The Art of Neighborhoo...   

         category main_category currency    deadline     goal  \
0          Poetry    Publishing      GBP  2015-10-09   1000.0   
1  Narrative Film  Film & Video      USD  2017-11-01  30000.0   
2  Narrative Film  Film & Video      USD  2013-02-26  45000.0   
3           Music         Music      USD  2012-04-16   5000.0   
4    Film & Video  Film & Video      USD  2015-08-29  19500.0   

              launched  pledged     state  backers country  usd pledged  \
0  2015-08-11 12:12:28      0.0    failed        0      GB          0.0   
1  2017-09-02 04:43:57   2421.0    failed       15

Next, create a column that checks whether the currency of the project is in US Dollars. Create a column called `usd` using the `np.where` function where if the currency is US Dollars, assign a value of 1 to the row and 0 otherwise.

In [36]:
# Your code here:
chusd= np.where(kickstarter['currency'] == 'USD',1,0)
kickstarter['usd'] = goal_met_ext
print(kickstarter.head(5))


           ID                                               name  \
0  1000002330                    The Songs of Adelaide & Abullah   
1  1000003930      Greeting From Earth: ZGAC Arts Capsule For ET   
2  1000004038                                     Where is Hank?   
3  1000007540  ToshiCapital Rekordz Needs Help to Complete Album   
4  1000011046  Community Film Project: The Art of Neighborhoo...   

         category main_category currency    deadline     goal  \
0          Poetry    Publishing      GBP  2015-10-09   1000.0   
1  Narrative Film  Film & Video      USD  2017-11-01  30000.0   
2  Narrative Film  Film & Video      USD  2013-02-26  45000.0   
3           Music         Music      USD  2012-04-16   5000.0   
4    Film & Video  Film & Video      USD  2015-08-29  19500.0   

              launched  pledged     state  backers country  usd pledged  \
0  2015-08-11 12:12:28      0.0    failed        0      GB          0.0   
1  2017-09-02 04:43:57   2421.0    failed       15

Now create a contingency table using the `pd.crosstab` function in the cell below to compare the `goal_met` and `usd` columns.

Import the `fisher_exact` function from `scipy.stats` and conduct the hypothesis test on the contingency table that you have generated above. You can read more about the `fisher_exact` function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact). The output of the function should be the odds ratio and the p-value. The p-value will provide you with the outcome of the test.

In [41]:
# Your code here:
import scipy.stats as stats
newtable= pd.crosstab(kickstarter['goal_met'], kickstarter['usd'])

oddsratio, pvalue = stats.fisher_exact(newtable)


print('Odds ratio: ', oddsratio)
print('Pvalue: ', pvalue)


Odds ratio:  inf
Pvalue:  2e-323


# Challenge 2 - The `linalg` submodule

This submodule allows us to perform various linear algebra calculations. 

Using the solve function, find the solution of the equation system 5x + 2y = 3 and 3x + y = 2 in the cell below.

In [43]:
# Your code here:
from scipy.linalg import solve
arr1=np.array([[5,2],[3,1]])
arr2=np.array([3,2])
x=solve(arr1,arr2)
x



array([ 1., -1.])

# Challenge 3 - The `interpolate` submodule

This submodule allows us to interpolate between two points and create a continuous distribution based on the observed data.

In the cell below, import the `interp1d` function and first take a sample of 10 rows from `kickstarter`. 

In [47]:
# Your code here:
from scipy.interpolate import interp1d
kickstarter.sample(n=10)


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,goal_met,usd
208198,2061050514,10kW residential solar system,Technology,Technology,CAD,2015-04-26,45000.0,2015-03-27 22:52:37,0.0,failed,0,CA,0.0,0.0,37437.6,0,0
361922,91420182,Big Hearts Fund!,Product Design,Design,USD,2010-08-01,8000.0,2010-06-30 23:53:27,8370.0,successful,70,US,8370.0,8370.0,8000.0,1,1
372933,970427522,Jolly Jail,Video Games,Games,USD,2014-02-17,10000.0,2014-01-18 08:47:13,236.0,failed,17,US,236.0,236.0,10000.0,0,0
56286,1286400600,Hexglo | Bottle + Light in One,Product Design,Design,HKD,2017-03-01,80000.0,2017-01-25 05:00:48,85515.0,successful,76,HK,1257.31,11013.73,10303.44,1,1
184506,1939605743,Rowdy Rogers: The United Republics of Piracy F...,Art,Art,USD,2010-12-05,2200.0,2010-11-05 09:13:14,5.0,failed,1,US,5.0,5.0,2200.0,0,0
118364,1601375640,The Geeks Of Retro,Comedy,Film & Video,USD,2014-06-27,25000.0,2014-05-28 23:17:50,160.0,failed,3,US,160.0,160.0,25000.0,0,0
283024,510197195,MOXIE (Queer Narrative Short),Film & Video,Film & Video,USD,2017-03-02,5000.0,2017-02-08 19:42:25,1380.0,failed,13,US,630.0,1380.0,5000.0,0,0
320749,70332964,A New Beginning,Fiction,Publishing,USD,2017-04-30,6620.0,2017-03-31 23:15:42,0.0,canceled,0,US,0.0,0.0,6620.0,0,0
73484,1374329002,Puppet Antics - A range of videos involving pu...,Film & Video,Film & Video,GBP,2017-09-05,3200.0,2017-08-06 22:40:53,156.0,failed,8,GB,121.28,206.14,4228.44,0,0
94937,1482543389,2017 Purpose Planner: Make your vision a reality,Product Design,Design,USD,2016-12-29,10000.0,2016-12-13 01:00:07,1207.0,failed,32,US,159.0,1207.0,10000.0,0,0


Next, create a linear interpolation of the backers as a function of `usd_pledged_real`. Create a function `f` that generates a linear interpolation of backers as predicted by the amount of real pledged dollars.

In [48]:
# Your code here:
x = np.arange(kickstarter["usd_pledged_real"])
y = np.cos(x)

# We generate an interpolation function and then apply that function to the new data and plot
f = interp1d(x, y)
xnew = np.arange(0, 10.25, 0.25)
plt.plot(xnew, f(xnew))



ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Now create a new variable called `x_new`. This variable will contain all integers between the minimum number of backers in our sample and the maximum number of backers. The goal here is to take the dataset that contains few obeservations due to sampling and fill all observations with a value using the interpolation function. 

Hint: one option is the `np.arange` function.

In [None]:
# Your code here:



Plot function f for all values of `x_new`. Run the code below.

In [None]:
# Run this code:

%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(x_new, f(x_new))

Next create a function that will generate a cubic interpolation function. Name the function `g`

In [None]:
# Your code here:



In [None]:
# Run this code:

plt.plot(x_new, g(x_new))

# Bonus Challenge - The Binomial Distribution

The binomial distribution allows us to calculate the probability of k successes in n trials for a random variable with two possible outcomes (which we typically label success and failure).  

The probability of success is typically denoted by p and the probability of failure is denoted by 1-p.

The `scipy.stats` submodule contains a `binom` function for computing the probabilites of a random variable with the binomial distribution. You may read more about the binomial distribution [here](https://en.wikipedia.org/wiki/Binomial_distribution) and about the `binom` function [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom).

In the cell below, compute the probability that a die lands on 5 exactly 3 times in 8 tries.

Hint: the probability of rolling a 5 is 1/6.

In [None]:
# Your code here:

