# Descriptive Stastics in Python Exercise - Module 1

In this exercise we will use a dataset related to a collection of individual fundraising campaigns created via the [GoFundMe](https://gofundme.com) website. The data comes from a [project on Github](https://github.com/lmeninato/GoFundMe/) which collected information about GoFundMe projects in 2018.

You will apply your knowledge of descriptive stastics and skills from the data wrangling course to summarize information about specific categories of projects. I've stubbed out a series of steps below. I will describe each task and leave an open code block for you to complete the task. Please use text blocks to summarize your analysis. Use your own knowledge and the [Module 1 example descriptive stats notebook](https://github.com/digitalshawn/STC551/blob/main/Module%201/Descriptive%20Stats%20Example.ipynb) as a guide, but you may use other techniques to answer the prompts.

## What to submit via Canvas

Download a copy of your completed notebook from Google Colab (File --> Download --> Download .ipynb) and upload it to Canvas for this assignment. Please make sure that you run all code blocks so I can see the output when I open the notebook file.

## Help! I have questions!

You may email me with questions or ask to setup a Zoom meeting so we can look at your code together. You may also use the Canvas discussion board to ask questions and share tips. While I ask that you do not collaborate on answers, you may discuss the assignment via Canvas. Keeping any discussions public allows everyone to benefit!

# Let's Get Started!

### Task hints

*   `instructions in this style require you to write and execute python code in a code block`
*   instructions in this style require you to write a summary, analysis, or explanation in a text block




Here we load the modules we will use in this script. They are the same modules that are used in the [example notebook](https://github.com/digitalshawn/STC551/blob/main/Module%201/Descriptive%20Stats%20Example.ipynb).

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px # accessible module for plotting graphs
from scipy.stats import skew, kurtosis # to analyze the skew of our dataset
import plotly.figure_factory as ff

# Loading the GoFundMe Data

Below we load the GoFundMe data directly via its GitHub URL. Briefly take a look [at the data file](https://raw.githubusercontent.com/lmeninato/GoFundMe/master/data-raw/GFM_data.csv). You'll see that although the files ends in .csv, the fields are delimited (seperated) via a tab and not a comma. You'll see that I've flagged this for panda's read_csv() function using the `sep` argument and setting it equal to a tab (`\t`).



In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/lmeninato/GoFundMe/master/data-raw/GFM_data.csv", sep="\t")


# Let's explore the data file

1.   `show the first few rows of the data file.`
2.   List and describe the meaning of each row






In [62]:
df.head()

Unnamed: 0.1,Unnamed: 0,Url,Category,Position,Title,Location,Amount_Raised,Goal,Number_of_Donators,Length_of_Fundraising,FB_Shares,GFM_hearts,Text,Latitude,Longitude
0,0,https://www.gofundme.com/3ctqm-medical-bills-f...,Medical,0,92 Yr old Man Brutally Attacked.,"LOS ANGELES, CA",327345.0,15000.0,12167,1 month,26000.0,12000.0,Rodolfo Rodriguez needs your help today! 92 Yr...,34.052234,-118.243685
1,1,https://www.gofundme.com/olivia-stoy-bone-marr...,Medical,0,Olivia Stoy:Transplant & Liv it up!,"ASHLEY, IN",316261.0,1000000.0,5598,3 months,12000.0,5700.0,Thomas Stoy needs your help today! Olivia Stoy...,41.527273,-85.065523
2,2,https://www.gofundme.com/autologous-Tcell-Tran...,Medical,1,AUTOLOGOUS T CELL TRANSPLANT,"STATEN ISLAND, NY",241125.0,250000.0,841,2 months,1800.0,836.0,Philip Defonte needs your help today! AUTOLOGO...,40.579532,-74.150201
3,3,https://www.gofundme.com/a-chance-of-rebirth,Medical,1,A chance of rebirth,"DUBLIN, CA",237424.0,225000.0,4708,1 month,9700.0,4700.0,Sriram Kanniah needs your help today! A chance...,37.702152,-121.935792
4,4,https://www.gofundme.com/teamclaire,Medical,1,Claire Wineland Needs Our Help,"GARDEN GROVE, CA",236590.0,225000.0,8393,2 months,6400.0,8900.0,Melissa Yeager needs your help today! Claire W...,33.774269,-117.937995


In [16]:
# Going ahead and slotting this in here for later. 

def value_to_float(x):

    if type(x) == float or type(x) == int:
        return x

    if 'k' in x:
        if len(x) > 1:
            return float(x.replace('k', '')) * 1000
        return 1000.0

    if 'K' in x:
        if len(x) > 1:
            return float(x.replace('K', '')) * 1000
        return 1000.0

    if 'm' in x:
        if len(x) > 1:
            return float(x.replace('m', '')) * 1000000
        return 1000000.0

    if 'M' in x:
        if len(x) > 1:
            return float(x.replace('M', '')) * 1000000
        return 1000000.0

    if 'b' in x:
        return float(x.replace('b', '')) * 1000000000

    if ',' in x:
        return float(x.replace(',', ''))


    return float(x)

In [17]:
df.FB_Shares = df.FB_Shares.apply(value_to_float)
df.GFM_hearts = df.GFM_hearts.apply(value_to_float)
df.Goal = df.Goal.apply(value_to_float)

df.head()

Unnamed: 0.1,Unnamed: 0,Url,Category,Position,Title,Location,Amount_Raised,Goal,Number_of_Donators,Length_of_Fundraising,FB_Shares,GFM_hearts,Text,Latitude,Longitude
0,0,https://www.gofundme.com/3ctqm-medical-bills-f...,Medical,0,92 Yr old Man Brutally Attacked.,"LOS ANGELES, CA",327345.0,15000.0,12167,1 month,26000.0,12000.0,Rodolfo Rodriguez needs your help today! 92 Yr...,34.052234,-118.243685
1,1,https://www.gofundme.com/olivia-stoy-bone-marr...,Medical,0,Olivia Stoy:Transplant & Liv it up!,"ASHLEY, IN",316261.0,1000000.0,5598,3 months,12000.0,5700.0,Thomas Stoy needs your help today! Olivia Stoy...,41.527273,-85.065523
2,2,https://www.gofundme.com/autologous-Tcell-Tran...,Medical,1,AUTOLOGOUS T CELL TRANSPLANT,"STATEN ISLAND, NY",241125.0,250000.0,841,2 months,1800.0,836.0,Philip Defonte needs your help today! AUTOLOGO...,40.579532,-74.150201
3,3,https://www.gofundme.com/a-chance-of-rebirth,Medical,1,A chance of rebirth,"DUBLIN, CA",237424.0,225000.0,4708,1 month,9700.0,4700.0,Sriram Kanniah needs your help today! A chance...,37.702152,-121.935792
4,4,https://www.gofundme.com/teamclaire,Medical,1,Claire Wineland Needs Our Help,"GARDEN GROVE, CA",236590.0,225000.0,8393,2 months,6400.0,8900.0,Melissa Yeager needs your help today! Claire W...,33.774269,-117.937995


**URL**- The URL for the GoFundMe (GFM) posting

**Category** - The GFM category for the fundraiser (Medical expenses, emergency, animal care, etc.)

**Position** - I wan unable to find any documentation on what this category tracks, but I beleive it to be the location the campaign appeared at on the homepage of GFM. 

**Title** - Title of the GFM posting

**Location** - Location of the home community the poster is from 

**Amount Raised** - Total amount raised during the duration of the fundraiser

**Goal** - Initial goal of the GFM post. 

**Number of Donators** - Total number if financial contributors to the GFM. 

**Length of Funraising** - Length of time the GFM ran on the website. 

**FB Shares** - Number of Facebook shares the listing received

**GFM Hearts** - Number of "Hearts" the campaign recieved on the GFM platform

**Longtitude** - East-West geographical coordinate

**Latitude** - North-South geographical coordinate


# Campaigns by Category



1.   `How many campaigns are in each category?`
2.   `What is the average $ amount raised in each category?`
3.   `What is the average fundraising goal in each category?`
4.   Provide a text summary of the results

*feel free to use multiple code blocks if you'd like*



In [18]:
df["Category"].value_counts()

Medical        76
Memorial       72
Volunteer      72
Travel         72
Sports         72
Newlywed       72
Family         72
Faith          72
Event          72
Creative       72
Competition    72
Community      72
Business       72
Education      72
Charity        72
Emergency      72
Wishes         72
Animals        10
11525.0         1
-73.9495823     1
-75.3199035     1
Name: Category, dtype: int64

In [19]:
df_grouped = df.groupby('Category')

In [20]:
for group_name, df_group in df_grouped:
  print("Mean $ Raised:", group_name, df_group["Amount_Raised"].mean())

Mean $ Raised: -73.9495823 nan
Mean $ Raised: -75.3199035 nan
Mean $ Raised: 11525.0 688.0
Mean $ Raised: Animals 98085.4
Mean $ Raised: Business 11813.430555555555
Mean $ Raised: Charity 65931.91666666667
Mean $ Raised: Community 120226.7042253521
Mean $ Raised: Competition 5570.375
Mean $ Raised: Creative 25302.347222222223
Mean $ Raised: Education 45777.86111111111
Mean $ Raised: Emergency 116201.01388888889
Mean $ Raised: Event 10978.422535211268
Mean $ Raised: Faith 12903.785714285714
Mean $ Raised: Family 63499.86111111111
Mean $ Raised: Medical 147340.40789473685
Mean $ Raised: Memorial 115498.94444444444
Mean $ Raised: Newlywed 3478.8169014084506
Mean $ Raised: Sports 19540.125
Mean $ Raised: Travel 7099.871428571429
Mean $ Raised: Volunteer 13642.472222222223
Mean $ Raised: Wishes 23230.583333333332


In [37]:
for group_name, df_group in df_grouped:
  print("Mean Goal:", group_name, df_group["Goal"].mean())

Mean Goal: -73.9495823 nan
Mean Goal: -75.3199035 nan
Mean Goal: 11525.0 141.0
Mean Goal: Animals 98500.0
Mean Goal: Business 36416.208333333336
Mean Goal: Charity 171994.73611111112
Mean Goal: Community 152429.52112676058
Mean Goal: Competition 8214.472222222223
Mean Goal: Creative 77225.13888888889
Mean Goal: Education 62325.84722222222
Mean Goal: Emergency 152998.44444444444
Mean Goal: Event 14271.830985915492
Mean Goal: Faith 55787.68571428571
Mean Goal: Family 77055.55555555556
Mean Goal: Medical 199735.76315789475
Mean Goal: Memorial 112638.88888888889
Mean Goal: Newlywed 21107.098591549297
Mean Goal: Sports 26839.458333333332
Mean Goal: Travel 58877.44285714286
Mean Goal: Volunteer 46422.083333333336
Mean Goal: Wishes 54142.930555555555


Looking at the results, it appears the data sample is fairly well balanced in terms of the sample size of the various categories. In that sense, it makes the data more credible since one category is not overly represented compared to the others. Additionally, it appears that "Medical" has the highest overall average amount raised compared to the other categories, coming in at 147k on average. Interestingly the average of the Animals category practically matches its average goal, which would indicate the campaigns typically hitting their targets, but Animals is also the category with the only underrepresented sample compared to the rest (10 total samples). On the other hand "Business" misses its average goal versus average raised by 66%, which would indicate buisness campaigns have a much harder time reaching their goals. 

Looking through the discrepancy of the mean amount raised versus the mean goal, it appears that memorials are the only category that on average exceeds the goal set out for in the campaigns. 

# Looking for outliers in shares and hearts



1.   `Select 3 catgories and create a boxplot of the FB shares and GFM hearts`
2.   `Plot the outliers in the boxplot`
1.   `Calculate the mean, median, mode, std deviation, and variance for the 3 categories' FB shares and GFM hearts`
3.   Summarize these results. What conclusions can you come to about these results?



In [67]:
df_medical = df_grouped.get_group('Medical')

fig = px.box(df_medical, x = "FB_Shares", title = "Distribution of Medical FB Shares")
fig.show()

fig = px.box(df_medical, x = "GFM_hearts", title = "Distribution of Medical GFM Hearts")
fig.show()

print("Mean FB_Shares:", df_medical["FB_Shares"].mean(skipna=True, numeric_only=None))
print("Median FB_Shares:", df_medical["FB_Shares"].median())
print("Mode of FB_Shares:", df_medical["FB_Shares"].mode())
print("Variance FB_Shares: ",df_medical["FB_Shares"].var())
print("Standard Deviation FB_Shares: ", df_medical["FB_Shares"].std())

print("Mean GFM_hearts:", df_medical["GFM_hearts"].mean(skipna=True, numeric_only=None))
print("Median GFM_hearts:", df_medical["GFM_hearts"].median())
print("Mode of GFM_hearts:", df_medical["GFM_hearts"].mode())
print("Variance GFM_hearts: ",df_medical["GFM_hearts"].var())
print("Standard Deviation GFM_hearts: ", df_medical["GFM_hearts"].std())

Mean FB_Shares: 4032.0
Median FB_Shares: 2300.0
Mode of FB_Shares: 0    1300.0
1    1600.0
dtype: float64
Variance FB_Shares:  19091734.72
Standard Deviation FB_Shares:  4369.40896689701
Mean GFM_hearts: 1636.7894736842106
Median GFM_hearts: 1050.0
Mode of GFM_hearts: 0    1100.0
dtype: float64
Variance GFM_hearts:  3320924.99508772
Standard Deviation GFM_hearts:  1822.3405266545876


In [68]:
df_emergency = df_grouped.get_group('Emergency')

fig = px.box(df_emergency , x = "FB_Shares", title = "Distribution of Emergency FB Shares")
fig.show()

fig = px.box(df_emergency , x = "GFM_hearts", title = "Distribution of Emergency GFM Hearts")
fig.show()

print("Mean FB_Shares:", df_emergency ["FB_Shares"].mean(skipna=True, numeric_only=None))
print("Median FB_Shares:", df_emergency["FB_Shares"].median())
print("Mode of FB_Shares:", df_emergency ["FB_Shares"].mode())
print("Variance FB_Shares: ",df_emergency["FB_Shares"].var())
print("Standard Deviation FB_Shares: ", df_emergency["FB_Shares"].std())

print("Mean GFM_hearts:", df_emergency ["GFM_hearts"].mean(skipna=True, numeric_only=None))
print("Median GFM_hearts:", df_emergency["GFM_hearts"].median())
print("Mode of GFM_hearts:", df_emergency ["GFM_hearts"].mode())
print("Variance GFM_hearts: ",df_emergency["GFM_hearts"].var())
print("Standard Deviation GFM_hearts: ", df_emergency["GFM_hearts"].std())

Mean FB_Shares: 4213.652777777777
Median FB_Shares: 2950.0
Mode of FB_Shares: 0    1100.0
1    1900.0
2    2100.0
3    3600.0
4    5300.0
dtype: float64
Variance FB_Shares:  14275977.72280907
Standard Deviation FB_Shares:  3778.3564843472714
Mean GFM_hearts: 1480.9166666666667
Median GFM_hearts: 1000.0
Mode of GFM_hearts: 0    1000.0
dtype: float64
Variance GFM_hearts:  1411913.9647887303
Standard Deviation GFM_hearts:  1188.2398599562002


In [69]:
df_memorial = df_grouped.get_group('Memorial')

fig = px.box(df_memorial , x = "FB_Shares", title = "Distribution of Memorial FB Shares")
fig.show()

fig = px.box(df_memorial , x = "GFM_hearts", title = "Distribution of Memorial GFM Hearts")
fig.show()

print("Mean FB Shares:", df_memorial ["FB_Shares"].mean(skipna=True, numeric_only=None))
print("Median FB Shares:", df_memorial ["FB_Shares"].median())
print("Mode of FB Shares:", df_memorial ["FB_Shares"].mode())
print("Variance FB Shares: ",df_memorial["FB_Shares"].var())
print("Standard Deviation FB Shares: ", df_memorial["FB_Shares"].std())

print("Mean GFM_hearts:", df_memorial ["GFM_hearts"].mean(skipna=True, numeric_only=None))
print("Median GFM_hearts:", df_memorial ["GFM_hearts"].median())
print("Mode of GFM_hearts:", df_memorial ["GFM_hearts"].mode())
print("Variance GFM_hearts: ",df_memorial["GFM_hearts"].var())
print("Standard Deviation GFM_hearts: ", df_memorial["GFM_hearts"].std())

Mean FB Shares: 5915.222222222223
Median FB Shares: 3150.0
Mode of FB Shares: 0    1300.0
dtype: float64
Variance FB Shares:  101377296.62597813
Standard Deviation FB Shares:  10068.629332038106
Mean GFM_hearts: 1663.7222222222222
Median GFM_hearts: 989.0
Mode of GFM_hearts: 0    1000.0
dtype: float64
Variance GFM_hearts:  5957926.710485126
Standard Deviation GFM_hearts:  2440.886459974148


For the three categories I chose to observe, I looked at medical, emergency, and memorial. Within these categories, we see signifigant right skews and outliers in both the FB Shares and GFM Hearts. The mean in all three categories for FB Shares is between 4-5k shares, but we see outliers that reach from 11k all the way up to 63k shares. Similarly, the GFM hearts see a similar right skew, though with smaller margins. The average GFM hearts for all three categories is arount 1-2k, but the data shows outliers and a right skew that reachs anywhere from 3.6k to 16k hearts. 

Interestingly, memorials is the only category where the mean amount raised exceeded the mean set goal, and it also demonstrates the most extreme skew of the three categories I observed, as well as the most number of outliers. This would lead to the assumption that the sucess of the memorial campaigns could be correlated to the FB Shares and GME hearts. It is possible that the data shows memorial campaigns are more successful due to these outliers and the campaigns having gone "viral". 

# Explore on your own

1. Select one category and use descriptive stats to explore the success of campaigns in this category.
1. Use graphs where approporiate.
1. Provide commentary aling the way on what descriptive measures you are using and why.
1. Provide a one to two paragraph summary of the success of this category.

*use as many code and text blocks along the way*
*Also make sure to consult the pandas and plotly documentation along the way*

In [30]:
# I decided to look at the medical campaigns a little closer. My metric for campaign "success" is whether a campaign reaches its established goal

# Here is me just testing 
(df_medical.Amount_Raised - df_medical.Goal)

0     312345.0
1    -683739.0
2      -8875.0
3      12424.0
4      11590.0
        ...   
89    176152.0
90    -15275.0
91     -5746.0
92    -15280.0
93    -29899.0
Length: 76, dtype: float64

In [49]:
# Here I write a new column to my data frame containing medical campaigns that takes the amount raised and subtracts the goal.
# Negative numbers are unsuccesful campaigns and positive numbers are successful
df_medical['Target'] = (df_medical.Amount_Raised - df_medical.Goal) 

df_medical

Unnamed: 0.1,Unnamed: 0,Url,Category,Position,Title,Location,Amount_Raised,Goal,Number_of_Donators,Length_of_Fundraising,FB_Shares,GFM_hearts,Text,Latitude,Longitude,Target
0,0,https://www.gofundme.com/3ctqm-medical-bills-f...,Medical,0,92 Yr old Man Brutally Attacked.,"LOS ANGELES, CA",327345.0,15000.0,12167,1 month,26000.0,12000.0,Rodolfo Rodriguez needs your help today! 92 Yr...,34.052234,-118.243685,312345.0
1,1,https://www.gofundme.com/olivia-stoy-bone-marr...,Medical,0,Olivia Stoy:Transplant & Liv it up!,"ASHLEY, IN",316261.0,1000000.0,5598,3 months,12000.0,5700.0,Thomas Stoy needs your help today! Olivia Stoy...,41.527273,-85.065523,-683739.0
2,2,https://www.gofundme.com/autologous-Tcell-Tran...,Medical,1,AUTOLOGOUS T CELL TRANSPLANT,"STATEN ISLAND, NY",241125.0,250000.0,841,2 months,1800.0,836.0,Philip Defonte needs your help today! AUTOLOGO...,40.579532,-74.150201,-8875.0
3,3,https://www.gofundme.com/a-chance-of-rebirth,Medical,1,A chance of rebirth,"DUBLIN, CA",237424.0,225000.0,4708,1 month,9700.0,4700.0,Sriram Kanniah needs your help today! A chance...,37.702152,-121.935792,12424.0
4,4,https://www.gofundme.com/teamclaire,Medical,1,Claire Wineland Needs Our Help,"GARDEN GROVE, CA",236590.0,225000.0,8393,2 months,6400.0,8900.0,Melissa Yeager needs your help today! Claire W...,33.774269,-117.937995,11590.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89,89,https://www.gofundme.com/kdafoos-cancer,Medical,0,Kdafoos ... Cancer ...,"HOUSTON, TX",676152.0,500000.0,2408,1 month,330.0,2400.0,Cameron McHugh needs your help today! Kdafoos ...,29.760427,-95.369803,176152.0
90,90,https://www.gofundme.com/Nashville-shooting-vi...,Medical,4,Nashville shooting victim recovery,"MOUNT JULIET, TN",184725.0,200000.0,563,12 days,1300.0,598.0,Mark Oglesby needs your help today! Nashville ...,36.200055,-86.518605,-15275.0
91,91,https://www.gofundme.com/joy039s-breast-cancer...,Medical,12,Joy's Breast Cancer Fight,"MARMORA, NJ",124254.0,130000.0,860,1 month,2300.0,886.0,Mark Simmerman needs your help today! Joy's Br...,39.267395,-74.651700,-5746.0
92,92,https://www.gofundme.com/Nashville-shooting-vi...,Medical,4,Nashville shooting victim recovery,"MOUNT JULIET, TN",184720.0,200000.0,562,12 days,1300.0,599.0,Mark Oglesby needs your help today! Nashville ...,36.200055,-86.518605,-15280.0


In [51]:
# Descriptive statistics of the new comulmn that observes the new success metric. 

print("Mean :", df_medical ["Target"].mean(skipna=True, numeric_only=None))
print("Median:", df_medical ["Target"].median())
print("Mode of:", df_medical ["Target"].mode())
print("Variance: ",df_medical["Target"].var())
print("Standard Deviation: ", df_medical["Target"].std())
print("Min: ",df_medical["Target"].min())
print("Max: ",df_medical["Target"].max())

Mean : -52395.35526315789
Median: -5796.0
Mode of: 0   -15280.0
dtype: float64
Variance:  48050314831.53877
Standard Deviation:  219203.82029412437
Min:  -1366409.0
Max:  312345.0


In [59]:
# Practicing normalizing
df_medical["normalized_target"]=(df_medical["Target"]-df_medical["Target"].min())/(df_medical["Target"].max()-df_medical["Target"].min())

In [46]:
# Scatter plot showing the Success/Failures

GFMM = df_medical["Target"]
hist_data = [GFMM]
group_labels = ['Success of Campaign based on Goal Set']

fig = px.scatter(df_medical, x = "Target", title = "Success of Campaigns")

fig.update_layout(title = "Success of Campaign based on Goal Set")

fig.show()

In [61]:
# Tried replotting it normalized, and produced the exact same graph with a different scale. 

GFMM = df_medical["normalized_target"]
hist_data = [GFMM]
group_labels = ['Success of Campaign based on Goal Set']

fig = px.scatter(df_medical, x = "normalized_target", title = "Success of Campaigns")

fig.update_layout(title = "Success of Campaign based on Goal Set")

fig.show()

In [66]:
fig = px.box(df_medical , x = "Target", title = "Distribution Success/Failures")
fig.show()


In order to determine the sucess of the medical GFM category, I decided to compare the established goal with the amount raised. Subtracting the amount raised by the goal led to a new variable in my dataframe that I called the "target" variable. From there I used descriptive statistics to determine how successful medical campaigns were. 

From my analysis, medical campaigns are generally unsuccesfull at reaching their goal. The average of the "Target" was -52,395, which means on average these campagins fell short of their goal by 52k. With that in mind, looking at the box plot for the target showed considerable left skew and some far reaching outliers pulling our data into the negative. Despite these outliers, we can see that the central tendancy still pulled left towards unsuccessfull campaigns. 