<a href="https://colab.research.google.com/github/mrandolph95/STC551/blob/main/STC551_Module_1%2C_desc_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptive Stastics in Python Exercise - Module 1

In this exercise we will use a dataset related to a collection of individual fundraising campaigns created via the [GoFundMe](https://gofundme.com) website. The data comes from a [project on Github](https://github.com/lmeninato/GoFundMe/) which collected information about GoFundMe projects in 2018.

You will apply your knowledge of descriptive stastics and skills from the data wrangling course to summarize information about specific categories of projects. I've stubbed out a series of steps below. I will describe each task and leave an open code block for you to complete the task. Please use text blocks to summarize your analysis. Use your own knowledge and the [Module 1 example descriptive stats notebook](https://github.com/digitalshawn/STC551/blob/main/Module%201/Descriptive%20Stats%20Example.ipynb) as a guide, but you may use other techniques to answer the prompts.

## What to submit via Canvas

Download a copy of your completed notebook from Google Colab (File --> Download --> Download .ipynb) and upload it to Canvas for this assignment. Please make sure that you run all code blocks so I can see the output when I open the notebook file.

## Help! I have questions!

You may email me with questions or ask to setup a Zoom meeting so we can look at your code together. You may also use the Canvas discussion board to ask questions and share tips. While I ask that you do not collaborate on answers, you may discuss the assignment via Canvas. Keeping any discussions public allows everyone to benefit!

# Let's Get Started!

### Task hints

*   `instructions in this style require you to write and execute python code in a code block`
*   instructions in this style require you to write a summary, analysis, or explanation in a text block




Here we load the modules we will use in this script. They are the same modules that are used in the [example notebook](https://github.com/digitalshawn/STC551/blob/main/Module%201/Descriptive%20Stats%20Example.ipynb).

In [None]:
import numpy as np 
import pandas as pd 
import plotly.express as px 
from scipy.stats import skew, kurtosis
import plotly.figure_factory as ff

# Loading the GoFundMe Data

Below we load the GoFundMe data directly via its GitHub URL. Briefly take a look [at the data file](https://raw.githubusercontent.com/lmeninato/GoFundMe/master/data-raw/GFM_data.csv). You'll see that although the files ends in .csv, the fields are delimited (seperated) via a tab and not a comma. You'll see that I've flagged this for panda's read_csv() function using the `sep` argument and setting it equal to a tab (`\t`).



In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/lmeninato/GoFundMe/master/data-raw/GFM_data.csv", sep="\t")

In [None]:
df = df[['Category', 'Position', 'Title', 'Location', 'Amount_Raised', 'Goal', 'Number_of_Donators', 'FB_Shares', 'GFM_hearts']]

In [None]:
df = df.dropna(how='any')

In [None]:
# the next four lines remove instances of the letters k and M within four different columns.
# they also convert the strings to floats

df['FB_Shares'] = df['FB_Shares'].replace({'k': '*1e3', 'M': '*1e6'}, regex=True).map(pd.eval).astype(float)

In [None]:
df['Goal'] = df['Goal'].replace({'k': '*1e3', 'M': '*1e6', ',':''}, regex=True).map(pd.eval).astype(float)

In [None]:
df['GFM_hearts'] = df['GFM_hearts'].replace({'k': '*1e3', 'M': '*1e6'}, regex=True).map(pd.eval).astype(float)

In [None]:
# this converts the strings in the number of donators column to floats
df['Number_of_Donators'] = pd.to_numeric(df['Number_of_Donators'],errors='coerce')

# Let's explore the data file


1.   `show the first few rows of the data file.`
2.   List and describe the meaning of each row






In [None]:
df.head()

Unnamed: 0,Category,Position,Title,Location,Amount_Raised,Goal,Number_of_Donators,FB_Shares,GFM_hearts
0,Medical,0,92 Yr old Man Brutally Attacked.,"LOS ANGELES, CA",327345.0,15000.0,,26000.0,12000.0
1,Medical,0,Olivia Stoy:Transplant & Liv it up!,"ASHLEY, IN",316261.0,1000000.0,,12000.0,5700.0
2,Medical,1,AUTOLOGOUS T CELL TRANSPLANT,"STATEN ISLAND, NY",241125.0,250000.0,841.0,1800.0,836.0
3,Medical,1,A chance of rebirth,"DUBLIN, CA",237424.0,225000.0,,9700.0,4700.0
4,Medical,1,Claire Wineland Needs Our Help,"GARDEN GROVE, CA",236590.0,225000.0,,6400.0,8900.0


*list and description of column headers go here*

> 

1.   **Index**: **The key value of the line items
2.   **Category**: Fundraising Category on GoFundMe
4. **Title**: The title of the GoFundMe Fundraising post
5. **Location**: The geographical location of the GoFundMe post
6. **Amount_Raised**: Amount of money raised
7. **Goal**: The fundraising goal
8. **Number_of_Donators**: The number of people who donated money
9. **FB_Shares**: The number of times the GFM post was shared on Facebook
10. **GFM_Likes**: The number of likes on the GFM post


# Campaigns by Category



1.   `How many campaigns are in each category?`
2.   `What is the average $ amount raised in each category?`
3.   `What is the average fundraising goal in each category?`
4.   Provide a text summary of the results

*feel free to use multiple code blocks if you'd like*

In [None]:
df['Category'].value_counts()

Medical        76
Memorial       72
Emergency      72
Charity        72
Education      72
Volunteer      72
Business       72
Family         72
Creative       71
Community      71
Wishes         71
Competition    70
Sports         70
Event          68
Faith          67
Travel         65
Newlywed       49
Animals        10
Name: Category, dtype: int64

In [None]:
df.groupby('Category', as_index=False)['Amount_Raised'].mean().round()

Unnamed: 0,Category,Amount_Raised
0,Animals,98085.0
1,Business,11813.0
2,Charity,65932.0
3,Community,120227.0
4,Competition,5458.0
5,Creative,25475.0
6,Education,45778.0
7,Emergency,116201.0
8,Event,11124.0
9,Faith,13052.0


In [None]:
df.groupby('Category', as_index=False)['Goal'].mean().round()

Unnamed: 0,Category,Goal
0,Animals,98500.0
1,Business,36416.0
2,Charity,171995.0
3,Community,152430.0
4,Competition,8106.0
5,Creative,78144.0
6,Education,62326.0
7,Emergency,152998.0
8,Event,14607.0
9,Faith,57815.0


:*summarize output here*


> There are 17 different fundraising categories.

> Most posts are in the Medical category.

> The Newlywed category has the lowest goals and the least amount raised.

> The Medical category has the highest goals and the highest amount raised.



# Looking for outliers in shares and hearts



1.   `Select 3 catgories and create a boxplot of the FB shares and GFM hearts`
2.   `Plot the outliers in the boxplot`
1.   `Calculate the mean, median, mode, std deviation, and variance for the 3 categories' FB shares and GFM hearts`
3.   Summarize these results. What conclusions can you come to about these results?



In [None]:
category = df.groupby('Category')

In [None]:
newlyweds = category.get_group('Newlywed')

In [None]:
medical = category.get_group('Medical')

In [None]:
wishes = category.get_group('Wishes')

In [None]:
figtwo = px.box(newlyweds, x=['FB_Shares', 'GFM_hearts'], title = 'Distribution of Likes and Shares for the Newlyweds GoFundMe Category')
figtwo.show()

In [None]:
figone = px.box(medical, x=['FB_Shares', 'GFM_hearts'], title = 'Distribution of Likes and Shares for the Medical GoFundMe Category')
figone.show()

In [None]:
figthree = px.box(wishes, x=['FB_Shares', 'GFM_hearts'], title = 'Distribution of Likes and Shares for the Wishes GoFundMe Category')
figthree.show()

In [None]:
newlyweds.FB_Shares.describe()

count      49.000000
mean       94.734694
std       175.345466
min         2.000000
25%        17.000000
50%        45.000000
75%        95.000000
max      1100.000000
Name: FB_Shares, dtype: float64

In [None]:
newlywedfb_variant = 1100 - 2
print(newlywedfb_variant)

1098


In [None]:
newlyweds.GFM_hearts.describe()

count     49.000000
mean      34.448980
std       30.393983
min        2.000000
25%       18.000000
50%       25.000000
75%       39.000000
max      137.000000
Name: GFM_hearts, dtype: float64

In [None]:
newlywedg_variant = 137 - 2
print(newlywedg_variant)

135


In [None]:
medical.FB_Shares.describe()

count       76.000000
mean      4032.000000
std       4369.408967
min        147.000000
25%       1300.000000
50%       2300.000000
75%       6100.000000
max      26000.000000
Name: FB_Shares, dtype: float64

In [None]:
medicalfb_variant = 26000 - 147
print(medicalfb_variant)

25853


In [None]:
medical.GFM_hearts.describe()

count       76.000000
mean      1636.789474
std       1822.340527
min        106.000000
25%        703.000000
50%       1050.000000
75%       1900.000000
max      12000.000000
Name: GFM_hearts, dtype: float64

In [None]:
medicalg_variant = 12000 - 106
print(medicalg_variant)

11894


In [None]:
wishes.FB_Shares.describe()

count      71.000000
mean     1029.169014
std      1395.119360
min        15.000000
25%       224.000000
50%       476.000000
75%      1150.000000
max      5800.000000
Name: FB_Shares, dtype: float64

In [None]:
wishesfb_variant = 5800 - 15
print(wishesfb_variant)

5785


In [None]:
wishes.GFM_hearts.describe()

count      71.000000
mean      309.774648
std       557.720749
min        11.000000
25%        95.000000
50%       166.000000
75%       321.500000
max      4400.000000
Name: GFM_hearts, dtype: float64

In [None]:
wishesg_variant = 4400 - 11
print(wishesg_variant)

4389


*summarize your outlier results here*

The outliers in all box plots represent GoFundMe posts that went viral as all outliers are to the right which represents higher values.

# Explore on your own

1. Select one category and use descriptive stats to explore the success of campaigns in this category.
1. Use graphs where approporiate.
1. Provide commentary aling the way on what descriptive measures you are using and why.
1. Provide a one to two paragraph summary of the success of this category.

*use as many code and text blocks along the way*
*Also make sure to consult the pandas and plotly documentation along the way*

In [None]:
# I wanted to compare the goal with the amount raised, the following code helps
# us gauge this
emergency = category.get_group('Emergency')

In [None]:
emergency.Goal.describe()

count         72.000000
mean      152998.444444
std       143050.605661
min         2000.000000
25%        83750.000000
50%       100000.000000
75%       200000.000000
max      1000000.000000
Name: Goal, dtype: float64

In [None]:
emergency.Amount_Raised.describe()

count        72.000000
mean     116201.013889
std       61271.936672
min       64270.000000
25%       73122.500000
50%       96204.500000
75%      129957.750000
max      333291.000000
Name: Amount_Raised, dtype: float64

In [None]:
figfour = px.histogram(emergency, x=['Goal', 'Amount_Raised'])
figfour.show()

In [None]:
figfour = px.scatter(emergency, x=['Goal', 'Amount_Raised'])
figfour.show()

There is quite a bit of crossover in this data comparison and there are few outliers. The highest goal is a million dollars and no amount raised data points reach that outlier. The most amount of money that was raised in the Emergency category was about $330,000. Interestingly, the minimum amount raised surpased the minimum goal by close to $4,000. Based on that information, this fundraising category is sucessful.