# Module 5: Data Analysis!

## Topic 1: Data Analysis Start to Finish - Part 1

### First import the library

In [None]:
import pandas as pd

### Now we can bring in our dataset as a dataframe and assign it to a variable

In [None]:
kick_all = pd.read_csv('ks-projects-201801.csv',index_col='ID')

### With that done, we can take a quick peek at the shape and some of the data in the dataframe

In [None]:
print(kick_all.shape)
kick_all.head()

### Let's have a look at the shape of the backers because I'm interested in analyzing it

In [None]:
kick_all['backers'].value_counts(bins=10)

### It appears to be heavily skewed in the low end with a long tail of larger counts

### Let's break that data into equally sized buckets and see how that looks

In [None]:
pd.qcut(kick_all['backers'], q=6)

### There are a lot of projects in here that are small and I don't want them.  I'm going to cut off any project that got less than 20 backers

### I'm going to first create a list of the row IDs that need to be removed

In [None]:
kick_rows_to_delete = kick_all[kick_all["backers"]<20].index
print(kick_rows_to_delete.shape)
kick_rows_to_delete = list(kick_rows_to_delete)
print(kick_rows_to_delete[0:10])

### I can now use that list to drop the matching rows from the dataframe and assign the result to a new dataframe

In [None]:
kick_50_backers = kick_all.drop(kick_rows_to_delete)
#print(kick_50_backers.head())
kick_50_backers.shape

## Topic 2: Data Analysis Start to Finish - Part 2

### I'm going to sort the new dataframe by backers to see the most backed projects

In [None]:
kick_50_backers.sort_values(by=["backers"], ascending=False, inplace=True)
kick_50_backers.head()

### I'm curious what categories of projects get the most backing so I'm going to groupby

In [None]:
kick_50_group_main_cat = kick_50_backers.groupby("main_category")
kick_50_group_main_cat = kick_50_group_main_cat.sum()

### These are big numbers so they'll get output in scientific notation. I'm going to override that

In [None]:
kick_50_group_main_cat.head(15).style.format("{:.1f}") #suppresses scientific notation

### Now I want to see what the highest total goal category and figure out the average pledge per backer in each category

In [None]:
kick_50_group_main_cat.sort_values(by=["goal"], ascending=False, inplace=True)
kick_50_group_main_cat['pledge_per_backer'] = kick_50_group_main_cat["pledged"]/kick_50_group_main_cat['backers']
kick_50_group_main_cat.head(15).style.format("{:.1f}") #suppresses scientific notation

In [None]:
kick_50_group_main_cat.tail(15).style.format("{:.1f}") #suppresses scientific notation

### Now to have a look at just the biggest categories

In [None]:
kick_50_group_main_cat_trimmed = kick_50_group_main_cat[kick_50_group_main_cat["goal"]>100000000]

In [None]:
kick_50_group_main_cat_trimmed.head(15).style.format("{:.1f}") #suppresses scientific notation

## Topic 3: Data Analysis Start to Finish - Part 3

### It's hard to do any decent analysis on the backer counts because they aren't discrete.  I'm going to create buckets of backer counts so I can see the data easier

In [None]:
kick_50_backers["backer_groups"] = pd.cut(kick_50_backers.backers,[0,50,100,200,500,1000,2000,5000,10000,100000000])

In [None]:
kick_50_backers.head(10)

### With my new buckets, I can do some counts

In [None]:
kick_50_backers['backer_groups'].value_counts()

### Let's see if more backers make for a higher chance of success.  First I'll create a from that is just the columns I'm interested in

In [None]:
kick_success_by_backers = kick_50_backers.loc[:,["backer_groups","state"]]
print(kick_success_by_backers)

### With that data, I can now get counts by state by backer group

In [None]:
kick_success_by_backers.groupby('backer_groups')['state'].value_counts()

### While counts are nice, it's easier to see if we look at percentages

In [None]:
kick_success_by_backers.groupby('backer_groups')['state'].value_counts(normalize=True) *100

### It also helps to look at the overall success rate in the data without the backer groups so we have something to compare against.

In [None]:
kick_success_by_backers['state'].value_counts(normalize=True) *100

## Topic 4: Data Analysis Start to Finish - Part 4

### No code to show here.  This is just the part of the analysis where you would want to think about what format/structure you need the data in for demonstration/visualization...which is the next module!

### There's also just a bit of info about .npy format and pickling