In [1]:
import pandas as pd

In [12]:
df = pd.read_csv('data/Kickstarter000.csv')
df.head(6)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,21,2006 was almost 7 years ago.... Can you believ...,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",802,US,1387659690,"{""id"":1495925645,""name"":""Daniel"",""is_registere...",USD,$,True,...,new-final-round-album,https://www.kickstarter.com/discover/categorie...,True,False,successful,1391899046,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",802.0,international
1,97,An adorable fantasy enamel pin series of princ...,"{""id"":54,""name"":""Mixed Media"",""slug"":""art/mixe...",2259,US,1549659768,"{""id"":1175589980,""name"":""Katherine"",""slug"":""fr...",USD,$,True,...,princess-pals-enamel-pin-series,https://www.kickstarter.com/discover/categorie...,True,False,successful,1551801611,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",2259.0,international
2,88,Helping a community come together to set the s...,"{""id"":280,""name"":""Photobooks"",""slug"":""photogra...",29638,US,1477242384,"{""id"":1196856269,""name"":""MelissaThomas"",""is_re...",USD,$,True,...,their-life-through-their-lens-the-amish-and-me...,https://www.kickstarter.com/discover/categorie...,True,True,successful,1480607932,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",29638.0,international
3,193,Every revolution starts from the bottom and we...,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",49158,IT,1540369920,"{""id"":1569700626,""name"":""WAO"",""slug"":""wearewao...",EUR,€,False,...,wao-the-eco-effect-shoes,https://www.kickstarter.com/discover/categorie...,True,False,successful,1544309940,1.136525,"{""web"":{""project"":""https://www.kickstarter.com...",49075.15252,international
4,20,Learn to build 10+ Applications in this comple...,"{""id"":51,""name"":""Software"",""slug"":""technology/...",549,US,1425706517,"{""id"":1870845385,""name"":""Kalpit Jain"",""is_regi...",USD,$,True,...,apple-watch-development-course,https://www.kickstarter.com/discover/categorie...,False,False,failed,1428511019,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",549.0,domestic
5,77,'Eclipse' - A 30mm hard enamel pin in jet blac...,"{""id"":262,""name"":""Accessories"",""slug"":""fashion...",2117,GB,1515687963,"{""id"":385711367,""name"":""Jennifer Hawkyard"",""sl...",GBP,£,False,...,saluki-totem-enamel-pin,https://www.kickstarter.com/discover/categorie...,True,False,successful,1518865273,1.419368,"{""web"":{""project"":""https://www.kickstarter.com...",2141.826071,international


### Clearify Feature Meanings:

* backers_count: amount/number of different investors
* blurb: short summary of the product/idea/... --> NLP algorithm to find category?
* category: composition of various data/subcategories regarding product on website (web-url, name, position, id, ...)
* converted_pledged_amount: Beträge, die wirklich ausgezahlt wurden am Ende
* country: country of origin
* created_at: UNIX Timestamp from 1970 (2009 bis 2019) --> can be reduced to monthly timeline
* creator: creator of kickstarter campaign plus more subcategories (e.g. id, nickname, profile, links, ...)
* currency --> can be dropped because of redundance (countries are already considered)
* currency_symbol --> drop it
* currency_trailing_code: ???
* current_currency --> drop it
* deadline: Timestamp --> create a delta of time (deadline - created_at)
* disable_communication: highly imbalanced --> have a look at True, otherwise drop feature
* friends: missing --> drop it
* fx_rate: exchange rate (Wechselkurs) --> have a look at the time of investment for the same currency
* goal: target amount of money you want to get from investors
* id --> may be dropped?
* is_backing: missing --> drop it
* is_starrable: is imbalanced
* is_starred: missing --> drop it
* launched_at: time the product is launched (timestamp)
* location: city or town of origin (has to be cleaned)
* name: name of the product
* permissions: missing --> drop it
* photo: not useful --> drop it
* pledged: amount of money offered from investors --> compare with converted_pledged_amount
* profile: profile information plus many subdata --> can be dropped
* slug --> can be dropped (data is already there)
* source_url: not needed --> category of product can be extracted easily
* spotlight: attention to products and history (marketing) --> seems important
* staff_pick: are positive voted by staff members of kickstart; imbalanced, boolean
* state: state of product development (successfull, failed), three more states --> what to do with them
* state_changed_at:
* static_usd_rate:
* urls --> drop it
* usd_pledged: redundant information
* usd_type:

generally:
- look at correlations from summary statistics
- what makes the "Phik"?
- get insights from Pandas Profiling

#### EDA Observations:

* df_index kann gedropt werden
* backers_count highly right skewed --> log draufklatschen
* category should be droped, splitted and appended
* converted_pledged_amount highly right skewed --> log
* countries can be grouped by continent
* goal --> highly right skewed
* 

In [48]:
df.location.describe()

count                                                  3775
unique                                                 1296
top       {"id":2442047,"name":"Los Angeles","slug":"los...
freq                                                    183
Name: location, dtype: object

In [4]:
df.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'friends', 'fx_rate', 'goal', 'id',
       'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location',
       'name', 'permissions', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type'],
      dtype='object')

### To Do

* drop columns for minimal dataset:
  - blurb
  - creator
  - currency
  - currency_symbol
  - currency_trailing_code
  - current_currency
  - friends
  - fx_rate
  - id
  - is_backing
  - is_starred
  - location
  - name
  - permissions
  - photo
  - profile
  - slug
  - source_url
  - static_usd_rate
  - urls
  - usd_type
  
  
* country: group by USA, Canada, GB, Europa, Australia, Asia, others --> Birte
* category splitten & cleanen --> Phillip
* function for timestamp to convert in Month/Week/Day timeline (created_at, deadline, launched_at, state_changed_at) --> Julius
* function for dropping features
* show influence of disable_communication on state via plot(s) --> Birte
* have a look at categories of state features and decide how to combine (we need 2 classes at the end) --> Julius
* have a look at the influence of usd_pledged

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3779 entries, 0 to 3778
Data columns (total 37 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   backers_count             3779 non-null   int64  
 1   blurb                     3779 non-null   object 
 2   category                  3779 non-null   object 
 3   converted_pledged_amount  3779 non-null   int64  
 4   country                   3779 non-null   object 
 5   created_at                3779 non-null   int64  
 6   creator                   3779 non-null   object 
 7   currency                  3779 non-null   object 
 8   currency_symbol           3779 non-null   object 
 9   currency_trailing_code    3779 non-null   bool   
 10  current_currency          3779 non-null   object 
 11  deadline                  3779 non-null   int64  
 12  disable_communication     3779 non-null   bool   
 13  friends                   8 non-null      object 
 14  fx_rate 