## Identification of sustainability-focused campaigns on the kickstarter crowdfunding platform using NLP and ML boosted with swarm intelligence
--- ------------------
<div>
Data Analysis: part 1
<br>
Submitted by: Jossin Antony<br>
Affiliation: THU Ulm<br>
Date: 11.06.2024
</div>

## Overview
- [Introduction]()
- [Details of dataset]()
- [Preparation of Dataset]()
- [Save dataset]()

### A. Introduction
--- -------------------

The aim of the project is to study how crowdfunding campaigns support sustainable inititatives. This project, in particular, focuses on crowdfunded campaigns in the [kickstarter](https://www.kickstarter.com/) platform and explores a dataset of c.a 184,186 initiatives from different domains (e.g, Technology, Music, Publishing etc.). The goal of the analyses here is to find the most important features that are relevant to initiatives that are both sustainable as well as profitable. The analyses will also explore the possible relationship of the features with each other, and elucidate insights that might contribute to better understanding of the success/failure propsects of current and future environment focused crowdfunded initiatives.


#### B. Details of dataset:
-- -------------------
1. Source: [Kickstarter_File.xlsx](Kickstarter_File.xlsx)
2. Generation mode: provided by researcher
3. Time period considered: 04-2009 to 05-2021 (c.a 146 months).
4. Total entries: 184,185

The initial data preparation consists of examining the various features and eliminating redundant features & renaming and re-ordering of features and saving the dataframe.

### C. Preparation of Dataset
--- ------------------

First we make sure the dataset is 'reasonable', i.e, it has good structure, columns have data of expected types, devoid of null values etc.

The basic information of the data is as following:

In [24]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [25]:
#load the raw dataset from excel file. This is slow.
df= pd.read_excel('./data/Kickstarter_File.xlsx')
#--------------------------------------------------
#write the raw dataset as a data frame in a csv file. Do this only once, if the dataframe file is not provided already.
df.dropna(how='all', inplace=True)
df.to_csv('./data/dataframe_raw.csv', index=False)
#--------------------------------------------------
#load the raw dataset from the data frame csv file, ONCE file is already created. This is fast as compared to reading from excel file.
#df= pd.read_csv('./data/dataframe_raw.csv', low_memory=False)
#df.sample(5)

In [36]:
df.rename_axis('index',inplace=True)
shape= df.shape
print(f'The dataframe has {shape[0]} rows and {shape[1]} columns.')
print()
print('The overall dataframe information is given below:')
print(df.info())

print()

print("We also make the preliminary observation that the columns named 'environmental', 'social' and 'unnamed: 5' \
have lots of 'NaN' values. We will deal with them later.")

The dataframe has 184187 rows and 11 columns.

The overall dataframe information is given below:
<class 'pandas.core.frame.DataFrame'>
Index: 184187 entries, 0 to 1048574
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   campaign_name       184186 non-null  object 
 1   blurb               184184 non-null  object 
 2   main_category       176465 non-null  object 
 3   sub_category        184186 non-null  object 
 4   is_environmental    2053 non-null    object 
 5   is_social           2053 non-null    object 
 6   country             184186 non-null  object 
 7   duration_in_days    184186 non-null  float64
 8   goal_usd            184186 non-null  float64
 9   pledged_amount_usd  184186 non-null  float64
 10  is_success          184186 non-null  object 
dtypes: float64(3), object(8)
memory usage: 16.9+ MB
None

We also make the preliminary observation that the columns named 'environmental', 'socia

In [None]:
rows_na=df[df.isna().sum(axis=1)>2]
rows_na

Next we provide meaningful names to the columns to reflect the nature of the data they contain as well as re-order them.

In [None]:
#Rename the columns to meaningful names
df.rename(columns={'Environmental':'is_environmental',
                   'Social':'is_social',
                   'state':'is_success',
                   'Unnamed: 5':'main_category',
                   'Subcategory':'sub_category',
                   'converted_pledged_amount':'pledged_amount_usd',
                   'goal':'goal_in_local_currency',
                   'duration':'duration_in_days',
                   'name':'campaign_name',
                   'pledged':'pledged_in_local_currency',
                   },inplace=True)
df.sample(2)
#Reorder the columns
print('column_names:',list(df.columns))
df=df[['campaign_name', 
       'blurb', 
       'slug', 
       'main_category',
       'sub_category', 
       'is_environmental', 
       'is_social', 
       'country', 
       'country_displayable_name', 
       'created_at', 
       'launched_at', 
       'deadline', 
       'duration_in_days', 
       'currency', 
       'goal_in_local_currency', 
       'pledged_in_local_currency', 
       'usd_pledged',
       'pledged_amount_usd', 
       'staff_pick', 
       'state.1', 
       'fx_rate', 
       'static_usd_rate', 
       'usd_exchange_rate',
       'is_success',]]
df.sample(4)

In [29]:
print('The new column_names are:\n','\n'.join(list(df.columns)))

The new column_names are:
 campaign_name
blurb
slug
main_category
sub_category
is_environmental
is_social
country
country_displayable_name
created_at
launched_at
deadline
duration_in_days
currency
goal_in_local_currency
pledged_in_local_currency
usd_pledged
pledged_amount_usd
staff_pick
state.1
fx_rate
static_usd_rate
usd_exchange_rate
is_success


Next we drop the columns which are redundant or which do not add any value to the analysis. The dropped columns are as following:

1. <b>'country' and 'country_displayable_name':</b>

    We need only one of these; but we save the country codes for later reference.

2. <b>'created_at', 'launched_at', 'deadline', 'duration':</b>

    There is no discernible difference between 'created_at' and 'launched_at' since they are, at maximum, only few days apart in order to have an effect on the results we look for. 'duration' provides the difference in days between launched_at and deadline and we keep this parameter (for now).

3. <b>'currency', 'goal_in_local_currency', 'pledged_in_local_currency', 'usd_pledged','converted_pledged_amount_usd', 'fx_rate', 'static_usd_rate', 'usd_exchange_rate':</b>

    There is the goal- but only in local currency- and the pledged amount- in both local currency and usd. 
    We add a new column, 'goal_in_usd', which gives the goal in usd as well. It is obtained by multiplying the 'goal_in_local_currency' with the provided 'usd_exchange_rate' (Logic: The converted_pledged_amount_usd is provided by the author as a product of 'usd_exchange_rate' and 'pledged_in_local_currency').

4. <b>'staff_pick' and 'state.1':</b>
    These columns are dropped, since state.1 is a reptition of the column 'is_success' and 'staff_pick' do not seem to add value to the analysis at hand.

5. <b>'slug' and 'campaign_name':</b>
    'slug'is a repetition of 'campaign_name', it is dropped.

In [None]:
#Drop columns which do not add value to the analysis
#---------------------------------------------------
#1. 'country' and 'country_displayable_name'.
# We need only on of these; but save the country codes for later reference.
df[['country_displayable_name','country']].drop_duplicates().reset_index(drop=True).to_csv('./data/country_codes.txt',sep='\t', index=False),

#---------------------------------------------------
#2. 'created_at', 'launched_at', 'deadline', 'duration'
# There is no discernible difference between created_at and launched_at since they are, at maximum, only few days apart in oorder to have an 
#effect on the results we look for. duration provides the difference in days between launched_at and deadline and we keep this parameter (for now).

#-------------------------------------------------
#3. 'currency', 'goal_in_local_currency', 'pledged_in_local_currency', 'usd_pledged', 'converted_pledged_amount_usd',
# 'fx_rate', 'static_usd_rate', 'usd_exchange_rate'
# There is the goal- but only in local currency- and the pledged amount- in both local currency and usd. 
# We add a new column, 'goal_in_usd', which gives the goal in usd as well. It is obtained by multiplying the 'goal_in_local_currency' with
# the provided 'usd_exchange_rate' (Logic: The converted_pledged_amount_usd is provided by the author as a product of 'usd_exchange_rate' 
# and pledged_i'n_local_currency).
df['goal_usd']= df['goal_in_local_currency']*df['usd_exchange_rate'] 
#We retain, in the end, 'goal_in_usd' and 'converted_pledged_amount_usd' and drop other currency, exchange rates and goal and pledged amounts
#in local currency.

#-------------------------------------------------
#4. 'staff_pick' and 'state.1'
# These columns are dropped, since state.1 is a reptition of the column 'is_success' and 'staff_pick' do not seem to add value to the 
#analysis at hand.

#-------------------------------------------------
#5. 'slug' and 'campaign_name'
# 'slug'is a repetition of 'campaign_name', it is dropped.

#-------------------------------------------------
#Drop unwanted columns
columns_to_drop= ['country_displayable_name', 
       'slug',
       'created_at', 
       'launched_at', 
       'deadline', 
       'currency', 
       'goal_in_local_currency', 
       'pledged_in_local_currency', 
       'usd_pledged',
       'staff_pick', 
       'state.1', 
       'fx_rate', 
       'static_usd_rate', 
       'usd_exchange_rate',]
for column in columns_to_drop:
       if column in df.columns:
              df.drop(column, axis=1, inplace=True)

#-------------------------------------------------
#Reorder columns
print('column_names:',list(df.columns))
df=df[['campaign_name', 
       'blurb',
       'main_category', 
       'sub_category', 
       'is_environmental', 
       'is_social', 
       'country', 
       'duration_in_days', 
       'goal_usd',
       'pledged_amount_usd', 
       'is_success', 
       ]]
#Round floating number values to 2
df=df.round(2)
df.sample(3)

In [43]:
print('Overview of the selected features:\n','\n'.join(list(df.columns)))

Overview of the selected features:
 campaign_name
blurb
main_category
sub_category
is_environmental
is_social
country
duration_in_days
goal_usd
pledged_amount_usd
is_success


### D. Save dataset
--- -------------------

After this we save the data to a local file for the next set of analyses.

In [8]:
#Save the processed data
df.dropna(how='all', inplace=True)
df.to_csv('./data/dataframe_stripped_features.csv', index=False)

In [1]:
!jupyter nbconvert --to webpdf 01.Dataset_Prep.ipynb --no-input

[NbConvertApp] Converting notebook 01.Dataset_Prep.ipynb to webpdf
[NbConvertApp] Building PDF
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 65479 bytes to 01.Dataset_Prep.pdf
Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<Connection.run() running at C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\site-packages\playwright\_impl\_connection.py:274> wait_for=<Future pending cb=[Task.task_wakeup()]>>
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x000002397A25C220>
Traceback (most recent call last):
  File "C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\asyncio\base_subprocess.py", line 126, in __del__
    self.close()
  File "C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\asyncio\base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\asyncio\proactor_events.py", line 109, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "C:\Users\Ronin\miniforge3\envs\dl4cv