## Identification of sustainability-focused campaigns on the kickstarter crowdfunding platform using NLP and ML boosted with swarm intelligence
--- ------------------
<div>
Data Analysis: part 2
<br>
Submitted by: Jossin Antony<br>
Affiliation: THU Ulm<br>
Date: 11.06.2024
</div>

## Overview
- [Introduction]()
- [Extraction of key words]()
- [Attention!]()
- [To Dos]()

### A. Introduction
--- -------------------

We continue our analysis with the filtered dataset from part 1. The data set consists of features 'is_environmental' and 'is_social' which are thought to be very essential in the upcoming analyses. However, only 1% of these columns hold values. 
In this script, we try to populate the rest of the columns with values using NLP analyses.

In [61]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)

import numpy as np
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
import re
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

from pprint import PrettyPrinter
pp= PrettyPrinter()

In [51]:
#load the dataset
df= pd.read_csv('./data/dataframe_stripped_features.csv', low_memory=False)

In [52]:
print('We print 2 random rows of the dataset for preliminary impressions.')
pd.set_option('display.max_colwidth', 50)
df.sample(2)


We print 2 random rows of the dataset for preliminary impressions.


Unnamed: 0,campaign_name,blurb,main_category,sub_category,is_environmental,is_social,country,duration_in_days,goal_usd,pledged_amount_usd,is_success
13946,Enchanted Woods Lenormand Oracle Deck,A hand illustrated Lenormand tarot/oracle deck...,Art,Illustration,,,US,30.0,1300.0,2880.0,successful
30689,The Black-Jack Demon,"To avenge his father's unnatural murder, Silas...",Comics,Comic Books,,,US,32.74,3000.0,3436.0,successful


In [64]:
df['funding_acquired_percent']=((df['pledged_amount_usd']/df['goal_usd'])*100).round()

In [54]:
df= df[df['goal_usd']>=1000]

In [55]:
len(df)

139608

In [None]:
df.head(50)

In [65]:
df= df[df['funding_acquired_percent'] >=50]

In [66]:
len(df)

78447

In [70]:
pd.set_option('display.max_rows', None)
df[df['funding_acquired_percent']>1000]['funding_acquired_percent'].value_counts().sort_index(ascending=False)

funding_acquired_percent
39595.0    1
39182.0    1
36554.0    1
35537.0    1
31959.0    1
31202.0    1
30912.0    1
30665.0    1
25600.0    1
24729.0    1
23866.0    1
21742.0    1
18575.0    1
18483.0    1
18294.0    1
17965.0    1
17950.0    1
17587.0    1
17071.0    1
17031.0    1
16973.0    1
16870.0    1
16641.0    1
15867.0    1
15620.0    1
15618.0    1
15537.0    1
15342.0    1
15251.0    1
15099.0    1
15007.0    1
14927.0    1
14705.0    1
14322.0    1
14070.0    1
14044.0    1
13908.0    1
13812.0    1
13776.0    1
13501.0    1
13456.0    1
12905.0    1
12902.0    1
12848.0    1
12838.0    1
12756.0    1
12750.0    1
12200.0    1
12194.0    1
12081.0    1
11714.0    1
11442.0    1
11307.0    1
11217.0    1
11103.0    1
11047.0    1
10998.0    1
10973.0    1
10844.0    1
10668.0    1
10641.0    1
10635.0    1
10541.0    1
10514.0    1
10477.0    1
10465.0    1
10425.0    1
10402.0    1
10384.0    1
10381.0    1
10196.0    1
10138.0    1
10113.0    1
9859.0     1
9811.0     1


In [72]:
df[df['funding_acquired_percent']>500].count()

campaign_name               4812
blurb                       4812
main_category               4812
sub_category                4812
is_environmental              54
is_social                     54
country                     4812
duration_in_days            4812
goal_usd                    4812
pledged_amount_usd          4812
is_success                  4812
funding_acquired_percent    4812
dtype: int64

In [85]:
df['duration_in_days']= np.ceil(df['duration_in_days']/30)

In [86]:
df['duration_in_days'].value_counts()

duration_in_days
1.0    47252
2.0    30221
3.0      943
4.0       30
5.0        1
Name: count, dtype: int64

In [90]:
df[df['goal_usd']>170000].value_counts().sort_index(ascending=False)

campaign_name                                                 blurb                                                                                                                                    main_category  sub_category  is_environmental  is_social  country  duration_in_days  goal_usd  pledged_amount_usd  is_success  funding_acquired_percent
Star Wars Toy Guide: Vol 1 - Kenner Action Figures 1977-1985  An in-depth guide to the original Kenner line                                                                                            Publishing     Art Books     No                No         GB       2.0               171735.8  193335.0            successful  113.0                       1
Lightseekers                                                  A next generation adventure role playing video game connecting smart action figures, trading cards, & comics in ways never seen before.  Games          Video Games   No                No         US       1.0               200000.0 

### B. Extraction of key words
--- -------------------

We try to extract the main keywords which will help to classify the blurbs- description of the project-  as environmentally or socially relevant.

Some of the data are manually curated and classified as socially or environmentally relevant. We start with the analysis of this data in the hopes that it might reveal some clues to understand how the data was actually classified, beyond the human notions of what is socially or environmentally relevant.

First we replace all the NaN values with the term 'unspecified'. Next we check how many samples were manually curated.

In [4]:
# fill the fields with NaN as 'unspecified'
df = df.fillna('unspecified')

# Extract the samples having values in 'is_environmental' and 'is_social' columns
df_is_envt_or_social= df[((df['is_environmental']!='unspecified')) &
                           ((df['is_social']!='unspecified'))]

df_is_envt_or_social=df_is_envt_or_social[['campaign_name','blurb','is_environmental', 'is_social']]

print(f'Observation: The Dataset has {df_is_envt_or_social.shape[0]} rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.\n\
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.\n')

Observation: The Dataset has 1944 rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.



Next we check the proportion of 'yes' and 'No' values.



In [6]:
#Check if the classes are balanced
print(df_is_envt_or_social['is_environmental'].value_counts())
print(df_is_envt_or_social['is_social'].value_counts())

is_environmental
No     1903
Yes      41
Name: count, dtype: int64
is_social
No     1918
Yes      26
Name: count, dtype: int64


**Observation:** The classes are not well balanced. We see that an overwhelming number of samples are classified 'NO' for social or environmental relevance. Classical classification machine learnings cannot be applied here, because the 'null accuracy' (prediction 'No') is well over 90%.

As an alternate (and easy) approach, we try to find the most important words that appear in the 'blurb' classified as socially/environmentally relevant. 

We start with the ['tf-idf'](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) algorithm. The aim is to calculate the mean tf-idf scores of the words that appear in the corpus marked as socially or environmentlly relevant and later attempt to use the appearance of these words to classify uncategorized extracts.

Note:
- We use [stemming](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) to find the 'root' form of the words that appear in the corpus. We start the analysis with [snowball stemming](https://www.ibm.com/topics/stemming#Types+of+stemming+algorithms).
- The [stopwords in english](https://gist.github.com/sebleier/554280) (e.g. 'and', 'these') are omitted from the analysis. Similarly, all numbers, symbols etc. are also ignored (e.g: 'covid-19' -> 'covid').
- To increase the amount of training data, the 'campaign_name' is also considered along with 'blurb'.

In [7]:
#Easy test: tf-idf
# tf-idf 
#STOP_WORDS='english'
STOP_WORDS = list(text.ENGLISH_STOP_WORDS.union([str(i) for i in range(10)]))
MIN_DOCS= .05
TOKEN_PATTERN= '(?u)\\b[a-zA-Z]{2,}\\b'

def stem(extract):
    stemmer = SnowballStemmer("english")
    return [' '.join([stemmer.stem(token) for token in word_tokenize(text)]) for text in extract]

def transform_extract(extract):
    return stem(re.sub(r'[\W,]+', ' ', extract).replace('-', ' ').lower().split())

def rank_words(df, ranked_words, column_affirmative, column_ranked_words='ranked_words', threshold=0.05):
    df[column_affirmative] = 0
    df['combined_description']=''
    
    df.loc[:,'combined_description'] = df.loc[:,'campaign_name'] + ' ' + df.loc[:,'blurb']

    df.loc[:,column_ranked_words] = df.loc[:,'combined_description'].apply(
        lambda blurb: [word if word in ranked_words.index and ranked_words.loc[word] > threshold else '' 
                        for word in transform_extract(blurb)]
                        ).apply(lambda x: list(filter(None, x)))
    
    df.loc[:,column_affirmative]= df.loc[:,column_ranked_words].apply(len)
    df.drop(columns=['combined_description'], axis=1,inplace=True)
    return df

def get_word_count_in_classified_blurbs(df, count_column):
    return df[count_column].apply(lambda x: 'No keyword' if x == 0 else 'at least one keyword').value_counts()


In [8]:
#1. FInd top ranking words in samples classified as environmental
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS, token_pattern=TOKEN_PATTERN)

#blurb_is_environmental= df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes']['blurb'].tolist()
blurb_is_environmental = df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_is_environmental = stem([text.lower() for text in blurb_is_environmental])

#Vectorization of corpus
tf_idf_vector = tf_idf_model.fit_transform(blurb_is_environmental)

# #Get original terms in the corpus
words_set = tf_idf_model.get_feature_names_out()

# #Data frame to show the TF-IDF scores of each document
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)

# Calculate the sum of TF-IDF scores for each word
word_importance = df_tf_idf.mean(axis=0)

# Sort words based on the sum of TF-IDF scores
ranked_words_environmental = word_importance.sort_values(ascending=False)

# Print the ranked words
print(f'''No. of identified top words distinguising environmentally relevant blurbs: {len(ranked_words_environmental)}.\n
The first column represents the relevant words and the second column gives the mean tf-idf score''')
print('Note: The word are in the stemmed format. e.g "sustain" can mean "sustainability", "sustaining", "sustained" etc.')
print()
print('Top words (is_environmental)')
print('-----------------------------')
print(ranked_words_environmental)


No. of identified top words distinguising environmentally relevant blurbs: 33.

The first column represents the relevant words and the second column gives the mean tf-idf score
Note: The word are in the stemmed format. e.g "sustain" can mean "sustainability", "sustaining", "sustained" etc.

Top words (is_environmental)
-----------------------------
sustain      0.146818
organ        0.121158
natur        0.084267
friend        0.07456
design        0.07273
eco          0.069532
food         0.060402
build        0.060103
recycl       0.058756
farm          0.05819
make         0.057818
world        0.057185
healthi      0.055094
use           0.04879
produc       0.048043
small        0.046716
save         0.045356
fashion      0.044635
local          0.0441
provid       0.042886
anim         0.042846
compani      0.042332
communiti    0.041974
hand         0.040683
awar         0.040172
rais         0.040172
vegan        0.040008
mobil        0.039819
project       0.03176
creat      

Now we verify that this approach works! We expect that the words we found as relevant occur multiple times (atleast one time) in the samples manually curated as relevant and do not occur at all if they were manually curated as irrelevant. From this data we calculate the accuracy as the number of correctly classified/ total classified.

We try this approach first on the samples marked as 'environmentally' relevant.

In [9]:
df_envt= df_is_envt_or_social[df_is_envt_or_social['is_environmental']=='Yes'].copy()
df_envt.drop(columns=['is_social'],axis=1,inplace=True)
rank_words(df_envt, ranked_words_environmental, column_affirmative='yes_count: is_envt',threshold=0.052)
word_count_classified_envt= get_word_count_in_classified_blurbs(df_envt,'yes_count: is_envt' )
print('Categorization summary')
print('========================')
print(word_count_classified_envt)
print(f'accuracy: {word_count_classified_envt.iloc[0]/(word_count_classified_envt.iloc[0]+word_count_classified_envt.iloc[1]):.4f}')

Categorization summary
yes_count: is_envt
at least one keyword    36
No keyword               5
Name: count, dtype: int64
accuracy: 0.8780


**Observation:**

Out of the 43 samples available, 36 were classified correctly and 7 incoorectly, giving us an accuracy of ~83%.
We can also inspect the data frame in detail, so that we know where the results were false.

In [10]:
df_envt

Unnamed: 0,campaign_name,blurb,is_environmental,yes_count: is_envt,ranked_words
47,Beluga tent 6-in-1 from Qaou,The first all in one highly eco-friendly tent made from recycled plastic.,Yes,3,"[eco, friend, recycl]"
71,"Thé-tis Tea : Plant-based seaweed tea, rich in minerals","Delicious tea infusion made with seaweed. Healthy, organic, plant-based, eco-friendly, and rich-mineral tea for vegans.",Yes,4,"[healthi, organ, eco, friend]"
99,baby food,"Inspired by the selection at the Grocery mart, I want to make Safe Healthy Nutritious Slurpable foods for baby. No preservatives added.",Yes,4,"[food, make, healthi, food]"
125,Chique Addiction,"High fashions made from ethical and sustainable, environmentally friendly, vegan fabrics for the modern world.",Yes,3,"[sustain, friend, world]"
163,Hearth & Market - Wood Fired Food Truck & Mobile Market,"A wood fired food truck & mobile farmers market that connects you to our farm, way of life and certified organic produce and products.",Yes,4,"[food, food, farm, organ]"
170,Sutra (Thread)Hand Dyed Hand Spinned Sustainable Yarn,To create yarn&projects out of sustainable bamboo&hemp fiber with the desert dye cochineal.,Yes,2,"[sustain, sustain]"
233,"Rebel Swim - Men's swim shorts, designed with a purpose!",Buy a pair of our beautiful men's swim shorts and protect an endangered animal!,Yes,1,[design]
284,"Ash Apothecary: Small Batch, All-Natural Simple Syrup","Small-batch simple syrups for bartending, mixology, coffee, cocktails, soda, chai, and more. Only organic and non-GMO ingredients.",Yes,2,"[natur, organ]"
331,Pawstively Droolicious,An all natural and homemade dog treats that are personalized to every dog's needs and desires.,Yes,1,[natur]
351,Stitchmill Clothing // The Perfect Henley Shirt,Redefining Henley fashion for women and men. Sustainably making the highest quality Henleys in the U.S.A. Be You. Be Confident.,Yes,2,"[sustain, make]"


**Observations:**
- Some of the terms identified (e.g. row:99->'food' row:233->'design') might not be relevant environmentally and may have to be removed from the list of ranked words. (How? -> More on this in sections below.)
- The occurence of more than one different words or the same word multiple times from the list of ranked words in the samples increases the likelihood that the sample is correctly classified as relevant. We will later use this feature to our advantage.
- Sample 233 is correctly classified, but due to the wrong reason! It found the word 'design' among the ranked word-list, but it should have been ideally 'endangered' which does not appear in our ranked words list. This again emphasizes the importance of more training data. The same can be said of the other incorrectly classified samples in the dataset. (More on how to circumvent this issue is discussed in later sections.)

We now apply the same approach to samples marked as socially relevant. The results are:

In [11]:
#----------------------------------------------------------------
#2. FInd top ranking words in samples classified as social
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS, token_pattern=TOKEN_PATTERN)

blurb_is_social = df_is_envt_or_social[df_is_envt_or_social['is_social'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_is_social = stem([text.lower() for text in blurb_is_social])

tf_idf_vector = tf_idf_model.fit_transform(blurb_is_social)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words_social = word_importance.sort_values(ascending=False)

print(f'No. of identified top words distinguising socially relevant blurbs: {len(ranked_words_social)}')
print('Top words (is_social)')
print('-----------------------------')
print(ranked_words_social)


No. of identified top words distinguising socially relevant blurbs: 31
Top words (is_social)
-----------------------------
communiti    0.121427
support      0.114027
project      0.078614
build          0.0724
free         0.069167
covid         0.06393
area         0.062197
public       0.061306
hous         0.060613
card         0.059676
rais         0.055751
awar         0.055751
make         0.054066
shirt        0.052792
live         0.051397
school       0.049044
help         0.047177
fund         0.046626
solut        0.045875
film         0.044926
know         0.041608
fight        0.041572
save         0.041075
end            0.0406
main         0.038005
kid          0.036531
app          0.036265
children     0.033609
individu     0.032786
risk         0.032578
creat        0.031378
dtype: Sparse[float64, 0]


The accuracy on training data is as follows:

In [12]:
df_social= df_is_envt_or_social[df_is_envt_or_social['is_social']=='Yes'].copy()
df_social.drop('is_environmental',axis=1,inplace=True)

rank_words(df_social, ranked_words_social, column_affirmative='yes_count: is_social')

word_count_classified_social= get_word_count_in_classified_blurbs(df_social,'yes_count: is_social' )
print(word_count_classified_social)
print(f'accuracy: {word_count_classified_social.iloc[0]/(word_count_classified_social.iloc[0]+word_count_classified_social.iloc[1]):.4f}')

yes_count: is_social
at least one keyword    24
No keyword               2
Name: count, dtype: int64
accuracy: 0.9231


We now inspect the data frame in detail.

In [13]:
df_social

Unnamed: 0,campaign_name,blurb,is_social,yes_count: is_social,ranked_words
6,Surviving the Unknown,A family struggles to survive off the grid in secrecy. But it's more than just the harsh elements that are tearing them apart.,Yes,0,[]
23,The Call - a voice to the voiceless,"This is a project, which aims to save lives of unarmed men, women and children trapped in war, who reject to participate in violence!",Yes,2,"[project, live]"
55,Et al. Creatives,"A collaborative employment, resource, and community platform.",Yes,1,[communiti]
63,the breast express,pumpspotting is going cross-country to support & show up for breastfeeding moms and document the boob-venture of a lifetime.,Yes,1,[support]
88,Gay Occasions,"I was looking in a card shop for a card for my fiancée, and was struck by the lack of LGBT cards available. Let's make it happen.",Yes,4,"[card, card, card, make]"
100,MIRZ PLAYING CARDS : 2ND EDITION (feat. Hope For Justice),Change lives. End Slavery.,Yes,2,"[card, live]"
130,Seattle Streets to Main Street: End Child Trafficking.,Help me build the social impact of my award winning documentary “The Long Night” and get the film to audiences everywhere.,Yes,1,[build]
150,MizaBella After School Project,Teaching Kids How To Knit,Yes,1,[project]
210,Aegis,Aegis- A turnkey security solution that scans the area for security threats and risks to safeguard public health w.r.t Covid-19,Yes,3,"[area, public, covid]"
371,"Little Free Library in West Louisville, Kentucky","Support the creation of a little free library in West Louisville, Kentucky.",Yes,3,"[free, support, free]"


**Observations**
- The observations we made in the case of environmentally relevant samples are more or less valid in the case of socially relevant samples also.
- Row:100 is interesting. It is correctly classified, but is is questionable that the words ('card') are really relevant (Emphasis on more traing data!). It is also questionabl if the manual curation is also correct in this case, since´the project is about playing cards. 

Finally, we also try to get the most ranked words list taking the data frame as a whole. The least overlap in this list of words with other ranked word lists will confirm that the top ranked words list corresponding to each topic is indeed distinct and represents the particular topic it is assigned to.

In [14]:
#----------------------------------------------------------------
#3. Find top ranking words in all samples, for the sake of completeness
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= .05)

blurb_all = df[['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_all = stem([text.lower() for text in blurb_all])

tf_idf_vector = tf_idf_model.fit_transform(blurb_all)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words = word_importance.sort_values(ascending=False)
print(f'No. of identified top words in all blurbs: {len(ranked_words)}')
print('Top words (all)')
print('-----------------------------')
print(ranked_words)

No. of identified top words in all blurbs: 12
Top words (all)
-----------------------------
new         0.08107
help       0.077095
book       0.062207
make       0.057526
film       0.057032
world      0.056284
art        0.053833
creat      0.053278
album      0.051541
music       0.05118
project    0.045769
need       0.036496
dtype: Sparse[float64, 0]


As expected, the top ranked words in the entire dataset is different from the other words list we derived.

In [23]:
common_words_social_environment = set(ranked_words_social.head(5).index) & set(ranked_words_environmental.head(5).index)
print(common_words_social_environment)

set()


### C. Attention!
--- -------------------
Some of the important parameters in the tf-idf algotithm relevant to our analysis are:
1.  **min-df:**

    This is the minimum number of documents in which the word should appear, in order for it to be considered relevant. To ideally represent a topic, the min-df should be large. However, we have only very little training data (~50 for environmentally relevant and ~25 for socially relevant) and we run to the risk of losing information with a higher min_df value. We initially set it at 0.05%- this means, an term that appears in fewer than 5% of the documents (~2 documents) will be ignored and not considered for analysis. This emphasizes the importance of having more training data.

    Please see the <a. illustartion: min_df= 0.1> in the section below.

2. **mean tf-idf score:**
    It is possible to set a minimum threshold score value so that words with scores below the threshold in the ranked word list are not considered for analysis. In the analyses above, the default value is 0.05. 

    please see the <b. Illustration: tf_df_threshold= 0.06> in the section below.
3. **words_num_threshold:**
    We saw from the previous sections that the more nummber of times different words appear in the extract, the stronger the categorisation is. In order for us to do this, we need to increase the confidence in the selected list of words. 

Other methods to improve confidence:
- **Stemmimng and Lemmatization:** 

    We saw stemming in a previous section. We could also experiment with ['lemmatization'](https://www.ibm.com/topics/stemming-lemmatization?mhsrc=ibmsearch_a&mhq=lemmatization) and various combinations of both to try to improve the performance.
- **manual pruning of words:** 

    We saw in the previous sections that (due to insufficient training data) some top ranked words might not be relevant in the domain of investigation, afterall. (e.g.'card' in socially relevant topics.) It is worth a try to manually prune the ranked words list and remove irrelevant words.
- **Manual inclusion of words:** 

    Similarly, it is also recommended to include words which might be relevant to the topic. For example, words such as 'tree', 'endangered', 'eBike' etc. might be relevant to environmental projects.

**a. Illustration: min_df= 0.1**

In [16]:
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= .1, token_pattern=TOKEN_PATTERN)

#blurb_is_environmental= df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes']['blurb'].tolist()
blurb_demo = df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_demo = stem([text.lower() for text in blurb_demo])

#Vectorization of corpus
tf_idf_vector = tf_idf_model.fit_transform(blurb_demo)

# #Get original terms in the corpus
words_set = tf_idf_model.get_feature_names_out()

# #Data frame to show the TF-IDF scores of each document
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)

# Calculate the sum of TF-IDF scores for each word
word_importance = df_tf_idf.mean(axis=0)

# Sort words based on the sum of TF-IDF scores
ranked_words_demo = word_importance.sort_values(ascending=False)

# Print the ranked words
print('min_df= 0.1')
print(f'''No. of identified top words distinguising environmentally relevant blurbs: {len(ranked_words_demo)}.\n
The first column represents the relevant words and the second column gives the mean tf-idf score''')
print('''We see stronger tf-idf scores, but lower number of terms which will have an effect on categorization.\
This means that there are fewer, but surer terms which indicate if the sample is relevant or not.''')
print()
print('Top words (is_environmental)')
print('-----------------------------')
print(ranked_words_demo)

min_df= 0.1
No. of identified top words distinguising environmentally relevant blurbs: 13.

The first column represents the relevant words and the second column gives the mean tf-idf score
We see stronger tf-idf scores, but lower number of terms which will have an effect on categorization.This means that there are fewer, but surer terms which indicate if the sample is relevant or not.

Top words (is_environmental)
-----------------------------
sustain    0.191448
organ      0.164938
friend     0.107089
natur      0.100755
design     0.090118
eco        0.087987
build      0.085141
farm       0.082849
food       0.072712
world      0.072143
make       0.071882
healthi    0.070624
produc     0.065731
dtype: Sparse[float64, 0]


**b. Illustration: tf_df_threshold= 0.06**

In [17]:
df_envt_demo= df_is_envt_or_social[df_is_envt_or_social['is_environmental']=='Yes'].copy()
df_envt_demo.drop('is_social',axis=1,inplace=True)
rank_words(df_envt_demo, ranked_words_environmental, column_affirmative='yes_count: is_envt',threshold=0.06)
word_count_classified_envt_demo= get_word_count_in_classified_blurbs(df_envt_demo,'yes_count: is_envt' )
print('Categorization summary')
print('========================')
print(word_count_classified_envt_demo)
print(f'accuracy: {word_count_classified_envt_demo.iloc[0]/(word_count_classified_envt_demo.iloc[0]+word_count_classified_envt_demo.iloc[1]):.4f}')

Categorization summary
yes_count: is_envt
at least one keyword    33
No keyword               8
Name: count, dtype: int64
accuracy: 0.8049


We see that the accuracy has dropped, because understandably there are only a lower number of terms now available for categorization. But this is not necessarily bad! it is quite possible that we were overfitting on the training data and the model might not work quite as expected on data it has not seen before. Therefore, it is again good to have more training data, so that we can increase the threshold confidemntly.

### D. Categorize the dataframe
--- -------------------

We now use the selected words list to categorize the whole data frame as socially or environmentally relevant. The parameters are kept their default values mentioned in the previous sections. The words_num_threshold is kept as 2, meaning that atleast 2 occurences of the same word or different word should exist in order for the extract to be categorized into the repective category. The manually curated samples are NOT overwritten, irrespective of the resulting relevant word counts for these samples.

In [18]:
df_to_write= df.copy()
df_to_write.drop(columns=['sub_category', 'country', 'duration_in_days', 'goal_usd', 'pledged_amount_usd'], axis= 1, inplace=True)


In [19]:

df_curated= df_to_write[df_to_write['is_social']!='unspecified'].copy()

df_curated= rank_words(df_curated, ranked_words_environmental, 'count_is_envt', column_ranked_words='ranked_words_envt', threshold=0.05)
df_curated= rank_words(df_curated, ranked_words_social, 'count_is_social', column_ranked_words='ranked_words_social', threshold=0.05)
df_curated.drop(columns=['count_is_social', 'count_is_envt'],axis=1, inplace=True)

df_uncategorized= df_to_write.loc[df_to_write['is_social']=='unspecified'].copy()
df_uncategorized= rank_words(df_uncategorized, ranked_words_environmental, 'count_is_envt', column_ranked_words='ranked_words_envt', threshold=0.05)
df_uncategorized= rank_words(df_uncategorized, ranked_words_social, 'count_is_social', column_ranked_words='ranked_words_social', threshold=0.05)

df_uncategorized['is_environmental'] = np.where(df_uncategorized['count_is_envt'] >= 2, 'Yes', 'No')
df_uncategorized['is_social'] = np.where(df_uncategorized['count_is_social'] >= 2, 'Yes', 'No')
df_uncategorized.drop(columns=['count_is_social', 'count_is_envt'],axis=1, inplace=True)

df_categorized= pd.concat([df_curated, df_uncategorized])



In [20]:
#Save the processed data
#df_categorized.dropna(how='all', inplace=True)
#df_categorized.to_csv('./data/dataframe_categorized.csv', index=False)

After saving the data, we look at some metrics.

In [26]:
shape_envt= df_categorized[df_categorized['is_environmental']=='Yes'].shape
print(f'Number of samples marked as environmentally relevant: {shape_envt[0]}; i.e, {(shape_envt[0] *100/df_categorized.shape[0]):2.3f} % of total samples')

shape_social= df_categorized[df_categorized['is_social']=='Yes'].shape
print(f'Number of samples marked as socially relevant: {shape_social[0]}; i.e, {(shape_social[0] *100/df_categorized.shape[0]):2.3f} % of total samples')

shape_success= df_categorized[df_categorized['is_success']=='successful'].shape
print(f'Number of samples marked as success: {shape_success[0]}; i.e, {(shape_success[0] *100/df_categorized.shape[0]):2.3f} % of total samples')

shape_envt_success= df_categorized[(df_categorized['is_success']=='successful') & (df_categorized['is_environmental']=='Yes')].shape
print(f'Number of environmentally successful samples: {shape_envt_success[0]}; i.e, {(shape_envt_success[0] *100/df_categorized.shape[0]):2.3f} % of total samples')


shape_social_success= df_categorized[(df_categorized['is_success']=='successful') & (df_categorized['is_social']=='Yes')].shape
print(f'Number of environmentally successful samples: {shape_social_success[0]}; i.e, {(shape_social_success[0] *100/df_categorized.shape[0]):2.3f} % of total samples')


Number of samples marked as environmentally relevant: 11450; i.e, 7.097 % of total samples
Number of samples marked as socially relevant: 13914; i.e, 8.624 % of total samples
Number of samples marked as success: 93189; i.e, 57.757 % of total samples
Number of environmentally successful samples: 5192; i.e, 3.218 % of total samples
Number of environmentally successful samples: 8006; i.e, 4.962 % of total samples


Please also note that the data can of course contain false positives and false negatives. These can be reduced by suitably adjusting the parameters mentioned in the previous sections.

### E. To Dos
--- -------------------
- Critically analyze the findings of this notebook. Try different combinations of the suggested parameters and evaluate results.
- Critically analyze the categorized dataframe
- Manually enrich the training data as suggested in the previous sections and see if it brings out better results.

- pivot:
    - summarize success to funding goal
    - categories 


    ##To do:
pie chart of :
- success categories
- goal usd_categories
- main categores
- sub categories


In [22]:
#!jupyter nbconvert --to HTML 02.Dataset_semantic_analysis.ipynb --no-input