## Dataset- Classification into is_environment & is_social
--- ------------------

### A. Introduction
--- -------------------

The data set consists of features 'is_environmental' and 'is_social' which are thought to be very essential in the upcoming analyses. However, only 1% of these columns hold values. 
In this script, we try to populate the rest of the columns with values using NLP analyses.


Please make sure that the respective data files exist in the 'data' folder.

#### Imports:
-- ----------

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
import re
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

In [2]:
#load the dataset
df= pd.read_csv('./data/dataframe_stripped_features.csv', low_memory=False)
df.sample(2)

Unnamed: 0,campaign_name,blurb,main_category,sub_category,is_environmental,is_social,country,duration_in_days,goal_usd,pledged_amount_usd,is_success
73564,Alien Puppies - A Sci-Fi Card Game,🐶A Strategic Sci-Fi Card Game with Cyberpunk Pups. Collect Alien Puppies and avoid Barkmageddon!🚀 \n✅50+ Dog Breeds\n🤣Fun\n🦾Cyberpunk,Games,Tabletop Games,,,SG,28.25,5188.36,20017.0,successful
83066,BEstitchARY for MÖRK BORG,Stitch together 400 chimeric abominations!\nA third-party bestiary supplement for MÖRK BORG RPG.,Games,Tabletop Games,,,PL,14.0,606.6,3051.0,successful


## Extraction of keywords

We try to extarct the main keywords which will help to classify the blurbs as environmentally or socially relevant.

In [3]:
# fill the fields with NaN as 'unspecified'
df = df.fillna('unspecified')

# Extract the samples having values in 'is_environmental' and 'is_social' columns
df_is_envt_or_social= df[((df['is_environmental']!='unspecified')) &
                           ((df['is_social']!='unspecified'))]

df_is_envt_or_social=df_is_envt_or_social[['campaign_name','blurb','is_environmental', 'is_social']]

print(f'The Dataset has {df_is_envt_or_social.shape[0]} rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.\n\
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.\n')

The Dataset has 2053 rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.



In [4]:
#Check if the classes are balanced
print(df_is_envt_or_social['is_environmental'].value_counts())
print(df_is_envt_or_social['is_social'].value_counts())

is_environmental
No     2010
Yes      43
Name: count, dtype: int64
is_social
No     2027
Yes      26
Name: count, dtype: int64


**Observation:** The classes are not well balanced.

We try to find the most important words that appear in the 'blurb' classified as socially/environmentally relevant. We start with the tf-idf algorithm.

In [23]:
#Easy test: tf-idf
# tf-idf 
#STOP_WORDS='english'
STOP_WORDS = list(text.ENGLISH_STOP_WORDS.union([str(i) for i in range(10)]))
MIN_DOCS= .05
TOKEN_PATTERN= '(?u)\\b[a-zA-Z]{2,}\\b'

def stem(extract):
    stemmer = SnowballStemmer("english")
    return [' '.join([stemmer.stem(token) for token in word_tokenize(text)]) for text in extract]

def transform_extract(extract):
    return stem(re.sub(r'[\W,]+', ' ', extract).replace('-', ' ').lower().split())

def rank_words(df, ranked_words, columns_probed, column_affirmative, column_ranked_words='ranked_words', threshold=0.05):
    df[column_affirmative] = 0
    df['combined_description']=''

    for column_probed in columns_probed:
        df['combined_description'] += ' '+ df[column_probed]

    df[column_ranked_words] = df['combined_description'].apply(
        lambda blurb: [word if word in ranked_words.index and ranked_words.loc[word] > threshold else '' 
                        for word in transform_extract(blurb)]
                        ).apply(lambda x: list(filter(None, x)))
    
    df[column_affirmative]= df[column_ranked_words].apply(len)
    #df.drop(columns=['column_keywords','combined_description'], axis=1,inplace=True)
    return df

def get_word_count_in_classified_blurbs(df, count_column):
    return df[count_column].apply(lambda x: 'No keyword' if x == 0 else 'at least one keyword').value_counts()



In [6]:
#df_sample= df.tail(10)
#df_sample

In [7]:
#df_sample_categorised= categorize_samples(df_sample, ranked_words=ranked_words_environmental, columns_probed=['campaign_name','blurb'],column_affirmative='is_environmental',word_threshold=1)


In [24]:
#1. FInd top ranking words in samples classified as environmental
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS, token_pattern=TOKEN_PATTERN)

#blurb_is_environmental= df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes']['blurb'].tolist()
blurb_is_environmental = df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_is_environmental = stem([text.lower() for text in blurb_is_environmental])

#Vectorization of corpus
tf_idf_vector = tf_idf_model.fit_transform(blurb_is_environmental)

# #Get original terms in the corpus
words_set = tf_idf_model.get_feature_names_out()

# #Data frame to show the TF-IDF scores of each document
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)

# Calculate the sum of TF-IDF scores for each word
word_importance = df_tf_idf.mean(axis=0)

# Sort words based on the sum of TF-IDF scores
ranked_words_environmental = word_importance.sort_values(ascending=False)

# Print the ranked words
print(f'No. of identified top words distinguising environmentally relevant blurbs: {len(ranked_words_environmental)}')
print('Top words (is_environmental)')
print('-----------------------------')
print(ranked_words_environmental)



No. of identified top words distinguising environmentally relevant blurbs: 33
Top words (is_environmental)
-----------------------------
sustain       0.14045
organ        0.116038
natur        0.080313
friend       0.071258
design       0.069359
eco          0.066331
food         0.057577
build        0.057305
recycl       0.056003
farm         0.055535
make         0.055135
world        0.054524
healthi      0.052558
use          0.046495
produc       0.045798
small        0.044453
save         0.043234
fashion      0.042483
local        0.041971
provid       0.040826
anim         0.040822
compani      0.040321
communiti    0.039976
hand         0.038725
awar         0.038294
rais         0.038294
vegan        0.038061
mobil        0.037889
project      0.030225
creat        0.028413
innov        0.027485
materi       0.027197
brand        0.026125
dtype: Sparse[float64, 0]


In [25]:
df_envt= df_is_envt_or_social[df_is_envt_or_social['is_environmental']=='Yes']
df_envt.drop('is_social',axis=1,inplace=True)
rank_words(df_envt, ranked_words_environmental, columns_probed=['campaign_name', 'blurb',], column_affirmative='yes_count: is_envt',threshold=0.052)
word_count_classified_envt= get_word_count_in_classified_blurbs(df_envt,'yes_count: is_envt' )
print(word_count_classified_envt)
print(f'accuracy: {word_count_classified_envt[0]/(word_count_classified_envt[0]+word_count_classified_envt[1]):.4f}')
df_envt

yes_count: is_envt
at least one keyword    36
No keyword               7
Name: count, dtype: int64
accuracy: 0.8372


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_envt.drop('is_social',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_affirmative] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['combined_description']=''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentatio

Unnamed: 0,campaign_name,blurb,is_environmental,yes_count: is_envt,combined_description,ranked_words
50,Beluga tent 6-in-1 from Qaou,The first all in one highly eco-friendly tent made from recycled plastic.,Yes,3,Beluga tent 6-in-1 from Qaou The first all in one highly eco-friendly tent made from recycled plastic.,"[eco, friend, recycl]"
74,"Thé-tis Tea : Plant-based seaweed tea, rich in minerals","Delicious tea infusion made with seaweed. Healthy, organic, plant-based, eco-friendly, and rich-mineral tea for vegans.",Yes,4,"Thé-tis Tea : Plant-based seaweed tea, rich in minerals Delicious tea infusion made with seaweed. Healthy, organic, plant-based, eco-friendly, and rich-mineral tea for vegans.","[healthi, organ, eco, friend]"
103,baby food,"Inspired by the selection at the Grocery mart, I want to make Safe Healthy Nutritious Slurpable foods for baby. No preservatives added.",Yes,4,"baby food Inspired by the selection at the Grocery mart, I want to make Safe Healthy Nutritious Slurpable foods for baby. No preservatives added.","[food, make, healthi, food]"
129,Chique Addiction,"High fashions made from ethical and sustainable, environmentally friendly, vegan fabrics for the modern world.",Yes,3,"Chique Addiction High fashions made from ethical and sustainable, environmentally friendly, vegan fabrics for the modern world.","[sustain, friend, world]"
167,Hearth & Market - Wood Fired Food Truck & Mobile Market,"A wood fired food truck & mobile farmers market that connects you to our farm, way of life and certified organic produce and products.",Yes,4,"Hearth & Market - Wood Fired Food Truck & Mobile Market A wood fired food truck & mobile farmers market that connects you to our farm, way of life and certified organic produce and products.","[food, food, farm, organ]"
176,Sutra (Thread)Hand Dyed Hand Spinned Sustainable Yarn,To create yarn&projects out of sustainable bamboo&hemp fiber with the desert dye cochineal.,Yes,2,Sutra (Thread)Hand Dyed Hand Spinned Sustainable Yarn To create yarn&projects out of sustainable bamboo&hemp fiber with the desert dye cochineal.,"[sustain, sustain]"
233,Catboxpro: Self-Flushing Automatic Cat Litter Box,"No monthly Subscriptions to bags, chemicals, filters or litter with the Catboxpro.",Yes,0,"Catboxpro: Self-Flushing Automatic Cat Litter Box No monthly Subscriptions to bags, chemicals, filters or litter with the Catboxpro.",[]
241,"Rebel Swim - Men's swim shorts, designed with a purpose!",Buy a pair of our beautiful men's swim shorts and protect an endangered animal!,Yes,1,"Rebel Swim - Men's swim shorts, designed with a purpose! Buy a pair of our beautiful men's swim shorts and protect an endangered animal!",[design]
294,"Ash Apothecary: Small Batch, All-Natural Simple Syrup","Small-batch simple syrups for bartending, mixology, coffee, cocktails, soda, chai, and more. Only organic and non-GMO ingredients.",Yes,2,"Ash Apothecary: Small Batch, All-Natural Simple Syrup Small-batch simple syrups for bartending, mixology, coffee, cocktails, soda, chai, and more. Only organic and non-GMO ingredients.","[natur, organ]"
345,Pawstively Droolicious,An all natural and homemade dog treats that are personalized to every dog's needs and desires.,Yes,1,Pawstively Droolicious An all natural and homemade dog treats that are personalized to every dog's needs and desires.,[natur]


In [30]:
#----------------------------------------------------------------
#2. FInd top ranking words in samples classified as social
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS, token_pattern=TOKEN_PATTERN)

blurb_is_social = df_is_envt_or_social[df_is_envt_or_social['is_social'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_is_social = stem([text.lower() for text in blurb_is_social])

tf_idf_vector = tf_idf_model.fit_transform(blurb_is_social)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words_social = word_importance.sort_values(ascending=False)

print(f'No. of identified top words distinguising socially relevant blurbs: {len(ranked_words_social)}')
print('Top words (is_social)')
print('-----------------------------')
print(ranked_words_social)


No. of identified top words distinguising socially relevant blurbs: 31
Top words (is_social)
-----------------------------
communiti    0.121427
support      0.114027
project      0.078614
build          0.0724
free         0.069167
covid         0.06393
area         0.062197
public       0.061306
hous         0.060613
card         0.059676
rais         0.055751
awar         0.055751
make         0.054066
shirt        0.052792
live         0.051397
school       0.049044
help         0.047177
fund         0.046626
solut        0.045875
film         0.044926
know         0.041608
fight        0.041572
save         0.041075
end            0.0406
main         0.038005
kid          0.036531
app          0.036265
children     0.033609
individu     0.032786
risk         0.032578
creat        0.031378
dtype: Sparse[float64, 0]


In [29]:
df_social= df_is_envt_or_social[df_is_envt_or_social['is_social']=='Yes']
df_social.drop('is_environmental',axis=1,inplace=True)

rank_words(df_social, ranked_words_social, columns_probed=['campaign_name','blurb'], column_affirmative='yes_count: is_social')

word_count_classified_social= get_word_count_in_classified_blurbs(df_social,'yes_count: is_social' )
print(word_count_classified_social)
print(f'accuracy: {word_count_classified_social[0]/(word_count_classified_social[0]+word_count_classified_social[1]):.4f}')
df_social

yes_count: is_social
at least one keyword    24
No keyword               2
Name: count, dtype: int64
accuracy: 0.9231


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_social.drop('is_environmental',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_affirmative] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['combined_description']=''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the doc

Unnamed: 0,campaign_name,blurb,is_social,yes_count: is_social,combined_description,ranked_words
6,Surviving the Unknown,A family struggles to survive off the grid in secrecy. But it's more than just the harsh elements that are tearing them apart.,Yes,0,Surviving the Unknown A family struggles to survive off the grid in secrecy. But it's more than just the harsh elements that are tearing them apart.,[]
26,The Call - a voice to the voiceless,"This is a project, which aims to save lives of unarmed men, women and children trapped in war, who reject to participate in violence!",Yes,2,"The Call - a voice to the voiceless This is a project, which aims to save lives of unarmed men, women and children trapped in war, who reject to participate in violence!","[project, live]"
58,Et al. Creatives,"A collaborative employment, resource, and community platform.",Yes,1,"Et al. Creatives A collaborative employment, resource, and community platform.",[communiti]
66,the breast express,pumpspotting is going cross-country to support & show up for breastfeeding moms and document the boob-venture of a lifetime.,Yes,1,the breast express pumpspotting is going cross-country to support & show up for breastfeeding moms and document the boob-venture of a lifetime.,[support]
91,Gay Occasions,"I was looking in a card shop for a card for my fiancée, and was struck by the lack of LGBT cards available. Let's make it happen.",Yes,4,"Gay Occasions I was looking in a card shop for a card for my fiancée, and was struck by the lack of LGBT cards available. Let's make it happen.","[card, card, card, make]"
104,MIRZ PLAYING CARDS : 2ND EDITION (feat. Hope For Justice),Change lives. End Slavery.,Yes,2,MIRZ PLAYING CARDS : 2ND EDITION (feat. Hope For Justice) Change lives. End Slavery.,"[card, live]"
134,Seattle Streets to Main Street: End Child Trafficking.,Help me build the social impact of my award winning documentary “The Long Night” and get the film to audiences everywhere.,Yes,1,Seattle Streets to Main Street: End Child Trafficking. Help me build the social impact of my award winning documentary “The Long Night” and get the film to audiences everywhere.,[build]
154,MizaBella After School Project,Teaching Kids How To Knit,Yes,1,MizaBella After School Project Teaching Kids How To Knit,[project]
217,Aegis,Aegis- A turnkey security solution that scans the area for security threats and risks to safeguard public health w.r.t Covid-19,Yes,3,Aegis Aegis- A turnkey security solution that scans the area for security threats and risks to safeguard public health w.r.t Covid-19,"[area, public, covid]"
388,"Little Free Library in West Louisville, Kentucky","Support the creation of a little free library in West Louisville, Kentucky.",Yes,3,"Little Free Library in West Louisville, Kentucky Support the creation of a little free library in West Louisville, Kentucky.","[free, support, free]"


In [12]:
#----------------------------------------------------------------
#3. Find top ranking words in all samples, for the sake of completeness
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= .05)

blurb_all= df['blurb'].tolist()
blurb_all= [text.lower() for text in blurb_all]

tf_idf_vector = tf_idf_model.fit_transform(blurb_all)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words = word_importance.sort_values(ascending=False)
print(f'No. of identified top words in all blurbs: {len(ranked_words)}')
print('Top words (all)')
print('-----------------------------')
print(ranked_words)

No. of identified top words in all blurbs: 5
Top words (all)
-----------------------------
help     0.081314
new      0.076475
book     0.049146
world    0.048567
album    0.044756
dtype: Sparse[float64, 0]


In [13]:
common_words_social_environment = set(ranked_words_social.head(25).index) & set(ranked_words_environmental.head(25).index)
print(common_words_social_environment)

{'make', 'build', 'awar', 'save', 'communiti'}
