## Dataset- Classification into is_environment & is_social
--- ------------------

### A. Introduction
--- -------------------

The data set consists of features 'is_environmental' and 'is_social' which are thought to be very essential in the upcoming analyses. However, only 1% of these columns hold values. 
In this script, we try to populate the rest of the columns with values using NLP analyses.


Please make sure that the respective data files exist in the 'data' folder.

#### Imports:
-- ----------

In [120]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

In [121]:
#load the dataset
df= pd.read_csv('./data/dataframe_stripped_features.csv', low_memory=False)
df.sample(2)

Unnamed: 0,campaign_name,blurb,main_category,sub_category,is_environmental,is_social,country,duration_in_days,goal_usd,pledged_amount_usd,is_success
126755,The Cage: Plan the Escape...or Escape the Plan!,"""The Cage"" is a story of redemption and restoration. How does God work when it seems like you were set up to fail?",Art,Performance Art,,,US,24.81,1500.0,200.0,failed
170741,Fantasy Box Office,Fantasy Box Office allows you to compete against friends by scoring points based on both box office revenue and critical evaluation..,Games,Mobile Games,,,US,45.0,20000.0,989.0,failed


## Extraction of keywords

We try to extarct the main keywords which will help to classify the blurbs as environmentally or socially relevant.

In [122]:
# fill the fields with NaN as 'unspecified'
df = df.fillna('unspecified')

# Extract the samples having values in 'is_environmental' and 'is_social' columns
df_is_envt_or_social= df[((df['is_environmental']!='unspecified')) &
                           ((df['is_social']!='unspecified'))]

df_is_envt_or_social=df_is_envt_or_social[['campaign_name','blurb','is_environmental', 'is_social']]

print(f'The Dataset has {df_is_envt_or_social.shape[0]} rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.\n\
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.\n')

The Dataset has 2053 rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.



In [123]:
#Check if the classes are balanced
print(df_is_envt_or_social['is_environmental'].value_counts())
print(df_is_envt_or_social['is_social'].value_counts())

is_environmental
No     2010
Yes      43
Name: count, dtype: int64
is_social
No     2027
Yes      26
Name: count, dtype: int64


**Observation:** The classes are not well balanced.

We try to find the most important words that appear in the 'blurb' classified as socially/environmentally relevant. We start with the tf-idf algorithm.

In [143]:
#Easy test: tf-idf
# tf-idf 
#STOP_WORDS='english'
STOP_WORDS = list(text.ENGLISH_STOP_WORDS.union([str(i) for i in range(10)]))
MIN_DOCS= .05
#TOKEN_PATTERN= '(?u)\\b\\w*[a-zA-Z]\\w*\\b'
#TOKEN_PATTERN = '(?u)\\b\\w*[a-zA-Z]\\w*[a-zA-Z]\\w*\\b'
TOKEN_PATTERN= '(?u)\\b[a-zA-Z]{2,}\\b'

'''
def rank_words(df, ranked_words, column_probed, column_affirmative,):
  df.loc[:, column_affirmative] = df[column_probed].apply(lambda blurb: ['Yes' if re.sub(r'\W+', '', word).lower() in ranked_words.index else 'No' for word in blurb.replace('-', ' ').lower().split()]).apply(lambda x: x.count('Yes'))
  return df
'''
'''
def rank_words(df, ranked_words, columns_probed, column_affirmative, column_ranked_words='ranked_words'):
    df[column_affirmative] = 0
    df[column_ranked_words] = ''
    for column_probed in columns_probed:
        df[column_affirmative] += df[column_probed].apply(lambda blurb: ['Yes' if re.sub(r'[\W,]+', '', word).lower() in ranked_words.index else 'No' for word in blurb.replace('-', ' ',).lower().split()]).apply(lambda x: x.count('Yes'))
        df[column_ranked_words] += df[column_probed].apply(lambda blurb: [re.sub(r'[\W,]+', '', word) if re.sub(r'[\W,]+', '', word).lower() in ranked_words.index else '' for word in blurb.replace('-', ' ').lower().split()]).apply(lambda x: ', '.join(filter(None, x)))
    return df
'''

def rank_words(df, ranked_words, columns_probed, column_affirmative, column_ranked_words='ranked_words', threshold=0.05):
    df[column_affirmative] = 0
    df[column_ranked_words] = ''
    for column_probed in columns_probed:
        df[column_affirmative] += df[column_probed].apply(
            lambda blurb: ['Yes' if (clean_word := re.sub(r'[\W,]+', '', word)) in ranked_words.index and ranked_words.loc[clean_word] > threshold else 'No' 
                           for word in blurb.replace('-', ' ',).lower().split()]
                           ).apply(lambda x: x.count('Yes'))
        df[column_ranked_words] += df[column_probed].apply(
            lambda blurb: [clean_word + ', ' if (clean_word := re.sub(r'[\W,]+', '', word)) in ranked_words.index and ranked_words.loc[clean_word] > threshold else '' 
                           for word in blurb.replace('-', ' ').lower().split()]
        ).apply(lambda x: ''.join(filter(None, x)))
    return df


def get_word_count_in_classified_blurbs(df, count_column):
    return df[count_column].apply(lambda x: 'No keyword' if x == 0 else 'at least one keyword').value_counts()


  '''
  '''


In [144]:
#1. FInd top ranking words in samples classified as environmental
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS, token_pattern=TOKEN_PATTERN)

#blurb_is_environmental= df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes']['blurb'].tolist()
blurb_is_environmental = df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_is_environmental = [text.lower() for text in blurb_is_environmental]

#Vectorization of corpus
tf_idf_vector = tf_idf_model.fit_transform(blurb_is_environmental)

# #Get original terms in the corpus
words_set = tf_idf_model.get_feature_names_out()

# #Data frame to show the TF-IDF scores of each document
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)

# Calculate the sum of TF-IDF scores for each word
word_importance = df_tf_idf.mean(axis=0)

# Sort words based on the sum of TF-IDF scores
ranked_words_environmental = word_importance.sort_values(ascending=False)

# Print the ranked words
print(f'No. of identified top words distinguising environmentally relevant blurbs: {len(ranked_words_environmental)}')
print('Top words (is_environmental)')
print('-----------------------------')
print(ranked_words_environmental)



No. of identified top words distinguising environmentally relevant blurbs: 21
Top words (is_environmental)
-----------------------------
organic        0.154815
sustainable    0.138874
friendly       0.094534
eco            0.085335
healthy        0.072135
build          0.066082
natural        0.061865
food           0.059899
awareness      0.057332
world          0.056652
save           0.050945
small          0.050785
produce        0.050398
farm           0.049858
company         0.04703
materials      0.045872
mobile         0.043656
community      0.037919
brand          0.034434
create         0.033762
innovative     0.033729
dtype: Sparse[float64, 0]


In [145]:
blurb_is_environmental

['beluga tent 6-in-1 from qaou the first all in one highly eco-friendly tent made from recycled plastic.',
 'thé-tis tea : plant-based seaweed tea, rich in minerals delicious tea infusion made with seaweed. healthy, organic, plant-based, eco-friendly, and rich-mineral tea for vegans.',
 'baby food inspired by the selection at the grocery mart, i want to make safe healthy nutritious slurpable foods for baby. no preservatives added.',
 'chique addiction high fashions made from ethical and sustainable, environmentally friendly, vegan fabrics for the modern world.',
 'hearth & market - wood fired food truck & mobile market a wood fired food truck & mobile farmers market that connects you to our farm, way of life and certified organic produce and products.',
 'sutra (thread)hand dyed hand spinned sustainable yarn to create yarn&projects out of sustainable bamboo&hemp fiber with the desert dye cochineal.',
 'catboxpro: self-flushing automatic cat litter box no monthly subscriptions to bags, 

In [146]:
df_envt= df_is_envt_or_social[df_is_envt_or_social['is_environmental']=='Yes']
df_envt.drop('is_social',axis=1,inplace=True)
rank_words(df_envt, ranked_words_environmental, columns_probed=['campaign_name', 'blurb',], column_affirmative='yes_count: is_envt')
word_count_classified_envt= get_word_count_in_classified_blurbs(df_envt,'yes_count: is_envt' )
print(word_count_classified_envt)
print(f'accuracy: {word_count_classified_envt[0]/(word_count_classified_envt[0]+word_count_classified_envt[1]):.4f}')
df_envt

yes_count: is_envt
at least one keyword    36
No keyword               7
Name: count, dtype: int64
accuracy: 0.8372


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_envt.drop('is_social',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_affirmative] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_ranked_words] = ''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation

Unnamed: 0,campaign_name,blurb,is_environmental,yes_count: is_envt,ranked_words
50,Beluga tent 6-in-1 from Qaou,The first all in one highly eco-friendly tent made from recycled plastic.,Yes,2,"eco, friendly,"
74,"Thé-tis Tea : Plant-based seaweed tea, rich in minerals","Delicious tea infusion made with seaweed. Healthy, organic, plant-based, eco-friendly, and rich-mineral tea for vegans.",Yes,4,"healthy, organic, eco, friendly,"
103,baby food,"Inspired by the selection at the Grocery mart, I want to make Safe Healthy Nutritious Slurpable foods for baby. No preservatives added.",Yes,2,"food, healthy,"
129,Chique Addiction,"High fashions made from ethical and sustainable, environmentally friendly, vegan fabrics for the modern world.",Yes,3,"sustainable, friendly, world,"
167,Hearth & Market - Wood Fired Food Truck & Mobile Market,"A wood fired food truck & mobile farmers market that connects you to our farm, way of life and certified organic produce and products.",Yes,4,"food, food, organic, produce,"
176,Sutra (Thread)Hand Dyed Hand Spinned Sustainable Yarn,To create yarn&projects out of sustainable bamboo&hemp fiber with the desert dye cochineal.,Yes,2,"sustainable, sustainable,"
233,Catboxpro: Self-Flushing Automatic Cat Litter Box,"No monthly Subscriptions to bags, chemicals, filters or litter with the Catboxpro.",Yes,0,
241,"Rebel Swim - Men's swim shorts, designed with a purpose!",Buy a pair of our beautiful men's swim shorts and protect an endangered animal!,Yes,0,
294,"Ash Apothecary: Small Batch, All-Natural Simple Syrup","Small-batch simple syrups for bartending, mixology, coffee, cocktails, soda, chai, and more. Only organic and non-GMO ingredients.",Yes,4,"small, natural, small, organic,"
345,Pawstively Droolicious,An all natural and homemade dog treats that are personalized to every dog's needs and desires.,Yes,1,"natural,"


In [147]:
#----------------------------------------------------------------
#2. FInd top ranking words in samples classified as social
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS, token_pattern=TOKEN_PATTERN)

blurb_is_social = df_is_envt_or_social[df_is_envt_or_social['is_social'] == 'Yes'][['campaign_name', 'blurb']].agg(' '.join, axis=1).tolist()
blurb_is_social = [text.lower() for text in blurb_is_social]

tf_idf_vector = tf_idf_model.fit_transform(blurb_is_social)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words_social = word_importance.sort_values(ascending=False)

print(f'No. of identified top words distinguising socially relevant blurbs: {len(ranked_words_social)}')
print('Top words (is_social)')
print('-----------------------------')
print(ranked_words_social)


No. of identified top words distinguising socially relevant blurbs: 27
Top words (is_social)
-----------------------------
community    0.123717
support      0.099638
awareness    0.079236
covid        0.072623
public       0.070535
free         0.068879
area         0.068648
building     0.066994
project      0.066582
shirt        0.061332
house        0.059418
film         0.057692
tree         0.052644
cards        0.049402
school       0.048636
make         0.048153
know          0.04512
raise        0.044806
lives        0.042185
end          0.041437
fighting     0.041174
save         0.040421
main         0.039012
kids         0.037597
app          0.037419
help         0.035332
children     0.034195
dtype: Sparse[float64, 0]


In [148]:
df_social= df_is_envt_or_social[df_is_envt_or_social['is_social']=='Yes']
df_social.drop('is_environmental',axis=1,inplace=True)

rank_words(df_social, ranked_words_social, columns_probed=['campaign_name','blurb'], column_affirmative='yes_count: is_social')

word_count_classified_social= get_word_count_in_classified_blurbs(df_social,'yes_count: is_social' )
print(word_count_classified_social)
print(f'accuracy: {word_count_classified_social[0]/(word_count_classified_social[0]+word_count_classified_social[1]):.4f}')
df_social

yes_count: is_social
at least one keyword    23
No keyword               3
Name: count, dtype: int64
accuracy: 0.8846


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_social.drop('is_environmental',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_affirmative] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_ranked_words] = ''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the docu

Unnamed: 0,campaign_name,blurb,is_social,yes_count: is_social,ranked_words
6,Surviving the Unknown,A family struggles to survive off the grid in secrecy. But it's more than just the harsh elements that are tearing them apart.,Yes,0,
26,The Call - a voice to the voiceless,"This is a project, which aims to save lives of unarmed men, women and children trapped in war, who reject to participate in violence!",Yes,1,"project,"
58,Et al. Creatives,"A collaborative employment, resource, and community platform.",Yes,1,"community,"
66,the breast express,pumpspotting is going cross-country to support & show up for breastfeeding moms and document the boob-venture of a lifetime.,Yes,1,"support,"
91,Gay Occasions,"I was looking in a card shop for a card for my fiancée, and was struck by the lack of LGBT cards available. Let's make it happen.",Yes,0,
104,MIRZ PLAYING CARDS : 2ND EDITION (feat. Hope For Justice),Change lives. End Slavery.,Yes,0,
134,Seattle Streets to Main Street: End Child Trafficking.,Help me build the social impact of my award winning documentary “The Long Night” and get the film to audiences everywhere.,Yes,1,"film,"
154,MizaBella After School Project,Teaching Kids How To Knit,Yes,1,"project,"
217,Aegis,Aegis- A turnkey security solution that scans the area for security threats and risks to safeguard public health w.r.t Covid-19,Yes,3,"area, public, covid,"
388,"Little Free Library in West Louisville, Kentucky","Support the creation of a little free library in West Louisville, Kentucky.",Yes,3,"free, support, free,"


In [130]:
#----------------------------------------------------------------
#3. Find top ranking words in all samples, for the sake of completeness
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= .05)

blurb_all= df['blurb'].tolist()
blurb_all= [text.lower() for text in blurb_all]

tf_idf_vector = tf_idf_model.fit_transform(blurb_all)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words = word_importance.sort_values(ascending=False)
print(f'No. of identified top words in all blurbs: {len(ranked_words)}')
print('Top words (all)')
print('-----------------------------')
print(ranked_words)

No. of identified top words in all blurbs: 5
Top words (all)
-----------------------------
help     0.081314
new      0.076475
book     0.049146
world    0.048567
album    0.044756
dtype: Sparse[float64, 0]


In [131]:
common_words_social_environment = set(ranked_words_social.head(25).index) & set(ranked_words_environmental.head(25).index)
print(common_words_social_environment)

{'save', 'awareness', 'community'}
