## Dataset- Classification into is_environment & is_social
--- ------------------

### A. Introduction
--- -------------------

The data set consists of features 'is_environmental' and 'is_social' which are thought to be very essential in the upcoming analyses. However, only 1% of these columns hold values. 
In this script, we try to populate the rest of the columns with values using NLP analyses.


Please make sure that the respective data files exist in the 'data' folder.

#### Imports:
-- ----------

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score

In [2]:
#load the dataset
df= pd.read_csv('./data/dataframe_stripped_features.csv', low_memory=False)
df.sample(2)

Unnamed: 0,campaign_name,blurb,main_category,sub_category,is_environmental,is_social,country,duration_in_days,goal_usd,pledged_amount_usd,is_success
52853,Go Solar: STEM and FUN,DIY solar turning clip/wooden stand and spiral,Crafts,DIY,,,HK,45.0,1290.21,1420.0,successful
78294,Heart to Heart (Omegaverse Novel),"Heart to Heart, an illustrated Omegaverse romance novel.",Publishing,Fiction,,,US,31.1,3000.0,13121.0,successful


## Extraction of keywords

We try to extarct the main keywords which will help to classify the blurbs as environmentally or socially relevant.

In [3]:
# fill the fields with NaN as 'unspecified'
df = df.fillna('unspecified')

# Extract the samples having values in 'is_environmental' and 'is_social' columns
df_is_envt_or_social= df[((df['is_environmental']!='unspecified')) &
                           ((df['is_social']!='unspecified'))]

df_is_envt_or_social=df_is_envt_or_social[['campaign_name','blurb','is_environmental', 'is_social']]

print(f'The Dataset has {df_is_envt_or_social.shape[0]} rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.\n\
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.\n')

The Dataset has 2053 rows with an "Yes" or "No" value in "is_environment" and "is_social" columns.
Note: Due to manual curation all the selected samples have values in both "is_environmental" and "is_social" columns.



In [4]:
#Check if the classes are balanced
print(df_is_envt_or_social['is_environmental'].value_counts())
print(df_is_envt_or_social['is_social'].value_counts())

is_environmental
No     2010
Yes      43
Name: count, dtype: int64
is_social
No     2027
Yes      26
Name: count, dtype: int64


**Observation:** The classes are not well balanced.

We try to find the most important words that appear in the 'blurb' classified as socially/environmentally relevant. We start with the tf-idf algorithm.

In [5]:
#Easy test: tf-idf
# tf-idf 
#STOP_WORDS='english'
STOP_WORDS = list(text.ENGLISH_STOP_WORDS.union([str(i) for i in range(10)]))
MIN_DOCS= .05
#TOKEN_PATTERN= '(?u)\\b\\w*[a-zA-Z]\\w*\\b'
#TOKEN_PATTERN = '(?u)\\b\\w*[a-zA-Z]\\w*[a-zA-Z]\\w*\\b'


def rank_words(df, ranked_words, column_probed, column_affirmative,):
    df.loc[:, column_affirmative] = df[column_probed].apply(lambda blurb: ['Yes' if re.sub(r'\W+', '', word).lower() in ranked_words.index else 'No' for word in blurb.replace('-', ' ').lower().split()]).apply(lambda x: x.count('Yes'))
    return df


def get_word_count_in_classified_blurbs(df, count_column):
    return df[count_column].apply(lambda x: 'No keyword' if x == 0 else 'at least one keyword').value_counts()


In [6]:
#1. FInd top ranking words in samples classified as environmental
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS)

blurb_is_environmental= df_is_envt_or_social[df_is_envt_or_social['is_environmental'] == 'Yes']['blurb'].tolist()
blurb_is_environmental = [text.lower() for text in blurb_is_environmental]

#Vectorization of corpus
tf_idf_vector = tf_idf_model.fit_transform(blurb_is_environmental)

# #Get original terms in the corpus
words_set = tf_idf_model.get_feature_names_out()

# #Data frame to show the TF-IDF scores of each document
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)

# Calculate the sum of TF-IDF scores for each word
word_importance = df_tf_idf.mean(axis=0)

# Sort words based on the sum of TF-IDF scores
ranked_words_environmental = word_importance.sort_values(ascending=False)

# Print the ranked words
print(f'No. of identified top words distinguising environmentally relevant blurbs: {len(ranked_words_environmental)}')
print('Top words (is_environmental)')
print('-----------------------------')
print(ranked_words_environmental)



No. of identified top words distinguising environmentally relevant blurbs: 18
Top words (is_environmental)
-----------------------------
organic        0.142245
sustainable    0.138854
healthy        0.087672
friendly       0.078001
eco            0.068118
produce        0.061879
save            0.05899
awareness      0.057187
world          0.055405
company        0.054219
natural        0.053191
materials      0.053099
small          0.049525
build          0.048795
mobile         0.043798
create         0.042819
food            0.04052
innovative     0.036686
dtype: Sparse[float64, 0]


In [7]:
df_envt= df_is_envt_or_social[df_is_envt_or_social['is_environmental']=='Yes']
df_envt.drop('is_social',axis=1,inplace=True)
rank_words(df_envt, ranked_words_environmental, column_probed='blurb', column_affirmative='yes_count: is_envt')
word_count_classified_envt= get_word_count_in_classified_blurbs(df_envt,'yes_count: is_envt' )
print(word_count_classified_envt)
print(f'accuracy: {word_count_classified_envt[0]/(word_count_classified_envt[0]+word_count_classified_envt[1]):.4f}')
df_envt

yes_count: is_envt
at least one keyword    35
No keyword               8
Name: count, dtype: int64
accuracy: 0.8140


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_envt.drop('is_social',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, column_affirmative] = df[column_probed].apply(lambda blurb: ['Yes' if re.sub(r'\W+', '', word).lower() in ranked_words.index else 'No' for word in blurb.replace('-', ' ').lower().split()]).apply(lambda x: x.count('Yes'))
  print(f'accuracy: {word_count_classified_envt[0]/(word_count_classified_envt[0]+word_count_classified_envt[1]):.4f}')


Unnamed: 0,campaign_name,blurb,is_environmental,yes_count: is_envt
50,Beluga tent 6-in-1 from Qaou,The first all in one highly eco-friendly tent made from recycled plastic.,Yes,2
74,"Thé-tis Tea : Plant-based seaweed tea, rich in minerals","Delicious tea infusion made with seaweed. Healthy, organic, plant-based, eco-friendly, and rich-mineral tea for vegans.",Yes,4
103,baby food,"Inspired by the selection at the Grocery mart, I want to make Safe Healthy Nutritious Slurpable foods for baby. No preservatives added.",Yes,1
129,Chique Addiction,"High fashions made from ethical and sustainable, environmentally friendly, vegan fabrics for the modern world.",Yes,3
167,Hearth & Market - Wood Fired Food Truck & Mobile Market,"A wood fired food truck & mobile farmers market that connects you to our farm, way of life and certified organic produce and products.",Yes,4
176,Sutra (Thread)Hand Dyed Hand Spinned Sustainable Yarn,To create yarn&projects out of sustainable bamboo&hemp fiber with the desert dye cochineal.,Yes,2
233,Catboxpro: Self-Flushing Automatic Cat Litter Box,"No monthly Subscriptions to bags, chemicals, filters or litter with the Catboxpro.",Yes,0
241,"Rebel Swim - Men's swim shorts, designed with a purpose!",Buy a pair of our beautiful men's swim shorts and protect an endangered animal!,Yes,0
294,"Ash Apothecary: Small Batch, All-Natural Simple Syrup","Small-batch simple syrups for bartending, mixology, coffee, cocktails, soda, chai, and more. Only organic and non-GMO ingredients.",Yes,2
345,Pawstively Droolicious,An all natural and homemade dog treats that are personalized to every dog's needs and desires.,Yes,1


In [8]:
#----------------------------------------------------------------
#2. FInd top ranking words in samples classified as social
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= MIN_DOCS)

blurb_is_social= df_is_envt_or_social[df_is_envt_or_social['is_social'] == 'Yes']['blurb'].tolist()
blurb_is_social= [text.lower() for text in blurb_is_social]

tf_idf_vector = tf_idf_model.fit_transform(blurb_is_social)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words_social = word_importance.sort_values(ascending=False)

print(f'No. of identified top words distinguising socially relevant blurbs: {len(ranked_words_social)}')
print('Top words (is_social)')
print('-----------------------------')
print(ranked_words_social)


No. of identified top words distinguising socially relevant blurbs: 20
Top words (is_social)
-----------------------------
community    0.137246
support      0.120478
building     0.080387
awareness    0.079236
public       0.075793
area         0.070734
film         0.065658
project       0.06432
shirt        0.061332
lives        0.058119
help         0.055642
19           0.054839
covid        0.054839
free         0.054441
raise        0.044806
know         0.043906
save          0.04379
children     0.043449
fighting     0.042745
app          0.038728
dtype: Sparse[float64, 0]


In [9]:
df_social= df_is_envt_or_social[df_is_envt_or_social['is_social']=='Yes']
df_social.drop('is_environmental',axis=1,inplace=True)

rank_words(df_social, ranked_words_social, column_probed='blurb', column_affirmative='yes_count: is_social')

word_count_classified_social= get_word_count_in_classified_blurbs(df_social,'yes_count: is_social' )
print(word_count_classified_social)
print(f'accuracy: {word_count_classified_social[0]/(word_count_classified_social[0]+word_count_classified_social[1]):.4f}')
df_social

yes_count: is_social
at least one keyword    22
No keyword               4
Name: count, dtype: int64
accuracy: 0.8462


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_social.drop('is_environmental',axis=1,inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, column_affirmative] = df[column_probed].apply(lambda blurb: ['Yes' if re.sub(r'\W+', '', word).lower() in ranked_words.index else 'No' for word in blurb.replace('-', ' ').lower().split()]).apply(lambda x: x.count('Yes'))
  print(f'accuracy: {word_count_classified_social[0]/(word_count_classified_social[0]+word_count_classified_social[1]):.4f}')


Unnamed: 0,campaign_name,blurb,is_social,yes_count: is_social
6,Surviving the Unknown,A family struggles to survive off the grid in secrecy. But it's more than just the harsh elements that are tearing them apart.,Yes,0
26,The Call - a voice to the voiceless,"This is a project, which aims to save lives of unarmed men, women and children trapped in war, who reject to participate in violence!",Yes,4
58,Et al. Creatives,"A collaborative employment, resource, and community platform.",Yes,1
66,the breast express,pumpspotting is going cross-country to support & show up for breastfeeding moms and document the boob-venture of a lifetime.,Yes,1
91,Gay Occasions,"I was looking in a card shop for a card for my fiancée, and was struck by the lack of LGBT cards available. Let's make it happen.",Yes,0
104,MIRZ PLAYING CARDS : 2ND EDITION (feat. Hope For Justice),Change lives. End Slavery.,Yes,1
134,Seattle Streets to Main Street: End Child Trafficking.,Help me build the social impact of my award winning documentary “The Long Night” and get the film to audiences everywhere.,Yes,2
154,MizaBella After School Project,Teaching Kids How To Knit,Yes,0
217,Aegis,Aegis- A turnkey security solution that scans the area for security threats and risks to safeguard public health w.r.t Covid-19,Yes,4
388,"Little Free Library in West Louisville, Kentucky","Support the creation of a little free library in West Louisville, Kentucky.",Yes,2


In [10]:
#----------------------------------------------------------------
#3. Find top ranking words in all samples, for the sake of completeness
#----------------------------------------------------------------
tf_idf_model = TfidfVectorizer(stop_words=STOP_WORDS, min_df= .05)

blurb_all= df['blurb'].tolist()
blurb_all= [text.lower() for text in blurb_all]

tf_idf_vector = tf_idf_model.fit_transform(blurb_all)
words_set = tf_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame.sparse.from_spmatrix(tf_idf_vector, columns=words_set)
word_importance = df_tf_idf.mean(axis=0)
ranked_words = word_importance.sort_values(ascending=False)
print(f'No. of identified top words in all blurbs: {len(ranked_words)}')
print('Top words (all)')
print('-----------------------------')
print(ranked_words)

No. of identified top words in all blurbs: 5
Top words (all)
-----------------------------
help     0.081314
new      0.076475
book     0.049146
world    0.048567
album    0.044756
dtype: Sparse[float64, 0]


In [11]:
common_words_social_environment = set(ranked_words_social.head(25).index) & set(ranked_words_environmental.head(25).index)
print(common_words_social_environment)

{'save', 'awareness'}
