## Identification of sustainability-focused campaigns on the kickstarter crowdfunding platform using NLP and ML boosted with swarm intelligence
--- ------------------
<div>
Data Analysis: part 3
<br>
Submitted by: Jossin Antony<br>
Affiliation: THU Ulm<br>
Date: 23.06.2024
</div>

## Overview
- [Introduction](#A.-Introduction)
- [Visualizations](#B.-Visualizations)

### A. Introduction
--- -------------------

We continue our analysis with the filtered dataset from part 2. We categorized the samples into categories according to social and environmental relevance. However, there is only a very low number of social and environmentally relevant startups and a successful derivation of conclusive results on success of the campaigns solely from social/environmental point of view might not be feasible.

In [19]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)

import numpy as np

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.pipeline import make_pipeline


import xgboost as xgb
import matplotlib.pyplot as plt

import altair as alt

In [57]:
#load the dataset
df= pd.read_csv('./data/dataframe_categorized.csv', low_memory=False)
df.rename(columns={'country':'country_code'},inplace=True)

In [58]:
df.columns

Index(['campaign_name', 'blurb', 'main_category', 'sub_category',
       'is_environmental', 'is_social', 'country_code', 'duration_in_months',
       'goal_usd_category', 'is_success'],
      dtype='object')

In [8]:
shape_envt= df[df['is_environmental']=='Yes'].shape
print(f'Number of samples marked as environmentally relevant: {shape_envt[0]}; i.e, {(shape_envt[0] *100/df.shape[0]):2.3f} % of total samples')

shape_social= df[df['is_social']=='Yes'].shape
print(f'Number of samples marked as socially relevant: {shape_social[0]}; i.e, {(shape_social[0] *100/df.shape[0]):2.3f} % of total samples')

shape_success= df[df['is_success']!='fail'].shape
print(f'Number of samples marked as success: {shape_success[0]}; i.e, {(shape_success[0] *100/df.shape[0]):2.3f} % of total samples')

shape_envt_success= df[(df['is_success']!='fail') & (df['is_environmental']=='Yes')].shape
print(f'Number of environmentally successful samples: {shape_envt_success[0]}; i.e, {(shape_envt_success[0] *100/df.shape[0]):2.3f} % of total samples')


shape_social_success= df[(df['is_success']!='fail') & (df['is_social']=='Yes')].shape
print(f'Number of environmentally successful samples: {shape_social_success[0]}; i.e, {(shape_social_success[0] *100/df.shape[0]):2.3f} % of total samples')


Number of samples marked as environmentally relevant: 1047; i.e, 0.751 % of total samples
Number of samples marked as socially relevant: 698; i.e, 0.500 % of total samples
Number of samples marked as success: 76642; i.e, 54.943 % of total samples
Number of environmentally successful samples: 341; i.e, 0.244 % of total samples
Number of environmentally successful samples: 291; i.e, 0.209 % of total samples


Nevertheless, we extract and focus our analysis on the environmentally and socially relevant campaigns.

In [9]:
df_envt= df[df['is_environmental']=='Yes']
df_social= df[df['is_social']=='Yes']

### B. Visualizations
--- -------------------

Here we visualize the distribution of social and environmental campaigns with respect to various features such as various success levels achieved, funding amount targets, country of the campaign etc.

#### B.1 Success levels of campaigns

The funding success levels of campaigns are plotted here. 

In [10]:
count_envt_fail= len(df_envt[df_envt['is_success']=='fail'])
count_envt_success= len(df_envt[df_envt['is_success']=='goal_achieved'])
count_envt_blockbuster= len(df_envt[df_envt['is_success']=='blockbuster'])

count_social_fail= len(df_social[df_social['is_success']=='fail'])
count_social_success= len(df_social[df_social['is_success']=='goal_achieved'])
count_social_blockbuster= len(df_social[df_social['is_success']=='blockbuster'])


data_envt = {'Environment campaigns': ['Failure','Success', 'Blockbuster'],
        'Count': [count_envt_fail,count_envt_success,count_envt_blockbuster],
        }
data_envt= pd.DataFrame(data_envt)

chart_envt = alt.Chart(data_envt).mark_arc(innerRadius=0).encode(
    theta=alt.Theta(field="Count", type="quantitative"),
    color=alt.Color(field="Environment campaigns", type="nominal"),
    tooltip=['Environment campaigns', 'Count']
).properties(
    width=200,
    height=200
).interactive()

data_social = {'Social campaigns': ['Failure','Success', 'Blockbuster'],
        'Count': [count_social_fail,count_social_success,count_social_blockbuster],
        }
data_social= pd.DataFrame(data_social)

chart_social = alt.Chart(data_social).mark_arc(innerRadius=0).encode(
    theta=alt.Theta(field="Count", type="quantitative"),
    color=alt.Color(field="Social campaigns", type="nominal"),
    tooltip=['Social campaigns', 'Count']
).properties(
    width=200,
    height=200
).interactive()

combined_charts = alt.hconcat(chart_envt, chart_social).configure_view(
    strokeWidth=0
)
combined_charts

Similar trends are observed for both social and environmental campaigns. Almost <span style="color: red">3/4ths</span> of campaigns of both type <span style="color: red">failed to acquire the required funding </span>. Almost <span style="color: green">1/4th </span>of the campaigns from both categories were <span style="color: green">success</span>- they acquired maximum upto 300% of the requested funding. A very <span style="color: cyan">small percentage of the campaigns</span> acquired <span style="color: cyan">more than atleast 300% funding</span>.

In [11]:
# Distribution of funding target amounts for environmentally relevant campaigns
target_amounts_counts_envt = df_envt['goal_usd_category'].value_counts().reset_index()
target_amounts_counts_envt.columns = ['goal_usd_category', 'count']  # Rename columns for clarity


pie_chart_envt = alt.Chart(target_amounts_counts_envt).mark_arc(innerRadius=60).encode(
    theta=alt.Theta(field="count", type="quantitative"),  # The angle of the arc
    color=alt.Color(field="goal_usd_category", type="nominal"),  # Color by main_category
    tooltip=['goal_usd_category', 'count']  # Show tooltips for additional info
).properties(
    width=250,
    height=250,
    title='Distribution of fund targets in environmental campaigns'
)

# Distribution of funding target amounts  for socially relevant campaigns
target_amounts_counts_social = df_social['goal_usd_category'].value_counts().reset_index()
target_amounts_counts_social.columns = ['goal_usd_category', 'count']  # Rename columns for clarity


pie_chart_social = alt.Chart(target_amounts_counts_social).mark_arc(innerRadius=60).encode(
    theta=alt.Theta(field="count", type="quantitative"),  # The angle of the arc
    color=alt.Color(field="goal_usd_category", type="nominal"),  # Color by main_category
    tooltip=['goal_usd_category', 'count']  # Show tooltips for additional info
).properties(
    width=250,
    height=250,
    title='Distribution of fund targets in social campaigns'
)

combined_charts = alt.vconcat(pie_chart_envt, pie_chart_social).configure_view(
    strokeWidth=0
)
combined_charts

Similar trends are observable in the funding amount targets. In both social and environmnetal categories, little <span style="color: fuchsia">less than one-half each of all campaigns</span> targeted amounts in the <span style="color: fuchsia">range of 1k-10k USD and 10k-50k USD</span>. Around <span style="color: olive">1/8th of the campaigns</span> requested <span style="color: olive">more than 50k USD</span>.

In [12]:
# Distribution of main categories for environmentally relevant campaigns
main_category_counts_envt = df_envt['main_category'].value_counts().reset_index()
main_category_counts_envt.columns = ['main_category', 'count']  # Rename columns for clarity


pie_chart_envt = alt.Chart(main_category_counts_envt).mark_arc(innerRadius=40).encode(
    theta=alt.Theta(field="count", type="quantitative"),  # The angle of the arc
    color=alt.Color(field="main_category", type="nominal"),  # Color by main_category
    tooltip=['main_category', 'count']  # Show tooltips for additional info
).properties(
    width=250,
    height=250,
    title='Distribution of categories in environmental campaigns'
)

# Distribution of main categories for socially relevant campaigns
main_category_counts_social = df_social['main_category'].value_counts().reset_index()
main_category_counts_social.columns = ['main_category', 'count']  # Rename columns for clarity


pie_chart_social = alt.Chart(main_category_counts_social).mark_arc(innerRadius=40).encode(
    theta=alt.Theta(field="count", type="quantitative"),  # The angle of the arc
    color=alt.Color(field="main_category", type="nominal"),  # Color by main_category
    tooltip=['main_category', 'count']  # Show tooltips for additional info
).properties(
    width=250,
    height=250,
    title='Distribution of categories in social campaigns'
)

combined_charts = alt.vconcat(pie_chart_envt, pie_chart_social).configure_view(
    strokeWidth=0
)
combined_charts

The distribution of categories in which the funding was requested shows interesting trends. More than <span style="color: blue">one-half </span>of the requests in environmental categories are <span style="color: blue">food-related, followed by fashion</span>, showing that there are many campaigns which are targeted to find environment-friendly solutions in these 2 domains. The <span style="color: teal">social campaigns also lead</span> in the food category, followed by <span style="color: teal">technology, art and publishing</span>.

In [None]:
# Distribution of main categories for environmentally relevant campaigns
main_category_counts_envt = df_envt['main_category'].value_counts().reset_index()
main_category_counts_envt.columns = ['main_category', 'count']  # Rename columns for clarity


pie_chart_envt = alt.Chart(main_category_counts_envt).mark_arc(innerRadius=40).encode(
    theta=alt.Theta(field="count", type="quantitative"),  # The angle of the arc
    color=alt.Color(field="main_category", type="nominal"),  # Color by main_category
    tooltip=['main_category', 'count']  # Show tooltips for additional info
).properties(
    width=250,
    height=250,
    title='Distribution of categories in environmental campaigns'
)

# Distribution of main categories for socially relevant campaigns
main_category_counts_social = df_social['main_category'].value_counts().reset_index()
main_category_counts_social.columns = ['main_category', 'count']  # Rename columns for clarity


pie_chart_social = alt.Chart(main_category_counts_social).mark_arc(innerRadius=40).encode(
    theta=alt.Theta(field="count", type="quantitative"),  # The angle of the arc
    color=alt.Color(field="main_category", type="nominal"),  # Color by main_category
    tooltip=['main_category', 'count']  # Show tooltips for additional info
).properties(
    width=250,
    height=250,
    title='Distribution of categories in social campaigns'
)

combined_charts = alt.vconcat(pie_chart_envt, pie_chart_social).configure_view(
    strokeWidth=0
)
combined_charts

In [13]:
#Environmental relevance
# Step 1: Filter for successful campaigns (assuming 'is_success' indicates success)
df_successful_envt = df_envt[df_envt['is_success'] != 'fail'] 

# Step 2 & 3: Get value counts and reset index to make it a DataFrame
country_counts_envt = df_successful_envt['country'].value_counts().reset_index()
country_counts_envt.columns = ['country', 'count']  # Renaming columns for clarity

chart_envt = alt.Chart(country_counts_envt).mark_bar().encode(
    x=alt.X('country:N', sort='-y'),  # Sort countries based on the count, in descending order
    y=alt.Y('count:Q'),
    tooltip=['country', 'count']
).properties(
    width=250,
    height=250,
    title='Count of Environmentally successful Campaigns by Country'
)

#Social relevance
# Step 1: Filter for successful campaigns (assuming 'is_success' indicates success)
df_successful_social= df_social[df_social['is_success'] != 'fail'] 

# Step 2 & 3: Get value counts and reset index to make it a DataFrame
country_counts_social = df_successful_social['country'].value_counts().reset_index()
country_counts_social.columns = ['country', 'count']  # Renaming columns for clarity

chart_social = alt.Chart(country_counts_social).mark_bar().encode(
    x=alt.X('country:N', sort='-y'),  # Sort countries based on the count, in descending order
    y=alt.Y('count:Q'),
    tooltip=['country', 'count']
).properties(
    width=250,
    height=250,
    title='Count of Socially successful Campaigns by Country'
)


combined_charts = alt.hconcat(chart_envt, chart_social).configure_view(
    strokeWidth=0
)
combined_charts

The country contributions show similar trends. <span style="color: lime">United States (US) </span>leads in both categories, followed by <span style="color: blue">Great Britain (GB), Canada (CA) and Australia (AUS)</span>.

## SVM on df

In [28]:
x=df[['main_category', 'sub_category',
      'is_environmental', 'is_social',
      'country_code', 'duration_in_months',
      'goal_usd_category', 'is_success']].copy()
x['is_success'] = np.where(x['is_success']=='fail', 'fail', 'success')

In [29]:
# Step 1: Encode categorical variables
categorical_features = ['main_category', 'sub_category',
       'is_environmental', 'is_social', 'country_code']
one_hot_encoder = OneHotEncoder(handle_unknown='ignore')
feature_names= ['main_category', 'sub_category',
      'is_environmental', 'is_social',
      'country_code', 'duration_in_months',
      'goal_usd_category']

# Step 1: Encode categorical variables and split data into training and testing sets
X = x[feature_names].copy()  # Features
y = x['is_success']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


# Step 3: Scale features and create a processing pipeline
preprocessor = ColumnTransformer(transformers=[
    ('cat', one_hot_encoder, categorical_features)
])

svc_model= LinearSVC(random_state=42, tol=1e-5)
# SVM model with pipeline
svm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler(with_mean=False)),  # Use with_mean=False for sparse matrices
    ('svc', svc_model)  # Using linear kernel; you can try 'rbf' or others
])

# Define the parameter grid to search
param_grid = {
    'svc__C': [0.1, 1, 10, 100],
}

# Setup the stratified folds
stratified_k_fold = StratifiedKFold(n_splits=5)

# Setup the GridSearchCV object
grid_search = GridSearchCV(svm_pipeline, param_grid, cv=stratified_k_fold, scoring='accuracy')

# Fit the model
grid_search.fit(X, y)

# Print the best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

Best parameters: {'svc__C': 0.1}
Best cross-validation score: 0.748991005831382


In [33]:
# Step 4: Train the SVM model
svm_pipeline.fit(X_train, y_train)

# Step 5: Evaluate the model
y_pred = svm_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

        fail       0.70      0.80      0.74     12570
     success       0.81      0.71      0.76     15329

    accuracy                           0.75     27899
   macro avg       0.75      0.76      0.75     27899
weighted avg       0.76      0.75      0.75     27899



In [34]:
# Accessing the coefficients
coefficients = svc_model.coef_[0]  # For a binary classification, .coef_ returns a 2D array

# Assuming feature_names is a list of your feature names
feature_importance = zip(feature_names, coefficients)

# Sorting features by absolute importance
sorted_feature_importance = sorted(feature_importance, key=lambda x: abs(x[1]), reverse=True)

for feature, importance in sorted_feature_importance:
    print(f"Feature: {feature}, Importance: {importance}")

Feature: is_environmental, Importance: -0.06059888379571127
Feature: sub_category, Importance: 0.05289902515102622
Feature: goal_usd_category, Importance: -0.019619437012716782
Feature: duration_in_months, Importance: 0.018483746365640093
Feature: main_category, Importance: -0.018155922258632936
Feature: is_social, Importance: 0.014907831451923982
Feature: country, Importance: 0.013689495882467899


In [None]:
from sklearn.model_selection import cross_val_score

# Assuming the rest of the code is the same, especially the part where you define svm_pipeline and preprocessor

# Prepare the full dataset (without explicit train-test split)
X = x[categorical_features]  # Features
y = x['is_success']  # Target variable

# Execute cross-validation
cv_scores = cross_val_score(svm_pipeline, X, y, cv=5)  # Using 5-fold cross-validation

# Calculate and print the average score
print("Average CV Score:", cv_scores.mean())

## RF on df

In [84]:
x=df[['main_category', 'sub_category',
      'is_environmental', 'is_social',
      'country_code', 'duration_in_months',
      'goal_usd_category', 'is_success']].copy()
x['is_success'] = np.where(x['is_success']=='fail', 'fail', 'success')

#x= x.head(50000)

In [86]:
categorical_features = ['main_category', 'sub_category','is_environmental',
                 'is_social', 'country_code', 'duration_in_months',
                'goal_usd_category']

one_hot_encoder = OneHotEncoder(handle_unknown='ignore')

feature_names= ['main_category', 'sub_category','is_environmental',
                 'is_social', 'country_code', 'duration_in_months',
                'goal_usd_category']
# Step 1: Encode categorical variables and split data into training and testing sets
X = x[feature_names].copy()  # Features
y = x['is_success']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 2: Create a processing pipeline
preprocessor = ColumnTransformer(transformers=[
    ('cat', one_hot_encoder, categorical_features)
])

rfc= RandomForestClassifier(random_state=42, verbose=True, n_jobs=-1)

# Random Forest model with pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler(with_mean=False)),  # Use with_mean=False for sparse matrices
    ('classifier', rfc)
])

# Define the parameter grid to search
param_grid = {
    'classifier__n_estimators': [50, 100, 150],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Setup the stratified folds and GridSearchCV object
stratified_k_fold = StratifiedKFold(n_splits=5)
grid_search = GridSearchCV(rf_pipeline, param_grid, cv=stratified_k_fold, scoring='accuracy', n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)



[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    1.7s


Best parameters: {'classifier__max_depth': None, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 150}
Best cross-validation score: 0.7663763312922951


[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:    8.9s finished


In [87]:
# Make predictions with the best model
y_pred = grid_search.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 150 out of 150 | elapsed:    0.0s finished


              precision    recall  f1-score   support

        fail       0.72      0.79      0.75     12570
     success       0.81      0.75      0.78     15329

    accuracy                           0.77     27899
   macro avg       0.77      0.77      0.77     27899
weighted avg       0.77      0.77      0.77     27899



In [88]:
# Access the fitted RandomForestClassifier
fitted_rfc = grid_search.best_estimator_['classifier']

# Get feature importances
importances = fitted_rfc.feature_importances_

# Get the feature names after one-hot encoding
transformed_feature_names = grid_search.best_estimator_.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)

# For numerical features, their names remain the same
#numerical_features = ['duration_in_months']
numerical_features=[]
# Combine all feature names
all_feature_names = np.concatenate([transformed_feature_names, numerical_features])

# Match feature names with their importances
feature_importances = zip(all_feature_names, importances)

# Sort the features by importance
sorted_feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)

# Print the sorted feature importances
for feature, importance in sorted_feature_importances:
    print(f"{feature}: {importance}")

goal_usd_category_50k_plus: 0.05299544864705209
goal_usd_category_1k-10k: 0.03568470715708515
main_category_Food: 0.03525555099366868
sub_category_Documentary: 0.033817983776288314
sub_category_Apparel: 0.02970718688693122
duration_in_months_2: 0.029249522335045568
sub_category_Narrative Film: 0.027367485711013325
sub_category_Product Design: 0.02651519607312934
main_category_Comics: 0.02648676751909887
duration_in_months_1: 0.02564515587270092
sub_category_Nonfiction: 0.02374849574398016
sub_category_Video Games: 0.02318859218624527
sub_category_Country & Folk: 0.022696278166857594
sub_category_Shorts: 0.02199182336840339
sub_category_Children's Books: 0.021932842258149635
sub_category_Rock: 0.021751622968531534
sub_category_Web: 0.021244561655453985
sub_category_Indie Rock: 0.021147796161915294
main_category_Crafts: 0.020582111200543138
sub_category_Playing Cards: 0.020066031859875412
sub_category_Mobile Games: 0.01810061577538836
sub_category_Fiction: 0.017040380434965918
sub_catego

In [90]:
# Given feature importances from the model output
feature_importances = {feature: importance for feature, importance in sorted_feature_importances}

# Initialize a dictionary to hold the aggregated importances
aggregated_importances = {}

# Aggregate importances for categorical features
for feature, importance in feature_importances.items():
    # Split the feature name to identify the original feature
    original_feature = ''.join(feature.split("_")[:2])
    if original_feature in ['duration_in', 'goal_usd']:  # Assuming these are prefixes for numerical features
        # Directly use the feature name for numerical features
        aggregated_importances[feature] = importance
    else:
        # Sum the importances for categorical features
        if original_feature in aggregated_importances:
            aggregated_importances[original_feature] += importance
        else:
            aggregated_importances[original_feature] = importance

# Sort the aggregated importances
sorted_aggregated_importances = sorted(aggregated_importances.items(), key=lambda x: x[1], reverse=True)

# Display the sorted aggregated importances
for feature, importance in sorted_aggregated_importances:
    print(f"{feature}: {importance}")

subcategory: 0.634267958411183
maincategory: 0.1653715600845491
goalusd: 0.10267485572602472
durationin: 0.059014424378228215
countrycode: 0.03615633315937031
issocial: 0.0012951939709886563
isenvironmental: 0.001219674269656207


In [48]:
# Ensure 'duration_in_months' is in the list of all feature names
print('duration_in_months' in all_feature_names)

# Directly check the importance of 'duration_in_months'
duration_index = np.where(all_feature_names == 'duration_in_months')[0]
if duration_index.size > 0:
    print(f"Importance of 'duration_in_months': {importances[duration_index[0]]}")
else:
    print("'duration_in_months' not found in the feature names.")

True


IndexError: index 189 is out of bounds for axis 0 with size 189

In [53]:
x=df_envt[['main_category','sub_category','goal_usd_category','duration_in_months','is_success']].copy()
x['is_success'] = np.where(x['is_success']=='fail', 'fail', 'success')

In [54]:

# Step 1: Encode categorical variables
categorical_features = ['main_category', 'sub_category','goal_usd_category', 'duration_in_months']

# Step 2: Split data into training and testing sets
#X = pd.get_dummies(x[categorical_features])

label_encoder = LabelEncoder()
X = x[categorical_features]  # Features


for col in categorical_features:
    X.loc[:,col] = label_encoder.fit_transform(X[col])
# Assuming 'df_envt' is your DataFrame and it includes the 'is_success' column

y = x['is_success']  # Target variable

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train, y_train)

# Predict on the testing set
y_pred = random_forest.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7238095238095238
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.91      0.81       134
           1       0.71      0.39      0.51        76

    accuracy                           0.72       210
   macro avg       0.72      0.65      0.66       210
weighted avg       0.72      0.72      0.70       210



In [56]:
# Assuming 'random_forest' is your trained RandomForestClassifier instance
# and 'X_train' is the DataFrame used for training, with column names

# Get feature importances
importances = random_forest.feature_importances_

# Convert the importances into a DataFrame
feature_importances = pd.DataFrame({'feature': X_train.columns, 'importance': importances})

# Sort the DataFrame to show the most important features at the top
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

# Display the feature importances
print(feature_importances)

              feature  importance
1        sub_category    0.575083
0       main_category    0.227708
2   goal_usd_category    0.124279
3  duration_in_months    0.072930


## RF on df

In [398]:
df.columns

Index(['campaign_name', 'blurb', 'main_category', 'sub_category',
       'is_environmental', 'is_social', 'country', 'duration_in_months',
       'goal_usd_category', 'is_success'],
      dtype='object')

In [399]:
x=df_envt[['main_category',
       'is_environmental', 'is_social', 'country', 'duration_in_months',
       'goal_usd_category', 'is_success']].copy()
x['is_success'] = np.where(x['is_success']=='fail', 'fail', 'success')

In [400]:

# Step 1: Encode categorical variables
categorical_features = ['main_category',
       'is_environmental', 'is_social', 'country', 'duration_in_months',
       'goal_usd_category',]

# Step 2: Split data into training and testing sets
#X = pd.get_dummies(x[categorical_features])

label_encoder = LabelEncoder()
X = x[categorical_features]  # Features


for col in categorical_features:
    X.loc[:,col] = label_encoder.fit_transform(X[col])
# Assuming 'df_envt' is your DataFrame and it includes the 'is_success' column

y = x['is_success']  # Target variable

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train, y_train)

# Predict on the testing set
y_pred = random_forest.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7095238095238096
Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.90      0.80       134
           1       0.67      0.38      0.49        76

    accuracy                           0.71       210
   macro avg       0.70      0.64      0.64       210
weighted avg       0.70      0.71      0.69       210



In [401]:
# Assuming 'random_forest' is your trained RandomForestClassifier instance
# and 'X_train' is the DataFrame used for training, with column names

# Get feature importances
importances = random_forest.feature_importances_

# Convert the importances into a DataFrame
feature_importances = pd.DataFrame({'feature': X_train.columns, 'importance': importances})

# Sort the DataFrame to show the most important features at the top
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

# Display the feature importances
print(feature_importances)

              feature  importance
0       main_category    0.443573
3             country    0.254730
5   goal_usd_category    0.183919
4  duration_in_months    0.092657
2           is_social    0.025121
1    is_environmental    0.000000


## RFE

In [402]:

# Assuming 'X_train' is your features and 'y_train' is your target variable

# Step 2: Instantiate the model
random_forest = RandomForestClassifier()

# Step 3: Instantiate RFE with the model and the desired number of features
rfe = RFE(estimator=random_forest, n_features_to_select=3)  # Adjust n_features_to_select as needed

# Step 4: Fit RFE
rfe.fit(X_train, y_train)

# Optional: Transform the dataset to reduce it to the selected features
X_train_transformed = rfe.transform(X_train)

# Step 6: Inspect selected features and their ranking
selected_features = pd.DataFrame({'Feature': X_train.columns, 
                                   'Importance': rfe.ranking_}).sort_values(by='Importance')

# Display the selected features
print(selected_features)

              Feature  Importance
0       main_category           1
3             country           1
5   goal_usd_category           1
4  duration_in_months           2
2           is_social           3
1    is_environmental           4


## XGB

In [407]:
x=df_envt[['main_category',
       'is_environmental', 'is_social', 'country', 'duration_in_months',
       'goal_usd_category', 'is_success']].copy()
x['is_success'] = np.where(x['is_success']=='fail', 'fail', 'success')

In [405]:
categorical_features = ['main_category',
       'is_environmental', 'is_social', 'country', 'duration_in_months',
       'goal_usd_category',]
one_hot_encoder = OneHotEncoder()
encoded_categorical = one_hot_encoder.fit_transform(x[categorical_features]).toarray()

# Create a new DataFrame with encoded categorical features
encoded_x = pd.DataFrame(encoded_categorical, columns=one_hot_encoder.get_feature_names_out(categorical_features))

# Drop original categorical columns and concatenate encoded features
x.drop(columns=categorical_features, inplace=True)
x = pd.concat([x, encoded_x], axis=1)

# Split your data into features (X) and target (y)
X = x.drop('is_success', axis=1)
y = x['is_success']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an XGBoost Model
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Visualize Feature Importance
xgb.plot_importance(model)
plt.show()

TypeError: '<' not supported between instances of 'str' and 'float'

In [156]:
!jupyter nbconvert --to webpdf 03.DataAnalysis.ipynb --no-input

[NbConvertApp] Converting notebook 03.DataAnalysis.ipynb to webpdf
[NbConvertApp] Building PDF
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 126401 bytes to 03.DataAnalysis.pdf
Task was destroyed but it is pending!
task: <Task pending name='Task-2' coro=<Connection.run() running at C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\site-packages\playwright\_impl\_connection.py:274> wait_for=<Future pending cb=[Task.task_wakeup()]>>
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x000001D03BC95A80>
Traceback (most recent call last):
  File "C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\asyncio\proactor_events.py", line 116, in __del__
                               ^^^^^^^^
  File "C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\asyncio\proactor_events.py", line 80, in __repr__
    info.append(f'fd={self._sock.fileno()}')
                      ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Ronin\miniforge3\envs\dl4cv\Lib\asyncio\windows_utils.py", line 102, in fileno
    raise V