<div style="text-align:center;">
  <img src="https://speedmedia.jfrog.com/08612fe1-9391-4cf3-ac1a-6dd49c36b276/https://media.jfrog.com/wp-content/uploads/2021/12/29113553/jfrog-logo-2022.svg/w_1024" width="150" height="150">
</div>

# JFrog Data Science Home Assignment


You are employed at a company, FrogAd Corp, that provides a platform for managing advertising campaigns on behalf of various businesses.

FrogAd offers three distinct product tiers: P1, P2, and P3, with P3 being the highest-tier offering, followed by P2, and then P1. We've gathered data from customers who were in either the P1 or P2 tiers and examined whether they upgraded to the P3 tier or not. Our goal is to assist the Sales department in determining which customers should be their primary focus for upgrading to the P3 tier.

Your assignment involves the following objectives:

1. Conduct an analysis of customer data pertaining to P1 and P2 tiers.
2. Develop a baseline model capable of predicting which customers are inclined to upgrade their subscription to the more advanced P3 tier. Discuss which metric(s) is/are relevant for this case.
3. Suggest action items based on your findings. E.g., for your team or for business stakeholders.
4. Prepare a summary presentation showing your development to the FrogAd staff. You will present this presentation to JFrog's team and answer questions regarding it. Please submit this presnetation file as well as the Jupyter notebook ipynb file you used. 

The data available contains:
- The impressions from 4 different social media in the past 3 months
- The client's industry type
- The number of support cases were opened with us by the client
- The number of technical sessions were held by us with the client
- The number of platform users we had in the past 3 months
- The number of people working for the customer whose contact information we have
- The client's location
- The client's subscription tier
- The client's number of employees




## Install & Import Required Libraries

In [None]:
! pip install jupyter
! pip install pandas
! pip install pyarrow
! pip install matplotlib
! pip install scipy
! pip install scikit-learn
! pip install scikit-optimize
! pip install lightgbm
! pip install imbalanced-learn
! pip install shap

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy.stats import chisquare
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from lightgbm import LGBMClassifier
from skopt import BayesSearchCV
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import make_scorer, fbeta_score
import shap
from sklearn.metrics import confusion_matrix

# Data Exploration and EDA

In [None]:
# Load the data
df = pd.read_parquet('assignment_data_(3).parquet')
print(df.shape)
df.head()

### Nans analysis

In [None]:
df.isna().sum()

In [None]:
# Check if there are missing values (NA) in accounts that are marked as sharing data (is_sharing = 1)
df[df['is_sharing'] == 1].isna().sum()

In [None]:
# Example data
data = {'is_sharing': [0, 1], 'missing_values': [14782, 59]}
missing_df = pd.DataFrame(data)

# Plot the bar chart
fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(missing_df['is_sharing'], missing_df['missing_values'], color=['orange', 'skyblue'])

# Add labels and title
plt.title('Total Missing Values by is_sharing')
plt.xlabel('is_sharing (0 = Not Sharing, 1 = Sharing)')
plt.ylabel('Count of Missing Values')
plt.xticks([0, 1], labels=['Not Sharing', 'Sharing'])
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Annotate values on the bars
for i, v in enumerate(missing_df['missing_values']):
    ax.text(i, v + 100, str(v), ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()



### Tiers Distribution

In [None]:
distribution = df.groupby(['product', 'target']).size().unstack()
percentage_distribution = distribution.div(distribution.sum(axis=1), axis=0) * 100
print(percentage_distribution)
# Data
upgrade_rates = percentage_distribution.T.to_dict()
plot_data = pd.DataFrame(upgrade_rates)
plot_data['index'] = ['Not Upgraded', 'Upgraded']
plot_data = plot_data.set_index('index')

# Stacked bar chart
ax = plot_data.T.plot(kind='bar', stacked=True, figsize=(8, 5), color=['skyblue', 'orange'])
plt.title('Upgrade Distribution by Product Tier')
plt.xlabel('Product Tier')
plt.ylabel('Percentage (%)')
plt.xticks(rotation=0)
plt.legend(title='Upgrade Status')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
contingency_table = pd.DataFrame({
    'Not Upgraded (0)': [2616 * 0.88, 886 * 0.72],
    'Upgraded (1)': [886 * 0.12, 351 * 0.28]
}, index=['P1', 'P2'])

# Chi-Square
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square: {chi2}, p-value: {p}")

In [None]:
print(df['date'].min())
print(df['date'].max())

## Regression and correlation analysis

In [None]:
def plot_scatter_with_regression(x, y, title, color, ax):
    # Scatter plot
    ax.scatter(x, y, color=color, alpha=0.6, label=f'{x.name} vs {y.name}')
    
    # Fit a linear regression model
    model = LinearRegression()
    x_reshaped = x.values.reshape(-1, 1)
    model.fit(x_reshaped, y)
    y_pred = model.predict(x_reshaped)
    
    # Plot the regression line
    ax.plot(x, y_pred, color=color, label='Regression Line')
    ax.set_title(title, fontsize=14)
    ax.set_xlabel(x.name, fontsize=12)
    ax.set_ylabel(y.name, fontsize=12)
    ax.legend(fontsize=10)
    ax.grid(True)

# Plot settings
fig, axes = plt.subplots(1, 3, figsize=(18, 5), sharey=True)
no_nans_df_for_lr = df.dropna()

# Plot for facebook vs x
plot_scatter_with_regression(
    no_nans_df_for_lr['facebook_impressions_1_lag'], 
    no_nans_df_for_lr['x_impressions_1_lag'], 
    "Facebook vs X", 
    'red', 
    axes[0]
)

# Plot for facebook vs instagram
plot_scatter_with_regression(
    no_nans_df_for_lr['facebook_impressions_1_lag'], 
    no_nans_df_for_lr['instagram_impressions_1_lag'], 
    "Facebook vs Instagram", 
    'blue', 
    axes[1]
)

# Plot for x vs instagram
plot_scatter_with_regression(
    no_nans_df_for_lr['x_impressions_1_lag'], 
    no_nans_df_for_lr['instagram_impressions_1_lag'], 
    "X vs Instagram", 
    'green', 
    axes[2]
)

# Adjust layout
plt.tight_layout()
plt.show()

### Box-Plots

In [None]:
# Melt the data for easy plotting
melted_df = pd.melt(df, id_vars="target", value_vars=["x_impressions_1_lag", "instagram_impressions_1_lag", "facebook_impressions_1_lag"],
                    var_name="Network", value_name="Usage")

# Create the boxplot
plt.figure(figsize=(10, 6))
boxplot = sns.boxplot(x="Network", y="Usage", hue="target", data=melted_df)

# Add plot titles and labels
plt.title("Usage Distribution by Network with Target Separation", fontsize=14)
plt.xlabel("Network", fontsize=12)
plt.ylabel("Usage", fontsize=12)
plt.legend(title="Target", loc="upper right")
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()

In [None]:
# Apply a log transformation to the Usage column
melted_df['Usage'] = np.log1p(melted_df['Usage'])  # log1p handles log(0) safely

plt.figure(figsize=(10, 6))
boxplot = sns.boxplot(x="Network", y="Usage", hue="target", data=melted_df)

# Add plot titles and labels
plt.title("Log-Transformed Usage Distribution by Network with Target Separation", fontsize=14)
plt.xlabel("Network", fontsize=12)
plt.ylabel("Log(Usage)", fontsize=12)
plt.legend(title="Target", loc="upper right")
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()


## Categorical Analysis

In [None]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = [col for col in df if col not in numerical_features]

In [None]:
# Define the categorical features and the target
target = 'target'

# Iterate through each categorical feature and plot the relationship with the target
for feature in categorical_features:
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=feature, hue=target, palette="Set2")
    plt.title(f'Relationship Between {feature} and Target', fontsize=14)
    plt.xlabel(feature, fontsize=12)
    plt.ylabel('Count', fontsize=12)
    plt.legend(title='Target')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle="--", alpha=0.7)
    plt.tight_layout()
    plt.show()

In [None]:
categorical_features = ['industry_group', 'territory', 'product', 'n_employees_range']
target = 'target'

# Overall target distribution
overall_target_distribution = df[target].value_counts(normalize=True)

# Function to perform the Chi-Square Test for each category
def chi_square_per_category(df, feature, target, overall_dist):
    results = []
    for category in df[feature].unique():
        # Subset target distribution for the current category
        category_distribution = df[df[feature] == category][target].value_counts(normalize=True)
        
        # Align the distributions to ensure matching categories
        category_dist = category_distribution.reindex(overall_dist.index, fill_value=0)
        overall_dist_for_test = overall_dist.reindex(category_dist.index, fill_value=0)
        
        # Perform Chi-Square Test
        chi2_stat, p_value = chisquare(f_obs=category_dist * len(df[df[feature] == category]),
                                       f_exp=overall_dist_for_test * len(df[df[feature] == category]))
        
        results.append({
            "Feature": feature,
            "Category": category,
            "Chi2 Statistic": chi2_stat,
            "P-value": p_value,
            "Significant": p_value < 0.05  # True if p-value is less than 0.05
        })
    return results

# Perform the test for all categorical features
all_results = []
for feature in categorical_features:
    results = chi_square_per_category(df, feature, target, overall_target_distribution)
    all_results.extend(results)

# Convert results to a DataFrame for easier visualization
chi_square_results = pd.DataFrame(all_results)
chi_square_results

In [None]:
# Filter significant categories
significant_results = chi_square_results[chi_square_results['Significant']]

# Prepare data for all significant categories
plot_data = []

for _, row in significant_results.iterrows():
    feature = row['Feature']
    category = row['Category']
    
    # Filter the DataFrame for the current category
    subset = df[df[feature] == category]
    
    # Calculate target distribution in percentages
    target_distribution = subset[target].value_counts(normalize=True) * 100
    
    # Add data for plotting
    plot_data.append({
        "Feature": feature,
        "Category": category,
        "Target 0 (%)": target_distribution.get(0, 0),
        "Target 1 (%)": target_distribution.get(1, 0)
    })

# Convert to DataFrame for plotting
plot_df = pd.DataFrame(plot_data)

# Combine feature and category for x-axis labels
plot_df["Feature_Category"] = plot_df["Feature"] + " (" + plot_df["Category"].astype(str) + ")"

# Add overall target distribution
overall_target_distribution = df[target].value_counts(normalize=True) * 100
overall_target_df = pd.DataFrame({
    "Feature_Category": ["Overall Target"],
    "Target 0 (%)": [overall_target_distribution.get(0, 0)],
    "Target 1 (%)": [overall_target_distribution.get(1, 0)]
})

# Combine the significant categories with the overall target
plot_df = pd.concat([plot_df, overall_target_df], ignore_index=True)

# Plot
ax = plot_df.set_index("Feature_Category")[["Target 0 (%)", "Target 1 (%)"]].plot(
    kind="bar",
    stacked=True,
    figsize=(12, 6),
    color=["skyblue", "salmon"],
    alpha=0.8
)

# Customize overall target bar color
bars = ax.patches[-2:]  # The last two bars correspond to Overall Target (Target 0 and Target 1)
bars[0].set_color("gray")    # Target 0 for Overall Target
bars[1].set_color("darkred") # Target 1 for Overall Target
bars[0].set_alpha(0.9)
bars[1].set_alpha(0.9)

# Customize plot
plt.title("Target Distribution for Significant Categories and Overall Target", fontsize=14)
plt.xlabel("Feature (Category)", fontsize=12)
plt.ylabel("Percentage (%)", fontsize=12)
plt.xticks(rotation=45, ha='right')

# Move legend to the center below the plot
plt.legend(title="Target Value", loc='lower center', bbox_to_anchor=(0.5, 0.3), ncol=2)

plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


# Modeling

In [None]:
# Split the data into training and testing sets
df = df.set_index(['date', 'account_id'], drop=True)
features = df.drop(columns=['target'])
target = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2, random_state=42, stratify=target
)

In [None]:
numerical_features = features.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = [col for col in features if col not in numerical_features]

constant_feature_droper = VarianceThreshold(threshold=0.05)

numerical_transformer = Pipeline(steps=[
    ("scaler", StandardScaler()),  # Standard scaling for numerical features
    ("imputer", KNNImputer()),   # KNN Imputer for numerical data
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),  # Impute with most frequent value
    ("encoder", OneHotEncoder())             # OneHotEncoder encoding for categorical features
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("drop_constants", constant_feature_droper),
    ("feature_selection", SelectKBest(score_func=f_classif)),  # Feature selection
    ("classifier", LGBMClassifier(random_state=42, class_weight="balanced"))  # Classifier
])


# Define the parameter grid for Bayesian search
param_grid = {
    # Tune threshold for constant/low-variance features
    "drop_constants__threshold": (0.0, 0.01),
    # Tune k for the KNN Imputer
    "preprocessor__num__imputer__n_neighbors": (2, 10),  # Number of neighbors for numerical imputer
    # Feature selection
    "feature_selection__k": (6, len(numerical_features) + len(categorical_features)),  # Number of features to select
    # LGBM hyperparameters
    "classifier__n_estimators": (50, 500),         # Number of boosting rounds (trees)
    "classifier__max_depth": (1, 10),              # Maximum depth of a tree
    "classifier__learning_rate": (0.01, 0.3),      # Learning rate (shrinkage)
    "classifier__subsample": (0.5, 1.0),           # Subsample ratio of training instances
    "classifier__colsample_bytree": (0.5, 1.0),    # Subsample ratio of columns for each tree
    "classifier__min_child_samples": (5, 30),      # Tune lower values
    "classifier__min_child_weight": (1, 5),        # Allow lighter leaves
    "classifier__scale_pos_weight": (1, 10)        # Balancing class weights
}

scoring_metric = make_scorer(fbeta_score, beta=0.5, average="binary")
# Define Bayesian search
bayes_search = BayesSearchCV(
    estimator=pipeline,
    search_spaces=param_grid,
    n_iter=50,  # Number of iterations for the search
    cv=5,       # Cross-validation splits
    scoring=scoring_metric,
    random_state=42,
    n_jobs=-1
)

# Fit the Bayesian search
bayes_search.fit(X_train, y_train)

# Display the best parameters and score
print("Best Parameters:", bayes_search.best_params_)
print("Best CV Score:", bayes_search.best_score_)

# Evaluate on the test set
test_score = bayes_search.score(X_test, y_test)
print("Test Score:", test_score)
y_pred = bayes_search.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

## Shap Analysis

In [None]:
# EXtract the tuned classifier
best_pipeline = bayes_search.best_estimator_
classifier = best_pipeline.named_steps['classifier']

In [None]:
# Prepare shap explainer
# Prepare the data (use the preprocessing step from the pipeline)
preprocessed_X_train = best_pipeline.named_steps['preprocessor'].transform(X_train)

# Initialize the SHAP explainer
explainer = shap.Explainer(classifier, preprocessed_X_train)

In [None]:
# Get Features Names for shap analysis
# Get the preprocessor step from the pipeline
preprocessor = best_pipeline.named_steps['preprocessor']

# Extract feature names for numerical and categorical features
numerical_feature_names = numerical_features  # These remain the same
categorical_feature_names = (
    preprocessor.transformers_[1][1]
    .named_steps['encoder']
    .get_feature_names_out(categorical_features)
)

# Combine numerical and categorical feature names
feature_names = list(numerical_feature_names) + list(categorical_feature_names)

In [None]:
# Compute SHAP values
preprocessed_X_test = best_pipeline.named_steps['preprocessor'].transform(X_test)
shap_values = explainer(preprocessed_X_test)

In [None]:
# Assign feature names to SHAP values (if necessary)
shap_values.feature_names = feature_names

# Plot SHAP bar chart
shap.plots.bar(shap_values)