# Introduction
To use this notebook, upload it to colab and run a script. Each script is free standing, as in there are no scripts that you must run before running a separate script.

For each script, you will hit run and it will prompt you to insert files. You should be able to insert all of the files that you have available, it will handle which snapshots of data are useful and which ones need to be filtered through.

The model used in each of these is similar to churn models that companies use to predict whether a customer will stop buying their product within the next x days/months/years. It uses multiple snapshots of the same repository's data to find patterns in the behaviour of the features and ties it to a feature that indicates whether than repository will be abandoned in the next x days.

# A summary of some of the functions you will see:

**extract_datatime_from_filename** - the feature excel files all follow the same format that includes the time the data was scraped in the file name, so this function takes that time and converts it into a date and time.

**convert_to_numeric** - takes entries that are non-numeric such as true or false and converts it into a number to be interpreted by the logistic regression. In the case of an unexpected entry, such as an empty cell, it will change it to the null naN value to be handled later.

**load_and_process_files** - combines the files all together into one data frame, and adds the timestamp for when each repository data was scraped.

**process_churn_labels** - adds the features of Abandoned and Future_Abandoned. This is where you would change the definition of abandoned or how far into the future you want to predict. If we want to look forward x days, the logic for future_abandoned is: for each repository, check for the last snapshot that is less than x days in the future- if the repository is abandoned, then future_abandoned is true. If not, check the first snapshot that is over x days in the future. If this repository is not abandoned, then future_abandoned is false. Anything else, we don't have enough information to decide, so it removes that instance of the repository data. Lastly, it removes any repository where abandoned = 1 and future_abandoned = 1 because these are redundant.

**logistic_regression_churn_model** - makes the model.

# Oversampling Churn Model
This script uses the churn model, and oversamples the minority class, future_abandoned, to counteract the strong tendency to predict not_future_abandoned for every repository because that is so much more common.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datetime import datetime, timedelta
from sklearn.metrics import classification_report, roc_auc_score
from google.colab import files
from sklearn.preprocessing import StandardScaler  # Import StandardScaler
import re
import numpy as np
from imblearn.over_sampling import SMOTE  # For oversampling

# Function to extract the timestamp from a filename
def extract_datetime_from_filename(filename):
    date_match = re.search(r'\d{4}-\d{2}-\d{2}', filename)
    if date_match:
        return datetime.strptime(date_match.group(), '%Y-%m-%d')
    else:
        raise ValueError(f"Invalid filename format: {filename}")

# Function to convert strings with 'k', 'M', 'B' to numeric values
def convert_to_numeric(value, column_name):
    value = str(value).replace(',', '')  # Remove commas
    if value.lower() in ['true', 'yes']:
        return 1.0
    elif value.lower() in ['false', 'no']:
        return 0.0
    try:
        if 'k' in value:
            return float(value.replace('k', '')) * 1e3
        elif 'M' in value:
            return float(value.replace('M', '')) * 1e6
        elif 'B' in value:
            return float(value.replace('B', '')) * 1e9
        elif re.match(r'^[-+]?[0-9]*\.?[0-9]+$', value):
            return float(value)
        else:
            return np.nan
    except Exception as e:
        print(f"Error in column '{column_name}' with value '{value}': {e}")
        return np.nan

# Function to load multiple files and build a unified DataFrame
def load_and_process_files(file_dict):
    all_data = []
    for file_name, content in file_dict.items():
        # Load file and process
        df = pd.read_excel(file_name)
        timestamp = extract_datetime_from_filename(file_name)

        # Process date columns and add timestamp
        df['last_update'] = pd.to_datetime(df['Last Update'], errors='coerce').dt.tz_localize(None)
        df['days_since_last_update'] = (timestamp - df['last_update']).dt.days
        df['timestamp'] = timestamp
        df['file_name'] = file_name
        all_data.append(df)

    # Concatenate all data and sort by repository and timestamp
    combined_df = pd.concat(all_data)
    combined_df.sort_values(by=['Project URL', 'timestamp'], inplace=True)
    combined_df.reset_index(drop=True, inplace=True)
    return combined_df

# Process all files and add "future_abandoned" indicator for churn model
def process_churn_labels(df, relevant_columns):
    # Apply numeric conversion
    df[relevant_columns] = df[relevant_columns].apply(lambda col: col.apply(lambda x: convert_to_numeric(x, col.name)))

    # Add target indicators
    df['abandoned'] = df['days_since_last_update'].apply(lambda x: 1 if x > 30 else 0)
    df['future_abandoned'] = 0  # Initialize column

    rows_to_remove = []  # To track rows to delete

    for i, row in df.iterrows():
        current_repo = row['Project URL']
        current_time = row['timestamp']

        # Find snapshots of the same repository after the current timestamp
        future_snapshots = df[(df['Project URL'] == current_repo) & (df['timestamp'] > current_time)]

        if future_snapshots.empty:
            # No future information available, mark for removal
            rows_to_remove.append(i)
            continue

        # Sort future snapshots by timestamp (ascending)
        future_snapshots = future_snapshots.sort_values(by='timestamp')

        # Find the closest snapshot within 2 weeks
        closest_within_two_weeks = future_snapshots[
            future_snapshots['timestamp'] <= current_time + timedelta(days=14)
        ]

        if not closest_within_two_weeks.empty:
            # Check if the repository is abandoned at the closest snapshot
            if closest_within_two_weeks.iloc[-1]['abandoned'] == 1:
                df.at[i, 'future_abandoned'] = 1
            else:
                # Look for the next snapshot beyond two weeks
                next_snapshot = future_snapshots[
                    future_snapshots['timestamp'] > current_time + timedelta(days=14)
                ]

                if not next_snapshot.empty:
                    # If the repository is not abandoned in the next snapshot, mark as not future-abandoned
                    if next_snapshot.iloc[0]['abandoned'] == 0:
                        df.at[i, 'future_abandoned'] = 0
                    else:
                        rows_to_remove.append(i)
                else:
                    rows_to_remove.append(i)
        else:
            rows_to_remove.append(i)

    # Drop rows with insufficient information
    df.drop(index=rows_to_remove, inplace=True)
    df = df[~((df['abandoned'] == 1) & (df['future_abandoned'] == 1))]
    return df

# Logistic Regression Model Setup
def logistic_regression_churn_model(df, relevant_columns):
    X = df[relevant_columns]
    y = df['future_abandoned']

    # Drop rows with missing values in either X or y
    df_cleaned = pd.concat([X, y], axis=1).dropna()  # Concatenate and drop NaNs
    X = df_cleaned[relevant_columns]  # Update X
    y = df_cleaned['future_abandoned']  # Update y

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


    # Apply scaling to the features
    scaler = StandardScaler()  # Initialize StandardScaler
    X_train_scaled = scaler.fit_transform(X_train)  # Fit on training data and transform
    X_test_scaled = scaler.transform(X_test)  # Use the same scaler to transform the test data

    # Oversample using SMOTE
    smote = SMOTE(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

    # Train model
    model = LogisticRegression(max_iter=5000)
    model.fit(X_train_resampled, y_train_resampled)

    # Output feature weights and model evaluation
    weights = model.coef_[0]
    feature_weights = pd.DataFrame({
        'Feature': X.columns,
        'Weight': weights
    })

    print("\nClassification Report:\n", classification_report(y_test, model.predict(X_test_scaled)))
    print("ROC AUC Score for testing data:", roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1]))
    print("ROC AUC Score for training data:", roc_auc_score(y_train, model.predict_proba(X_train_scaled)[:, 1]))
    print("Feature Weights:\n", feature_weights)


# Relevant columns for analysis
relevant_columns = [
    'Size', 'Number of Stars', 'Number of Watches_x', 'Number of Open Issues',
    'Number of forks', 'Topics', 'Open Pull Requests', 'Closed Pull Requests',
    'Followers of Owner', 'Number of Commits', 'Members of Owner',
    'Repos of Owner', 'Active Pull Requests', 'Active Issues',
    'Open Issues', 'Closed Issues', 'Number of Labels', 'Current Sponsors',
    'Sponsored', 'Number of Milestones', 'Number of Dependents',
    'Number of Files', 'Depth of Files', 'Number of Contributors',
    'Number of Merges', 'Number of Branches', 'Number of Tags',
    'Number of Links', 'Verified Owner', 'Has a Wiki', 'Has Discussions',
    'Has Projects', 'Has Pages', 'Archived', 'Has README', 'Has SECURITY',
    'Has Conduct', 'Has Contributing', 'Has ISSUE_TEMPLATE', 'Has PULL_TEMPLATE'
]

# Step 1: Upload the .xlsx files
print("Please upload your .xlsx files.")
uploaded_files = files.upload()

# Step 2: Load files and process
combined_df = load_and_process_files(uploaded_files)

# Step 3: Process for churn model (future abandoned indicator)
processed_df = process_churn_labels(combined_df, relevant_columns)

# Step 4: Train and evaluate churn model
logistic_regression_churn_model(processed_df, relevant_columns)


Please upload your .xlsx files.


Saving features_2024-09-24_H01-M36-S00.xlsx to features_2024-09-24_H01-M36-S00 (10).xlsx
Saving features_2024-09-29_H22-M30-S30.xlsx to features_2024-09-29_H22-M30-S30 (10).xlsx
Saving features_2024-10-13_H17-M48-S51.xlsx to features_2024-10-13_H17-M48-S51 (10).xlsx
Saving features_2024-10-21_H09-M45-S35.xlsx to features_2024-10-21_H09-M45-S35 (10).xlsx
Saving features_2024-11-10_H22-M59-S00.xlsx to features_2024-11-10_H22-M59-S00 (10).xlsx
Saving features_2024-11-17_H23-M29-S52.xlsx to features_2024-11-17_H23-M29-S52 (10).xlsx
Saving features_2024-11-19_H12-M30-S15.xlsx to features_2024-11-19_H12-M30-S15 (10).xlsx
Error in column 'Repos of Owner' with value '5k+': could not convert string to float: '5+'
Error in column 'Repos of Owner' with value '5k+': could not convert string to float: '5+'
Feature Weights:
                    Feature    Weight
0                     Size -0.179435
1          Number of Stars -9.059994
2      Number of Watches_x  1.237375
3    Number of Open Issues  0

# Without Oversampling Churn Model
This script uses the churn model, and does not oversample the minority class, future_abandoned. If you run this one, you will notice the precision and recall of future_abandoned are 0.00, or close to it. This is mostly here as a base script.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datetime import datetime, timedelta
from sklearn.metrics import classification_report, roc_auc_score
from google.colab import files
from sklearn.preprocessing import StandardScaler  # Import StandardScaler
import re
import numpy as np
from imblearn.over_sampling import SMOTE  # For oversampling

# Function to extract the timestamp from a filename
def extract_datetime_from_filename(filename):
    date_match = re.search(r'\d{4}-\d{2}-\d{2}', filename)
    if date_match:
        return datetime.strptime(date_match.group(), '%Y-%m-%d')
    else:
        raise ValueError(f"Invalid filename format: {filename}")

# Function to convert strings with 'k', 'M', 'B' to numeric values
def convert_to_numeric(value, column_name):
    value = str(value).replace(',', '')  # Remove commas
    if value.lower() in ['true', 'yes']:
        return 1.0
    elif value.lower() in ['false', 'no']:
        return 0.0
    try:
        if 'k' in value:
            return float(value.replace('k', '')) * 1e3
        elif 'M' in value:
            return float(value.replace('M', '')) * 1e6
        elif 'B' in value:
            return float(value.replace('B', '')) * 1e9
        elif re.match(r'^[-+]?[0-9]*\.?[0-9]+$', value):
            return float(value)
        else:
            return np.nan
    except Exception as e:
        print(f"Error in column '{column_name}' with value '{value}': {e}")
        return np.nan

# Function to load multiple files and build a unified DataFrame
def load_and_process_files(file_dict):
    all_data = []
    for file_name, content in file_dict.items():
        # Load file and process
        df = pd.read_excel(file_name)
        timestamp = extract_datetime_from_filename(file_name)

        # Process date columns and add timestamp
        df['last_update'] = pd.to_datetime(df['Last Update'], errors='coerce').dt.tz_localize(None)
        df['days_since_last_update'] = (timestamp - df['last_update']).dt.days
        df['timestamp'] = timestamp
        df['file_name'] = file_name
        all_data.append(df)

    # Concatenate all data and sort by repository and timestamp
    combined_df = pd.concat(all_data)
    combined_df.sort_values(by=['Project URL', 'timestamp'], inplace=True)
    combined_df.reset_index(drop=True, inplace=True)
    return combined_df

# Process all files and add "future_abandoned" indicator for churn model
def process_churn_labels(df, relevant_columns):
    # Apply numeric conversion
    df[relevant_columns] = df[relevant_columns].apply(lambda col: col.apply(lambda x: convert_to_numeric(x, col.name)))

    # Add target indicators
    df['abandoned'] = df['days_since_last_update'].apply(lambda x: 1 if x > 30 else 0)
    df['future_abandoned'] = 0  # Initialize column

    rows_to_remove = []  # To track rows to delete

    for i, row in df.iterrows():
        current_repo = row['Project URL']
        current_time = row['timestamp']

        # Find snapshots of the same repository after the current timestamp
        future_snapshots = df[(df['Project URL'] == current_repo) & (df['timestamp'] > current_time)]

        if future_snapshots.empty:
            # No future information available, mark for removal
            rows_to_remove.append(i)
            continue

        # Sort future snapshots by timestamp (ascending)
        future_snapshots = future_snapshots.sort_values(by='timestamp')

        # Find the closest snapshot within 2 weeks
        closest_within_two_weeks = future_snapshots[
            future_snapshots['timestamp'] <= current_time + timedelta(days=14)
        ]

        if not closest_within_two_weeks.empty:
            # Check if the repository is abandoned at the closest snapshot
            if closest_within_two_weeks.iloc[-1]['abandoned'] == 1:
                df.at[i, 'future_abandoned'] = 1
            else:
                # Look for the next snapshot beyond two weeks
                next_snapshot = future_snapshots[
                    future_snapshots['timestamp'] > current_time + timedelta(days=14)
                ]

                if not next_snapshot.empty:
                    # If the repository is not abandoned in the next snapshot, mark as not future-abandoned
                    if next_snapshot.iloc[0]['abandoned'] == 0:
                        df.at[i, 'future_abandoned'] = 0
                    else:
                        rows_to_remove.append(i)
                else:
                    rows_to_remove.append(i)
        else:
            rows_to_remove.append(i)

    # Drop rows with insufficient information
    df.drop(index=rows_to_remove, inplace=True)
    df = df[~((df['abandoned'] == 1) & (df['future_abandoned'] == 1))]
    return df

def logistic_regression_churn_model(df, relevant_columns):
    X = df[relevant_columns]
    y = df['future_abandoned']

    # Drop rows with missing values in either X or y
    df_cleaned = pd.concat([X, y], axis=1).dropna()  # Concatenate and drop NaNs
    X = df_cleaned[relevant_columns]  # Update X
    y = df_cleaned['future_abandoned']  # Update y

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Apply scaling to the features
    scaler = StandardScaler()  # Initialize StandardScaler
    X_train_scaled = scaler.fit_transform(X_train)  # Fit on training data and transform
    X_test_scaled = scaler.transform(X_test)  # Use the same scaler to transform the test data

    # Train model
    model = LogisticRegression(max_iter=5000)
    model.fit(X_train_scaled, y_train)

    # Output feature weights and model evaluation
    weights = model.coef_[0]
    feature_weights = pd.DataFrame({
        'Feature': X.columns,
        'Weight': weights
    })

    print("Feature Weights:\n", feature_weights)
    print("\nClassification Report:\n", classification_report(y_test, model.predict(X_test_scaled)))
    print("ROC AUC Score for testing data:", roc_auc_score(y_test, model.predict_proba(X_test_scaled)[:, 1]))
    print("ROC AUC Score for training data:", roc_auc_score(y_train, model.predict_proba(X_train_scaled)[:, 1]))


# Relevant columns for analysis
relevant_columns = [
    'Size', 'Number of Stars', 'Number of Watches_x', 'Number of Open Issues',
    'Number of forks', 'Topics', 'Open Pull Requests', 'Closed Pull Requests',
    'Followers of Owner', 'Number of Commits', 'Members of Owner',
    'Repos of Owner', 'Active Pull Requests', 'Active Issues',
    'Open Issues', 'Closed Issues', 'Number of Labels', 'Current Sponsors',
    'Sponsored', 'Number of Milestones', 'Number of Dependents',
    'Number of Files', 'Depth of Files', 'Number of Contributors',
    'Number of Merges', 'Number of Branches', 'Number of Tags',
    'Number of Links', 'Verified Owner', 'Has a Wiki', 'Has Discussions',
    'Has Projects', 'Has Pages', 'Archived', 'Has README', 'Has SECURITY',
    'Has Conduct', 'Has Contributing', 'Has ISSUE_TEMPLATE', 'Has PULL_TEMPLATE'
]

# Step 1: Upload the .xlsx files
print("Please upload your .xlsx files.")
uploaded_files = files.upload()

# Step 2: Load files and process
combined_df = load_and_process_files(uploaded_files)

# Step 3: Process for churn model (future abandoned indicator)
processed_df = process_churn_labels(combined_df, relevant_columns)

# Step 4: Train and evaluate churn model
logistic_regression_churn_model(processed_df, relevant_columns)


Please upload your .xlsx files.


Saving features_2024-09-24_H01-M36-S00.xlsx to features_2024-09-24_H01-M36-S00.xlsx
Saving features_2024-09-29_H22-M30-S30.xlsx to features_2024-09-29_H22-M30-S30.xlsx
Saving features_2024-10-13_H17-M48-S51.xlsx to features_2024-10-13_H17-M48-S51.xlsx
Saving features_2024-10-21_H09-M45-S35.xlsx to features_2024-10-21_H09-M45-S35.xlsx
Saving features_2024-11-10_H22-M59-S00.xlsx to features_2024-11-10_H22-M59-S00.xlsx
Error in column 'Repos of Owner' with value '5k+': could not convert string to float: '5+'
Feature Weights:
                    Feature    Weight
0                     Size  0.002778
1          Number of Stars -3.863778
2      Number of Watches_x  0.147317
3    Number of Open Issues  0.108553
4          Number of forks -0.826268
5                   Topics -0.058752
6       Open Pull Requests  0.046839
7     Closed Pull Requests -0.125635
8       Followers of Owner -0.013040
9        Number of Commits  0.207462
10        Members of Owner  0.067955
11          Repos of Owner 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Random Feature Elimination with Oversampling
This script uses the same churn model as before. The only difference being, this time it recursively eliminates features while trying to minimize the loss of accuracy in the Classification Report.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from datetime import datetime, timedelta
from sklearn.metrics import classification_report, roc_auc_score
from google.colab import files
from sklearn.preprocessing import StandardScaler  # Import StandardScaler
import re
import numpy as np
from imblearn.over_sampling import SMOTE  # For oversampling
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score

# Function to extract the timestamp from a filename
def extract_datetime_from_filename(filename):
    date_match = re.search(r'\d{4}-\d{2}-\d{2}', filename)
    if date_match:
        return datetime.strptime(date_match.group(), '%Y-%m-%d')
    else:
        raise ValueError(f"Invalid filename format: {filename}")

# Function to convert strings with 'k', 'M', 'B' to numeric values
def convert_to_numeric(value, column_name):
    value = str(value).replace(',', '')  # Remove commas
    if value.lower() in ['true', 'yes']:
        return 1.0
    elif value.lower() in ['false', 'no']:
        return 0.0
    try:
        if 'k' in value:
            return float(value.replace('k', '')) * 1e3
        elif 'M' in value:
            return float(value.replace('M', '')) * 1e6
        elif 'B' in value:
            return float(value.replace('B', '')) * 1e9
        elif re.match(r'^[-+]?[0-9]*\.?[0-9]+$', value):
            return float(value)
        else:
            return np.nan
    except Exception as e:
        print(f"Error in column '{column_name}' with value '{value}': {e}")
        return np.nan

# Function to load multiple files and build a unified DataFrame
def load_and_process_files(file_dict):
    all_data = []
    for file_name, content in file_dict.items():
        # Load file and process
        df = pd.read_excel(file_name)
        timestamp = extract_datetime_from_filename(file_name)

        # Process date columns and add timestamp
        df['last_update'] = pd.to_datetime(df['Last Update'], errors='coerce').dt.tz_localize(None)
        df['days_since_last_update'] = (timestamp - df['last_update']).dt.days
        df['timestamp'] = timestamp
        df['file_name'] = file_name
        all_data.append(df)

    # Concatenate all data and sort by repository and timestamp
    combined_df = pd.concat(all_data)
    combined_df.sort_values(by=['Project URL', 'timestamp'], inplace=True)
    combined_df.reset_index(drop=True, inplace=True)
    return combined_df

# Process all files and add "future_abandoned" indicator for churn model
def process_churn_labels(df, relevant_columns):
    # Apply numeric conversion
    df[relevant_columns] = df[relevant_columns].apply(lambda col: col.apply(lambda x: convert_to_numeric(x, col.name)))

    # Add target indicators
    df['abandoned'] = df['days_since_last_update'].apply(lambda x: 1 if x > 30 else 0)
    df['future_abandoned'] = 0  # Initialize column

    rows_to_remove = []  # To track rows to delete

    for i, row in df.iterrows():
        current_repo = row['Project URL']
        current_time = row['timestamp']

        # Find snapshots of the same repository after the current timestamp
        future_snapshots = df[(df['Project URL'] == current_repo) & (df['timestamp'] > current_time)]

        if future_snapshots.empty:
            # No future information available, mark for removal
            rows_to_remove.append(i)
            continue

        # Sort future snapshots by timestamp (ascending)
        future_snapshots = future_snapshots.sort_values(by='timestamp')

        # Find the closest snapshot within 2 weeks
        closest_within_two_weeks = future_snapshots[
            future_snapshots['timestamp'] <= current_time + timedelta(days=14)
        ]

        if not closest_within_two_weeks.empty:
            # Check if the repository is abandoned at the closest snapshot
            if closest_within_two_weeks.iloc[-1]['abandoned'] == 1:
                df.at[i, 'future_abandoned'] = 1
            else:
                # Look for the next snapshot beyond two weeks
                next_snapshot = future_snapshots[
                    future_snapshots['timestamp'] > current_time + timedelta(days=30)
                ]

                if not next_snapshot.empty:
                    # If the repository is not abandoned in the next snapshot, mark as not future-abandoned
                    if next_snapshot.iloc[0]['abandoned'] == 0:
                        df.at[i, 'future_abandoned'] = 0
                    else:
                        rows_to_remove.append(i)
                else:
                    rows_to_remove.append(i)
        else:
            rows_to_remove.append(i)

    # Drop rows with insufficient information
    df.drop(index=rows_to_remove, inplace=True)
    df = df[~((df['abandoned'] == 1) & (df['future_abandoned'] == 1))]
    return df

def rfecv_feature_selection_churn_model(df, relevant_columns):
    X = df[relevant_columns]
    y = df['future_abandoned']

    # Drop rows with missing values in either X or y
    df_cleaned = pd.concat([X, y], axis=1).dropna()  # Concatenate and drop NaNs
    X = df_cleaned[relevant_columns]  # Update X
    y = df_cleaned['future_abandoned']  # Update y

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Apply scaling to the features
    scaler = StandardScaler()  # Initialize StandardScaler
    X_train_scaled = scaler.fit_transform(X_train)  # Fit on training data and transform
    X_test_scaled = scaler.transform(X_test)  # Use the same scaler to transform the test data

    # Oversample using SMOTE
    smote = SMOTE(random_state=42)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

    # Initialize the logistic regression model
    model = LogisticRegression(max_iter=5000)

    # Perform RFECV to select the optimal features
    rfecv = RFECV(
        estimator=model,
        step=1,  # Number of features to remove at each iteration
        cv=StratifiedKFold(5),  # 5-fold cross-validation
        scoring="roc_auc",  # Evaluation metric
        n_jobs=-1  # Use all available processors
    )

    rfecv.fit(X_train_resampled, y_train_resampled)

    # Get selected features
    selected_features = [feature for feature, selected in zip(relevant_columns, rfecv.support_) if selected]
    print("Optimal number of features:", rfecv.n_features_)
    print("Selected features:", selected_features)

    # Train a logistic regression model with selected features
    X_train_selected = rfecv.transform(X_train_resampled)
    X_test_selected = rfecv.transform(X_test_scaled)
    model.fit(X_train_selected, y_train_resampled)

    # Evaluate the model
    print("\nClassification Report:\n", classification_report(y_test, model.predict(X_test_selected)))
    print("ROC AUC Score for testing data:", roc_auc_score(y_test, model.predict_proba(X_test_selected)[:, 1]))
    print("ROC AUC Score for training data:", roc_auc_score(y_train_resampled, model.predict_proba(X_train_selected)[:, 1]))

    # Create a DataFrame for the feature rankings
    feature_rankings = pd.DataFrame({
        'Feature': relevant_columns,
        'Ranking': rfecv.ranking_
    }).sort_values(by='Ranking')

    print("\nFeature Rankings:\n", feature_rankings)

# Relevant columns for analysis
relevant_columns = [
    'Size', 'Number of Stars', 'Number of Watches_x', 'Number of Open Issues',
    'Number of forks', 'Topics', 'Open Pull Requests', 'Closed Pull Requests',
    'Followers of Owner', 'Number of Commits', 'Members of Owner',
    'Repos of Owner', 'Active Pull Requests', 'Active Issues',
    'Open Issues', 'Closed Issues', 'Number of Labels', 'Current Sponsors',
    'Sponsored', 'Number of Milestones', 'Number of Dependents',
    'Number of Files', 'Depth of Files', 'Number of Contributors',
    'Number of Merges', 'Number of Branches', 'Number of Tags',
    'Number of Links', 'Verified Owner', 'Has a Wiki', 'Has Discussions',
    'Has Projects', 'Has Pages', 'Archived', 'Has README', 'Has SECURITY',
    'Has Conduct', 'Has Contributing', 'Has ISSUE_TEMPLATE', 'Has PULL_TEMPLATE'
]

# Step 1: Upload the .xlsx files
print("Please upload your .xlsx files.")
uploaded_files = files.upload()

# Step 2: Load files and process
combined_df = load_and_process_files(uploaded_files)

# Step 3: Process for churn model (future abandoned indicator)
processed_df = process_churn_labels(combined_df, relevant_columns)

# Step 4: Train and evaluate churn model
rfecv_feature_selection_churn_model(processed_df, relevant_columns)


Please upload your .xlsx files.


Saving features_2024-09-24_H01-M36-S00.xlsx to features_2024-09-24_H01-M36-S00 (12).xlsx
Saving features_2024-09-29_H22-M30-S30.xlsx to features_2024-09-29_H22-M30-S30 (12).xlsx
Saving features_2024-10-13_H17-M48-S51.xlsx to features_2024-10-13_H17-M48-S51 (12).xlsx
Saving features_2024-10-21_H09-M45-S35.xlsx to features_2024-10-21_H09-M45-S35 (12).xlsx
Saving features_2024-11-10_H22-M59-S00.xlsx to features_2024-11-10_H22-M59-S00 (12).xlsx
Saving features_2024-11-17_H23-M29-S52.xlsx to features_2024-11-17_H23-M29-S52 (12).xlsx
Saving features_2024-11-19_H12-M30-S15.xlsx to features_2024-11-19_H12-M30-S15 (12).xlsx
Error in column 'Repos of Owner' with value '5k+': could not convert string to float: '5+'
Error in column 'Repos of Owner' with value '5k+': could not convert string to float: '5+'
Optimal number of features: 18
Selected features: ['Size', 'Number of Stars', 'Number of Watches_x', 'Number of forks', 'Open Pull Requests', 'Number of Commits', 'Active Pull Requests', 'Active 