# Data Preprocessing

Based on the insights that are obtained from the previous analysis, it is required that the data be treated to engineer new features and create models.

In [1]:
# Import necessary packages
import pandas as pd

In [2]:
# Load the data
df = pd.read_csv('./../../data/data.csv')

Preprocessing steps will be carried out as follows:
1. Select observations having a goal amount less than equal to 1 million.
2. Select observations having backers count less than equal to 1000.
3. Select observations having a length of blurb less than 35.
4. Select categories that have representation from both successful and failed projects.
5. Perform one-hot encoding for the following columns:
   1. Country
   2. Currency
   3. Category
6. Convert all the necessary features with object or boolean data types to numerical data types.

```{note}
It is necessary to convert all the columns to numerical data types because all the models in python accept numerical data only for fitting.
```

In [3]:
# Select data for only successful and failed projects
df = df[df['state'].isin(['successful', 'failed'])]

In [4]:
# Select observations having a goal mount less than equal to 1M
df = df[df['goal'] <= 1000000]

In [5]:
# Select observations having backers count less than equal to 1000
df = df[df['backers_count'] <= 1000]

In [6]:
# Select observations having a length of blurb less than 35
df = df[df['blurb_len'] < 35]

In [7]:
# Select categories that have representation from both the categories of target variables

# Select the categories to keep
categories = df['category'].unique().tolist()
selected_categories = [category for category in categories if df[df['category'] == category]['state'].nunique() == 2]

# Use the selected categories to subset the data
df = df[df['category'].isin(selected_categories)]

In [8]:
# One-hot encode the columns

# Select the columns to be encoded
onehot_columns = ['country', 'currency', 'category']

# One-hot encode columns one by one and then drop the main columns
# Loop over selected columns
for column in onehot_columns:
    # Onehot encode the columns
    onehot_array = pd.get_dummies(df[[column]], prefix=column, drop_first=True)
    
    # Join the results with the main data
    df = pd.concat([df, onehot_array], axis=1)

    # Drop the main column
    df.drop(column, axis=1, inplace=True)

In [9]:
# Convert the data types of other required features to numerical type

# Convert boolean variables to numerical data
bool_cols = ['disable_communication', 'currency_trailing_code', 'staff_pick', 'spotlight']
df[bool_cols] = df[bool_cols].astype(int)

In [10]:
# Convert the target variable to numeric feature
df['state'] = df['state'].replace({'successful':1, 'failed': 0})

In [11]:
print(f"The shape of the preprocesed data is: {df.shape[0]} rows, {df.shape[1]} columns")

The shape of the preprocesed data is: 14862 rows, 71 columns


In [12]:
# Save the preprocessed data
df.to_csv('./../../data/preprocessed.csv', index=False)