# Feature Engineering
Created by: Pat Pascual


In [67]:
# Import necessary libraries
import pandas as pd
import numpy as np

In [68]:
# Read the dataset
df = pd.read_csv('dataset.csv', sep='|')
print(f"Shape before dropping duplicates: {df.shape}")

Shape before dropping duplicates: (35000, 21)


In [69]:
# Drop duplicate rows from the dataset
df = df.drop_duplicates()

# Display the shape of the dataset before and after dropping duplicates
print(f"Shape after dropping duplicates: {df.shape}")


Shape after dropping duplicates: (34989, 21)


### Highly Correlated Features

#### Pearson Correlations (|r| > 0.7)
- Feature_ee_5 (Economic Expectations) shows strong correlations with:
  - Feature_cx_6 (Consumer Index): r = 0.776
  - Feature_em_8 (Interest Rate): r = 0.972 
  - Feature_nd_9 (Number of Employees): r = 0.907
- Feature_em_8 (Interest Rate) and Feature_nd_9 (Number of Employees): r = 0.945

#### Cramer's V Correlations (V > 0.6) 
- Contact medium (Feature_cd_16) and Month contacted (Feature_md_17)
  - Strong association (V = 0.61)
  - Suggests potential redundancy in contact channel/timing information
  - Consider combining or selecting one feature

| Importance Level | Feature | Key Metrics | Notes |
|-----------------|---------|-------------|--------|
| **High** | Feature_dn_1 (Call Duration) | • MI: 0.079 (highest)<br>• RF: 0.281 (highest)<br>• Pointbiserial: 0.409<br>• IV: 1.971 (suspicious) | Critical predictor across all metrics |
| **High** | Feature_em_8 (Interest Rate) | • MI: 0.074<br>• RF: 0.097 (2nd highest)<br>• Strong statistical significance<br>• IV: 1.037 (suspicious) | Consistent performance across metrics |
| **High** | Feature_pd_19 (Previous Campaign Outcome) | • Chi-square: 3514.89 (highest)<br>• Strong categorical MI<br>• Extremely low p-value<br>• IV: 0.536 (suspicious) | Best categorical predictor |
| **High** | Feature_md_17 (Month Contacted) | • Chi-square: 2596.58 (2nd highest)<br>• Good categorical MI<br>• IV: 0.481 (strong) | Shows strong seasonal patterns |
| **Medium** | Feature_cx_6/7 (Price/Confidence Indices) | • MI: 0.068/0.066<br>• Moderate RF importance<br>• IV: 0.435/0.782 (strong/suspicious) | Complementary economic indicators |
| **Medium** | Feature_nd_9 (Number of Employees) | • MI: 0.062<br>• RF: 0.048<br>• IV: 1.165 (suspicious) | Strong statistical significance |
| **Medium** | Feature_ee_5 (Employment Rate) | • MI: 0.055<br>• IV: inf (suspicious) | Relevant economic indicator |
| **Medium** | Feature_cd_16 (Contact Medium) | • Chi-square: 739.85<br>• Moderate categorical MI<br>• IV: 0.256 (medium) | Important for campaign execution |
| **Low** | Feature_ae_0 (Age) | • MI: 0.015<br>• Weak pointbiserial<br>• IV: 0.141 (medium) | Only moderately significant |
| **Low** | Feature_cn_2 (Call Attempts) | • MI: 0.001 (lowest)<br>• Low across metrics<br>• IV: 0.055 (weak) | Possibly redundant |
| **Low** | Feature_ps_3/4 (Previous Campaign Metrics) | • Moderate to low MI<br>• Low RF importance<br>• IV: 0.000/inf (not predictive/suspicious) | May be redundant with Feature_pd_19 |
| **Very Low** | Feature_hd_14 (Home Loan) | • Non-significant chi-square (p=0.184)<br>• Very low MI<br>• IV: 0.001 (not predictive) | No clear predictive value |
| **Very Low** | Feature_ld_15 (Personal Loan) | • Non-significant chi-square (p=0.425)<br>• Very low MI<br>• IV: 0.001 (not predictive) | No clear predictive value |
| **Very Low** | Remaining Categorical Features (14-20) | • Very low MI scores (≈0.000)<br>• Low RF importance<br>• IV: mostly not predictive | Consider only strongest categories |

# Based on the correlation analysis and feature importance metrics

## Highly Correlated Features (Pearson |r| > 0.7)
We should remove some of these highly correlated features to reduce multicollinearity:

Feature_ee_5 correlates strongly with:
- Feature_cx_6 (r = 0.776)
- Feature_em_8 (r = 0.972) 
- Feature_nd_9 (r = 0.907)

Feature_em_8 correlates with Feature_nd_9 (r = 0.945)

### Recommendation: 
From this correlated group, keep Feature_em_8 since it has:
- Second highest RF importance (0.097)
- High MI score (0.074)
- Strong statistical significance

And drop:
- Feature_ee_5 (lower importance metrics)
- Feature_nd_9 (lower importance metrics)

Keep Feature_cx_6 since its correlation is lower and it provides complementary information

## Low Importance Features
Drop these features based on consistently low importance across metrics:

### Feature_cn_2 (Call Attempts):
- Lowest MI score (0.001)
- Low RF importance (0.039)
- Weak IV (0.055)

### Feature_ps_3 and Feature_ps_4 (Previous Campaign Metrics):
- Low RF importance
- Redundant with Feature_pd_19
- Not predictive IV scores

## Categorical Feature Redundancy
Feature_cd_16 and Feature_md_17 have strong association (V = 0.61)

### Recommendation:
Keep Feature_md_17 (Month Contacted) and drop Feature_cd_16 (Contact Medium) because:
- Feature_md_17 has higher chi-square score (2596.58 vs 739.85)
- Feature_md_17 shows stronger seasonal patterns
- Feature_md_17 has higher importance metrics

## Very Low Importance Categorical Features
Drop these features based on consistently poor performance:
- All features marked as "Very Low" importance in the table
- Features with near-zero MI scores
- Features with non-significant chi-square tests

In [70]:
# Drop low importance, redundant and highly correlated features
features_to_drop = [
    # Highly correlated features
    'Feature_ee_5',  # Correlated with Feature_em_8 and lower importance
    'Feature_nd_9',  # Correlated with Feature_em_8 and lower importance
    
    # Low importance features
    'Feature_cn_2',  # Lowest MI score, low importance
    'Feature_ps_3',  # Redundant with Feature_pd_19
    'Feature_ps_4',  # Redundant with Feature_pd_19
    
    # Redundant categorical features
    'Feature_cd_16',  # Redundant with Feature_md_17
    
    # Very low importance features
    'Feature_hd_14',  # Non-significant chi-square
    'Feature_ld_15',  # Non-significant chi-square
    'Feature_dd_13',  # Low importance metrics
    'Feature_dd_18',  # Low importance metrics
    'Feature_jd_10',  # Low importance metrics
    'Feature_md_11',  # Low importance metrics
    'Feature_ed_12'   # Low importance metrics
]

df = df.drop(columns=features_to_drop)
print(f"Shape after dropping features: {df.shape}")


Shape after dropping features: (34989, 8)


In [71]:
# Get numerical features
numerical_features = ['Feature_ae_0', 'Feature_dn_1', 'Feature_cx_6', 'Feature_cx_7', 
                     'Feature_em_8']

# Display unique values and their counts for each numerical feature after removing duplicates
for feature in numerical_features:
    print(f"\n{feature} unique values after removing duplicates:")
    print(df[feature].value_counts().sort_index())
    print(f"Total unique values: {df[feature].nunique()}")
    print("-" * 50)


Feature_ae_0 unique values after removing duplicates:
Feature_ae_0
17     4
18    23
19    36
20    54
21    91
      ..
91     2
92     3
94     1
95     1
98     2
Name: count, Length: 77, dtype: int64
Total unique values: 77
--------------------------------------------------

Feature_dn_1 unique values after removing duplicates:
Feature_dn_1
0        4
1        3
2        1
3        3
4       11
        ..
3509     1
3631     1
3643     1
3785     1
4918     1
Name: count, Length: 1488, dtype: int64
Total unique values: 1488
--------------------------------------------------

Feature_cx_6 unique values after removing duplicates:
Feature_cx_6
92.201     645
92.379     224
92.431     375
92.469     149
92.649     302
92.713     144
92.756       9
92.843     231
92.893    4964
92.963     627
93.075    2079
93.200    3071
93.369     226
93.444    4354
93.749     151
93.798      55
93.876     183
93.918    5709
93.994    6590
94.027     192
94.055     192
94.199     254
94.215     262
9

# Feature Analysis and Recommendations

## Feature_ae_0 (Age)
- Distribution: Normal/bell-shaped
- Range: ~20-100
- ~77 unique values
- **Recommendation**: Bin into age groups (e.g., 10-year bins like 20-30, 30-40, etc.) as this is a common practice for age and will make the analysis more interpretable.

## Feature_dn_1 (Latest call duration)
- Distribution: Highly right-skewed
- Many unique values (1488)
- **Recommendation**: Bin into meaningful duration ranges (e.g., short calls <1min, medium 1-5min, long >5min) to handle the skewness and make it more interpretable.

## Feature_cx_6 & Feature_cx_7 (Consumer price/confidence index)
- Distribution: Multimodal with distinct peaks
- 26 unique values each
- **Recommendation**: Keep as is since they have meaningful distinct values.

## Feature_em_8 (Interest rate)
- Distribution: Multimodal
- 314 unique values
- **Recommendation**: Consider binning into ranges that make business sense (e.g., low, medium, high interest rates).

In [72]:
# Create binned versions of selected features
# Age bins
df['age_binned'] = pd.cut(df['Feature_ae_0'], 
                         bins=[0, 30, 40, 50, 60, 70, 100],
                         labels=['<30', '30-40', '40-50', '50-60', '60-70', '70+'])

# Call duration bins (assuming seconds)
df['duration_binned'] = pd.cut(df['Feature_dn_1'],
                              bins=[0, 60, 300, float('inf')],
                              labels=['short', 'medium', 'long'])

# Interest rate bins
df['interest_rate_binned'] = pd.qcut(df['Feature_em_8'], 
                                    q=5,  # quintiles
                                    labels=['very_low', 'low', 'medium', 'high', 'very_high'])

# Display the distribution of binned features and check for any NaN values
for col in ['age_binned', 'duration_binned', 'interest_rate_binned']:
    print(f"\n{col} distribution:")
    print(df[col].value_counts(dropna=False).sort_index())  # Include NaN counts
    print(f"NaN values: {df[col].isna().sum()}")
    print("-" * 50)

# Fill any NaN values with appropriate defaults
df['age_binned'] = df['age_binned'].fillna('40-50')  # Fill with median age group
df['duration_binned'] = df['duration_binned'].fillna('medium')  # Fill with medium duration
df['interest_rate_binned'] = df['interest_rate_binned'].fillna('medium')  # Fill with median rate

# Verify no NaN values remain
print("\nVerifying no NaN values remain:")
for col in ['age_binned', 'duration_binned', 'interest_rate_binned']:
    print(f"{col}: {df[col].isna().sum()} NaN values")


age_binned distribution:
age_binned
<30       6295
30-40    13882
40-50     8695
50-60     5347
60-70      410
70+        360
Name: count, dtype: int64
NaN values: 0
--------------------------------------------------

duration_binned distribution:
duration_binned
short      3628
medium    21865
long       9492
NaN           4
Name: count, dtype: int64
NaN values: 4
--------------------------------------------------

interest_rate_binned distribution:
interest_rate_binned
very_low     7343
low          7166
medium       7170
high         7251
very_high    6059
Name: count, dtype: int64
NaN values: 0
--------------------------------------------------

Verifying no NaN values remain:
age_binned: 0 NaN values
duration_binned: 0 NaN values
interest_rate_binned: 0 NaN values


In [73]:
df.head()

Unnamed: 0,Feature_ae_0,Feature_dn_1,Feature_cx_6,Feature_cx_7,Feature_em_8,Feature_md_17,Feature_pd_19,Response,age_binned,duration_binned,interest_rate_binned
0,57,371,92.893,-46.2,1.299,Cat_6_m***y,Cat_0_f***e,0,50-60,long,very_low
1,55,285,93.994,-36.4,4.86,Cat_6_m***y,Cat_1_n***t,0,50-60,medium,medium
2,33,52,92.893,-46.2,1.313,Cat_6_m***y,Cat_0_f***e,0,30-40,short,low
3,36,355,94.465,-41.8,4.967,Cat_4_j***n,Cat_1_n***t,0,30-40,long,very_high
4,27,189,93.918,-42.7,4.963,Cat_3_j***l,Cat_1_n***t,0,<30,medium,very_high


In [74]:
# Drop original features that have been binned
columns_to_drop = ['Feature_ae_0',    # Age
                  'Feature_dn_1',      # Call duration
                  'Feature_em_8']      # Interest rate

df = df.drop(columns=columns_to_drop)


In [75]:
df.head()

Unnamed: 0,Feature_cx_6,Feature_cx_7,Feature_md_17,Feature_pd_19,Response,age_binned,duration_binned,interest_rate_binned
0,92.893,-46.2,Cat_6_m***y,Cat_0_f***e,0,50-60,long,very_low
1,93.994,-36.4,Cat_6_m***y,Cat_1_n***t,0,50-60,medium,medium
2,92.893,-46.2,Cat_6_m***y,Cat_0_f***e,0,30-40,short,low
3,94.465,-41.8,Cat_4_j***n,Cat_1_n***t,0,30-40,long,very_high
4,93.918,-42.7,Cat_3_j***l,Cat_1_n***t,0,<30,medium,very_high


In [76]:
df.shape

(34989, 8)

# Standardization

In [77]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Create standardized versions of Feature_cx_6 and Feature_cx_7
features_to_standardize = ['Feature_cx_6', 'Feature_cx_7']
standardized_features = scaler.fit_transform(df[features_to_standardize])

# Add standardized features to dataframe with new names
df['Feature_cx_6_std'] = standardized_features[:, 0]
df['Feature_cx_7_std'] = standardized_features[:, 1]

# Drop original features
df = df.drop(columns=features_to_standardize)

# Verify the changes
print("Shape after standardization:", df.shape)
print("\nFirst few rows of standardized features:")
print(df[['Feature_cx_6_std', 'Feature_cx_7_std']].head())

Shape after standardization: (34989, 8)

First few rows of standardized features:
   Feature_cx_6_std  Feature_cx_7_std
0         -1.180303         -1.229706
1          0.722754          0.891772
2         -1.180303         -1.229706
3          1.536869         -0.277205
4          0.591390         -0.472035


In [78]:
df.head()

Unnamed: 0,Feature_md_17,Feature_pd_19,Response,age_binned,duration_binned,interest_rate_binned,Feature_cx_6_std,Feature_cx_7_std
0,Cat_6_m***y,Cat_0_f***e,0,50-60,long,very_low,-1.180303,-1.229706
1,Cat_6_m***y,Cat_1_n***t,0,50-60,medium,medium,0.722754,0.891772
2,Cat_6_m***y,Cat_0_f***e,0,30-40,short,low,-1.180303,-1.229706
3,Cat_4_j***n,Cat_1_n***t,0,30-40,long,very_high,1.536869,-0.277205
4,Cat_3_j***l,Cat_1_n***t,0,<30,medium,very_high,0.59139,-0.472035


# Encoding

In [79]:
# Import necessary libraries
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# 1. Feature_md_17 (Month Contacted) - One-Hot Encoding
# Reason: Categorical with no inherent order, seasonal patterns matter
month_dummies = pd.get_dummies(df['Feature_md_17'], prefix='month', drop_first=False)
df = pd.concat([df, month_dummies], axis=1)

# 2. Feature_pd_19 (Previous Campaign Outcome) - One-Hot Encoding
# Reason: Categorical variable with distinct categories
previous_campaign_dummies = pd.get_dummies(df['Feature_pd_19'], prefix='previous_campaign', drop_first=False)
df = pd.concat([df, previous_campaign_dummies], axis=1)

# 3. age_binned - Label Encoding (Ordinal)
# Reason: Has natural order, bins are meaningful intervals

age_mapping = {
    '<30': 0,
    '30-40': 1,
    '40-50': 2,
    '50-60': 3,
    '60-70': 4,
    '70+': 5
}
df['age_encoded'] = df['age_binned'].map(age_mapping)

# 5. duration_binned - Label Encoding (Ordinal)
# Reason: Has natural order of duration lengths
duration_mapping = {
    'short': 0,
    'medium': 1,
    'long': 2
}
df['duration_encoded'] = df['duration_binned'].map(duration_mapping)

# 6. interest_rate_binned - Label Encoding (Ordinal)
# Reason: Has natural order of rate levels
interest_mapping = {
    'very_low': 0,
    'low': 1,
    'medium': 2,
    'high': 3,
    'very_high': 4
}
df['interest_rate_encoded'] = df['interest_rate_binned'].map(interest_mapping)

# Drop original columns
columns_to_drop = ['Feature_md_17', 'Feature_pd_19', 'age_binned', 
                  'duration_binned', 'interest_rate_binned']
df = df.drop(columns=columns_to_drop)

In [80]:
df.head()

Unnamed: 0,Response,Feature_cx_6_std,Feature_cx_7_std,month_Cat_0_a***r,month_Cat_1_a***g,month_Cat_2_d***c,month_Cat_3_j***l,month_Cat_4_j***n,month_Cat_5_m***r,month_Cat_6_m***y,month_Cat_7_n***v,month_Cat_8_o***t,month_Cat_9_s***p,previous_campaign_Cat_0_f***e,previous_campaign_Cat_1_n***t,previous_campaign_Cat_2_s***s,age_encoded,duration_encoded,interest_rate_encoded
0,0,-1.180303,-1.229706,False,False,False,False,False,False,True,False,False,False,True,False,False,3,2,0
1,0,0.722754,0.891772,False,False,False,False,False,False,True,False,False,False,False,True,False,3,1,2
2,0,-1.180303,-1.229706,False,False,False,False,False,False,True,False,False,False,True,False,False,1,0,1
3,0,1.536869,-0.277205,False,False,False,False,True,False,False,False,False,False,False,True,False,1,2,4
4,0,0.59139,-0.472035,False,False,False,True,False,False,False,False,False,False,False,True,False,0,1,4


In [81]:
df.to_csv('df_cleaned.csv', index=False)

# Standardization and Encoding of Raw Dataset

In [82]:
# Read the dataset
df = pd.read_csv('dataset.csv', sep='|')
print(f"Shape before dropping duplicates: {df.shape}")

Shape before dropping duplicates: (35000, 21)


In [83]:
# Drop duplicate rows from the dataset
df = df.drop_duplicates()

# Display the shape of the dataset before and after dropping duplicates
print(f"Shape after dropping duplicates: {df.shape}")


Shape after dropping duplicates: (34989, 21)


In [84]:
df.head()


Unnamed: 0,Feature_ae_0,Feature_dn_1,Feature_cn_2,Feature_ps_3,Feature_ps_4,Feature_ee_5,Feature_cx_6,Feature_cx_7,Feature_em_8,Feature_nd_9,...,Feature_md_11,Feature_ed_12,Feature_dd_13,Feature_hd_14,Feature_ld_15,Feature_cd_16,Feature_md_17,Feature_dd_18,Feature_pd_19,Response
0,57,371,1,999,1,-1.8,92.893,-46.2,1.299,5099.1,...,Cat_1_m***d,Cat_3_h***l,Cat_0_n***o,Cat_0_n***o,Cat_2_y***s,Cat_0_c***r,Cat_6_m***y,Cat_1_m***n,Cat_0_f***e,0
1,55,285,2,999,0,1.1,93.994,-36.4,4.86,5191.0,...,Cat_1_m***d,Cat_7_u***n,Cat_1_u***n,Cat_2_y***s,Cat_0_n***o,Cat_1_t***e,Cat_6_m***y,Cat_2_t***u,Cat_1_n***t,0
2,33,52,1,999,1,-1.8,92.893,-46.2,1.313,5099.1,...,Cat_1_m***d,Cat_2_b***y,Cat_0_n***o,Cat_0_n***o,Cat_0_n***o,Cat_0_c***r,Cat_6_m***y,Cat_0_f***i,Cat_0_f***e,0
3,36,355,4,999,0,1.4,94.465,-41.8,4.967,5228.1,...,Cat_1_m***d,Cat_3_h***l,Cat_0_n***o,Cat_0_n***o,Cat_0_n***o,Cat_1_t***e,Cat_4_j***n,Cat_0_f***i,Cat_1_n***t,0
4,27,189,2,999,0,1.4,93.918,-42.7,4.963,5228.1,...,Cat_1_m***d,Cat_3_h***l,Cat_0_n***o,Cat_2_y***s,Cat_0_n***o,Cat_0_c***r,Cat_3_j***l,Cat_0_f***i,Cat_1_n***t,0


In [85]:
# Standardize numerical features except Response
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.drop('Response')
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

In [86]:
df.head()


Unnamed: 0,Feature_ae_0,Feature_dn_1,Feature_cn_2,Feature_ps_3,Feature_ps_4,Feature_ee_5,Feature_cx_6,Feature_cx_7,Feature_em_8,Feature_nd_9,...,Feature_md_11,Feature_ed_12,Feature_dd_13,Feature_hd_14,Feature_ld_15,Feature_cd_16,Feature_md_17,Feature_dd_18,Feature_pd_19,Response
0,1.627455,0.437448,-0.564554,0.195573,1.668929,-1.197216,-1.180303,-1.229706,-1.338106,-0.941368,...,Cat_1_m***d,Cat_3_h***l,Cat_0_n***o,Cat_0_n***o,Cat_2_y***s,Cat_0_c***r,Cat_6_m***y,Cat_1_m***n,Cat_0_f***e,0
1,1.435649,0.104904,-0.203244,0.195573,-0.351001,0.648782,0.722754,0.891772,0.714889,0.332017,...,Cat_1_m***d,Cat_7_u***n,Cat_1_u***n,Cat_2_y***s,Cat_0_n***o,Cat_1_t***e,Cat_6_m***y,Cat_2_t***u,Cat_1_n***t,0
2,-0.674223,-0.79606,-0.564554,0.195573,1.668929,-1.197216,-1.180303,-1.229706,-1.330035,-0.941368,...,Cat_1_m***d,Cat_2_b***y,Cat_0_n***o,Cat_0_n***o,Cat_0_n***o,Cat_0_c***r,Cat_6_m***y,Cat_0_f***i,Cat_0_f***e,0
3,-0.386513,0.375579,0.519376,0.195573,-0.351001,0.839748,1.536869,-0.277205,0.776577,0.846082,...,Cat_1_m***d,Cat_3_h***l,Cat_0_n***o,Cat_0_n***o,Cat_0_n***o,Cat_1_t***e,Cat_4_j***n,Cat_0_f***i,Cat_1_n***t,0
4,-1.249642,-0.266309,-0.203244,0.195573,-0.351001,0.839748,0.59139,-0.472035,0.774271,0.846082,...,Cat_1_m***d,Cat_3_h***l,Cat_0_n***o,Cat_2_y***s,Cat_0_n***o,Cat_0_c***r,Cat_3_j***l,Cat_0_f***i,Cat_1_n***t,0


In [87]:
# One-hot encode categorical features
categorical_features = df.select_dtypes(include=['object']).columns
df = pd.get_dummies(df, columns=categorical_features, drop_first=True)

In [88]:
df.head()

Unnamed: 0,Feature_ae_0,Feature_dn_1,Feature_cn_2,Feature_ps_3,Feature_ps_4,Feature_ee_5,Feature_cx_6,Feature_cx_7,Feature_em_8,Feature_nd_9,...,Feature_md_17_Cat_6_m***y,Feature_md_17_Cat_7_n***v,Feature_md_17_Cat_8_o***t,Feature_md_17_Cat_9_s***p,Feature_dd_18_Cat_1_m***n,Feature_dd_18_Cat_2_t***u,Feature_dd_18_Cat_3_t***e,Feature_dd_18_Cat_4_w***d,Feature_pd_19_Cat_1_n***t,Feature_pd_19_Cat_2_s***s
0,1.627455,0.437448,-0.564554,0.195573,1.668929,-1.197216,-1.180303,-1.229706,-1.338106,-0.941368,...,True,False,False,False,True,False,False,False,False,False
1,1.435649,0.104904,-0.203244,0.195573,-0.351001,0.648782,0.722754,0.891772,0.714889,0.332017,...,True,False,False,False,False,True,False,False,True,False
2,-0.674223,-0.79606,-0.564554,0.195573,1.668929,-1.197216,-1.180303,-1.229706,-1.330035,-0.941368,...,True,False,False,False,False,False,False,False,False,False
3,-0.386513,0.375579,0.519376,0.195573,-0.351001,0.839748,1.536869,-0.277205,0.776577,0.846082,...,False,False,False,False,False,False,False,False,True,False
4,-1.249642,-0.266309,-0.203244,0.195573,-0.351001,0.839748,0.59139,-0.472035,0.774271,0.846082,...,False,False,False,False,False,False,False,False,True,False


In [89]:
df.to_csv('df_raw.csv', index=False)