# What Is Feature Extraction ?

![](https://miro.medium.com/max/800/0*sQzmiOf8Yb_18HX1.png)

- Feature extraction is the process of transforming raw data into a reduced representation of its most informative features. It involves selecting or creating a set of features that are relevant and useful for a particular task or problem. Feature extraction techniques aim to capture the most important information from the data while reducing its dimensionality.

- Feature extraction is commonly used in machine learning and data analysis tasks to improve efficiency, reduce noise, and enhance predictive models. It involves applying various mathematical and statistical methods to identify patterns, relationships, or properties in the data that are relevant to the task at hand.

- Some common feature extraction techniques include principal component analysis (PCA), linear discriminant analysis (LDA), and various signal processing methods such as Fourier transform. These techniques help in reducing the dimensionality of the data, removing irrelevant or redundant features, and creating new features that better represent the underlying patterns in the data.

- Overall, feature extraction plays a crucial role in data preprocessing and feature engineering, enabling more effective and efficient analysis, modeling, and prediction.

# Binary Features: Flag, Bool, True-False

In [1]:
# import Required Libraries

import numpy as np
import pandas as pd
import seaborn as sns

from datetime import date
from statsmodels.stats.proportion import proportions_ztest



In [2]:
# Adjusting Row Column Settings

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.width', 500)

In [3]:
# Loading the Data Set

df_titanic = pd.read_csv('/kaggle/input/data-science-day1-titanic/DSB_Day1_Titanic_train.csv')

In [4]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# Create a new binary feature from the 'Cabin' column

df_titanic["NEW_CABIN_BOOL"] = df_titanic["Cabin"].notnull().astype('int')

In [6]:
# Calculate the survival rate based on the new 'NEW_CABIN_BOOL' feature

df_titanic.groupby("NEW_CABIN_BOOL").agg({"Survived": "mean"})

Unnamed: 0_level_0,Survived
NEW_CABIN_BOOL,Unnamed: 1_level_1
0,0.3
1,0.667


In [7]:
# Perform a two-proportion z-test to compare the survival rates between passengers with and without cabin information

test_stat, pvalue = proportions_ztest(count=[df_titanic.loc[df_titanic["NEW_CABIN_BOOL"] == 1, "Survived"].sum(),
                                             df_titanic.loc[df_titanic["NEW_CABIN_BOOL"] == 0, "Survived"].sum()],

                                      nobs=[df_titanic.loc[df_titanic["NEW_CABIN_BOOL"] == 1, "Survived"].shape[0],
                                            df_titanic.loc[df_titanic["NEW_CABIN_BOOL"] == 0, "Survived"].shape[0]])

In [8]:
# Test Statistic and p-value for Two-Proportion Z-Test Comparison of Survival Rates

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 9.4597, p-value = 0.0000


- The test statistic for the Two-Proportion Z-Test is 9.4597, and the corresponding p-value is 0.0000. 
  This indicates a significant difference in the survival rates between the groups with and without a cabin. 
  The p-value being close to zero suggests strong evidence against the null hypothesis of no difference in survival rates. 
  Therefore, we can conclude that the presence or absence of a cabin has a significant impact on the survival rate.

In [9]:
# Creating a New Feature "NEW_IS_ALONE" Based on Family Size

df_titanic.loc[((df_titanic['SibSp'] + df_titanic['Parch']) > 0), "NEW_IS_ALONE"] = "NO"

In [10]:
# Creating a New Feature "NEW_IS_ALONE" Based on Family Size

df_titanic.loc[((df_titanic['SibSp'] + df_titanic['Parch']) == 0), "NEW_IS_ALONE"] = "YES"

In [11]:
# Survival Rate Based on "NEW_IS_ALONE" Feature

df_titanic.groupby("NEW_IS_ALONE").agg({"Survived": "mean"})

Unnamed: 0_level_0,Survived
NEW_IS_ALONE,Unnamed: 1_level_1
NO,0.506
YES,0.304


In [12]:
# Proportions Z-Test Results for Survival Rate based on "NEW_IS_ALONE" Feature

test_stat, pvalue = proportions_ztest(count=[df_titanic.loc[df_titanic["NEW_IS_ALONE"] == "YES", "Survived"].sum(),
                                             df_titanic.loc[df_titanic["NEW_IS_ALONE"] == "NO", "Survived"].sum()],

                                      nobs=[df_titanic.loc[df_titanic["NEW_IS_ALONE"] == "YES", "Survived"].shape[0],
                                            df_titanic.loc[df_titanic["NEW_IS_ALONE"] == "NO", "Survived"].shape[0]])

In [13]:
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = -6.0704, p-value = 0.0000


- The test statistic is -6.0704 and the p-value is 0.0000. Based on these results, it can be concluded that there is a statistically significant difference in the survival rates based on the "NEW_IS_ALONE" feature. Since the p-value is less than 0.05, the null hypothesis is rejected, indicating that the "NEW_IS_ALONE" feature may have an impact on the survival rate.

# Text Features

In [14]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NEW_CABIN_BOOL,NEW_IS_ALONE
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,NO
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C,1,NO
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,YES
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,NO
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,YES


In [15]:
# Letter Count

df_titanic["NEW_NAME_COUNT"] = df_titanic["Name"].str.len()

In [16]:
# Word Count

df_titanic["NEW_NAME_WORD_COUNT"] = df_titanic["Name"].apply(lambda x: len(str(x).split(" ")))

In [17]:
# Capturing Special Structures

df_titanic["NEW_NAME_DR"] = df_titanic["Name"].apply(lambda x: len([x for x in x.split() if x.startswith("Dr")]))

In [18]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NEW_CABIN_BOOL,NEW_IS_ALONE,NEW_NAME_COUNT,NEW_NAME_WORD_COUNT,NEW_NAME_DR
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,NO,23,4,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C,1,NO,51,7,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,YES,22,3,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,NO,44,7,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,YES,24,4,0


In [19]:
df_titanic.groupby("NEW_NAME_DR").agg({"Survived": ["mean","count"]})

Unnamed: 0_level_0,Survived,Survived
Unnamed: 0_level_1,mean,count
NEW_NAME_DR,Unnamed: 1_level_2,Unnamed: 2_level_2
0,0.383,881
1,0.5,10


# Regex Features

In [20]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NEW_CABIN_BOOL,NEW_IS_ALONE,NEW_NAME_COUNT,NEW_NAME_WORD_COUNT,NEW_NAME_DR
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,NO,23,4,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C,1,NO,51,7,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,YES,22,3,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,NO,44,7,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,YES,24,4,0


In [21]:
# Extracting Titles from Names to Create a New Feature "NEW_TITLE"

df_titanic['NEW_TITLE'] = df_titanic.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

In [22]:
df_titanic.head(20)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NEW_CABIN_BOOL,NEW_IS_ALONE,NEW_NAME_COUNT,NEW_NAME_WORD_COUNT,NEW_NAME_DR,NEW_TITLE
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,NO,23,4,0,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C,1,NO,51,7,0,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,YES,22,3,0,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,NO,44,7,0,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,YES,24,4,0,Mr
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.458,,Q,0,YES,16,3,0,Mr
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.862,E46,S,1,YES,23,4,0,Mr
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,0,NO,30,4,0,Master
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.133,,S,0,NO,49,7,0,Mrs
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.071,,C,0,NO,35,5,0,Mrs


In [23]:
# Analysis of Survival Rate and Age Distribution by Title

df_titanic[["NEW_TITLE", "Survived", "Age"]].groupby(["NEW_TITLE"]).agg({"Survived": "mean", "Age": ["count", "mean"]})

Unnamed: 0_level_0,Survived,Age,Age
Unnamed: 0_level_1,mean,count,mean
NEW_TITLE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Capt,0.0,1,70.0
Col,0.5,2,58.0
Countess,1.0,1,33.0
Don,0.0,1,40.0
Dr,0.429,6,42.0
Jonkheer,0.0,1,38.0
Lady,1.0,1,48.0
Major,0.5,2,48.5
Master,0.575,36,4.574
Miss,0.698,146,21.774


# Date Features

In [24]:
# Loading the Data Set

df_course_reviews = pd.read_csv('/kaggle/input/course-reviewscsv/course_reviews.csv')

In [25]:
df_course_reviews.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0


In [26]:
df_course_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4323 entries, 0 to 4322
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rating              4323 non-null   float64
 1   Timestamp           4323 non-null   object 
 2   Enrolled            4323 non-null   object 
 3   Progress            4323 non-null   float64
 4   Questions Asked     4323 non-null   float64
 5   Questions Answered  4323 non-null   float64
dtypes: float64(4), object(2)
memory usage: 202.8+ KB


In [27]:
# Conversion of Timestamp to Datetime Format

df_course_reviews['Timestamp'] = pd.to_datetime(df_course_reviews["Timestamp"], format="%Y-%m-%d")

In [28]:
# Year

df_course_reviews['year'] = df_course_reviews['Timestamp'].dt.year

In [29]:
df_course_reviews.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,year
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,2021
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,2021
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,2021
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,2021
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,2021


In [30]:
# Month

df_course_reviews['month'] = df_course_reviews['Timestamp'].dt.month

In [31]:
df_course_reviews.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,year,month
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,2021,2
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,2021,2
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,2021,2
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,2021,2
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,2021,2


In [32]:
# Year diff

df_course_reviews['year_diff'] = date.today().year - df_course_reviews['Timestamp'].dt.year

In [33]:
df_course_reviews.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,year,month,year_diff
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,2021,2,2
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,2021,2,2
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,2021,2,2
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,2021,2,2
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,2021,2,2


In [34]:
# month diff (month difference between two dates): year diff + month diff

df_course_reviews['month_diff'] = (date.today().year - df_course_reviews['Timestamp'].dt.year) * 12 + date.today().month - df_course_reviews['Timestamp'].dt.month

In [35]:
df_course_reviews.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,year,month,year_diff,month_diff
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,2021,2,2,29
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,2021,2,2,29
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,2021,2,2,29
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,2021,2,2,29
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,2021,2,2,29


In [36]:
# Day name

df_course_reviews['day_name'] = df_course_reviews['Timestamp'].dt.day_name()

In [37]:
df_course_reviews.head()

Unnamed: 0,Rating,Timestamp,Enrolled,Progress,Questions Asked,Questions Answered,year,month,year_diff,month_diff,day_name
0,5.0,2021-02-05 07:45:55,2021-01-25 15:12:08,5.0,0.0,0.0,2021,2,2,29,Friday
1,5.0,2021-02-04 21:05:32,2021-02-04 20:43:40,1.0,0.0,0.0,2021,2,2,29,Thursday
2,4.5,2021-02-04 20:34:03,2019-07-04 23:23:27,1.0,0.0,0.0,2021,2,2,29,Thursday
3,5.0,2021-02-04 16:56:28,2021-02-04 14:41:29,10.0,0.0,0.0,2021,2,2,29,Thursday
4,4.0,2021-02-04 15:00:24,2020-10-13 03:10:07,10.0,0.0,0.0,2021,2,2,29,Thursday


# Feature Interaction

In [38]:
# Loading the Data Set

df_titanic = pd.read_csv('/kaggle/input/data-science-day1-titanic/DSB_Day1_Titanic_train.csv')

In [39]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [40]:
df_titanic["NEW_AGE_PCLASS"] = df_titanic["Age"] * df_titanic["Pclass"]

In [41]:
df_titanic["NEW_FAMILY_SIZE"] = df_titanic["SibSp"] + df_titanic["Parch"] + 1

In [42]:
df_titanic.loc[(df_titanic['Sex'] == 'male') & (df_titanic['Age'] <= 21), 'NEW_SEX_CAT'] = 'youngmale'

In [43]:
df_titanic.loc[(df_titanic['Sex'] == 'male') & (df_titanic['Age'] > 21) & (df_titanic['Age'] < 50), 'NEW_SEX_CAT'] = 'maturemale'

In [44]:
df_titanic.loc[(df_titanic['Sex'] == 'male') & (df_titanic['Age'] >= 50), 'NEW_SEX_CAT'] = 'seniormale'

In [45]:
df_titanic.loc[(df_titanic['Sex'] == 'female') & (df_titanic['Age'] <= 21), 'NEW_SEX_CAT'] = 'youngfemale'

In [46]:
df_titanic.loc[(df_titanic['Sex'] == 'female') & (df_titanic['Age'] > 21) & (df_titanic['Age'] < 50), 'NEW_SEX_CAT'] = 'maturefemale'

In [47]:
df_titanic.loc[(df_titanic['Sex'] == 'female') & (df_titanic['Age'] >= 50), 'NEW_SEX_CAT'] = 'seniorfemale'

In [48]:
df_titanic.head(30)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,NEW_AGE_PCLASS,NEW_FAMILY_SIZE,NEW_SEX_CAT
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,66.0,2,maturemale
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.283,C85,C,38.0,2,maturefemale
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,78.0,1,maturefemale
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,2,maturefemale
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,105.0,1,maturemale
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.458,,Q,,1,
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.862,E46,S,54.0,1,seniormale
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,6.0,5,youngmale
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.133,,S,81.0,3,maturefemale
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.071,,C,28.0,2,youngfemale


In [49]:
df_titanic.groupby("NEW_SEX_CAT")["Survived"].mean()

NEW_SEX_CAT
maturefemale   0.774
maturemale     0.199
seniorfemale   0.909
seniormale     0.135
youngfemale    0.679
youngmale      0.250
Name: Survived, dtype: float64